[best of the WEB]
Ivan Tashev
Kinect Development Kit: A Toolkit for Gestureand Speech-Based Human–Machine Interaction
K
sensor’s camera to see objects as close recognition, and computer vision algoinect is a device for as 40 cm in front of the device without rithms. The toolbox therefore is a plathuman–machine interaclosing accuracy or precision, with form to create new algorithms and user tion, which adds two more graceful degradation out to 4 m. interfaces and can be a base for teaching input modalities to the The color camera in Xbox Kinect is a digital signal processing, audio processpalette of the user interface regular Web camera at 480 # 640 pixels, 8 ing, computer vision, and building studesigner: gestures and speech [1]. Kinect bits. The Kinect camera is improved, with dent’s projects. (Figure 1) is transforming how people a resolution of up to 1,024 # 1,280 pixels interact with computers, kiosks, and and increased sensitivity. Figure 2(c) Sensors and drivers other motion-controlled devices [2] from shows the output of the color camera. The Kinect sensor contains four compofun applications like playing a virtual Four supercardiod microphones form nents: depth camera, color camera, four violin [3], to applications in health care a microphone array that operates at element microphone array, and tilting [4] and physical therapy [5], retail [6], 16-kHz sampling rate and 24-bit mechanism. On the computer it appears [7], education [8], and training [9]. resolution. The acoustical Kinect was released in 2010 design uses the enclosure as an accessory for the gaming shape to improve the directivconsole Xbox [10]. In 2011 the ity of the microphones. The first public beta Kinect for Winnonuniform microphone dows Development Kit (KDK) array geometry is optimized was released with a set of drivto cover the bandwidth of the ers, processing blocks, code speech signal. samples, and sample applicaThe purpose of the tilttions. Version 1.7 of the KDK ing mechanism in the gamwas released 18 March 2013 ing scenarios is to move the [11]. The KDK is a toolbox for camera up and down so it designing applications with [Fig1] Kinect for Windows device. (Photo in public domain, used courtesy of Microsoft Corporation.) can see both the kids and human–machine interfaces that their parents. The range of include gesture and speech. The tilting the camera is !27˚. as a standard USB 2.0 device. Because Kinect for Windows SDK and toolkit the device requires more power than contain drivers, tools, application proProcessing modules in KDK standard USB ports can supply, it makes gramming interfaces, device interfaces, KDK provides sample code and processuse of a connector combining USB and code samples to simplify developing modules for skeletal tracking, audio communication with additional power. ment of applications with Kinect. The processing pipeline and speech recThe depth camera in Kinect works KDK has integrated skeletal and facial ognizer, face detecting and tracking on the unstructured scattered light tracking and gesture recognition. Voice module, gesture recognition, and 3-D principle: a projector illuminates the recognition adds an additional dimenobject scanning. scene with infrared light in a specific sion of human comprehension, while The skeletal tracking module pattern, and the image is captured by an Kinect Fusion reconstructs data into [Figure 2(b)] processes the video stream infrared camera and converted into a three-dimensional (3-D) models. from the depth camera. After the backdepth image [12]. The depth image is a Since the KDK provides the raw data ground is removed, each part of the gray-scale image where each pixel value (audio, video, and depth), it can also be a human figure is matched to the part of is not the brightness but the distance to workbench to work with various audio the human body (or bodies) to which it the object [Figure 2(a)]. The depth processing, speech enhancement, speech belongs [13]. The actual representation of image is 480 # 640 pixels with 11-bit the body pose is a table with the number depth resolution. The Kinect sensor has Digital Object Identifier 10.1109/MSP.2013.2266959 of the joint and its x, y, and z coordinates. a “near mode,” which enables the Date of publication: 20 August 2013
1053-5888/13/$31.00©2013IEEE
IEEE SIGNAL PROCESSING MAGAZINE [129] september 2013
[best of the WEB]
continued
(a)
(b)
(c)
[Fig2] (a) Depth image, (b) skeletal tracking visualization, and (c) color image from Kinect for Windows sensor and processing modules. (Images courtesy of Ivan Tashev.)
The skeletal tracking module also has a seating mode that processes only the upper part of the human body (Figure 3). Kinect Interactions use the skeletal information to recognize not only basic gestures but also more complex interactions, such as “push” to select virtual objects and “grip” to pan and scroll. Kinect Interactions recognize up to four hands simultaneously [14]. Part of the KDK is an audio processing pipeline that contains a mono acoustic echo canceller, a sound source
localizer, a beamformer, an acoustic echo suppressor, and a noise suppressor [15]. This audio pipeline removes the signal from the loudspeaker, reduces the ambient noise and reverberation, and provides an enhanced audio stream with the voice of the person in front of the Kinect. The data from the sound source localizer is also available and interfaces are exposed to control the capturing beam direction. The KDK also comes with Microsoft Speech Server 11.0 with acoustic models
[Fig3] The skeletal tracking in seating mode. (Image courtesy of Ivan Tashev.)
trained on the audio pipeline output. The speech recognizer in KDK supports English (United States, Great Britain, Ireland, Australia, New Zealand, Canada), French (France, Canada), German (Germany), Italian (Italy), Japanese (Japan), and Spanish (Spain, Mexico). Overall, the microphone array, audio processing pipeline, and speech recognizer allow the creation of handsfree speech-enabled dialog systems. The KDK also contains a processing module for detecting faces and face
[Fig4] A visualization of the face detection in KDK. (Image courtesy of Ivan Tashev.)
IEEE SIGNAL PROCESSING MAGAZINE [130] september 2013
Are you keeping up with technology— Or falling behind? [Fig5] Scanning objects in three dimensions with Kinect. (Images courtesy of Ivan Tashev.)
features on the video stream coming from the RGB camera. This module, called the Face Detection Development Kit (FDK), is able to detect faces and to recognize and track more than 80 specific points on the detected face (Figure 4) for distances from 1 to 3 m, requiring normal lighting conditions to operate [16]. Finally, version 1.7 of the KDK makes it possible to program Kinect applications that reconstruct high-quality 3-D scans of people and objects in real time (Figure 5). The KDK includes graphics processing unit-assisted 3-D object and scene reconstruction in real time and nonreal-time central processing unit-only mode for noninteractive scenarios [17]. Code samples In addition to the drivers and processing modules, KDK provides code samples that use these processing modules. Most of the applications are provided in C/C++, C#, and Visual Basic. While the majority of them use one processing module and demonstrate one technology, combining the provided sample code for creation of sophisticated applications is quite simple. The KDK also contains Kinect Bridge code samples for interaction with OpenCV and MATLAB. Kinect for Windows code samples are posted on CodePlex [18] and released under the Apache 2.0 license. author Ivan Tashev (
[email protected]) is a principal software architect with Microsoft Research in Redmond, Washington,
With Proceedings of the
and an affiliate professor at the University of Washington in Seattle.
IEEE, it’s easy to stay upto-date with cutting-edge technology breakthroughs,
References
[1] Kinect [Online]. Available: http://en.wikipedia.org/ wiki/Kinect [2] Startups use Kinect to solve problems in surgery, retail, filmmaking and more [Online]. Available: http://www.nbcnews.com/technology/startups-usekinect-solve-problems-surgery-retail-filmmakingmore-853771
from new uses for existing technology to innovations in a variety of disciplines.
[3] Kinectar [Online]. Available: http://ethnotekh. com/software/kinectar/ [4] Applications with Kinect for Windows [Online]. Available: http://apps.after-mouse.com/kinect-forwindows.html [5] Kinect physical therapy [Online]. Available: http:// kinectpt.net/ [6] My-Wardrobe exclusively launches virtual shopping window [Online]. Available: http://www. myretailmedia.com / blog /6396/my-wardrobe_ exclusively_launches_virtual_shopping_window. php [7] Virtual dressing room/interactive mirror Kinect [Online]. Available: http://www.youtube.com/ watch?v=UhOzN2z3wtI [8] Kinect in education [Online]. Available: http:// www.kinecteducation.com/ [9] Nike + Kinect training [Online]. Available: http:// www.nike.com/us/en_us/c/training/nike-plus-kinect-training [10] Kinect for Xbox 360 [Online]. Available: http:// www.xbox.com/en-US/kinect [11] Kinect for Windows home page [Online]. Available: http://www.microsoft.com/en-us/kinectforwindows/ [12] PrimeSense [Online]. Available: http://www. primesense.com/ [13] Human pose estimation for Kinect [Online]. Available: http://research.microsoft.com/en-us/projects/ vrkinect/default.aspx [14] Getting started with Kinect Interactions [Online]. Available: http://channel9.msdn.com/coding4fun/kinect/Getting-started-with-Kinect-Interactions [15] Kinect Audio: preparedness pays off [Online]. Available: http://research.microsoft.com/en-us/news/ features/kinectaudio-041311.aspx [16] Face SDK Beta [Online]. Available: http://research.microsoft.com/en-us/projects/facesdk/
Proceedings of the IEEE. Subscribe today. www.ieee.org/proceedings
[17] KinectFusion project page [Online]. Available: http://research.microsoft.com/en-us/projects/surfacerecon/ [18] Kinect for Windows on CodePlex [Online]. Available: http://kinectforwindows.codeplex.com/
[SP]
IEEE SIGNAL PROCESSING MAGAZINE [131] september 2013 09-PCEED-0207o-Proceedings-Third-Page-Ad-Final.indd 10/21/091 2:52:53 PM