[5] Wren, Cristopher. Azarbayejani, Ali. Darrel, Trevor. Pentland, Alex. (1997) Pfinder: Real-Time Tracking of Human Body. MIT Media Laboratory Perceptual ...
Platform for Rapid-Prototyping of Computer Vision and Interaction
JAN HUMBLE
Master’s Degree Project Stockholm, Sweden 2005
TRITA-NA-E05016
Numerisk analys och datalogi KTH 100 44 Stockholm
Department of Numerical Analysis and Computer Science Royal Institute of Technology SE-100 44 Stockholm, Sweden
Platform for Rapid-Prototyping of Computer Vision and Interaction
JAN HUMBLE
TRITA-NA-E05016
Master’s Thesis in Computer Science (20 credits) at the School of Computer Science and Engineering, Royal Institute of Technology year 2005 Supervisor at Nada was Kristian Simsarian Examiner was Yngve Sundblad
PLATFORM FOR RAPID-PROTOTYPING OF COMPUTER VISION AND INTERACTION Abstract This Masters project has considered development of and experimentations in "unencumbered interaction" through the use of a camera-input. The set of deliverables for the work includes a software library implementing various image processing interaction routines and the results of user interaction with the working system. The implemented image processing routines embody the notion user-interaction in a spatial environment by the integration of simple image processing with appropriate and engaging interaction. While most of the image processing techniques are applications of textbook solutions, their inclusion in environments, interaction paradigms, and other settings (virtual and physical) are novel and required investigation and iterative development. The work takes advantage of assumptions about the user and setting, so fundamental vision problems (e.g. foreground/background segmentation, and motion flow direction) fall to the background, while the parameters of user-interaction come in focus. The work was carried out primarily in the Java language, the DIVE virtual environment system, and the KidPad cooperative drawing tool. A java toolbox package, Motion Studio, was developed for the purpose of future ease of prototyping, expanding the current system, and implementing distributed solutions. The work is incorporated into the greater work of two international projects, one building storytelling spaces for children, and the other multi-user information space. The computer vision routines developed form a core set of basic interaction methods applicable in multiple research settings.
SNABBTILLVERKNINGS PLATFORM FÖR DATORSEENDE OCH INTERAKTION Sammanfattning Detta examensarbete handlar om utveckling och försök inom "unencumbered interaction" (interaktion utan onödigt skräp) genom användning av indata från kameror. Arbetet inkluderar ett mjukvarubibliotek som implementerar olika bildbehandlingsrutiner samt resultat av slutanvändares erfarenheter. De implementerade bildbehandlingsrutinerna inför begrepp från användarinteraktion i en rumslig miljö med enkel bildbehandling. Trots att de flesta bildbehandlingsmetoder vi använder oss av är tagna från litteratur inom området är deras användning i experimentella scenarios, i interaktionsparadigm, och i andra miljöer (virtuella och fysiska) ny, vilket krävde vidare undersökning. Vi drar fördel av antaganden om användare och dessas miljö så att fundamental datorseendeproblematik (t.ex uppdelning i förgrund/bakgrund och rörelseflödesriktning) hamnar i bakgrunden medan parametrar som definierar användarinteraktionen kommer i fokus. Arbetet implementerades till stor del i programspråket Java, DIVE-systemet för att skapa virtuella miljöer, och ritverktyget KidPad, avsett för barn i samarbete. En verktygslåda i Java, Motion Studio, utvecklades för att förenkla processen att snabbt utveckla prototyper, som påbyggnad av nuvarande system, och att utveckla distribuerade lösningar. Arbetet ingick i ett par internationella forskningsprojekt, ett som konstruerade teknikrika rum för barns gemensamma historieberättande, ett annat för stöd flera samtidiga användare i digitala rum. De utvecklade datorseenderutinerna bygger upp en kärnmängd av grundläggande interaktionsmetoder tillämplig inom ett flertal forskningsområden.
Preface The following is a Master Thesis report in Computer Science and Human-Computer Interaction for the Royal Institute of Technology (KTH), Department of Numerical Analysis and Computer Science (NADA). The work was done in conjunction with and requested by the Swedish Institute of Computer Science (SICS), at the Interactive Collaborative Environments group (ICE). It was initially to form part of the work for the KidStory project described bellow. The resulting software library resulting from the project is entitled Motion Studio and was developed using the Java language and extended libraries for media handling. The examiner at KTH was Prof. Yngve Sundblad. The supervisor at SICS was Dr. Kristian Simsarian. Below are brief introductions to SICS, the ICE group, the KidStory project, and the relevant software systems included in this work.
SICS The Swedish Institute of Computer Science (SICS) is a non-profit research institute sponsored by the Swedish National Board of Technical and Industrial Development (a governmental agency known as VINNOVA) and by an association of seven major companies, Celsius Tech Systems AB, Ericsson, IBM Svenska AB, Telia AB, Sun Microsystems, FMV (Defence Material Administration) and SJ (Swedish Railways). The main goal of SICS is to contribute to the competitive strength of Swedish industry by carrying out advanced research in selected areas of Computer Science and actively promoting the use of new ideas and results in industry. SICS research and development covers the following areas: Distributed and parallel computer systems, high-speed and mobile networks, software methodology, formal design methods, knowledge-based systems, neural networks, multimedia, human-computer interaction, virtual reality, and language technology. SICS’ personnel is also engaged in education at several Swedish universities.
ICE The Interactive Collaborative Environments Laboratory (ICE) at SICS focuses on the development of future computational environments that will support and promote cooperative interaction. Inhabitants of these environments will require support independently of physical location as well as having interaction and communication with fellow inhabitants to occur both synchronously and asynchronously. Designing the next generation of control and interaction surfaces will bring together work in tangible and physical interfaces coupled with distributed virtual reality technology to explore how a number of heterogeneous devices and interfaces can be used in unison. The main research challenge is the exploration of how future information technology will weave a ubiquitous backdrop to our everyday activities, both at work and at home. This project work fits with these themes through the use of camera-based interactions.
KidStory KidStory develops the currently available technology to inherently support social learning experiences while concurrently developing novel technologies that can be used in the learning environments of tomorrow. KidStory works with children, educators, and researchers from various disciplines in the development process by building an interdisciplinary, intergenerational, international design team. http://www.kidstory.org
KidPad: A Collaborative Storytelling Tool for Children KidPad is a collaborative story-authoring tool for children, and part of the resulting work from the KidStory project. It provides basic drawing functionality on a zooming canvas that is provided by Jazz, an extensible zoomable user interface graphics toolkit in Java. The narrative structure of a story is defined by creating spatial hyperlinks between objects on the canvas. Instead of using a standard WIMP (Windows, Icons, Menus, Pointer) user interface, KidPad uses local tools that can be picked up, used and dropped anywhere on the drawing surface. The local tools interface and MID, a Java/Windows98 library developed at the University of Maryland, allows KidPad to support shoulder-to-shoulder collaboration. If multiple USB mice are connected to the computer each mouse will control a tool in KidPad, making it possible to let several children simultaneously create a story together. http://www.kidpad.org
DIVE: Taxonomy of a Networked Virtual Environment DIVE (Distributed Interactive Virtual Environment) (Carlsson and Hagsand 1993; Frécon and Stenius 1998) has been developed at the Swedish Institute of Computer Science. The DIVE system is a toolkit for building distributed VR applications in a heterogeneous network environment. The networking is based on peer-to-peer multicast communication (reliable multicast is used, except for the continuous data streams, i.e. video and audio, that use nonreliable multicast). http://www.sics.se/dive
Motion Studio The Motion Studio software package is the product of this Master Thesis work, incorporating Java classes for interfacing camera hardware, ease of applying computer vision algorithms, filtering and interpreting output, as well as interfacing with outside software. It is based on the core Java Media Framework API for interfacing with capture hardware and digital media. Motion Studio supplies solutions for real-time computer vision algorithms, software connectivity and distributed network applications to relieve CPU processing time.
Acknowledgements This work was made possible with the help of the following people: Dr Adrian Bullock, Lennart Fahlén, Pär Hansson, Dr Kristian Simsarian, Olov Ståhl, Anders Wallberg, Karl-Petter Åkesson at SICS in Sweden; Carina Fast, Marita Kjellin, Prof Yngve Sundblad, and Gustav Taxen at the Royal Institute of Technology in Stockholm, Sweden; the KidStory research team at the University of Maryland in the USA and the University of Nottingham in the United Kingdom
Contents Preface ..........................................................................................................................................iv DIVE: Taxonomy of a Networked Virtual Environment ......................................................v Motion Studio ........................................................................................................................v Acknowledgements ...................................................................................................................v Introduction ...................................................................................................................................1 Application Domain ..................................................................................................................1 Background....................................................................................................................................4 Overview of a Real-Time Computer Vision Application..........................................................5 Brief Analysis of Computer Vision Systems.............................................................................5 Pfinder ...................................................................................................................................5 Systems Based On Optical Flow ...........................................................................................6 Multi-Scale Feature Tracking ................................................................................................6 Representation of Action ...........................................................................................................6 Technical Platform ......................................................................................................................10 Java as a developing framework for Computer Vision............................................................10 Semantics.............................................................................................................................10 Portability ............................................................................................................................10 Robustness ...........................................................................................................................10 Networking ..........................................................................................................................10 Open Resources ...................................................................................................................11 User Interfaces.....................................................................................................................11 Practical Technology Overview ..............................................................................................11 Java Imaging Packages ........................................................................................................12 Motion Studio – The Java Computer Vision Application and Library....................................12 Handling the Pixel Data.......................................................................................................13 Temporal Buffers.................................................................................................................14 Codecs .................................................................................................................................14 Hotspots...............................................................................................................................15 Results & Connectivity........................................................................................................15 Filtering and Interpreting Output.........................................................................................16 Gesture.....................................................................................................................................17 Detection of Presence ..........................................................................................................17 Motion in image...................................................................................................................17 Motion flow for behaviour specification .............................................................................17 Mass of Object/Agent..........................................................................................................18
Colour, position and mass ...................................................................................................18 Shape of Object ...................................................................................................................18 Applications and Users................................................................................................................19 Integration with Dive...............................................................................................................19 Setup ....................................................................................................................................20 Integration with KidPad...........................................................................................................21 Work with Child Users ............................................................................................................22 Set-up for Drawing with Tangible.......................................................................................22 User Feedback .....................................................................................................................24 Current Research .....................................................................................................................25 Future Aspects .............................................................................................................................27 Conclusion...................................................................................................................................28 Appendices ..................................................................................................................................29 Computer Vision......................................................................................................................29 Human Vision......................................................................................................................29 Image Representation ..........................................................................................................29 Convolution .........................................................................................................................30 Filtering ...............................................................................................................................31 Segmentation .......................................................................................................................32 Histogram ............................................................................................................................32 Edge Detection ....................................................................................................................33 Edgel Detection ...................................................................................................................33 Finding Connected Components..........................................................................................35 Algorithm: Recursive Connected Components Algorithm..................................................35 Sequential Algorithm...........................................................................................................36 Sequential Connected Components Algorithm using 4-connectivity..................................36 The Equivalence Table ........................................................................................................36 Colour Models .....................................................................................................................37 Technical Issues.......................................................................................................................37 Understanding JMF .............................................................................................................37 Processors ............................................................................................................................38 Presentation Controls...........................................................................................................38 Standard User Interface Components ..................................................................................38 Controller Events.................................................................................................................38 Processing............................................................................................................................39 Processor States ...................................................................................................................41 Methods Available in Each Processor State ........................................................................42
Processing Controls .............................................................................................................43 Data Output..........................................................................................................................43 Capture.................................................................................................................................43 Images: KidStory Sessions ......................................................................................................44 References ...................................................................................................................................46
Motion Studio
Introduction As technology infiltrates our everyday surroundings we explore new ways to interact with it. New scenarios emerge upon the arrival of smaller, more efficient, and perhaps unexpected technology. The concepts of tangible and non-tangible interfaces come forward, and new forms of human-computer interaction become more interesting. The goal of a vast amount of HCI research deals in understanding and in some cases developing new forms of interface with the computer. It becomes a necessity to implement new [computer vision] algorithms, or a variation on existing ones, on envisioned devices. These might be of solely academic research importance or in the development of viable future devices. Developing such configurations is time consuming and a considerable amount of time is spent reconfiguring and optimizing the system. The software might be required to be deployable to a variety of targeted hardware, depending on the access to such of the research institution or the means and availability of such to the end users. We would therefore opt for rapid forms of prototyping interactive tools, in order to evaluate them and perhaps acquire some sense of their usability and value. This paper focuses on the development of software for the aid of developing prototype interfaces using computer vision technology. We aim to create a platform which is easily adaptable to a range of hardware, connectable to several research applications, and open to expansion for future needs.
Application Domain Our area of concern is developing a simple to operate software platform to use in ubiquitous interfaces for readily available applications. Although a vast range of computer vision software applications exists, they tend to be consolidated to a set of specific hardware components. A set of research projects requiring motion or presence detection as a means of interaction have a short lifespan, making it undesirable to spend time and budget in developing the computer vision software or even adapting other systems to the requirements at hand. We also tend to be limited to the hardware components already available for several reasons. For example, the Kidstory project [1] wished to make use of the already available deployed PCs in the involved schools therefore avoiding introducing delicate and expensive hardware. The goal of this work is to develop a set of tools and portable application that would be the base for implementing computer vision algorithms and integrating the output results to functional applications. We look into solutions for achieving real-time results through code optimisation and distributed applications. Concepts of gesture recognition and interpretation, motion detection, and general machine vision are discussed. This discussion provides an overview of limitations and potential of a system comprised of present standard components as well as demonstrating how a useful user interface can be developed despite these limitations. This paper presents a software platform and algorithmic solutions for developing easily deployable computer vision applications. It is written entirely in the Java language, as a means of supporting the most widely available operating systems and hardware video components. Alongside with the technical aspects presented here within, we discuss some of the current approaches and notions in the examination of physical motion. We present a couple of implementations based on the platform within the context of research projects. We will establish a framework for developing future applications in the domain of unencumbered interaction and different approaches towards implementing a tracking system. Part of this project, was also to see how efficient a portable system could be made that complies with such practical functionality.
1
Motion Studio
A lot of care has been taken to optimise the Java code developed, and thus this paper is intended to be used as a reference handbook for anyone interested in developing Java based computer vision applications or solution subsets of such.
Approach The work is divided into three main parts: a background study on current computer vision solutions and applications, a description of the core developed software infrastructure, and a brief description of how we incorporated our system into current applications. Preliminary research was made on the current status of camera tracking devices for unencumbered interaction, including their limitations and the areas in which they don’t succeed. There are several systems, both experimental and applied, which seem to work reasonably well, however these systems rely on expensive task specific hardware to run. The core of any tracking system is image processing, and thus much of the report is given to different computer vision algorithms. We discuss limitations and boundaries of such algorithms, and consider only those that support a real-time system, although the platform design allows for ease of implementation of non-real-time applications. We started with the development of a working platform, for interfacing with a camera, image capturing and subsequent image processing. Java was chosen as the platform since it provided a ready-made hardware-interfacing package, the Java Media Framework or JMF [2], making prototyping easier, at the cost of optimal system speed. It allowed for operating system independence, useful since we needed an easily transportable and portable set up. The JMF package was only available as a beta release at the beginning of this project, and thus functionality was not completely achieved in the first stages. Later releases provided more stability and resolved some major issues in the video capture and processing operations. Java also provided easy socket communication, useful in cases where we needed to resort to setups of applications running on multiple machines. We made decisions on which camera set-up to use based on what media types and hardware the Java Media Framework supported. We opted for a mainstream web camera, allowing the system to be portable and offered an acceptable capture frame-rate at lower resolutions. Despite a standard and static setup we firstly developed a simple software library for accessing camera hardware devices, allowing for ease of camera settings configuration and application deployment. Once we had a stable configuration for acquiring and manipulating video sources, the next task was isolating and handling the video data in as fast and stable manner as possible. We compared and tested higher-level methods for advanced image handling, such as the Java Advanced Imaging API [3], against interpreting and manipulating the video data at its native level. It turned out the latter was the only option despite the increase in development time, since no other methods at the time provided an option for real-time computations Next we introduced a generic way of implementing computer vision algorithms to the captured video data stream. A programming interface based on coders-decoder modules, or codecs, was devised to simplify writing or adoption of readily available algorithmic solutions for gesture recognition or general object tracking. Once an image-processing platform was established, we proceeded with the tasks of assessing which set of video processing schemes seemed suitable for some of the projects at hand. These schemes, consisting of the before mentioned codecs, video capture settings software, basic user and application interfaces, were packaged into what constituted the preliminary Motion Studio software library suite.
2
Motion Studio
For testing the functionality and use of the library, several application interfaces were developed. A tool for tracking objects based on their colour properties (in a specific colourmodel space) was used to provide output to other readily available interactive applications, such as KidPad and DIVE. The tools for connecting to these external applications are also part of the library, as well as tools for interpreting the basic (mostly 2-dimensional) outputs into forms that acquire new meaning in the environments that they apply to. Much care was given to provide an adequate graphical user interface, higher-level methods for connectivity, and input/output from properties files (e.g. saving and loading settings and results). It is relatively easy to create new codecs that apply algorithms to single or sequences of video frames, without the necessity of understanding how the JMF package works or how the video stream is interpreted.
Report Overview This paper is divided into three parts, starting from a background study and finishing with real world usage of the developed tools. The first part concerns a background overview of existing computer vision technologies and approaches, and a discussion on the concepts of motion. This is mainly to establish a base of knowledge in the domain to assist in the design decisions of our proposed system. The second part introduces the developed software, which forms the package that we now call Motion Studio. We describe the various features the system provides to ease - and in cases permit - the implementation of encumbered interaction prototypes. Since the time scale of this work did not permit a proper iterative process, in which feedback from actual usage and real life scenarios drove the development, we hoped to provide the necessary core tools from the very beginning. Finally we present real implementations coupled to other software tools that could benefit from unencumbered interaction. These were taken to users in an ongoing project; Kidstory in which children used instances of some of the Motion Studio based tools in several experimental scenarios.
3
Motion Studio
Background Computer vision is a cross-disciplinary field that involves biological, mechanical and computerbased aspects (Figure 1). While computer vision systems are not replicas of biological vision, an understanding of biological vision can help lead to eloquent solutions to computer vision problems. For example, receptive fields in human vision are akin to the reduction of information through resolution and clustering/grouping in computer vision. Roberts [4] did the first substantial work on computer vision in the mid 1960s. His work led him to develop a system that could perceive the position and orientation of blocks in a scene. Despite the limitations imposed by 1960s computer technology, the ideas and algorithms he presented are still in use today.
Biological Studies
Computer Science and Engineering
Electronic Engineering
Artificial Intelligence / Cognitive Studies
Mechanical Engineering
Robotics
Computer Vision
Image Processing
Pattern Recognition
Figure 1 Cross disciplines for Computer Vision
Advanced Computer Vision applications make use of complicated algorithms, pushing the boundaries of processing power required for a real-time performance. We define real-time as a computational procedure delivering output within milliseconds of the actual video capture. Not only do these aim to detect motion per se, but extend to attempting recognition of the scene itself. They tend to segment and recognize the object in question from its background context, exponentially increasing the computation required to deliver results. Even the most powerful computers wouldn’t be able to compare to the computational power provided by the human brain in cognitive systems. Although the brain transmits neural impulses at a lesser speed than today’s modern processor chips it is nonetheless capable of a very effective parallel processing, not to mention almost immediate coupling activation attached to a vast knowledge base of the world. A myriad of computer vision approaches exist, addressing different output requirements and demands. They share the limitation of the hardware required to process the computational requirements in real-time. As technology allows for affordable and faster processors, better and more complex analysis can be made on the captured scene.
4
Motion Studio
Overview of a Real-Time Computer Vision Application A typical real-time computer vision system usually includes three main components: a video capture device (and driver), the processing unit or algorithm bulk, and the process of results for a specific task (Figure 2). Video input
Image and computer vision processing
Process Results
Figure 2 Real-time computer vision components
The video signal from the camera is digitised and sampled unto one or multiple coordinate grids. Grid values usually hold colour and light intensity information from the sampled video. These grid matrices are then passed on as input to the processing unit were all the relevant feature extracting algorithms are applied. An intermediary process of decodification of the video source is required to extract and process the image data from the video stream. This process is usually handled by software or hardware modules called codecs (coder/decoder). Output from these generic feature extractors are then processed into useful output for other applications to make use of. A simple example being the wave of a hand recognized and interpreted as a valid gesture used as a cue to trigger an action in an external application. We intend to construct a platform which makes it easy to instantiate a video capture device, construct algorithms for motion feature extraction and apply the results to other applications. Obviously there already exists a substantial amount of computer vision systems and we proceed to review some of these and their approaches to interpreting motion. These were developed by different research groups and cover different aspects of problem sub-domains. Such systems in use bring a clearer understanding of the limitations and possibilities of the alternative routes chosen, hopefully allowing us to provide an architecture that facilitates the construction of these if required.
Brief Analysis of Computer Vision Systems The area of computer vision, or machine vision in the robotics community, has been explored many times before. We now present a brief discussion of the most common systems implemented specifically with unencumbered interaction in mind, along with computer vision methods that are worth mentioning. We discuss why we choose or not to include these ideas in instantiations of our prototypes.
Pfinder Pfinder [3] is a complete human-form tracking system developed at Massachusetts Institute of Technology. It not only segments a human subject from the background, but also attempts – with a certain level of success - to recognize the subject’s limbs. Pfinder explicitly employs several domain-specific assumptions to make the vision task tractable. When these assumptions break, the system degrades in specific ways. Due to the nature of Pfinder's structure and since the model of the user is fairly weak, the system degrades gracefully and recovers in two or three frames once the assumption again holds. Pfinder expects the scene to be significantly less dynamic than the user. Although Pfinder has the ability to compensate for small or gradual changes in the scene or the lighting, it cannot compensate for large, sudden changes in the scene. If such changes occur, they are likely to be mistakenly considered part of the foreground region, and an attempt will be made to explain them in the user model.
5
Motion Studio
Another limitation, related to the dynamic scene problem, is that system expects only one user to be in the space. Multiple users don't cause problems in the low-level segmentation or blob tracking algorithms, but do cause significant difficulties with the gesture recognition system that attempts to explain the blob model as a single human figure. We aimed to deploy our own system in dynamic environments, with changing lighting conditions and background features. Also Pfinder requires a substantial amount of configuration and is relatively computationally expensive. Although it shows a great deal of promise and potential for full body tracking and in principle straightforward to port to our codec paradigm it was not tractable to implement.
Systems Based On Optical Flow Optical flow is the distribution of apparent velocities of movement of brightness patterns in an image. Optical flow can arise from relative motion of objects and the viewer [6]. Consequently optical flow can give important information about the spatial arrangement of the objects viewed and the rate of change of this arrangement [7]. Discontinuities in the optical flow can also help in segmenting images into regions that correspond to different objects [8]. It is not possible to independently compute optical flow at a single point in the image [8]. If we wish to do this, ignoring neighbouring points, additional constraints must be introduced. This is because the velocity field at each image point has two components while the change in image brightness at a point in the image plane due to motion yields only one constraint. Consider, for example, a patch of a pattern where brightness varies as a function of one image coordinate but not the other. Movement of the pattern in one direction alters the brightness at a particular point, but motion in the other direction yields no change. Thus components of movement in the latter direction cannot be computed locally. Additional constraints must be introduced to fully determine the flow. This can be worked around using an assumption of smoothness in the flow (not always in accordance to reality). However, even on the nicest of such assumptions, it is still a computationally demanding feat. Optical flow has been used for gesture recognition [10], in which gestures are recognized using a rule-based technique based on characteristics of the motion blobs such as relative motion and size.
Multi-Scale Feature Tracking Complex and more precise object feature tracking entails a subdivision of the image subject into different context related parts. For example, work done by Bretzner [11] studies how different features show up and can be made use of at different levels of resolution and abstraction. Consider the representation of human hand features, which transit between discerning just the hand shape at low resolutions to full digit identification and modelling at higher resolution levels. This system also suggests the pre-notion of context for feature tracking, e. g. the hierarchical structure of the human hand. It requires computation to discern the hand shape and extract finger articulation. At the time of writing this is too processor intensive and non-realtime for the hardware we target. Nonetheless, it represents the proper approach to recognizing human gestures for unencumbered interaction. Human gesture recognition needs to consider frame of reference for articulation in order to uniquely identify movement and flow from the desired physical human source (such as finger movement relative to hand-position and so on). For our purposes it at least demonstrates the conjectures one has to consider to differentiate arbitrary movements, or even noise, from proper human gestures.
Representation of Action Although this report does not delve into the depths of how to interpret actions from the results of a video-processing source, we do discuss some of the topics that come to mind. Since to this
6
Motion Studio
date no processing unit has usable intelligence to derive meaning from contextual gestures we argue that some questions are open for discussion. The final report of the 1997 NSF/DARPA Perception of Action Workshop [12] handles these basic questions and covers the necessary area to present a brief summary of relevant information. Representing actions or gestures in a Euclidean coordinate space has its limitations. We need to incorporate hierarchical descriptions, which mostly are context dependent. The granularity might or might not change according to the scale the containing space is represented as. It is a complex and non-trivial matter to determine the many parameters that come into image feature extraction. It is not up to us to assess all these but to provide the tools to apply one’s one paradigms of actions. Most of the methods for machine perception either take a complete data-driven approach or an approach based on a known structure or a model representation of action. The data-driven methods try to define robust methodologies for tracking actions so that these actions can be perceived and recognized. No prior assumptions of what is being perceived are made and the hope is that the action has a unique, inherent and implicit representation to completely define the action that can then be used for recognition. As much as we would like to believe in the "pureness" of this methodology, we have to question the lack of a true representation that this methodology relies on for recognition. One way of addressing the lack of an explicit representation is to build a structure or a model of what needs to be perceived. A limitation of this is how to deal with events and actions that we did not have the insight to incorporate in our a priori models. Additionally, building in a whole repertoire of actions and their explicit models is by no means a trivial problem. We might as well develop a system that does robust search over a very large space of possible solutions (why does it matter that the solutions are detailed model-based representations or just data specific estimates based on probability of captured signals). Overall, our goal is that we have to extract some sensory information so that we can use it to label certain events in the scene. Capturing and labelling an event may not be exactly what we need for perception of action, as an action could be a series of related events or sometimes just a single event (i.e. walking vs. pointing). This causes a major problem in how to represent actions (or how to define an action) for recognition. Additionally, it suggests the importance of time in the representation of actions. It is for this reason some sort of spatio-temporal reasoning needs to be incorporated into our representations. Such spatio-temporal reasoning can be incorporated by using constructs that change with time; with expectations and probabilities assigned to these constructs so that we can predict the changes and estimate the actions. Perhaps the concepts of physics-based modelling can be employed for developing detailed representations of actions. The major benefit of physics-based methods would be that the variability of time can be easily incorporated into the analysis and allow for spatio-temporal representations. Animation and mimicking of actions is a very important by-product of this method. Again, the limitation is the domain and the context. So it seems that we are forced to deal with limited domains and devise methodologies that, for each specific domain, at various levels of detail (in both space and time) allow for a somewhat deep exploration of action interpretations. The type of representation be it completely datadriven (only look at signals and infer actions using probabilistic and/or rule-based classifiers) or model-based (look at signals that can be mapped onto known and modelled actions) also depends on the task at hand. It is important to note, however, that while our system produces plausible explanations for the motion, there will often be multiple interpretations for a given frame of the sequence. For example, when considering the instantaneous motion of two attached objects, such as the hand lifting a can, we cannot determine which of the two objects is generating the lifting force. While it is possible to reduce ambiguity by integrating inferences over time (such as noticing that the hand is active in earlier frames of the sequence), this is only a partial solution to the problem. In particular, as described in [13], if a behaviour (such as attachment) is observed only when the
7
Motion Studio
objects interact we will be left with uncertainty about which object (if any) is the "cause'' of this interaction. This research raises several points relevant to the issues raised at this workshop. Aside from the fact that some of the processing operations are fundamentally and asymmetrically temporal in nature, these representations lend themselves to the same sort of pattern matching and analysis techniques that are useful for static imagery (basically various sorts of template matching operations, discretized, warped, inverted, transformed etc.) and suffer from the same limitations. It is possible to go a very long way with representations of this sort; and a lot of useful and impressive results have been obtained. However, there are certain limitations to the degree of abstraction that can be handled with analogical representations alone. There is a lot less uniformity of style in the more discrete representations that have been employed, and less close ties to conventional image processing. Some of these are listed below
Spatio-temporal segmentation Probably the simplest symbolic level, this is analogous to region segmentation - separation is into regions of equivalent [moving] objects and persists through time. This is pretty close to a direct analogue representation.
Labelled moving blobs The above-mentioned segmentation forms a basis for this. Essentially individual objects are tracked, and simple information (identity, size, speed, etc.) is carried along. This can be quite useful for some sorts of surveillance operations
Annotated trajectory representations The labelled moving blob representation is modified to contain information about temporal events and interactions between objects, and the time line is made more symbolic. (E.g. Here's a walking man, here he sat down, opened a door, got in a car and drove off). At this point the basic data structure is generally some sort of labelled graph.
Articulated moving object The emphasis here is to fit a physical model more complex than a moving blob to the basic data, e.g. a human puppet. This is potentially useful for making certain sorts of fine gesture distinctions, or recognizing detailed activities like shaking hands or opening a box. Once the fit has been made, the same sort of labelled part and annotated trajectory descriptions can be put on top of the more complex model (e.g. here he picks up a book with his right hand, here she writes a check).
Coordinated activity representations This is the most abstract sort of representation that has been attempted to date. The idea is to represent activities that involve multiple, temporally related components, or multiple, coordinated actors. Examples might include football plays, a man assembling an engine, a construction team building a roadway etc. Representations tend to again consist of some sort of labelled graph, but now with higher order constraints on the global structure. This work is still quite preliminary. The extent of this report covers 2-dimensional data extracted from a tracking system. Even though our platform is capable of extracting 3-dimensional data, this would require a larger amount of work, not necessarily in developing algorithms, but in setting up a stage in which to perform the tracking, testing and calibrating hardware, etc. We turn instead to how to turn our limited acquired information into meaningful interaction.
8
Motion Studio
A simple hand-gesture gives us an (x, y) coordinate, a direction and a velocity in space. A media space is commonly viewed through a flat screen, and thus it would seem a natural task to map the 2-D signal onto a relevant 2-D function. Or possibly, interpreting the 2-D output unto an extra degree of freedom, by our own decree. As an example, the hand-gesture is now mapped onto the rotation of a 3-Dimensional cube, whose rotation axis is arbitrarily set by the application.
9
Motion Studio
Technical Platform Having established a theoretical basis we concentrated on building our system. After testing the principal building blocks of the hardware interface and the Java Media Framework API kit we have a grasp on the boundaries imposed on what to attempt in order to achieve a certain level of real-time processing. We have an overview on which algorithms to use and how to apply them, but we begin first by briefly explaining why we chose Java as our software base.
Java as a developing framework for Computer Vision There are several advantages on choosing Java as our targeted developing language. The obvious limitation however is processing speed, as computational processes are run on an intermediary virtual machine and not necessarily as direct calls to the underlying processing unit. Since we are not firstly concerned with highly optimized release applications but with the rapid-prototyping of such, we will not make it a matter a great concern. Also, we can argue that high performance is achieved by incorporating support for just-in-time translation of portable byte codes to the native machine language of the local host. According to Sun, performance of translated code is roughly equivalent to the speed of current C and C++ programs. Following public announcements of Java in early 1995, acceptance among developers has grown very rapidly.
Semantics Java borrows the familiar syntax of C and C++. Like C++, Java is object-oriented, but it is much simpler than C++ because Java’s designers intentionally discarded redundant language features that were present primarily to support backward compatibility with legacy code. An additional benefit of its simplicity is the small size of its run-time system. Sun reports that the basic interpreter is about 40 Kbytes, and that basic libraries and thread support add approximately 175 Kbytes [14]. The development cycle is much faster because Java supports both interpreted and just-in-time compiled implementations. During development and rapid prototyping, developers save time by using the interpreter.
Portability Application software is more portable because the Java environment carefully specifies a machine-independent intermediate byte-code representation that can be transferred between heterogeneous network nodes and interpreted or compiled to native code on demand by the local Java run-time environment. Applications are adaptable to changing environments because code modules can be downloaded dynamically from other network nodes without necessitating a system restart.
Robustness Application software is more robust because Java’s run-time environment provides automatic garbage collection. The Java language has been designed to eliminate the possibility of dangling pointers and memory leaks.
Networking Networking and multi-threading is the corner stone of the Java language. When we consider that real-time computation of video data is CPU intensive we are faced with the necessity of
10
Motion Studio
dispersing threads or even whole applications onto different machines. It might be also the case of localization: applications running on other machines and possibly other operating systems. Java facilitates tremendously this task with ready-available and well-functioning socket connectivity and serialization of objects.
Open Resources A multitude of very useful packages are available in Java, not necessarily part of the standard Java Development Kit (JDK). This permitted an assessment of the readily available components without the need to struggle with hardware incompatibilities and idiosyncrasies. Some of our targeted applications to which to introduce unencumbered interaction techniques were already written in Java or had Java friendly interfaces.
User Interfaces On a last note on the benefits of Java as a developing framework it provides a flexible [for several technical reasons] window toolkit for building graphical user interfaces (GUIs). The GUI becomes an important part of an application that requires continuous refining in real-time of parameters to achieve acceptable results with video capture and computer vision processing.
Practical Technology Overview Concentrating on developing the computer vision software platform we establish our work environment in order to make the process as painless as possible. This section describes the hardware and software tools available and capable of achieving our goals. In the process we discard one or two that fail to meet requirements, despite their potential. At present, Intel processor-based PCs running Windows 98/NT/2000/XP operating systems are the clear market leader in both home and business computer use. Utilizing the ubiquitous PC for use in real-time computer vision applications is appealing for both users and developers. For the user, utilizing their existing inexpensive hardware and operating system is attractive for both economical and integration reasons. For the developer, the success of the PC has produced many useful developer tools and hardware peripherals that can be utilized for developing a realtime computer vision application. However, effectively utilizing the power of this hardware requires a great deal of low-level understanding of the hardware and low-level programming. At the same time, making a system portable, we must distance ourselves from as much hardware specific code as possible, without losing efficiency. We utilized the Microsoft Windows NT/2000 operating system for our real-time computer vision application development. The main reason for this is that the tools, such as DIVE and KidPad, for our interpretation of action are all ported to this platform. Also the solution at hand for interconnecting several machines together was made easier - and at times only available - on NT. The reasons for choosing Windows NT/2000 over the more popular - at the time - Windows 9x were: (1) Windows NT supports multiple processors; (2) Windows NT has better architecture for real-time applications than Windows 9x [15]; (3) Windows NT is more robust than Windows 9X. The reason for choosing Windows NT over Linux is that Windows NT provides many software development tools that aren’t available on Linux, such as camera driver interface APIs. In addition, Windows NT has a superior number of hardware peripherals available (with supported drivers), particularly for frame grabbers (essential for most real-time vision applications). Capturing live video is an essential part of most real-time computer vision applications. Windows provides the Video for Windows (VFW) interface as a standard for video input. While VFW may suffice for some applications, it has some efficiency problems. Specifically,
11
Motion Studio
VFW drivers perform memory copies on the captured images, instead of transferring images directly to DMA image buffers and making these buffers available to the user. The result of this inefficiency is wasted CPU cycles and dropped frames during video captures. In Windows 2000 (and Windows 9x), VFW is succeeded by WDM Video Capture [16], which alleviates many of the problems of VFW. In WDM Video Capture, images are transferred directly to a circular DMA buffer, and user interrupts are triggered when an image capture completes. WDM Video Capture uses the DirectShow interface to provide compatibility with many third-party applications.
Figure 3 Logitech low-cost webcam
A benefit of using windows (for those with a pending struggle of a love-hate relationship to Microsoft products) is that we had a vast array of video capture devices compatible with Windows. We nonetheless limited ourselves to web cameras (parallel and USB connected) as we envisioned deployment to low-tech gear and cheap robust hardware (Figure 3). The initial development platforms were the currently most stable – for our purposes – Windows versions (NT, 2000 and recently XP). Note however that the final tool releases are intended for multiple platforms, but nonetheless it saved some essential developing time in not worrying about system crashes or curious behaviour during the initial stages of code testing. At the least, when these did occur we could easily establish their cause.
Java Imaging Packages The most obvious solution to our imaging algorithm implementation would seem to be the use of readily available Java imaging packages. The Java Advanced Imaging (JAI) software library by Sun Microsystems [3] is the most common of these. The common problem shared with JAI and all other packages we examined was the inexistent real-time model for image processing. Basically they are designed to process static images (usually an ImageBuffer type) and not video streams. Although we could live with the time cost of extracting a snapshot of video into a manageable format, latencies in subsequent processing or even visual feedback of output would have made it prohibitive for real-time applications. We need to establish a model which handles the video stream data directly, in as close as possible to its native format, but at the same time accessible through a comprehensible API. The results of which led to the set of software components which make up Motion Studio, described further in detail.
Motion Studio – The Java Computer Vision Application and Library Motion Studio is the name given to the set of software components intended for the development of computer vision solutions for unencumbered interaction. At its core is the direct handling and manipulation of raw video stream data and chaining of video processing coders/decoders (commonly known as codecs). It builds on the Java Media Framework (JMF) platform for handling the hardware and video streams.
12
Motion Studio
The package currently includes classes and methods for: •
Creating data sources from installed and registered JMF compatible hardware and drivers.
•
Handling Video Formats for the capture device.
•
Creating processors for handling the data source.
•
Handling, programming and monitoring the processor at different stages of configuration.
•
Creating image handling codecs for the processor.
•
Extraction and interpretation of codec results, for either local or external use.
•
Creating windows for viewing of results.
•
Delivering output to other applications, such as DIVE (through JIVE) and KidPad (through sockets).
•
Other basic GUIs.
JMF gives us the basic application programming interface (API) to manipulate the captured video data. A processor object creates data sources from video capture devices and data sinks for monitoring or retransmission (Appendix: Understanding JMF). Processors allow us to introduce codecs to manipulate the video stream data between these sources and sinks. The video stream is interpreted at basic level in order to acquire the fastest performance possible. That is, no other image handling software packages were used (e.g. Java Advanced ImagingTM). The video streams are delivered in byte arrays, out of which the pixel colour values have to be extracted depending on video format and colour model.
Handling the Pixel Data At the core of each codec is the data buffer, which encapsulates the pixel colour intensity information for the different channels. These may be encoded in different colour spaces, resolutions, depths, byte ordering, and so on. Motion Studio provides easy access to the raw encoded data through a pixel data coordinate space. JMF, together with any low cost capture device (e.g. webcams) provides support for a limited range of video capture formats, such as RGB 16- and 24-bit. Therefore the buffer needs to be interpreted into useful pixel coordinates and colour values. JMF does provide BufferToImage and ImageToBuffer methods to convert from the buffer data to the standard Java AWT image format. However, it remains that pixel coordinates cannot be extracted without a subsequent transformation into, for example, a BufferedImage type with the necessary format information. This is a costly process, and a bottleneck on our imaging pipeline. The problem is that in order to have visual results we need to convert back into the buffer so that the processor can render the output in a window or deliver the original or modified output to the next codec in the processing pipeline. For this purpose the PixelHandler class was created, which extracts the relevant pixel information directly from the data array inside the video buffer. The PixelHandler class also provides methods for insertion directly into the buffer data array (mostly for highlighting algorithm results), without having to convert between formats. Thus we eliminate the bottleneck, at least as much as we can, without having to resource to native methods of the computer system.
13
Motion Studio
Temporal Buffers Several computer vision algorithms require image information from sequential frames over a time window (e g Subtraction, Optical Flow, etc). Motion Studio allows you to specify the size of this window and gives you access to their respective data buffers (Figure 4). In fact, each codec in the pipeline has the ability to specify the window size, and cache either the output of the original video source or the output video stream from the previous codec in the processing chain.
Video stream in
Video stream out x y
Buffered video frames
Pixel handler RGB space
Figure 4 Frame buffering for historical processing
Codecs The main approach of Motion Studio is to introduce and manipulate coding/decoding plugins in the video-processing pipeline. Each codec encapsulates a given procedure, in our case image handling algorithms. The input and output formats of the video package passing through each codec can be set manually, however at each stage the output format from a preceding codec must match that of the input format of the next in the pipeline. If this is not done manually, then the JMF handler attempts to insert appropriate plugins from its own package to convert from one format to the next, with of course, degradation in performance. Some standard computer vision algorithms are implemented as codec in the Motion Studio library. These provide a base set of tools for performing basic feature or gesture extraction computations. Some of these include: -
Subtraction: calculates the difference between subsequent images or from a static background (Figure 5).
-
Colour threshold: extracts the pixel coordinates of those which match a certain colour criteria (e g human skin).
-
Blob detection: highlights connected pixels which match a certain criteria (Figure 6).
-
Mask operators: allows to apply an arbitrarily sized mask to each image through convolution (Appendix: Computer Vision), e. g. Sobel masks (Figure 7), diffusion, etc.
The input and output buffer data streaming into the codec might end up in a non-referenced memory state in which it is free for the VM to trash collect and free resources. This process proves to be too time consuming, especially with video data streams at high-rate, or even a source for memory leaks. An improvement is achieved by swapping the memory pointers between input and output inside the codecs, eliminating (or at least minimizing) floating memory leaks. Although memory leaks are improbable in a Java VM environment, the underlying JMF package runs native optimised code which makes such occurrences more probable.
14
Motion Studio
Figure 5 Subtraction Codec on waving hand
Figure 6 Blob Detection Codec for human skin
Each codec, in essence, inherits a base codec in which the basic functionality for handling specific types of video formats and colour models are implemented. The base codec provides methods for accessing pixel information from the video buffer. It is a good idea to let the codec handle individual frame buffer formats, and not in some generalized way, since the video format may change, for whatever the reason, at any time. Each buffer coming through the codec has its pixel information handled in relevance to the current buffer’s video format.
Hotspots At times we wish to apply a certain vision algorithm to only a part of the image space. For this purpose we define (either dynamically or predefined) hotspots that set boundaries on the image space to where the algorithm should be applied (Figure 7). In special cases where only a particular area is interesting for our purposes, this significantly reduces the computational weight. In addition it provides us with the freedom, not only to apply sequentially ordered algorithms, but also to apply different algorithms to different parts of the image space at each frame time, expanding the possibilities but increasing the need of better bookkeeping.
Figure 7 Applying Sobel mask operator on two hotspots
Results & Connectivity A delicate aspect of the processing thread is that it should not push any results or call methods that disturb its basic computation. In other words, it should not use any push techniques to deliver its output data. Instead an external independent thread is run that reads or pulls the data from the codec class at an adequate desired frequency (Figure 8). This thread then delivers the results to an appropriate listener expecting them. LOCAL LISTENER APPLICATION
RESULTS EXTRACTOR
VIDEO CODEC PROCESSING THREAD
Figure 8 Local application accessing results from the video processor
If the listener expecting the results is not running on the same machine then a simple server gathers the results and feeds them to any client connecting to it (Figure 9). Needless to say, Motion Studio provides the toolbox and ready-made expandable modules to accomplish this.
15
Motion Studio
CLIENT LISTENER APPLICATION
SOCKET
RESULTS EXTRACTOR
SERVER
VIDEO CODEC PROCESSING THREAD
Figure 9 External application accessing results from video processor via socket
A computer vision application running remotely (in another machine and location) using this method is useful for a number of reasons. Firstly, the video capture hardware needs to be physically present at a different location than the application making use of it for its interface. Secondly, a machine assigned to the computer vision tasks might not have the necessary processing resources both this and the external application making use of it. Thirdly, one might wish to connect several applications to the same computer vision processing.
Filtering and Interpreting Output Our image processing algorithms are not stable under most conditions. This is apparent when considering changes in lumination, shadow casts, uncalibrated camera, noise, frame drops, etc. The list is almost endless, and all possible evils cannot be dealt with. Tracking an object held by a human is not a smooth output result operation. Hand jerkiness, for whatever the reason, will not be consequent with a desired smooth output, as one might get from, for example, using a mouse. For this purpose a filter add-on is implemented, with the same pipeline and computational logic as the image-processing counterpart. We can define a history window over a range of outputs for which we can apply different methods of noise filtering. There is no apparent delay in performing the filtering since we update the filtered output after every entry in the history window, and the computations are kept minimal. Median and average filters work efficiently enough for providing stability to an otherwise jerky output (Figure 10): Median filter: Arranging the range of values in increasing order we extract the middle value and use it as output (Figure 11). This is useful for getting rid of extreme deviations from average values, but does not provide the smoothness of the average filter. Average filter: The average of all values in the time range (Figure 12). This is the best approach to an overall smooth sequential output. However it is not always desired to eliminate the granularity of sharp movements or gestures. A median filter will allow for better granularity at the cost of allowing a certain noise level.
16
Motion Studio
Figure 10 Noisy image with 100 outputs. Circles show average and median averages in that instance
Figure 11 Median filtered path of motion
Figure 12 Average filtered path of motion
The filtering component is designed so any kind of positional value can be inserted into the existing history window or into a new one. Thus the output from any of the filters can be inserted back again into the same or another filter for subsequent processing. This pipeline, although sequential in nature, is powerful enough to build up complex output filtering. The filtering add-on besides providing for the previously mentioned methods has additional methods for interpreting patterns in motion, calculating directional vectors over single output pairs or over the whole history range, and the like. A brief summary of the sorts of gesturing which we envision we could extract are discussed in the next section.
Gesture The task reduces to the different ways to interpret the outputted results. A certain degree of abstraction occurs naturally as we can now concentrate on the discrete values delivered from Motion Studio. We can isolate several factors, and combine them to accomplish different tasks, taking into consideration that each has limits of resolution and accuracy. These notions are implemented as a result object in the different conceivable codecs. How to use them and in what context is up to the end-user.
Detection of Presence This is a routine that basically has a binary output. The output is whether or not there is someone or something present in front of the camera - e.g. something differing from the background. This is best done using a time difference. A suggestion for an algorithm might be the difference of two Gaussian blurred images over time: G(I(t2)) - G(I(t1)) => threshold => answer
Motion in image This is the next step from the above where the result is the centroid of the pixels for which the time-difference was greatest. Thus the returned values are coordinates (or sets of coordinates) representing the largest movement in front of the camera.
Motion flow for behaviour specification The extension of the above is to extract flow, building up a map over time of the change in coordinates. For example, if a user waves a hand in front of the camera, the motion can be detected and the set of coordinates over time could become a path specification for simple object behaviour (with, e.g. a time frame of 2 seconds).
17
Motion Studio
Mass of Object/Agent This routine would return a value that represents some approximation of the size of the agent in front of the camera. Thus if a user is crouched down, a different result would be given than if the user holds out her arms or if there are multiple persons there.
Colour, position and mass This routine would return the "mass" (size) of colour patches (pre-specified perhaps) and position of centroid of colour. For example a user could hold up a paddle or circle with colour, and the routine would return some rough parameters of its size and colour.
Shape of Object This is a more advanced routine. It would return a polygon that is a rough approximation of the shape of the object in front of the camera. Based on the work above, when presence is detected, a shape could be extracted based on fore/background assumptions and turned into a polygonal description of the agent. This we have referred to as the "cookie cutter".
18
Motion Studio
Applications and Users Motion Studio has served as the prototyping computer vision platform for projects such as KidStory [1], Electronic Landscape Project [18], ACCORD [19], and Equator [20]. We will show how Motion Studio is used to interact with external applications and how we coupled it to the KidPad and DIVE software. KidPad was the main software application in the KidStory project and talk about experiences with using our system with child users. Since the KidStory project our system has formed part of other research initiatives that we will briefly mention. With the basic video image processing shell ready, we move on to exporting the data to the outside system. Our output is – as in any camera system – imprecise and limited to 2dimesional spaces. In spite of these drawbacks we shall attempt to make use of our output. The MotionInterpreter object extracts, filters, and interprets the results for subsequent listeners to make use of (Figure 13). A standard listener interface is used for external applications to connect to Motion Studio, either on the same virtual machine process, or via a socket connection on another execution thread or machine. A simple painting program was originally conceived to test interaction by tracking colour coded objects acting as painting brushes. Later on, the interaction was applied to existing applications, DIVE and KidPad. Jive / Dive Motion Studio
MotionInterpreter KidPad
Painter Figure 13 Event propagation from Motion Studio, via motion interpreter, to applications
We interface with 3-dimensional space in Dive and 2-dimensional space in KidPad, and apply our tracking results in real-time to basic examples. As explained, the event pipeline is filtered and interpreted by the MotionInterpreter class, before calling events in DIVE or KidPad. It is up to the developer to decide at which stage to decipher meaningful output from the raw or filtered data. Implementing a subclass of MotionInterpreter allows one to specify the output format of the data encapsulated in the event the listening applications will receive. For most of our experiments these events hold data in the form of a vector, representing in some way the flow of motion in the given context.
Integration with Dive The DIVE run-time environment consists of a set of communication processes, running on nodes distributed within a local area network (LAN) or wide area network (WAN). In addition to object and participant management, a set of visualization programs exists for different display devices ranging from normal monitor displays to head-mounted displays. Application processes typically build their user interfaces by creating the necessary graphical objects, after which they listen to events in the world and react to them according to some control logic. Users in DIVE are represented as actor entities. An actor is a user or an automated process, representing a process-bound entity that performs actions within a world. Dynamic behaviour and interaction of objects can be achieved through JIVE, a Java package to interface with DIVE developed by the ICE lab. JIVE makes use of the Java Native Interface to provide a bridge to
19
Motion Studio
the native C methods available in the DIVE environment. It is not as efficient as a direct C DIVE application, but provides simple and fast enough functionality to the common DIVE calls. DIVE worlds are not static, and thus a thread must be run to keep track of the object to be manipulated at all times. This is particularly necessary while the world is in a loading status, and references to objects are not immediately available even if references to hierarchic higher ones are. DiveEntity entity = Dive.dcore.findEntityByName(name); This way we have a fresh reference to the object entity. The same thread then reads from the Motion Studio state to get the current output (position in our case) and interprets it to some spatial transformation in DIVE. This way the Motion Studio process is allowed to run without interruption, and independently from the JIVE connection.
Setup We had a DIVE application running on a standard PC with 3-D graphics card, and a Motion Studio application running on a Pentium II 333Mhz machine. The processor pipeline was set up in two different ways: •
A codec tracking human skin features delivers 2-D position of the centre of mass of the matching skin pixels. As a user waves his or her hand in front of the camera the difference in position output -a coarse motion flow vector- is broadcasted to the DIVE system, through JIVE. The 2-D vector data is converted into rotation of a particular object.
•
A codec that determines the difference in pixels over two or more frames roughly estimates the set of pixels in which the flow of motion is the greatest. The centre of mass of this pixel set shows an approximation of the location of the main change of motion in the captured scene. The difference between these centre-of-mass positions is then processed in the same way as before, showing an approximation of the speed and direction of motion. The final output of speed and direction of movement is delivered to DIVE as rotational input. Although this assumes that any change in the scene is a potential gesture, an event is only triggered if the range of motion is large enough and inconsistent with an otherwise static scene.
Figure 14 Position output from Motion Studio interpreted as rotational transform in DIVE
Connectivity with the Dive environment proved to be relatively straightforward. Since we did not require precision in movement or in manipulation of virtual objects, the rudimentary interaction techniques proved to be satisfactory. They mostly comprised simple triggering actions such as to specify the speed and angle of rotation of objects (Figure 14). Although our implemented techniques were not sophisticated enough to handle more intelligent gesturing, it made for a complimentary object manipulation interface to the standard keyboard and mouse navigation.
20
Motion Studio
Integration with KidPad KidPad is a collaborative story-authoring tool for children. It is a drawing application with support for multiple users, zooming, linking, and encouraging collaborative work. KidPad was extensively used in the KidStory project. Experimentations where done on introducing new engaging interface tools with the application, from which emerged the possibility of designing interfaces with computer vision tie ins. KidPad is a stand-alone drawing application and we decided to connect it to Motion Studio through the means of a socket. Motion Studio served as a server for providing tracking and gesture information as well as basic image grabbing. KidPad connected to it as a client, listening to output either in the form of position events which later translated into mouse events in the drawing canvas or images being captured by the web camera device. Output was delivered to KidPad in two ways: a KidPadObject that encapsulated position and “button” events and image feeds. The common setup we made use of was employing a colour threshold tracking codec (Appendix: Colour Models) together with a blob detection codec (Appendix: Finding Connected Components). The first segmented pixels from the image which closely matched specific objects we designated to be tracked. They would be used as paint brushes in space. The second step in the pipeline included the blob detection algorithms which detected conglomerations of these pixels, defining the largest of these blobs to represent the closest match to the object we wished to track (Figure 15). The closest match to the tracked object in our output is then interpreted as relative 2D coordinates in space. These coordinates were filtered with an average filter and passed on to any listening clients in the socket port, in our case the KidPad drawing application. The position dimensions from Motion Studio where mostly extracted from a 320x240 pixel coordinate space, while KidPad default drawing area was 840x620 pixels. Any undesired sudden positional change is augmented, and thus shaky outputs had to be filtered out. Triggering the brush in KidPad to actually release colour unto the canvas was done by hiding and presenting the object from the camera view. This proved to be too imprecise to be practical, so we resorted to implement these events though physical touch pads on the floor.
Figure 15 Tracking an orange
We also incorporated a simple image feed into KidPad. Motion Studio thus serves as a server for video capture unto multiple clients and can be used as data source for video data transmitted through either Motion Studio’s internal socket mechanism or Real-Time Transmission Protocol (RTP) [17] embedded in JMF. Any object sent through a socket must be serializable, that is, able to be converted into a stream of bytes that can later be deserialized back into a copy of the original object. A Java Abstract Windowing Toolkit (AWT) Image object (default library class for image representation) is not serializable, and thus must be encoded to some other form of serializable object, in our case we chose a standard JPEG encoding provided by the Java package. Motion Studio provides serialization of objects, as well as images in JPEG format.
21
Motion Studio
ObjectOutputStream out = new ObjectOutputStream(client.getOutputStream()); Image image = ((FrameGrabberCodec) codec).image; JPEGImageEncoder encoder = JPEGCodec.createJPEGEncoder(out); encoder.encode(makeBufferedImage(image)); This basic setup was somewhat modified and introduced into the KidStory show-and-tell sessions. In these sessions we brought in prototypes of technology created after participatory design sessions with children. The children would experiment and play with the technology and made their views known through short individual interviews and group discussions. Some experiences and observations from such sessions are discussed further.
Work with Child Users Clearly, how well a device for human-computer interaction works can only be examined with a test panel of end-users. In our case, the KidStory Project in conjunction with the Rågsved School in Stockholm provided some sessions with children aged 6-8 to test some of the prototypes and hopefully provide useful and beneficial feedback for future development in unencumbered interaction. Children demand clearness and intuitive tools to grab enough of their attention to make a tool useful, thus the group provided the harsh criticism (in their cruel honesty) for determining the value of each tool. During the collaborative design sessions new ideas presented themselves on what new kinds of interactions to develop for them. Most of the ideas and suggestions came from the children themselves, since the premise was for them to be head designers in storytelling and other narrative tools. Not all of these ideas where implemented, mostly because of lack of time or feasibility. Nonetheless it did provide direction on how to approach development and discard inapplicable suggestions.
Set-up for Drawing with Tangible We conceived a simple painting-on-air set-up for children to use in conjunction with KidPad. We took a Pentium 333MHz machine along with a web camera to a story-telling session with the children of KidStory in Stockholm. Motion Studio was calibrated to track an object with specular colour-characteristics divergent from the room environment. A blob detection codec supplied 2-dimensional coordinate results. These were broadcasted as events to a running KidPad application projected onto a wall (Figure 16).
22
Motion Studio
Figure 16 Tracking fluffy object to draw in KidPad
Several disadvantages were immediately apparent: •
The use of the drawing tool object to track was limited in range to the camera field of view. It was awkward to maintain the object within range without having to look at the monitor screen of the capture source.
•
It becomes tiresome to maintain the object up in the air for longer periods without becoming tired.
•
The position events were not that accurate, as there was some apparent lag on the delivery from the Motion Studio application running on one machine to the KidPad one running on the other.
These limitations although obvious did not impede some basic interaction with KidPad. Sessions were not longer than several minutes at a time and children changed roles with enough frequency as not to be tired by the suspension of their limbs in the air or by the impreciseness of in the handling of the brush (Figure 17). The projected canvas served more as a graffiti surface than a regular target for finely defined artistic strokes.
23
Motion Studio
Figure 17 Tracking orange to draw in KidPad
User Feedback At the end of every KidStory session some of the children were interviewed as a means to acquire some perspective and insight on their experiences. They, as designers of narrative tools, provided feedback to us as developers and themselves as designers in the iterative process which encompassed the length of the entire project. These suggestions or criticisms were asserted and taken into consideration for subsequent design and show-and-tell sessions. Some questions regarded the use of the tangible object used in front of a camera to be able to use the tools in KidPad. The first set of comments refers to the use a brightly coloured orange to paint in the air with KidPad: •
“I would be easier to draw directly on the screen.” When the children desired to do precise drawings they preferred a more traditional approach, with the easiness of being able to change colours at will, sit on their own, hide their work until completed, and so on.
•
“Why can't you use the hand?” Using particular objects as brushes (because of their unique specular colour features) didn’t seem to be all that important to them, even restricting. This might have come from the fact that each child had to wait for his or her turn in using the particular object on the canvas. Nonetheless using a particular object served as a moderator for drawing sessions, which otherwise wouldn’t have been possible with tracking common objects or limbs.
•
“It would be easier with a pen.” No dispute there.
•
“Instead of the orange you can pick up those tools you want to use and draw in the air. What about, a red crayon?” There was no easy means to introduce a new object with unique colour characteristics implemented. So a somewhat cumbersome process was required to calibrate the tracking algorithm to track a new object for acquisition of practical results. It would have been a better experience to have introduced an assortment of objects which changed the characteristics of the paint brush (such as colour, roughness, size, etc).
24
Motion Studio •
“It was fun.” It clearly provided a novel [for them] and lively way of drawing. Although it was apparent that there was an overall dislike of children painting over others’ work because of the lack of control.
A second, slightly different scenario involved capturing images via Motion Studio and incorporating them onto KidPad as either a brush texture or canvas background. Although this did not form part of a computer vision architecture experiment some of the comments are listed below: •
“Quicker than a digital camera – good.”
•
“You should be able to use already taken photos.”
•
“Difficult to draw on the picture, difficult to steer the mouse.”
•
“Good that the photos came up straight away on the screen.”
•
“It's nice that you can take a photo of yourself and then write your name and tell about yourself when you were a child.”
•
“Very funny with the camera in KidPad, a lot better than before.”
•
“Difficult to draw on your photo with KidPad - e.g. glasses on yourself.”
A third method of interaction employed was the voting experiment through colour schemes. This entailed a group of children holding up a specific coloured paper of their choosing up to the camera. The largest amount of colour registered by Motion Studio decided on the outcome for different situations. An example being in which children decided on the outcome of a story from different possible paths by holding up a coloured paper representing unique choices to determine the final outcome of the story. This particular experiment does not provide any new insight on the tracking software or the research goal of this report, but it does show how flexible the software and approach taken is to implement new forms of interaction at relatively low developing costs. Throughout the run of the KidStory project several other design sessions took place. Many other methods for interaction, not necessarily unencumbered, were tried out. The extent to which Motion Studio and computer vision unencumbered interaction was used was very limited. It is very difficult to create a physical space where conditions for video tracking function well. It requires tweaking of the hardware and tracking algorithms to function in the ever changing scene, lighting conditions, available material, and even the experimental scenario itself might change paradigms of interaction throughout the sessions. This made it quite impractical and perhaps the certain discipline required to engage with such technology was too much to impose on young children. A behaviour also observed, is that tolerance towards the imperfections in the system was not always present. This was most obvious when slight irregularities and time lags emerged in the feedback from holding an item and seeing its action on screen. Children tended to lose interest quickly in the system when it did not consistently showed immediate and concise feedback compared to say when using a real brush or pencil. Although seemingly obvious, one can argue that video tracking interfaces will not become popular until it provides a clear improvement over traditional physical artefacts.
Current Research Since the KidStory project Motion Studio has been used as the prototyping platform for computer vision components in several other research projects. Some of these have used Motion Studio to easily interact with video capture hardware and process simpler types of computation on the video or image data. A couple of examples are Accord, a Swedish research project in home technologies, and Equator, an interdisciplinary research initiative in the United Kingdom.
25
Motion Studio
The ACCORD project made use of Motion Studio for capturing still frames along with simple motion and presence detection. This formed part of one of the components in a set of household technologies made available to home dwellers. The notion being to empower every day people with the tools and mechanisms to configure and edit services in their homes with current available technology. The exact services created by connecting together different technology components was left mostly to the users themselves and achieved through different editing paradigms. Equator is an interdisciplinary research collaboration between different institutions in the United Kingdom and continued the work pioneered in Accord. Motion Studio forms part of the Equip Component Toolkit (ECT) for developing computer vision based components. The components developed for the ECT are more generic in nature than those in ACCORD and allow for the development of more sophisticated applications (at the cost of accessibility to everyday users), perhaps those entailing unencumbered interaction.
26
Motion Studio
Future Aspects Computer Vision in many ways is still in its infancy despite the many years of research in the field and the many advances in hardware technology. The main drawback, as previously discussed, is the lack or difficulty of an autonomous contextual reference from which to extract the relevant features for tracking. This is because cues in the scene are not always available and the targeted objects have a tendency to either disappear completely from the scene or change appearance. An example of this is the tracking and recognition of gestures of the human hand. At certain angles the hand or fingers might be obscured from camera sight, only to reappear at a later location. It might also be lost in a background scene with similar specular attributes or texture features, such as holding the hand in front of similarly pigmented skin or making a fist. Computer vision algorithms have a difficult time with these sorts of degenerations of the scene and usually require a contextual reference for it not to break down. A hierarchical model following not only the hand but the arm and perhaps the body give probabilistic and more accurate results in the event of these types of problems. However it introduces a far greater complexity in the computations required for tracking and engagement. The use of tangible artefacts for interfacing with computers or virtual systems is at times underrated. Unencumbered interfaces acquire importance only when systems are “intelligent” enough to extract relevant and robust data from context or consciously defined rules of engagement. Thus at the moment clever rule-based algorithmic solutions and parameter settings (for both capture devices and the computer vision algorithms) is a process of experience and observation. A particular algorithm inside a codec in Motion Studio might be considered slow and non-realtime at this moment and we at times opted towards other solutions which regretfully rendered coarser results. However as capture devices deliver higher resolutions, better colour range, along with higher frame rates at the same or lower costs and computers increase their processing speeds these more complex and full-ended algorithmic solutions become feasible. By then the platform-independent Motion Studio Java package will still function as required, without the need of writing or rewriting new code to adhere to the new hardware or underlying software changes.
27
Motion Studio
Conclusion Motion Studio delivers a software toolbox for building computer vision application and solutions for distributed environments accessing it. Motion Studio is modular in design so introducing more complex behaviour to the output or the codec algorithms is relatively simple and quick. It provides an accessible and flexible software library with the set of features required for rapid prototyping computer vision solutions. We have discussed how different interpretations of action and engagement form part of models of interaction with engaging software applications. This will hopefully throw some light on what to consider when developing future applications for unencumbered interaction. Despite the limitations of Java Virtual Machine environments in terms of processing and computational speed we have shown that it nonetheless is a comfortable and time-saving approach to prototyping technologies. The Computer Vision field and video tracking methods presents a large set of solutions each with a substantial number of possible set up parameters. Motion Studio provides an easy way to experiment with those solutions, modify them to one’s needs, and adjust the parameters required to optimise the system. The set of libraries and applications developed for this work was used in several projects, most notably the KidStory project which commissioned it. Several methods of interactions where developed with this library and introduced to children in the project, as part of their design visions. It was practical to bring into light the requirements and limitations of a full-fledged system for unencumbered interaction by using our system. Making use of Motion Studio for easy prototyping saved a substantial amount of time and allowed insight into future approaches.
28
Motion Studio
Appendices The appendix sections give more detailed information about the different components and background research mentioned in this work. This appendix describes some of the standard concepts and algorithms in the field of Computer Vision, more detailed description of the underlying mechanism of the Java Media Framework API, images from the KidStory design sessions and other technical details.
Computer Vision Human Vision The human vision system is the most complex of the senses and, overall, one of the most intricate and mysterious wonders of nature. Yet, it is that human vision is vague, disjointed and speculative. Many inconsistencies and visual illusions demonstrate how easy it is to mislead the eye. For this reason computer scientists are not endeavouring to directly mirror the human vision system: most technological applications demand for perception that is exactly precise and accurate.
Image Representation Still images can be expressed as the function f(x,y) , where x, and y are the image coordinates, and the value of the function at that point is the image intensity (usually a byte-value from 0 to 255). In the case of computer graphics, the images are digital—that is, images which have had a finitely spaced grid superimposed on them. Then, the values in each section of the matrix are quantized based on the colour depth of the image. The number of segments in this grid is known as the spatial resolution of the image. Typically, the individual segments of this matrix are referred to as "pixel elements" or pixels. Using this notation, a digitized image can be expressed as the graph of its function:
Original image...
And the graph of its function
Figure 18 Graph of the function of an image
Sampling resolution of the image is an intrinsic consideration in the design of any imaging system. The higher the resolution, the better the image quality and accuracy of the analysis, however, this comes at the cost of increased processor time, memory, and storage requirements.
29
Motion Studio
16x16
32x32
64x64
128x128
Figure 19 The effect of sampling resolution on a signal
Convolution Convolution is the operation of computing the weighted integral, or the sum in the case of discrete functions, of one function with respect to another function that has first been reflected about the origin, and then variably displaced. Equation 1 Continuous convolution
The two functions are then said to be convolved, and are expressed as o = i*h, or equivalently o = h*i. In the case of digital images, the functions dealt with are two dimensional arrays of discrete values. In this case, the algorithm used is discrete convolution, which is the weighted sum of one discrete function with respect to another discrete function that has first been reflected about the origin and then variably displaced. If f(m,n) and g(m,n) are two discrete functions, then Equation 2 Discrete convolution
It should be noted that reflection of the function about the origin is redundant when this function is symmetrical about the origin. For the purpose of computer imaging, the functions that are to be convoluted are usually a mask, and an image. Convoluting a signal with a mask is also called filtering the signal with that mask. When referred to image filtering, the mask is also called the point-spread function of the filter. Unless otherwise noted, in this paper, initial reflection about the origin is ignored, and the term convolution is used to describe the filtering of an image (f) with the filter mask (g). In the case of the functions that are to be convoluted, the matrix can be thought to consist of the indicated function surrounded by zeros. The nonzero part of the matrix is referred to as the kernel of the operator. Convolution, both continuous and discrete, exhibits several important properties. Three of these properties which are relevant to this application are the following: Commutativity: f*g = g*h. Associativity: f*(g*h) = (f*g)*h. Distributivity over addition: f*(g+h) = f*g + f*h.
30
Motion Studio
Filtering The effects of noise on an image can be reduced through filtering. That is, assigning each pixel a value of the weighted sum of its neighbours. This can be expressed through the following convolution: Equation 3 Filtering convolution
Where g is the convolution mask, f is the image, and delimit the size of the neighbourhood in which smoothing is involved. There are essentially two types of convolution masks used in image filtering. In the case of a simple blur, a filter mask of all 1s is used. In the case of a Gaussian blur, the filter mask is in the shape of the following figure, or Gaussian. A two-dimensional discretized version of the Gaussian is as follows: Equation 4 Two-dimensional discrete version of the Gaussian
where determines the width of the filter. The normalizing factor makes the integral of the two-dimensional Gaussian equal to one, thus, the net effect of the Gaussian is zero.
Figure 20 Graph of Gaussian function intensity
The Gaussian exhibits several interesting properties that are relevant to image processing. First, as noted above, it is symmetrical. Also, in convoluting it with the image, it emphasizes nearby pixels over more distant ones. This property reduces smearing while still maintaining noiseaveraging properties. Another important property of the Gaussian is that it never crosses zero. This is important for some edge detectors for which smoothing cannot introduce zero crossings of its own.
31
Motion Studio
Original image
Gaussian blur
Gaussian blur
Figure 21 Effect of Gaussian smoothing
Segmentation In image processing, it is often desirable to segment an image into discrete parts. The goal of segmentation is to break the image up into two parts, one part which is the object to be tracked and another which is the background. Since there are only two possibilities, this can be termed binarization. The method in which the image is segmented may vary. In this case, a histogram based threshold is used. A threshold is a numerical cut off point—values above this threshold are set to logical TRUE, and values below this point are set to logical FALSE. In the case of more complicated applications, several thresholds may be used. In this case the image will not be binarized, but rather separated into several regions. Besides histogram based thresholding, the image may be thresholded after it has been enhanced, perhaps using some mask such as an edge detector or a motion filtering algorithm.
Histogram A histogram, by definition, is the frequency distribution of some function. In this case, the function is image intensity. For the purpose of segmentation, the best histograms to use are those of individual colour channels. There is a definite reason for this: in reducing a colour image to a greyscale image, intensities of each of the red, green, and blue colour channels are averaged. Thus, it is quite possible that pixels that were completely different colours in the original image are given the same values after greyscaling. Equation 5 Colour to greyscale conversion algorithm
Where: f(x,y) is the greyscale image r(x,y), g(x,y), b(x,y) are the red, green and blue colour channels respectively.
Figure 22 Colour based thresholding
32
Motion Studio
The original image
The red channel of the Image
Its binarized counterpart.
Figure 23 Colour based thresholding
Edge Detection When dealing with digital images, it may become useful to locate the edges of an object. To achieve this, an edge detection operation is applied. An edge in an image is an image contour across which the brightness of the image changes abruptly.
Figure 24 Edge profiles: templates for a roof edge (left) and a step edge (right).
Our primary focus will be on step edges. A step edge in an image refers to an image contour across which the underlying intensity function changes abruptly. In edge detector design, it is almost always assumed that the edges are ideal in that the intensities on their two sides are constant. The difference between the intensities on the two sides of the edge is called the edge’s contrast.
Edgel Detection Due to the complex two-dimensional nature of digital images, most methods of edge detection are limited to the detection of short, discontinuous edge elements, or edgels. There are two primary groups of edgel detectors: difference operators and parametric-model matchers. Difference operators seek to estimate the changes in the underlying intensity surface by computing the weighted difference between pixels. Parametric-model matchers seek find edges by comparing the pixel intensities within an image window with some parameterized edgel model.
33
Motion Studio
Table 1 Edge detection masks
In a step edge, the underlying intensity surface has a large gradient pointing across the edge. That is, the difference between the values on one side of the edge is substantially different from those on the other side of the edge. Operators which detect edgels based on this principle are referred to as gradient operators. The simplest gradient operator is the Roberts’ cross operator. Using the two masks as shown, it calculates the orthogonal directional derivatives of image intensity and along the diagonals of the mask. The magnitude of the gradient at the image point where the masks are centred is proportional to
, and the orientation of the gradient with respect to the direction of
is given by has a large magnitude.
. Along the step edges of the image, the gradient of the image intensity
) similarly to that of the Roberts’ Masks which estimate the image intensity gradient ( and cross operator, but along the horizontal and vertical axes of the image point at the centre of the mask are as follows: Equation 6 Roberts cross operator
Due to their small size, difference operators are inherently susceptible to image noise. As mentioned earlier, a filtering mask applied to the image will reduce the effect of noise on subsequently applied algorithms. In order to save computer time, the Sobel mask can be used. Convolving a filtering mask with an edge detection mask as follows derives this mask: Equation 7 Derivation of the Sobel operator by convolving a Roberts cross operator with a filter mask
The mask on the left applies a simple blur (show here without the scale factor of .) Whereas separate filtering by each of the two masks on the left would require 8 additions and 1 normalization step, the Sobel operator requires only 6 additions and no normalization because the net effect of the mask is zero. However, as noted by Prewitt, one shortcoming of using of a mask whose central element is zero is that it does not take into account the image point at the centre of the mask. After the application of edge detection, a threshold must be chosen if segmentation is to occur.
34
Motion Studio
Original image
Image after application of Robert’s Cross edge detector and thresholding
Figure 25 Edge detection output
Finding Connected Components A component-labelling algorithm finds all connected components in an image and assigns a unique label to all points in the same component [21]. In many applications, it is desirable to compute characteristics (such as size, position, orientation, and bounding rectangle) of the components while labelling these components. There are two algorithms for component labelling: recursive and sequential.
1 1 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3
4 4 4 4 4 4 4 4 4 4 4 4
Figure 26 Component labelling in image data
On a sequential processor, a recursive algorithm is very inefficient. Due to its inefficiency, this algorithm is rarely used on general-purpose computers; but is commonly used on parallel machines.
Algorithm: Recursive Connected Components Algorithm 1. 2. 3. 4.
Scan the image to find an unlabeled 1 pixel and assign it a new label L. Recursively assign a label L to all its 1 neighbours. Stop if there are no more unlabeled 1 pixels Go to step 1.
35
Motion Studio
Sequential Algorithm The sequential algorithm usually requires two passes over the image. Since this algorithm work with only two rows of an image at a time, it can be used even when the images are stored as a file and space limitations do not allow the full image to be brought into memory. This algorithm looks at the neighbourhood of a pixel and tries to assign already used labels to a 1 pixel. In case of two different labels in the neighbourhood of a pixel, an equivalence table is prepared to keep track of all the labels that are equivalent. This table is used in the second pass to assign a unique label to all pixels of a component. In the algorithm, there are three cases of interest when scanning an image left to right and top to bottom. The algorithm looks at only two of a pixel’s 4-neighbors: the ones above and to the left of it. Note that these two pixels have already been seen by the algorithm. If none of these pixels is 1, then the pixel requires a new label. If only one of the two pixels is 1 and has been assigned a label L, then the pixel will be assigned L. If both pixels are 1s and have been assigned the same label L, then the new pixel will be assigned L; however, in the case where the neighbours have been assigned different labels M and N, then the two labels have been used for the same component and they must be merged. In this case the pixel is assigned one of the labels, usually the smaller label, and both labels are recorded as equivalent labels in the equivalence table. The equivalence table contains the information to assign unique labels to each connected component. In the first scan, all those labels that belong to one component are declared equivalent. In the second pass, one label from an equivalent set is selected to be assigned to all pixels of a component. Usually the smallest label is assigned to a component. The second pass assigns a unique label to each component. After all of the connected components have been found, the equivalence table should be renumbered so that gaps in the labels are eliminated; then the image is rescanned using the equivalence table as a lookup table for renumbering the labels in the image. Area, first moments, and second moments can be calculated for each component as part of the sequential connected components algorithm. Of course, separate variables must be used to accumulate the moment information for each region. When regions are merged, the moment accumulations for each region are just added together.
Sequential Connected Components Algorithm using 4-connectivity 1. Scan the image left to right, top to bottom. 2. If the pixel is 1, then a) If only one of its upper and left neighbours has a label, then copy the label. b) If both have the same label, then copy the label. c) If both have the different labels, then copy the upper’s label and enter the labels in the equivalence table as equivalence labels. d) Otherwise assign a new label to this pixel and enter this label in the equivalence table. 3. If there are more pixels to consider, then go to step 2. 4. Find the lowest label for each equivalent set in the equivalence table. 5. Scan the picture. Replace each label by the lowest label in its equivalent set.
The Equivalence Table The equivalence table has also to be efficient, preferably working in linear time. A hash table takes care of this. If a label is created then it is inserted in the hash table as both the key and the contained value. For every insertion into the equivalence table we use the left label (highest
36
Motion Studio
value) as the key, and for the contained value the reference to the contained value referred to by using the up label (lower value) as a key in the table. That way, we have a reference to the lowest valued label, even if that should change in preceding computations.
Colour Models The colour models most often used for image processing are the RGB, the YIQ (alias YUV), and the HIS models [22]. The YUV model was designed to take advantage of the human visual system’s greater sensitivity to changes in luminance than to advantages in hue or saturation. The principal advantage of the YUV model in image processing is that the luminance (Y) and colour information (U and V) are decoupled. Keep in mind that luminance is proportional to the amount of light perceived by the eye. Thus the importance of this decoupling is that the luminance component of an image can be processed without affecting its colour content. For instance, as opposed to the problem with the RGB model, we can apply histogram equalization to a colour image represented in YUV format simply by applying histogram equalization to its Y component. This process does not affect the relative colours in the image. It is also particularly useful in tracking objects through colour information, despite the variations that occur in luminance. Equation 8 Conversion function for RGB to YUV colour space 0.114 R Y 0.299 0.587 U = 0.596 − 0.275 − 0.321 G V 0.212 − 0.523 0.311 B
Technical Issues This chapter explains the inner-workings of the Java Media Framework package related and used in the Motion Studio toolbox. Dealing with video streams is no trivial matter, and careful handling is required if one wants to achieve a somewhat stable tracking system. JMF provides different methods for creating such video streams, reading, and writing to them. Several problems and shortcomings in the basic implementation of JMF exist, and workarounds had to be created. Most of the tutorial text is acquired from the Java Media Framework API Guide by Sun Microsystems, together with comments on how we use the components on our system.
Understanding JMF JavaTM Media Framework (JMF) provides a unified architecture and messaging protocol for managing the acquisition, processing, and delivery of time-based media data. JMF is designed to support most standard media content types, such as AIFF, AU, AVI, GSM, MIDI, MPEG, QuickTime, RMF, and WAV. By exploiting the advantages of the Java platform, JMF delivers the promise of "Write Once, Run AnywhereTM" to developers who want to use media such as audio and video in their Java programs. JMF provides a common cross-platform Java API for accessing underlying media frameworks. JMF implementations can leverage the capabilities of the underlying operating system, while developers can easily create portable Java programs that feature time-based media by writing to the JMF API.
37
Motion Studio
Processors Processors can also be used to present media data. A Processor is just a specialized type of Player that provides control over what processing is performed on the input media stream. A Processor supports all of the same presentation controls as a Player.
Figure 27 JMF processor model
In addition to rendering media data to presentation devices, a Processor can output media data through a DataSource so that another Player or Processor can present it, further manipulated by another Processor, or delivered to some other destination, such as a file.
Presentation Controls Additional custom Control types might be supported by a particular Player or Processor implementation to provide other control behaviours and expose custom user interface components. You access these controls through the getControls() method. For example, the CachingControl interface extends Control to provide a mechanism for displaying a download progress bar. If a Player can report its download progress, it implements this interface. To find out if a Player supports CachingControl, you can call getControl(CachingControl) or use getControls() to get a list of all the supported Controls.
Standard User Interface Components A Player or Processor generally provides two standard user interface components, a visual component and a control-panel component. You can access these Components directly through the getVisualComponent() and getControlPanelComponent() methods. You can also implement custom user interface components, and use the event listener mechanism to determine when they need to be updated.
Controller Events The ControllerEvents posted by a Controller such as a Player or Processor fall into three categories: change notifications, closed events, and transition events: Change notification events such as RateChangeEvent, DurationUpdateEvent, and FormatChangeEvent indicate that some attribute of the Controller has changed, often in response to a method call. For example, a Player posts a RateChangeEvent when its rate is changed by a call to setRate(). TransitionEvents allow your program to respond to changes in a Controller object's state. A Player posts transition events whenever it moves from one state to another. ControllerClosedEvents are posted by a Controller when it shuts down. When a Controller posts a ControllerClosedEvent, it is no longer usable. A ControllerErrorEvent is a special case of ControllerClosedEvent. You can listen for ControllerErrorEvents so that your program can respond to Controller malfunctions to
minimize the impact on the user.
38
Motion Studio
Figure 28 JMF Events
Processing A Processor is a Player that takes a DataSource as input, performs some user-defined processing on the media data, and then outputs the processed media data.
39
Motion Studio
Figure 29 JMF Processors
A Processor can send the output data to a presentation device or to a DataSource. If the data is sent to a DataSource, that DataSource can be used as the input to another Player or Processor, or as the input to a DataSink. While the processing performed by a Player is predefined by the implementor, a Processor allows the application developer to define the type of processing that is applied to the media data. This enables the application of effects, mixing, and compositing in real-time. The processing of the media data is split into several stages:
Figure 30 Processor stages
Demultiplexing is the process of parsing the input stream. If the stream contains multiple tracks, they are extracted and output separately. For example, a QuickTime file might be demultiplexed into separate audio and video tracks. Demultiplexing is performed automatically whenever the input stream contains multiplexed data. Pre-Processing is the process of applying effect algorithms to the tracks extracted from the input stream. Transcoding is the process of converting each track of media data from one input format to another. When a data stream is converted from a compressed type to an uncompressed type, it is generally referred to as decoding. Conversely, converting from an uncompressed type to a compressed type is referred to as encoding. Post-Processing is the process of applying effect algorithms to decoded tracks. Multiplexing is the process of interleaving the transcoded media tracks into a single output stream. For example, separate audio and video tracks might be multiplexed into a single MPEG1 data stream. You can specify the data type of the output stream with the Processor setOutputContentDescriptor() method. Rendering is the process of presenting the media to the user.
40
Motion Studio
The processing at each stage is performed by a separate processing component. These processing components are JMF plug-ins. If the Processor supports TrackControls, you can select which plug-ins you want to use to process a particular track. There are five types of JMF plug-ins: Demultiplexer--parses media streams such as WAV, MPEG or QuickTime. If the stream is
multiplexed, the separate tracks are extracted. Effect--performs special effects processing on a track of media data. Codec--performs data encoding and decoding. Multiplexer--combines multiple tracks of input data into a single interleaved output stream and delivers the resulting stream as an output DataSource. Renderer--processes the media data in a track and delivers it to a destination such as a screen or
speaker.
Processor States A Processor has two additional standby states, Configuring and Configured, which occur before the Processor enters the Realizing state..
Figure 31 Processor states
A Processor enters the Configuring state when configure is called. While the Processor is in the Configuring state, it connects to the DataSource, demultiplexes the input stream, and accesses information about the format of the input data. The Processor moves into the Configured state when it is connected to the DataSource and data format has been determined. When the Processor reaches the Configured state, a ConfigureCompleteEvent is posted. When Realize is called, the Processor is transitioned to the Realized state. Once the Processor is Realized it is fully constructed. While a Processor is in the Configured state, getTrackControls() can be called to get the TrackControl objects for the individual tracks in the media stream. These TrackControl objects enable you specify the media processing operations that you want the Processor to perform. Calling realize() directly on an Unrealized Processor automatically transitions it through the Configuring and Configured states to the Realized state. When you do this, you cannot configure the processing options through the TrackControls--the default Processor settings are used. Calls to the TrackControl methods once the Processor is in the Realized state will typically fail, though some Processor implementations might support them.
41
Motion Studio
Methods Available in Each Processor State Since a Processor is a type of Player, the restrictions on when methods can be called on a Player also apply to Processors. Some of the Processor-specific methods also are restricted to particular states. The following table shows the restrictions that apply to a Processor. If you call a method that is illegal in the current state, the Processor throws an error or exception. Table 2 Method restrictions for processors
Method
Unrealized Processor
Configuring Processor
Configured Processor
Realized Processor
AddController
NotRealizedError
NotRealizedError
NotRealizedError
legal
Deallocate
Legal
Legal
legal
legal
GetControlPanelCompon ent
NotRealizedError
NotRealizedError
NotRealizedError
legal
GetControls
Legal
Legal
legal
legal
GetDataOutput
NotRealizedError
NotRealizedError
NotRealizedError
legal
GetGainControl
NotRealizedError
NotRealizedError
NotRealizedError
legal
GetOutputContentDescrip NotConfiguredError tor
NotConfiguredError
legal
legal
GetStartLatency
NotRealizedError
NotRealizedError
NotRealizedError
legal
GetSupportedContentDescriptors
Legal
Legal
legal
legal
GetTimeBase
NotRealizedError
NotRealizedError
NotRealizedError
legal
GetTrackControls
NotConfiguredError
NotConfiguredError
legal
FormatChangeException
GetVisualComponent
NotRealizedError
NotRealizedError
NotRealizedError
legal
MapToTimeBase
ClockStoppedExcep tion
ClockStoppedExcep tion
ClockStoppedExcep tion
ClockStoppedException
Realize
Legal
Legal
legal
legal
RemoveController
NotRealizedError
NotRealizedError
NotRealizedError
legal
SetOutputContentDescrip NotConfiguredError tor
NotConfiguredError
legal
FormatChangeException
SetMediaTime
NotRealizedError
NotRealizedError
NotRealizedError
legal
SetRate
NotRealizedError
NotRealizedError
NotRealizedError
legal
SetStopTime
NotRealizedError
NotRealizedError
NotRealizedError
legal
SetTimeBase
NotRealizedError
NotRealizedError
NotRealizedError
legal
SyncStart
NotPrefetchedError
NotPrefetchedError
NotPrefetchedError
NotPrefetchedErr or
42
Motion Studio
Processing Controls You can control what processing operations the Processor performs on a track through the TrackControl for that track. You call Processor getTrackControls() to get the TrackControl objects for all of the tracks in the media stream. Through a TrackControl, you can explicitly select the Effect, Codec, and Renderer plug-ins you want to use for the track. To find out what options are available, you can query the PlugInManager to find out what plug-ins are installed. To control the transcoding that's performed on a track by a particular Codec, you can get the Controls associated with the track by calling the TrackControl getControls() method. This method returns the codec controls available for the track, such as BitRateControl and QualityControl. If you know the output data format that you want, you can use the setFormat() method to specify the Format and let the Processor choose an appropriate codec and renderer. Alternatively, you can specify the output format when the Processor is created by using a ProcessorModel. A ProcessorModel defines the input and output requirements for a Processor. When a ProcessorModel is passed to the appropriate Manager create method, the Manager does its best to create a Processor that meets the specified requirements.
Data Output The getDataOutput() method returns a Processor object's output as a DataSource. This DataSource can be used as the input to another Player or Processor or as the input to a data sink. A Processor object's output DataSource can be of any type: PushDataSource, PushBufferDataSource, PullDataSource, or PullBufferDataSource. Not all Processor objects output data--a Processor can render the processed data instead of outputting the data to a DataSource. A Processor that renders the media data is essentially a configurable Player.
Capture A multimedia-capturing device can act as a source for multimedia data delivery. For example, a microphone can capture raw audio input or a digital video capture board might deliver digital video from a camera. Such capture devices are abstracted as DataSources. For example, a device that provides timely delivery of data can be represented as a PushDataSource. Any type of DataSource can be used as a capture DataSource: PushDataSource, PushBufferDataSource, PullDataSource, or PullBufferDataSource. Some devices deliver multiple data streams--for example, an audio/video conferencing board might deliver both an audio and a video stream. The corresponding DataSource can contain multiple SourceStreams that map to the data streams provided by the device. How to access individual decoded video frames from a Processor while processing the media? This could be used for scanning the decoded data; computing statistics for each video frame, etc. There are basically 2 ways to do it: 1. Get the data from the output DataSource of the Processor; 2. Using a pass-through plugin codec as a callback when individual frames are being processed. We use the second approach:
43
Motion Studio
1. Build the pass-through codec. Specify what input format you desire and define that as the Codec getSupportedInputFormats(). 2. Create a Processor from the input file. 3. Get the TrackControls from the processor. 4. Set your codec on the video track: TrackControl.setCodecChain(codec_chain)
That way, your codec's process call will be the "callback" whenever a video frame goes through your plugin.
Images: KidStory Sessions The following images serve as document of the process of developing the video interaction system along and with the help of the kids in the KidStory project in Stockholm.
Figure 32 Building story machines at Rågsved School
44
Motion Studio
Figure 33 Setup at KTH for KidStory tech session
Figure 34 Grabbing images into KidPad
45
Motion Studio
References [1]
http://www.kidstory.org
[2]
Java Media Framework TM API Guide (November 19, 1999) JMF 2.0 FCS http://java.sun.com/products/java-media/jmf/2.0/documentation.html
[3]
Java Advanced Imaging
[4]
Roberts reference
[5]
Wren, Cristopher. Azarbayejani, Ali. Darrel, Trevor. Pentland, Alex. (1997) Pfinder: Real-Time Tracking of Human Body. MIT Media Laboratory Perceptual Computing Section. http://pfinder.www.media.mit.edu/projects/pfinder
[6]
J.J. Gibson. The Perception of the Visual World. Riverside Press, Cambridge, 1950. J.J. Boston: Houghton Mifflin, 1966.
[7]
Gibson, J. J. (1977). The theory of affordances. In Shaw, R. and Bransford, J., (Eds.), Perceiving, acting and knowing: toward an ecological psychology, pages 67{82. Hillsdale NJ: Lawrence Erlbaum Associates Publishers. Nakayama, K., Loomis, J. M., 1974. Optical velocity patterns, velocity sensitive neurons, and space perception: a hypothesis. Perception 3. Hornes K.P. Berthold. Shunck G., Brian. (1980) Determining Optical Flow. Massachusetts Institute o Technology. Artificial Intelligence Laboratory. A.I. Memo No. 572 Cutler, R. Turk, M. (1998) View-based Interpretation of Real-Time Optical Flow for Gesture Recognition. Third IEEE Interantional Conference on Automatic Face and Gesture Recognition, April 14-16, 1998, Nara Japan Bretzner, Lars. (1999). Multi-Scale Feature Tracking and Motion Estimation, Computer Vision and Active Perception Laboratory (CVAP), Department of Numerical Analysis and Computing Science, Stockholm Royal Institute of Technology, Sweden. Essa, Irfan. Nelson, Randal. (May 20-22, 1997) Perception of Action. NSF/DARPA Workshop. Excerpts from discussion on Representation of Action. http://vismod.www.media.mit.edu/conferences/nsf-action-97 Mann, R., & Jepson, A. (1998). Toward the Computational Perception of Action. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 794--799, Santa Barbara, CA Nilsen, Kelvin. Issues in the Design and Implementation of Real-Time Java. http://www.sys-con.com/java/iss1/real.htm Cutler, Ross. Davis, Larry (1999). Developing Real-Time Computer Vision Applications for Intel Pentium III based Windows NT Workstations. Microsoft Developer Network. 1999. http://msdn.microsoft.com
[8] [9]
[10]
[11]
[12]
[13]
[14] [15] [16]
TM
API
[17] RTP: Audio and Video for the Internet by Colin Perkins, ISBN 0672322498 [18] http://ada.sics.se/newice/archives/cat_projects.html#000292 [19] Humble, J., Crabtree, A., Hemmings, T., Åkesson, K-P., Koleva, B., Rodden, T., Hansson, P., Playing with the Bits - User-configuration of Ubiquitous Domestic Environments, Proceedings of the Fifth Annual Conference on Ubiquitous Computing, UbiComp2003, Seattle, Washington, USA, 12-15 October 2003. http://www.sics.se/accord [20] http://www.equator.ac.uk
46
Motion Studio
[21] Ramesh Jain, Rangachar Kasturi, Brian G. Schunck (Contributor), Rangacher Kasturi (Contributor). (1995) Machine Vision ISBN 0070320187 p45-47 [22] Gonzalez, Rafaeil C. Woods, Richard E. (1993) Digital Image Processing. p221-251. Addison-Wesley Publishing Company. World Student Series. ISBN 0-201-60078-1
47