Document not found! Please try again

Semantic, Interactive Manipulation of Visual Data - USC IRIS

3 downloads 161928 Views 4MB Size Report
While the CG artist should keep total control of the creative process, tedious tasks should .... tions (e.g. the Magic Wand in Adobe Photoshop [1]). We believe it is ...
Semantic, Interactive Manipulation of Visual Data

by Alexandre R.J. François

------------------------------------------------------

A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) December 2000

Copyright 2000

Alexandre R.J. François

Acknowledgments

I thank my advisor and chairman of my thesis committee, Prof. Gérard Medioni, for supporting and guiding me throughout my graduate studies at USC. Beyond the exceptional quality of his academic guidance, his most valuable gift to me was that of freedom, in all aspects of my life. I thank Prof. Max Nikias and Prof. Ulrich Neumann for making time to serve on my thesis committee, as well as Prof. Ram Nevatia, Prof. Irving Biederman and Prof. Shahram Ghandeharizadeh, who served on my qualifying exam committee. These past years, the Institute for Robotics and Intelligent Systems has become my home, and its members my friends. I could not have hoped for a better place to be and better people to share it with. They are forever part of me. My deepest, eternal gratitude goes to my parents Michèle and Ralph, and my grandparents Marie and Jean, who sacrificed so much for me. I also want to thank my sister Géraldine for coping with me all these years. Across the miles, their constant and unconditional support has made the California sun seem even brighter, and kept me going on cloudy days.

2

"Picture men dwelling in a sort of subterranean cavern with a long entrance open to the light on its entire width. Conceive them as having their legs and necks fettered from childhood, so that they remain in the same spot, able to look forward only, and prevented by the fetters from turning their heads. Picture further the light from a fire burning higher up and at a distance behind them, and between the fire and the prisoners and above them a road along which a low wall has been built, as the exhibitors of puppet-shows have partitions before the men themselves, above which they show the puppets." "All that I see," he said. "See also, then, men carrying past the wall implements of all kinds that rise above the wall, and human images and shapes of animals as well, wrought in stone and wood and every material.” [...] “Tell me do you think that these men would have seen anything of themselves or of one another except the shadows cast from the fire on the wall of the cave that fronted them?” [...] “And again, would not the same be true of the objects carried past them?"

Plato, Republic [20]

3

Contents Acknowledgments 2 Contents 4 List of Tables 7 List of Figures 8 Abstract 14 Chapter 1. Introduction 1 1.1

Motivation and Goal 1

1.2

Overview of Approach and Outline 3

1.3

Contributions 7

Chapter 2. System Architecture 8 2.1

Introduction 8 2.1.1 Related Work in Computer Graphics and Computer Vision 9 2.1.2 Related Work in Multimedia 14 2.1.3 Visual Programming 15 2.1.4 A Unifying Approach 15

2.2

The Flow Scheduling Framework (FSF) 17 2.2.1 Data Streams, Multithreading and Data Flow Design 17 2.2.2 FSF Overview 19 2.2.3 Data model 21 2.2.4 Processing model 22 2.2.5 From Components to Applications 26 2.2.6 Implementation Issues 30

2.3

Interactive System Design and Implementation 32 2.3.1 Modeling the System 32 2.3.2 Visual Data Structures 35 2.3.3 Run-Time Type Management 36 2.3.4 Visual Interaction 40

2.4

Summary 46

4

Chapter 3. Pixel-Level Representations 48 3.1

Introduction 48

3.2

Interactive Image Feature Extraction 48 3.2.1 Saliency Computation 49 3.2.2 Interactive Feature Extraction 50

3.3

Background Modeling and Video Stream Segmentation 53 3.3.1 Background Pixel Model 55 3.3.2 Occluding Pixel Detection 56 3.3.3 Background Model Update 57 3.3.4 Results and Discussion 59

3.4

Image Registration with an Instrumented Pan-Tilt-Zoom Camera 61 3.4.1 The Pan-Tilt-Zoom Camera Model 62 3.4.2 Planar Mosaic Construction 63 3.4.3 Implementation and Results 64

3.5

Summary 65

Chapter 4. Image-Centered Representations 67 4.1

Introduction 67

4.2

Choice of a geometric model: NURBS 68 4.2.1 Generalities and Notations 68 4.2.2 Projective Properties of NURBS 70

4.3

NURBS Curve Fitting 73 4.3.1 Problem Statement 73 4.3.2 Estimation of the number of control points 73 4.3.3 Knot values and parameterization of the curve 74 4.3.4 Position of the control points 75

4.4

Hybrid Region Models 76 4.4.1 Parameterization and Deformation of a 2-D region 76 4.4.2 Photometric Model 79 4.4.3 Rendering and Compositing 80

4.5

Summary 86

Chapter 5. Object/Scene-Centered Representations 89 5.1

Introduction 89

5.2

Symmetries 90 5.2.1 Generalities 91 5.2.2 Computation 91

5

5.3

3-D Surfaces 92 5.3.1 Bilateral Symmetry 92 5.3.2 Inference of Relative Depth for Symmetric Points 93 5.3.3 3-D Edges Computation 94 5.3.4 3-D Surface Model Inference 95

5.4

Volumetric Primitives 97 5.4.1 Generalized Cylinders 97 5.4.2 Volumetric Inference 98

5.5

Summary 99

Chapter 6. Conclusion and Future Research 100 6.1

Summary of Contribution 100

6.2

Future Research 100

Chapter 7. References 101 Appendix A.Glossary 108

6

List of Tables Table 1. Characteristics of representative Computer Graphics systems 9 Table 2. Characteristics of representative Computer Vision systems for Computer Graphics applications 11 Table 3. Characteristics of representative Computer Vision systems and libraries 13 Table 4. Characteristics of representative Multimedia systems 14

7

List of Figures Figure 1. Different representations of visual information. Image regions can be identified as set of pixels, extracted and recombined (a), or be modeled as the projection of parametric surfaces that can later be deformed (b). Object contours, once extracted (c,d), can be used to infer the 3-D shape of the object (e,f), using geometrical properties of the object itself and of its projection in the images. 5 Figure 2. Levels of representation of visual data and their relationships for Vision and Graphics. 6 Figure 3. Example dataflow through a system meeting our goal. 8 Figure 4. Our modular system software architecture. 16 Figure 5. Advantage of multithreading for time-critical applications. Processes 2 and 3 are independent; Process 4 depends on both 2 and 3. With a sequential execution model, the processing lag also constrains the achievable rate (inversely proportional to the interval between the completion of the processes of two consecutive time samples). Multithreading separates the lag from the processing rate. 18 Figure 6. The traditional static dataflow model (a) cannot handle data streams. The straightforward extension using local buffers in processing centers (b), although functional in simple cases, introduces several fundamental problems, such as persistence of computed samples in the system and collection of samples stored in independent buffers for synchronization. Instead, we introduce volatile carriers for synchronous data, called pulses, that are flowing down the stream paths (c). The color coding defined here (green for processing, red for persistent data and blue for volatile data) is used consistently in the remainder of this dissertation. 20 Figure 7. Generic processing unit: in an Xcell, the information carried by an active stream (carrying active pulses) and a passive stream (carrying passive pulses) is used in a process that may result in the augmentation of the active stream and/or update of the passive stream. 23 Figure 8. Pulse filtering. To each process are associated input and parameter data types, in the form of substructures called filters, for which the incoming active and passive pulses structures (respectively) are searched. These filtering operations return handles, used during processing for direct access to relevant nodes. 25

8

Figure 9. Inside an Xcell: an incoming active pulse is filtered to return a handle to the process input data. The incoming passive pulse is filtered to return a handle to the latest process parameters. Using the current parameters, new data can be produced and added to the active pulse. The process parameter values can be updated on the passive loop. Once the process is completed, the augmented active pulse is transmitted down the active stream. 26 Figure 10. Parameter feed-back on a passive loop. Since passive and active streams are independent, and identified as inherently different in nature, feed-back “loops” can never produce interlock situations. 27 Figure 11. Stream routing for parallel execution. Independent cells can be traversed in arbitrary order, and their processes can be executed independently. Stream splitter cells, called Scells, transmit incoming active pulses on both active outputs simultaneously. Reversely, Wcells are similar to dynamic AND gates with time stamp consistency, and wait until they receive the same pulse on both active inputs before sending it down the active output. 28 Figure 12. Pulse buffering: time referential transformations require the transfer of pulses across different streams. To do so, active pulses must be stored on a passive loop before they can be reintroduced in a different active stream. 29 Figure 13. Streams synchronization: the pulses from both streams must be buffered in a passive loop structure after undergoing a time transform to place them in a common time referential. Synchronous pulses can be retrieved from the buffer and placed in a same active pulse, after an eventual time transform. 30 Figure 14. Sources and cells are modeled in the system as special node types. This allows to create and manipulate system elements consistently, as any other data in the system. 33 Figure 15. The system is modeled and realized in the FSF framework. The system itself is naturally modeled as a special source that holds the (dynamic) System Pulse. In order to be accessible from the system, the System Source is represented by a System Source Object (SSO), itself part of the system data and thus placed in the System Pulse. To separate it from other data, the SSO is made a component of a Group Object called Bridge. 34 Figure 16. The Command cell provides a processing interface to node type specific methods. It takes as input a command node, that specifies a target node instance by its signature (type and name), an action, defined for the node type (or one of its parent types) to be applied to this particular instance, and a list of parameter values. The execution of the action for all target node instances, filtered in the

9

“document” source on the target stream, produces a set of Output nodes, added to the active stream. 38 Figure 17. Data organization and access in the system. Ccells have a dedicated target stream called secondary passive stream. The source connected to the secondary passive path of a Ccell can be thought of as a “document” (as opposed to the parameter source on the primary passive path). The Ccell can recursively be connected to any source whose associated source object is held by the current secondary source. This operation is similar to traversing a subdirectory structure, and allows any Ccell to access all the data in a given application space, as defined by a source. 39 Figure 18. Scripting support through an interpreter and a Ccell. The interpreter cell generates commands from a script and sends them to a Ccell. The Ccell outputs can be monitored by the interpreter cell to enforce command temporal order. 40 Figure 19. Ccell-based interaction with the FSF processing model. The information to be rendered (in the most general sense) and presented to the user is part or totality of the secondary (target data) and primary (process parameters) passive pulses data, and eventual command output nodes. User input must be collected and translated into commands that are sent to the Ccell. The set of processes used to perform these operations characterize an interaction device. 41 Figure 20. Example of interface using standard Windows dialogs. 42 Figure 21. The Interface/Device interaction model introduces the Ccell-derived Interface cell. The input/output processing is hidden in a device node held by the primary source of the interface cell. The device and the interface cell have special communication paths defined by virtual functions. 43 Figure 22. A more visual presentation of the same pulse data presented in figure 20 (This data is actually the application graph shown in the Console window in figure 33). 44 Figure 23. Interactive processing through an interface cell. Upon detection of a process triggering event (parameter or input data alteration), the Interaction cell generates an active pulse containing a copy of the current version of the Xcell’s input data. This active pulse triggers the Xcell process. After completion, the active pulse, possibly augmented with the Xcell’s output data, is collected by the Interface cell, which in turn updates the persistent version of the output held by the document source. 45

10

Figure 24. Saliency map computation: starting from an edge image obtained by a classical edge-detector (e.g. a gradient operator followed by thresholding and non-maxima suppresion), the tensor voting mechanism produces a dense saliency tensor field. 50 Figure 25. Saliency information. 51 Figure 26. Interactive feature extraction: the user specifies a point in the neighborhood of the edge through a simple point-and-click interface. Using the energy map, the closest edge point is then found, and the corresponding edge is automatically grown from that point. 52 Figure 27. Some edges and junctions extracted interactively in a test image. 53 Figure 28. Temporal integration of background value for video stream segmentation: for each incoming new frame, each pixel value is compared to the current corresponding model distribution to decide whether it is a measurement of the background or of an occluding element. Pixels marked as occlusions are then grouped into connected components. The resulting labeled components image encodes the segmented occluded areas in the scene. Regions whose size is below a certain threshold are rejected as false positives. The distribution for each background pixel is updated using the latest observation, in order to take into account (slow) changes in the scene background. 54 Figure 29. The HSV color model. 56 Figure 30. Update of chromatic information. 57 Figure 31. Learned background distributions. 58 Figure 32. Segmentation results in an indoors environment. 59 Figure 33. Application graph for background segmentation, occlusion frequency and compositing, with results obtained for an outdoors environment. 60 Figure 34. Players and ball segmentation in a tennis broadcast sequence. 61 Figure 35. Segmentation result on two frames of the Norway sequence from [55]. 61 Figure 36. The pinhole camera model and perspective projection. 62 Figure 37. The pan-tilt-zoom camera model. 63

11

Figure 38. Reprojection onto a reference image plane in the PTZ model. 64 Figure 39. Application graphs for mosaicing modules. Two integration modalities exist: spatio-temporal integration (PTZ camera) and spatial integration (camera array). 65 Figure 40. A mosaic built from a Sony PTZ camera using our algorithm. 66 Figure 41. NURBS are invariant under perspective transform. 71 Figure 42. NURBS curve fitting application subgraph. 73 Figure 43. NURBS fitting on an open curve (a) and a closed curve (b). 76 Figure 44. Modeling image regions with quadrangle (a) vs. NURBS surfaces (b). 77 Figure 45. Image region model. 77 Figure 46. Image region modeled with a NURBS surface defined by four NURBS curves. 78 Figure 47. Multiscale group relaxation for a 9x9 grid. 79 Figure 48. Resampling of an image region defined with a NURBS surface. 80 Figure 49. Extraction and deformation of an image region using the NURBS surface model. 81 Figure 50. Replaced image regions do not blend well in the resulting image. 81 Figure 51. The HSV color model. 82 Figure 52. Setup for interactive HSV histogram-based image transformation. 83 Figure 53. User interface for interactive HSV histogram-based image transformation. 84 Figure 54. Lighting intensity adjustment. 84 Figure 55. White correction transform. 85 Figure 56. Color adjustment and blurring make the replaced region fit better in the image. 86

12

Figure 57. (a) A solid quadrangle is perceived as an obstacle in the middle of the road; (b) The NURBS surface model, with the use of alpha blending makes the synthetic region blend naturally with the road. 87 Figure 58. Inserting a synthetic surface in the image can affect the perception of the scene, by introducing perceived landmarks where there are none in the natural environment. 88 Figure 59. Smooth Local Symmetry definition. 91 Figure 60. Symmetry axes and symmetry lines computed on various parts. 92 Figure 61. A bilateral symmetric object, viewed from an arbitrary non-accidental viewpoint, with the skew symmetry direction indicated. 93 Figure 62. Computation of the z coordinate of two points symmetrical with respect to a symmetry plane making a 45 degree angle with the image plane. and are squares. 94 Figure 63. Example of 3-D edges computation using bilateral symmetry. 95 Figure 64. Two examples of reconstructed 3-D surface models of bilateral symmetric objects. 96 Figure 65. Generalized Cylinders are defined by an axis (3-D curve), a cross-section (3-D curve) and a sweep function. The volume thus defined is obtained by sweeping the cross-section, scaled according to the sweep function, along the axis. The projection of a GC in an image produces three important curves: two limb edges, and one cross-section. 97 Figure 66. GC volumetric reconstruction. 99

13

Abstract Digital production of visual content is the object of Computer Graphics (CG). A large number of powerful systems and tools are commercially available. They are often specific of a particular level of representation (e.g. still images, video sequences, 3-D models), or are targeted, either by design or de facto, at one aspect of the production process (e.g. painting, modeling, animation, rendering). In these systems, most operations are performed manually. While the CG artist should keep total control of the creative process, tedious tasks should be performed by the computer. Ideally, a CG artist involved in any step of visual content production should be able to manipulate semantic objects, not only pixels, frames or vertices. Extracting objects from pixels is the goal of Computer Vision (CV). The emphasis in this field is put on the development of automatic techniques, resulting in systems that are specific to a given task, even to a given set of inputs, and require the setting of numerous, non-intuitive parameters. This approach is difficult to integrate in any human creative process. Meanwhile, the field has produced algorithms that could perform robustly and efficiently many tedious tasks that the CG artist performs manually today. While CV and CG historically evolved as two separate fields, we believe they must play complementary roles in a new approach to visual content production: tedious manual tasks in CG applications can be automated or facilitated, and ambiguities in CV algorithms can be resolved with high-level, intuitive human input. We present an open, interactive system that allows the user to create, experience and manipulate visual data in a consistent, natural and intuitive way. Visual data is defined as any representation of real or virtual entities primarily experienced visually. The design of such a novel system requires a thorough examination of the new constraints introduced by the goal of bringing more power to the user in an intuitive way so as not to interfere with the creative process, while preserving the freedom of access and control over information at any level of abstraction. Furthermore, the greatest challenge faced when integrating such a wide variety of functionalities in a useful and usable software system may lie beyond the conceptual and technical specificities of each concerned field or subfield. A central issue in the design of our system is providing universal and extensible data representation and processing, with consistent mechanisms for scheduling, planning and synchronization. Based on these considerations, we have designed an open, modular software architecture that supports concurrent on-line processing of data streams. We have developed the data structures and processing modules necessary to build a visual environment in which a user can interactively manipulate visual data of various levels of semantic representation, either directly or through programs, and interactively build and exploit applications for automatic, on-line processing of data-streams. We have developed a number of modules whose purpose is to produce, automatically or interactively, representations of visual information that allow the user to access its various semantic levels. At the pixel level, we demonstrate user-driven image feature extraction using an energy map produced by a tensor voting technique, automatic real-time segmentation of video streams produced by a fixed camera using a statistical color background

14

model, and mosaicing using the data provided by an instrumented pan-tilt-zoom camera. For image-centered representations, we demonstrate the advantages of Non-Uniform Rational B-Splines (NURBS), present an automatic NURBS curve fitting algorithm, and a NURBS surface-based deformable image region model. At the object/scene-centered level, we emphasize the importance of symmetries in the recovery of intrinsic 3D structures from image-centered level data in monocular views, and show how bilateral symmetry can be used to recover (textured) 3D surfaces. We also show how, in some cases, more constrained volumetric primitives can be recovered directly. Our most valuable contribution is the design and implementation of an open, generic interactive system, in which will be incorporated relevant existing CV algorithms, to complement the ones we have contributed. We also intend to use this system as a development platform for future research. Furthermore, we believe that our core architecture design is suitable to naturally extend the scope of our system from visual to multimedia data.

15

Chapter 1. Introduction 1.1 Motivation and Goal We sense our physical world as embedded in a (3+1)-D space (3 geometric dimensions and one temporal dimension). However, our favorite and dominant perception of this universe is through a projection into a (2+1)-D space, as a stream of images with temporal consistency. Vision is the set of mechanisms through which humans (and machines to some extent) can interpret this (2+1)-D information in terms of entities in (3+1)-D space. Vision is a fascinating problem, because it is obviously tractable (humans solve it constantly in everyday life), but yet mathematically underconstrained and amazingly difficult to solve with our scientific tools. The goal of Computer Vision (CV) is to produce semantic representations of the information contained in images or sequences of images, to be used in higher level processes (typically artificial intelligence applications). The nature of these semantic representations depends on the type of application targeted. We define visual data as any representation of real or virtual entities primarily experienced visually. Visual data include (but is by no means restricted to) images, video streams and 3D models (possibly animated). Crafting (2+1)-D information in order to record the state of the (3+1)-D world is at least as old as known prehistoric cave paintings, and since then pictural representations have been a medium of choice for information encoding. However, (2+1)-D recordings present two degrees of freedom in their interpretation: first, one such recording can correspond to many different (3+1)-D settings; second, it is possible to create (2+1)-D information that has no correspondence in the physical (3+1)-D world. Human creativity makes use of these degrees of freedom. Abstract paintings are an example of artificial records of unrealistic 3-D scenes, used to convey a lot more than accurate physical scene setting. Even “realistic” paintings are seldom accurate renderings of 3-D photometry and geometry [14]. Motion pictures use carefully choreographed (2+1)-D recordings of specifically designed (3+1)-D settings to create, in the viewer, the sensation of different, often much richer, situation than the one actually existing in the (3+1)-D world. A major difficulty in that case is to build the sets that will allow the director to create the images needed to convey his vision from a limited actual piece of physical world. Sometimes it is impossible to shoot the actual (3+1)-D scene, and it is necessary to alter the recorded (2+1)-D information. This process, called post-production, involves such delicate and tedious operations as painting on film and re-shooting several layers of film. These procedures are now increasingly replaced with their digital equivalent (digital painting and compositing), more powerful but no less labor intensive. Nevertheless, postproduction has been revolutionized by the development of computer graphics, or image synthesis. The goal of Computer Graphics (CG) is to synthesize realistic (2+1)-D informa-

1

tion from models (of various dimensions and levels of abstraction) of the (3+1)-D world. In that sense, it is the reverse of CV. Working with models gives CG artists complete control over the entities they manipulate and thus over the content of the images they create. It also allows to develop interactive simulations in which the spectator has some control over the images they experience, like in modern video games, Virtual Reality (VR) environments, and even more realistic immersive physical simulations. However, rendering photorealistic images of models is very challenging and computationally expensive. Furthermore, creating and manipulating CG-oriented models is most of the time a tedious, manual task, the complexity of which increases with the dimensionality and accuracy of the resulting model. As those models are optimized for CG, they encode the geometric and photometric properties of the entities they represent (e.g. textured triangles), rather than their semantics (e.g. functional object parts), which can make them difficult and counterintuitive to manipulate. While CV and CG historically evolved as two separate fields from the common root of Roberts’ work [89], we believe they must play complementary roles in a new approach to visual content production. Established successful CV techniques can be used to produce semantic models of visual information, that facilitate both production of and interaction with visual data. For example, convergence between CV and CG has recently lead to the creation of a new field, called Image-Based Rendering and Modeling (IBRM) [49]. The original idea was to use one or several photographs of a scene, which are by definition photorealistic, to produce different images of the same scene without explicit geometric representation. Various levels of quality and flexibility have been achieved using this principle, ranging from simple view interpolation [17], to arbitrary view synthesis [37], to image recomposition [104] and interactive navigation inside an image [41]. Recent studies tend to show that better results can be achieved when relevant image content information is known [92]. The main focus in the CV part of IBRM is now on the extraction of explicit information from the images, such as scene lighting information [113], camera intrinsic and extrinsic calibration parameters [12][16][61], and scene/object geometry and texture [11][15][18]. On the CG part, the major difficulty is the production of an image by combining and blending part or totality of one or several images. A successful approach adopts a layered representation of scenes [96][97]. Reversely, interaction and CG are tightly linked. It thus becomes appropriate to re-explore CV techniques that failed to perform automatically, and develop interactive CV in the context of visual data manipulation. Most CV systems are traditionally designed to be “automatic”, which usually means that they are not very robust, they are often specific to a given task, even to a given set of inputs, and require the setting of numerous, non-intuitive parameters. In mainstream CV, need for interaction is considered a failure. Interactive CV systems have found much better appreciation in domains such as medical image analysis, where some interaction is a low price to pay to facilitate tedious image analysis tasks, such as boundary delineation. For example, user interaction is used for initializing contour extraction tasks that use snakes [50][75][76]. Other examples of user-steered segmenta-

2

tion tools are Live-Wire [25] and Intelligent Scissors [72], based on a graph representations of image gradients. User directed segmentation is also used in photogrammetry for building extraction from aerial images using modeling primitives [36]. As a result, user assisted segmentation is one of the first interactive tools to have appeared in CG applications (e.g. the Magic Wand in Adobe Photoshop [1]). We believe it is worthwhile and necessary to bring CV and CG together in a new, unifying approach to producing and using visual data. In this context, our goal is to build an interactive system that allows the user to create, experience and manipulate visual data in a consistent, natural and intuitive way.

1.2 Overview of Approach and Outline The general goal we describe spans a very wide area of functionalities, with corresponding technical and scientific challenges. We cannot, and will not attempt to solve them all, but will rather analyze the nature of visual data manipulation in order to provide a unifying framework for designing and implementing automatic or interactive tools in a consistent way. The greatest challenge faced when integrating such a wide variety of functionalities in a useful and usable software system may indeed lie beyond the conceptual and technical specificities of each concerned field or subfield. A central issue in the design of our system is certainly providing universal and extensible data representation and processing, with consistent mechanisms for scheduling, planning and synchronization. Our system should give the user total control over the data in the system. He has power of creation and manipulation, either through direct interactive manipulation or through programs. Interaction involves a cyclic repetition of processes in which information is presented to the user, data produced by the user in reaction to the presented information is collected, and user data is processed, possibly resulting in the modification of the information. The system should thus support collection and processing of user input, and visual presentation (image output) of any and all data. Through interactive manipulation, the user can directly modify the state of a structure. The system must therefore provide not only a visual presentation of the structure, but also a means for the user to manipulate it through a visual interface. Programs can perform automatic, on-line processing of data streams, possibly in real-time, or be selectively triggered by the user (interactive processing). The result of such processes can be the creation of new data (including but not restricted to visual data) or the modification or deletion of existing data. The system must therefore support concurrent process execution, and data structuration, including explicit handling of the time component (synchronization). It must also provide a way for the user to define programs, and access input and output data as well as process parameters. Interaction and on-line processing are by nature incompatible: humans are too slow to perform any interactive task on dynamic data. As a result, in existing approaches, they either do not coexist at all, or they are handled inconsistently. However both aspects are related, as interaction can be seen as on-line processing of particular data streams. We believe that 3

consistent handling of the dynamic aspect of visual data must play a central role in our approach. In order to realize our goal, we have designed a generic, modular software architecture for interactive and on-line processing systems. It is organized into a middleware layer whose purpose is to provide an abstraction level between the low-level services and the applications, and an application layer that allows to compose and execute applications using the available software components in a dataflow-based visual programming environment. At the core of the middleware layer is the Flow Scheduling Framework (FSF), an extensible set of classes that provide basic functionality and composition mechanisms to develop data-stream processing components in the context of the proposed system architecture. A generic data model allows to encapsulate existing data formats and standards as well as low-level service protocols and libraries, and make them available in a system where they can inter-operate. A generic dataflow processing model supports stream control and synchronization in a concurrent processing environment. Our system is an implementation of the proposed architecture in the specific context of interactive manipulation of visual data. Using the FSF model, we have deigned and developed core system components, basic visual data structures and a visual environment, implementation of the application layer for interactive manipulation of visual data, and interactive visual programming of on-line processes. This system is described in details in chapter 2. The described system is used as a development and execution environment for the manipulation of visual data. A particular category of programs of special interest in our work is concerned with the production, from existing visual media, of semantic representations of the same information, useful for interactive manipulation and/or synthesis of different visual data. We show in figure 1 various representations of visual information. For example, the image of a vase on a table is a collection of pixels of various color or intensity values, with a certain spatial relationship (the pixels are ordered). Based on this representation, it is rather difficult to perform such simple tasks as changing the color of the vase, moving the vase on the table, etc. A representation that carries the semantics of the scene, could provide a 3D model of the vase and a 3D model of the table. Such a representation is certainly more convenient for manipulating the objects, and thus their appearance in images. We can organize visual data according to the levels of semantic representation they provide as shown in figure 2. We categorize them into three classes: pixel-level, image-centered and object/scene-centered. Pixel level representations encode properties of image elements and their relationships, and are tightly related to the raster encoding of images. Image-centered representations encode geometric and photometric properties of structural entities and their arrangement to form images. They are independent of the raster representation of images but make sense only in the context of a projection into image space. Finally, at the highest abstraction level, object/scene-centered representations encode object geometric and photometric properties, eventual part composition relationships, and object arrangement to form a scene. These representations are independent of any imaging context.

4

(a)

(c)

(b)

(e)

(d)

(f)

Figure 1. Different representations of visual information. Image regions can be identified as set of pixels, extracted and recombined (a), or be modeled as the projection of parametric surfaces that can later be deformed (b). Object contours, once extracted (c,d), can be used to infer the 3-D shape of the object (e,f), using geometrical properties of the object itself and of its projection in the images. Building representations of any level by hand, either from images or from scratch, is arguably of a similar degree of difficulty as manipulating the pixels directly in the images. We have developed a number of modules whose purpose is to produce, automatically or interactively, representations of visual information that allow the user to access its various

5

Image(s)

Image transforms Pixel properties

GRAPHICS

Pixel Level

Image region properties

Parametric models Hybrid models

Image Centered

3D curves, surfaces Volumetric primitives Objects

VISION

Object/Scene Centered

Scene

Figure 2. Levels of representation of visual data and their relationships for Vision and Graphics. semantic levels. In this dissertation, Vision application modules are restricted to monocular images or sequences of images. At the pixel level (chapter 3), we demonstrate: • user-driven image feature extraction using an energy map produced by a tensor voting technique, and • automatic real-time segmentation of video streams produced by a fixed camera using a statistical color background model, • mosaicing using the data provided by an instrumented pan-tilt-zoom camera. For image-centered representations (chapter 4) • we demonstrate the advantages of Non-Uniform Rational B-Splines (NURBS),

6

• •

present an automatic NURBS curve fitting algorithm, and present a NURBS surface-based deformable image region model.

At the object/scene-centered level (chapter 5), after emphasizing the importance of symmetries in the recovery of intrinsic 3D structures from image-centered level data in monocular views, we show: • how bilateral symmetry can be used to recover (textured) 3D surfaces, and • how, in some cases, more constrained volumetric primitives can be recovered directly.

1.3 Contributions We believe our most valuable contribution is the design and implementation of an open, generic interactive system, whose general relevance extends far beyond the manipulation of visual data. The visual data processing and manipulation modules described are more focused contributions to the field of CV, either extending existing approaches or proposing novel approaches in the context of interactive vision.

7

Chapter 2. System Architecture 2.1 Introduction Our goal is to build an interactive system that allows the user to create, experience and manipulate visual data in a consistent, natural and intuitive way. Such a system spans a very wide area of functionalities, with corresponding technical and scientific challenges in separate fields of computer science, such as Computer Vision (CV) and Computer Graphics (CG). An example of conceptual dataflow diagram is presented in figure 3. The inputs

Input

Output Pixels Imaging parameters Features Regions Objects

Export

Import

Layers 3D Surfaces Scenes Volumetric primitives

Processing

Interaction Figure 3. Example dataflow through a system meeting our goal. and outputs are entirely open, and naturally include any visual data. Pixel-level representations are presented as the main input and output format, as they are essential for experiencing any information in the system. The user must be able to create, modify and experience data in the system, either through direct interactive manipulation or through programs, hence Human-Computer Interfaces issues and Visual Programming techniques 8

are relevant to our enterprise. However, the greatest challenge faced when integrating such a wide variety of functionalities in a useful and usable software system may lie beyond the conceptual and technical specificities of each concerned field or subfield. A central issue in the design of our system architecture is certainly providing universal and extensible data representation and processing, with consistent mechanisms for scheduling, planning and synchronization. Systems providing various combinations of the desired functionalities, with various levels of integration, have been developed in separate fields. In this section, we review representative approaches in CG, CV, multimedia and visual programming, porting particular attention to their level of integration and extensibility, the level of representation they can handle, their data manipulation and processing modalities, and, last but not least, their handling of the temporal dimension (in terms of data modeling as well as processing). The analysis confirms the strengths and adequacy of some design principles. It also reveals intra- and inter-field inconsistencies in the handling of seemingly unrelated, but fundamentally similar aspects of visual data manipulations, and thus the need for a unified approach. 2.1.1

Related Work in Computer Graphics and Computer Vision

Digital production of visual content is the object of CG applications, which can be categorized into two largely independent groups. Representative systems with their relevant characteristics are presented in table 8. Table 8: Characteristics of representative Computer Graphics systems Name

Integration Extensibility

Representation level

Manipulation Processing

Temporal dimension

Photoshop (Adobe)

Application Plug-ins

Pixel-level

Interactive Off-line

N/A

Premiere (Adobe)

Application Plug-ins

Pixel-level, image-centered

Interactive Off-line

Off-line

3D Studio (Discreet)

Environment (modeling, animation, rendering) SDK, plug-ins

Object/scenecentered

Interactive Off-line/procedural

Off-line/batch

Softimage XSI (Avid) Maya (Alias) Houdini (Side Effects)

9

Table 8: Characteristics of representative Computer Graphics systems Name

Integration Extensibility

Representation level

Manipulation Processing

Temporal dimension

RenderMan (Pixar)

Specialized application (rendering)

Pixel-level, image-centered

Procedural Off-line

Implicit

Shake (Nothing Real)

Specialized application (compositing)

Pixel-level, image-centered

Procedural Off-line

Implicit

form.z (auto.des.sys)

Specialized application (curves)

Object/scenecentered

Interactive Off-line

N/A

Zbrush (pixologic)

Specialized application (3-D painting)

Pixel-level, Object/scenecentered

Interactive Off-line

N/A

The first group of applications deals with the creation and processing of images and image sequences (frames and pixels to frames and pixels), and involves various combinations of pixel-level and image-centered representations. Examples include Adobe’s Premiere [1], the Media 100 system [65]. Such applications traditionally cannot handle object/scenecentered representations, although some companies recently started offering systems that aim at combining 2-D and 3-D representations. Examples are Play’s Trinity system [84], and Kaydara’s FiLMBOX [52]. In video systems, interface and processing are organized around the dynamic aspect of the output data. If interface designs are not very different, each system has its own specificities, and exchange between systems are limited to import/export through format translation. The main interface features of video applications are a video monitor and a time-line. Tedious, repetitive manual tasks are also a dominant characteristic in these products, since the manipulated entities are primarily pixellevel representations. Some major systems recently incorporated higher level tools, borrowed from CV, that attempt to reduce low-level tasks. However, all the processing is handled off-line, possibly in batch mode but most of the time manually, frame-by-frame. Furthermore the eventual dynamic dimension of the input data is ignored: when imported in such a system, an image sequence is broken down in frames that are processed or manipulated independently. The time dimension representation is implicit, and handled off-line. The second group of applications aims at synthesizing images from objects/scene-centered models, and thus encompasses modeling, animation and rendering (geometric and photometric models to frames and pixels). A wide variety of commercial systems to perform one or all of these tasks is available today. A small number of companies offer expensive, complete systems or suite of tools, with various levels of integration and

10

homogeneity. Among those are Alias|wavefront’s Maya [2], Avid’s Softimage [6], Discreet’s 3D Studio [21] and Side Effects’ Houdini [95]. They are usually very powerful for one specific aspect of image synthesis, for which they were originally developed. These systems define some standard import/export formats through which they can communicate. Specialized processes are developed and commercialized, often by third-party companies, as extension plug-ins for the major systems. Examples of such companies are Ultimatte Corporation [106] and Kaydara [52]. A number of smaller stand-alone systems offer a variety of either simpler or more narrowly targeted functionalities. Examples are auto.des.sys’s form.z [5] and Robert McNeal & Associates’ Rhino [88] for curve modeling, Pixologic’s ZBrush for 3-D painting [83], Pixar’s RenderMan [82] for high quality rendering, Nothing Real’s Shake [79] for compositing. Those systems are not necessarily as tightly linked to major complete systems as plug-ins, but usually allow data exchange through standard file formats. All these systems are designed to provide maximum control to the user over the manipulated objects. As a result, their use requires a tremendous amount of manual, tedious work, and optimal proficiency in one system is achieved at the price of a long, specific training that does not transfer to other systems. Attempts to reduce the amount of tedious manual operations are materialized by the development of new data models, but the interfaces remain essentially unchanged in successive versions of these applications, due to product line continuity imposed by market pressures. Almost all systems in which 3-D objects are manipulated have an interface organized around three flat projections along the axes and one 3-D view. Attempts to utilize CV techniques to facilitate CG production have lead to the release of such packages as Realviz’s Image Processing Factory [87] and Geometrix’s 3Scan [32] (see table 9). Bringing CV and CG together in an interactive system is a natural evolution Table 9: Characteristics of representative Computer Vision systems for Computer Graphics applications Name

Integration Extensibility

Representation level

Manipulation Processing

Temporal dimension

Image Processing Factory (RealViz)

CV Application suite

All

Interactive Off-line

Off-line

3Scan (Geometrix)

Application +hardware

Object-centered

Interactive Off-line

N/A

for both fields: tedious manual tasks in CG applications can be automated, and ambiguities in CV algorithms can be resolved with high-level, intuitive human input. However, applying CV algorithms to CG needs changes the constraints under which these algorithms must perform. The problems to solve in interactive vision are not the same as those addressed in traditional CV, concerned with designing “automatic” systems/ Those are 11

usually not very robust, they are specific to a given task, even to a given set of inputs, and require the setting of numerous, non-intuitive parameters. Libraries implementing well established CV algorithms, potentially useful in CG applications, are widely available (see table 10). Examples include Vista [86], and the very recent Intel Open Source Computer Vision library [45]. If in mainstream CV need for interaction is considered a failure, some interactive CV systems have found much better appreciation in domains such as medical image analysis, where interaction is a low price to pay to facilitate tedious image analysis tasks, such as boundary delineation. For example, user interaction is used for initializing contour extraction tasks that use snakes [50][75][76]. Other examples of user-steered segmentation tools are Live-Wire [25] and Intelligent Scissors [72], based on a graph encoding of image gradients. User directed segmentation is also used in photogrammetry for building extraction from aerial images using modeling primitives [36]. As a result, user-assisted segmentation is one of the first interactive tools to have appeared in CG applications, e.g. the Magic Wand in Adobe Photoshop [1], based on the Intelligent Scissors. However, large scale interactive Vision systems and environments were developed for traditional computer vision problems, and are difficult to integrate with CG systems. For example, SRI International’s 3-Dimensional Image Understanding System (3DIUS) [99] and DARPA/GE’s Image Understanding Environment [54] are primarily concerned with image analysis. They are designed as development and execution environments, so that data manipulation is carried mainly procedurally, and processing is performed off-line. BAE Systems’ Socet Set [7] is a specialized interactive application developed to enable mapmakers and photogrammetrists to geo-position images with a very high degree of accuracy and provides “user-friendly” image exploitation tools for both military and commercial applications. Such dedicated systems, although very useful for their specific purpose, do not provide any support for dynamic, on-line processing, and would not be easy to adapt for CG applications because their design favors application development and execution for visual data processing, rather than visual data manipulation. Even CV applications specifically targeted at the CG market, although potentially very useful and time saving for graphics applications, are not yet well integrated in the traditional CG artist workflow. Indeed, the convergence of CV and CG gives rise to new problems. Human-computer interaction issues must be carefully addressed in order to bring more power to the user in an intuitive way so as not to interfere with the creative process, and preserve the freedom of access and control of the manipulated objects. Interactive applications implementing new Human-Computer interface models are developed for specific aspects of visual data creation and manipulation [43][53], and are difficult to integrate in established commercial systems. For commercial and historical reasons, there is actually little effort to integrate all aspects of visual data manipulation in a unified, homogeneous interactive framework. Major movie production studios, large scale consumers of CG applications, are forced to develop and implement custom in-house hardware and software architectures, the details of which are closely guarded, because of the huge investment and technological edge they represent. An example of such a system is Industrial Light+Magic’s Sabre [39]. It encompasses 12

Table 10: Characteristics of representative Computer Vision systems and libraries Name

Integration Extensibility

Representation level

Manipulation Processing

Temporal dimension

3DIUS (SRI International)

Environment (Lisp/CLOS) Open

All

Procedural Off-line

Implicit

IUE+Target Jr. (DARPA/GE)

Library +environment (C) Open

All

Procedural Off-line

Implicit

SocetSet (BAE Systems)

Application

All

Interactive

Implicit

Vista

Library (C)

Pixel-level, image-centered

Procedural Off-line

Implicit, offline, ad hoc

OpenCV library (Intel)

all aspects of digital production from digitalization of film footage to restitution of the final composites on film. It is designed to provide the graphics artists and designers with a transparent and fast interface to all the stages of the digital image pipeline. It implements hardware and software solutions for digital image storage and retrieval, object modeling, animation, scripting, compositing, in an open architecture design to accommodate a mix of custom hardware and software modules, and third party off-the-shelf graphics packages, for which it operates as a unifying interface. Such a system however is built for a specific need (digital movie production), and must accommodate a variety of heterogeneous packages with their own interfaces and processing models. Note that this very complex system is only targeted at image synthesis. It has no explicit representation of temporal dimension other than frame sequentiality, and supports no CV derived representations or processing. Finally, a key deficiency, common to both CG and CV systems, is their inconsistent handling of the temporal dimensionality of visual data, both in terms of representation and processing. The reason is also historical, as the processing power necessary to perform useful computations of image streams (whether analysis or synthesis) became available and affordable only recently. Systems in which dynamic aspects of visual data are not a direct concern usually ignore them completely (time relationships handling is at best implicit). Animation systems, whose primary purpose is to specify temporal evolution of visual data offer a variety of approaches (e.g. interpolation of key-samples or procedural specification), using ad hoc data models, processing and access mechanisms. Similarly, except for a handful of dedicated applications (again involving ad hoc design and imple-

13

mentation), CV systems are designed to manipulate isolated images (possibly image sequences in batch mode), but never integrate image sequence processing explicitly. Clearly, in order for a system to fulfill our general goal, it must implement a unifying framework that allows to handle all visual data (in their spatial and temporal dimensions). We believe the only realistic way to support all existing and upcoming techniques and technologies is to design an open, modular architecture that will allow us to implement a system that will serve as a development and exploitation platform. Existing CG products, for historical and commercial reasons, show very limited application of these principles. Multimedia is a domain in which integrating multiple existing and emerging technologies and standards is a central issue, and for which dynamic, on-line handling of data streams is a major requirement. 2.1.2

Related Work in Multimedia

There have been multiple attempts to design unifying frameworks for building multimedia applications. Representative systems with their relevant characteristics are presented in table 11. De Mey and Gibbs proposed an object-oriented Multimedia Component Kit [69], Table 11: Characteristics of representative Multimedia systems Name Multimedia Component Kit (University of Genève)

Integration Extensibility Environment Open

Representation level Pixel-level, image-centered

Manipulation Processing Procedural On-line

Continuous Media Toolkit (Berkeley)

Temporal dimension Implicit, offline

On-line

VuSystem (MIT) Muse (Muse Technologies)

Immersive environment SDK (closed)

Object-scenecentered

Interactive N/A

On-line (realtime)

that implements a framework for rapid prototyping of distributed multimedia applications. Their framework supports the development of components that encapsulate interfaces to various hardware elements, in what today would be called a middleware layer. They also proposed to use visual composition to build dataflow based multimedia applications. More recently, Duke and Herman formalized the role of middleware in a standard called PREMO [22] (PResentation Environments for Multimedia Objects), which addresses synchronization, management of distributed resources and integration of data and processes. 14

However, those approaches are based on an outdated, static conception of multimedia applications, and thus do not accommodate dynamic interaction and on-line processing. New architectures have been developed to specifically address the dynamic aspects of modern multimedia applications. MIT’s VuSystem [62] implements a design based on visualization systems programming techniques applied to the development of computeintensive audio-video multimedia applications, i.e. applications that take action upon the analysis of their audio and video input. BMRC’s Continuous Media Toolkit [64] is a development environment for continuous media (audio and video) applications. Both systems implement modular dataflow architecture concepts. They are designed primarily for audio and video processing with a strong emphasis on capture and replay aspects. They do not readily support interaction and manipulation of visual data of various levels of representation, as it is not their primary purpose. Even though multimedia frameworks and architectures do not fulfill all our requirements, they certainly have design aspects that are desirable in our system, such as the use of an object-oriented, component-based extensible middleware layer. This modular approach is usually complemented with a dataflow-based visual programming model. 2.1.3

Visual Programming

As our system is interactive and its primary purpose is the creation and manipulation of visual data, it should provide a rich visual exploitation environment, not only for direct data manipulation, but also for building applications. Dataflow-based models are particularly adapted to visual programming. For example, Khoros [4], with its visual programming environment Cantata, is a successful integrated development environment for scientific computing and visualization. Other dataflow programming tools with visual programming support include G/LabView [3] in the domain of engineering, CONDOR [51] for constraint-based dataflow programming suited to computer graphics problems, and Fabrick [44]. All these systems demonstrate the power of dataflow-based visual programming models and graphical interfaces, but were not design to handle on-line processing of data streams. 2.1.4

A Unifying Approach

We believe a straightforward extension of existing approaches cannot lead to a complete solution to our goal because they were developed incrementally from prior partial, heterogeneous solutions to different problems that turn out to be related. Most importantly, for historical reasons rooted in the availability of computing power and technical limitations of sensors and storage devices, the temporal dimension of data and dynamic, on-line aspect of processing are not handled consistently, when handled at all. Interaction itself requires on-line data stream processing, but is supported in different processing models, often using ad hoc techniques and shortcuts to meet the “interactive-time” requirement using limited computing power. We have now reached the technological maturity necessary to make a unified approach not only realistic, but also necessary in order to provide more control and power to the user.

15

data stream

software component instance software components instantiation

Low level services and libraries

Middleware layer

Application development and execution environment

Applications

Software components inheritance hierarchy Flow Scheduling Framework

Local storage

Libraries

Interface devices

Figure 4. Our modular system software architecture.

In order to realize our goal, we have designed a generic, modular software architecture for interactive and on-line processing systems [30], presented in figure 4. Its main components are:

16





A middleware layer that provides an abstraction level between the low-level services and the applications, in the form of software components. At the core of this layer is the Flow Scheduling Framework (FSF), that implements the foundation classes to develop these components. A generic data model allows to encapsulate existing data formats and standards as well as low-level service protocols and libraries, and make them available in a system where they can inter-operate. A generic dataflow processing model supports stream control and synchronization in a concurrent processing environment. An application layer, built on the FSF, that allows to compose and execute applications using the available software components in a dataflow-based visual programming environment.

The FSF, cornerstone of this architecture, is described in depth in section 2.2. Our system is an implementation of the proposed architecture in the specific context of interactive manipulation of visual data. It is described in section 2.3. Specifically, the basic visual data structures necessary are described in section 2.3.2. The visual environment is an implementation of the application layer that supports interactive manipulation of visual data, and interactive visual programming for on-line processes. It requires support for runtime type management, presented in section 2.3.3, and for visual interaction, described in section 2.3.4.

2.2 The Flow Scheduling Framework (FSF) The FSF is an extensible set of classes that provide basic functionality and composition mechanisms to develop data-stream processing components in the context of the proposed system architecture. The FSF specifies and implements a data model and a dataflow-based processing model supporting stream control and synchronization in a concurrent processing environment. 2.2.1

Data Streams, Multithreading and Data Flow Design

Interaction and real-time processing both require fast data processing, in order to provide the user with temporally consistent and usable information in the case of interaction, and in order to simply keep up with the input data rate in the case of real-time processing. In both cases, two time performance measures have to be taken into account. One is the latency introduced by the processing, the other one is the time interval between the output (or input) of two consecutive time samples, which determines the throughput (or processing rate) of the system. For example, if an application captures a live video stream and displays it on the screen, the latency is the time taken for an image to appear on the screen after it has been captured, and the interval is the inverse of the rate at which the frames are captured and/or displayed. In many cases, it is desirable to minimize both the latency and the sample interval. As shown in figure 5, in a sequential execution model, these two quantities are equal. Introducing concurrent execution of processes on different time samples allows to decouple latency and sample throughput, and possibly dramatically increase the throughput, if sufficient processing power is available. Further reduction of the latency

17

A: Sequential execution

B: Time multithreading

C: Time and process multithreading

Process 1

Process 2 Interval = Latency

Latency

Latency

Process 3

Process 4 Interval

Process execution

Interval

Figure 5. Advantage of multithreading for time-critical applications. Processes 2 and 3 are independent; Process 4 depends on both 2 and 3. With a sequential execution model, the processing lag also constrains the achievable rate (inversely proportional to the interval between the completion of the processes of two consecutive time samples). Multithreading separates the lag from the processing rate. can be made possible by supporting concurrent execution of independent processes affecting a same sample. Support for concurrent execution appears necessary for both interaction and on-line, including real-time, processing. This view is further reinforced when considering that interaction and on-line processing are incompatible: human reaction times are extremely slow compared to computer activity, and thus processes involving user intervention cannot be allowed to unnecessarily delay or suspend independent processes dealing with on-line data streams if both types must coexist in the same system. Concurrent processing, however, introduces the major burden of keeping track of data relationships, in particular dependency and synchronization information.

18

Dataflow processing models are particularly well suited to the types of problems we have described so far. However, traditional dataflow models were not designed to handle online data streams, and thus are not directly applicable to our design. An application is specified as a dependency graph of processing nodes. In the traditional approach (see figure 6a), each processing node has a fixed number of inputs and outputs, each of a given data type. When all the inputs are available for a given node, they are processed to produce the node’s outputs. This processing can be triggered either manually or automatically. The production of the outputs in one node triggers or allows the processing in other nodes, and the procedure is propagated in the graph until the final outputs are produced. This approach is adequate in the case of deterministic processing of static data. Input type checking can be done at application design time. A first problem arises when the data type of an input is known at design type, but not the number of objects of that type. This situation can be handled at the price of more complex structures. A deeper deficiency is the static model on which, ironically, dataflow programming is based: the “dataflow” is regulated off-line, implicitly. There is no support for on-line processing, time consistency or synchronization, and process parameters are not integrated in the data model which makes any type of feed-back impossible. Extending the static dataflow model to a dynamic, datastream model is not trivial: process inputs are no longer static, unique objects (e.g. a single image), but time samples entering the system at a given rate (e.g. the frames of a video stream). Consequently, as process completion order is not deterministic, data must be kept available in the system until all dependent processes have been completed. The straightforward extension to the static model, using local buffers in processing centers as described in figure 6b, although functional in simple cases, introduces several fundamental problems, such as the persistence of computed samples in the system and the collection of samples stored in independent buffers for data or process synchronization: if a process depends on data produced by several other processes, it should perform on synchronous inputs (i.e. data produced from the same sample), which requires, if buffers are used, to search all the buffers of all the input processes for samples with a particular time stamp. In order to avoid those problems, we introduce volatile carriers for synchronous data, called pulses, that are flowing down the stream paths (figure 6c). The color coding defined in figure 6 (green for processing, red for persistent data and blue for volatile data) is used consistently in the remainder of this dissertation. Furthermore, a glossary regrouping definitions for the specific terms defined in this chapter is provided in appendix A. 2.2.2

FSF Overview

Information is modeled as data streams. A stream represents one or several synchronous multimedia objects, i.e. expressed in the same time referential. An application is specified by a number of streams of various origins and the manipulations they undergo as they pass through processing nodes called cells. Stream operations fall into two main categories: • Intra-stream processing: formatting, alteration and presentation. A stream can originate from local sources such as local storage or input devices, or can be received from a network. It can also be created internally as the result of the manipulation of one or

19

(a) Static dataflow model a

A

b

B

c

C

(b) Data-stream extension with local buffering a(t2)

A

b(t0) b(t1)

B

c(t0)

C

a(t0) a(t1)

D

d

D

?

d(t0)=f[a(t0),b(t0),c(t0)]

(c) Data stream extension with Pulses

A t0

Suggest Documents