Synchronized Hypermedia Documents: a Model and its Applications
Ombretta Gaggi
Technical Report UBLCS-2003-05 March 2003
Department of Computer Science University of Bologna Mura Anteo Zamboni 7 40127 Bologna (Italy)
The University of Bologna Department of Computer Science Research Technical Reports are available in gzipped PostScript format via anonymous FTP from the area ftp.cs.unibo.it:/pub/TR/UBLCS or via WWW at URL http://www.cs.unibo.it/. Plain-text abstracts organized by year are available in the directory ABSTRACTS. All local authors can be reached via e-mail at the address
[email protected]. Questions and comments should be addressed to
[email protected].
Recent Titles from the UBLCS Technical Report Series 2001-3 Nature-Inspired Search Techniques for Combinatorial Optimization Problems (Ph.D. Thesis), Rossi, C., Janaury 2001. 2001-4 Desktop 3d Interfaces for Internet Users: Ef£ciency and Usability Issues (Ph.D. Thesis), Pittarello, F., January 2001. 2001-5 An Expert System for the Evaluation of EDSS in Multiple Sclerosis, Gaspari, M., Roveda, G., Scandellari, C., Stecchi, S., February 2001. 2001-6 Probabilistic Information Flow in a Process Algebra, Aldini, A., April 2001 (Revised September 2001). 2001-7 Architecting Software Systems with Process Algebras, Bernardo, M., Ciancarini, P., Donatiello, L., July 2001. 2001-8 Non-determinism in Probabilistic Timed Systems with General Distributions, Aldini, A., Bravetti, M., July 2001. 2001-9 Anthill: A Framework for the Development of Agent-Based Peer-to-Peer Systems, Babaoglu, O., Meling, H., Montresor, A., November 2001 (Revised September 2002). ¨ 2002-1 A Timed Join Calculus, Bunzli, D. C., Laneve, C., February 2002. 2002-2 A Process Algebraic Approach for the Analysis of Probabilistic Non-interference, Aldini, A., Bravetti, M., Gorrieri, R., March 2002. 2002-3 Quality of Service and Resources‘ Optimization in Wireless Networks with Mobile Hosts (Ph.D Thesis), Bononi, L., March 2002. 2002-4 Speci£cation and Analysis of Stochastic Real-Time Systems (Ph.D. Thesis), Bravetti, M., March 2002.
2002-5 QoS-Adaptive Middleware Services (Ph.D. Thesis), Ghini, V., March 2002. 2002-6 Towards a Semantic Web for Formal Mathematics (Ph.D. Thesis), Schena, I., March 2002. 2002-7 Revisiting Interactive Markov Chains, Bravetti, M., June 2002. 2002-8 User Untraceability in the Next-Generation Internet: a Proposal, Tortonesi, M., Davoli, R., August 2002. 2002-9 Towards Adaptive, Resilient and Self-Organizing Peer-to-Peer Systems, Montresor, A., Meling, H., Babaoglu, O., September 2002. 2002-10 Towards Self-Organizing, Self-Repairing and Resilient Distributed Systems, Montresor, A., Babaoglu, O., Meling, H., September 2002 (Revised November 2002). 2002-11 Messor: Load-Balancing through a Swarm of Autonomous Agents, Montresor, A., Meling, H., Babaoglu, O., September 2002. 2002-12 Johanna: Open Collaborative Technologies for Teleorganizations, Gaspari, M., Picci, L., Petrucci, A., Faglioni, G., December 2002. 2003-1 Security and Performance Analyses in Distributed Systems (Ph.D. Thesis), Aldini, A., February 2003. 2003-2 Models and Types for Wide Area Computing. The calculus of Boxed Ambients (Ph.D. Thesis), Crafa, S., February 2003. 2003-3 MathML Formatting (Ph.D. Thesis), Padovani, L., February 2003.
iii
iv
Dottorato di Ricerca in Informatica Universit`a di Bologna, Padova, Venezia
Synchronized Hypermedia Documents: a Model and its Applications Ombretta Gaggi March 15, 2003
Coordinatore: ¨ Prof. Ozalp Babao˘glu
Tutore: Prof. Augusto Celentano
I sogni, spesso, sono contagiosi. Enzo Ferrari
vii
viii
Abstract
Authoring, delivering and presenting hypermedia documents is a complex task, since such documents can contain a large number of components bound by multiple temporal and layout constraints. Media items are files which have to be arranged on the screen and synchronized to be displayed to the user. The growth of network bandwidth encourages the use of rich media types in documents, such as animations, audio and video, giving new possibilities to improve information presentation. A distributed environment adds further complexity: documents can contain links to other documents, and the user can interact with components inside them by pausing, stopping or resuming their playback, or by following links to other documents. In this thesis we discuss synchronization issues in hypermedia presentations composed of several continuous and noncontinuous media objects. We define a model for synchronization relationships among media objects, and formally describe a presentation behavior in terms of events which help an automaton to evolve. We have also implemented a visual authoring environment based on the model, and suggested a number of applications in areas related to automatic generation of standard presentations and multimedia information retrieval.
ix
x
Acknowledgements When I was a child, the baker next to my home used to greet my with “Hi, professor” since he was persuaded one day I would be a professor. At that time I thought he was wrong, now I know he was right. My very first acknowledgements is for my tutor, prof. Augusto Celentano, without him I would not have known what a Ph.D. is. He has been helping me since the beginning, sharing my difficulties and joys. I am grateful for his patience. I am in dept with Maria Luisa Sapino, first of all, for her warm hospitality and friendship, and second, for her good contribution to my work. I would like to thank the referees Sibel Adalı and Angela Guercio for their precious advices. I wish to acknowledge four students who, during their studies for their master thesis, joined with me part of the work presented here: they are Mauro Scarpa, Diego Medici, Daniele Dal Mas e Alessandro De Faveri. During these three years I have been so lucky to have a lot of nice people around me that I would like to thank here. I will never forget my colleagues, Chiara, Claudio, Damiano, Matteo, Massimiliano, Moreno (whose wisdom I appreciate and welcome the most), Silvia and Valentino and all the others, who shared with me a warm and nice office. A grateful thought goes to Gianluca Musumeci and Fabrizio Romano for their friendship and technical support, and to all the staff of the department of computer science of Venice. Grazie a mamma, pap` a e Linda per aver condiviso con me anche questa avventura. Grazie a Marco, per avermi seguito in giro per il mondo, portandomi le valige (come dice lui) e per non avermi mai fatto mancare il suo sostegno e incoraggiamento, indispensabili per farmi arrivare fino a qui. Un bacione a Pushio, che ha reso tutti noi pi` u allegri. Venezia, March 2003.
xi
xii
Contents
Abstract
ix
Acknowledgements
xi
List of Figures
xvi
1 Introduction
1
2 Related work
7
2.1
Modelling Multimedia Documents . . . . . . . . . . . . . . . . . . . . . . .
2.2
Authoring Multimedia Presentation
2.3
Automatic Generation of Multimedia Presentations . . . . . . . . . . . . . . 14
2.4
Multimedia Querying and Browsing . . . . . . . . . . . . . . . . . . . . . . 16
2.5
Multimedia delivery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
. . . . . . . . . . . . . . . . . . . . . . 12
3 A Model for Synchronized Hypermedia Documents 3.1
3.2
3.3
7
21
Hypermedia document structure . . . . . . . . . . . . . . . . . . . . . . . . 21 3.1.1
Logical vs. physical structure . . . . . . . . . . . . . . . . . . . . . . 24
3.1.2
Channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Synchronization primitives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2.1
Basic concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.2
Basic synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . 30
User interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.3.1
Hyperlinks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 xiii
3.4
Automatic derivation of synchronization primitives from document structure 41
3.5
Comparison with related work . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.6
An example: The Maya’s sun temple . . . . . . . . . . . . . . . . . . . . . . 46
4 Modelling Multimedia Presentation Behaviors
57
4.1
An Automaton for Multimedia Presentations . . . . . . . . . . . . . . . . . 58
4.2
Analysis of the presentation automaton . . . . . . . . . . . . . . . . . . . . 70
5 An XML Schema for Describing Multimedia Documents
75
5.1
The Layout section . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.2
The Components section . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.3
The Relationships section . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.4
Constraints for a valid document . . . . . . . . . . . . . . . . . . . . . . . . 80
6 A Visual Authoring Environment for Prototyping Multimedia Presentations
83
6.1
The Authoring Environment for Hypermedia Documents . . . . . . . . . . . 84
6.2
The User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.3
A Timeline Style View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.4
An Execution Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.5
Model translation using SMIL . . . . . . . . . . . . . . . . . . . . . . . . . . 93
7 Schema Modelling for Automatic Generation of Multimedia Presentations
97
7.1
Multimedia Reporting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
7.2
Dynamics definition in multimedia reports . . . . . . . . . . . . . . . . . . . 102
7.3
An XML-based schema definition for multimedia reports . . . . . . . . . . . 106
7.4
Schema and data integration . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.5
Handling incomplete results . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.6
Query Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 xiv
8 Retrieval in Multimedia Presentations
119
8.1
A working example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
8.2
Building a Multimedia Query Answer . . . . . . . . . . . . . . . . . . . . . 127 8.2.1
8.3
Retrieving Consistent Presentation Fragments . . . . . . . . . . . . . 129
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
9 Conclusion
137
A An XML schema for Multimedia Presentations using XML Schema language
141
B An XML schema for Multimedia Presentations using DSD language
151
C An XML schema for Multimedia Report using XML Schema language 159 References
171
xv
List of Figures
3.1
Structure of video-centered hypermedia document . . . . . . . . . . . . . . 23
3.2
Synchronization schemas defined using primitives ⇔ and ⇒ . . . . . . . . . 33
3.3
The relation ⇓
3.4
Following hyperlinks: first case . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.5
Following hyperlinks: use of > and > relationships . . . . . . . . . . . . . . 40
3.6
Synchronization schema for a lesson . . . . . . . . . . . . . . . . . . . . . . 43
3.7
Maya Sun Temple’s presentation interface . . . . . . . . . . . . . . . . . . . 47
3.8
Synchronization primitives for the first two modules . . . . . . . . . . . . . 52
3.9
Synchronization primitives for module M2 (first part) . . . . . . . . . . . . . 53
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 p
s
3.10 Synchronization primitives for module M2 (second part) . . . . . . . . . . . 53 3.11 Structure and synchronization of module M3
. . . . . . . . . . . . . . . . . 54
4.1
Synchronization primitives for the first two modules . . . . . . . . . . . . . 68
4.2
Automaton of the first two modules of Maya presentation . . . . . . . . . . 69
4.3
A pause event . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.4
A natural ending presentation . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.5
Two equivalent presentations . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.6
Two not equivalent presentations . . . . . . . . . . . . . . . . . . . . . . . . 74
5.1
Layout section of the Maya Sun Temple presentation . . . . . . . . . . . . . 76
5.2
Hierarchical structure in the component section . . . . . . . . . . . . . . . . 78
5.3
An excerpt of the relationships section of the Maya Sun Temple presentation 79
5.4
A conditional definition using DSD schema language . . . . . . . . . . . . . 81 xvi
6.1
The synchronization of a news-on-demand cover . . . . . . . . . . . . . . . . 86
6.2
The XML representation of a news-on-demand cover . . . . . . . . . . . . . 87
6.3
A timeline style view . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.4
The simulation of a news-on-demand cover
7.1
Overview of the multimedia reporting process.
7.2
A simplified synchronization graph for a news-on-demand presentation . . . 102
7.3
A modular news-on-demand presentation. . . . . . . . . . . . . . . . . . . . 103
7.4
(a) A visual template for a news report, (b) An article with placeholders
. . . . . . . . . . . . . . . . . . 92 . . . . . . . . . . . . . . . . 101
and media items. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 7.5
XML schema for a news report . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.6
The generated presentation in XML . . . . . . . . . . . . . . . . . . . . . . 118
8.1
Screenshots from a presentation about Beethoven’s life . . . . . . . . . . . . 122
8.2
The guide to listening of the Pastoral Symphony . . . . . . . . . . . . . . . 123
8.3
The score analysis of the Pastoral Symphony . . . . . . . . . . . . . . . . . 124
8.4
The synchronization schema for the presentation about Beethoven’s life . . 125
8.5
The synchronization schema for the presentation of the guide to listening of the music work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
8.6
The automaton of the guide to listening of the Pastoral Symphony . . . . . 131
8.7
The resulting presentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
8.8
The fragments returned for text1,1 and text1,2 . . . . . . . . . . . . . . . . . 135
xvii
Chapter 1
Introduction
“ . . . today, there is probably nothing more complicated and demanding than producing a live musical comedy with an orchestra, lighting, special effects, and an array of potentially conflicting personalities and artistic agendas. Military people may disagree and claim that producing a musical comedy is almost trivially easy compared with waging a military campaign. NASA people might claim that nothing at all can compare with building a space vehicle and launching it successfully. We live in an extremely complex civilization, and in all three cases (musical comedy, military campaign, and space mission), the problems are actually quite similar, though the stakes differ considerably. They are all concerned with making all the right things happen at all the right times, despite the large number of people and systems involved, and the complexity of the interdependencies involved in the enterprise . . . ” The above fragment, excerpted from [55], supports the idea that also modelling, authoring, delivering and presenting hypermedia documents is a complex task, even if at a smaller scale, since such documents can contain a large number of components bound by multiple temporal and layout constraints. A distributed environment adds further complexity. Internet and the World Wide Web allow delivery and presentation of complex documents, composed of several media files and streams, which must be arranged on the screen and synchronized. The growth of network bandwidth encourages the use of rich media types in documents, such as animations, audio and video, giving new possibilities to improve information presentation.
2
Chapter 1. Introduction
Delivering hypermedia documents over a network requires different protocols, technologies and resources to give the user a coherent presentation under varying network performance parameters. Moreover, documents can contain links to other documents, and the user can interact with components inside them, for example by pausing or resuming their playback, or by following a link. In other words, also the authors of a complex hypermedia document are concerned with making all the right things happen at all the right times. The integration of multimedia material into hypermedia documents is widely used in many applications like distance learning (e.g., ilearning.oracle.com), web advertising and e-business (e.g., www.sony.com), virtual tourism (e.g., www.360thecity.com), cultural heritage (e.g., www.palazzograssi.it), news delivery (e.g., www.cnn.com), entertainment (e.g., www.warner.com), and so on. A large number of such applications fall in the socalled “infotainment” domain, where authors enrich the information with audio files video clips and animations in order to obtain a more engaging and entertaining effect on the users. This thesis presents a synchronization model for designing hypermedia presentations integrating and synchronizing several media items, and supporting user interaction. Media items are files in a distributed environment, which have to be coordinated and delivered to a client, like in the World Wide Web. Documents contain continuous and non-continuous media items. Non-continuous (or static) media are text pages and images which, once displayed on the user screen, do not evolve along time. Continuous media are video and audio files, which have their own behavior. If the delivery of static documents is trivial, the delivery of continuous media items is more critical and requires specific technologies like streaming. A constant playback rate must be obeyed even if media delivery rate is not constant, and this requires additional care [65]. The model is focused on a class of applications that we call “video-centered hypermedia”: one or more continuous media files (video or audio streams) are presented to the user and, as streams play, other documents (text and images) are sent to the client browser and displayed in synchrony with them. The user can interact with a presentation by pausing and resuming it, by moving forward or backward, and by following hyperlinks that lead him/her to a different location or time in the same document, or to a different
Chapter 1. Introduction
3
document. At each user action the presentation must be kept coherent, resynchronizing the documents delivered. The model addresses both temporal and spatial aspects of a web-based multimedia presentation, and arranges the media involved according to a hierarchical structure. Such a structure allows a designer to neatly organize the presentation components, and to focus on a subset of the synchronization constraints, since many of them can be automatically derived from the presentation structure. The temporal behavior is based on reactions of media items to events. Compared to timeline-based or structure-based models, this model is more flexible since a change in an object scheduling time does not need to be propagated to other media by explicitly redefining their behavior. The model defined in this thesis is not intended as an execution language, e.g., as SMIL is, but as a model to design and prototype multimedia presentations [28]. It is a good trade-off between two main requirements: expressiveness and simplicity. The temporal behavior of a presentation is described with five synchronization relationships which define temporal constraints among the media components. The presentation is thus represented by a graph whose nodes are the media items involved and edges are the synchronization relationships. Compared to other approaches, this model is simpler; for example, Allen [2] defines seven basic temporal relations between two media objects, which can describe their mutual behavior in a multimedia document1 . When a model is too complex it could be difficult to apply it to any particular presentation, and an authoring system may become cumbersome to use. But a simple model, supported by user-friendly tools, can become too restrictive to allow the specification of all aspects of any multimedia presentation. We do not claim that our model allows an author to design efficiently any hypermedia document, due to the great variety of hypermedia types and structures, mainly on the World Wide Web. However, its reference context is wide, and includes applications like self and distance education, professional training, Web advertising, cultural heritage promotion and news-on-demand. In this thesis we’ll refer to some sample scenarios to illustrate the model: news-ondemand, cultural heritage promotion and e-learning. These scenarios share a common structure of multimedia presentations, based on several modules hierarchically organized. 1
A deeper comparison between our model and the related work can be find in Chapter 3.
4
Chapter 1. Introduction
Each module is composed of a continuous media stream which, as it plays, causes texts, images and possibly other continuous media documents to be displayed to the user screen. The user may adapt the pace of the presentation (e.g., a news) to his/her needs by moving in the module or across the modules in a controlled way: by skipping some parts, playing again other parts, temporarily suspending the presentation to browse supplemental material, and so on. The system re-synchronizes (i.e., delivers to the browser) the different media in order to preserve the whole consistency. This thesis also presents a visual authoring tool which helps the author to design the spatial layout and the temporal behavior of a multimedia presentation. During development, particular attention has been paid to the user interface: the tool must support the author in his/her activity, so that few specific skills are required. The authoring environment integrates differing media types into a unique presentation. Therefore we do not address editing issues related to single media items, but we assume that the author can access them in a repository. The author can check the behavior of the presentation through an execution simulator which dynamically highlights regions of the user screen corresponding to active media, and also animates the synchronization graph of the presentation by showing how the synchronization relations between media items are activated as reactions to time related events. This solution enables to the author to check, step by step, the behavior induced by the designed synchronization scheme. An XML schema which translates the graph representation of the presentation into an XML document is defined, and the authoring tool is able to export an appropriate XML document of the designed presentation. The XML representation of the hypermedia document can be easily translated into a delivery language like SMIL. Moreover, it can take full advantages of XML processing tools; e.g., XML query languages can provide facilities to query the presentation structure. This approach has many advantages. First of all, the spatial representation is completely independent from the temporal organization of the objects involved, and can be changed easily, e.g., to present hypermedia documents on different platforms (such as a hand-held calculator), while preserving the original goals of the author design. Then, a media item can be referenced at different points of the presentation structure without redundancy by addressing its id. Thus, an author can easily reuse the structure of a
Chapter 1. Introduction
5
document, a single media item, an entire document (by simply adjusting its layout) or a part if it. Other approaches like SMIL [57] and Amsterdam Hypermedia Model [31], do not separate temporal behavior from data definition, the two types of information being interleaved in the specification. This integration often generates redundancy and does not encourage media reutilization, which is desirable especially for large media files. We also discuss a formal description of a hypermedia presentation. Since we propose an event-based synchronization mechanism, during playback the only relevant time instants are the observable time instants, i.e., the ones in which an event occurs, like starting or stopping a media item. Indeed, these are the time instants in which something in the presentation might change as a consequence of an event. We can collect all the information which characterizes the presentation at any given time into the notion of state, which contains all the active and paused objects and their channel occupancy. Then, given a presentation, we can build an automaton which describes all its possible evolutions along time: given a state and an event the automaton deterministically computes what changes are needed in the presentation, i.e., which media items are activated, in which channels, and which items are stopped to evolve to the next state. The automaton, which is implemented by the simulator of the authoring tool, can also be used to study some properties of the corresponding presentation, e.g., if it contains loops (e.g., a soundtrack which loops during a section or during the entire presentation), or to characterize which are its master components (i.e. the objects which, in absence of user interactions, make the presentation to evolve). This thesis also discusses some applications of the proposed model. Besides presentation authoring and execution simulation, it addresses two issues: generation of standard presentations from templates, and querying presentations by content. With the term multimedia report we denote a presentation which integrates data returned by one or more queries to a multimedia repository into a multimedia document where data are coordinated and synchronized. The idea is to extend the authoring tool to give the user the possibility to define a template of a hypermedia presentation, i.e., to define the structure and the synchronization constraints among the objects without addressing object instances but only placeholders standing for items returned by a query. Then the system executes the query and instantiates the template on the retrieved media items. The XML schema has been extended in order to design multimedia reports, and an
6
Chapter 1. Introduction
algorithm is presented that generates multimedia presentations from a set of media items and a template. The thesis also addresses the use of the proposed model to browse fragments of multimedia presentations returned as a result by a query. Information is globally conveyed by independent media which are archived and delivered separately, and are coordinated and synchronized during playback. If we search for information in a collection of multimedia objects, it can be contained in documents of different kinds. In such a scenario the user is not interested in the documents which are part of the presentation, but in a complete and coherent fragment of the presentation in which documents satisfying the query are present. Therefore a query processor must be able to identify a subset of the presentation, which must be returned to the user with proper layout and synchronization. If a hypermedia document is well modelled and one of its components is selected by the query, its activation should activate other objects according to the established synchronization model. The thesis discusses the need of a model for retrieving continuous data in a consistent way and proposes the presentation automaton as a solution to recognize the fragment to return to the user. The thesis is organized as follows: Chapter 2 comments on the related literature. In Chapter 3, a model for the synchronized delivery of hypermedia documents and synchronization issues due to user interaction are illustrated. In Chapter 4 the synchronization model is translated into the definition of automata which describes all the possible evolutions of a multimedia presentation and allows the study of the property of a presentation. Chapter 5 defines an XML schema designed according to the model. In Chapter 6 the authoring tool which implements the model and simulates a presentation behavior is presented. In Chapter 7 and 8 the model is proposed as a solution to the problems of the automatic generation of multimedia reports and to the coherent browsing and retrieving of data included in hypermedia presentations. Chapter 9 presents some final remarks and introduces future work.
Chapter 2
Related work
2.1
Modelling Multimedia Documents
Media synchronization in hypermedia documents has been largely investigated. In [6], Bertino and Ferrari review a number of models for synchronization of multimedia data concluding that “. . . much research is still needed to develop multimedia models with a complete set of features able to address all the requirements of multimedia data synchronization”. Since then, several proposals have been made which approach the problem in different ways and with different perspectives. A first class of models describes synchronization and behavior of media components of a presentation by a set of temporal relations. Allen [2] defines a list of thirteen relationships between temporal intervals which can be used in domains where interval durations are known, but information about their occurrence is imprecise and relative. As an example, the relation A before B means that the media object A is played before B, but it does not define how much time is elapsed between the two objects. King et al. [46] present a taxonomy of possible synchronization relationships between item pairs in multimedia documents. The authors distinguish between two classes of media objects: the synchronization events, which are events that should be available in the first item of the relationship to activates the relationships itself, and the synchronization items, the second item in the relationship, which must be synchronized. A synchronization event can be temporal or conditional, according to whether it refers to a point or to a subinterval on a time line, or if it refers to a condition which may occur dynamically. Synchronization
8
Chapter 2. Related work
items are divided into point and interval items. A more detailed taxonomy divides synchronization events and items in several subclasses, yielding seventy-two synchronization possibilities. Some of these relationships are of little practical use, but are defined in order to use the taxonomy as a reference point for authoring systems and formal models. The formal descriptions of the relations are given in Mexitl, a formal notation for specifying multimedia documents based on an interval temporal logic. Spatial and temporal layout of media objects inside a multimedia presentation is discussed by Vazirgiannis et al. [69]. The paper defines a model based on temporal and spatial relationships between different multimedia objects that build up a presentation and an efficient management of queries related to these relationships during the authoring process. The authors focus on the importance of an integrated approach for the representation of all aspects of a multimedia presentation. Temporal and spatial starting points are defined, and the objects are set using topological and time relationships between them. The result is a formalism suitable for designing any multimedia composition. Schnepf et al. define a model for flexible presentations, called FLIPS [64] (for FLexible Interactive Presentation Synchronization). Media objects have no predefined time length, that instead can change according to user interactions. The synchronization scheme is based on two temporal relationships, called barriers and enablers. If event A is a barrier to event B then B is delayed until event A occurs. If event A is an enabler for event B when event A occurs event B also occurs if it is barrier-free. FLIPS is best oriented toward multimedia presentations, rather than network based applications, or hypermedia navigation. It does not define a structure for the presentations which are modelled only by temporal relations among the media objects. User interaction is limited to movements to different points in the time line of the presentation. In [63] the authors present a framework designed and implemented to support the synchronization of different media elements into a presentation according to FLIPS temporal relationships. The framework supports authors during the authoring of a presentation because it handles user interactions such as skipping forward and backward to different point within a presentation, re-synchronizing all the objects involved. A second class of models uses composition to describe synchronization relationships inside a multimedia presentation. Hytime [15, 55] is a modular standard for expressing document architectures in SGML, preserving information about scheduling and intercon-
Chapter 2. Related work
9
nections of media components. It arranges media items in a multidimensional Finite Coordinate Space in which axes can be established for anything can be measured or counted and the events are n-dimensional areas. Hytime provides constructs for specifying where and when the result of following a hyperlink will be rendered or how events scheduled in a virtual measurement domain must be “projected” into a real one. A similar approach is presented by Amsterdam Hypermedia Model [33, 32, 31] which combines media items with temporal relations into a hypertext document. AHM, which extends the Dexter Model for hypertexts [30], divides document components into atomic, link and composite components. An atomic component contains information on a media item, its attributes, anchors and presentation requirements. A link component specifies a hyperlink between two components of the documents. A composite component includes a collection of atomic, composite or link components which are grouped together, mainly to specify temporal relations among them. Synchronization inside composite components is described by synchronization arcs and offsets that establish time relationships between two components or two anchors. Media items are played into channels whose content during playback depends on how the user interactions modify the context of the presentation. The notion of context in fact, allows the authors to better specify the global behavior of the presentation during the operation of link following, by defining which media items begin (or continue) playing and which items are stopped. SMIL [56, 57], Synchronized Multimedia Integration Language, is a W3C standard markup language. Defined as an XML application, SMIL defines tags for presenting multimedia objects in a coordinated way. Synchronization is achieved through two tags: seq to render two or more objects one after the other, and par to play them at the same time. Using attributes it is possible to play segments inside the time span of an object. The tag excl is used to model some user interactions. It provides a list of child elements and only one of them may play at any given time. The screen is divided into regions in which multimedia objects are placed. SMIL 2.0 increases interaction and timing semantics and extends content control and layout facilities adding new animation and transition primitive. Differently from the model presented in this thesis, SMIL does not define a reference model for the data structure, but only tags for describing media objects behavior. Moreover, re-use of media objects is not encouraged since information about media definitions
10
Chapter 2. Related work
and temporal behavior are mixed. In [62] Rutledge et al. three case studies of standard SMIL are presented, illustrating how it can be used for different forms of multimedia presentations. In [34], Hardman et al. discuss requirements for incorporating multimedia synchronization and hypertext links into time-base hypermedia documents. They distinguish between three types of navigation through links: a simple move across temporal axis (rewind or forward ) that does not alter the linearity of a multimedia presentation, a hypertext navigation among different presentations (or presentation sections), and navigation within and among non-linear multimedia presentations. The paper presents their modelling solutions for incorporating temporal and linking aspects using Amsterdam Hypermedia Model and SMIL. ZYX [10, 9, 8] is a multimedia document model with a particular attention on reusing and adapting multimedia content to a given user context. The need for this model arises from the lack of sufficient support for reuse and adaptation of other existing standards such as Dynamic HTML, MHEG, SMIL and Hytime. ZYX provides primitives for temporal, spatial and interaction modelling of a multimedia presentation. It describes a multimedia document by means of a tree: a node is a presentation element and can be bound to other presentation elements in a hierarchical way. A presentation element is a generic element of the ZYX model. It can be an atomic or a complex media element, a reference to an external media item or a temporal, spatial or layout relationship between the elements of a multimedia presentation. A fragment is a composition of more media items, for example two objects that are to be played in parallel. Presentation layout is defined by the projector variables which are used to bind the media object to the projector elements, which are the devices available at the client side. This design allows an element to be easily reused since its layout can be simply modified by changing the projector variable. The adaptation to user context is realized through the switch elements which allows to specify different alternatives for a part of the document. Many other works present different ways of specifying the temporal scenario of a multimedia presentation. PREMO, Presentation Environment for Multimedia Objects [37, 24], primarily focuses on the presentation aspects of multimedia. PREMO aims at describing a standard programming environment in a very general way, through which media presentation can be constructed, processed and rendered as part of an overall application. It
Chapter 2. Related work
11
provides a minimal, common vocabulary of structures because one of its main design requirements is to allow for extensibility. Synchronization between media items is described in an object-oriented way. PREMO objects are active, they have their own thread of control and can communicate with one another through messages, i.e. objects’ methods. The model maintains a clear conceptual separation between event-based and time-based synchronization, allowing applications to choose whichever view is more appropriate for them. The model supplies an abstraction for the virtual devices that may produce different multimedia data of a presentation. An object type and its subtypes set up time relationships. Two objects can be rendered sequentially, in parallel or in alternative according to the presentation state. MHEG-5 [25] is a generic standard for encoding multimedia data. An application is represented as a set of scenes, each of which is a multimedia presentation, with several media items of different types, spatially and temporally coordinated. The MHEG-5 model is object oriented: a class is specified by its structure (i.e. the set of attributes), the events which originate from the objects and the actions (i.e. the object methods) which perform a behavior. Only one scene is active at one time and navigation within an application is performed in terms of transitions between scenes. Paulo et al. [59] describes a synchronization model based on hypercharts. Hyperchart notation extends the statechart formalism in order to make it able to describe temporal constrains and synchronization requirements of a multimedia presentation. A hyperchart contains three new constructs: a timed history to resume an activity at the point it was previously stopped, timed transition to specify the temporal behavior of presentation activities whose firing depends on the state of the system, and a synchronization mechanism to specify objects interrelationships. The system performs a single step at each time unit, reacting to all external changes that happen in that time interval. Hypercharts provide mechanisms for specifying hypermedia requirements such as objects duration, delay and jitter, user interaction, navigation mechanisms and separation between structure and content. Synchronization between time-varying components of a multimedia document is defined also in the MPEG-4 standard [53]. In the MPEG-4 context synchronization is bound to multiplexing and demultiplexing of continuous media streams through time stamping of access units within elementary streams. Differently from the models reviewed above,
12
Chapter 2. Related work
MPEG-4 synchronization is more oriented to fine-grain synchronization rather than to event-based coordination of media.
2.2
Authoring Multimedia Presentation
Many metaphors are proposed by existing authoring tools. The simplest is the timedbased metaphor, used by Adobe Premiere [40]: multimedia elements are presented and organized in tracks along a time line. Macromedia Director [42] implements a theatrical metaphor in which text, audio and other media objects are cast members on a stage, and the score is a sequencer which animates the actors. These metaphors are simple and intuitive, but are not easy to manage and maintain, since a modification to the time of an event can require to adjust the time relationships between several objects. Macromedia Authorware [41] uses a flowchart-based paradigm where media objects are placed in sequence and grouped into sub-routines, like commands in procedural programming. With this metaphor the author cannot explicitly define the time intervals ruling the multimedia presentation, which are computed according to the execution order of the media components. Besides commercial products, many research works have designed multimedia authoring models based on different paradigms, able to create and manage temporal scenarios. CMIFed, CWI Multimedia Interchange Format Editor [68, 31], is a presentation editing tool for hypermedia documents, is presented. CMIFed offers a powerful interface for the creation of the temporal structure of hypermedia documents which may contain links to other documents on the Web. CMIFed gives the author flexibility allowing the use of multiple simultaneous channels in which hypermedia objects are placed for presentation. CMIFed implements the Amsterdam Hypermedia Model document structure, providing three different ”views” of a presentation: the hierarchy view, to edit the structure, the channel view to see which channels are busy and which are not, and the player view to preview the presentation’s behavior. Unlike other systems, that use a timeline for the temporal representation, the author has to deal with a collection of events and timing constrains. CMIFed builds a direct graph of timing dependencies and uses it to render hypermedia documents in a correct way. Unlike the authoring tool presented in this thesis, the user cannot manipulate the graph to change the behavior of the presentation,
Chapter 2. Related work
13
but he/she must modify the hierarchical structure of the document. In 1998 version 1.0 of SMIL was released and the authors found that the CMIFed temporal composition has the most direct equivalent in SMIL, that became the target language of the documents authored by CMIFed. GRinNS [12] is a more robust update of CMIFed which turned into a commercial product. It provides an extensive user interface for creating and maintaining SMIL (version 2.0) presentation of all sizes. GRiNS includes a network traffic emulator with which the author can understand how connectivity to media servers and different bandwidth resources may affect the presentations behavior. Madeus, an authoring and presentation tool for interactive multimedia documents, is presented in [44, 45]. This system allows efficient support for the specification of temporal scenarios and fits into an architecture that allows the integration of both authoring and presentation of multimedia documents. The authors stress the difference of a constraint-based versus operational environments, describing the main advantages of the first approach. Madeus allows users to specify the behavior of a presentation, to organize documents in hierarchical structures and to edit a document using multiple views, and includes a plug-in mechanism to incorporate external objects. Multimedia objects are synchronized through the use of timing constrains and the final layout is specified by spatial relations, such as align or center. The document structure, both temporal and spatial, is represented by a graph, where the horizontal axis represents time values. Graphs are used not only for authoring but also for scheduling and time-based navigation. HyperProp [66] is an authoring and formatting system for hypermedia documents. The authors compare their proposal to other approaches, stressing the importance of the documents’ logical structure. HyperProp uses composition to represent any kind of relation, both temporal (synchronization relationships) and spatial and uses NCM, Nested Context Model, as conceptual model to represent documents’ structure. According to NCM each entity has a unique identifier, an access control list and a descriptor which contains information about the object’s presentation. A link is an m : n relation composed by m source points and n target points that define source and destination context of the link. HyperProp offers three different graphical views to model the logical structure of the document: the structural view, the temporal view and the spatial view. The author can use the first view to browse and edit the logical structure of the document, the temporal view to represent the objects along a timeline, and the spatial view to preview objects
14
Chapter 2. Related work
layout in the user interface. HyperProp also supports versioning and cooperative work. Yu presents Media Relation Graph [70] (MRG), a simple and intuitive hypermedia synchronization model, together with an alternative implementation of the Hypermedia Presentation and Authoring System (HPAS) which is used as a testbed for MRG. A hypermedia presentation can be modelled by a direct acyclic graph, where vertices represent media objects, and directed edges represent the flow of time. For example, an edge from vertices v1 to v2 means that the object v2 is played after the end of the object v1 . HPAS is a system for presenting, integrating and managing hypermedia documents on the Web written as a Java applet. It provides services for integrating and reusing pluggable components such as Java applets and browser plugins. Its temporal model combines both interval-based and point-based synchronization mechanisms. In [11] Brotherton et al. describe a system suitable for the automatic generation, integration and visualization of media streams. Although this system can be applied to different domain, such as tourism or meetings, the solutions proposed are strongly oriented towards the educational domain. The authors view teaching and learning experience as a form of multimedia authoring and provide a method to integrate different media streams captured during a university lecture, like the video of the teacher, the slides of the lesson, and what the teacher writes on an overhead projector or a whiteboard. The paper addresses two specific problems: streams integration and streams visualization. About streams integration the authors propose to use different levels of granularity, and one particular solution to each level. For the visualization of multiple streams they provide a timeline “road map” of the lecture that points out significant events.
2.3
Automatic Generation of Multimedia Presentations
The problem of automatic generation of synchronized multimedia presentations with data returned by queries to multimedia databases has been approached in recent years from several points of views. In [1] Adali et al. present an algebra for creating and querying interactive multimedia presentation databases. The authors model a hypermedia presentation as a tree whose branches represent different playouts due to user interactions and nodes contain the set of active objects and the spatio-temporal constraints between them. The algebra defines
Chapter 2. Related work
15
operators to select and present paths of the presentation to which the user is interested, thus can be used to create new presentations merging parts of existing ones. SQL+D [4, 5] is an extension to SQL which allows users to retrieve multimedia documents as result of querying a multimedia database. An SQL+D query specifies all presentations properties, from screen layout to its temporal behavior. In addition to SELECTFROM clauses the user can define DISPLAY-WITH clauses to describe screen areas (called panels), in which groups of retrieved media items are placed with specified relative positions. A SHOW clause defines the temporal behavior in terms of timed sequences of returned instances display. Geurts et al. [29] present a formalism to construct multimedia documents by defining high-level semantic relations between media objects. Both spatial layout and temporal dynamics are described through the use of quantitative and qualitative constraints. Quantitative constraints alone are not sufficient because they address low level issues: e.g., the author want to state that a figure is at the left of another object, but is not interested in specifying the exact number of pixels between them. The author imposes a set of constraints and the system generates a multimedia presentation to present to the user as solution. A prototype is developed called Cuypers [61, 67], which is a transformation environment supporting semi-automated assembling of multimedia documents according to a rich structural and semantic annotation based on XML. The annotation allows for the specification of different processing steps concerning semantic structure, constraints satisfaction and final form presentation, which occur in multimedia authoring, to be integrated in a single execution stream. In [3], Andr´e presents an alternative approach to the problem of automatic generation of multimedia documents, based on concepts already developed in the context of natural language processing. The author considers the generation of multimedia presentations as a goal-directed activity. The input is a communicative goal with a set of parameters, like target audience and language, resource limitations and so on. The planning component of the system selects a multimedia presentation structure on the base of some communicative rules, and retrieves elementary objects like text, graphics or animations. The temporal behavior is expressed by temporal relations similar to the ones defined by Allen [2] and by metric (in)equalities.
16
Chapter 2. Related work
2.4
Multimedia Querying and Browsing
Problems related to querying, browsing and presenting multimedia information has been largely investigated in the literature. We refer to [7] for an extensive survey of combined text, audio and video retrieval in news application. Miller et al. [52] describe an integrated query and navigation model built upon techniques from declarative query languages and navigational paradigms. The system implemented provides facilities to help not expert user in query writing activity. Navigation hierarchies are used to present to the user summary information instead of a simple listing of query results. Differently from other approaches, the authors do not consider timing relations between objects, and hide querying of heterogeneous data on distributed repositories with an unified user interface. DelaunayMM [23] is a framework for querying and presenting multimedia data stored in distributed data repositories. The user specifies a multimedia presentation spatial layout by arranging graphical icons into a style sheet (the schema of the resulting document) by either snapping to a grid or by explicitly specification of spatial constraints. Then, each icon is assigned to a query, thus combining data selection with presentation. The userdefined style sheets combined with the answers to the queries automatically produce a virtual document whose pages are populated by retrieved multimedia objects. The pages are linked together in a list. DelaunayMM uses ad hoc querying capabilities to search different types of media item in distributed databases and on the Web, and supports navigational queries which enable the browsing of documents links during querying process. However, the authors do not address any solution for specifying temporal synchronization among the objects. TVQL, Temporal Visual Query Language [38, 39], is a multimedia visual query language for temporal analysis of video data. The authors consider a video of a teacher’s lesson in a classroom and annotate the video to identify interesting events, such as a student question or the teacher talk. TVQL enables user to browse for temporal relationships between two object subsets. For example the user can search for students speaking frequently after a teacher talk. The system allows a user to query and browse the data and integrates a visual query paradigm with dynamic visual presentation of results. Queries parameters can be dynamically adjusted. The authors do not approach complex presen-
Chapter 2. Related work
17
tations with heterogeneous synchronized objects, but only one video stream at a time. Chiaramella [22] presents a model that fully integrate browsing and querying of hypermedia data capabilities. The model gives particular emphasis to structured information and combines two components: the Hypermedia Component and the Information Retrieval (IR) component. The hypermedia component contains information about both structure and content of a document and is organized into two levels: the hyperindex and the hyperbase. The hyperindex contains knowledge needed to structure and index data and hyperdocuments; the hyperbase contains all the hyperdocuments and the links between them. This integrated system allows users to compose queries which contain both kind of information in their formulation. The IR component contains information about the model and the resolution of the queries. It is made of a model of documents, a model of queries and a matching function. Although it deals with composite documents, this paper does not consider time relationships among atomic objects of a structured document. GVISUAL [49, 48], is a graphical query language to formulate query of multimedia presentation based on content information. Multimedia presentations are modelled as directed acyclic graphs. A query consists of a head with its name and parameters, and of a body. The query body contains an iconized representation of the objects involved and a conditions box with the conditions required. An object icon contains a name and can be nested to represent a composition relationship. Two media items can be connected by an edge which represents a temporal operator. The authors aim at querying not only the information stored in individual streams but also the flow of information which represents the theme of a multimedia presentation.
2.5
Multimedia delivery
A survey on the issues related to multimedia documents modelling, authoring, querying and presenting should be incomplete without addressing problems relating delivery of multimedia data over the network. Multimedia data delivery is becoming an important issue in the new trends of network technology. The World Wide Web architecture has been created to deliver text documents. With the introduction of images, the size of the hypertext documents grows, but the increasing of network bandwidth available hides this problem.
18
Chapter 2. Related work Multimedia objects are characterized by large-sizes, require high bandwidth and real-
time traffic. Audio and video data must be played back continuously at the rate they are sampled. If data does not arrive in time, it is obsolete and thus useless. Indeed, data delay cam cause the playing process to stop, and the user could hear a silence or look at a still image instead of listening to a song or viewing a movie. The TCP/IP protocol is not suitable for multimedia data, because it provides a best effort service, guaranteeing the delivery of a message, but not guaranteeing that a maximum delay is observed. The UDP/IP protocol provides a range of services that multimedia applications can use, because it does not re-send missed packets, thus reducing network traffic. Both the protocols does not answer to the multimedia delivery challenge because they do n0t deal with problems relating network traffic balancing, guaranteeing a certain level of quality of service and resource reservation. So the design of proper real-time protocols for multimedia networking becomes an imperative [50]. RTSP, Real-Time Streaming Protocol [65], proposes the use of streaming technology to send multimedia data across the Internet in streams. Instead of waiting to download the entire file, the client begins to play back multimedia data as soon as it receives them. Media files are broken into packets with size suitable for transmission between the servers and clients. The real-time data are sent across the network and stored in buffer at the destination. A client can play the first packet, while decompressing the second, and receiving the third thus reducing the time the user has to wait for playback. RTSP, the Real Time Streaming Protocol, is a client-server multimedia presentation protocol to enable controlled delivery of streamed multimedia data over IP network. It provides remote control functionality for audio and video streams, like pause, fast forward, reverse, and absolute positioning. RTSP is an application-level protocol which cooperate with lower-level protocols to provide the transmission. RTSP allows the delivery and control different media streams but it does not deal with problems relating the inter-synchronization of such objects at the client side. In [35, 36] Huang et al. propose a solution to this problem. They describe a synchronization architecture to design and control multimedia document delivery and the timing constrains which cannot be described using traditional document structures (e.g., SGML). The model contains logical, layout and temporal structure to formally describe re-synchronization adjustments due to user interaction and network’s jitters. The authors’ approach uses
Chapter 2. Related work
19
dynamic extended finite-state machines (DEFSM) to describe and maintain temporal synchrony among several media streams: an “actor” DEFSM manages each single medium and a “synchronizer” DEFSM orchestrates the whole presentation behavior. An actor captures media units from the corresponding medium and delivers them to display devices. A synchronizer handles user interactions. Based on the proposed model, a distributed multimedia documents development mechanism, called MING-I [35], has been developed. MING-I proposes an interactive synchronization based on use of tokens for the control scheme and contains four main components: the authoring component, the presentation scheduler generator, the interaction-processing agent and the execution environment base. The presentation scheduler includes a client-part scheduler and a server-part scheduler that together are responsible for correct transmission of a multimedia documents and its re-synchronization after any user’s interaction. In [13, 14], Candan et al. approach the same problems by relating multimedia documents download to user presentation. They define a model to design and play multimedia presentations based on the use of flexible temporal constraints (e.g., the start time instant of an object can vary within five seconds) from which they derive a possible presentation schedule. The set of temporal constraints of the presentation can be represented by a graph, where nodes are temporal variables and edges are time constraints. The presentation schedule is solved using an algorithm for the shortest path calculation. If any negative cycle exists, there are some conflicting constraints that should be removed. Based on the presentation schedule, the authors derive the retrieval schedule and the buffer resources needed to download and play media items from the network. Then, the retrieval schedule can be validated by checking all the system availability constraints.
20
Chapter 2. Related work
Chapter 3
A Model for Synchronized Hypermedia Documents
This chapter presents a synchronization model for hypermedia presentations. Several media, continuous, like video or audio file, or non-continuous, like text pages or images, are delivered separately, and presented to the user in a coordinated way. The model is based on a set of synchronization relationships which define objects behavior during presentation’s playback, and channels in which to play each object. The model is suited for a wide range of hypermedia documents type: some examples are self and distance education, professional training, Web advertising, cultural heritage promotion and news-on-demand. A virtual exhibition is analyzed as a test bed to validate the model. The issues presented in this chapter are discussed also in [27, 16].
Hypermedia documents are modelled along two directions, one describing the hierarchical structure of the document components, the other describing the presentation dynamics through synchronization primitives.
3.1
Hypermedia document structure
A hypermedia document is composed of one or more modules, which the user can access in a predefined order or through some table of contents. For example, a guided tour in a virtual museum steps through a predefined selection of the museum rooms; a multimedia course is divided into lessons, which can be indexed through a syllabus or presented according to a time schedule; a news-on-demand application delivers independent articles selected from an index; an estate selling application lets the user choose the relevant property from a catalogue, and so on.
22
Chapter 3. A Model for Synchronized Hypermedia Documents
Definition 3.1.1 A module is a collection of media items, both continuous and not continuous, related to a common topic. We assume that from the presentation dynamics point of views each module is completely autonomous; all the media presented to the user at the same time are enclosed in the same module; user interaction can move the presentation’s playback to another module. We do not elaborate further on this level of access since it is not relevant for delivery and synchronization of the different components, and assume a module as the topmost level of aggregation of media components which is relevant for our discussion. We distinguish between two kinds of media objects: continuous media files and document files like texts and images. Continuous media files are video and audio recordings, usually referred to as videoclips or audioclips. Texts and images are contained in static documents usually referred to as pages in the WWW parlance. A module contains atomic media items and compound media items, which we call composites. A composite is a part of a multimedia document which is normally played without requiring user intervention. Moving from a media item or a composite to the next one can be automatic or not, depending on the document purpose and application context.
Definition 3.1.2 A composite is a set of continuous and noncontinuous media mutually synchronized which behave at a high level of observation as a single media item. A composite may represent a standard or a recurring pattern of synchronized media items, whose details can be hidden at a high level of specification. Composites can contain other composites to build up complex structures and behaviors. Therefore a composite is a kind of an envelope enclosing several media items. As a whole, it starts and ends according to the same synchronization rules which hold for atomic media items. In particular, the composite ends when all the media items enclosed are no longer active1 . If we analyze the structure from a dynamic point of view, we can better refer to a terminology borrowed from movie context. 1
This concept will be explained and discussed in Section 3.2.1.
Chapter 3. A Model for Synchronized Hypermedia Documents m o d u le s c e n e
c lip
s c e n e
s to ry
c o m p o s ite
23
c lip
c o m p o s ite
Figure 3.1: Structure of video-centered hypermedia document Definition 3.1.3 A story is a set of continuous media items which constitutes the “master” media stream of the module content. Definition 3.1.4 The clips are atomic continuous media streams, which build up a story, which are played continuously. Clips are divided into scenes, each of which is associated to different static documents, which are displayed during the scene playback. Definition 3.1.5 A scene is a time interval in the clip’s playback. The term narration introduced above denotes the playback of the continuous components of a hypermedia document. Figure 3.1 pictorially shows this structure. Our model hierarchically divides continuous documents into stories, clips and scenes; static documents are simply referred to as pages. Since our goal is the description of the behavior of a multimedia presentation, we model as independent entities only the media items which interact with other components of the presentation, i.e., media objects which are related by some synchronization relationship. Therefore, we do not consider the internal dynamics of components which embed animations, or short video and audio sequences (like, e.g., animated banners, logos and visual effects). They are treated as pages, unless the behavior of the embedded media items is subject to synchronization constraints. The model also defines two additional elements, the user interaction entity and the timer which are particular media items used to model specific presentation behaviors. Definition 3.1.6 A user interaction entity is an interaction widget, considered as a continuous medium object, whose length is variable and defined by the user at run time.
24
Chapter 3. A Model for Synchronized Hypermedia Documents A user interaction entity controls the synchronization induced by the user who selects
it. It can be an anchor in a text page or a separate component, e.g. a graphic button or a clickable image.
Definition 3.1.7 A timer(n) is a continuous medium object, whose time length is defined by the author at design time, and is equal to n time units2 . A timer is not visible in the user interface.
3.1.1
Logical vs. physical structure
The structure illustrated in Figure 3.1 is a logical structure, which is in principle unrelated to the physical organization of the media objects. First of all, we divide media items into compound objects, i.e. modules, composites and stories, and atomic media items, i.e. clips and pages. Compound items describe the logical organization of a multimedia presentation, while a correspondence with the physical structure must be defined for atomic components. A video file could correspond to a whole clip, to a scene within it, or to a sequence of clips, that can be played continuously or as independent segments. Indeed, the correspondence between logical and physical organization should be hidden in an implementation layer, without influencing the model organization. In practice this cannot be done without paying a price in terms of performance and manageability even in a local environment, such as a CD-ROM based multimedia presentation. Efficient hypermedia documents delivery over a network is affected by the size of information to be transferred even in a streaming environment where a correct buffer sizing is crucial. For this reason it is important to organize multimedia data into well defined and reasonably sized structures that make easy to manage them and allow the reuse of the constituent components where necessary. In a distributed environment like the World Wide Web the relationships between the logical and the physical organization of the media objects are thus of primary concern, and must be taken into account at application design level. 2
We assume that time is discrete. This assumption is discussed in Section 3.2.1.
Chapter 3. A Model for Synchronized Hypermedia Documents
25
A distinction exists between continuous and noncontinuous documents: due to their large size, continuous documents have both a logical and physical organization; static documents are atomic. In both cases, a correspondence between the logical and the physical structure exists: • a (logical) clip, is a video or audio file which is assumed to play continuously from beginning to end, unless user interaction modifies this behavior; • a static document is a file which is displayed as a whole by the browser, both directly or indirectly through a plug-in or helper application. We do not implicitly refer to any specific document type, we use the the generic term of page, without entering in further detail about other formats (HTML, XML, PDF, etc.). From the above assumption comes that one logical clip is implemented by one media file. Scenes are defined by time intervals in the clip: they must be contiguous and are normally played one after the other. Also pages are bound to specific contents associated to a scene. We can thus say that a scene with its associated page is the smallest amount of information that is delivered continuously as a whole, and a clip and its associated set of pages are the minimum self-consistent and autonomous information with respect to the application domain. According to the synchronization primitives that we shall discuss in Section 3.2, playing a scene causes the corresponding page to be displayed; conversely, accessing a page via a hyperlink causes the corresponding scene to be played. We can anticipate that, coherently with the class of applications we want to model, a master-slave relationship is established between the dynamic and the static components of a hypermedia document: the dynamic components control the hypermedia document playback, and the static components are synchronized accordingly.
3.1.2
Channels
Media objects require proper channels to be displayed or played. Definition 3.1.8 A channel is a (virtual) display or playback device like a window, a frame, an audio device or an application program able to play a media file, that can be used by one medium at a time.
26
Chapter 3. A Model for Synchronized Hypermedia Documents Our model does not concentrate on spatial aspects of multimedia presentations, that
can be better addressed using other approaches, e. g. SMIL, but the main focus is on synchronization issues. Therefore channel definition accounts only for a minimal description of the spatial arrangement of the documents and does not enter into details about channel properties, the only relevant property being an identifier that uniquely identifies it, and a type that defines the media types that can use it. Other properties of a channel will be discussed in Chapter 5 which describes an XML language for hypermedia presentations. While the modules are the mechanism to hierarchically compose different media objects into a presentation, the channels allow the spatial arrangement of media items, thus they are used to define the layout of a multimedia presentation. Channels are defined and associated to media objects at design phase, and defaults are established consistently with the WWW environment: e.g., the main browser window is the default channel for all the media with visible properties. Several channels can be defined as areas in the main window (e.g., frames) or as separate windows, if the presentation requires the contemporary activation of more media of the same type. For example, a lesson about language translation could require the original text and the translation be accessible to the user at the same time, even if each of them is a conventional text document that would be displayed in the browser main window if delivered alone. Definition 3.1.9 A channel is busy if an active media is using it, otherwise it is free. Free channels may however hold some content, if the deactivation of a media does not destroy its visibility: for example at the end of a movie playback the movie window could display the first or the last frame. Several compatible media objects may share in different times the same channel3 .
3.2
Synchronization primitives
A presentation is made of different components which evolve in reaction to some events, such as user interaction, or due to intrinsic property of components, like time length. We consider only the events that bring some change into the set of active objects during 3
Transitions effects, like fades, between two objects which share the same channel, can be solved at
application level by implementation.
Chapter 3. A Model for Synchronized Hypermedia Documents
27
presentation playback. For example, window resizing is not a significant event, while stopping a video clip is. In a hypermedia document each components plays a specific role during presentation’s playback and must be synchronized to the others to evolve the presentation. Thus we need some relationships to model objects’ behavior and channels’ utilization; we call them synchronization primitives. We define five synchronization primitives: • a plays with b, denoted by a ⇔ b • a activates b, denoted by a ⇒ b • a terminates b, denoted by a ⇓ b • a is replaced by b, denoted by a b α
• a has priority over b with behavior α, denoted by a > b. Some of these relations need to be explicitly defined during presentation’s authoring; other can be automatically inferred from the hierarchical structure of the presentation and from other relationships.
3.2.1
Basic concepts
If we observe the presentation along time, it can be divided into a number of time intervals in which some conditions hold, e.g., some media are active while other are paused. If MI is the set of media components which build a presentation and CH is the set of channels of the presentation, we can describe the presentation evolution in terms of active media and channel occupation at any given time instant. We assume that time is discrete, and marked by a variable i ∈ N which is updated by a clock. The actual time resolution is not important as long as it allows the capture of all the events related to media execution, and to observe the effect of time distinct events as distinct effects. Therefore we assume that variable i is incremented in such a way that if a time i an event e occurs, at time i + 1 we are able to see the presentation changes due to the event, and no other event occurs before time i + 2. Two or more events are contemporary if they occur at times denoted by the same value of the variable i .
28
Chapter 3. A Model for Synchronized Hypermedia Documents For any media item m ∈ MI the possible events are start(m), when the user activates
the object, pause(m) when m is paused, end (m) when the object ends, and stop(m) when m is forced to terminate4 . Two functions describe the channel occupation at a given time: • channel : MI → CH which, given a media object, returns the associated channel, and • usedBy : CH × N → MI ∪ { } that returns, for every channel, the media item that occupies it at the given instant i . The underscore symbol
denotes the absence of
media item; it is the value used to identify free channels. Then we must identify which media are currently playing, and which media are paused. This relevant information is captured by the following definitions. Definition 3.2.1 A media object is active at some time instant i if it occupies a channel, otherwise it is inactive. In term of occurred events, a media item m is active at time i = icurrent if an event start(m) has already occurred at a time istart < icurrent , but no event end (a) or stop(a) has occurred since that time. Then the function isActive : MI × N → {True, False} can be defined as follows: True if an event start(m) has occurred in a time istart < i and no isActive(m, i ) = event end (a) or stop(a) has occurred since that time. False otherwise.
Similarly, the function True isPaused (m, i ) = False
isPaused : MI × N → {True, False} is defined as:
if a event isPaused (m) has occurred in a time istart < i and no event start(a) or stop(a) has occurred since that time. otherwise.
At any time i = icurrent , the presentation state can be completely described by the
function usedBy, isActive and isPaused , which return the channels occupations, the set of active media items ({m | isActive(m, icurrent ) = True}) and the set of paused media ({m | isPaused (m, icurrent ) = True}). 4
The difference between the last two events will be described in Definition 3.2.2.
Chapter 3. A Model for Synchronized Hypermedia Documents
29
Definition 3.2.2 A continuous media naturally ends when it reaches its ending point during playback. A generic media object is forced to terminate when another entity (the user or another multimedia object) stops its playback or closes its channel, before its natural end. As particular cases, a user interaction entity naturally ends when the user activates it, e.g. with a mouse click, while a timer naturally ends when the time it has been set to is elapsed. We distinguish between the length and the duration of a generic media object, continuous or non-continuous. Every object, once activated, holds a channel for a predefined time span that is necessary to access the whole object content, unless the user interacts with it. We call object length such a time span. The length is a static property, defined when the object is created, independent from the presentation of which the object is a component. For example, the length of a video clip is its “native” length, i.e., the number of frames times the frame rate. Static objects like text pages once activated hold the channel until some event removes them. The user can access the content during a time which is not defined by the objects content. Their length is therefore potentially infinite. Composites and modules have a particular behavior, since they contain a collection of synchronized media items lasting for a certain time span. Therefore each composite (or module) has a complex inner structure but, from the presentation point of view, can be seen as an atomic media item with its own length. Definition 3.2.3 Let M be the set of media items contained in a composite (module), the composite (module) ends at some time instant i iff M ∩ {m | isActive(m, i ) = True} 6= ∅ and M ∩ {m | isActive(m, i + ²) = True} = ∅. If the last event occurred is end (m) where m ∈ M, the composite (module) naturally ends, otherwise it is assumed to be stopped. At run-time, a user can access a media object for a time span that can be different from its length, due to presentation design or to user interaction. For example, a video sequence can be played completely or only partially according to a run-time user decision, who could stop the video playback before end. We call object duration this time span; it is a dynamic property that can change during object playback.
30
Chapter 3. A Model for Synchronized Hypermedia Documents
3.2.2
Basic synchronization
The first two synchronization primitives deal with presentation behavior without user interaction. Definition 3.2.4 Let a and b be two generic media objects. We define “a plays with b”, written a ⇔ b, the relation such that the activation of one of the two objects (i.e., a or b) causes the activation of the other (i.e., b or a), and the natural termination of object a causes the forced termination of object b. Each object uses a different channel. More formally, given the time i = icurrent the relation a ⇔ b models the behavior of the presentation in the following cases: 1. if events start(a) or start(b) occur then Pre-conditions: • isActive(a, i ) = False • isActive(b, i ) = False • usedBy(channel (a), i ) =
5
• usedBy(channel (b), i ) =
5
Post-conditions: • isActive(a, i + 1) = True • isActive(b, i + 1) = True • usedBy(channel (a), i + 1) = a • usedBy(channel (b), i + 1) = b 2. if event end (a) occurs then Pre-conditions: • isActive(a, i ) = True • isActive(b, i ) = True • usedBy(channel (a), i ) = a • usedBy(channel (b), i ) = b 5
Unless used by other media as a consequence of other relations; this is a general remark and for
simplicity we shall not repeat it in the following.
Chapter 3. A Model for Synchronized Hypermedia Documents
31
Post-conditions: • isActive(a, i + 1) = False • isActive(b, i + 1) = False • usedBy(channel (a), i + 1) = • usedBy(channel (b), i + 1) = 3. otherwise, all other events e 6∈ {start(a), start(b), end (a)}, or situation in which some of the preconditions do not hold, do not interfere with other objects but simply change the state of the object affected by that event. This relationship describes the synchronization between two media active at the same time. If a video clip v goes with a static page p the relation v ⇔ p means that the video and the page start and end together (i.e., they have the same duration). The relation ⇔ also describes the synchronization between two continuous objects: e.g., a video v with a sound track st is described by the relation v ⇔ st. This relation means that v and st start at the same time, and when v naturally ends it causes the sound track to end too, if it is not finished yet. The relation “plays with” is therefore asymmetric: object a plays the role of master with respect to slave object b. The master object is usually the one the user can interact with. For example a video clip played with an accompanying text is the master because the user can pause, stop, or move inside it causing the accompanying text to be modified accordingly. It is important to note that the “plays with” relation can’t provide fine-grain synchronization (e.g., lip-synchronization) but only coarse-grain synchronization, because it defines the media behavior only at starting and ending points and not in between. This apparent limitation is consistent with the application domain of interest of this model. Definition 3.2.5 Let a and b be two generic media objects. We define “a activates b”, denoted by a ⇒ b, the relation such that the natural termination of object a causes the beginning of playback (or display, for non continuous media) of object b. Objects a and b can share the same channel or not. More formally, if the event end (a) occurs when at time i = icurrent and Pre-conditions:
32
Chapter 3. A Model for Synchronized Hypermedia Documents • isActive(a, i ) = True • isActive(b, i ) = False • usedBy(channel (a), i ) = a • usedBy(channel (b), i ) =
the relation a ⇒ b models the following behavior: Post-conditions: • isActive(a, i + 1) = False • isActive(b, i + 1) = True • usedBy(channel (b), i + 1) = b • usedBy(channel (a), i + 1) =
(if channel (a) 6= channel (b))
All other events e 6= end (a) do not interfere with other objects but simply change the state of the object affected by that event. The relationship “activates” describes two objects that play in sequence. We limit the scope of this relationship to the natural termination of an object, leaving out the forced termination caused by user intervention of other external events. The reason for doing so will be more evident after discussing the other synchronization relationships, but we can anticipate that the main reason is that we interpret the forced termination as an indication to stop the presentation or a part of it. Therefore, the scope of this action must be defined explicitly and should denote a deviation from the “normal” behavior of the presentation. With this relation we can model the situation in which a sound track st is looped to adapt its duration to the length of a video, specifying that st ⇒ st. The natural end of the sound track activates it again using the same channel, until the video ends. Then the sound track terminates too, according to the relation v ⇔ st. If the soundtrack is stopped, it is not repeated. While more meaningful if applied to continuous objects, the “activates” relation (⇒) can define the sequential play of continuous and non continuous objects, e.g., a credit page after the end of a video sequence. It should be obvious that object a in the relation a ⇒ b must be a continuous media object, because static media objects have an infinite length.
Chapter 3. A Model for Synchronized Hypermedia Documents Û
Û
c o m p c lip 1 Û
Þ Û
33 c o m p
c lip 1
p 1
Þ
Û
p 1
c lip 2
c lip 2
(a)
Û
(b)
c lip 1 Û c lip 2
Þ
c o m p p 1
(c)
Figure 3.2: Synchronization schemas defined using primitives ⇔ and ⇒
Simple timeline-based presentations can be described by using only “plays with” and “activates” relations; an example is illustrated in [19].
The event-based synchronization scheme turns out to be more powerful when the time length of media items is not known, as described in Figure 3.2. In (a) page p1 is displayed as soon as the shortest clip ends; in (b) the composite comp ends when both the clips are finished, so p1 waits for the end of last clip. In both cases, the author does not know in advance which is the shortest (or the longest) clip. Figure 3.2(c) shows a more complex situation: page p1 is displayed at the end of clip clip2 and it releases the channel when clip1 finishes. According to the duration of clip1 and clip2, page p1 remains on the user screen for a different time interval: e.g., if length(clip1) < length(clip2), p1 is not displayed.
34
Chapter 3. A Model for Synchronized Hypermedia Documents More complex dynamics, involving also user interaction, require additional relation-
ships which are described in the following section.
3.3
User interaction
Users have several possibilities of interacting with a hypermedia document: they can stop the presentation of part of the document before it ends, e.g. because they are no longer interested and want to skip further. They can branch forward or backward along the time span of a continuous medium, and the document components must resume playing in a coherent and synchronized way after the branch. Users can leave the current presentation to follow a hyperlink, either temporarily, resuming later the original document, or definitely, abandoning the original document and continuing with another one. We define three synchronization primitives to model the presentation behavior according to user interaction. Definition 3.3.1 Let a and b be two generic media objects. We define “a terminates b”, written a ⇓ b, the relation such that the forced termination of object a causes the forced termination of object b. The channels occupied by the two media objects become free. More formally, the relation a ⇓ b models the reaction to the event stop(a). If the time instant triggered by the event is i = icurrent and Pre-conditions: • isActive(a, i ) = True • isActive(b, i ) = True • usedBy(channel (a), i ) = a • usedBy(channel (b), i ) = b then: Post-conditions: • isActive(a, i + 1) = False • isActive(b, i + 1) = False • usedBy(channel (a), i + 1) =
Chapter 3. A Model for Synchronized Hypermedia Documents
35
• usedBy(channel (b), i + 1) = . All other events e 6= stop(a) do not interfere with other objects but simply change the state of the object affected by that event. The relationship “terminates” (⇓) models situations in which the user stops a part of a presentation, and the other media must be re-synchronized accordingly. As an example, let us consider a multimedia presentation for travel agencies: a video clip (a) illustrates a town tour and a text page (b) describes the relevant landmarks, with a ⇔ b. If the user stops the video playback, the object a is terminated and releases its channel. Object b remains active, because relationship a ⇔ b is meaningful only when a comes naturally to its ending point. Therefore the channel used by object b remains busy, leading to an inconsistent situation. Introducing the relationships a ⇓ b the inconsistency is removed: the forced termination of object a causes the termination of object b too; the channel used by b is released and can be used by another document. Similarly to the relationship a ⇔ b the relation a ⇓ b is asymmetric. It is worth to discuss why we have introduced in the model the relationship “terminates” (⇓) instead of extending the relationship “plays with” (⇔) to deal with any termination of an object. Should we have done so, the model would not be complete, i.e. some synchronization requirements could not be described. Several other approaches in fact, e.g., SMIL 2.0, do not distinguish the natural termination from the forced stop of an object which are the same event. In our model these two types of termination are two different events, which can bring to different paths of the presentation evolution. An example will illustrate this statement. Let us consider a presentation which begins with a short introductory animation. As the animation ends, a sound track starts playing, and the presentation displays a page which asks the user to choose among different video clips to continue6 . The sound track does not stop when the user selects a video clip, but continues in the background. Figure 3.3 pictorially shows this situation: the objects involved are the first animation intro, the sound track snd, the text page txt, a video clip vclip and an user interaction entity ui, which controls the synchronization induced by the user who selects the next video clip to play. In this example ui is obviously an anchor in the text page. According to definition 3.1.6, it naturally ends when the user activates it, e.g. with a mouse click. 6
The modelling of the whole presentation requires the notion of link that will be introduced later.
36
Chapter 3. A Model for Synchronized Hypermedia Documents
Figure 3.3: The relation ⇓ The introduction starts both the page and the sound track by virtue of the relationships intro ⇒ snd and intro ⇒ txt. Object ui is activated together with the text page (txt ⇔ ui )7 , and ends when the user selects the video clip. If the user wants later to leave the presentation, he or she stops the video clip vclip, which is the foreground media object. It must stop the sound track, and this action cannot be described without a specific relationship between the two objects, that are otherwise unrelated. We have to set the relation vclip ⇓ snd 8 . The relationship “terminates” (⇓) allows us to define the behavior of a presentation when a user moves forward and backward along the presentation time span. We consider a “branch” a movement on the time axis of the same multimedia presentation. If a user moves to another hypermedia document, we do not consider it a “branch” but the activation of a hypermedia link, which will be handled in a different way. The presentation is stopped at the point in which the user executes the branch command, and starts again at the point the user has selected as the branch target. The synchronization primitives introduced so far can handle this case. The target of a branch can be the beginning of an object (a continuous object), but can also be any point contained in its length. This is possible if the user interface provides some mechanisms to navigate along the time span of an object, i.e. a cursor, or a VCR-style console. From the synchronization point of view, the relations are associated to the activation of a media item as a whole, so each point inside the length of the object is equivalent to its starting point. Then, the relationship “plays with” (⇔) is associated to the activation of the master object, regardless of where it starts, so it is able to activate the associated slave object 7 8
In this example we ignore what happens to the text page when the video clip starts. In the figures we use a different arrow style for the relation terminates(⇓) to improve visibility.
Chapter 3. A Model for Synchronized Hypermedia Documents
37
even if the master starts from a point in the middle of its duration.
3.3.1
Hyperlinks
When a user follows a hyperlink to another presentation, modelling requires additional care: we must define the hyperlink source and destination, and the behavior of the objects involved in the link. The source of a hyperlink is the object which contains it, but the author can define some synchronization primitives which describes the behavior of the presentation when the user follows that link, eventually affecting other media objects. Then we consider the whole presentation as the source of a hyperlink. More properly, all the objects active at the time of the user interaction are considered as link sources. In the same way, the destination of the link is the linked object, but the presence of some synchronization relations can involve other media items. Therefore, we consider the set of active objects after the user action to be the destination of the hyperlink 9 . We can easily distinguish three cases, which need to be modelled in a different way: • following the hyperlink does not imply any change in the continuous objects of the presentation, • following the hyperlink takes a continuous component of the hypermedia document to a paused state, and • following the hyperlink stops a continuous object of the presentation. Let us consider the first case, taking as an example a self-learning application in which a video and some slides are displayed together. One of the slides contains a link to another text document with supplemental information. If the document is short, the user does not need to stop the video in order to read the text without loosing information. This example is illustrated in Figure 3.4. Two video clips c1 and c2 are associated each to a text page (respectively p1 and p2 ). Page p1 contains a hyperlink to page p3 . The author of the application can model the display of the linked page p3 in two ways: opening another channel, or using the same channel of p1 , that must be released purposely. In the first case the link causes a new channel to be allocated, without any consequence on the other 9
The reader could note that this behavior is consistent with the one defined in the Amsterdam Hyper-
media Model [33].
38
Chapter 3. A Model for Synchronized Hypermedia Documents
Figure 3.4: Following hyperlinks: first case media synchronization. In the second case, we need to introduce another synchronization primitive, as illustrated in Figure 3.4. Definition 3.3.2 Let a and b be two media objects of the same type (i.e. two continuous media or two non continuous media) that can use the same channel. We define “a is replaced by b”10 , written a b, the relation such that the activation of object b causes the forced termination of object a. Therefore, channel occupied by a is released to be used by object b. Then, if the current time is i = icurrent when event start(b) occurs and Pre-conditions: • isActive(a, i ) = True • isActive(b, i ) = False • usedBy(channel (a), i ) = a • channel (b) = channel (a) then: Post-conditions: • isActive(a, i + 1) = False 10
We prefer the passive form of the relation (i.e. “a is replaced by b” instead of “b replaces a”) because
the notation a b reflects from left to right the temporal order of the media items. This property is especially useful in the graphical representation, which is more clear.
Chapter 3. A Model for Synchronized Hypermedia Documents
39
• isActive(b, i + 1) = True • usedBy(channel (a), i + 1) = b 11 All other events e 6= start(b) do not interfere with other objects but simply change the state of the object affected by that event. In Figure 3.5, the relation p1 p3 allows page p3 to be displayed in the same window of page p1 , that is therefore terminated. In then same way, p3 p2 when, later, clip c1 ends and c2 starts playing. If page p1 contains a link to a large document or to another video clip the user should need to pause the initial presentation in order to pay attention to the new document. Figure 3.5.a pictorially shows this case: page p1 contains a link to a new video clip (c3 ) that is associated with another text page (e.g., another slide) p3 . The user can also decide to stop the presentation and to continue reading the new one, or the author could have designed a set of multimedia documents with this behavior in mind. Going back to the abandoned presentation in this case would be possible only by restarting it from some defined starting point. It should be clear that the user or author behavior depends on the meaning of the linked documents. In principle the synchronization model should be able to describe both cases. Definition 3.3.3 Let a and b be two generic media objects. We define “a has priority α
over b with behavior α”, written a > b, the relation such that the activation of object a (by the user, or according to the presentation’s design) forces object b to release the channel it occupies, so that object a can use it if needed. Label α denotes object b’s behavior once it has released the channel. It can assume only two values, p and s. If α = p then object b goes into an inactive state (i.e. it pauses), waiting for being resumed. If α = s, object b is forced to terminate (it stops), releasing the resource it uses. More formally, if event start(a) occurs when i = icurrent and Pre-conditions: • isActive(a, i ) = False • isActive(b, i ) = True 11
Objects a and b use the same channel.
40
Chapter 3. A Model for Synchronized Hypermedia Documents
(a)
(b) p
s
Figure 3.5: Following hyperlinks: use of > and > relationships • usedBy(channel (b), i ) = b • usedBy(channel (a), i ) =
(if channel (a) 6= channel (b))
• isPaused (b, i ) = False then: Post-conditions: 1. if α = s: • isActive(a, i + 1) = True • isActive(b, i + 1) = False • usedBy(channel (a), i + 1) = a • if channel (b) 6= channel (a) then usedBy(channel (b), i + 1) = 2. if α = p: • isActive(a, i + 1) = True • isPaused (b, i + 1) = True • usedBy(channel (a), i + 1) = a
Chapter 3. A Model for Synchronized Hypermedia Documents
41
• if channel (b) 6= channel (a) then usedBy(channel (b), i + 1) = . All other events e 6= start(b) do not interfere with other objects but simply change the state of the object affected by that event. p
In the case illustrated by Figure 3.5.a, we draw the relation c3 > c1 so that, when following the hyperlink, the user activates c3 that puts c1 into an inactive state. The channel used by c1 is released to be used by c3 . From the relation c3 ⇔ p3 we can assume that page p3 must be displayed into p1 ’s channel. Therefore an additional relationship p1 p3 must be added. When c3 terminates the user can resume c1 from the point where it was suspended. Since the relation c1 ⇔ p1 is active for all the duration of c1 , page p1 is also activated, so the presentation goes back to the state it was before the hyperlink activation. The channel that was used by p3 is free due to the two relationships ⇔ and ⇓ between c3 and p3 . s
If the author decides to stop the first document before leaving it, the relationship c 3 > c1 p
is used instead of the relation >, as illustrated in Figure 3.5.b. The relation p1 p3 introduced in Figure 3.5.a is not necessary because, since c1 is forced to terminate, by relation c1 ⇓ p1 also page p1 is forced to terminate, releasing the channel that can be used by p3 . In Figure 3.5.b is assumed that when c3 terminates, the clip c1 starts again, as described by the relationship c3 ⇒ c1 . A different behavior could be to leave to a user action the task to start again the stopped presentation.
3.4
Automatic derivation of synchronization primitives from document structure
Modelling of a hypermedia presentations requires two activities: the description of the document’s structure (i.e. the media objects and the hierarchical structure in which they are inserted) and the definition of the synchronization primitives that rule the presentation’s behavior. Some of these relationships need to be explicitly defined by design when structuring the hypermedia document, while other can be automatically inferred from the presentation structure. The activation of the whole hypermedia presentation is always initiated by the user that accesses the starting point through an index. According to the structural model, the
42
Chapter 3. A Model for Synchronized Hypermedia Documents
starting point is a module or a composite, therefore its continuous component, a story or a clip, is the real starting point. If it is a story, the first clip of the story must be activated. The “plays with” relationship (⇔) can therefore be automatically inferred (as a form of inheritance) between a story and its first clip, with the story acting as the master. Similarly, the same relationship is inferred between a clip and its first scene. The “activates” relationship (⇒) can be inferred between the scenes of a clip, since a clip is played continuously (in the absence of user interaction). As a result, only the relationships between a module and a story and between the clips of a story have to be defined explicitly. Synchronization between continuous and static documents (i.e., scenes and pages) must be defined explicitly at design phase, since in principle pages are not ordered: their ordering comes as a result of scene ordering and scene-page relationships. The forced termination of an object is propagated through the hierarchical structure: if a user wants to stop a presentation, he or she stops the active module, that should stop all the objects included. The “terminates” relationship (⇓) can be inferred between a module and all the objects inside it, between a story and its clips, and between a clip and its scenes. Similarly, the same relationship is inferred between objects that play in parallel as a consequence of a “plays with” relationship (⇔): if the user stops the master objects, all the “slave” objects must be terminated. More formally, the relationships that can be inferred are: • { storyi ⇔ clipi,1 | ∀ storyi } • { clipi ⇔ scenei,1 | ∀ clipi } • { scenei,j ⇒ scenei,j +1 | ∀ clipi , 1 ≤ j < Nscene(clipi ) } • { Modulei ⇓ elemi,j | ∀ Modulei , ∀ elemi,j such that @ obj ⇔ elemi,j } • { storyi ⇓ clipi,j | ∀ storyi , ∀ clipi,j } • { clipi ⇓ scenei,j | ∀ clipi , ∀ scenei,j } • { obji ⇓ objj | ∃ obji ⇔ objj } } where Nscene(clipi ) returns the number of scenes included in clipi and obj is a generic object of the presentation in the set MI.
Chapter 3. A Model for Synchronized Hypermedia Documents ß Û Û
Û
Û ß
ß
s c 1
Þ ß
p 1
M o d u le s to r y ß
c 1
Û
43
s c 2
Þ ß
p 2
Û Û
ß ß
s c 3 ß p 3
ß
c 2
Þ Û
s c 4 ß p 4
ß Þ Û
s c 5 ß p 5
Figure 3.6: Synchronization schema for a lesson In order to show how inheritance is applied, we introduce an example which models a small lesson in a self-learning application. The lesson is composed of one module, therefore of one story, composed by two video clips c1 and c2 which must be played one after the other, and by five text documents (pages) p1 through p5 which must be displayed during the clips playback. Clip c1 is composed of two scenes sc1 and sc2, whereas clip c2 contains three scenes, sc3, sc4 and sc5. Each scene must be associated to a different text page. Once accessed by the user (through some index), the module activates the story. A “plays with” (⇔) relationship exists between the story and its first clip c1, and between it and scene sc1. Both relationships are automatically inferred by the story-clip-scene hierarchical structure. Scene sc1 activation must in synchrony display the first page of the accompanying text p1. Similarly, sc2 ⇔ p2, . . . , sc5 ⇔ p5. Finally, since the two clips of the story must be played one after the other, a relationship c1 ⇒ c2 must be explicitly defined. Figure 3.6 pictorially shows these relationships. The figure contains also other relationships (⇓) needed to propagate the forced termination of the user from the clips (and scene) to the static pages. These relationships are automatically inferred for all the media objects between which a relationship of the type ⇔ exists, i.e., Module ⇓ story, story ⇓ c1, c1 ⇓ sc1, c2 ⇓ sc3, sc1 ⇓ p1, sc2 ⇓ p2, sc3 ⇓ p3, sc4 ⇓ p4 and sc5 ⇓ p5. The synchronization relationships defined, noting whether they are defined at design phase or inferred from other relationships, are: 1. Module ⇔ story for association between the module and the story, which is explicitly defined by the authors of the presentation;
44
Chapter 3. A Model for Synchronized Hypermedia Documents 2. story ⇔ c1 for association between the story and clip, which is inferred from the document structure; 3. c1 ⇔ sc1 for association between clip and its first scene that is inferred from the relation 2; 4. c2 ⇔ sc3 that is inferred from the document structure; 5. sc1 ⇔ p1, sc2 ⇔ p2, sc3 ⇔ p3 , sc4 ⇔ p4 and sc5 ⇔ p5 for associations between scene and page that must be explicitly defined by the authors of the presentation; 6. sc1 ⇒ sc2, sc3 ⇒ sc4 sc4 ⇒ sc5 for the scenes succeeding that could be inferred from the clip-scene organization; 7. c1 ⇒ c2 for the clips succeeding that are defined by design; 8. Module ⇓ story, story ⇓ c1, story ⇓ c2, c1 ⇓ sc1, c1 ⇓ sc2,c2 ⇓ sc3, c2 ⇓ sc4, and c2 ⇓ sc5 that could be inferred from the document structure; 9. sc1 ⇓ p1, sc2 ⇓ p2, sc3 ⇓ p3, sc4 ⇓ p4 and sc5 ⇓ p5 for the page termination that could be inferred by the relationships 5 between the scenes and the pages;
for a total of 26 relationships, 7 of which are defined by design, while the remaining 19 can be automatically inferred from the document structure and the relationships implications.
3.5
Comparison with related work
Some works, reviewed in Chapter 2, discuss issues close to the ones approached by our model. Allen’s approach [2] defines a list of thirteen possible relationships between temporal intervals which can be used in domains where interval length are known. Recalling what we said in Section 3.2, the length of an object is the time span from its beginning to its natural end (Definition 3.2.2). Then Allen relations can be applied to the length of an media items. In our perspective instead, since the user can interact with each component of a multimedia document, the media durations (the actual time length of its playback) are not known in advance, then Allen relations cannot be applied. Even if the domain is extended to interval with unknown length, our relationships cannot be translated a
Chapter 3. A Model for Synchronized Hypermedia Documents
45
priori with the relationships defined by Allen. For example, our relationship a ⇔ b has a different meaning depending on the duration of the object b: • if the duration of a is less or equal to the duration of b, the corresponding Allen’s relation is a equal b; • otherwise the corresponding Allen’s relation is a starts b. The actual object duration is unknown until execution, since the user can stop media playback. Then a correct translation cannot be done a priori. The same problem can be addressed for the works discussed in [46] and [69]. The main differences between FLIPS [63, 64] and our model concern the system environment and the hypermedia dynamics modelling. No structure is in fact provided, other than the ones coming from the objects mutual interrelationships. Due to the absence of a hierarchical structure, the re-use of an object, or of a time span inside its timeline, is not possible. In FLIPS synchronization is defined between object’s states and not between the objects themselves. Using barriers and enablers, the start or end of an object cannot directly cause the start or end of another object, but can only change the state of the object at hand. For example, the beginning of an object (which corresponds to the object activation in our model) depends upon the absence of barriers that can be set or reset by complex conditions. Our model is simpler since a state change in an object is caused only the user or by events associated to a related object, independently from other presentation components. Moreover, FLIPS does not address presentation layout but it only deals with synchronization issues. The Amsterdam Hypermedia Model [33] describes the temporal behavior of hypermedia documents inside the objects structure. Differently from our model, which defines a set of primitives which react to some events, synchronization is achieved through the use of object composition and synchronization arches, permitting the insertion of offsets into timing relationships. Thus, the authoring paradigm offered to the user is different. AHM integrates information about the structure of the document with information relating its temporal behavior which our model keeps separate. Like our model, AHM defines channels to play media items.
46
Chapter 3. A Model for Synchronized Hypermedia Documents The main difference between SMIL 2.0 [57] and our model concerns the lack of a
reference model for the data structure in SMIL. Our model organizes media objects into a hierarchical structure, which is useful to design complex presentations. For example, it can be used to infer temporal relationships between media (e.g. the scenes of a clip are played sequentially). Our model is not intended as an execution language, as e.g., SMIL is, but as a model to design and prototype multimedia presentations [28]. To this end, our XML language, which will be presented in Chapter 5, clearly separates spatial and temporal relations from reference to media objects in three separate sections. A media item can be referenced several times without redundancy by addressing its id. Thus, an author can easily reuse the structure of another presentation, a single media item, an entire document or a part if it. In SMIL, instead, the two types of information are interleaved in the document, possibly generating redundancy in the case of media reutilization. Other differences between SMIL 2.0 and our model can be found in the way actions directed to end media executions are managed. Like Allen’s relationships, SMIL 2.0’s native features do not distinguish between natural and forced termination of a media, therefore cannot define in a simple way the effects of a user interaction on a single media component. SMIL, however, is a complete language that allows an author to design almost any multimedia presentation, even if some behaviors are difficult to describe.
3.6
An example: The Maya’s sun temple
In this section we introduce a quite complex example in order to show how the hierarchical structure and the synchronization relations of this the model are used to describe a hypermedia document. To this purpose, we analyze a multimedia presentation designed for a virtual exhibition, namely the Maya Sun Temple section of the Maya exhibition at Palazzo Grassi, Venice, Italy, which is available on the World Wide Web[58]. The presentation is a narration of the Maya cosmogony illustrated by a tour in a 3D virtual world. The myth, engraved on the temples of Palenque (the Sun Temple) tells how the First Father, called Hun Nal Ye (A Grain of Maize) was born in 3114 BC. At
Chapter 3. A Model for Synchronized Hypermedia Documents Exhibition title
47
Presentation title
Text page
Buttons
Animation
Figure 3.7: Maya Sun Temple’s presentation interface that time, the Sun did not exist and darkness reigned over everything. Hun Nal Ye built himself a house in a place called the Higher Heaven. As the narration goes on, a 3D reconstruction of the Sun Temple is presented. Clicking on a dice engraved with Maya numbers the user can step through the building, exploring pictures and handicrafts inside it. Text pages explain habits and customs of Maya’s civilization. The presentation user interface is divided into five regions (Figure 3.7). Two regions are dedicated to the exhibition and to the presentation titles. Since the information contained does not change during presentation playback, we can ignore them. We shall concentrate our analysis on the animation, the text pages that are displayed in the left side of the screen, and the sound tracks that are played. A region in the lower left corner of the screen contains two buttons: one to require a help, and the other to exit the presentation. The first button is a link to a text page which is displayed in the text pages area. The second button stops the presentation and takes the user back to Maya exhibition’s home page. In order to keep the example simple we ignore these two elements, that do not change during the whole presentation. We consider thus four channels: an for the the animation, text for the text pages, sound for the sound tracks and noise for some audio effects that are played during the presentation12 . 12
We note that the implementation mixes several virtual audio channels to the same physical audio
48
Chapter 3. A Model for Synchronized Hypermedia Documents Maya sun temple’s presentation is divided into twenty animations, each of which is
a clip in our model’s structure. An animation can be associated to one or more text documents, but some clips are played alone, i.e. they are not associated to any text document. Different sound tracks are played during the presentation, in order to make more evident the progression of the narration. According to our model’s structure, text documents are pages, while the sound tracks are clips. The presentation is organized in five modules, which contain clips and pages. We also consider an initial module, M0 , which contains only an initial text page, p0 , which begins to tell cosmogony’s story, and the image of a dice playing the role of a button, ui0 , acting as a user interaction entity to step through the narration. The dice is placed in the animation channel an. It remains on the screen till the user clicks on it, then it naturally ends. Channel sound is initially free, while channel text and an are busy. Tables 3.1–3.3 lists the elements of modules M 0 –M3 , which are illustrated in this section. During the whole presentation, the user has five possible interactions, he, or she, can: • choose a module of the presentation and start it, • click on the current dice to step the presentation, • close a text page, • pause the presentation or • exit from the presentation. If the user pauses the presentation, he or she pauses all the active continuous objects, i.e. the current animation (if any) and the soundtrack which pause waiting for a re-start action. Thus no synchronization primitives are needed. In the same way, if the user closes the text pages, no other actions are performed, therefore no synchronization relationships are needed. When the user starts the presentation by clicking on the dice, he or she starts the first module M1 : we set the relation M0 M1 so that the first module can use channel text. As a consequence of M0 termination, page p0 must terminate too, so we set M0 ⇓ p0 . The first animation begins by building Hun Nal Ye house’s foundations, while the user hears wind blowing. If we call the animation’s clip a1 and the sound track s1 , we model device.
Chapter 3. A Model for Synchronized Hypermedia Documents
M0
Initial module.
p0
The myth engraved on the temples of Palenque recounts how
49
the First Father. . . ui0
(Maya number 0)
M1
Hun Nal Ye house in Higher Heaven.
s1
Wind sound track.
a1
House foundation.
p1
At the site of the house, the mythical ancestor also set up three stones that symbolized the three vertical levels . . .
ui1
(Maya number 1)
Table 3.1: Maya Sun Temple presentation elements, Modules 0 and 1
this behavior with the relations M1 ⇔ a1 and M1 ⇔ s1 , where a1 occupies the channel an and s1 the channel sound . The wind sound is a short audio file that is played continuously, hence s1 ⇒ s1 . At the end of the first animation, the presentation pauses, waiting for user interaction: a text page p1 is displayed in channel text. The relation a1 ⇒ p1 models this behavior. Wind sound track s1 continues playing, while channel an is released. At the end of each animation, the user gets the control of the presentation playback: the dice engraved with Maya numbers is displayed and the user must click on it to enter the next animation. When a1 terminates, ui1 begins by virtue of the relation a1 ⇒ ui1 . When the user clicks on the dice, ui1 ends: the next animation begins playing as a consequence of relations ui1 ⇒ M2 and M2 ⇔ a2 . If the user decides to exit the module, using the button “Exit”, all active objects should stop: they are p0 and ui0 for the initial module, a1 and s1 , during animation’s playback, and s1 , ui1 and p1 , if a1 has already finished. The relations M0 ⇓ ui0 , M1 ⇓ a1 , M1 ⇓ s1 , M1 ⇓ ui1 and s1 ⇓ p1 describe this behavior. Figure 3.8 shows the first two modules, in
50
Chapter 3. A Model for Synchronized Hypermedia Documents
M2
Hun Nal Ye brings the life in the world (representation of monumental art).
s2
Birds singing sound track.
n1
Noise due to God’s resurrection.
a2
Tree and stones raising.
a3
Hun Nal Ye resurrection.
a4
Hun Nal Ye appears like a beautiful young.
a5
Life-size portraits in stone.
a6
God sculpture.
a7
Hun Nal Ye appears like a beautiful young(2).
a8
House view.
ui2 .. . ui8 p2
(Maya number 2) .. . (Maya number 8) After performing these prodigious acts, Hun Nal Ye was then the leading figure . . .
p3
Maya rulers commissioned life-size portraits in stone (stelae) . . .
p4
Maya culture was based of a religious view of the world and of life, . . .
p5
With the power to determine future events, these gods were considered by the Maya . . .
p6
Though the Maya deities were in many respects superior to man . . .
p7
Depending upon the occasion, these deities would appeared in different forms . . .
pb
Blank page.
Table 3.2: Maya Sun Temple presentation elements, Module 2
Chapter 3. A Model for Synchronized Hypermedia Documents
51
M3
The sun temple.
s3
Ambient sound track.
a9
The sun temple building.
sc1
Hun Nal Ye house.
sc2
Growth of the sun temple.
p8
The earliest Pre-Colombian Maya architecture dates from the first centuries BC. . . .
ui9
(Maya number 9)
Table 3.3: Maya Sun Temple presentation elements, Module 3
which fourteen relationships are defined: M0 M1
M0 ⇓
M0 ⇔ p 0
ui0
M0 ⇓
M1 ⇔ a 1
ui1 ⇒ M2
M1 ⇔ s 1
s1
p0
M0 ⇔ ui0
ui0
a1
⇒ p1
M1 ⇓ a1
⇒ M1
a1
⇒ ui1
M 1 ⇓ s1
⇒ s1
M1 ⇓ ui1 s1
⇓ p1
The other modules are modelled in the same way: the dice represents an activation point for the next animation, or the next module if the animation currently playing is the last one of the module. For each module Mk the end of each animation ai activates an instance of the user interaction entity uii , which ends with a user click on it: relationships ai ⇒ uii and uii ⇒ ai+1 model this behavior. At the end of the last animation of a module, the user leaves module Mk and begins module Mk +1 . The relationship Mk Mk +1 , set for every module but the last one, says that the whole module Mk is forced to terminate because the channels used by the media contained in it must be used by the media contained in Mk +1 . Therefore its clip, sound track, user interaction entities and active page must terminate as described by relationships ⇓ between the module and its components. Channels an, sound and text are released. The narration of Maya cosmogony continues with the beginning of the life in the world. Hun Nal Ye comes in the world like a beautiful young man, bringing with him the seeds of maize. A second module, M2 , is introduced since the scenarios changes: the resurrection
52
Chapter 3. A Model for Synchronized Hypermedia Documents Û ß
p 0
Û ß
M 0
u i0
¨
Þ ß
Û
ß ß
Û
M 1
a 1 Þ Þ
u i1 s 1
ß
Þ
M 2 Þ
p 1
Figure 3.8: Synchronization primitives for the first two modules of the god from the underworld represents the beginning of life in the world. To emphasize this event, the sound track plays a sound of singing birds. Then the presentation gives the user some information about Maya monumental art and religious view of life and world. Module M2 has a much more complex structure. It contains seven animations, seven pages and two sound clips. They share the same channels used by module M1 , therefore M1 M2 . When the user begins M2 ’s playback, animation clip a2 adds three stones and a tree to the god’s house, and bird singing begins (sound track s2 ). When a2 ends, page p2 is displayed. Sound track s2 plays for the entire duration of M2 , because it loops continuously. The following relationships model this situation: M2 ⇔ a 2
s2
⇒ s2
M2 ⇓ a2
M2 ⇔ s 2
a2 ⇒ p 2
M 2 ⇓ s2 s2
⇓ p2
At the end of animation a2 , the user clicks on the dice to start the following animation a3 , which uses channel an released by a2 . Page p2 remains active. As a3 begins the user can hear, together with sound track s2 , also a background noise, which represents the resurrection from the underworld of god Hun Nal Ye. If we denote this sound clip with n1 , we can establish the relation a3 ⇔ n1 . The soundtrack s2 continues
Chapter 3. A Model for Synchronized Hypermedia Documents Û
ß
a 2 Þ
a 3 Þ
u i2
Þ
Û
n 1
¨
p 2
ß ß
ß
ß
Þ
ß
ß Û ß
ß ß
a 4 Þ
u i3
53
Þ
Þ
u i4
M 2 a 5
Þ p 3
Þ
s 2
ß
Figure 3.9: Synchronization primitives for module M2 (first part) ß ß
a 5 Þ
Þ
ß
u i5
Þ
a 6 Þ
Þ
ß ß
ß
u i6
a 7 Þ
ß
p 4
Þ
u i7
a 8 Þ
p b
p 3
ß
Þ Þ
ß ß
¨ ¨
p 5 ß
¨
M 2
u i8
Þ
M 3 ß
¨
¬® ß
p 6
¨ ¨
p 7 ß
Figure 3.10: Synchronization primitives for module M2 (second part) its playback in channel sound, while n1 uses channel noise. As for the animation a2 , if the user stops the module’s playback, both animation and noise should stop, so we set M2 ⇓ a3 and a3 ⇓ n1 , and the same relation has to be imposed between the module and all the animations that make up the story: M1 ⇓ a4 , M1 ⇓ a5 , M1 ⇓ a6 , M1 ⇓ a7 and M1 ⇓ a8 , as Figures 3.9 and 3.10 pictorially show. Page p2 remains on the screen until animation a5 ends, then page p3 is displayed. This is described by the relationships a5 ⇒ p3 to display the page, and p2 p3 to free channel text in which p3 is displayed.
54
Chapter 3. A Model for Synchronized Hypermedia Documents Û
ß Û
ß
s c 1
ß
ß
a 9 Þ
s c ß2
Þ p 8 Û
Þ
ß
s 3
M 3 ß
Û Þ
u i9 s 1
M 4
Þ
Figure 3.11: Structure and synchronization of module M3 When animation a6 , representing gods monumental sculpture, ends, it activates page p4 by virtue of the relation a6 ⇒ p4 . Page p4 contains a link to another text page, p5 , which contains two hyperlinks, one back to p4 , and one to p6 (see table 3.2 for a reference to the presentation pages). Pages p4 to p7 , and their hyperlinks, form a bidirectional list. Figure 3.10 shows this situation. We introduce relationship every time there is a link between two objects that share the same channel to describe that it must released the current content. Relationships p4 p5 , p5 p4 , and the others shown in the figure, manage the use of the channel text. Relationship M1 ⇓ p4 up to M1 ⇓ p7 are introduced to terminate the active page when the user stops the presentation. Animation a7 ends by showing the god physical aspect, then a blank page pb is displayed in channel text, as described by relationship a7 ⇒ pb . A blank page must be introduced because a Web browser displays a page until a new content is downloaded. When a page is terminated, the channel is released, but the page content remains visible. The modules that follow are modelled in the same way. We present module M3 because it has a clip with two different scenes. The animation shows for the first time the sun temple. The sound changes because the user is guided into the temple, where the sound of birds singing is inappropriate. The building raises over Hun Nal Ye’s house, and when the temple appears, a text page with information about Maya pre-colombian architecture is displayed. This animation, a 9 , can be described as a clip made of two scenes, sc1 and sc2 , one showing the god’s house,
Chapter 3. A Model for Synchronized Hypermedia Documents
55
and the second the raise of the temple. Figure 3.11 pictorially shows this part of module M3 ’s structure and synchronization. Synchronization primitives introduced are the same of other modules, but we need to establish the relationship sc1 ⇒ sc2 so that the user does not need to activate the second scene, that plays as soon as scene sc1 ends.
56
Chapter 3. A Model for Synchronized Hypermedia Documents
Chapter 4
Modelling Multimedia Presentation Behaviors
In this chapter we introduce a formal description of a hypermedia presentation. Since we proposed an event-based approach to describe the temporal behavior of media items, all the possible evolutions of a presentation can be described by an automaton which define the states entered by the events that trigger media playback. Then, the automaton can be used to study some properties of the presentations. The issues presented in this chapter are discussed also in [21].
In Chapter 3 we have introduced the relevant components of our model. A multimedia presentation can be modelled as the composition of independent media items, both continuous (i.e. video or audio streams) and static media components (i.e. images or text), which must be synchronized to be presented to a user. The model is able to capture all the aspects of a hypermedia presentation, i.e. the temporal relationships among different media types (media to be played in parallel or in sequence), the synchronization constraints (events that cause begin/end of other media items to occur) and how media components are organized in the user interface (audio and video channels which describes the layout of the presentation). The synchronization model defines a hierarchical structure of the documents, the layout of the presentation and a list of synchronization rules to describe objects reactions to events. The synchronization α
primitives are defined as temporal relationships a rel b, with rel ∈ {⇔, ⇒, ⇓, , >}, where Dom(rel ) = MI × MI and MI is the set of media items which build up a presentation. In this chapter we introduce a more formal description of a hypermedia presentation. Since we proposed an event-based approach to describe the temporal behavior of media items, given a presentation, we can build the corresponding automaton which describes all its possible evolutions along time. Thus we need to collect all the information which
58
Chapter 4. Modelling Multimedia Presentation Behaviors
characterize the presentation at any given time into the notion of state. Given a state and an event, the automaton deterministically moves to the next state which describes the media active and their channel occupation. Therefore, the transition describes which media are activated and which items are stopped during presentation playback. The automaton is implemented by the authoring tool presented in Chapter 6 which includes a simulator to test the behavior of the presentation the author is designing. The automaton can also be used to study some properties of the corresponding presentation.
4.1
An Automaton for Multimedia Presentations
The playout of a hypermedia presentation can be described in terms of the media items involved, the channels used for media playback, the events which cause media to start, pause and end, and the synchronization relationships which describe the dynamic media behavior. Such elements provide an intensional representation of the evolution of the presentation along time. Definition 4.1 (Presentation) A hypermedia presentation is a 4–tuple
P
=
hMI, CH, E, SRi where • MI is a set of media items {m0 , m1 , . . . , mn }; • CH is a set of presentation channels {c0 , c1 , . . . , ch }; • E is a set of events {e0 , e1 , . . . , ek }, ei ∈ ET × MI, where ET = {start, end, pause, stop} is the set of event types; • SR is a set of synchronization relationships {sr0 , sr1 , . . . , srl }, sri ∈ SP ×MI ×MI, p
s
and SP = {⇔, ⇒, ⇓, , >, >} is the set of synchronization primitives. For clarity we shall denote event instances as e(m) where e is an event type and m a media item, and synchronization relationship instances with the symbolic notation used in Chapter 3, e.g., m1 ⇔ m2 to denote the relationship (⇔, m1 , m2 ). At any time instant, the hypermedia presentation is completely described by the set of media that are active at that time1 , and the corresponding channel occupation. This 1
Paused media are considered active, since they occupy a channel.
Chapter 4. Modelling Multimedia Presentation Behaviors
59
relevant information is captured by the notion of state of the presentation. Before the presentation starts, no media item is active, thus all channels are free. When an event occurs, the state of the presentation changes: some items that were not active become active, some active items end, other items could be forced to stop due to some interruption. Definition 4.1.1 (State) The state of a hypermedia presentation is a triple S = hAM, FM, UCi, where AM is the set of active media, FM is the set of paused ( frozen) media and UC is the set of pairs hc, mi where c is a channel and m is the media item that occupies it as defined by the mapping isUsed : CH → MI ∪ { }2 . For clarity, in the following of the paper we shall refer to the association between channel and active media with the functional notation isUsed (x ) = y. According to Section 3.2.2, the set of active media items in a state s can be defined as AM = {m|isActive(m, i ) = True ∧ ibegins ≤ i ≤ iends } where ibegins and iends are the value of the variable i , which contains the current time, updated by a clock, respectively, when the presentation enters and exits from state s. In the same way, the set of frozen media FM = {m|pause(m, i ) = True ∧ ibegins ≤ i ≤ iends }. The set of paused media items FM is a subset of the set of active media AM, since the items still occupy the resources (or part of) used for their playback, e.g., the last frame of a video file, or the last image of a slide show, remains on the user screen. The channel p
can be used only by media items for which a relation > with the paused object exists. Remark 4.1.1 The set S of the possible states for a hypermedia presentation is finite, since both the set of media items and the set of channels are finite. If we observe the system along time, the only relevant time instants are the observable time instants, i.e., the time instants in which an event occurs, or the effects of an event are assessed and perceivable. Indeed, these are the time instants in which something in the state of the presentation might change, as a consequence of the occurred event. The state of a hypermedia presentation is thus a function of observable time instants. We assume that at any observable time instant at most one “master event” occurs, i.e., 2
This function differs from function isUsed of Section 3.2.2 for the absence of the variable i, which
contains the current time, since we now consider only the behavior of the presentation during a particular state.
60
Chapter 4. Modelling Multimedia Presentation Behaviors
either a media item is activated, paused or stopped by the user, or a media item naturally ends. Anyway, at the same time instant other items may be activated, paused or stopped, according to the synchronization relationships characterizing the presentation. It is important to notice that, given the set of synchronization primitives associated with the presentation, the effects of any observable event are deterministically implied by (i) the set of currently active media, (ii) the set of frozen media, (iii) the current channel occupation, and (iv) the occurred event. Thus, all the possible evolutions along time of a presentation P can be described by an automaton, defined as follows. Definition 4.2 (Automaton) Let P = hMI, CH, E, SRi be any presentation. Its associated finite state automaton is the 5–tuple AUT (P ) = hS, E, s0 , next, Final i, where • S is the set of possible states for the presentation P ; • E is the set of possible event instances in the form start(m), end (m), pause(m) and stop(m), m ∈ MI; • s0 , the initial state, is hAM0 , FM0 , isUsed 0 i, where AM0 = ∅, FM0 = ∅, and isUsed 0 (c) = , for all c ∈ CH; • the transition function next : S × E → S is the mapping that deterministically associates any state s to the state s 0 in which s is transformed by an event e ∈ E; • Final , the set of states which correspond to the end of the presentation playback. Details on the set Final will be given in Section 4.2. While it is clear, from the above definition, what S, E, s0 and Final are, we still have to define formally the deterministic behavior of the function next. This definition requires some extra notions, that we introduce in the following. First, we need to be able to capture all the consequences that an event might have on the presentation, given the current state, according to the synchronization rules. Starting and ending events might indeed activate a cascade of simultaneous media activations or stops. The following notions of closure of an item, with respect to some category of rules, capture these effects. Definition 4.3 ((⇔)Closure) Let a be a media item in MI. The (⇔)Closure of a is the set inductively defined as follows:
Chapter 4. Modelling Multimedia Presentation Behaviors
61
• a ∈ (⇔)Closure(a); • for any item b ∈ MI, if b ∈ (⇔)Closure(a) and b ⇔ c ∈ SR, then c ∈ (⇔)Closure(a); (⇔)Closure(a) captures the non symmetry and the transitivity of the plays with relation. In particular, we shall use the set (⇔)Closure(a) to take care of the (non symmetric) consequences of the end of a, in the presence of plays with rules. Definition 4.4 ((⇔)1step) Let a be a data item in MI. The (⇔)1step of a is the set defined as follows: • for any item b ∈ MI, if a ⇔ b ∈ SR or b ⇔ a ∈ SR, then b ∈ (⇔)1step(a); (⇔)1step(a) captures the symmetry of the plays with relation. In particular, we will use the set (⇔)1step(a) to take care of the (symmetric) consequences of the start of a, in the presence of plays with rules. Definition 4.5 ((⇓)Closure) Let a be a data item in MI. The (⇓)Closure of a is the set inductively defined as follows: • a ∈ (⇓)Closure(a); • for any item b ∈ MI, if b ∈ (⇓)Closure(a) and b ⇓ c ∈ SR, then c ∈ (⇓)Closure(a); Intuitively (⇓)Closure(a) contains all the media items that according to the terminates relation, are required to stop if a terminates. The following two definitions take care of the hierarchical structure of the media items by relating a story with its component clips, and a clip with its component scenes. Definition 4.6 (ComponentOf ) Let a and b be data items in MI. ComponentOf (a, b) evaluates to true if and only if at least one of the following conditions holds: • a is a story (clip) and b is a clip (resp. scene) and a ⇔ b ∈ SR; • a is a story (clip) and b is a clip (scene) and n clips (scenes) x1 . . . xn exist such that a ⇔ x1 , xi ⇒ xi+1 for all i = 1 . . . n − 1 and xn ⇒ b.
62
Chapter 4. Modelling Multimedia Presentation Behaviors
Definition 4.7 (IsLast) Let a and b be data items in MI. IsLast(a, b) evaluates to true if and only if both the following conditions hold: • ComponentOf (a, b) = True; • for any item c ∈ MI, if b ⇒ c ∈ SR, then ComponentOf (a, c) = False. Definition 4.7 lets us identify the last clip (scene) of a story (clip). When it naturally ends, the whole story (clip) ends. The state transformation caused by an event might be a complex action. For the sake of readability, we define some parameterized functions that group together semantically related operations caused by the occurrence of an event in order to apply the synchronization rules. Basically, any function takes care of the effects of starting (or re-starting, i.e. starting again after a pause), stopping, or replacing a media item, both on the sets of active and frozen media, and on the function modelling the occupation of channels. Then, we shall define the state transition function algorithm. START (x: media item; ∆+ :set of media items; newUsed: channel-media mapping) // x : media item being started, // ∆+ : set of media items to be added to the set of active media, // newUsed: occupation function after start of x begin ∆+ = ∆+ ∪ {x }; newUsed (channel 3 (x )) = x ; end
RESTART (x : media item; AM, ∆− F : set of media items; oldUsed, newUsed: channel-media mapping) // x : media item being restarted (in parallel with items in its closure), // AM: set of currently active media, // ∆− F : set of media items to be removed from the set of frozen media, // oldUsed: occupation function, before restart of x // newUsed: occupation function after restart of x and of media displayed // in parallel with it
3
The function channel : MI → CH is defined in Section 3.2.2.
Chapter 4. Modelling Multimedia Presentation Behaviors
63
begin for all y ∈ (⇔)Closure(x ) // restart all paused media y whose channel is free or already assigned if (y ∈ AM) and (oldUsed(channel(y)) = y or oldUsed(channel(y)) = ) then begin − ∆− F = ∆F ∪ {y}; newUsed(channel(y)) = y end end REPLACE (x, y: media item; ∆+ , ∆− , AM :set of media items; newUsed: channel-media mapping) // x : media item being started, // y: media item to replace, // ∆+ : set of media items to be added to the set of active media, // ∆− : set of media items to be removed from the set of active media, // AM: set of active media, // newUsed: occupation function after start of x begin ∆− = ∆− ∪ {y}; ∆+ = ∆+ ∪ {x }; newUsed (channel (y)) = x ; //stop all media for which a ⇓ relation exist with y for all z ∈ (⇓)Closure(y) if z ∈ AM then begin // stop z ∆− = ∆− ∪ {z }; newUsed(channel(z )) = end end STOP (x: media item; ∆− :set of media items; newUsed: channel association function) // x : media item being stopped, // ∆+ : set of media items to be added to the set of active media, // ∆+ : set of media items to be removed from the set of active media, // newUsed: occupation function after start of x begin ∆− = ∆− ∪ {x }; newUsed (channel (x )) = ; end
64
Chapter 4. Modelling Multimedia Presentation Behaviors
ACTIVATE (x : media object; AM, ∆+ , ∆− , ∆+ F : set of media items; oldUsed, newUsed: channel-media mapping) // x : media item to be activated, // AM: set of currently active media, // ∆+ : set of media items to be added to the set of active media // ∆− : set of media items to be removed from the set of active media, // ∆+ F : set of media items to be added to the set of frozen media, // oldUsed: channel occupation function when ACTIVATE is called // newUsed: channel occupation function after ACTIVATE is executed begin Set = {x }; // the set of items to be activated after a’s start while Set 6= ∅ do begin pick any y from Set; if oldUsed(channel(y)) = then // the channel of y is free, start y START(x, ∆+ , newUsed); else begin // check if some relation exists which releases y’s channel z = oldUsed(channel(y)); s if z y ∈ SR or y > z ∈ SR then // replace y for z in z ’s channel REPLACE(y, z, ∆+ , ∆− , AM, newUsed); p else if y > z ∈ SR then begin // z pauses, then y starts + ∆+ F = ∆F ∪ {z }; START(y, ∆+ , newUsed); end // if y’s channel is used by one of its components, // add y to the set of active media else if ComponentOf (y, z ) then ∆+ = ∆+ ∪ {y} // if y is a component of z , start y else if ComponentOf (z , y) then START(y, ∆+ , newUsed); end; // if y is activated then try to activate all media in (⇔)1step(y) if y ∈ ∆+ then Set = Set ∪ ((⇔)1step(y) \ ∆+ ) end end
Chapter 4. Modelling Multimedia Presentation Behaviors
65
The first four functions are very simple. START(x, ∆+ , newUsed) and STOP(x, ∆− , newUsed) are complementary functions: START adds media x to the set of media items that will be activated together when the automaton reaches the next state; STOP removes x from the set of active objects. Function START assigns to object x the proper channel (i.e. the one returned by channel (x )), while function STOP removes this mapping. The function RESTART(x, AM, ∆+ F , oldUsed, newUsed) resumes object x from the pause. It resumes also all other objects y such that y ∈ (⇔)Closure(x )4 . The function REPLACE(x, y, ∆+ , ∆− , AM, newUsed) replaces objects y with object x . x and y must use the same channel which is assigned to x . y is removed from the set of active, together with all other objects z such that z ∈ (⇓)Closure(y); media x is started. The function ACTIVATE(x, AM, ∆+ , ∆− , ∆+ F , oldUsed, newUsed) is more complex. It activates objects x , and controls which other objects are activated as a consequence of the beginning of x , i.e. the objects which belong to (⇔)Closure(x ). Step by step the function activates all media items y ∈ (⇔)1step(x ) if their channel is free. Otherwise the α
function controls if relationships of type or > exist. Then it stops the media which used the same channel as y and starts y. Therefore, the state transition function can be defined as follows: Definition 4.8 (State transition function) The
state
transition
function
next : S × E → S, where S is the set of all possible states, and E is the set of events, is the function that, given a state s and an event e at the observable time instant n, returns the state s 0 = next(s, e) at the observable time instant n + 1 where s
=
hAMn , FMn , isUsed n i, s 0
=
hAMn+1 , FMn+1 , isUsed n+1 i,
− − AMn+1 = AMn ∪ ∆+ \ ∆− , FMn+1 = FMn ∪ ∆+ F \ (∆F ∪ ∆ ), and isUsed n+1 , − ∆+ , ∆− , ∆+ F and ∆F are defined according to the following process, in which e is the
occurring event and m the media item to which the event applies: begin − ∆− = ∅; ∆+ = ∅; ∆+ F = ∅; ∆F = ∅; for all c ∈ CH do isUsedn+1 (c) = isUsedn (c); case e = start(m): if m ∈ FMn then 4
This choice reflects the definition of the transition function next.
66
Chapter 4. Modelling Multimedia Presentation Behaviors RESTART(m, AMn , ∆− F , isUsedn , isUsedn+1 ); if m ∈ / AMn then ACTIVATE (m, AMn , ∆+ , ∆− , ∆+ n , isUsedn+1 ); F , isUsed s for all x ∈ ∆+ //check if some relation > between two active media exist s if x > y ∈ SR and (y ∈ AMn or y ∈ ∆+ ) then //stop all media for which a relation ⇓ exists with y for all z ∈ (⇓)Closure(y) if z ∈ AMn or z ∈ ∆+ then STOP(z , ∆− ,isUsedn+1 ); s for all x ∈ ∆+ //check if some relation > between two active media exist p if x > y ∈ SR and (y ∈ AMn or y ∈ ∆+ ) then + ∆+ //pause media y F = ∆F ∪ {y};
case e = end(m): x = m; while ∃y ∈ MI such that ComponentOf (y, x ) do // if x is the last component of y then consider the event end (y) if IsLast(y, x ) then x = y else exit while ; if x ∈ AMn then 5 begin // x stops and releases its channel STOP(x , ∆− ,isUsedn+1 ); // in relation a ⇔ b when a ends b must be stopped for all y ∈ (⇔)Closure(x ) if (y ∈ AMn ) then begin // stop all media y which were activated by x STOP(y, ∆− ,isUsedn+1 ); // stop all media z for which a relation ⇓ exists with a stopped media for all z ∈ (⇓)Closure(y) if z ∈ AMn then STOP(z , ∆− ,isUsedn+1 ); end; for all y ∈ MI such that x ⇒ y ∈ SR do if y ∈ FMn then RESTART(y, AMn , ∆− F , isUsedn , isUsedn+1 ); if y ∈ / AMn then ACTIVATE (y, AMn , ∆+ , ∆− , ∆+ F , isUsedn , isUsedn+1 ) end; 5
Non active items cannot end; the same holds for stop and pause events.
Chapter 4. Modelling Multimedia Presentation Behaviors
67
case e = pause(m): if m ∈ AMn then begin + ∆+ F = ∆F ∪ {m}; for all x ∈ (⇔)Closure(m) + if (x ∈ AMn ) then ∆+ F = ∆F ∪ {x } end; case e = stop(m): if m ∈ AMn then begin STOP(m, ∆− ,isUsedn+1 ); // stop all media x for which a relation ⇓ exists with a stopped media for all x ∈ (⇓)Closure(m) if x ∈ AMn then STOP(x , ∆− ,isUsedn+1 ); end end. By definition 4.2 the transition function of the automaton, next : S × E → S is deterministic. Thus the automaton is deterministic: given sequence of events the automaton always reaches the same state. Given an event e and a state sn , the transition function next searches for all the synchronization rules in SR which are activated by such event and performs the corresponding changes in state sn+1 . If media item m starts, it calls the functions ACTIα
VATE or RESTART if the media is paused. Then it checks if some relations > exist with an active media. If the item naturally ends, the transition function stops all media items x ∈ (⇔)Closure(m) and activates items y for which a relation m ⇒ y exists. If e = pause(m) the function pauses all objects x ∈ (⇔)Closure(m). If a media object is forced to stop the function stops all objects x ∈ (⇓)Closure(m). As an example, we describe the automaton of an excerpt of the Maya Sun Temple presentation introduced in Section 3.6. Figure 4.1 summarizes the dynamics of the first two modules of the presentation. When the user starts the presentation by clicking on ui0 (which naturally ends6 ) module M1 starts, replacing M0 which stops p0 . 6
The reason for this behavior is explained in Section 3.6.
68
Chapter 4. Modelling Multimedia Presentation Behaviors Û ß
Û
p 0
ß
u i0
M 0
¨
Þ ß
Û
ß ß
Û
M 1
a 1 Þ Þ
ß
u i1 s 1
Þ
M 2 Þ
p 1
Figure 4.1: Synchronization primitives for the first two modules Figure 4.2 depicts the automaton which describes the behavior of the first two modules. The automaton contains both the natural events, i.e. the natural end of a media item, and the user interactions. In particular the user can: • start the first module of the presentation and start it, • close a text page, • pause the presentation or • stop the active module (thus the presentation). Figure 4.2 does not contain any pause event for the sake of readability. At the beginning of the presentation, the initial state is s0 = hAM0 , FM, isUsed0 i where AM0 = FM = ∅ and isUsed0 (c) = , for all c ∈ CH. In the natural evolution of the presentation, the user activates the initial module with the event e = start(M0 ). The function next(s0 , start(M0 )) returns the state s1 = hAM1 , ∅, isUsed1 i where AM1 = {M0 , p0 , ui0 } and isUsed1 (channel (M0 )) = M0 , isUsed1 (channel (p0 )) = p0 and isUsed1 (channel (ui0 )) = ui0 . Thus, the transition from s0 to s1 captures the fact that the initial module activates the page and the dice. From state s1 the user can stop the module or the page, can click on the dice. In the first case, the function next(s1 , stop(M0 )) returns the initial state s0 since module M0 stops all its components. Otherwise, if the user stops the text page, only that media item is
Chapter 4. Modelling Multimedia Presentation Behaviors
69
s 2 =
stop( M 0 )
MA2 = {M 0 , ui0 }
stop( p0 )
stop( M 0 )
s1 =
start ( M 0 )
s 0 =
MA1 = {M 0 , p0 , ui0 }
stop( M1 )
end (ui0 )
stop( M 1 )
stop( M 1 ) s 5 =
end (ui0 )
stop( p1 )
MA5 = {M 1 , s1 , ui1 }
s 4 =
end (a1 )
MA4 = {M 1 , s1 , ui1 , p1 }
end (ui1 ) M2
end ( s1 )
end (ui1 )
s 3 = MA3 = {M 1 , a1 , s1 }
end ( s1 )
Figure 4.2: Automaton of the first two modules of Maya presentation removed from the set of active media AM1 . If the user clicks on the dice, the automaton reaches the state s3 : the transition function activates module M1 which replaces module M0 and activates a1 and s1 (i.e. AM3 = {M1 , a1 , s1 }). Without user interaction, the set of possible events is {end (s1 ), end (a1 )}. In the first case the state does not change, because the soundtrack plays continuously. Thus, next(s3 , end (sound )) = s3 . In the case of the natural termination of the first animation a1 , the second dice is displayed, together with page p1 . The automaton reaches a new state s4 = hAM4 , ∅, isUsed4 i where AM4 = {M1 , s1 , ui1 , p1 }. Once more, if the soundtrack naturally ends, the state does not change. If the user stops page p1 , p1 is removed from the set of active media items. If the user stops the module, the automaton reaches state s0 since s1 , ui1 , p1 ∈ (⇓)Closure(M1 ). When the user clicks on the second dice module M2 begins. Figure 4.3 shows an excerpt of the automaton of the presentation modelling the behavior of the presentation if the user pauses it during animation a1 . In state s6 media items M1 , a1 , s1 ∈ FM6 since a1 , s1 ∈ (⇔)Closure(M1 ). If the user restarts module M1 , the automaton comes back to state s3 .
70
Chapter 4. Modelling Multimedia Presentation Behaviors start ( M 1 ) s 3 =
pause( M 1 )
MA3 = {M 1 , a1 , s1 }
s 6 = MA6 = F6 = {M 1 , a1 , s1 }
end (s1 ) end (a1 ) s 4 = MA4 = {M 1 , s1 , ui1 , p1 }
end ( s1 )
Figure 4.3: A pause event
4.2
Analysis of the presentation automaton
The definition of the automaton presented in the previous section can be used to study some properties of the presentation. First of all it facilitates the analysis of the presentation behavior. For example, we can search for media items which loop continuously (e.g. the soundtrack in our previous example). If the presentation contains this type of media, the corresponding automaton contains a state sl which contains the media items which loop, and a path init(sl )∗ , where init is an initial path from s0 to si , sl = next(si , e) and sl = next(sl , end (m)). In this case, m is the media which loops. In our example this path is s0 s1 (s3 )∗ , and the soundtrack is the returned item. Given a presentation P = hMI, CH, E, SRi the corresponding automaton AUT (P ) is finite since, by remark 4.1.1, S is finite and, by construction, E is finite.
Definition 4.2.1 A presentation naturally ends if the set of final states in the corresponding automaton is Final = {sf }, where sf = h∅, ∅, isUsed f i, isUsed f (c) =
∀c ∈ CH
and there exists a path which leads to the final state, which is an ordered sequence of events e0 , e1 , . . . , en such that e0 = start(m0 ), ei = end (mj ), for some medium mj which is not a user interaction entity7 , si+1 = next(si , ei ) ∀i = 0 . . . n and sn+1 = sf . The 7
According to Definition 3.1.6, a user interaction entity is an interaction widget displayed on the user
screen, which naturally ends when the user clicks on it.
Chapter 4. Modelling Multimedia Presentation Behaviors
start ( M 1 )
s 0 =
71
s1 = MA1 =
end ( scene1,1 )
{M 1 , clip1 , scene1,1 , slide1,1 }
s 2 = MA2 = { M 1 , clip1 , scene1,2 , slide1,2 }
end (clip2 ) s 4 = MA4 =
s3 = MA3 =
end ( scene2,1 )
{M 1 , clip2 , scene2,2 , slide2,2 }
end (clip1 )
{ M 1 , clip 2 , scene2,1 , slide2,1 }
Figure 4.4: A natural ending presentation ordered sequence of events might contain multiple instances of the same event end (m j ), corresponding to different observable time instants. This definition allows us to identify which presentations naturally end, without any user interaction, i.e. that class of presentations which, once activated, follow a path till the final state h∅, ∅, isUsed0 i. At any instant, the user can interact with them to change the followed path. Moreover, if the automaton can recognize hypermedia presentations which naturally end, it can classify media components inside them. Definition 4.2.2 Let
P
be
a
naturally
ending
presentation,
and
start(m0 ) . . . end (mj ) . . . end (mk ) be the sequence of labels of edges in the minimal path leading from the initial state s0 to the final state sf . We call master objects of the presentation the media objects that appear in the labels of the edges of the given path, whose timing makes the presentation to evolve in its natural behavior. MO denotes the set of the master objects of a presentation. Let us consider for example a presentation which is a lesson of a distance learning course like the one illustrated in Figure 3.6. The presentation consists of a video of the teacher explaining a number of slides, which are images (thus pages) displayed on the user screen. The video is organized into clips, which are divided into scenes. Each scene is played in parallel with a slide. Figure 4.4 depicts the automaton of the presentation which reports only the natural evolution of the presentation.
72
Chapter 4. Modelling Multimedia Presentation Behaviors According to Definition 4.2.1, the presentation in the figure naturally ends, since the
path start(M1 ) end (scene1,1 ) end (clip1 )) end (scene2,1 ) end (clip2 ) leads to state s0 . We can call this path the natural evolution of the presentation. The set of media items which make the presentation to evolve is then MO(P ) = {M1 , scene1,1 , clip1 , scene2,1 , clip2 }. Since scene1,1 and scene2,1 are components of the first and the second clip, and M1 is a module, the presentation is ruled by the clips respectively, then the video stream can be considered the master stream which activates all other objects of the presentation according to its evolution. If the user interacts with any media component, he, or she, diverts the presentation from its natural evolution. Otherwise, if the user stops or pauses one of the two clips (and then one of the scenes), he, or she, stops or pauses the whole presentation. Definition 4.2.3 (Loop) A presentation P contains a loop, i.e.
one or more me-
dia which repeat continuously, if its corresponding automaton AUT (P ) contains a path ek , . . . , en such that ei = end (mj ) for some medium mj and si+1 = next(si , ei ) ∀i = k . . . n − 1 and sn+1 = sk . The length of the loop is (n − k ) + 1. In the Maya Sun Temple presentation described Section 4.1, the path (s3 , end (s1 )), which models the soundtrack playing continuously, is a loop of length equal to one. Definition 4.2.4 (Reachable state) A state sk is reachable if there exists a path e0 , e1 . . . , ek −1 such that e0 = start(m0 ), ei = end (mj ) for some medium mj which is not an user interaction entity, and si+1 = next(si , ei ) ∀i = 0 . . . k − 1 from the initial state s0 to state sk . Definition 4.2.5 (Connected presentation) A presentation P is connected if all states s ∈ S in the corresponding automaton AUT (P ) are reachable. We call reachable a state that can be reached by the user interaction or as a consequence of the natural evolution of the presentation. If a presentation is not connected, it contains some states which cannot be reached from the starting state s0 . Thus the automaton is a useful tool during the authoring of the presentation since, once defined the interactions allowed to the user, if the presentation is not connected, it means that the user cannot access the sections of the presentation which correspond to the unreachable states.
Chapter 4. Modelling Multimedia Presentation Behaviors
73
s1 =
B
start ( A)
stop ( A)
MA1 = { A}
⇔
A C
end ( A) s 0 =
(a)
end ( B )
s 2 =
start ( B )
MA2 = {B , C }
stop / end (C )
stop( B )
B ⇔
A
s 3 =
stop / end (C )
MA3 = {C }
C
stop / end (C )
(b)
s 4 = MA4 = {B}
Figure 4.5: Two equivalent presentations Since the automaton of a presentation completely describes all the possible evolutions of a presentation, if two presentations share the same automaton, they have the same behavior. Then: Definition 4.2.6 Two presentations P and P 0 are equivalent if AUT (P ) = AUT (P 0 ). By definition of the automaton, if two different presentations share the same automaton, they contain the same media objects (MI P = MI P 0 ). Otherwise, if m ∈ MI P and m 6∈ MI P 0 ∃sk ∈ SAP such that m ∈ AMk and sk 6∈ SAP 0 and AUT (P ) 6= AUT (P 0 )). For the same reason, CHP = CHP 0 , otherwise if channel c exists in only one of the two presentations, the automata are different, since their states contain different functions which map the media objects into the channels since their domains are different. Thus if P 6= P 0 and MI P = MI P 0 , CHP = CHP 0 and EP = EP 0 then SRP 6= SRP 0 . Two equivalent presentations are shown in Figure 4.5 in which events of type start are not depicted due to lack of space. The only difference between the presentation (a) and (b) are the relationships A ⇒ B and A ⇒ C . Since both the presentations contain the relationship B ⇔ C , both objects B and C are activated when object A naturally ends. Then the presentation has the same behavior. Equivalent but not equal (i.e., P 6= P 0 ) presentations are very difficult to generate when we deal with complex presentations where the cardinality of MI is high, or simply when several media items share the same channel.
74
Chapter 4. Modelling Multimedia Presentation Behaviors
D
B
s 0 =
start ( A)
s1 = MA1 = { A}
⇔
A
start ( D )
C
s 3 =
end ( A)
MA3 = { D}
s 2 = MA2 = { A, D}
(a)
B
D
s 0 =
start ( A)
MA1 = { A}
⇔
A
s1 =
start ( D )
C s 3 = MA3 = { D , C }
end ( A)
s 2 = MA2 = { A, D}
(b) Figure 4.6: Two not equivalent presentations For example, in presentations described in Figure 4.5, objects A and B can share the same channel or used two different channels and they are still equivalent. Let us suppose objects A and B used two different channels. If we introduce a new object D which needs the same channel of B and we add the relation B ⇒ D in both presentations, their behaviors change as shown in Figure 4.6. If the user starts objects A and B in sequence, at the natural end of A in presentation (a), object B is activated, in presentation (b) the relation A ⇒ C starts object C . In the first case object B cannot start since its channel is still used by D, then presentation (a) reaches state s3 where only D is playing. In the second case, presentation (b) reaches state s3 in which both objects D and C are active, then AUT (P(a) ) 6= AUT (P(b) ), and the presentations are not equivalent.
Chapter 5
An XML Schema for Describing Multimedia Documents
This chapter presents an XML schema to describe hypermedia documents modelled according to the model described in Chapter 3. The XML representation of a hypermedia presentation contains three different sections: the layout section describes the spatial arrangement of the media items, the components section contains the media involved in the presentation and the relationships section describes their temporal behavior. The complete schema for this XML language, defined using the XML Schema language, is in Appendix A. Appendix B proposes the same schema defined using the DSD language, Document Structure Description. The issues presented in this chapter are discussed also in [27].
Hierarchical structure and temporal relationships defined in our model are described in an XML schema to take full advantage of existing XML tools. The XML language[26] provides components for the construction of a language specific to a particular application domain. The actual tag set and structures must be defined thus making XML a standardized way of exchanging structured data. Information stored in XML can be presented through various interfaces, and query languages for XML provides facilities to retrieve data from XML documents. An XML document can be seen as a collection of data stored to be machine-readable and also machine-understandable. Our main purpose is to store hypermedia document structure and spatio-temporal information separately from multimedia data. In most existing models this information are often mixed. For example in SMIL 2.0[57], spatial information are separated in the head section, but the temporal axis includes the media objects declaration. Such integrated definition does not encourage object re-use, mainly in
76
Chapter 5. An XML Schema for Describing Multimedia Documents
Figure 5.1: Layout section of the Maya Sun Temple presentation complex documents where it would be especially useful. Redundancy is generated, which requires cross-checking between different document sections. An XML source document describing a hypermedia presentation contains three kinds of specifications: the spatial layout of the document, the media involved in the presentation and their temporal behavior. Data is thus organized in three sections: the layout section, the components section and the relationships section. This solution allows for easy reuse of media objects, since they do not contain any information about the documents they are included into. A presentation can be rendered by processing the XML file and accessing media objects referred from it. Moreover, queries can be executed on data content, retrieving pieces of presentations coherent with the original structure. An XML file contains the whole information about a single presentation, the definition of its interfaces, the links to the media objects it contains and how they are synchronized with each other. This XML file begins and ends with a tag presentation which has an optional title attribute: this is a string which brings some information about the multimedia presentation described.
5.1
The Layout section
The layout section is the first section of the XML document. It contains the definition of the channels used by the presentation and the dimensions of the presentation window. Figure 5.1 shows the layout section of the presentation illustrated in section 3.6. The an and the text channels are portion of the user screen delimited by the coordinates of the left superior angle (the attributes SupX and SupY) and of the lower right angle (the
Chapter 5. An XML Schema for Describing Multimedia Documents
77
attributes InfX and InfY). The sound and noise are audio channels, therefore they have no visible layout. Each channel has a name that is a unique identifier.
5.2
The Components section
The components section (Figure 5.2) contains the description of all media objects involved in the presentation. According to our model definition, media objects are organized in a hierarchical structure. Each presentation contains at least a module which contains continuous objects, (i.e. stories or clips), or static objects (i.e. pages). A story is made by clips, which can be divided into scenes. Media objects can be organized into compound items (i.e. the composites), which contain both clips and pages. A composite can contain another composite. Each element has a unique identifier, id, which is used to reference the object from the other sections of the document, and a type (but for modules and composites), which is one of video, audio or animation for continuous media, and text or image for the pages. The type is inherited from a story to its clips and scenes. The attribute channel is optional for the stories and the scenes, and is required for clips and pages. Since a story contains at least one clip, the channel can be defined in the story or in each clip of the story. A scene inherits the channel from the clip. Clips and pages always correspond to a file (they are the correspondences which exist between logical and physical structure): the attribute file refers to the actual multimedia data URL. Figure 5.2 shows an excerpt of the component section of the second module of the “Maya” presentation illustrated in Figure 3.9.
5.3
The Relationships section
The relationships section of the XML document describes the temporal behavior of objects, by defining the list of synchronization primitives needed for the correct playback of the presentation. Each relationship type is coded with a different tag, enclosing two children tags for the two sides of the relationship; an optional attribute from struct indicates whether the
78
Chapter 5. An XML Schema for Describing Multimedia Documents
... ... ...
Figure 5.2: Hierarchical structure in the component section relationship is derived from the hierarchical structure of the presentation or explicitly set by the author. Figure 5.3 show an excerpt of the relationships section of the second module of the “Maya” presentation illustrated in Figure 3.9. The relation “plays with” is coded with the tag play having as children the tags master, denoting a continuous media object (cont object), and slave denoting a generic object. The relation “activates” is coded with the tag act whose children ended and activated denote respectively the ending media object and the media object which is activated sequentially. ended contains a tag cont object, since the relation “activates” is meaningless if applied to a static object, and activated contains a tag object. The relation “is replaced by” is coded with the tag repl whose children tags before and after denote respectively the replaced and the replacing media objects. Both children contain a tag object. The relationship “terminates” is coded with tag stop whose children tags first and second reference the two terminating objects. Both first and second contain a tag object. The synchronization primitive “A has priority over B with behavior α” is coded with the tag link and the attribute behaviour holding the value of the parameter α (i.e. “pause”
Chapter 5. An XML Schema for Describing Multimedia Documents
79
... ... < / relationships >
Figure 5.3: An excerpt of the relationships section of the Maya Sun Temple presentation
80
Chapter 5. An XML Schema for Describing Multimedia Documents
to pause the source object or “stop” to stop it). The object which contains the link is enclosed into a tag from while the destination of the link is described by the tag to both containing an object identifier. The complete schema for this XML language, defined using the XML Schema language, is in Appendix A.
5.4
Constraints for a valid document
Beside the definition of the tags of the language, the schema contains also some elements to control the validity of the XML document. For example, two objects cannot have the same id or two channels must have distinct name. The tag key is used to specify that the attribute id (or name for the channels) is a key (i.e. it must be unique and always present) within a specified scope. The scope of the key is defined with an XPath expression specifying the values (attribute or element values) that must be unique. Another set of keyref constraints are provided: ref channel ensures that the attributes channel of the elements of the hierarchical structure address a valid channel; ref cont object and ref object control that the attributes id of the tags cont object and object really point respectively to continuous and generic objects defined inside the document. If an XML document, describing a multimedia presentation, can be validated with this schema, it is syntactically correct, but the presentation described can contain some incoherency. For example, the XML schema cannot avoid the situation in which a story and one of its clip address different channels. This is obviously an error, since the player does not knows which channel to use. This situation could arise because XML Schema language does not allow to control, with a constraint, that the attribute channel is defined only once in the hierarchical nesting of the objects, i.e. in a story, or in all its clips, or in all the scenes. So a story can address a channel and one of its clips another channel. A schema language which allows conditional definition can avoid this kind of error. DSD[47], Document Structure Description, is a language for defining XML schemas that allows conditional definitions. Figure 5.4 shows an excerpt from the definition of a clip which assure that the attribute channel is defined only once. It controls if this attribute is defined in the clip or in the story which contains it: in this case the clip can contain zero
Chapter 5. An XML Schema for Describing Multimedia Documents
81
Figure 5.4: A conditional definition using DSD schema language or more scene, otherwise it contains at least one. A conditional definition of the scene will assure that it contains the attribute channel. A complete schema for our XML language using the DSD language can be found in Appendix B.
82
Chapter 5. An XML Schema for Describing Multimedia Documents
Chapter 6
A Visual Authoring Environment for Prototyping Multimedia Presentations
In this chapter we describe an authoring environment which allows the author to set up and test a complex multimedia presentation by defining the synchronization relationships among media. The main component of the authoring environment is a visual editor based on a graph notation, in which the nodes are media objects involved in a multimedia presentation, and the edges are the synchronization relations between them. Several external representations can be generated: a timeline-based representation highlighting media sequencing, and an XML-based description suitable for further processing (e.g. it can be rendered by a player). An execution simulator helps the author to check the presentation behavior before delivery. The issues presented in this chapter are discussed also in [17, 28].
Authoring hypermedia presentations is a complex task, since the author must deal both with the structure and layout of the document, and with its temporal behavior. Chapter 3 addresses these problems when modelling hypermedia presentations. The task is even more difficult when dealing with interactive documents, since unanticipated user interaction can alter the correct timing relationships between media. In this situation the test of a presentation cannot rely on the execution under all possible circumstances. Rather, simulation of media related events and user interaction should be possible in order to test the synchronization and coordination among media besides their unattended execution. Many metaphors are proposed by existing authoring tools. The simplest is the timedbased metaphor (see Section 2.2). Besides commercial products, many research works have designed multimedia authoring models based on different paradigms described in Section 2.2.
84
Chapter 6. A Visual Authoring Environment for Prototyping Multimedia Presentations In this chapter we illustrate an authoring environment based on the synchronization
model defined in Chapter 3, which allows the author to set up and test a complex multimedia presentation, by defining the synchronization relationships among media in a visual way. The main component of the authoring environment is a visual editor based on a graph notation, in which the nodes are media objects involved in a multimedia presentation, and the edges are the synchronization relations between them, defined upon the occurrence of events like begin and end of the media components. An execution simulator helps the author to check the presentation behavior not only during a normal continuous play, but also in presence of user interaction like pausing and resuming a presentation, stopping a media playback, or following a link in a hypermedia document. Several external representations can be generated: besides XML language defined in Chapter 5, a timeline-based representation highlights media sequencing.
6.1
The Authoring Environment for Hypermedia Documents
The visual environment for authoring and prototyping multimedia presentations is based on the synchronization model defined in Chapter 3. Due to the event-based style of the synchronization relationships, a graph notation was a natural choice for the visual editor: media are represented as nodes and synchronization relationships are represented as typed edges. Authoring consists mainly in drawing the synchronization graph of the multimedia presentation and all the temporal aspects of the presentation are defined by manipulating the graph. The editor provides also facilities to design the screen layout and to define the playback channels, to simulate the presentation dynamics, and to generate the XML description. The visual editor does not provide functions for editing the media components, but only to assemble the corresponding files: a wide choice of good programs for digital media manipulation is commonly available.
6.2
The User Interface
The visual editor provides two views to create and edit a multimedia presentation: one to define the channels and the layout of the presentation, and one to define the temporal behavior.
Chapter 6. A Visual Authoring Environment for Prototyping Multimedia Presentations
85
Channels can be of two types: regions, i.e., screen areas, and audio channels. The user defines the size of the multimedia presentation window, and creates the regions by drawing rectangles on it; each region has a unique name. The audio channels are defined only by assigning them a name. Colors distinguish the different regions, and are used consistently in the authoring process. Regions can be moved and resized by direct manipulation. The temporal behavior view provides the author with a panel on which to draw the graph which describes the dynamics of the presentation in terms of synchronization among media. With respect to a time-line description an event-based representation has some drawbacks and many advantages. Even if an event-based description may appear less intuitive, it allows a better specification of relationships among objects as evidenced in Chapter 2. The advantages are more noticeable during testing of a new presentation. A time line specification for hypermedia presentations that integrate continuous and non continuous media with user interaction can only be an approximation of the real behavior. An event-based description is simpler to draw and to adapt, since it allows the designer to concentrate on the relationships between media-related events without anticipating the actual time length of the media items. Each media object has some attributes: a unique name, a link to the corresponding file, the media type, (text, image, video, etc.) and the channel used for its playback. It is drawn as a rectangle with an icon which identifies the type. The rectangle is colored like the associated region (hollow for the audio objects) to show the channel used. Consistency between channel definition and usage is checked by the editor. Synchronization relations are established by selecting the two objects and the relation from a menu. The relation is drawn in the graph as an edge from the first to the second object, labelled with the icon of the relation type. As an example, we describe the cover presentation of a news-on-demand Web service which summarizes the articles available. In the presentation a short video of each report is shown, with a headline on the top of the video, while the speaker’s voice reads a brief summary of the news. As the speaker reads the following summaries, the headline and the video change their contents. During the presentation a background soundtrack is played continuously. After the last news a text page is displayed, and the user can select the article from a menu.
86
Chapter 6. A Visual Authoring Environment for Prototyping Multimedia Presentations
Figure 6.1: The synchronization of a news-on-demand cover Figure 6.1 shows the synchronization graph for such a presentation. Two audio channels are defined, one for the speaker voice and one for the soundtrack, and two regions, one for the video and one for the headline. In figure 6.1 we consider only two articles, but the synchronization schema can be extended in a straightforward way. The voice of the speaker, the video and the headline of an article are played in parallel as described by the relationships newsi ⇔ videoi and newsi ⇔ headlinei , with i = 1, 2. As the speaker ends reading the summary of the first article, the second is activated by the relationship news1 ⇒ news2 . At the end of the second article the relation news2 ⇒ menu displays the article menu. The soundtrack is repeated continuously as modelled by the relationship music ⇒ music. When starting it activates the first summary (music ⇔ news1 ). Figure 6.2 shows the XML file of this hypermedia document, which can be used for further processing. Five other relationships are needed to model the situation in which the user stops one of the master media during its playback: the relationships newsi ⇓ videoi and newsi ⇓ headlinei with i = 1, 2 stop the video and the headline if the user stops the corresponding audio news. The relationship menu ⇓ music forces the background soundtrack to end when the user stops (i.e., closes) the menu, since this action means that the user wants to exit the presentation.
Chapter 6. A Visual Authoring Environment for Prototyping Multimedia Presentations < p r e s e n ta tio n < la y o u t w id < c h a n n e l n < c h a n n e l n < c h a n n e l n < c h a n n e l n < /la y o u t>
x th a a a a
m ln = "5 m e m e m e m e
s = " m o d e l.x 0 0 " h e ig h t= = " v id e o " S u = " h e a d lin e " = " v o ic e " /> = "s o u n d "/>
< c o m p o n e n ts > < m o d u le id = " n e w s " > < c lip id = " m u s ic " file = " s o u < c lip id = " v id e o 1 " file = " v id < c lip id = " n e w s 1 " file = " n e < p a g e id = " h e a d lin e 1 " file < c lip id = " v id e o 2 " file = " v id < c lip id = " n e w s 2 " file = " n e < p a g e id = " h e a d lin e 2 " file < p a g e id = " m e n u " file = " m < /m o d u le > < /c o m p o n e n ts > < r e la tio n s h < p la y > < m a s te r> < s la v e > < < m a s te r> < s la v e > < < m a s te r> < s la v e > < < m a s te r> < s la v e > < < m a s te r> < s la v e > < < /p la y >
s d "> "4 0 0 " > p X = "1 9 " S u p Y = "1 3 " In fX = "4 7 1 " In fY = "3 2 1 "/> S u p X = "1 9 " S u p Y = "3 2 4 " In fX = "4 7 1 " In fY = "3 9 1 "/>
n d .w a e o 1 .m w s 1 .w = "h e a e o 2 .m w s 2 .w = "h e a e n u .tx
v " c h a n n e l= " s o p e g " c h a n n e l= a v " c h a n n e l= " v d lin e 1 .tx t" c h a n p e g " c h a n n e l= a v " c h a n n e l= " v d lin e 2 .tx t" c h a n t" c h a n n e l= " v id
u n d " ty p e = "a " v id e o " ty p e = o ic e " ty p e = " a n e l= " h e a d lin " v id e o " ty p e = o ic e " ty p e = " a n e l= " h e a d lin e o " ty p e = " im
u d io " /> " v id e o " /> u d io " /> e " ty p e = "te x t"/> " v id e o " /> u d io " /> e " ty p e = "te x t"/> a g e "/>
ip s > < c o b < c o b < c o b < c o b < c o b
o je o je o je o je o je
n t_ c t n t_ c t n t_ c t n t_ c t n t_ c t
o b id = o b id = o b id = o b id = o b id =
je "n je "v je "h je "v je "h
c t id e w s c t id id e o c t id e a d c t id id e o c t id e a d
= " m u s ic " /> < /m 1 " /> < /s la v e > = "n e w s 1 "/> < /m 1 " /> < /s la v e > = "n e w s 1 "/> < /m lin e 1 " /> < /s la v e = "n e w s 2 "/> < /m 2 " /> < /s la v e > = "n e w s 2 "/> < /m lin e 2 " /> < /s la v e
a s te r> a s te r>
< a c t> < e n d e d > < a c tiv a te < e n d e d > < a c tv a te < e n d e d > < a c tiv a te < /a c t>
< c d > < c d > < c d >
< s to < fir < s e < fir < s e < fir < s e < fir < s e < fir < s e < /s to
b je c t > < o b b je c t > < o b b je c t > < o b b je c t > < o b b je c t > < o b
p > s t> c o s t> c o s t> c o s t> c o s t> c o p >
< o n d < o n d < o n d < o n d < o n d
o n t_ o b < c o n t_ o n t_ o b < c o n t_ o n t_ o b < o b je c id je id je id je id je id je
= c t = c t = c t = c t = c t
je c t o b je je c t o b je je c t t id = "n id "n id "n id "n id "m id
87
id = " m u s ic " /> < /e n d e d > c t id = " m u s ic " /> < /a c tiv a te d > id = " n e w s 1 " /> < /e n d e d > c t id = " n e w s 2 " /> < /a c tiv a te d > id = " n e w s 2 " /> < /e n d e d > " m e n u " /> < /a c tiv a te d >
e w s 1 " /> < /fir = " v id e o 1 " /> e w s 1 " /> < /fir = " h e a d lin e 1 e w s 2 " /> < /fir = " v id e o 2 " /> e w s 2 " /> < /fir = " h e a d lin e 2 e n u " /> < /fir s = " m u s ic " />
< /s s t> "/> s t> < /s s t> "/> t> /s e
e c o n d > < /s e c o n d > e c o n d > < /s e c o n d > c o n d >
< /r e la tio n s h ip s > < /p r e s e n ta tio n >
a s te r> > >
a s te r> a s te r>
Figure 6.2: The XML representation of a news-on-demand cover
6.3
A Timeline Style View
The authoring tool provides a third view, a timeline style view, that helps the author to check the presentation behavior in terms of sequential and parallel execution of the media objects. However, this view cannot be used to edit the presentation, since it is defined only by the synchronization relationships. The time-based view translates the synchronization graph into a tree structure which represents the parallel and the sequential composition of the media, much as in a SMIL program. The media objects which play in parallel are drawn as children of a node labelled with the name of the “master” object, i.e., object a in the relationship a ⇔ b, which causes the “dependent” object to terminate if its length is longer. Objects played sequentially are drawn in a father-child relationship. Figure 6.3 shows the time diagram of the example illustrated in Figure 6.1. The tree has a root labelled with the name of the presentation. The first child is the object which starts the presentation playback. If it has children, and one of them is labelled with the same name of the father, the children will play in parallel with it, otherwise they will be activated at the end of the first object playback. The execution
88
Chapter 6. A Visual Authoring Environment for Prototyping Multimedia Presentations
Figure 6.3: A timeline style view continues until the tree has been traversed down to the leaves. The depth of a node is a measure of its position in the presentation: objects closer to the root will be played before objects in a deeper position in the same branch of the tree. The tree therefore represents a family of timelines, where the exact duration of each medium are not defined but the mutual occurrences are visible. As an example, in Figure 6.3, the first node news1 has tree children, news1 , video1 and headline1 . Then media items video1 and headline1 can play till news1 natural end. This means that the three represents six possible behavior of the presentation depending the order of the natural ends of the items. The reason why we speak of a family of timelines comes from the user interaction possibility. In a multimedia document which provides user interaction the real duration of each component cannot be stated, since it depends on events external to the natural flow of the presentation, like pausing and resuming it, or jumping forward or backward. In a timeline each event occurs at a specific time, but a family of timelines is able to describe also events occurring as a consequence of other, unanticipated events. The timeline style view also provides some additional information that are not easily visible in the synchronization graph. E.g., the object which starts the presentation is the root first child, but in the graph diagram is simply a node which is not activated as a consequence of internal events. The tree describes which objects are reachable within the
Chapter 6. A Visual Authoring Environment for Prototyping Multimedia Presentations
89
normal flow of the playback (i.e., there is a path from the start of the presentation to the object) and which can be reached only by a user interaction (i.e., such path does not exist and they are drawn as subtrees of the root).
6.4
An Execution Simulator
An execution simulator allows the author to check the temporal behavior of the presentation. The simulator places the media placeholders in the channels they would use in the real execution. Then, without being compelled to follow a real time scale, the author can generate all possible events, both internal (e.g., the end of an object play), and external (e.g., a user-executed stop or a hyperlink activation) to see how the events are propagated and how the presentation evolves. The simulator provides two interfaces for two different simulation styles. According to the first style, media placeholders are used to show the presentation dynamics as perceived by the final user and second simulation style shows which part of the graph of the presentation is currently in execution. The simulator opens a new window which lists the media components of the presentation and displays the channels as designed by the author in the layout view. Audio channels, which do not have a layout, are represented as small rectangular areas outside of the presentation window. The author can select a medium from the list and can send events to it: start, stop, pause and end, thus simulating the beginning, the forced stop, the pause or the natural end of the selected medium. Otherwise, the author can provide a file of pre-generated events, and see the presentation evolution step by step. The simulator is a visual implementation of the formal automaton described in Chapter 4, which is more easy to understand for the common user. The graph representation of the multimedia document is used to build the corresponding automaton. At each step, the simulator window contains all the information about the state of the presentation. The user selects the event to simulate and the simulator performs the next function of the automaton, displaying the new state. The mapping function isUsed defined in Section 4.1 is represented by the state of the regions in the simulator window,
90
Chapter 6. A Visual Authoring Environment for Prototyping Multimedia Presentations
the set AM of active objects contains all the objects which are using a channel (i.e. the colored regions), and the set of frozen items FM contains all the paused media objects. At the beginning of the simulation all channels are free, therefore they are drawn as empty areas. Each channel is marked with its name. When the author starts a media item, the channel it would use in the real execution is filled with the channel color, as set in the layout view, and the media item name (i.e., an identifier, the file name or URL) is displayed in the channel area. A number of incoherences are checked and reported, e.g., if the channel is still busy because it was used by another media item and not released, an error message is displayed. Then the simulator checks the relationships defined by the author looking for the ones which involve the selected media item (let us call it a): • if a is a paused media, it is restarted; • for each object b such as the relation a ⇔ b exists, the simulator starts object b and fills the corresponding channel with the channel distinctive color and object b name; • for each object b such as the relation b a exists, the simulator first stops object b (releasing the channel) and then activates object a; α
• for each object b such as the relation a > b exists, the simulator first stops (if α = s) or pauses (if α = p) object b and then starts the execution of object a. A trace of the events triggered and media activated is logged for tuning and debugging purposes. It is worth to note the the simulation does not involve the actual files that will be used in the presentation, each media object being represented by a visual modification of the associated channel appearance1 . The author can end or stop an active media. In both cases, the simulator releases the corresponding channel by removing the channel color and replacing the object name with the channel name. Then, as for media start, it analyzes the relationships between the ended or stopped media item and the other media items. If the author ends media item a: 1
An improvement is under development, which allows for displaying an image or a text, taken from the
media object file, if available, in order to enhance the user perception of how the presentation will evolve.
Chapter 6. A Visual Authoring Environment for Prototyping Multimedia Presentations
91
• for each object b such as the relation a ⇔ b exists, the simulator ends object b; • for each object b such as the relation a ⇒ b exists, the simulator activates object b; p
• for each object b such as the relation a > b exists, the simulator resumes object b. If the author stops a media item a, the simulator forces the termination of all objects b for which the relation a ⇓ b holds. If the author pauses a media items a, the simulator pauses also all media objects b such that a relation a ⇔ b exists. In this way the author can see in every moment which channels are busy and which media use them. The interface provided is a simulation of real media layout and dynamics during the presentation playback. However, understanding which part of the synchronization graph is currently executed is not simple. Except for the name of the media active in the channels, no other information is provided. In order to improve the user perception of the events and their relationships with the media, the simulation animates also the graph of the presentation: when the user activates an object, the corresponding node in the media-relationships graph is highlighted. Then the relations triggered by the activation of the object are also highlighted, and their effect is propagated to the related objects, giving the author, in a short animation, the visual perception of how media synchronization is achieved. If the author ends or stops a medium the graph is animated by showing how this event is propagated to the related media objects. Figure 6.4 shows a step of the simulation of the news-on-demand cover illustrated in Figure 6.1. In Figure 6.4a the presentation is playing the first news news1 , which plays together with the associated video file video1 and the title headline1 2 . The soundtrack music plays continuously. The author simulates the end of news1 by selecting it in the list of media items (on the right in Figure 6.4a) and clicking on the button “End”. The result is shown in Figure 6.4b: news2 starts as a consequence of relationship news1 ⇒ news2 , while the relationships news2 ⇔ video2 and news2 ⇔ headline2 activate video2 and headline2 . The soundtrack continues playing. In the simulator windows the author can see both the result (in the lower right window of Figure 6.4b, where the media placeholders for the channels video, voice and headline are changed), and the chain of relationships activated (in the left window of Figure 6.4b). With this representation it is always clear to the 2
These objects are highlighted with thick borders.
92
Chapter 6. A Visual Authoring Environment for Prototyping Multimedia Presentations
(a)
(b) Figure 6.4: The simulation of a news-on-demand cover
Chapter 6. A Visual Authoring Environment for Prototyping Multimedia Presentations
93
author which part of the presentation he/she is simulating, which are the active media and how the end-user interface is organized. The simulation is independent from the actual media file duration, since the author can simulate any combination of object duration by ending or stopping them explicitly. This allows the author to simulate (therefore to model) a great variety of situations and to design the graph of a presentation even if the length of some related media is unknown. Given a set of pre-generated events, the simulator can show the evolution of the presentation step by step. The list of events is a simple text file with the type of the event (start, stop end or pause) and the name of the object affected by that event 3 . Then the sequence of events file can be edited by author or automatically generated for systematic testing. The main problem with this feature is the generation of event sequences which are coherent with the actual presentation execution: e.g., the user cannot stop a non active media, and cannot activate media out of their natural ordering unless explicit hyperlinks are provided. Another possibility is to generate automatically sequence of events. As already stated, all the possible evolutions of a multimedia presentation can be described by means of an automaton which formally describes the presentation states entered by the events which trigger the media playback. Each state of the automaton contains the set of active media (i.e. AM), and the corresponding channel occupation, then AM is also the set of media which the author can end or stop. Since the simulator allows the author to generate an event selecting an object from the complete list of media items, we can assume that each media can be started at any step. Then, a coherent sequence of events could be generated by retrieving a list of events from a finite path of the automaton.
6.5
Model translation using SMIL
The authoring tool generates an XML-based description which can be processed by a player that was implemented ad hoc [51]. Another possibility for the playback of the presentation is to translate the synchronization schema into an execution language. We discuss here SMIL 2.0 translation because it is a well-known international standard. We must note 3
We are implementing the possibility to give a time at which the event occurs.
94
Chapter 6. A Visual Authoring Environment for Prototyping Multimedia Presentations
that SMIL does not cover all the temporal constraints imposed by the five synchronization primitives defined in our model, in particular concerning the user interaction. For example, as already discussed in Section 3.5, SMIL does not deal with the forced termination of an object. SMIL players most commonly used allow the user to start, stop, pause and resume a presentation, and not the component media items alone. E.g., it is not possible to stop an object, making the rest of the presentation evolve. Therefore, SMIL lacks the concept of a forced termination of a media. Under this assumptions, a partial translation is possible. Each visual channel can be translated into a SMIL region, i.e. a screen area hosting media to be played, with the same spatial arrangement. Audio files do not require a specific channel in SMIL, therefore audio channels are ignored. In the same way, SMIL does not define a hierarchical structure thus overall synchronization must be redesigned to eliminate modules and composites. Synchronization is achieved primarily through two tags: seq, to play two or more objects sequentially, and par, to play them in parallel. These two tags allow a natural translation of the relationships a ⇔ b and a ⇒ b. The relationship a ⇔ b can be translated into SMIL with the tag par, using the attribute end="id(a)". This attribute makes object b to end if object a ends by defining a synchronization point at a’s end. The relationship a ⇒ b can be translated into the tag seq since it simply models the sequential composition of different media. The relationship a b has no correspondence in SMIL, it only stated that a must release the channel if b needs it. It is therefore necessary to assign to object b the same region of object a. Then the definition of media item a contains the attribute end="b.begin", thus obtaining the same result of the relationship. The relation ⇓ cannot be implemented using SMIL native features, for the assumptions above. α
Some problem could arise for the relationship a > b. SMIL models the creation of a hyperlink through the tag a href. This tag allows the use of the attribute sourcePlayState. p
If this attribute has value pause, it translates the relationship a > b, if it has value stop, s
it translates the relationship a > b. If the presentation does not contain a link to media item a, the same temporal constraint can be imposed using the tag excl and the priority
Chapter 6. A Visual Authoring Environment for Prototyping Multimedia Presentations
95
classes. In this case, we cannot provide a general translation, but a different solution must be designed for each case. Even if the translation of a single primitive is quite easy, we must note that a general rule for the translation of a complex presentation, designed using the proposed model, which involves several media items and synchronization relationships cannot be defined. E.g., complex combination of synchronization relationships ⇔ and ⇒, must be translated finding a correct nesting of tags par and seq. This operation often requires to search, in the corresponding automaton, the correct order of visualization of the media objects involved (thus it requires to simulate the presentation execution). Sometimes, since the relationship a ⇔ b also states the forced termination of b when a ends, a correct nesting does not exist.
96
Chapter 6. A Visual Authoring Environment for Prototyping Multimedia Presentations
Chapter 7
Schema Modelling for Automatic Generation of Multimedia Presentations
A multimedia report is a multimedia presentation built on a set of data returned by one or more queries to multimedia repositories, integrated according to a schema with appropriate spatial layout and temporal synchronization, and coherently delivered to a user for browsing. This chapter discusses the problem of defining schemas for such multimedia reports. Multimedia presentations can then be automatically generated according to the schema. The XML language presented in Chapter 5 is extended to describe presentations template. The issues presented in this chapter are discussed also in [18, 20].
Rapid progress in technology for display, creation, storage and transfer of multimedia documents gives the user new possibilities to access and retrieve information of different kinds. Web sites offer a great variety of media such as text, images, video or audio files; multimedia documents are an effective way to present different kinds of information, since the presence of various media types gives more expressive power and opportunities to catch user attention. As an example, let us consider the description of a scientific experiment, where an animation can bring more information than a large amount of numeric data, or of a catalogue of hotels where the presence of images and videos helps the user to better compare the different choices. The authors of multimedia presentations must design the coordinated playback of different media and the consistent interpretation of user interactions. In many applications information displayed comes from a data repository, and its identification and extraction requires additional design care and the use of a schema for data. Re-use of media and information for different purposes, or adaption to user profile and history, are other re-
98
Chapter 7. Schema Modelling for Automatic Generation of Multimedia Presentations
quirements asking for the design of presentations according to well defined models and schemas. In such a scenario the automatic generation of standard multimedia presentations with data extracted from a repository is a valuable goal that allows authors to build with limited effort several variants on a same schema without re-designing the whole application from scratch. We aim at automatically generating multimedia presentations by developing schemas based on recurring patterns, focusing the discussion on coordination and synchronization of continuous media. The idea is to extend both the model and the authoring tool, presented in Chapters 3 and 6, in order to model schemas for automatically building multimedia presentations. With a schema authoring system we aim at giving the author the possibility to define the layout and the behavior of the presentation, the characteristics and attributes of the objects involved without knowing the instances that will be used to fill the schema. In this chapter, and in the following one, we discuss different problems relating information retrieval in multimedia presentations stored in databases. Reporting is one of three basic access modes to a data repository, the other two being browsing and querying. Browsing means to access a data repository along a priori undefined paths according to an estimate of relevance that the user formulates as he or she proceeds in the exploration of the repository (Chapter 8). Querying means to identify relevant information according to precise and pre-defined selection criteria based on the content of the information. In both chapters, we are concerned only with the temporal and synchronization aspects of integration of the returned media items. We therefore do not address issues related to query processing and retrieval of multimedia data; this is a problem of crucial importance, not easy to formulate formally and to solve. For example, the language (textual or pictorial) used to formulate the query, belongs to a specific syntactic domain against which data representation of different types must be matched. Moreover, automatic correspondence between different types of data cannot be established safely relying only on the interpretation of the data content. Content-based image retrieval systems available today, are still far from handling a concept of similarity based on the human perceived meaning of the images. Extending the semantic interpretation to several media adds an unknown amount of complexity.
Chapter 7. Schema Modelling for Automatic Generation of Multimedia Presentations
99
In spite of such difficulties, retrieval of multimedia data can be done at different extents with current technology. Therefore we assume that a system for retrieving multimedia information such as texts, voice talks, musical fragments, scores and images by content analysis is available, even if for information other than text still exhibits a high degree of uncertainty in existing products or prototypes. An extensive survey of combined text, audio and video retrieval in news applications can be found in [7].
7.1
Multimedia Reporting
The automatic generation of multimedia presentations is based on three steps: 1. A presentation schema is defined in which the multimedia items can be placed in a coordinated way according to the desired dynamics. 2. Data is selected (retrieved) from a data repository. Each item is a media file (or part of it) with attributes that describe its contents (w.r.t. selection parameters), its playback features, such as format, time duration, size on screen, and possibly additional information useful to relate the item with other data, such as indexes and cross-references. 3. The schema is filled with the selected data retrieved, the spatial and dynamic relationships are instantiated, and the presentation is played. Steps 1 and 2 can be interchanged. In effect, the knowledge about the format and the number of the returned data could suggest some changes in the schema to improve the final presentation. In some cases the automatic generation can be a first prototyping step of a more refined version, especially if the data repository is quite stable. Globally, the three steps build a continuous presentation in which data extracted from a multimedia data repository is located, connected, synchronized and coherently presented to the user. We call this activity multimedia reporting, i.e the automatic generation of multimedia documents modelled with respect to a template, whose content is retrieved according to selection parameters [18]. Accessing data by reporting means that the retrieved data are meaningful as a collection, and the relationships among data items are perceived as relationships among aggre-
100
Chapter 7. Schema Modelling for Automatic Generation of Multimedia Presentations
gations which have a meaning in the application domain. Also the presentation schema of the report suggests the user a way of reading it and adds further semantics to the data. In the most general case multimedia reporting would require the designer to approach and solve many problems about data selection, e.g. how to coherently integrate data coming from one or several databases. In Section 7.6 we shall briefly discuss some issues about this problem, we note here that too much generality prevents from satisfactory solutions and is in some way in contrast with the idea of “reporting”, an activity based on standardization. Therefore we assume the following scenario for our work: 1. The presentation collects data into groups, like in a text report, such that in each group data of different type exist (e.g., video, audio, text, image), whose instances are related like in a relational table. More precisely, we assume that each group is structurally equivalent to a relational table where columns identify the media types and rows are instances. Some values can be NULL values, denoting that in some instances some media can be missing. 2. Apart from groups, “background” data exists which are associated to the presentation as a whole, or to parts of it identified by a group or a sequence of groups, such as a continuous soundtrack, a permanent title, a background image, and so on. 3. The whole presentation consists of the coordinated (e.g. sequential) playback of the groups, taking care of user actions like pause, stop, rewind, and so on. 4. No a priori constraint is put on the time properties of continuous data items, but the system should be able to coordinate the execution by synchronizing the beginning and end of the data group components. When the operating environment evolves from text-only databases to complex and distributed multimedia repositories, reporting must be extended to face issues like media delivery and synchronization, channel management, user interaction, and cannot be effectively performed without a suitable model for describing the multimedia presentation which constitutes the report itself. As already discussed in Section 3.1.1 in the World Wide Web environment, we are concerned with scenarios in which media items might be delivered independently, possibly by several servers, and must be synchronized at the user client site. Then, since
Chapter 7. Schema Modelling for Automatic Generation of Multimedia Presentations
101
query execution
multimedia database
text image
movie audio
movie text audio
presentation specifications
media integration
multimedia presentation
multimedia tuples
delivery
Figure 7.1: Overview of the multimedia reporting process. data instances are not known in advance, we are concerned with a data integration and synchronization model which is based on data classes and types rather then on data instances, a quite different situation with respect to traditional multimedia authoring and integration environments. Last, we are interested in building complex and interactive multimedia presentations. Therefore, we must define how different types of data can be presented together and how user actions must be handled in order to maintain the whole presentation coherent. Figure 7.1 summarizes the process of building a multimedia presentation by querying a multimedia repository. Query execution returns a set of multimedia tuples whose components belong to different media types. In each tuple instances are homogeneous 1 therefore the specifications for media playback are compatible over all the data instances of a specific tuple type. The presentation specifications describe, among other information, how media items are integrated; the specifications describe how media items behave when events related to their playback happen. Master-slave relationships are established among media types, so that when the overall presentation is delivered to the user, its behavior is dictated by the dynamics and the timing of a master medium, usually a video clip or an audio track. Once the presentation is started, the master medium generates events when it starts, ends, or when the user interacts, e.g., by stopping playback. Such events 1
In Section 7.6 we shall argue about this assumption.
102
Chapter 7. Schema Modelling for Automatic Generation of Multimedia Presentations c r e d it s
s o u n d t r a c k
jin g le
v id e o 1
v id e o 2
v id e o 3
n e w s 1
n e w s 2
n e w s 3
c a p t io n 1
c a p t io n 2
c a p t io n 3
Figure 7.2: A simplified synchronization graph for a news-on-demand presentation propagate to other media, which in turn generate other events, determining the dynamic evolution of the whole presentation.
7.2
Dynamics definition in multimedia reports
Conceptually, defining multimedia reports is not different from building text-only reports: the author must define the structure of the report, i.e. the data layout, and the query to select and retrieve relevant data. In the case of multimedia reporting, data items collected have a temporal behavior, which increases the complexity of the structure definition by adding a new dimension to the task: the author should deal with synchronization problems and temporal sequencing of objects. Moreover, if the spatial layout definition could be trivial, this is not true for the temporal dimension. As an example, an author could design a news-on-demand service based on a database of articles stored as related multimedia document items: video documentaries, audio and text comments, images, and so on. A multimedia report is built from the selection of the appropriate news, by presenting them as in a synchronized sequence. Each article has a video story, an audio comment, and a text, which must be properly synchronized. The articles are normally played one after the other, but user interaction can modify the sequence, e.g. by skipping forward, or going back, or selecting the articles from an initial menu, and so on.
Chapter 7. Schema Modelling for Automatic Generation of Multimedia Presentations
103
c r e d it s
jin g le
s o u n d t r a c k
a r t ic le 1
a r t ic le 2
a r t ic le 3
n e w s 2
c a p t io n 2
a r t ic le 2 v id e o 2
Figure 7.3: A modular news-on-demand presentation. We use graphs as a visual representation for describing the temporal behavior of a multimedia presentation. Also the template of a multimedia report can be defined by linking object placeholders (the nodes of the graph) with labelled edges (the temporal relations). Figure 7.2 illustrates a simple graph showing a (simplified) news-on-demand presentation made of three articles, drawn by our authoring system according to our synchronization model. The graph specifies that, while a background soundtrack is continuously repeated (when it ends, it is activated again) the three articles are played in sequence, and each article is made of a spoken narration (newsi ), a video (videoi ) and a text caption (captioni ). The length of each article is controlled by the length of the narration, which is the master medium driving the parallel play of the other two media (the dot at the end of the line denotes the dependent medium). If the user stops the news playback, he or she stops the master medium, i.e., newsi , which also stops the video and text caption due to the relations newsi ⇓ videoi and newsi ⇓ captioni . When the last article ends the soundtrack is replaced by a jingle and a credits screen is displayed. In the graph of Figure 7.2 a recurring pattern is immediately perceivable: each article has the same components, i.e., a spoken comment, a video and a text caption, and the three articles have a common behavior. Figure 7.3 makes the recurrence more visible by introducing a composite item for each article, whose details can be hidden at a high level of specification. As already introduced in Chapter 3, a composite, drawn with a
104
Chapter 7. Schema Modelling for Automatic Generation of Multimedia Presentations
thick border in order to distinguish it from atomic media items2 , is a kind of an envelope enclosing several media items which behaves at a high level of observation as a compound media item, starting and ending according to the same synchronization rules which hold for atomic media items. From the presentation schema of Figures 7.2 and 7.3 a schema for a multimedia report which generically displays selected news in sequence can be derived straightforwardly. The schema does not define the object instances involved in the presentation, since the cardinality of the media set returned by querying the repository is unknown till execution. Therefore some nodes of the graph are placeholders for a collection of media items with the same characteristics, while other nodes denote media items which do not depend on queried data. The syntax used for drawing the graph must make evident which are the items which build up the repeated media groups, and the schema editor must provide the author a means to draw the structure of a report specifying which part of the structure is a replicated group, the relations inside a group and between different instances of the replication. The concept of composite is used to specify such repeated groups. To distinguish between report schemas and presentation schemas, a composite denoting a repeated report group is called a stencil. Figure 7.4a shows a report schema for the new-on-demand example of Figure 7.3. The stencil encloses the media placeholders which make up a repeated element (i.e., an article), specifying which events are generated and which synchronization relationships are obeyed. A stencil may contain also media items which do not depend on query results, but are simply replicated once for each tuple returned: such items are denoted with a star in the upper-right angle. In Figure 7.4b a richer article structure is shown: while retaining the synchronization between the voice comment, the video and the text of Figure 7.4a, each article instance is preceded by the article headline together with the TV channel logo and and musical tune, and is followed by a button, i.e. a user interaction entity, which allows the user to step through the news. The logo, the musical tune and the button are repeated for all the news but do not change their content. A stencil describes the structure and the behavior of a repeated presentation element, and is used as a template during instantiation. Relations which involve the stencil can be labelled with a value denoting which tuple of the result is affected by the relation: the 2
The authoring system draws composites in a slightly different way, but this detail is not relevant here.
Chapter 7. Schema Modelling for Automatic Generation of Multimedia Presentations
105
c r e d it s
s o u n d t r a c k a r t ic le
fir s t
jin g le
v id e o
la s t n e w s
a r t ic le
t u n e
c a p t io n
n e x t
*
lo g o
h e a d lin e
(a)
*
v id e o
n e w s
b u t t o n
*
c a p t io n
(b)
Figure 7.4: (a) A visual template for a news report, (b) An article with placeholders and media items. first, the last, the next, or all the tuples if the relation is unlabelled. The iteration is subjected to the following rules: 1. The first instance is executed according to the synchronization relationship labelled with the label first (in Figure 7.4a, the plays with(⇔) relation with the soundtrack, which means that the media items enclosed in the stencil start playing with the soundtrack), and the composing media are synchronized as described by the stencil details. 2. The instance execution ends according to the synchronization schema described in the stencil, and the end is propagated out of the stencil according to the relationship which links the stencil instance to the next one (i.e., the relationship labelled with the label next). Then the next instance is started according to the same synchronization schema; in Figure 7.4a the second narrations starts with the related video and caption. 3. When the last instance of the stencil ends, the end is propagated as described by the relationship labelled with the label last. In the example of Figure 7.4a, the soundtrack is replaced by a jingle, the credits screen is displayed, and the presentation ends.
106
Chapter 7. Schema Modelling for Automatic Generation of Multimedia Presentations The report is thus divided into a number of units some of which are media items that
do not depend on retrieved data, while other units are stencils, i.e., structural templates to be iterated on actual returned data instances. Since a stencil (as well as a composite) masks the details of the internal media and placeholders, synchronization relationships cannot be established between media items outside the stencil and media items and placeholders inside. The graph is a visual representation of the report template, which contains all information about temporal relationships of media objects. Spatial relationships are described by the channels associated with the media items. Such a representation is used by the visual authoring system and is supported by an XML-based language for defining the data structure and relationships in a more suited machine processable representation.
7.3
An XML-based schema definition for multimedia reports
Hierarchical structure and temporal behavior of a multimedia presentation can be described using the XML schema presented in Chapter 5. To take full advantages of XML processing tools we extend the XML language we defined to describe multimedia reports. In a way similar to a hypermedia presentation, also a report schema can be described by separating the definition of spatio-temporal relationships among media objects from information about their location. This approach is therefore well suited for describing multimedia report schemas since it is possible to define synchronization relationships without knowing any information about media items location or duration. Moreover, it allows to design report template which can be instantiated with minimal modifications of the report data. The author defines the temporal behavior by addressing abstract media objects identifiers (i.e., placeholders of actual data) rather than actual instances. The system will bind object identifiers, defined in the component section, to actual media objects after retrieving the data. The final presentation can be rendered by processing the XML file and accessing media objects which are located elsewhere.
Chapter 7. Schema Modelling for Automatic Generation of Multimedia Presentations
107
If the XML specification describes a report is a template: pages and clips definitions are placeholders for retrieved items, therefore the attribute file is not present and will be added at presentation instantiation time. The media can be defined inside a tag stencil, which represents a stencil item which builds up a report repeated item. In Figure 7.5, the clips soundtrack and jingle, and the page credits refer to media objects which do not depend on the report instantiation (therefore the attribute file is defined), while the stencil section represents the thick rectangle of Figure 7.2, which will be instantiated on the results of the query execution. Each item inside the tag stencil represents a class of objects; the items are instantiated by querying a data repository. In Figure 7.5, the clip video represents the set of videos returned for the selected news, news is the set of voice comments and caption is the set of text pages related to the same news. The attribute file is missing because it will be added during the schema and data integration phase. In an XML template, relationships can be established between objects and stencil items. In this case, the tag ref stencil is used. Relationships between stencil instances must be evaluated during the schema and data integration phase. The attribute num identifies which instance of the stencil is referred: the next instance, the first or last one. If this attribute is not present it means that the relationship is established between the same instance of the stencil.
7.4
Schema and data integration
Once media objects are collected from query results, the template is instantiated on the objects retrieved in order to generate the actual multimedia report. Figure 7.6 shows the XML description of the presentation described in Figure 7.2. The transformation of a schema report into a presentation is performed by the procedure FILL which reads the XML file of the report (schema) and writes the resulting presentation in report. For simplicity the code assumes a correct XML schema of the report, therefore it lacks any error checking and diagnosing feature. FILL(schema, report: file, RS : mapping function) // schema: file which contains the XML schema of the report, // report: XML file which will contain the report after the computation, // RS : function which returns ∀ stencil the set of data returned by the query
108
Chapter 7. Schema Modelling for Automatic Generation of Multimedia Presentations
< p re s e n ta < la y o u t w < c h a n n e < c h a n n e < c h a n n e < c h a n n e < /la y o u t>
tio id l n l n l n l n
n x th = a m a m a m a m
m ln "5 0 e = e = e = e =
s = "re p o rt.x s d "> 0 " h e ig h t= " 4 0 0 " > " v id e o " S u p X = " 1 9 " S u p Y = " 1 3 " In fX = " 4 7 1 " In fY = " 3 2 1 " /> " c a p tio n " S u p X = " 1 9 " S u p Y = " 3 2 4 " In fX = " 4 7 1 " In fY = " 3 9 1 " /> " v o ic e " /> "s o u n d "/>
< c o m p o n e n ts > < m o d u le id = " n e w s _ r e p o r t" > < c lip id = " s o u n d tr a c k " file = " s o u n d .w a v " c h a n n e l= " s o u n d " ty p e = " a u d io " /> < s te n c il id = " a r tic le " > < c lip id = " v id e o " c h a n n e l= " v id e o " ty p e = " v id e o " /> < c lip id = " n e w s " c h a n n e l= " v o ic e " ty p e = " a u d io " /> < p a g e id = " c a p tio n " c h a n n e l= " c a p tio n " ty p e = " te x t" /> < /s te n c il> < c lip id = " jin g le " file = " jin g le .w a v " c h a n n e l= " s o u n d " ty p e = " a u d io " /> < p a g e id = " c r e d its " file = " c r e d its .tx t" c h a n n e l= " v id e o " ty p e = " im a g e " /> < /m o d u le > < /c o m p o n e n ts > < r e la tio n s h ip s > < p la y > < m a s te r > < c o n t_ o b je c t id = " s o u n d tr a c k " /> < /m a s te r > < s la v e > < r e f_ s te n c il id = " a r tic le " n u m = " fir s t" /> < /s la v e > < m a s te r > < r e f_ s te n c il id = " a r tic le " /> < /m a s te r > < s la v e > < c o n t_ o b je c t id = " n e w s " /> < /s la v e > < m a s te r > < c o n t_ o b je c t id = " n e w s " /> < /m a s te r > < s la v e > < o b je c t id = " v id e o " /> < /s la v e > < m a s te r > < c o n t_ o b je c t id = " n e w s " /> < /m a s te r > < s la v e > < o b je c t id = " c a p tio n " /> < /s la v e > < m a s te r > < c o n t_ o b je c t id = " jin g le " /> < /m a s te r > < s la v e > < o b je c t id = " c r e d its " /> < /s la v e > < /p la y > < a c t> < e n d e d > < c o n t_ o b je c t id = " s o u n d tr a c k " /> < /e n d e d > < a c tiv a te d > < c o n t_ o b je c t id = " s o u n d tr a c k " /> < /a c tiv a te d > < e n d e d > < r e f_ s te n c il id = " a r tic le " /> < /e n d e d > < a c tiv a te d > < r e f_ s te n c il id = " a r tic le " n u m = " n e x t" /> < /a c tiv a te d > < e n d e d > < r e f_ s te n c il id = " a r tic le " n u m = " la s t" /> < /e n d e d > < a c tiv a te d > < o b je c t id = " jin g le " /> < /a c tiv a te d > < /a c t> < r e p l> < b e fo r e > < o b je c t id = " s o u n d tr a c k " > < /b e fo r e > < a fte r > < o b je c t id = " jin g le " > < /a fte r > < /r e p l> < s to p > < fir s t> < o b je c t id = " n e w s " /> < /fir s t> < s e c o n d > < o b je c t id = " v id e o " /> < /s e c o n d > < fir s t> < o b je c t id = " n e w s " /> < /fir s t> < s e c o n d > < o b je c t id = " c a p tio n " /> < /s e c o n d > < /s to p > < /r e la tio n s h ip s > < /p r e s e n ta tio n >
Figure 7.5: XML schema for a news report
Chapter 7. Schema Modelling for Automatic Generation of Multimedia Presentations
109
begin line = schema.readline(); // first line is namespace line = replace(line, “report.xds”, “model.xds”); while line 6= “” do // first section is layout begin // copy without changes copy(line, report); line = schema.readline() end; copy(line, report); // copy “” line = schema.readline(); // read next line while line 6= “” do // next section is components begin // process components section if line.contains(“stencil”) then begin idstencil = attribute(line,“id”); // replicate a stencil for all data tuples FILLSTENCIL(schema, report, line, RS (idstencil ), num, stencil ) end else copy(line, report); // copy data item outside the stencil line = schema.readline() end; copy(line, report); // copy “” line = schema.readline(); // read next line while line 6= “” do // last section is relationships begin // process relationships section if beginRelation(line) then // copy or replicate the relationships FILLRELATIONS(schema, report, line, num, stencil ) else copy(line, report); // copy the lines with the relation type line = schema.readline() end; . . . copy to end of schema end
The layout section of the XML document is not affected by report instantiation, which involves the objects and their relationships, but does not modify the layout. Only the reference to the namespace needs to be modified, since the XML schema refers to the namespace defined for presentation templates, while the XML presentation addresses the one defined for multimedia presentations. The components section must be extended to address the objects retrieved. Media outside the stencils remain unchanged, while each stencil is replaced by a composite which contains the returned media. The query returns an ordered sequence of tuples, whose
110
Chapter 7. Schema Modelling for Automatic Generation of Multimedia Presentations
elements belong to the media classes defined in the report schema by the objects inside the stencils. In our example the returned set is: RS (idarticle ) = {(videoi , newsi , captioni ) | 0 < i ≤ |RS (idarticle )|} The components section can be completed by replacing the stencils with a composite which contains the same objects. Then the composite is replicated |RS (id article )| times, and the instances are distinguished by systematically changing the object name placeholders in the template. The attribute id is instantiated by appending a sequence number i , and the attribute file (when missing) is added to each object, referring the actual media locations. This behavior is described by the procedure FILLSTENCIL. FILLSTENCIL(schema, report: file, line: string, RS : set of tuples, num, stencil : mapping function) // schema: file which contains the XML schema of the report, // report: XML file which will contain the report after the computation, // line: string which contains the last line read from file schema, // RS : the set of returned media for this stencil, // num: function returning for each stencil the number of replications, // stencil : function returning for each element the containing stencil begin elem = 0; numc = 1; id = attribute(line,“id”); line = readStencil (schema, line); while RS 6= ∅ do // while the query returns some data begin pick next tuple from RS ; // each stencil becomes a composite stencil = replace(line,“stencil”,“composite”); // append to attribute “id” a sequence number stencil = append (stencil ,“id”, numc ); for all element in stencil do // insert attribute “file” if missing // if element depends on query results if attribute(element,“file”) = null then stencil = add (element,“file”, tuple(elem + +)); copy(stencil , report); elem = 0; numc + + end; for all element in line do stencil (element) = id ; // element is contained in stencil id stencil (id ) = id ; // id is itself a stencil num(id ) = |RS | // num(id ) is the number of replications end
Chapter 7. Schema Modelling for Automatic Generation of Multimedia Presentations
111
The relationships section is processed similarly to the components section. Each single relationship is processed by procedure FILLRELATIONS which controls if the media involved are stencils, media placeholders or actual media items. Relationships between objects outside the stencil are copied unchanged in the new file, while relationships between objects inside a stencil are replicated by the procedure SAME. SAME(report: file, relation: string, iter : integer) // report: file which contains the XML report, // relation: string containing the whole relation, // iter : number of iteration begin for i = 1 to iter begin line = append (relation,“id”, i ); copy(relation, report) end end
// append sequence number to attribute “id” // copy line with correct indexes
The management of relationships which involve stencils and other objects is a bit more complex. With reference to Figure 7.4a, a stencil can be both the origin and the end of a dynamic synchronization relation with a media item of a placeholder. The attribute num of the template definition points out which tuple of the resulting set is affected by the relationship. If num = next the relationship must be defined between each resulting composite and its successor as described by procedure NEXT. If it is not present, the relationship must be replicated for any tuple of the set, otherwise it involves only the selected tuple. NEXT(report: file, relation: string, iter : integer) // report: file which contains the XML report, // relation: string containing the whole relation, // iter : number of iteration begin for i = 1 to iter − 1 // for each instance begin // instantiate relation with next instance relation = appendN (relation,“id”, i , 1); relation = appendN (relation,“id”, i + 1, 2); copy(relation, report) end end
112
Chapter 7. Schema Modelling for Automatic Generation of Multimedia Presentations
FILLRELATIONS(schema, report: file, line: string, num, stencil : mapping function) // schema: file which contains the XML schema of the report, // report: XML file which will contain the report after the computation, // line: string which contains the last line read from file schema, // num: function which returns for each stencil the number of replications, // stencil : function which returns for each element the containing stencil begin relation = readRelation(schema, line); // read the entire relation idA = attributeN (line,“id”, 1); // attribute “id” of the first object idB = attributeN (line,“id”, 2); // attribute “id” of the second object // replace all occurrences of “ref stencil” with “cont object” relation = replace(relation,“ref stencil”, “cont object” ); if stencil (idA ) = null then begin if stencil (idB ) = null then copy(relation, report)
// A is not in a stencil // B is not in a stencil
else // B is a stencil, stencil (idB ) = idB begin // remove attribute “num” attrnum = attribute(relation, “num”); relation = remove(relation, “num”); case attrnum =“first”: begin // append 1 to 2nd attribute “id” relation = appendn (relation,“id”, 1, 2); copy(relation, report) end; case attrnum =“last”: begin // append # of iterations to 2nd attribute “id” relation = appendn (relation,“id”, num(stencil (idB )), 2); copy(relation, report) end; // attrnum 6= null here end // end B is a stencil end; // end A is not in a stencil if stencil (idA ) 6= null and stencil (idA ) 6= idA then begin // A is a placeholder in a stencil if stencil (idB ) 6= null and stencil (idB ) 6= idA then // B is in a stencil // copy relation for all stencil instances SAME(report, relation, num(stencil (idA ))) else // B is a stencil, stencil (idB ) = idB begin // remove attribute “num” attrnum = attribute(relation, “num’); relation = remove(relation, “num’);
Chapter 7. Schema Modelling for Automatic Generation of Multimedia Presentations
case attrnum =“next”: // copy relation for all stencil instances NEXT(report, relation, num(stencil (idA ))); case attrnum = null: // copy relation for all stencil instances SAME(report, relation, num(stencil (idA ))) end // end B is a stencil end; // end A is a placeholder in a stencil if stencil (idA ) = idA then begin if stencil (idB ) = null then
// A is a stencil
// B is not in a stencil // symmetric to A not in a stencil and B stencil begin // remove attribute “num” attrnum = attribute(relation, “num”); relation = remove(relation, “num”); case attrnum =“first”: begin // append 1 to 1st attribute “id” relation = appendn (relation,“id”, 1, 1); copy(relation, report) end; case attrnum =“last”: begin // append # of iterations to 1st attribute “id” relation = appendn (relation,“id”, num(stencil (idB )), 1); copy(relation, report) end; // attrnum 6= null here end // end B is not a stencil
else // B is a stencil or inside a stencil begin // attribute “num” of the 1st object attrnumA = attribute(relation, “num”, 1); // attribute “num” of the 2nd object attrnumB = attribute(relation, “num”, 2); if attrnumA 6= null and attrnumB 6= null then begin // instantiate attribute “id” as above relation = remove(relation, “num”); case attrnumA =“first”: relation = appendn (relation,“id”, 1, 1); case attrnumA =“last”: relation = appendn (relation,“id”, num(stencil (idA )), 1); case attrnumB =“first”: relation = appendn (relation,“id”, 1, 2); case attrnumB =“last”: relation = appendn (relation,“id”, num(stencil (idA )), 2); copy(relation, report) end
113
114
Chapter 7. Schema Modelling for Automatic Generation of Multimedia Presentations
// attrnumA or attrnumB = “next”, // or both A and B are stencils begin // equal to A in a stencil and B is a stencil attrnum = attribute(relation, “num’); // remove attribute “num” relation = remove(relation, “num’); case attrnum =“next”: // copy relation for all stencil instances NEXT(report, relation, num(stencil (idA ))); case attrnum = null: // copy relation for all stencil instances SAME(report, relation, num(stencil (idA ))) end // end B is a stencil end // end B is a stencil or in a stencil end end else
In the example the soundtrack play starts the execution of the first stencil instance. Since the attribute num is present with value first, the relationship must be evaluated only once, and refers to the first instance of the stencil, therefore to the first composite article1 . Each stencil instance starts playing the audio file instantiated for the news item, since the attribute num is not defined in the relation play between article and news. The relationship play (⇔) of Figure 7.4a is propagated from the stencil to the voice comment. The object news plays the role of a master, since it starts the video and the caption, and stops their playback when ending. Its ending coincides also with the ending of the composite. In the generated presentation therefore a relation is established between the soundtrack and the first composite, as depicted in Figure 7.4a. For the other instances of the stencil, the relationships between the objects inside them are replicated. At the end of the last stencil instance, the relation act between article and jingle starts the play of a jingle. According to the num attribute of the stencil, this relation must be translated only for the last composite, thus the relationship act is instantiated between article3 and jingle. An act relationship exists between the two stencil instances, specifying at both ends a stencil article. The relation must be replicated for all the tuples of the resulting set, i.e. for all composites, since the attribute num is not defined in the first element of the presentation, but assumes the value next in the second element. The relations instantiated are therefore articlei ⇒ articlei+1 , 1 ≤ i ≤ 2
Chapter 7. Schema Modelling for Automatic Generation of Multimedia Presentations
115
making the generated presentation to play sequentially all the articles.
7.5
Handling incomplete results
As discussed in Section 7.1, the ordered set of tuples retrieved by the query RS (id article ) = {(videoi , newsi , captioni ) | 0 < i ≤ |RS (idarticle )|} can contain NULL values, denoting that in some tuple some media item can be missing. This value requires attention, particularly if the missing media item is an object which rules the behavior of the whole presentation. As an example, if the media object caption2 is missing in the second tuple returned by the query, the presentation can continue its playback, simply, the channel assigned to caption 2 remains empty. However, if the missing object is news2 , the presentation cannot continue after news1 ends. Recalling a concept we have introduced in Chapter 4 we call master objects of a presentation those items which rule the behavior of the presentation, i.e. those items whose time properties define the presentation timing and advancing. The absence or unavailability of a master object stops the presentation playback. In an automatic generation framework such a behavior is not admissible, therefore master objects must be clearly identified, and their unavailability in a stencil instance must be overcome. Such a problem can be solved in two ways. First of all, we could modify the XML language for report templates to allow the author to define which media items are required (the master objects3 ) and which are optional, therefore can be missing. Otherwise, we could recognize this type of object by analyzing the synchronization relationships. Once we have identify the master objects, either by looking into the XML file or analyzing the synchronization relationships, we can filter the set of tuples returned by a query according to the following rules: 1. if the tuple does not contain any NULL value or the media item corresponding to the NULL value is not a master object, then the tuple is accepted; 2. if the NULL values correspond to a master object the tuple is discarded. 3
See Definition 4.2.2.
116
Chapter 7. Schema Modelling for Automatic Generation of Multimedia Presentations A more conservative approach could be to present to the user all data returned by the
query. In this case, the NULL value could be replace by a timer [27], i.e. a continuous object with a constant duration. Such a solution plays all the other media objects for a certain interval time, during which the channel associated to the NULL instance remains empty, but the presentation does not stop. If the tuple contains other continuous media, the timer duration can be set equal to the longest duration by the player, otherwise a default value can be provided at design time.
7.6
Query Execution
We have not discussed in this chapter issues related to the query formulation and execution. This is of course a problem of crucial importance, and we do not claim it is easy to formulate formally and to solve. However, effective solutions can be found in the database area where models and technology for dealing with multimedia data exist. A number of questions must be answered, which however do not interfere with the schema model we have discussed here. We have assumed that data comes from one multimedia database, therefore media instances are naturally related to each other much as in a relational table. What if several data repositories are accessed? This situation seems desirable due to the large number of available media sources. However the problem may become hard to approach for several reasons: 1. Different data repositories can hold data items which are semantically close but very far in their physical properties, e.g., different in video size, or in image resolution, or in audio fidelity. A coherent presentation including elements from all the repositories can be very hard or even impossible. 2. Different repositories can require different query languages, or queries with different parameters, due to the differences in DB schemas. What about result integration? 3. We assume that different types of media can be returned. How are different elements related if they come from different databases, so that in principle only the interpretation of content can relate them? How can we “link”, say, a video instance to an instance of a text related to the same article but coming from a different repository?
Chapter 7. Schema Modelling for Automatic Generation of Multimedia Presentations
117
Current technology can help us in approaching some of the points above. Wrappers and mediators [60] can be used to approach the problem of querying and integrating several data repositories with different schemas. Semantic attributes and metadata can be used to identify relevant information in multimedia objects, e.g., according to the MPEG-7 standard [43]. In a multimedia report, however, it is plausible to assume a high degree of homogeneity in the returned data, due to the iterated nature of the media presented to the user. In approaching automatic generation of presentations therefore we are bound to a set of constraints which make our initial assumptions realistic and effective. 1. We must be able to select coherent data, i.e. data that is semantically related and that can be put in a presentation which is recognizable by the reader as a meaningful document. This problem is present in all the automatic presentation construction systems, and is assumed implicitly. 2. Data can be linked by external keys or equivalent cross-reference information which assure that we can identify related data by testing such information. 3. Data is coherent with respect to physical playback properties. These requirements are satisfied if we have only one multimedia database. They can be guaranteed to some extent by filtering data coming from different databases using wrappers and mediators, even if it could be hard to assure the needed physical homogeneity in the resulting presentation. We should in this case assume that the report produces a presentation prototype that has to be refined by hand in its visual aspects. Applications that can be satisfied with these requirements are wide: news-on-demand, that we have used as a scenario in a very simplified view, is a good case, since the assumption that the same database holds news video with associated texts and audio is realistic. Advertisement is another good case, since it is plausible that a set of advertised items can be described each by a picture or a video, a spoken text, a jingle, and so on, related by well identifiable keys. In all cases the multimedia report can be completed with purely aesthetic media such as a background soundtrack, decorative frames, contour images, and so on, which can be described in the report schema or added in a subsequent refinement phase.
118
Chapter 7. Schema Modelling for Automatic Generation of Multimedia Presentations
< p re s e n ta < la y o u t w < c h a n n e < c h a n n e < c h a n n e < c h a n n e < /la y o u t>
tio id l n l n l n l n
n x th = a m a m a m a m
m ln "5 0 e = e = e = e =
s = " m o d e l.x s d " > 0 " h e ig h t= " 4 0 0 " > " v id e o " S u p X = " 1 9 " S u p Y = " 1 3 " In fX = " 4 7 1 " In fY = " 3 2 1 " /> " c a p tio n " S u p X = " 1 9 " S u p Y = " 3 2 4 " In fX = " 4 7 1 " In fY = " 3 9 1 " /> " v o ic e " /> "s o u n d "/>
< c o m p o n e n ts > < m o d u le id = " n e w s _ r e p o r t" > < c lip id = " s o u n d tr a c k " file = " s o u n d .w a v " c h a n n e l= " s o u n d " ty p e = " a u d io " /> < c o m p o s ite id = " a r tic le 1 " > < c lip id = " v id e o 1 " c h a n n e l= " v id e o " ty p e = " v id e o " /> < c lip id = " n e w s 1 " c h a n n e l= " v o ic e " ty p e = " a u d io " /> < p a g e id = " c a p tio n 1 " c h a n n e l= " c a p tio n " ty p e = " te x t" /> < /c o m p o s ite > < c o m p o s ite id = " a r tic le 2 " > ....< /c o m p o s ite > < c o m p o s ite id = " a r tic le 3 " > ....< /c o m p o s ite > < c lip id = " jin g le " file = " jin g le .w a v " c h a n n e l= " s o u n d " ty p e = " a u d io " /> < p a g e id = " c r e d its " file = " c r e d its .tx t" c h a n n e l= " v id e o " ty p e = " im a g e " /> < /m o d u le > < /c o m p o n e n ts > < r e la tio n s h ip s > < p la y > < m a s te r > < c o n t_ o b je c t id = " s o u n d tr a c k " /> < /m a s te r > < s la v e > < c o n t_ o b je c t id = " a r tic le 1 " n u m = " fir s t" /> < /s la v e > < m a s te r > < c o n t_ o b je c t id = " a r tic le 1 " /> < /m a s te r > < s la v e > < c o n t_ o b je c t id = " n e w s 1 " /> < /s la v e > ... < m a s te r > < c o n t_ o b je c t id = " n e w s 1 " /> < /m a s te r > < s la v e > < o b je c t id = " c a p tio n 1 " /> < /s la v e > ... < m a s te r > < c o n t_ o b je c t id = " jin g le " /> < /m a s te r > < s la v e > < o b je c t id = " c r e d its " /> < /s la v e > < /p la y > < a c t> < e n d e d > < c o n t_ o b je c t id = " s o u n d tr a c k " /> < /e n d e d > < a c tiv a te d > < c o n t_ o b je c t id = " s o u n d tr a c k " /> < /a c tiv a te d > < e n d e d > < c o n t_ o b je c t id = " a r tic le 1 " /> < /e n d e d > < a c tiv a te d > < c o n t_ o b je c t id = " a r tic le 2 " /> < /e n d e d > ... < /a c t> < s to p > < fir s t> < o b je c t id = " n e w s 1 " /> < /fir s t> < s e c o n d > < o b je c t id = " v id e o 1 " /> < /s e c o n d > < fir s t> < o b je c t id = " n e w s 1 " /> < /fir s t> < s e c o n d > < o b je c t id = " c a p tio n 1 " /> < /s e c o n d > ... < /s to p > < r e p l> < b e fo r e > < o b je c t id = " s o u n d tr a c k " > < /b e fo r e > < a fte r > < o b je c t id = " jin g le " > < /a fte r > < /r e p l> < /r e la tio n s h ip s > < /p r e s e n ta tio n >
Figure 7.6: The generated presentation in XML
Chapter 8
Retrieval in Multimedia Presentations
In this chapter we discuss some issues about information retrieval in multimedia presentations. We propose a retrieval model able to reconstruct a coherent fragment of a presentation from the atomic components returned by the execution of queries to multimedia presentation repositories. The retrieval model is based on the automaton described in Chapter 4 and returns a new presentation with all the media related to the retrieved ones, with their original structural and synchronization relationships. The issues presented in this chapter are discussed also in [19, 21].
Retrieving information in hypermedia documents is a task which presents a number of peculiarities. First, information is contained in documents of different types; second, the user is not usually interested in any single document but in the part of presentation in which at least one of the documents satisfies the query; and third, we consider hypermedia presentation in which at least one continuous media item plays in synchrony with other static documents. Then, the presentation being dynamic, a coherent and understandable segment must be returned to the user; i.e., the query result extent must be identified according to a context which takes into account the presentation structure. At a first glance the result of a query on a multimedia database is a set of media objects which respond to the query criteria. Each media item can be a (section of) audio or video file, but also a non continuous medium like a text page or an image. If retrieved items are also part of a composite multimedia document, such as a complex presentation, they cannot be displayed or played alone: for example, if the user is looking for information about the “Gioconda” by Leonardo, a picture of a detail, without the corresponding text comment, is scarcely useful. For this reason, it is not possible, in general, to return to the user only an index of retrieved media items. We need to retrieve also any other item which is played together with
120
Chapter 8. Retrieval in Multimedia Presentations
the retrieved ones, i.e. media objects belonging to the fragment of the original presentation. Therefore, a query system should re-build fragments of the original presentation. If two media objects belong to the same multimedia document, they can be proposed as two different results or as part of the same one, according to their position in the presentation. The model discussed in this thesis is suitable for approaching this problem. It describes the presentation structure for deciding when two items are part of a same section, and the temporal relationships, for displaying different results according to the author design of the presentation. As already discuss in Chapter 7, we are concerned only with the temporal and synchronization aspects of the returned media items. We therefore do not face query specification and execution. The goal is the reconstruction of a coherent multimedia document fragment given one or more of its components media objects. We assume that a system for retrieving multimedia information exists; the user formulates a query with proper parameters which identify media items whose content is relevant, belonging to one or more presentations. The retrieval system executes the query and returns a (ranked) list of media items, possibly of different media types. Generally speaking, each returned reference is a pair hMid , P i where Mid is the media item identifier, i.e. a reference to a descriptor containing a file locator or a URL, a media type and possibly other information, and P is a reference to the presentation in which it is located. If a media item belongs to several presentations, several pairs are returned with different values of P . Our goal is to re-construct, for each pair, a coherent (and complete) fragment of the presentation P to be displayed as result of the query, identifying all objects which belong to the same fragment of P which contains the media referenced by Mid . We want to return a time span of the original presentation, thus a complete fragment contains all the media objects which play in parallel with the retrieved media. Therefore we must: • identify the missing media items to be supplied, and • identify the fragment scope. We assume that each presentation is described by a schema according to the model described in Chapter 3.
Chapter 8. Retrieval in Multimedia Presentations
8.1
121
A working example
As a working example we illustrate briefly the overall structure of a multimedia CDROMs1 featuring Beethoven Symphony no. 6 “Pastorale”, published in Italy a few years ago [54]. The CD-ROM follows a constant structure and contains a wide set of information about the musician’s life, the historical period, the masterwork itself (as an audio file), a description of its musical structure in form of graphical score and text comments, a criticism, in the form of written text, which accompanies the music play, and a set of audio files of related works. Figures 8.1, 8.2 and 8.3 show some screen shots from the CD-ROM. In Figure 8.1 a description of the life of Beethoven is illustrated. It is an animated presentation (Figure 8.1(a) shows the final state) stepping through selected years in Beethoven’s life. Each year is described by a short animation which draws a segment of a timeline, displays an image and starts a spoken comment. A background music plays during the whole animation. As the narration goes on, a complete timeline is built. At the end of the presentation the user can click over an image to display a text page which reproduces the comment that was given by voice about that year, as shown in the image on the right. The speaker’s voice and the background music are integrated in the same audio file. According to our model, the audio track is a clip, which is divided into a sequence of scenes which interleave the speaker’s voice and background music. The animations which draw the timeline between two events in Beethoven’s life are also clips, while the images displayed during the spoken comments can be considered pages since they are static media objects. Figure 8.4 shows the synchronization schema of this part of the presentation. Since the presentation starts with an animation, even numbered scenes correspond to voice comments and images, while odd numbered scenes correspond to the transitions from one year to the next. The scenes play in sequence, hence scenei ⇒ scenei+1 . Each audio scene plays in parallel with the related animation, which draws a timeline segment; this behavior is described by the relationship scene2i−1 ⇔ animationi . Similarly, each spoken comment goes in parallel with an image, as described the relationship scene2i ⇔ imagei . 1
We use a CD-ROM based example because it is a complete product featuring a complex multimedia
presentation, even if most features of our model are mostly suitable for a WWW environment.
122
Chapter 8. Retrieval in Multimedia Presentations
(a)
(b) Figure 8.1: Screenshots from a presentation about Beethoven’s life
Chapter 8. Retrieval in Multimedia Presentations
123
Figure 8.2: The guide to listening of the Pastoral Symphony The user can stop the presentation by clicking over a “bypass” button. The audio track is terminated, and also the animation or the image displayed are terminated. This behavior is described by the relationships scene2i−1 ⇓ animationi and scene2i ⇓ imagei . Figure 8.2 and 8.3 shows two different but similar sections of this multimedia presentation: the analysis of the symphony from the points of view of the artistic quality and the musical structure. In Figure 8.2 the overall structure of the work is shown, and as the music plays, a bar moves showing the position in the score length. Text comments help the user to interpret the execution. In Figure 8.3 the musical score is drawn and updated as the music plays. Also in this case some text comments are displayed. The regular structure of these two sections can be described, in terms of synchronization relationships, in a way similar to the example shown above. Both sections are divided into modules with the same structure. Figure 8.5 illustrates the structure of the first module of the guide to listening. The module contains the soundtrack divided into stories (i.e. story1 and story2 ), clips (for the first story, c1,1 and c1,2 ) and scenes (for the first clip, sc1,1 and sc1,2 ). The soundtrack plays continuously then the module contains the relations:
124
Chapter 8. Retrieval in Multimedia Presentations
Figure 8.3: The score analysis of the Pastoral Symphony
M1
⇔ story1
ci,k
⇒ ci,k +1
∀i , k
storyi
⇒ storyi+1 ∀i
ci,k
⇔ sck ,1
∀i , k
storyi
⇔ ci,1
sck ,h
⇒ sck ,h+1 ∀k , h
∀i
The progress bar is an animation also divided into clips and scenes (for the first clip of the animation, an1,1 and an1,2 ).Textual information is organized in pages. Each module corresponds to a movement of the symphony. This information is shown to the user by a header (a page in the model), title1 for the first module, which is displayed on the screen for the entire duration of the module. This behavior is modelled by the relation M1 ⇔ title1 . Each movement of the masterwork is divided into sections which correspond to the stories. The relation storyi ⇔ tempoi displays the current tempo of the music. Two other information help the user to understand its position inside the music score: a short title is changed each time a new clip begins and the bar progresses in synchrony with the music. The relations ci,k ⇔ infoi,k and sck ,h ⇔ ank ,h display each title for the duration of the corresponding clip and start the animations together with the scenes of the soundtrack.
Chapter 8. Retrieval in Multimedia Presentations
s ce n e i-1
sc e n e i
⇒
125
s ce n e i+ 1
⇒
⇔
⇔
a n im a tio n k-1
⇔
a n im a tio n k
im a g e j
Figure 8.4: The synchronization schema for the presentation about Beethoven’s life ⇔ ⇔
⇔
tempo1
⇔ ⇔ ⇔sc
info1,1
Story1
c1,1
1,1
c1,2
⇔
⇔
⇔
text1,1
⇔
sc1,2
⇔ an1,1
an1,2
text1,2
M1
title1
⇔
⇔
⇔
Story2
⇔
⇔
⇔ ⇔
Figure 8.5: The synchronization schema for the presentation of the guide to listening of the music work The relationships ank ,h ⇔ textk ,h complete the modelling of the presentation by displaying some text comments to help the user to interpret the music execution. If the user stop the presentation all the objects must stop in the same moment. To model this behavior relationships A ⇓ B can be inherited between all media items A and B for which a relation A ⇔ B exists, between the modules and all their stories, between the stories and their clips and last, between the clips and all their scenes. For sake of readability, Figure 8.5 does not contain these relationships. Let us suppose that a collection of presentations about different symphonies and musicians are stored in a multimedia repository. Each presentation can be modelled with a
126
Chapter 8. Retrieval in Multimedia Presentations
structure very similar to the one discussed above. In this scenario, we want to retrieve all the musicians in the data repository which have been living in Lipzig, and for each of them we want a description of such periods, or in general the information about their life in that town. We could find a set of voice comments, taken from the presentation of the musician’s life, and the texts associated of them, as in Figure 8.1, plus other information chunks taken from different sections of the repository. If we want to retrieve all the music passages where strings play in a crescendo style we could ask for such a text description in the guide to listening and in the score analysis, or perform a content analysis on the score itself in order to identify such annotations. In both cases the retrieval system will return a set of (pointers to) multimedia data instances, but the user should receive composite parts of the original presentation in order to understand the result of the query. Browsing has similar problems: once identified a relevant information, the user needs to enlarge the scope and access a more complex information in order to understand it. The model presented in Chapter 3 allows the system to fill the gap between the retrieval of separate components and the presentation of the composite document. One could argue that in many cases the presentation is built as a single file or a set of tightly coupled files, i.e. a set of tracks, therefore a reference from the components to the time interval in which they are played is all is needed to identify the relevant context. Even if the presentation is made of separate files integrated by a player, the execution proceeds linearly in time, therefore something similar to a time stamp could solve the problem. In many cases this is true, but if we consider, as the enclosing scenario of our discussion, a distributed environment like the World Wide Web, there are at least two issues that make this simplification not realistic: • the media are delivered by a server according to a composite document structure which can be distributed in several files, and is known as they are delivered to the client; • a WWW document usually has links that allow the user to explore its structure in a non linear way, and the user can also interact by backtracking or reloading the documents as they are displayed.
Chapter 8. Retrieval in Multimedia Presentations
8.2
127
Building a Multimedia Query Answer
Presentation of multimedia information can rely on two properties of the model in order to build a query answer which contains all the presentation elements related to the media objects retrieved as results of the query execution: the hierarchical structure of the presentation (Chapter 3) and its corresponding automaton (Chapter 4). The static structure of the presentation can help to define the scope of the fragment to be returned to the user; the automaton, on the other hand, is used to collect all the media items that complete the fragment in which the retrieved object is located. The automaton, in fact, since it describes formally how the presentation evolves, contains in each node the information about which media play together, under all possible conditions about triggering of events related to normal play and user interaction. It is therefore the candidate source of information for re-constructing consistent presentation fragments after some media items have been retrieved according to user queries. We do not consider the user interactions in the automaton, since we suppose that the query must return a fragment of the original presentation and thus of its original behavior. Then the user is free to interact with the returned result, thus reaching the other states of the automaton. For this reason, the automata depicted in the figures of this chapter contain only the natural evolution of the corresponding presentations. The scope of the result, i.e. the “length” of the fragment, can be calculates in terms of the length of the path in the automaton and requires additional attention. Our solution returns a presentation which ends when the media item selected by the query naturally ends. Other possibilities will be discussed in Section 8.3. Answer building then is performed in tree steps: • the automaton of the presentation is built, according to the rules described in Chapter 4; • the system selects from the automaton all the states in which the retrieved items are located; • a new presentation, which is a fragment of the original one, is built, and returned to the user as result.
128
Chapter 8. Retrieval in Multimedia Presentations With a reference to Figure 8.4, if the query retrieves a passage of the voice comment in
scene i , the answer to be reported to the user is at least that scene, which is dynamically linked to an image j , which should also be displayed according to the presentation behavior. In the same way, after retrieving a text in which a crescendo of strings is described, e.g. text1,2 in Figure 8.5, it is easy to identify the corresponding part of the audio track and bar progress in the guide to listening, in our example sc1,2 and an1,2 , because both the segment of the audio track, the animation and comment belong to the set of active objects of the same state. In the score analysis, the text comment is related to the same segment of audio track, but also the image of the pertaining score page is related, and the whole presentation fragment made of the music, the score and the text can be delivered and displayed. If more than one object is selected as a retrieved item, it is easy to check if they are part of a same section or of different sections of a same presentation, or of different presentation, avoiding duplicates. The hierarchical structure of a multimedia document helps to identify the scope of a module, of a clip or of a scene. Thanks to the same hierarchical structure, it is possible to give the user different levels of access to the resulting presentation. A first attempt is to index the scenes which build the master stream (Definition 4.2.2), which are the finest level of access in our model. The user can quickly browse them without being compelled to browse longer sections. In our example, if the user is interested in a particular moment of a musicians’s life, he or she wants to listen only the scene regarding that year. This level of indexing gives thus the user the possibility to evaluate the accuracy of retrieval in terms of precision. The user could then select, according to a relevance analysis, some responses which appear more appropriate than others, and gain access to the higher level of hierarchy, which give a broader context for the retrieved items. In our example, the user could be interested in the whole life of one musician, say Beethoven, identified through a specific information about one of the narrated episodes. The higher level is in this case the whole presentation of a specific instance in the retrieved set, together with animations and images. In all cases the presence of a synchronization model guarantees that the sections of the presentation which are displayed are coherent.
Chapter 8. Retrieval in Multimedia Presentations
8.2.1
129
Retrieving Consistent Presentation Fragments
The retrieval engine is assumed to return a set R of pairs hMid , P i each pair denoting a media item Mid and a reference to the presentation P . Each presentation P corresponds to an automaton AUT (P ). As already said, we are interested only in the natural evolution of the presentation, then the automaton may contain only the event which starts the presentation, and the events which describe the natural end of its master components. Under this assumption, Figure 8.6 depicts the automaton of the first module of the guide of listening described in Figure 8.5. The set of consistent presentation fragments containing the retrieved media items is built according to the following procedure, which is executed for each presentation P = hMI, CH, E, SRi occurring in R. For the sake of readability, in the following we omit any explicit reference to P , implicitly assuming that media, states, events and synchronization relationships are related to the same presentation. 1. For each item r ∈ R returned, let Mid r be the media item identifier and Sr the set of states in which Mid r is active. Each state si ∈ Sr identifies the set of media items AMi which play together, and the corresponding channel assignments. Let Sf , be the set of states which identify the fragments that must be returned to the user as the answer to his/her query. Initially, Sf = Sr . If Mid r is active in two states s1 and s2 = next(s1 , e) for some e ∈ E, we consider only state s1 , i.e., Sf = Sf \ {s2 }. If Mid r is active in two states which are not sequential, and inactive in the middle, we take both states for subsequent analysis. 2. For each state si ∈ Sf identify the event which activates the retrieved media item Mid r . Since media items are activated by plays with(⇔) or activate(⇒) relations, all states are entered by end events, but for state s1 which is entered by event start(M1 ). In terms of the presentation synchronization relationships state si is entered under one of two conditions: (a) an event start(M ) has occurred where M ∈ AMi is a module, or (b) a relation x ⇒ m ∈ SR, where m ∈ AMi , x ∈ AMj , and si = next(sj , end (x )). In case (2a), the module itself acts as the “starting” component of the presentation fragment. In case (2b), only one media item in each state acts as the “starting”
130
Chapter 8. Retrieval in Multimedia Presentations object, the others being activated in parallel with it, or being already active by virtue of previous events. Let us call it m0 .
3. If condition (2a) holds, the fragment to be returned is the module itself. 4. Otherwise, the minimum presentation fragment enclosing m0 to be returned to the user is described by a synchronization schema obtained from the original presentation schema with the following transformations: (a) if relation M ⇔ m0 ∈ / SR, where M is the presentation module containing m0 , add it to SR, and remove from SR any other relation M ⇔ mch where mch ∈ AMi and channel (m0 ) = channel (mch ); (b) if ComponentOf (x , m0 ) and relation x ⇔ m0 ∈ / SR, add it to SR, and remove from SR any other relation x ⇔ mch where mch ∈ AMi and channel (m0 ) = channel (mch ); (c) for each media item m ∈ AMi , if m ∈ / (⇔)Closure(m0 ) and the relation m ⇔ m0 ∈ / SR, add relation m0 ⇔ m to SR; (d) iteratively remove all media items x ∈ / Reach, where Reach is the set of reachable objects defined as follows: i. for any media item m ∈ AMi , m ∈ Reach; ii. for any media items x and m such that m ∈ Reach, if m ⇒ x ∈ SR and ∃k | x ∈ AMk ∧ sk ∈ Sr ∧ sj +1 = next(sj , e) ∧ i ≤ j < k , for any event e ∈ E, then x ∈ Reach 2 ; iii. for any media item x , if m ⇔ x ∈ SR or x ⇔ m ∈ SR and m ∈ Reach, then x ∈ Reach. In this way the resulting fragment contains only the media items that in the original presentation play together with the items retrieved by the query. The items which temporally precede the retrieved ones are removed and the fragment ends when the retrieved media items are no longer active. This point deserves some comment. In step 4(d)ii the procedure retains the part of the presentation corresponding to the 2
Sr is defined in step 1 of this procedure.
Chapter 8. Retrieval in Multimedia Presentations
s 0 =
start ( M1 )
131
s1 = AM 1 =
end ( sc1,1 )
{ M 1 , title1 , story1 , tempo1 , c1,1 , info1,1 , sc1,1 , an1,1 , text1,1 }
end (an1,1 )
s 2 = AM 2 =
end ( sc1,1 )
s 3 = AM 3 = {M 1 ,
{ M 1 , title1 , story1 , tempo1 , c1,1 , info1,1 , sc1,2 , an1,2 , text1,2 }
title1 , story1 , tempo1 , c1,1 , info1,1 , sc1,1 }
end (c1,1 ) s 9 = AM 9 = {M 1 ,
s 7 = AM 7 = { M 1 ,
title1 , story1 , tempo1 , c1,2 , info1,2 , sc2,2 }
title1 , story1 , tempo1 , c1,2 , info1,2 , sc 2,1 }
end (c1,2 ) end (an2,2 ) end (an2,1 ) s8 end ( sc2,1 ) end (c1,2 ) s = AM = s 4 = AM 4 6 6 6 6 end ( sc2,1 )
end (an1,2 ) s 5 = AM 5 = { M 1 , title1 , story1 , tempo1 , c1,1 , info1,1 , sc1,2 }
=
{M1 , title1 , story1 , tempo1 ,
{ M 1 , title1 , story1 , tempo1 ,
c1,2 , info1,2 , sc2,2 , an2,2 , text 2,2 }
c1,2 , info1,2 , sc 2,1 , an2,1 , text 2,1 }
end (c1,1 )
Figure 8.6: The automaton of the guide to listening of the Pastoral Symphony sequence of states si . . . sk such that sj +1 = next(sj , e), i ≤ j < k where the retrieved media are active in all the states sj and not active in si−1 and sk +1 . In terms of the presentation synchronization relations, if state sk is not a final state, an x ⇒ y relation must exist for a medium x in state sk . If this relation is removed from the presentation, and iteratively all the unreachable media items are also removed, the presentation stops playing when leaving state sk . This procedure assures also that if two media items mi1 and mi2 which last for the same time interval are retrieved, the same presentation fragment is returned for both. Finally, channels are preserved since they are associated to the media statically. Referring to our example, let us suppose that a user’s query returns a pair hc 1,2 , P i where P refers to the schema of the guide to listening presented in Section 8.1. The automaton of the presentation is illustrated in Figure 8.6. We can easily identify the set of states in which c1,2 is active; it is Sr = {s4 , s6 , s7 , s9 }. Since s6 = next(s4 , end (sc2,1 )), s7 = next(s4 , end (an2,1 )) and s9 = next(s6 , end (an2,2 )) we consider only Sf = {s4 }. The set of objects which play with c1,2 is then equal to AM4 = {M1 , title 1 , story 1 , tempo 1 , c1,2 , info 1,2 , sc2,1 , an2,1 , text 2,1 } which correspond to the text info about the position inside
132
Chapter 8. Retrieval in Multimedia Presentations
tempo1
⇔
title1
⇔ ⇔ ⇔ ⇔
⇔sc
info1,1
M1
Story1
c1,2
2,1
sc2,2
⇔ an2,1
⇔
⇔
⇔
text2,1
an2,2
text2,2
Figure 8.7: The resulting presentation the music work, the bar animation and the text comments which help to understand the music execution. Since the synchronization rule c1,1 ⇒ c1,2 exists, c1,2 is the “starting object” mi0 . Then the synchronization relationships M1 ⇔ story1 and story1 ⇔ c1,2 are replaced by the relations M1 ⇔ c1,2 and story1 ⇔ c1,2 . The resulting fragment is shown in Figure 8.7 where the “unreachable” item have been removed (and consequently all the synchronization rules involving at least one of them have been removed).
8.3
Discussion
The presentation fragment built by the procedure in Section 8.2.1 stops its execution when the media returned as query results are no longer active. Let us call RM the set of media items returned by the query, given an automaton AUT (P ) = hS, E, s0 , next, Final i, T T Final = {si | si = next(sj , e), RM AMj 6= ∅, RM AMi = ∅}. We could ask if this behavior, which is consistent with the way the presentation is transformed, is correct for
the user’s expectation. Indeed, such a sharp identification of the temporal scope of the fragment could lessen the significance of the result. The presentation once started could follow its complete execution up to its end. In this case, the system returns to the user an access point and leaves the user free to stop
Chapter 8. Retrieval in Multimedia Presentations
133
the presentation playback when he/she wants3 . Another choice could be to identify the fragment scope based also on the presentation static structure: a meaningful fragment ends when the module which contains it ends. These solutions can help the user to better understand the context and the significance of the resulting fragment. Therefore, while it is easy to set the beginning of the relevant scope of the fragment, its end is more a matter of meaning and semantics rather than of structure. A second comment concerns the hierarchical structure of the presentation in terms of modules, stories and sections. They build up a hierarchy of contexts which can give the user different levels of access to the presentation content. The retrieval model we have presented identifies only a minimal scope of the fragment of the presentation and a set of media temporally related. The design structure of the presentation can integrate this dynamic information with other information related to the identification of different “meaning scopes” in the presentation. For example, let us suppose that the user’s query returns as two different results the text comments text1,1 and text1,2 . Our procedure considers them separately. Since they are active in two different states, s1 and s2 (see Figure 8.8(a)), the system returns two different fragments. With reference to Figure 8.8, (b) is the fragment returned for text1,1 which corresponds to state s1 , highlighted with a thick rectangle in Figure 8.8(a), and (c) is the result which contains text1,2 which corresponds to state s2 , (see the dotted rectangle in the figure). Since both text1,1 and text1,2 belong to the same section, and s1 and s2 are two consecutive states, it is reasonable to return a single fragment which contains both the media items much as the fragment depicted in Figure 8.7. The choice between these two possibilities can depend on the level of access required by the user. In the same way, we suggest that a narrower scope than the one defined here could come from computing the (⇔) Closure and (⇔)1step sets on the retrieved media items. The scope is narrower because it does not consider media which are not directly connected by synchronization relationships to the retrieved ones. E.g., a background soundtrack which starts at the beginning of the presentation and lasts up to the end does not bear a meaningful content to the presentation evolution, therefore it could be left out from the presentation of retrieval results, at least in a first browsing phase where the relevance of the returned items is evaluated by the user. 3
This behavior is obtained by removing step 4(d)ii from the procedure described in Section 8.2.
134
Chapter 8. Retrieval in Multimedia Presentations In the same way, elements like generic menus, banners, advertisements, sidebars, which
often surround the core information in Web documents are not only scarcely relevant, but can deviate the user attention from the primary results of information retrieval. We can cut off these components from the fragment to result with a deep analysis of the automaton which describes the behavior of the presentation. By Definition 4.2.2, if a presentation naturally ends, we can recognize which are the master objects which make the presentation evolve, and which media items are played as a consequence of the progress of the presentation. Then, once retrieved the set of states in which the user is interested, the master objects contained in them must belong to the fragment returned. Other media items may belong or not according to other considerations, as for example, the level of access given to the user, the resources available at the client side and so on. As an example, turning back at the result depicted in Figure 8.7, c1,2 is the master object (so are its scenes sc2,1 and sc2,2 ) and must be returned to the user. The objects tempo1 and title1 are not directly related to the clip or to the scene, then can be omitted. The bar animations an2,1 and an2,2 and the text comments text2,1 and text2,2 can be returned or not according to other information, like e.g., the user preferences.
Chapter 8. Retrieval in Multimedia Presentations
start ( M1 )
s 0 =
135
s1 = MA1 =
end ( sc1,1 )
{ M 1 , title1 , story1 , tempo1 , c1,1 , info1,1 , sc1,1 , an1,1 , text1,1 }
s 2 = MA2 =
s5
{M 1 , title1 , story1 , tempo1 ,
end (c1,2 )
c1,1 , info1,1 , sc1,2 , an1,2 , text1,2 } s 3 = MA3 =
s 4 = MA4 =
end ( sc2,1 )
{M1 , title1 , story1 , tempo1 ,
end (c1,1 )
{ M 1 , title1 , story1 , tempo1 , c1,2 , info1,2 , sc 2,1 , an2,1 , text 2,1 }
c1,2 , info1,2 , sc2,2 , an2,2 , text 2,2 }
(a) ⇔
tempo1
⇔ ⇔
M1
⇔
tempo1
⇔ ⇔
info1,1
c
⇔sc1,1
⇔
c1,1 sc1,2
⇔
⇔
1,1
info1,1
Story1
Story1
⇔
⇔ an1,1
an1,2
text1,1
⇔
⇔ ⇔
text1,2
⇔ title1
(b)
title1
(c)
Figure 8.8: The fragments returned for text1,1 and text1,2
M1
136
Chapter 8. Retrieval in Multimedia Presentations
Chapter 9
Conclusion
In this thesis we have discussed synchronization issues in hypermedia presentations composed of several continuous and not continuous media objects. We have defined a model for synchronization relationships among media objects, implemented a visual authoring environment based on the model, and suggested a number of applications in areas related to automatic generation of standard presentations, and multimedia information retrieval. A presentation is modelled along two directions: structure and temporal synchronization. The model structures a hypermedia document in a hierarchical way: a continuous media stream is a story, which is made of clips, divided into scenes; static documents are called pages. Atomic media can be composed in composite items in order to describe more complex synchronization schemas. Synchronization is achieved with a set of relationships among the components of a multimedia presentation. Spatial positioning is obtained through channel definitions. The model provides also inheritance of some synchronization relationships from the hierarchical structure of the presentation. The model is better oriented to describe applications based on a main video or audio narration, to which static or dynamic objects are synchronized but its reference context is wide, and includes applications like self and distance education, professional training, Web advertising, cultural heritage promotion and news-on-demand. We have called this kind of applications “video-centered” presentations. The model is able to capture all the aspects of a hypermedia presentation: since we proposed an event-based approach to describe the temporal behavior of media items, given a presentation, all its possible evolutions along time can be formally described by an automaton. At any given time, all the information which characterize the presentation are
138
Chapter 9. Conclusion
collected into the notion of state, which contains the set of objects currently playing, the set of paused objects and the channels occupation. Then, given a state and an event, the automaton deterministically calculates what changes are needed, therefore which media are activated and which items are stopped in the new state. The study of the automaton of the presentation can recognize some properties of the presentation, e.g. if the presentation naturally ends, and the role of its components, e.g. which are the master objects. We have defined an XML language for hypermedia presentations described according to the model. The language clearly separates objects definition from spatial arrangement and temporal synchronization in order to improve reusability. The thesis also proposes a visual authoring tool which allows the user to arrange the spatial layout of the document by drawing the channels of the presentation. The synchronization is achieved by the definition of a graph in which the nodes are the media components and the edges are the temporal relationships between them. The authoring tool generates two external representation, a timeline-based representation highlighting media sequencing, and an XML document. An execution simulator helps the author to check the presentation behavior before delivery. Some applications of the proposed model have been discussed. In order to alleviate the author from tedious and time consuming work, we have investigated the possibility of automating the authoring process for a particular class of multimedia presentations. A multimedia report is a multimedia presentation built on a set of data returned by one or more queries to multimedia repositories, integrated according to a schema with appropriate spatial layout and temporal synchronization, and coherently delivered to a user for browsing. Multimedia reports can be automatically generated, given a synchronization schema and the media items returned by a (set of) query. The XML language facilitates the integration process, and has been extended to describe multimedia presentation templates. Another application of the model discussed is in the area of retrieval of distributed multimedia presentations. In the thesis we have analyzed the model in the perspective of information retrieval and browsing, showing how such model can help a user in building coherent answers to information retrieval queries to multimedia data repositories where complex presentations are stored. Compared to existing approaches, the model, discussed in the thesis, has some advantages and some drawbacks. We have already addressed the peculiarities of an event-based
Chapter 9. Conclusion
139
synchronization model compared to a timeline solution, which requires additional work to carry out even a local adjustment. Then, the separation between the presentation aspects of a multimedia document (i.e. the layout) from the data definition (i.e. the media file location) and the temporal behavior (i.e. the synchronization relationships) encourages further processing and objects reuse. A hypermedia presentation design according to this model can be easily translated into an execution language, like e.g. SMIL. Moreover, this separation increases the portability of the presentation: for example, the layout can be simply adjusted to be played in a hand-held calculator. Also the hierarchical structure of the document gives some advantages to the model: as already said, it allows the author to specify only a restricted set of temporal relations among the media items (others can be automatically inferred) and it helps to provide different level of access to the user when browsing the results of a query on multimedia presentations. Moreover, it encourages reuse of modules or sections of existing documents. Some sections of the work presented in this thesis deserve further investigation. First of all, the authoring tool provides a simulator which implements the automaton of a presentation. It shows the presentation behavior without addressing the actual media file, thus allowing the prototyping of multimedia documents. We are currently developing a player for distributed multimedia presentations which completes the authoring environments, giving the author the possibility to check problems relating download of media items from the network (e.g. what happen if a media object is temporally unavailable). The player can be used both as stand-alone application and as an integrated tool during the authoring phase. In this case, the player cooperates with the simulator animating the graph of the presentation to highlight the synchronization relationships activated by the events triggered. The authoring environment could also be extended to define multimedia reports schema to be integrated with data returned by a query after its execution. Another topic which need further investigation is the identification of the scope of the fragment of a presentation to be returned to the user as answer of a query. In our discussion, it corresponds to the segment of the presentation in which the media item returned by the query is active. This solution returns a complete fragment of a presentation, thus simulating the normal playback of the document. We could argue if all media items returned are effectively relevant to the user. Elements like generic menus, banners, advertisements,
140
Chapter 9. Conclusion
sidebars, which often surround the core information, mainly in Web documents, can often deviate the user attention from the primary results of information retrieval and increase the resources needed. Therefore, we are investigating the possibility to narrow the scope to cut off all the media objects which build the background of the presentation but are not neatly relevant to the query, thus increasing the precision of the retrieval system. Future work will address these issues. Moreover, we are studying the possibility to model distributed hypermedia presentations, which can be played in mobile devices (e.g. hand-held calculator) and change their behavior according to the user context.
Appendix A
An XML schema for Multimedia Presentations using XML Schema language
This schema is also available at www.dsi.unive.it/∼ogaggi/xml/model.xsd. XML Schema for Hypermedia documents based on synchronization model defined in Chapter 3, December 2001.
142Appendix A. An XML schema for Multimedia Presentations using XML Schema language
Appendix A. An XML schema for Multimedia Presentations using XML Schema language143
144Appendix A. An XML schema for Multimedia Presentations using XML Schema language
Appendix A. An XML schema for Multimedia Presentations using XML Schema language145
Appendix A. An XML schema for Multimedia Presentations using XML Schema language147
148Appendix A. An XML schema for Multimedia Presentations using XML Schema language
Appendix A. An XML schema for Multimedia Presentations using XML Schema language149
150Appendix A. An XML schema for Multimedia Presentations using XML Schema language
Appendix B
An XML schema for Multimedia Presentations using DSD language
Schema for An XML language to describe hypermedia presentations. 1.0 Ombretta Gaggi This document contains an alternative implementation, using the DSD language, of the schema for the XML language defined in this thesis to describe hypermedia presentations.
152
Appendix B. An XML schema for Multimedia Presentations using DSD language
Appendix B. An XML schema for Multimedia Presentations using DSD language
153
154
Appendix B. An XML schema for Multimedia Presentations using DSD language
Appendix B. An XML schema for Multimedia Presentations using DSD language
155
156
Appendix B. An XML schema for Multimedia Presentations using DSD language
Appendix B. An XML schema for Multimedia Presentations using DSD language
157
158
Appendix B. An XML schema for Multimedia Presentations using DSD language
Appendix C
An XML schema for Multimedia Report using XML Schema language
XML Schema for multimedia reports using our model for MM presentations. First version, July 2002. Author: Ombretta Gaggi.
160
Appendix C. An XML schema for Multimedia Report using XML Schema language
Appendix C. An XML schema for Multimedia Report using XML Schema language
161
162
Appendix C. An XML schema for Multimedia Report using XML Schema language
Appendix C. An XML schema for Multimedia Report using XML Schema language
Appendix C. An XML schema for Multimedia Report using XML Schema language
165
166
Appendix C. An XML schema for Multimedia Report using XML Schema language
Appendix C. An XML schema for Multimedia Report using XML Schema language
167
168
Appendix C. An XML schema for Multimedia Report using XML Schema language
Appendix C. An XML schema for Multimedia Report using XML Schema language
169
170
Appendix C. An XML schema for Multimedia Report using XML Schema language
References [1] Sibel Adali, Maria Luisa Sapino, and V. S. Subrahmanian. An Algebra for Creating and Querying Multimedia Presentations. Multimedia Systems, 8(3):212–230, 2000. [2] J. F. Allen. Maintaining Knowledge about Temporal Intervals. Comm. ACM, 26(11):832–843, November 1983. [3] E. Andr`e. A Handbook of Natural Language Processing: Techniques and Applications for the Processing of Language as Text, chapter The Generation of Multimedia Documents, pages 305–327. Marcel Dekker Inc., 2000. [4] C. Baral, G. Gonzalez, and A. Nandigam. SQL+D: extended display capabilities for multimedia database queries. In ACM Multimedia 1998, pages 109–114, Bristol, UK, September 1998. [5] C. Baral, G. Gonzalez, and T. Son. A Multimedia display extension to SQL: Language and Design Architecture. In International Conference in Data Engineering, Orlando, FL, USA, February 1998. [6] E. Bertino and E. Ferrari. Temporal Synchronization Models for Multimedia Data. IEEE Transactions on Knowledge and Data Engineering, 10(4):612–631, July/August 1998. [7] A. Del Bimbo. Visual Information Retrieval. Morgan Kauffmann, 1999. [8] S. Boll, W. Klas, and U. Westermann. A Comparison of Multimedia Documents Models Concerning Advanced Requirements. Technical Report 99–01, Ulmer Informatik-Berichte, Department of Computer Science, University of Ulm, Germany, February 1999. [9] Susanne Boll and Wolfgang Klas. ZYX - A Semantic Model for Multimedia Documents and Presentations. In The 8th IFIP Conference on Data Semantics (DS-8)- Semantic Issues in Multimedia Systems, Rotorua, New Zealand, January 1999. [10] Susanne Boll and Wolfgang Klas. ZYX, a Multimedia Document Model for Reuse and Adaption of Multimedia Content. IEEE Transaction on Knowledge and Data Engineering, DS–8 Special Issue, 13(3):361–382, May/June 2001. [11] J.A. Brotherton, J.R. Bhalodia, and G.D. Abowd. Automated Capture, Integration, and Visualization of Multiple Media Streams. In IEEE International Conference on Multimedia Computing and Systems, pages 54–63, Austin, Texas, USA, July 1998. [12] Dick C. A. Bulterman, Lynda Hardman, Jack Jansen, K. Sjoerd Mullender, and Lloyd Rutledge. GRiNS: A GRaphical INterface for creating and playing SMIL documents. In The 7th International World Wide Web Conference, Computer Networks and ISDN Systems, volume 30(1-7), pages 519–529, Brisbane, Australia, April 1998. [13] K.S. Candan, B. Prabhakaran, and V.S. Subrahmanian. CHIMP: a Framework for Supporting Distributed Multimedia Document Authoring and Presentation. In Proceedings of the fourth ACM International Conference on Multimedia, pages 329–340. ACM Press, 1996.
172
References
[14] K.S. Candan, B. Prabhakaran, and V.S. Subrahmanian. Retrieval Schedules Based on Resource Availability and Flexible Presentation Specifications. Multimedia Systems, 6(4):232– 250, 1998. [15] L. A. Carr, D. W. Barron, H. C. Davis, and W. Hall. Why use HyTime? Electronic Publishing - Origination, Dissemination, and Design, 7(3):163–178, 1994. [16] A. Celentano and O. Gaggi. Synchronization Model for Hypermedia Document Navigation. In ACM Symposium on Applied Computing (SAC2000), pages 585–591, Como, Italy, March 2000. [17] A. Celentano and O. Gaggi. Authoring and Navigating Hypermedia Documents on the WWW. In IEEE International Conference on Multimedia and Expo (ICME 2001), pages 988–991, Tokyo, Japan, August 2001. [18] A. Celentano and O. Gaggi. Multimedia reporting: building multimedia presentations with query answers. In Workshop on Multimedia Information Systems (MIS 2001), pages 61–71, Capri, Italy, November 2001. [19] A. Celentano and O. Gaggi. Querying and Browsing Multimedia Presentations. In M.Tucci, editor, Second international workshop on Multimedia Databases and Image Communication, number 2184 in LNCS, pages 105–116, Amalfi, Italy, September 2001. Springer Verlag. [20] A. Celentano and O. Gaggi. Schema Modelling for Automatic Generation of Multimedia Presentations. In Fourteenth International Conference on Software Engineering and Knowledge Engineering (SEKE2002), pages 593–600, Ischia, Italy, July 2002. [21] A. Celentano, O. Gaggi, and M.L. Sapino. Retrieving Consistent Multimedia Presentation Fragments. In Workshop on Multimedia Information Systems (MIS 2002), pages 146–154, Tempe, Arizona, USA, November 2002. [22] Y. Chiaramella. Browsing and Querying: Two Complementary Approaches for Multimedia Information Retrieval. In Hypertext - Information Retrieval - Multimedia ’97, pages 9–26, Dortmund, WA, USA, September 1997. [23] I.F. Cruz and W.T. Lucas. A Visual Approach to Multimedia Querying and Presentation. In The Fifth ACM International Conference on Multimedia ’97, pages 109–120, Seattle, WA, USA, November 1997. [24] D.J. Duke, I.Herman, T.Rist, and M.Wilson. Relating the primitive hierarchy of PREMO standard to the standard reference model for intelligent multimedia presentation systems. Computer Standards & Interfaces, 18(6–7):525–535, 1997. [25] Marica Echiffre, Claudio Marchisio, Pietro Marchisio, Paolo Panicciari, and Sivia Del Rossi. MediaTouch: A Native Authoring Tool for MHEG-5 Applications. IEEE Multimedia, 5(1):84– 91, 1998. [26] D. C. Fallside (ed.). XML Schema Part 0: Primer. http://www.w3.org/TR/xmlschema-0/, May 2001. W3C. [27] O. Gaggi and A. Celentano. Modeling Synchronized Hypermedia Documents. Technical Report 1/2001, Department of Computer Science, Universit`a Ca’ Foscari di Venezia, Mestre (VE), Italy, January 2001Submitted for publication to Multimedia Tools and Applications. [28] O. Gaggi and A. Celentano. A Visual Authoring Environment for Multimedia Presentations on the World Wide Web. In IEEE International Symposium on Multimedia Software Engineering (MSE2002), Newport Beach, California, December 2002.
References
173
[29] J. Geurts, J. van Ossenbruggen, and L. Hardman. Application-Specific Constraints for Multimedia Presentation Generation. In International Conference on Multimedia Modeling 2001 (MMM01), pages 247–266, CWI, Amsterdam, The Netherlands, November 5-7 2001. [30] F. Halasz and M. Schwartz. The Dexter Hypertext Reference Model. Communications of the ACM, 37(2):30–39, February 1994. [31] L. Hardman. Modelling and Authoring Hypermedia Documents. PhD thesis, CWI, University of Amsterdam, 1998. [32] L. Hardman and D. Bulterman. Using the Amsterdam Hypermedia Model for Abstracting Presentation Behavior. In Electronic Proceedings of the ACM Workshop on Effective Abstractions in Multimedia, San Francisco, California, November 1995. [33] L. Hardman, D.C.A. Bulterman, and G. van Rossum. The Amsterdam Hypermedia Model: Adding Time, Structure and Context to Hypertext. Communications of the ACM, 37(2):50– 62, February 1994. [34] L. Hardman, J. van Ossenbruggen, L. Rutledge, K. Sjoerd Mullemder, and D.C.A. Bulterman. Do You Have the Time? Composition and Linking in Time-based Hypermedia. In ACM Conference on Hypertext and Hypermedia ’99, pages 189–196, Darmstadt, Germany, February 1999. [35] C. Haung, J. Chen, C. Lin, and C. Wang. MING I: A Distributed Interactive Multimedia Document Development Mechanism. Multimedia Systems, 6(5):316–333, 1998. [36] C. Haung and C. Wang. Synchronization for Interactive Multimedia Presentations. IEEE Multimedia, 5(4):44–62, October/December 1998. [37] I. Herman, N. Correia, D.A. Duce, D.J. Duke, G.J. Reynolds, and J. Van Loo. A Standard Model for Multimedia Synchronization: PREMO Synchronization Objects. Multimedia Systems, 6(2):88–101, 1998. [38] S. Hibino and E. A. Rundensteiner. Multimedia Database Systems: Design and Implementation Strategies, chapter A Visual Multimedia Query Language for Temporal Analysis of Video Data, pages 123–159. ISBN 0-7923-9712-6. Kluwer Academic Publishers, 1996. [39] S. Hibino and E. A. Rundensteiner. User Interface Evaluation of a Direct Manipulation Temporal Visual Query Language. In ACM Multimedia 1997, pages 99–107, Crowne Plaza Hotel, Seattle, USA, November 1997. [40] Adobe Systems Inc. Adobe Premiere 6. http://www.adobe.com/premiere. [41] Macromedia Inc. Macromedia Authorware 6. http://www.macromedia.com/authorware. [42] Macromedia Inc. Macromedia Director 8.5. http://www.macromedia.com/director. [43] ISO/IEC JTC1/SC29/WG11. MPEG-7 Standard Overview, N4980. http://mpeg.telecomitalialab.com/standards/mpeg-7/mpeg-7.htm, 2002. [44] M. Jourdan, N. Laya¨ıda, C. Roisin, L. Sabry-Ismail, and L. Tardif. Madeus, an Authoring Environment for Interactive Multimedia Documents. In ACM Multimedia 1998, pages 267– 272, Bristol, UK, September 1998. [45] M. Jourdan, C. Roisin, and L. Tardif. A Scalable Toolkit for Designing Multimedia Authoring Environments. Multimedia Tools and Applications, 12(3–3):257–279, November 2000.
174
References
[46] P. King, H. Cameron, H. Bowman, and S. Thompson. Synchronization in Multimedia Documents. In Jacques Andre, editor, Electronic Publishing, Artistic Imaging, and Digital Typography, Lecture Notes in Computer Science, volume 1375, pages 355–369. Springer-Verlag, May 1998. [47] N. Klarlund, A. Moller, and M. Schwatzbach. DSD: A Schema Language for XML. In ACM Workshop on Formal Methods in Software Practice, Portland, OR, USA, August 2000. [48] Taekyong Lee, Lei Sheng, Nevzat Hurkan Balkir, A. Al-Hamdani, Gultekin Ozsoyoglu, and Z. Meral Ozsoyoglu. Query Processing Techniques for Multimedia Presentations. Multimedia Tools and Applications, 11(1):63–99, 2000. [49] Taekyong Lee, Lei Sheng, Tolga Bozkaya, Nevzat Hurkan Balkir, Z. Meral Ozsoyoglu, and Gultekin Ozsoyoglu. Querying Multimedia Presentations Based on Content. IEEE Transaction on Knowledge and Data Engineering, 11(3):361–385, 1999. [50] Chunlei Liu. Multimedia over IP: RSVP, RTP, RTCP, RTSP. http://hamsa.unl.edu/∼byrav/CSCE952/ip multimedia.pdf/, December 1997. [51] Daniele Dal Mas and Alessandro De Faveri. Progetto di Visualizzatore per Presentazioni Multimediali. Master’s thesis, Dipartimento di Informatica, Universit`a Ca’ Foscari, Venezia, February 2003. [52] R. J. Miller, O. G. Tsatalos, and J. H. Williams. Integrating Hierarchical Navigation and Querying: A User Customizable Solution. In Electronic Proceedings of the ACM Workshop on Effective Abstractions in Multimedia, San Francisco, CA, USA, November 1995. [53] MPEG ISO/IEC Joint Technical Committee. MPEG-4 Overview, ISO/IEC JTC1/SC29/ WG11 N4668, March 2000. [54] Enda Multimedia. CD-ROM Musica. Enda Srl, Milano, Italy, 1996. [55] S.R. Newcomb, N.A. Kipp, and V.T. Newcomb. The “HyTime”: Hypermedia/Time-based Document Structuring Language. Communications of the ACM, 34(11):67–83, 1991. [56] Synchronized Multimedia Working Group of W3C. Synchronized Multimedia Integration Language (SMIL) 1.0 Specification. http://www.w3.org/TR/REC-smil/, June 1998. [57] Synchronized Multimedia Working Group of W3C. Synchronized Multimedia Integration Language (SMIL) 2.0 Specification. http://www.w3.org/TR/smil20, August 2001. [58] Palazzo Grassi Home Page. The Maya exhibition. Palazzo Grassi, Venice, Italy. http://www.palazzograssi.it/eng/mostre/mostre maya.html. [59] F.B. Paulo, P.C. Masiero, and M.C. Ferreira de Oliveira. Hypercharts: Extended Statecharts to Support Hypermedia Specification. IEEE Transactions on Software Engineering, 25(1):33– 49, January/February 1999. [60] Project MIX. The MIX (Mediation of Information using XML) Home Page. http://www.db.ucsd.edu/Projects/MIX/. [61] L. Rutledge, B. Bailey, J. van Ossenbruggen, L. Hardman, and J. Geurts. Generating Presentation Constraints from Rethorical Structure. In 11th ACM Conference on Hypertext and Hypermedia, San Antonio, Texas, USA, May 30–June 3 2000. [62] L. Rutledge, L. Hardman, and J. Ossenbruggen. Evaluating SMIL: Three User Case Studies. In ACM Multimedia 1999, Orlando, Florida, USA, November 1999.
References
175
[63] J. Schnepf, Y. Lee, D. Du, L. Lai, and L. Kang. Building a Framework for FLexible Interactive Presentations. In Pacific Workshop on Distributed Multimedia Systems (Pacific DMS 96), Hong Kong, June 1996. [64] James A. Schnepf, Joseph A. Konstan, and David Hung-Chang Du. Doing FLIPS: FLexible Interactive Presentation Synchronization. IEEE Journal on Selected Areas of Communications, 14(1):114–125, January 1996. [65] H. Schulzrinne, A. Rao, and R. Lanphier. Real Time Streaming Protocol (RTSP). http://sunsite.auc.dk/RFC/rfc/rfc2326.html, April 1998. RFC 2326. [66] L. F. G. Soares, R. F. Rodrigues, and D. C. Muchaluat Saade. Modeling, Authoring and Formatting Hypermedia Documents in the HyperProp System. Multimedia Systems, 8(2):118– 134, 2000. [67] J. van Ossenbruggen, J. Geurts, F. Cornelissen, L. Hardman, and L. Rutledge. Towards Second and Third Generation Web-based Multimedia. In The Tenth International World Wide Web Conference, pages 479–488, Hong Kong, China, May 1–5 2001. [68] G. van Rossum, J. Jansen, K. Mullender, and D. Bulterman. CMIFed: a Presentation Environment for Portable Hypermedia Documents. In The the First International Conference on Multimedia, pages 183–188, Anaheim, California, August 1993. [69] M. Vazirgiannis, Y. Theodoridis, and T. Selling. Spatio-Temporal Composition and Indexing for Large Multimedia Applications. Multimedia Systems, 6(4):284–298, 1998. [70] Jin Yu. A Simple, Intuitive Hypermedia Synchronization Model and its Realization in the Browser/Java Environment. In Asia Pacific Web Conference, pages 209–218, Hong Kong, September 1998.