news or newspaper headlines. From time to time .... example, a guided tour in a virtual museum steps through a predefined selection of the museum rooms; ...
Modelling Synchronized Hypermedia Documents Ombretta Gaggi and Augusto Celentano Dipartimento di Informatica, Universit`a Ca’ Foscari di Venezia Via Torino 155, 30172 Mestre (VE), Italia {ogaggi,auce}@dsi.unive.it Abstract This paper presents a synchronization model for hypermedia applications. Several media, continuous, like video or audio file, or non-continuous, like text pages or images, are delivered separately over a network in a Webbased environment, and presented to the user in a coordinated way. The model is based on a set of synchronization relationships which define objects behavior during presentation’s playback, and channels in which to play each object. The model is suited for a wide range of hypermedia documents type: some examples are self and distance education, professional training, Web advertising, cultural heritage promotion and news-on-demand. A virtual exhibition is analyzed as a test bed to validate the model. Keywords: Hypermedia authoring, World Wide Web, media delivery and synchronization, interactive presentation.
1
Introduction
In this paper we present a synchronization model for navigating hypermedia documents in a distributed environment like the World Wide Web. The presence of temporal information in hypermedia documents requires particular attention when modelling composition between different elements. The model is targeted to a class of applications where one or more continuous media files (video or audio streams) are presented to the user and, as streams play, other documents (text and images) are sent to the browser and displayed in synchrony with them. We use the term of “video-centered” hypermedia to denote this kind of documents. A typical example is “infotainment”, where the authors enrich the information with audio, video clips or animations, in order to obtain a more engaging and entertaining result. The user can interact with the presentation by pausing and resuming it, by moving forward or backward, and by following hyperlinks that lead him/her to a different place or time in the same hypermedia document
1
or to a different document. At each user action the presentation must be kept coherent resynchronizing the documents delivered. Even if the model does not cover efficiently all hypermedia document types its reference context is wide, and includes applications like self and distance education, professional training, Web advertising, cultural heritage promotion and news-on-demand. As an example, in a Web application for news-on-demand, the speaker’s voice reads the news while video clips document them with images. To remark some details, text pages are displayed, containing summary of the news or newspaper headlines. From time to time some links for further news examination become available, giving the user the possibility to explore specific issues in a flexible way. Another example is the remote delivery of a lesson composed of the teacher’s lecture video or audio recording that illustrates the issues discussed in the classroom, together with the set of slides which are shown at proper times during the lecture playback, and links to supplemental material. In this paper we’ll refer to two scenarios with some examples to illustrate our model: distance education and cultural heritage promotion. Both scenarios share a common structure of multimedia presentations, based on several modules hierarchically organized. Each module is composed of a continuous media stream which, as it plays, makes texts, images and possibly other continuous media documents to be displayed to the user screen. The user may adapt the pace of the presentation (e.g., a lesson) to his/her needs by moving in the module or across the modules in a controlled way: by skipping some parts, playing again other parts, temporarily suspending the presentation to browse supplemental material, and so on. The system re-synchronizes (i.e., delivers to the browser) the different media in order to preserve the whole consistency. The paper is organized as follows: Section 2 comments the related literature. In Section 3 a model for video-centered hypermedia documents is illustrated. Section 4 describes the primitives for synchronized delivery of different media. Section 5 discusses synchronization issues due to user interaction. Section 6 applies the model to a non-trivial Web presentation of a virtual exhibition, used as a test bed. Section 7 presents some final remarks.
2
Related work
Media synchronization in hypermedia documents has been largely investigated. Amsterdam Hypermedia Model [2, 3, 4] was the first serious attempt to combine media items with temporal relations into hypertext document. AHM, which extends the Dexter Model [1], divides document’s components into atomic and composite. Media items are played into channels while synchronization inside composite components is described by synchronization arches and offsets that establish time relationships between two components or two anchors. SMIL[5, 6, 7], Synchronized Multimedia Integration Language, is a W3C recommendation defined as an XML application. It is a very simple markup language that defines tags for presenting multimedia objects in coordinated
2
way. Synchronization is achieved through two tags: seq to render two or more objects one after the other and par to reproduce them at the same time. Using attributes it is possible to play segments inside the time span of an object. The screen is divided into regions in which it is possible to put multimedia objects that build-up a presentation. Differently from the model presented in this paper, SMIL does not define a reference model for the data structure, but only tags for describing media objects behavior. In [8] three case studies of standard SMIL are presented, illustrating how it can be used for different forms of multimedia presentations. In [9] Hardman et al. discuss requirements for incorporating multimedia synchronization and hypertext links into time-base hypermedia documents. They distinguish between three types of navigation through links: a simple move across temporal axis (rewind or forward ) that does not alter the linearity of a multimedia presentation, a hypertext navigation among different presentations (or sections of presentations), and navigation within and among non-linear multimedia presentations. The paper presents their modelling solutions for incorporating temporal and linking aspects using Amsterdam Hypermedia Model and SMIL. In [10] an authoring and presentation tool for interactive multimedia documents named Madeus is presented. Multimedia objects are synchronized through the use of timing constrains. Layout is specified by spatial relations, such as align or center. The document structure, both temporal and spatial, is represented by a graph, where the horizontal axis represents time values. Graphs are used not only for authoring but also for scheduling and time-based navigation. The work [11] presents a taxonomy of possible synchronization relationships between item pairs in multimedia documents. The authors classify media objects into synchronization events, which are available for synchronization, and synchronization items, which must be synchronized. Synchronization events and items are divided in subclasses, yielding seventy-two relationships types. Some of these relationships are of little practical use in actual documents, but are defined in order to use the taxonomy as a reference point for authoring systems and formal models. Spatio-temporal relationships of a multimedia presentation are discussed in [12]. The paper defines a model based on temporal and spatial relationships between different multimedia objects that compose a presentation. Temporal and spatial starting points are defined, and the objects are set using topological and time relationships between them. The result is a formalism suitable for designing any multimedia composition. PREMO[13], Presentation Environment for Multimedia Objects, primarily focuses on the presentation aspects of multimedia. PREMO aims at describing a standard programming environment in a very general way, through which media presentation can be constructed, processed and rendered as part of an overall application. The model can be divided into four components: each component is described in an object-oriented way. The model supplies an abstraction for the virtual devices that may produce different multimedia data of a presentation. An object type and its subtypes set up time relationships. Two objects can 3
be rendered sequentially, in parallel or in alternative according to presentation state. HyperProp [14] is an authoring and formatting system for hypermedia documents. The authors compare their proposal to other approaches, stressing the importance of the documents’ logical structure. HyperProp uses composition to represent any kind of relation, both temporal (synchronization relationships) and spatial, and offers three different graphical views to model the logical structure of the document: the structural view, the temporal view and the spatial view. The author can use the first view to browse and edit the logical structure of the document, the temporal view to represent the objects along a timeline, and the spatial view to preview objects layout in the user interface. In [15, 16] Huang et al. propose a model to design and control multimedia documents and their timing constrains which cannot be described using traditional document structures (e.g., SGML). The model contains logical, layout and temporal structure to formally describe re-synchronization adjustments due to user interaction. The authors’ approach uses dynamic extended finitestate machines (DEFSM) to describe and maintain temporal synchrony among several media streams: an “actor” DEFSM manages each single medium and a “synchronizer” DEFSM orchestrates the whole presentation behavior. Based on the proposed model, a distributed multimedia documents development mechanism, called MING-I[15], has been developed. MING-I proposes an interactive synchronization based on use of tokens for the control scheme and contains four main components: the authoring component, the presentation scheduler generator, the interaction-processing agent and the execution environment base. The work [17] defines a model for flexible presentations, called FLIPS (for FLexible Interactive Presentation Synchronization). Media objects have no predefined time length, that instead can change according to user interactions. The synchronization scheme is based on two temporal relationships, called barriers and enablers. If event A is a barrier to event B then B is delayed until event A occurs. If event A is an enabler for event B when event A occurs event B also occurs as soon as it is barrier-free. FLIPS is best oriented toward multimedia presentations, rather than network based applications, or hypermedia navigation. In [18] the authors present a framework designed and implemented to support the synchronization of different media elements into a presentation according to FLIPS temporal relationships. The framework supports authors during the authoring of a presentation because it handles user interactions such as skipping forward and backward to different point within a presentation, resynchronizing all the objects involved. In [19] the authors describe a system suitable for the automatic generation, integration and visualization of media streams. Although this system can be applied to different domain, such as tourism or meetings, the solutions proposed are strongly oriented towards the educational domain. The authors view teaching and learning experience as a form of multimedia authoring and provide a method to integrate different media streams captured during a university lecture. The paper addresses two specific problems: streams integration and streams visualization. About streams integration the authors propose to use 4
different levels of granularity, and one particular solution to each level. For the visualization of multiple streams they provide a timeline “road map” of the lecture that points out significant events. In [20] Media Relation Graph (MRG), a simple and intuitive hypermedia synchronization model, is presented together with an alternative implementation of the Hypermedia Presentation and Authoring System (HPAS) which is used as a testbed for MRG. A Hypermedia presentation can be modelled by a direct acyclic graph, where vertices represent media object, and directed edges represent the flow of time. For example, an edge from vertices v1 to v2 means that the object v2 is played after the end of the object v1 . The temporal model combines both interval-based and point-based synchronization mechanisms. In [21] a model for synchronization of a hypermedia application is described. It is based on hypercharts, a notation that extends the statechart formalism in order to make it able to describe temporal constrains of a multimedia presentation. Hypercharts provide mechanisms for specifying hypermedia requirements such as objects duration, user interaction, navigation mechanisms and separation between structure and content. The work [22] discusses ZYX, a multimedia document model with particular attention to reuse and adaptation of multimedia content to a given user context. The need for this model arises from the lack of sufficient support for reuse and adaptation of other existing standards such as Dynamic HTML, MHEG, SMIL and Hytime. ZYX describe a multimedia document by means of a tree: a node is a presentation element (a media object or a temporal, spatial or layout relationship), and can be bound to other presentation elements in a hierarchical way. The projector variables specify the element’s layout. With this design, an element can be easily reused, and can change its layout simply changing the projector variable.
3
A model of video-centered hypermedia documents
Hypermedia documents are modelled along two directions, one describing the hierarchical structure of the document components, the other describing the presentation dynamics through synchronization primitives. A hypermedia document is generally composed of a set of modules, which the user can access in a predefined order or through some table of contents. For example, a guided tour in a virtual museum steps through a predefined selection of the museum rooms; a multimedia course is divided into lessons, which can be indexed through a syllabus or presented according to a time schedule; a newson-demand application delivers independent articles selected from an index; an estate selling application lets the user choose the relevant property from a catalogue, and so on. Taking distance education as an example, lessons can be grouped in higher level aggregates corresponding to matters explained in different parts of a text-
5
module clip
clip scene
scene
story
section
section
Figure 1: Structure of video-centered hypermedia document book; further, the user could be constrained to follow them in a specific order. We assume that from the presentation dynamics point of view each module is completely autonomous and self-contained. Therefore we do not elaborate further on this level of access, since it is not relevant for delivery and synchronization among the different components, and assume a module, whatever be its definition, as the topmost level of aggregation of media components which is relevant for our discussion. Borrowing the terminology from document structuring, we call chapter such a module. A chapter is divided into sections: a section is a multimedia document which is normally played without requiring user intervention, while moving from a section to the next one can be automatic or not, depending on the document purpose and application context. Sections could be addressable from an index as chapters are, but still we shall not enter into such details. A section is composed of continuous media streams and static pages of text and images. Section playback is ruled by the continuous media, with static pages delivered and displayed at specific narration points. A section is thus divided into smaller components that correspond to the time intervals between two narration points in which one or more static documents must be delivered to the user. If we analyze the structure from a dynamic point of view, we can better refer to a terminology borrowed from movies context. The continuous media presentation which constitutes the main chapter content is a story, which is made of a sequence of clips 1 , each of which corresponds to a section of the chapter. Clips are divided into scenes, each of which is associated to a different static document, e.g., a text page or an image, which is displayed during the scene playback. The term narration introduced above denotes the playback of the continuous components of a hypermedia document. Figure 1 pictorially shows this structure. More complex structures with nested multimedia components will be introduced later. 1 The
reason for such a name will be clear later.
6
3.1
Logical vs. physical structure
The structure illustrated in Figure 1 is a logical structure, which is in principle unrelated to the physical organization of the media objects. A video file could correspond to a whole clip, to a scene within it, or to a sequence of clips, that can be played continuously or as independent segments. Indeed, the correspondence between logical and physical organization should be hidden in an implementation layer, without influencing the model organization. In practice this cannot be done without paying a price in terms of performance and manageability even in a local environment, e.g., in a CD-ROM based multimedia presentation. Efficient hypermedia documents delivery over a network is affected by the size of information to be transferred. For this reason it is important to organize multimedia data into well defined and reasonably sized structures that make easy to manage them and allow the reuse of the constituent components where necessary. In a distributed environment like the World Wide Web the relationships between the logical and the physical organization of the media objects are thus of primary concern, and must be taken into account at application design level. We distinguish between two kinds of media objects: continuous media files and document files like texts and images. Continuous media files are video and audio recordings, usually referred to as videoclips or audioclips. Texts and images are contained in static documents usually referred to as pages in the WWW parlance2 . A correspondence between the logical and the physical structure exists, based on two cases: • a (logical) clip, i.e., the narrative content of a section, is a video or audio file which is assumed to be played continuously from beginning to end, unless user interaction modifies this behavior; • a static document is a file which is supported and displayed as a whole by the browser, both directly or indirectly through a plug-in or helper application. We implicitly refer to HTML documents in the following of the paper using the generic term of page, without entering in further detail about other formats (e.g., XML, PDF, etc.). From the above assumption comes that one logical clip is implemented by one media file. Scenes are defined by time intervals in the clip: they must be contiguous and are normally played one after the other. Also pages are bound to specific contents associated to a scene. We can thus say that a scene with its associated page is the smallest amount of information that is delivered as a whole, and a clip and its associated set of pages (i.e., a section) are the minimum self-consistent and autonomous information with respect to the application domain. 2 We do not consider text documents with embedded dynamic components like animations or video sequences. To be precise, we do not consider them as “dynamic” with reference to the user control of the whole hypermedia delivery or playback.
7
According to the synchronization primitives that we shall discuss later, playing a scene causes the corresponding page to be displayed and conversely, accessing a page via a hyperlink causes the corresponding scene to be played. We can anticipate that, coherently with the class of applications we want to model, a master-slave relationship is established between the dynamic and the static components of a hypermedia document: the dynamic components control the hypermedia document playback, and the static components are synchronized accordingly. We shall come back to this matter later.
3.2
Channels
Media objects require proper channels to be displayed or played. Channels are basically windows or frames on the user screen3 . Several compatible media objects may share in different times the same channel; the model must describe how channels are reserved, used and released during hypermedia documents delivery. A channel is a (virtual) display or playback device like a window, a frame, an audio device or an application program able to play a media, that can be used by one media at a time. The model does not enter into details about channels properties, the only relevant property being an identifier that uniquely identifies it on the user workstation, and a type that defines the media types that can use it. Channels are defined and associated to media objects at design phase, and defaults are established consistently with the WWW environment: e.g., the main browser window is the default channel for all the media. New channels can be created dynamically, and channels can be instantiated in multiple instances (e.g., separate browser windows) if the presentation requires the contemporary activation of media of the same type in a run-time dependent way. For example, a lesson about language translation could require the original text and the translation be accessible to the user at the same time, even if each of them is a conventional text document that would be displayed in the browser main window if delivered alone. A channel is busy if an active media is using it, otherwise it is free. Free channels may however hold some content, if the deactivation of a media does not destroy its visibility: for example at the end of a movie playback the movie window can display the first or the last frame.
4
Synchronization primitives
Once we have described the logical structure of a hypermedia document, we must define the relationships that determine how the document is delivered and presented to a user. As we have already seen, a presentation is made of different components which evolve in reaction to some events, such as user interaction, or due to intrinsic property of components, like time length. We consider only the 3 Audio
channels do not use windows, but must be handled the same way.
8
events that bring some change into the set of active objects during presentation playback. For example, window resizing is not a significant event, while stopping a video clip is. We can think of a hypermedia document as a set of objects of different kinds (video or audio streams, text pages or images), each of which plays a specific role during presentation’s playback. Each object needs to access an appropriate channel to be displayed, and some objects can share the same channel at different times. Thus we need some relationships to model objects’ behavior and channels’ utilization; we call them synchronization primitives. A hypermedia document is composed by a set of objects and relationships among them which specify their reaction to specific events. We define five synchronization primitives: • A plays with B, denoted by A ⇔ B • A activates B, denoted by A ⇒ B • A is terminated with B, denoted by A ⇓ B • A is replaced by B, denoted by A ⇀ ↽B α
• A has priority over B with behavior α, denoted by A > B. Some of these relations need to be explicitly defined during presentation’s authoring; other can be automatically inferred from the hierarchical structure of the presentation and from other relationships.
4.1
Basic synchronization
Before introducing the primitives definitions, we need some specification regarding some terms we shall use during this paper. Definition 1 A media object is active when it occupies a channel, otherwise it is inactive. Definition 2 A continuous media naturally ends, when it reaches its ending point. A generic media object is forced to terminate when another entity (the user or another multimedia object) stops its playback or closes its channel, before its natural end. We distinguish between the length and the duration of a generic media object, continuous or not. Every object once activated occupies a channel for a predefined time span, unless the user interacts with it, that is necessary to access the object content. We call object length such time span. The length is a static property, defined when the object is created, independent from the presentation of which the object is a component. A static object, i.e. a text page, once activated remains on the screen until some event removes it. The user can access the content during a time which is not defined by the text itself. Its length is therefore potentially infinite. 9
At run-time, a user can access a media object for a time span that can be different from its length, due to presentation design or to user interaction. For example, a video sequence can be played completely or partially according to a user selection defined at run-time. We call object duration this time span; it is a dynamic property that can change during object playback. We can now introduce the definitions of the synchronization primitives. Definition 3 Let A and B be two generic media objects. We define “A plays with B”, written A ⇔ B, the relation such that the activation of one of the two objects (i.e. A) causes the activation of the other (i.e. B), and the natural termination of object A causes the forced termination of object B. Each object uses a different channel. The relation “plays with” is asymmetric: object A plays the role of master with respect to slave object B. The master object is the one the user can interact with. For example a video clip played with an accompanying text is the master because the user can pause, stop, or move inside it causing the accompanying text to be modified accordingly. This relationship describes the synchronization between two media active at the same time. Taking our previous example, if a video clip v goes with a static page p the relation v ⇔ p means that the video and the page start and end together (i.e., they have the same duration). The relation ⇔ also describes the synchronization between two continuous objects: e.g., a video v with a sound track st is described by the relation v ⇔ st. This relation means that v and st start at the same time, and when v naturally ends it causes the sound track to end too, if it is not finished yet. Other more complex relations, e.g., adapting the sound track duration to the video length by looping it continuously, can be described by synchronization primitives that will be introduced later. It is important to note that the “plays with” relation can’t provide fine-grain synchronization (e.g., lip-synchronization) but only coarse-grain synchronization, because it defines the media behavior only at starting and ending points and not in between. Definition 4 Let A and B be two generic media objects. We define “A activates B”, denoted by A ⇒ B, the relation such that the natural termination of object A causes the beginning of playback (or display, for non continuous media) of object B. The objects can share the same channel or not. The relationship “activates” describes two objects that are played in sequence. We limit the scope of this relationship to the natural termination of an object, leaving out the termination caused by user intervention of other external events. The reason for doing so will be more evident after discussing the other synchronization relationships, but we can anticipate that the main reason is that we interpret the forced termination as an indication to stop the presentation or a part of it. Therefore, the scope of this action must be defined explicitly and in a way different from the normal behavior of the presentation. 10
With this primitive we can model the situation suggested above, in which a sound track st is looped to adapt its duration to the length of a video, specifying that st ⇒ st. The natural end of the sound track activates it again using the same channel, until the video ends. Then the sound track is terminated too, according to the relation v ⇔ st. While more meaningful if applied to continuous objects, the ⇒ relationship can define the sequential play of continuous and non continuous objects, e.g., a credit page after the end of a video sequence. It should be obvious that object A in the relation A ⇒ B must be a continuous media object, because non continuous media objects have an infinite length.
5
User interaction
With relationships till now defined we are able to model multimedia documents that, once activated, synchronize their components according to author design till they naturally end. We have not yet considered user interaction. Users have several possibilities of interacting with a hypermedia document: they can stop the presentation of part of the document before it ends, e.g. because they are no longer interested and want to skip further. They can branch forward or backward along the time span of a presentation, and the document component must resume playing in a coherent and synchronized way after the branch. The users can leave the current presentation to follow a hyperlink, either temporarily, resuming later the original document, or definitely, abandoning the original document and continuing with another one. We need to define other synchronization primitives to model the presentation behavior according to user interaction. Definition 5 Let A and B be two generic media objects. We define “A is terminated with B”, written A ⇓ B, the relation such that the forced termination of object A causes the forced termination of object B. The channels occupied by the two media objects become free. We need the relationship ⇓ to model situations in which the user stops a part of a presentation, and the other media must be synchronized accordingly. Let us suppose that a hypermedia document is composed by two objects, A and B, bound by the relation A ⇔ B. As an example, let us consider a multimedia presentation for travel agencies: a video clip (A) illustrates a town tour and a text page (B) describes the relevant landmarks. If the user stops the video playback, the object A is terminated and releases its channel. Object B remains active, because relationship A ⇔ B is meaningful only when A comes naturally to its ending point. Therefore the channel used by object B remains busy, a clearly inconsistent situation. Introducing the relationships A ⇓ B the inconsistency is removed: the forced termination of object A causes the termination of object B too; the channel used by B is released and can be used by another document. Similarly to the relationship A ⇔ B the relation A ⇒ B is asymmetric.
11
⇒
intro
snd
⇒ ⇓ ⇔
txt
ui
⇒
vclip
Figure 2: The relation ⇓ It is worth to discuss why we have introduced in the model the relationship ⇓ instead of extending the relationship ⇔ to deal with termination of an object for any reason. Should we have done so, the model would not be complete, i.e. some synchronization requirements could not be described. An example will illustrate this statement. Let us consider a presentation which begins with a short introductory animation. As the animation ends, a sound track starts playing, and the presentation displays a page which asks the user to chose among different video clips to continue4 . The sound track doesn’t stop when the user selects a video clip, but continues in the background. Figure 2 pictorially shows this situation: the objects involved are the first animation intro, the sound track snd, the text page txt and a video clip vclip. We introduce an additional element ui, standing for user interaction, which controls the synchronization induced by the user who selects the next video clip to play. In this example it is obviously an anchor in the text page, but in other cases it could be a separate medium, e.g. a button or a clickable image. We are interested in modelling its dynamic behavior, which is defined by the user who interacts with it. Namely, we consider it a continuous medium object whose length is variable and defined by the user. It naturally ends when the user activates it, e.g. with a mouse click. The introduction starts both the page and the sound track by virtue of the relationships intro ⇒ snd and intro ⇒ txt. Object ui is activated together with the text page (txt ⇔ ui )5 , and ends when the user selects the video clip. If the user wants later to leave the presentation, he or she stops the video clip vclip, which is the foreground media object. It must stop the sound track, and this action cannot be described without a specific relationship between the two objects, that are otherwise unrelated. We have to set the relation vclip ⇓ snd 6 . The relationship “is terminated with” allows us to define the behavior of a presentation when a user moves forward and backward along the presentation time span. We consider a “branch” a movement inside the same multimedia presentation. If a user moves to another hypermedia document, we do not 4 The modelling of the whole presentation requires the notion of link that will be introduced later. 5 In this example we ignore what happens to the text page when the video clip starts. 6 In the figures we use a different arrow for the relation is terminated with(⇓) to improve visibility.
12
consider it a “branch” but the activation of a hypermedia link, which will be handled in a different way. A branch can be modelled as a movement on the time axis of the hypermedia document. The presentation is stopped at the point in which the user executes the branch command (whatever is its implementation: a cursor in a timeline, a VCR-style console, etc.), and starts again at the point the user has selected as the branch target. Therefore the synchronization primitives introduced so far can handle this case. The target of a branch can be the beginning of an object (a continuous object), but can also be any point contained in its length: the relationship ⇔ is associated to the activation of a master object as a whole, regardless of where it starts, so it is able to activate the associated slave object even if the master starts from a point in the middle of its duration. When a user follows a hyperlink to another presentation, modelling requires additional care: we must define the hyperlink source and destination, and the behavior of the objects involved in the link. We consider the whole presentation as the source of a hyperlink, and not the object in which the link is contained. More properly, all the objects active at the time of the user interaction are considered as link sources. In the same way, the destination of the link is made of all the objects that are active after the user action7 . We can easily distinguish three cases: • following the hyperlink does not imply any change in the continuous objects of the presentation, • following the hyperlink takes a a continuous component of the hypermedia document top a paused state, and • following the hyperlink stops a continuous object of the presentation. Each case needs to be modelled in a different way. Let us consider the first case, taking as an example a self-learning application in which a video and some slides are displayed together. One of the slides contains a link to another text document with supplemental information. If the document is short, the user doesn’t need to stop the video in order to read the text without loosing information. This example is illustrated in Figure 3. Two video clips c1 and c2 are associated each to a text page (respectively p1 and p2 ). Page p1 contains a hyperlink to page p3 . The author of the application can model the display of the linked page p3 in two ways: opening another channel, or using the same channel of p1 , that must be released purposely. In the first case the link causes a new channel to be allocated, without any consequence on the other media synchronization. In the second case, we need to introduce another synchronization primitive, as illustrated in Figure 3. 7 The reader could note that this behavior is consistent with the one defined in the Amsterdam Hypermedia Model[2].
13
⇒
c1 ⇔
c2 ⇔
⇓
p1
⇓
p2 → ←
→ ←
p3 hyperlink
Figure 3: Following hyperlinks: first case Definition 6 Let A and B be two media objects of the same type (i.e. two continuous media or two non continuous media) that can use the same channel. We define “A is replaced by B”, written A ⇀ ↽ B, the relation such that the activation of object B causes the forced termination of object A. The channel occupied by A is released to be used by object B. In our example, the relationship p1 ⇀ ↽ p3 allows page p3 to be displayed in the same window of page p1 , that is therefore terminated. In then same way, p3 ⇀ ↽ p2 for the same reason, when later clip c1 ends and c2 starts playing. If page p1 contains a link to a large document or to another video clip the user should need to pause the initial presentation in order to pay attention to the new document. Figure 4 pictorially shows this case: page p1 contains a link to a new video clip (c3 ) that is associated to another text page (another slide) p3 . The user can also decide to stop the presentation and to continue reading the new one, or the author could have designed a set of multimedia documents with this behavior in mind. Going back to the abandoned presentation in this case would be possible only by restarting it from some defined starting point. It should be clear that the user or author behavior depend on the meaning of the linked documents. In principle the synchronization model should be able to describe both cases. Definition 7 Let A and B be two generic media objects. We define “A has α priority over B with behavior α”, written A > B, the relation such that the activation of object A (by the user, or according to the presentation’s design) forces object B to release the channel it occupies, so that object A can use it if needed . Label α denotes object B’s behavior once it has release the channel. It can assume only two values, p and s. If α = p then object B goes into an inactive state (i.e. it pauses), waiting for being resumed. If α = s, object A is forced to terminate (it stops), releasing the resource it uses.
14
⇒
c1 ⇔
c2 ⇔
⇓
p1
⇓
p2 p
>
→ ←
hyperlink c3 ⇔
⇓
p3
p
Figure 4: Following hyperlinks: use of > relationship
⇒
c1 ⇔
c2 ⇔
⇓
p1
⇓
p2 S
>
hyperlink c3 ⇒
⇔
⇓
p3
s
Figure 5: Following hyperlinks: using > relationship
15
p
In the case illustrated by Figure 4, we draw the relation c3 > c1 so that, when following the hyperlink, the user activates c3 that puts c1 into an inactive state. The channel used by c1 is released to be used by c3 . From the relation c3 ⇔ p3 we can assume that page p3 must be displayed into p1 ’s channel. Therefore an additional relationship p1 ⇀ ↽ p3 must be added. When c3 terminates the user can resume c1 from the point where it was suspended. Since the relation c1 ⇔ p1 is active for all the duration of c1 , page p1 is also activated8 , so the presentation goes back to the state it was before the hyperlink activation. If the author decides to stop the first document before leaving it, the relap s tionship c3 > c1 is used instead of relation >, as illustrated in Figure 5. The relation p1 ⇀ ↽ p3 introduced in Figure 4 is not necessary because, since c1 is forced to terminate, by relation c1 ⇓ p1 also page p1 is forced to terminate, releasing the channel for p3 . In Figure 5 is assumed that when c3 terminates the clip c1 starts again, as described by the relationship c3 ⇒ c1 . A different behavior could leave to a user action the task to start again the stopped presentation.
6
Model testing: The Maya’s sun temple
In this section we introduce a complex example in order to prove that the model is able to describe a wide range of different hypermedia documents. To this purpose, we analyze a multimedia presentation designed for a virtual exhibition, namely the Maya Sun Temple section of the Maya exhibition at Palazzo Grassi, Venice, Italy, which is available on the World Wide Web[26]. The presentation is a narration of the Maya cosmogony illustrated by a tour in a 3D virtual world. The myth, engraved on the temples of Palenque (the Sun Temple) tells how the First Father, called Hun Nal Ye (A Grain of Maize) was born in 3114 BC. At that time, the Sun did not exist and darkness reigned over everything. Hun Nal Ye built himself a house in a place called the Higher Heaven. As the narration goes on, a 3D reconstruction of the Sun Temple is presented. Clicking on a dice engraved with Maya numbers the user can step through the building, exploring pictures and handicrafts inside it. Text pages explain habits and customs of Maya’s civilization. The presentation user interface is divided into five regions (Figure 6). Two regions are dedicated to the exhibition and to the presentation titles. Since the information contained does not change during presentation playback, we can ignore them. A region in the lower left corner of the screen contains two buttons: one to require a help, and the other to exit the presentation. The first button is a link to a text page which is displayed in the text pages area. The second button stops the presentation and takes the user back to Maya 8 the channel that was used by p is free due to the two relationships ⇔ and ⇓ between c 3 3 and p3 .
16
Exhibition title
Presentation title
Text page
Buttons
Animation
Figure 6: Maya Sun Temple’s presentation interface exhibition’s home page. In order to keep the example simple we ignore these two elements, that do not change during the whole presentation. We shall concentrate our analysis on the animation, the text pages that are displayed in the left side of the screen, and the sound tracks that are played. We consider thus three channels, the first called an for the the animation, the second called text for the text pages, the third called sound for the sound tracks. Maya sun temple’s presentation is divided into twenty animations, each of which is a clip in our model’s structure. An animation can be associated to one or more text documents, but some clips are played alone, i.e. they are not associated to any text document. Different sound tracks are played during the presentation, in order to make more evident the progression of the narration. According to our model’s structure, text documents are pages, while the sound tracks are clips. The presentation is organized in five modules, which contain a story made by some clips and pages. We also consider an initial module, M0 , which contains only an initial text page, p0 , which begins to tell cosmogony’s story, and the image of a dice playing the role of a button, ui0 , acting as a user interaction device to step through the narration. The dice is placed in the animation channel an. Channel sound is initially free, while channel text and an are busy. Tables 1–3 lists the elements of modules M0 –M3 , which are illustrated in this section. As soon as the user starts the presentation (by clicking on the dice), he, or she, starts the first module M1 : we impose M0 ⇀ ↽ M1 so that the first module can use channel text. As a consequence of M0 termination, page p0 is forced to terminate, so we can impose M0 ⇓ p0 . The first animation begins by building Hun Nal Ye house’s foundations, while 17
M0 p0
Initial module. The myth engraved on the temples of Palenque recounts how the First Father. . . (Maya number 0)
ui0 M1 s1 a1 p1
Hun Nal Ye house in Higher Heaven. Wind sound track. House foundation. At the site of the house, the mythical ancestor also set up three stones that symbolized the three vertical levels . . . (Maya number 1)
ui1
Table 1: Maya Sun Temple presentation elements, Modules 0 and 1 the user hears wind blowing. If we call the animation’s clip a1 and the sound track s1 , we model this behavior with the relation M1 ⇔ a1 and M1 ⇔ s1 , where a1 occupies the channel denoted by an and s1 the channel sound. The wind sound is a short audio file that is played continuously, so we impose s1 ⇒ s1 . At the end of the first animation, the presentation pauses, waiting for user interaction: a text page p1 is displayed in channel text. We model this behavior with relationship a1 ⇒ p1 . Wind sound track, s1 , goes on, while channel an becomes free. During the entire presentation user interaction is limited: the user cannot move inside animations’ time line, but he (or she) can only step through animations. The user can follow a link to another text page but, as we will see later, this action has no effect on the other elements of the presentation. At the end of each animation, the user gets the control of the presentation playback: a dice, with Maya numbers on its faces, is displayed; the user can click on this dice to enter next animation. When a1 terminates, ui1 begins, by virtue of relationship a1 ⇒ ui1 . When the user clicks on the dice, ui1 ends: the next animation begins playing as a consequence of relations ui1 ⇒ M2 and M2 ⇔ a2 . If the user decides to exit the module, using the button “Exit”, all active objects should stop: they are a1 and s1 , during animation’s playback, and s1 and p1 , if a1 has already finished. The relationships M1 ⇓ a1 , M1 ⇓ s1 and s1 ⇓ p1 describe this behavior. Figure 7 shows first two modules modelling: fourteen relationships are imposed: M0 M0 M0 M0
⇀ ↽ ⇔ ⇓ ⇔
M1 p0 p0 ui0
ui0 M1 M1 s1
⇒ ⇔ ⇔ ⇒
M1 a1 s1 s1
a1 a1 ui1
18
⇒ ⇒ ⇒
p1 ui1 M2
M1 M1 s1
⇓ a1 ⇓ s1 ⇓ p1
M2 s2 n1 a2 a3 a4 a5 a6 a7 a8 ui2 .. . ui8 p2 p3 p4 p5 p6 p7 pb
Hun Nal Ye brings the life in the world (representation of monumental art). Birds singing sound track. Noise due to God’s resurrection. Tree and stones raising. Hun Nal Ye resurrection. Hun Nal Ye appears like a beautiful young. Life-size portraits in stone. God sculpture. Hun Nal Ye appears like a beautiful young(2). House view. (Maya number 2) .. . (Maya number 8) After performing these prodigious acts, Hun Nal Ye was then the leading figure . . . Maya rulers commissioned life-size portraits in stone (stelae) . . . Maya culture was based of a religious view of the world and of life, . . . With the power to determine future events, these gods were considered by the Maya . . . Though the Maya deities were in many respects superior to man . . . Depending upon the occasion, these deities would appeared in different forms . . . Blank page. Table 2: Maya Sun Temple presentation elements, Module 2
M3 s3 a9 sc1 sc2 p8 ui9
The sun temple. Ambient sound track. The sun temple building. Hun Nal Ye house. Growth of the sun temple. The earliest Pre-Colombian Maya architecture dates from the first centuries BC. . . . (Maya number 9) Table 3: Maya Sun Temple presentation elements, Module 3
19
⇔
⇓
⇔
M0
→ ←
⇔
⇓
M1
a1 ui0
⇒ ⇓ ⇔
p0
⇒
ui1
⇒ s1
⇒
M2
⇒
⇓
p1
Figure 7: Synchronization primitives for the first two modules The other modules are modelled in the same way. The dice represents an activation point for next animation, or next module, if the animation currently playing is the last one of the module. For each module Mk the end of each animation ai activates an instance of the entity user interaction uii , and a click on the dice naturally terminates uii : relationships ai ⇒ uii and uii ⇒ ai+1 model this behavior. At the end of the last animation of a module, the user leaves the initial module (Mk ) by clicking on the dice, and begins the next one (Mk+1 ). The relationship Mk ⇀ ↽ Mk+1 , describes that module Mk is forced to terminate, and so are terminated its clip, sound track and active page as a consequence of relationships ⇓ between the module and its components. Channels an, sound and text are free. The narration of Maya cosmogony continues with the beginning of the life in the world. Hun Nal Ye comes in the world like a beautiful young man, bringing with him the seeds of maize. A second module, M2 , is introduced since the scenarios changes: the resurrection of the god from the underworld represents the beginning of life in the world. To emphasize this event, the sound track plays the sound of birds singing. This event provides the opportunities to give the user some information about Maya monumental art and religious view of life and world. Module M2 has a much more complex structure. It contains seven animations, seven pages and two sound clips that share the same channels used by module M1 . Therefore M1 ⇀ ↽ M2 . As user begins M2 ’s playback, animation clip a2 adds three stones and a tree to the god’s house, and bird singing begins (sound track s2 ). When a2 ends page p2 is displayed. Sound track s2 plays for the entire M2 ’s duration, so it loops continuously. The following relationships model this situation, as in
20
⇔
⇓
⇓
⇓
a3
a2 ⇒
a4
⇒ ui2
⇒
⇔
⇒
⇓
⇒
a5 ⇒
⇒ ui4
ui3
⇓
M2
⇒
n1 → ←
⇓
p2
p3 s2 ⇔
⇒
⇓
⇓
Figure 8: Synchronization primitives for module M2 (first part)
⇓
⇓
a6
a5 ⇒ ⇒
⇓
⇒
⇓
a7 ⇒
ui5
⇒
a8 ⇒
⇒
⇒
ui6
M2
⇒
⇒
pb
p3
p4
→ ←
p5
→ ←
M3 ⇓
→ ← → ←
→ ←
⇒ ui8
ui7
→ ←
p6
→ ←
p7
⇓ ⇓
⇓
⇓
⇓
Figure 9: Synchronization primitives for module M2 (second part)
21
module M1 : M2 M2
⇔ ⇔
a2 s2
s2 a2
⇒ ⇒
s2 p2
M2 M2 s2
⇓ a2 ⇓ s2 ⇓ p2
At the end of animation a2 , the user clicks on the dice to start the following animation a3 . It uses channel an, that is free, since a2 has finished its playback. Page p2 remains active, so no new relationship is required. As a3 begins the user can hear, together with sound track s2 , also a background noise, which represents the resurrection from the underworld of god Hun Nal Ye. If we denote this sound clip with n1 , we can establish the relationship a3 ⇔ n1 . As for the animation a2 , if the user stops the module’s playback, both animation and noise should stop, so M2 ⇓ a3 and a3 ⇓ n1 , and the same relationship has to be imposed between the module and all the animations that make up the story: M1 ⇓ a4 , M1 ⇓ a5 , M1 ⇓ a6 , M1 ⇓ a7 and M1 ⇓ a8 , as Figures 8 and 9 pictorially show. Page p2 remains on the screen until animation a5 ends, then page p3 is displayed. This is described by the relationships a5 ⇒ p3 to display the page, and p2 ⇀ ↽ p3 to free channel text in which p3 is displayed. When animation a6 , representing gods monumental sculpture, reaches its ending point, it activates page p4 by virtue of relationship a6 ⇒ p4 . Page p4 contains a link to another text page, p5 , which contains two hyperlinks, one back to p4 , and one to p6 (see table 2 for a reference to the presentation pages). Pages p4 to p7 , and their hyperlinks, form a bidirectional list. Figure 9 shows this situation. We introduce relationship ⇀ ↽ every time there is a link between two objects that share the same channel to describe that it must be released by the current content. Relationships p4 ⇀ ↽ p4 , and the others shown ↽ p5 , p5 ⇀ in the figure, manage the use of channel text. Relationship M1 ⇓ p4 up to M1 ⇓ p7 are introduced to terminate the active page when the user stops the presentation. Animation a7 ends by showing the god physical aspect, then a blank page pb is displayed in channel text, as described by relationship a7 ⇒ pb . A blank page must be introduced because a Web browser displays a page till something else covers it. When a page is terminated, the channel is freed, but the page content remains visible. Other modules are modelled in the same way. We present module M3 because it has a clip with two different scenes, a situation that has not been exemplified in other modules. The animation shows for the first time the sun temple. The sound changes because the user is guided into the temple, where the sound of birds singing is inappropriate. The building raises over Hun Nal Ye’s house, and when the temple appears, a text page with information about Maya pre-colombian architecture is displayed. This animation, a9 , can be described as a clip made of two scenes, sc1 and sc2 , one showing the god’s house, and the second the raise of the temple. Figure 10 pictorially shows this part of module M3 ’s structure and synchronization.
22
⇔
⇓ ⇓
⇔
sc1
M3
⇓
a9 ⇒
sc⇓ 2 ⇔
⇒
⇒ ui9
⇒
M4
⇓
p8
s3 ⇔
s1
⇒
⇓
Figure 10: Structure and synchronization of module M3 Synchronization primitives introduced are the same of other modules, but we need to establish the relationship sc1 ⇒ sc2 so that the user doesn’t need to activate the second scene, that plays as soon as scene sc1 ends.
7
Conclusion
In this paper we have presented a model for describing hypermedia documents delivered in a World Wide Web environment. A document is modelled along two directions: document’s structure and temporal synchronization. The model structures a hypermedia document in a hierarchical way: a continuous media stream is a story, which is made of clips, divided into scenes; a clip, and the pages related to its scenes, build up a section. Synchronization is achieved with a set of relationships among the components of a multimedia presentation, while spatial positioning is obtained by channels’ definitions. Our model is better oriented to describe applications based on a main video or audio narration, to which static or dynamic objects are synchronized. As a verification of the model potentialities, we have modelled a web based presentation, the Maya Sun Temple’s animation[26], a virtual exhibition built for a cultural institution in Venice. Three other works, presented in section 2, discuss issues very close to the one approached by our model. Amsterdam Hypermedia Model[2] describes hypermedia documents’ temporal behavior inside objects’ structure. Differently from our model, which defines a set of primitives, synchronization is achieved through the use of objects’ composition and synchronization arches, permitting the insertion of offsets into timing relationships. Like our model, AHM defines channels to play media items. The main difference between SMIL[5] and our model concerns user interaction, since SMIL does not consider it as a part of its model. Moreover, it is
23
not possible to stop an object (for example a soundtrack) while the remainder of the presentation continues playing, since SMIL’s native features do not allow interactions with a single object, but only with the whole document: an user can only starts or ends an entire presentation. SMIL does not define a reference model for the data structure. Different from our model’s channels, regions are not considered as independent entities, but only as screen area hosting media to be played. The main differences between FLIPS[17, 18] and our model concern the system environment and the hypermedia dynamics modelling. No structure is in fact provided, other than the ones coming from the objects mutual interrelationships. Due to the absence of a hierarchical structure, the re-use of an object, or of a time span inside its timeline, is not possible. Synchronization is defined between object’s states and not between the objects themselves. Using barriers and enablers, the start or end of an object cannot directly cause the start or end of another object, but can only change the state of the object at hand. For example, the beginning of an object (which corresponds to the activation in our model) depends from the absence of barriers that can be set or reset by complex conditions. In our model the start or end of an object could be caused only by the user or by events associated to another object, independently from the state of other presentation’s components.
8
Acknowledgements
The authors wish to thank Mauro Scarpa for his contribution to the work.
References [1] F. Halasz, M. Schwartz. The Dexter Hypertext Reference Model. Comm. of the ACM, 37(2), 1994. [2] L. Hardman, D.C.A. Bulterman and G. van Rossum. The Amsterdam Hypermedia Model: adding time and context to the Dexter Model. Comm. of the ACM, 37(2), pages 50–62, 1994. [3] L. Hardman. Using the Amsterdam Hypermedia Model for Abstracting Presentation Behavior. In Electronic Proceedings of the ACM Workshop on Effective Abstractions in Multimedia. San Francisco, CA, 4 November 1995. [4] L. Hardman. Modelling and Authoring Hypermedia Documents. PhD. Thesis, University of Amsterdam, 1998. http://www.cwi.nl/ftp/ lynda/thesis. [5] Synchronized Multimedia Working Group of W3C. Synchronized Multimedia Integration Language (SMIL) 1.0 Specification. W3C Recommendation, 15 June 1998. http://www.w3.org/TR/REC-smil. 24
[6] Synchronized Multimedia Working Group of W3C. Synchronized Multimedia Integration Language (SMIL) Boston Specification. W3C Working Draft, 22 June 2000. http://www.w3.org/TR/smil-boston. [7] Synchronized Multimedia Working Group of W3C. Synchronized Multimedia Integration Language (SMIL) 2.0 Specification. W3C Working Draft, 21 September 2000. http://www.w3.org/TR/smil20. [8] L. Rutledge, L. Hardman, J. van Ossenbruggen. Evaluating SMIL: Three User Case Studies. Proceedings of ACM Multimedia, November 1999, Orlando, Florida, USA. http://www.cwi.nl/multimedia/publications/acmmm99.ps.gz. [9] L. Hardman, J. van Ossenbruggen, L. Rutledge, K. Sjoerd Mullemder, D.C.A. Bulterman. Do You Have the Time? Composition and Linking in Time-based Hypermedia. Proceeding of ACM Hypertext ’99, Darmstadt, Germany, February, 1999. [10] M. Jourdan, N. Laya¨ıda, C. Roisin, L. Sabry-Isma¨ıl, L. Tardif. Madeus, an Authoring Environment for Interactive Multimedia Documents. Elettronic Proceedings of the ACM Multimedia Conference. Bristol, UK, 1998. [11] H. Cameron, P. King, H. Bowman and S. Thompson. Synchronization in Multimedia Documents. Electronic Publishing, Artistic Imaging, and Digital Typography, Lecture Notes in Computer Science 1375 , SpringerVerlag 1998. [12] M. Vazirgiannis, Y. Theodoridis and T. Selling. Spatio-temporal composition and indexing for large multimedia applications. Multimedia Systems, 6(4), pages 284–298, 1998. [13] D.J. Duke, I.Herman, T.Rist and M.Wilson. Relating the primitive hierarchy of PREMO standard to the standard reference model for intelligent multimedia presentation systems. Information Systems (INS) ISN-R9704, 1997. http://www.cwi.nl/static/publications/reports/abs/ INS-R9704.html [14] L. F. G. Soares, R. F. Rodrigues, D. C. Muchaluat Saade. Modelling, authoring and formatting hypermedia documents in the HyperProp system. Multimedia Systems, 8(2), 2000. [15] C. Haung, J. Chen, C. Lin and C. Wang. MING I: a distributed interactive multimedia document development mechanism. Multimedia Systems, 6(5), pages 316–333, 1998. [16] C. Haung and C. Wang. Synchonization for Interactive Multimedia Presentations. IEEE Multimedia, 5(4), pages 44–61, 1998.
25
[17] J. Schnepf, J.A. Konstan and D.H.C. Du. Doing FLIPS: Flexible Interactive Presentation Synchronization. IEEE Journal on Selected Areas in Communications, 8(3), pages 114–125, Jannuary 1996. [18] J. Schnepf, Y. Lee, L. Lai, L. Kang, D.H.C. Du. Building a Framework for FLexible Interactive Presentations. Proceedings of Pacific Workshop on Distriubted Multimedia Systems, Hong Kong, June 1996. [19] J.A. Brotherton, J.R. Bhalodia, G.D. Abowd. Automated Capture, Integration, and Visualization of Multiple Media Streams. Proceedings of IEEE International Conference on Multimedia Computing and Systems, pages 54–63, Austin, Texas, June 28–July 1, 1998. [20] J. Yu. A Simple, Intuitive Hypermedia Synchronization Model and its Realization in the Browser/Java Environment. Proceedings of Asia Pacific Web Conference 1998, pages 209–218, September 1998. [21] F.B. Paulo, P.C. Masiero, M.C. Ferreira de Oliveira. Hypercharts: Extended Statecharts to Support Hypermedia Specification. In IEEE Transactions on Software Engineering, 25(1), pages 33–49, 1999. [22] S. Boll, W. Klas. ZYX, a Multimedia Document Model for Reuse and Adaption of Multimedia Content. To appear in Transaction on Knowledge and Data Engineering, DS–8 Special Issue, IEEE, 2000. http://www.informatik.uni-ulm.de/dbis/Cardio-OP/publications/ TKDE2000.ps.gz. [23] S. Boll, W. Klas. ZYX, a Semantic Model for Multimedia Documenta and Presentations. Proceedings of the 8th IFIP Conference on Data Semantics (DS–8), Jannuary 5–8, Rotorua, New Zealand, 1999. [24] S. Boll, W. Klas, U. Westermann. A Comparison of Multimedia Documents Models Concerning Advanced Requirements. Technical Report - Ulmer Informatik-Berichte Nr. 99–01, University of Ulm, Germany, February 1999. http://www.informatik.uni-ulm.de/dbis/Cardio-OP/publications/ TR-01.ps.gz. [25] A. Celentano, O.Gaggi. Synchronization Model for Hypermedia Document Navigation. Proceedings of the 2000 ACM Symposium on Applied Computing, pages 585–591, Como, 2000. [26] Palazzo Grassi, Venice, Italy. The Maya exhibition. http://www.palazzograssi.it/mostre/.
26