From Ontology-based Semiosis to Computational Intelligence

9 downloads 1236 Views 557KB Size Report
throughout this chapter are taken from two domains: media production with an emphasis on news ... authoring, such as Director/Shockwave, Flash, Dreamweaver, Frontpage, and WWW presentation technology, such as HTML and SMIL, has already. 1 ... conversion (speech to text, picture to speech, visual transcoding, etc.).
Chapter 4 From Ontology-based Semiosis to Computational Intelligence 1

The Future of Media Computing Frank Nack CWI, Amsterdam, The Netherlands

Abstract:

In this chapter we investigate the underlying structural requirements for media-aware knowledge spaces. We discuss the merging of media generation and annotation to facilitate the use of media-based information for diverse purposes. We pay in particular attention to the description of various tools in a distributed digital production environment supporting the distinct phases of media production. We then outline the accessibility of annotated media material for the purpose of discourse in encyclopaedic spaces. The examples throughout this chapter are taken from two domains: media production with an emphasis on news and film production, and encyclopaedic spaces as provided by domains such as theory, history, and anthropology of film.

Over the last 15 years the underlying idea of ‘semantic and semiotic productivity’, either in a manual, semi-automatic, or automatic way, has inspired a great deal of research in computer environments that seek to interpret, manipulate or generate visual media [Bloch 1986, Parkes 1989, Aguierre-Smith & Davenport 1992, Sack 1993, Tonomura et. al 1994, Davis 1995, Nack 1996, Brooks 1999, Lindley 2000]. Similar developments have occurred in the audio domain [Bloom 1985, Hirata 1995, Wold et al. 1996, Robertson et. al. 1998, TALC 1999]. The steady infiltration of those technological advances in everyday production environments, such as non-linear video editing systems (FAST 601, Softimage DS), image editing tools (Photoshop, Illustrator, GIMP, or Maya), audio systems such as Cubase VST, environments for new media authoring, such as Director/Shockwave, Flash, Dreamweaver, Frontpage, and WWW presentation technology, such as HTML and SMIL, has already 1

The term ‘communal intelligence’ was coined by Pierre Lévy [1994].

1

2

Chapter 4

deeply changed the social way of exchanging information. However, the technology follows the strains of traditional written communication by supporting the linear representation of an argument resulting in a final multimedia product of context-restricted content. This is an instance of Marshall McLuhan’s observation that new media technology is used initially to solve old problems. Conversely, the deeper impact of digital media is to redefine the forms of media, blur the boundaries between traditional categories like pre-production, production, and post-production, and alter the structure of information flow from producers to consumers. What we are heading towards is a cyberspace as described by William Gibson in his novel Neuromancer [Gibson 1984]. Though, we see a perception shift from this space defined by a hybrid system of traditional media (telephone, cinema, TV, theatre, museum, books, newspapers, etc.) and digital information technology (i.e. networked and storage intensive computers, CD-ROMs, DVD, video and camcorders, IP-telephony, Webcams, synthesisers, MIDI, DAW, etc.) to a knowledge space, facilitating new forms of creativity, knowledge exploration and social relationships, mediated through communication networks (i.e. hypertext, hypermedia, interactive games, interactive information/experience systems, virtual reality, simulations, augmented reality, groupware, and so on). Such an interactive, open and multimodal system sustains the activation of articulation powers, which in general represent parts of a semiotic continuum, where verbal, gestical, musical, iconic, graphic, or sculptural expressions form the basis of adaptive discourses. A basic aspect for such a an individual supporting but still communal space is that communication tools must make knowledge publicly available and provide means to make others aware of available sources. This requires new models that integrate in a natural way into daily activities. A media infrastructure supporting such needs necessitates that a given component of media exists independently of its use in any given production. Making use of such an independent media item demands its processing, allowing the extraction of relationships between the signs of the audio-visual information unit and the idea it represents, according to the creator's intention, as well as the differing connotations that can be attributed to the signs, depending on the circumstances and presuppositions of the receiver at the time of perception, along with the various legitimated codes and subcodes the receiver uses as interpretational channels [Arnheim 1956, Peirce 1960, Eco 1985, Greimas 1983, Bordwell 1989]. This requires that the information buried within the established relations between the single media units needs to be made available. Moreover – and this is even harder to tackle – information needs to be made accessible that is hidden in the unified structure of the single text, image, video, audio or tactile unit that

4. Fout! Opmaakprofiel niet gedefinieerd.

3

results from the composition of all its elements (for detailed descriptions of visual signification, see [Gregory 1961, Bettetini 1973, Kuleshov 1974, Metz 1974, Carroll 1980, Eisenstein 1991]). Systems are therefore needed to manage independent media objects and representations for use in many different productions with a potentially wide range of applications such as search, filtering of information, media understanding (surveillance, intelligent vision, smart cameras, etc.), or media conversion (speech to text, picture to speech, visual transcoding, etc.). Systems are also required for authoring representations that enable the creation of particular productions, such as narrative films, documentaries, interactive games, and news items. In other words, we not only need tools which allow people to use their creativity in ways they are already accustomed to, but in addition, tools that utilize human actions to extract the significant syntactic, semantic and semiotic aspects of its content [Brachman & H. J. Levesque 1983], so that descriptions based on a formal language can be constructed. In this chapter we investigate the underlying structural requirements for media-aware knowledge spaces, the merging of media generation and annotation to facilitate its use for diverse purposes, and the accessibility of such annotated media material for the purpose of discourse. The examples are taken from two domains: media production with an emphasis on news and film production, and encyclopaedic spaces as provided by domains such as theory, history, and anthropology of film. We chose these domains because in both domains a great variety of media is constantly generated, manipulated, analysed, and commented on. Moreover, both facilitate the discourse between people with varying degrees of expertise. Finally, the domains are sufficiently different to show that the suggested approach is domain independent, but close enough to illustrate, how cross-domain connections can be achieved. However, first it is useful to have a closer look at the basic structures for a knowledge space that allow the representation of audio-visual media in the form of a repository for retrieval purposes and also facilitate human creativity and discourse. We hope to explain that the described structures extend the currently discussed ideas of the Semantic Web [Semantic Web 2001]. The Semantic Web should bring machine-processable content to Web pages, thus being an extension of the current Web. The idea is to add ontology-based metadata to text or HTML documents to improve accessibility and provide means for reasoning about the content. From the point of view of visual media, this linguistic-centred view is the major drawback of XML-based environments, since the dynamic nature of visual media as well as the variety of data representations and their mixes is not recognized. However, that is precisely what we are interested in.

4

1.

Chapter 4

THE STRUCTURE OF A SEMANTIC AND SEMIOTIC CONTINUUM

If we look at visual material, such as single images or video, as an abstract element, we see that this material, embedded in a myriad of perceptual, cognitive and cultural codes, is subjective in its accessibility. Moreover, we know that visual signification, though based on common human content and thematic structures, provides its own realities of time and space based on pattern of juxtaposition which are interwoven in the narrative structure [Gregory 1961, Bettetini 1973, Metz 1974, Eisenstein 1991]. For the representation and use of visual media in dynamic digital environments this constructivist aspect of new media call fors more than characterizing audio-visual information on a perceptual level using objective measurements, such as those based on image or sound processing or pattern recognition [Aigrain et al. 1995, Gupta & Jain 1997, Del Bimbo 1999, Santini & Ramesh 2000, Mills et al. 2000, Johnson et al. 2000, Melucci & Orio 2000, Lemström & Tarhio 2000]. The creative reuse of material for individual purposes has a strong influence on the descriptions and annotations of the media data, either created during the production process or added at any time later. It is important to provide semantic, episodic, and technical representation structures with the capability to change and grow over time. This also requires relations between the different type of structures with a flexible and dynamic ability for combination. To achieve this, media annotations cannot form a monolithic document but must rather be organised as a semantic network of specialised content description schemata.

1.1

General concepts

An appropriate way to deal with these requirements is to use semantic networks. They have emerged from research in knowledge representation [Brachman & Levesque 1983, Sowa 1984, Collier 1987, Halasz 1988] as a fundamental representation that can be used for dealing with multimodal data. Semantic networks feature three significant functions: – They provide semantic, episodic, and technical memory structures (i.e. information nodes) with the capability to change and grow, thus allowing an ongoing process of inspection and interpretation of source material. – They facilitate dynamic use of audio-visual material using links, enabling connections from multi-layered information nodes to data on a temporal, spatial and spatio-temporal level, thus providing temporality without the disadvantage of using keywords, as keywords have been replaced by a structured information representation.

4. Fout! Opmaakprofiel niet gedefinieerd.

5

– They enable semantic connections between information nodes using typed relations, thus structuring the information space on a semantic as well as syntactic level. At this point it is useful to examine the main features, i.e. nodes, relations, and anchors, in further detail. 1.1.1

Nodes

We distinguish between three types of nodes, i.e., data nodes (D-nodes), content description nodes (CD-nodes), and conceptual annotation nodes (CA-nodes). Figure 1 provides a simplified visualization of the three node types. Battleship Potemkin Description of camera position and movement

Link with temporal anchor Example Def. of Rhythmical Montage

Fragment Odessa Steps Reference Fragment Station Scene

Description of camera position and movement

Link with temporal anchor The Untouchables

Data node

CA node

CD node

Figure 1. The different node types as part of an information space constructed on the work of the Russian director S. M. Eisenstein A D-node represents physical audio-visual material of any media type, such as text, audio, video, 3D animation, 2D image, 3D image, and graphic. The size, duration, and technical format of the material is not restricted, nor are any limitations present with respect to the content, i.e., number of actors,

6

Chapter 4

actions and objects. In Figure 1, we can see that a data node might contain a complete film, as conveyed by the nodes ‘Battleship Potemkin’ and ‘the Untouchables’ or merely a scene, as represented by the nodes named ‘Fragment Odessa Steps’ and ‘Fragment Station Scene’. The details of the link and anchor structure between the two nodes will be described in Section 1.1.2 (Relations and anchors). The other two types of nodes are best understood as instantiated schema providing either denotative, semantically loaded technical characteristics of the data (CD-nodes), or connotative material (CA-nodes). A CD-node provides information about the physical level of the data, such as shape, colour, light, focus, distance (spatial interpretation) or movement. The content of such a node can be either based on natural language or features or measurements. This type of node is predominantly generated automatically and mainly serves as the low-level basis for automatic interpretation and generation of material. In Figure 1, both the fragment nodes are associated with a CD node (in reality there would be more, but for the sake of simplicity we just show one for each). Since both provide information on lens movement, lens state, camera distance, movement, position, angle, and production date, they provide the means for automatic comparison of compositional features. Thus the relation ‘Reference’ between the two fragment nodes of different films can now be established automatically, though it will be most likely done semi-automatically (see later discussions of that issue in this chapter). CA-nodes are socially determined small reticular semantic structures that allow the interpretation or combination of D-nodes to establish meaning, such as episodes in a narrative, or the identification of metaphors or analogies. In Figure 1, a description of a particular film technique is attached to the ‘Fragment Odessa Steps’ node. Now, this description can be either Eisenstein’s text or a formalized representation in the form of a feature grammar (see Dorai & Venkatesh [2001], Vries et al. [2000]). 1.1.2

Relations and anchors

Due to the evolving nature of knowledge spaces we have to develop flexible connection structures between the information sources (i.e. nodes) within the network. Hence, we require two sorts of connections attached to nodes: • Spatio-temporal connections between data nodes and descriptional nodes (anchors). An anchor enables the connection from a description to data on a temporal, spatial or spatio-temporal level. • Semantic connections between descriptional nodes (relations). Relations enable connections of structural elements within a node as

4. Fout! Opmaakprofiel niet gedefinieerd.

7

well as connections between distinct nodes. There can be up to n relations between two nodes. Relations can be either uni- or bidirectional and they are usually typed. Figure 2 outlines the temporal and spatial aspects graphically, using the film ‘North by Northwest’ of Alfred Hitchcock as an example. As can be seen, a particular schema can address a sequence of a larger media unit, represented through the key frame with the matchbox (sequence) in the top row of images (the film). Figure 2 describes that a schema can at the same time provide access to particular objects in form of a spatial description. As the matchbox is an important prop in the scene, it is marked and the resulting shape is then detected within every single frame of the relevant sequence. Thus, the schema not only provides a particular model of an object (e.g. in a 2D or 3D description) but also its appearance within the relevant time span. In combination with other schemata, e.g. for the visual representation of a frame or shot, such as close-up or medium, we are then in the position to shape a retrieval request or perform rule-based interpretations. However, these issues will be discussed in more detail later in the chapter.









… t

TIME

SPACE

Figure 2. Temporal and spatial anchors used in describing film sequences

8

Chapter 4

Within a networked environment in which we cannot assume the rigid structure of a document, we have to identify those descriptions that provide connections to the real media with ‘absolute addresses’ and ‘absolute times’. Every other node connected to them will merely provide reference addresses. This mechanism allows the identification of entry points into the network via the particular time representation in the description. Moreover, the decomposition of the information in small temporal or spatial units not only supports the required flexibility within growing knowledge spaces but also facilitates streaming aspect of media units. If we wish to provide metainformation with streamed data we can now just use those information units, which are relevant for the temporal period and of interest to a particular user exploiting the stream, thus improving the precision of information delivery to match the level of context requirements and individual needs. Detailed descriptions of anchor and relation issues can be found in [Derose & Durand 1994, ISO HyTime 1997, ISO MHEG 1997, Hardman 1998, Auffret et al. 1999, SMIL 2001].

1.2

Problems

On the representational level, we have demonstrated that a description in form of a semantic network (i.e. instantiated description schemata connected via relations) supports the creation of new annotations. If new information structures are required, new templates can be designed using any descriptional language [see XML Schema 2001, RDF 2001, RDF Schema 2001, DAML+OIL 2001, ISO MPEG-7 DDL 2001, and ISO MPEG-7 MDS 2001]. However, myriads of instantiated schemata performing as nodes in a network, webbed together by a large variety of relations, generate problems, mainly with respect to search and content validation. Search: The complex structure of the semantic net does not permit easy detection of required information units. It is not difficult to detect the right entry point for search, but the traversal of the network is, compared to a hierarchical tree structure, more complex. Due to its flexibility, it is rather problematic to generate structures such as a table of content or an index. A potential solution to this problem might be the introduction of a schema header (containing general information about the schema type, inwards and outgoing links and relations, and other organisational information), distinguished from the schema body with specific descriptive information. Validation:

4. Fout! Opmaakprofiel niet gedefinieerd.

9

For cases in which new nodes are added, established nodes are changed, or new relations between existing nodes are introduced we have to establish that these operations and the created documents are valid. This is computationally expensive. However, it can be achieved via partial parsing. The parser validates only a particular part of the network (e.g. a number of hubs), thereby avoiding parsing a complete network if only a tiny section is affected Finally, the organisation of network structures in memory is problematic. Preferably the network should be stored in a XML-based object-oriented database (see, among others, Tamino from Software AG, dbXM from the dbXML Group, the XML enabled databases from IBM – XML Extender, Informix Internet Foundation.2000, Oracle - XML Developer's Kit, etc.). However, at the moment of writing, none of them demonstrate storage and retrieval access times that are acceptable. Thus, file systems are a better option, though they require intensive work on suitable data management environments (i.e. storage and retrieval of metadata as well as connections between data and annotations) and on validation tools. Having described how the semiotic continuum can be organised and the problems encountered with this endeavour, we are now in a position to analyse particular domains to see how the suggested structures can be applied. We are specifically interested in illustrating how the required information can be collected (automatic versus manual generation/instantiation of schemata) and how these new methods of information gathering will change our ways of communication.

2.

MEDIA PRODUCTION

Media production, such as for news, documentaries, feature films, interactive games, and virtual environments, is a complex, resource demanding, and distributed process with the aim of providing interesting and relevant information by the composition of different audio-visual information units. Media production is traditionally arranged in three parts, i.e., preproduction, production, and postproduction. The activities associated with these phases may vary according to the type of production. For example, the methodology involved in the creation of a dramatic film, a documentary, and a news program is very different. Commercial dramatic film production is typically a highly planned and linear process, while documentary making is much more iterative, with the story structure often being very vague until well into the editing process. News production is structured for very rapid assembly of material from very diverse sources. Media information systems must be able to accommodate all these styles of

10

Chapter 4

production, providing a common framework for the storage of media content and assembly of presentations. It must be emphasised that the interrelationships between different stages within the production (e.g., the influence of personal attributes on decisions or the comparison of different solutions) are complex and extremely collaborative. The nature of these decisions is important because as feedback is collected from many people, it is the progression through the various types of reviewers that affects the nature of the work. Thus, each of the different phases of media production provides important information on technical, structural, and descriptional levels. However, today’s media production is mainly a one-time design and production. This means that oral discussions on the choice of events, paper-based descriptions of activities on the set, production schedules, editing lists with decision descriptions, and organisational information will be lost after the production is finished. Ironically, it is this kind of cognitive content and context-based information that today’s researchers and archivists try to analyse and re-construct out of the final product - usually with limited success. Current professional IT tools exacerbate information loss during production. These applications assist in the processes of transforming ideas into scripts (e.g. a text editor, news ticker application, Dramatica, etc.), digital recording (e.g. Sony’s 24p Camera, Hdreel from director’s friend), digital/analog editing (Media 100, Media Composer, FAST 601, df-cineHD and Hdreel from director’s friend, etc.), and production management (Media-PPS, SAP R/2, SESAM, etc.). However, the tools are often based on incompatible and proprietary architectures. Hence, it is not easy to establish an automatic information flow between different tools, or to support the information flow between distinct production phases2. The net result is little or no intrinsic compatibility across systems from different providers, and poor support for broader reuse of media content. Hence, we face the paradoxical situation, that while there are more possibilities than ever to assist in the creative development and production processes of media, we still lack environments which serve as an integrated information space for use in distributed productions, research, restructuring (e.g. by software agents), or in direct access and navigation by the audience.

2

The domain that demonstrates best what is possible today is news. In particular the technology of Avstar Systems LLC, the market leader in newsroom computing systems, shows how automatic indexing enables news journalists, editors, producers and directors to effectively search, browse and pre-edit all of their incoming videos from the desktop.

4. Fout! Opmaakprofiel niet gedefinieerd.

2.1

11

Digital production – environment and tools

The aim of a digital production, as we understand it, is to provide a distributed environment supporting the creation, manipulation, and archiving/retrieval of audio-visual material during and after its production. Such an environment requires a digital library of consistent data structures (supporting different production types, such as news programs, documentaries, interactive games, etc.), around which associated tools are grouped to support the distinct phases of the media production process. Each of the available tools forms an independent object within the framework and can be designed to assist the particular needs of a specialised user. It is essential that the tools should not place extra workload on the user – she should concentrate on the creative aspects of the work in exactly the same way she is familiar with. Let us have a look at the different production phases to clarify the idea. 2.1.1

Preproduction

Preproduction is concerned with determining the main ideas and logic that forms the core of the production. In a plot, the intentions of the narrator must be achieved, i.e. to present the material as plausibly and succinctly as possible. The impact of this relies on articulational techniques, i.e. communication strategies between narrator and receiver. Further, the dynamics within the material must be considered, since these form the basis of the plot, i.e. thematic structures. Thus, narration is a dynamic process of interaction in a partly given social context, where ‘the interaction encompasses … the communicator, the content, the audience and the situation’ [Janowitz & Street, 1966, p. 209]. Writers of stories for the screen have a deeply ingrained tendency to construct stories in a fixed linear fashion. The main state-of-the-art type of tool for authors therefore is either ‘Script Formatting Software’ or more sophisticated ‘Storytelling Support Software’. Script formatting software, such as Final Draft [2001], Movie Master [2001], or Movie Magic Screenwriter [2001], is capable of performing the basic functions required to create draft scripts. Besides a standard formatting functionality, this type of software supplies automatic ways to change from dialogue to description to character name to transition. Storytelling support software, such as Dramatica [Dramatica 2001], follows the approach of using a question-answer mode to construct the story. As the writer fills out the questionnaire making high-level narrative choices, Dramatica’s main goal is to force the writer into making the most detailed

12

Chapter 4

decisions about the story as much as possible, such that Dramatica’s matches for a story structure and style come down to just one. The problem with both approaches is that the result might be a good plot. However, interesting semantic structures are buried in the script and powerful natural language processing is required to retrieve them. Moreover, these systems do not scale to the new breed of interactive plots used in games. Here, choices are demanded on how a plot unfolds preferably in a personalized manner. Thus, the deterministic and closed world approach of a linear plot is replaced by structures where players or the audience decide how the plot moves along. Today, such stories are planned on walls covered with paper strips, on which authors scribble their ideas of complex story transitions. What is needed is a distributed environment in which the developing team not only provides the plot, but also additional clues such as mark-ups for emotional developments (just as we emphasise text with bold or italics) to facilitate machine interpretation of the emotional development for characters either within a particular episode, but rather over the whole plot. We suggest, therefore, a Script Editor accessible on an interactive, networked electronic wall, as described by Streitz et al. [2001]. The suggested editor allows the presentation and organisation of episodes in a mix of graphic- and text-based levels: organisational level, plot level, episode level, and conceptual level. Organisation level: This level provides choices regarding the type of production, i.e., a documentary, soap, or feature film. This will influence the appearance of templates and representational structures for categories such as summary, character, dialogue, setting, and media data (while this is not very important to the development of the story, it becomes interesting for later stages of the production, when the actual audio-visual material is being produced). Plot level: Each episode of the plot is presented as a box, signifying one of the following types, distinguished by colour: introduction of a setting or character and explanation of the state of affairs (Introduction), initiating event (Conflict), emotional response or statement of goal by protagonist (Resolution), complicating emotions (Diversion), and outcome (Ending). Episodes are connected by a variety of relations, such as follows, precede, must include, supports, opposes, conflict-resolution, motivation, justification, etc. The graphical representation of episodes characterizes, at higher levels of abstraction, some of the details captured by the episode level of the editor

4. Fout! Opmaakprofiel niet gedefinieerd.

13

(described below). The graph structure can be animated using a general speed model (time to tell the story), presenting changes of importance (size) and mood (colour) of each sequence over time. Thus the author gets a feel for the flow of the story (macroscopic browsing). Moreover, using the presented graph structure the author can experiment with different storylines by rearranging the path between sequences. Over the time of production the simple graphical representations of this level as well as the episode level can be replaced with more advanced graphical representations, such as animated story boards or 3D animations – which also provide the means for introducing optical effects or animations that are used later in the postproduction phase. Episode level: This level portrays various aspects of the characters in one particular episode, i.e., relevance for the scene, emotional state, relation to other characters. The animation can be simple, such as iconic representations of particular characters, where the size of an object refers to its importance in the episode, the position on the screen represents relations between objects, the colour represents an emotional state, etc. Viewing the animated graphical development of characters gives a better understanding of the individual development (microscopic browsing). The automated generation of these animations is based on the annotations provided by the author of the script, developed on the conceptual level. Conceptual level: On this level the script can be edited in detail using word-processor technologies creating content conforming to the standard formats for scripts. Since this part is text based, access to query systems for themes, genres, emotional states, etc. is possible, to support shaping of the plot. Queries might deal with determining the various themes and issues the story should illustrate, with an emphasis on locating the central problem(s) that reside in the main characters and the overall story. Based on these answers the system can then analyse the deep structure, i.e., plot, theme or character pattern. Thus, this level provides a combination of script formatting and story telling support software, only that it also facilitates extra annotation for various structural elements, which are automatically translated into instantiations of available description schemata. The various logical, temporal, and spatial links supplied by the work processes at the various scripting levels provide a way of efficient versioning (path lists can represent different story lines – see also Lindley & Nack

14

Chapter 4

[2000]). A similar approach using layers for the early development of media presentations is described by Bailey et al. [2001]. Once structures of scripts are created, they can be made available for general or protected access, allowing authors to overcome conceptual problems (e.g. if a conflict between two characters is established but cannot be resolved, the system can search for a similar structure and show the resolution of the retrieved plot sequences), or help in further story development (e.g. after 745 episodes of a soap, the system could extrapolate a number of possible new relations between actors, in consequence stimulating fresh ideas for the author). Finally, the script structures can be connected with organisational tools supporting processes, such as casting (assigning real actors or stuntmen to characters), time and production management (which actor has to be available when, what sort of equipment has to be where and when, what sort of post-production is required, etc), or budgeting (how expensive is a scene regarding staff, equipment, location rent, etc., what is the estimated budget for the film, how do cuts in the budget affect the film, etc.). 2.1.2

Production

The main task during production is the acquisition of media material. Usually, a production uses one version of a script as the basis for shooting. An exception is made in news, where time constraints and the unpredictability of events prevent scripting, so that similar structural information is gathered during the production itself. Compared to traditional analog media fabrication, the digital set has some advantages. Imagine that a particular scene is going to be shot. The cameraman has arranged the lighting instructions according to his notes available through the digital Location Editor. This editor offers a link to the information space created by the Script Editor, as described earlier, extracting particular information about the realisation of shots. At the same time the director arranges a couple of dialogue changes with the script girl, which are saved as an additional version of the scene. In fact, the digital Location Editor facilitates more, i.e. storing status information about a scene, whether it is realized or not, or newly realised. The result is an automatic update of the overall script structure. Immediately after recording a scene but before final storing, the video sequence can be analysed using digital playback. While reviewing the material, additional information such as the reason for the production of each shot and its consideration for the composition of the final media unit, data concerning continuity, etc., can be included into the information space offered by the Location Editor. Such information can be gathered in different ways. The director can, for example,

4. Fout! Opmaakprofiel niet gedefinieerd.

15

use tools that support simple semantic annotations while shooting. Another option is that the script girl exploits the time during discussions to collect the information either with equipment based on natural language processing or graphic-based description languages as suggested by Davis [Davis 1995]. Such a distributed set environment provides a permanent, accessible, coherent relationship between the textual script and its underlying concepts, and their visual representations. Not all of the tools exist as outlined above. Those which exist are again proprietary, such as the Ikegami camera with a 4Gbyte harddisc recorder and AVID technology, where markers are set while the video is recorded. However, the Mobile Group at GMD-IPSI has developed a set of customisable production tools for the domain of news [Nack & Putz, 2001], which aim to address the sketched scenario. Based on a three month investigation at WDR (WDR: Westdeutscher Rundfunk – Germany largest public broadcasting station) and HR (Hessischer Rundfunk – the broadcasting station of the federal state of Hesse) where they talked with and observed the work of reporters, cameramen, and editors during actual productions, the group designed and prototyped a media-aware camera that automatically stores the acquired video stream together with relevant lowlevel feature information in form of an associated network of instantiated XML-Schema structures. The metadata schemata used by the camera describe image parameters, such as camera movement (pan and tilt), lens action (zoom and focus), shutter, gain, and iris position, as they contribute to the various expressive elements in film [Bordwell 1989, Eco 1985]. Moreover, information is collected about the spatial position of the camera by using a magnetic tracker. Limited information units were chosen to demonstrate the general access and archival mechanisms. A complete set of camera descriptors might be closer to the parameters suggested by the Dynamic Data Dictionary [SMPTE 1999] and the GMD Virtual Studio [Fehlis 1999]. The actual hardware components used in the demonstration environment are: • Polihemus FastScan Tracker, • Sony EVID-30 Camera, • A Videodisc Photon MPEG-1 Encoder. The camera is a working example demonstrating how large amounts of relevant metadata can be collected automatically during production. The camera illustrates the instantiation processes of annotations or spatiotemporal identification marks for audio-visual data, based on a linking mechanism using time-codes, region-codes, and scene identifiers. The advantage of this generic approach is that the expert, here the cameraman,

16

Chapter 4

can concentrate on his tasks without being concerned about storage organisation or general presentation. Due to the particular requirements of the various types of production it is important to provide a larger (e.g., for feature films) or smaller (e.g., for news) annotation set. Additionally, it is important to facilitate the different desires of the members of the production team. Thus, the set of features that should be annotated during filming (the team at GMD developed 18 schemata), as well as production information (date, time, location, etc.) and personal information (e.g., useful for copyrighting a scene for the cameraman and director) can be individually assigned and applied via a programmable code-card, which is installed into the camera before shooting starts. The synchronisation (i.e., linking) between data and annotation is resolved via time codes in SMPTE notation (hh:mm:ss:msms) combined with a scene identifier. Before shooting, the camera and the handheld device exchange the scene id. Once the camera starts recording the digital video in MPEG format, the annotation algorithm polls every 20 ms for changes on the relevant image capture parameters (zoom, focus, shutter, iris, gain). In case a change is detected a mediadevice-Description Scheme (DS) will be instantiated with the start and end time of the event, the parameter type, e.g. zoom, and its descriptional value. Furthermore, the node will be immediately integrated in the relevant scene-graph by providing connections to the relevant nodes (using the scene ID). If the camera capture event performs on a longer time span than 20ms, the end time will be entered after the first unsuccessful poll (the algorithm corrects the temporal delay automatically). The instantiation of a mediadevice-DS for focus might look as follows: http://www.darmstadt.gmd.de/MPEG7/Tagesschau200001202000 part_of http://www.darmstadt.gmd.de/MPEG7/event01_inst part_of

4. Fout! Opmaakprofiel niet gedefinieerd.

17

00/00/02/200 00/00/03/100 closeup 734

In such a way the camera establishes a document network, where each change will be represented in a single document (temporal actions longer than 20ms such as zooms, are collected in one document only). Additionally, a handheld device was developed to allow the reporter the making of real-time annotations during shooting on a basic semantic level, i.e. capturing in and out points for sound and images. The device provides a monitor for screening the recorded material of the camera in real-time (the transition rate is 13 images per second, since the reporter is only interested in gaining an idea of the framing and content). Furthermore, the handheld supplies a set of buttons • to mark the importance of sound and images with in- and out-points for sound and image; • to provide a conceptual shot dependency via a scene id, such as 11,1-2,2-1, etc., where the first digit represents the scene and the second digit denotes the shot number. The reporter can adjust the scene id at any time if the camera is not recording. Once the camera is active, it first identifies the current id setting of the handheld (a handshake protocol on an infrared basis is used), which will provide the id for all description schemata created for the current recording. The synchronisation of the in and out points, with the audio or video is achieved via time codes (provided by the camera) and the scene id. The reporter can set in and out points at any time during recording. The advantage of the described approach is, that the synchronisation between camera and important conceptual information source is no longer based on the adjustment of camera time code and the responsible human synchronisation source (i.e. the time read by the reporter from his watch), but rather automatically. For the actual person, i.e. the reporter, this means that she can now concentrate on the action that needs to be observed, since only a few buttons need to be pressed, instead of scribbling time codes and related nodes on paper.

18

Chapter 4

The same mechanism as described between handheld and camera can of course also be performed between a Location Editor, as described above, and the camera. The only difference is that the available information space is naturally much bigger by using the Location Editor. These few examples demonstrate how extra semantics could be added to audiovisual material during production without increasing the workload or introducing drastic changes in the work procedures. The next section provides ideas on how enhanced media can improve the work process and information flow during postproduction. 2.1.3

Postproduction

The aim of the postproduction is twofold. The first goal is to arrange the material so that the resulting media unit (i.e., the video, film, interactive game, etc.) makes sense as a whole. Second, one must ensure that the intended theme engages the spectator both emotionally and intellectually. Thus, the postproduction is where the actual moulding of the story in a particular medium takes place. It is at this stage that knowledge is required about the final media presentation, such as the final plot, the characters, the theme, the presentation style (linear, interactive), the target audience (user models, possible platforms, access rights, etc.), and the distribution channels (internet networks, broadcast and wireless, disc, cable, etc). In a digital environment as described in Sections 2.1.1 and 2.1.2, the acquired meta-information from previous production processes provides the required knowledge. In the following section, we would like to demonstrate how the particular process of ‘editing’ could be supported by the gathered information. The example, a media-aware tool suite, as depicted in Figure 3, was developed at GMD [Steinmetz and Hemmje 1994] originally to demonstrate distributed film editing.

4. Fout! Opmaakprofiel niet gedefinieerd.

19

Figure 3. Editing suite for the creation of interactive video stories The key parts of the interface shown in Figure 3 are the working space (main window) and the transition editor (window in the right lower corner). The other two smaller windows offer graphical access to editing, style, and relation techniques. The main working area is divided into two sections. The upper part represents the typical editing environment, i.e., the time line, on which clips and their transitions are linearly ordered representing the final product. In this editor, the timeline represents the currently selected path of the visual object. The larger lower part offers an additional compositional space allowing the visual feedback for alternative cuts, based on graphical representations similar to the one introduced earlier in the description of the Script/Location Editor. This is the actual working environment. When the editor works on a particular scene, the relevant material will be supplied automatically in an ordered fashion (see the vertical line of shots at the right side of the window), based on conceptual dependencies for the particular production, such as the following: • Groups of video material are based on scene ids and their versions. • Short clips are more important than long ones. • Annotations of in and out points increase the value of a shot, it should be presented more prominently, • A clip already accepted on the set is important, it should be presented notably, etc. Since the position of a clip in a cluster represents its potential value for the scene, the editor can now quicker decide which clips need to be used in the working space, and thus can concentrate on shaping a clip, juxtaposing it

20

Chapter 4

with others and determining the transitions between clips. The advantage of the graph representation is that various versions can be produced to establish different emotional or logical patterns without losing a single version. The description of a single scene contains the chosen path (i.e., the used clips and their order), the duration of each single clip, the transition between clips and, if desired, the same information of other created versions. Related to this editing list is the list of sound mixing. Once a scene is edited one can swap to the episodic level, checking if the established pattern fits the overall structure of the product. The aim of the Editing Suite is to create time based cutting lists for different established versions of the end product. Technically, it is possible to include additional meta-data (e.g., why a certain cut or FX was chosen), though this version does not facilitate such features. However, in a time and equipment constraint environment, as in news, the described Editing Suite is far too complicated. Nevertheless, even under such constraint conditions information-improved media editing is possible, as illustrated by the semi-automated News Editing Suite demonstrator developed at GMD-IPSI. This Editing Suite is a simplified version of the suite described above. The News Editing Suite provides the reporter with an instant overview of the available material by ordering scenes in the same vertical fashion as described. The reporter is now in the position to mark the relevant video clips for the newscast by pointing at them. The order of pointing indicates the sequential appearance of the clips and the length of pointing, their importance. As a visual aid the selected clips are enclosed by different shapes to highlight the role of the selected clip within the news unit, i.e., a square stands either for an introduction or a resolution (distinguished by colour), whereas a circle represents the significant message. Finally the reporter provides the overall time of the clip. Based on a simple planner the Editing Suite then performs an automated composition of the news clip, which the reporter can tune, for example, by adjusting the clip, according to his voice over, using the time line offered by the interface. The simple planner for news composition is based on a set of 22 rules for the automated clipping of a shot and juxtaposition of shots. To provide an example, we describe now those rules that are concerned with the rhythmical shape of a sequence. Strategy A and Strategy B reflect the fact that a viewer perceives the image in a close-up shot in a relatively short time (≈ 2-3 seconds), whereas the full perception of a long shot requires more time.3 Moreover, the composition of shots may vary in the number of subjects, number and speed of actions, and so on, which also influences the time taken to perceive the 3

The time values used in all the following examples of editing strategies are based on estimates provided by the editors at the WDR.

4. Fout! Opmaakprofiel niet gedefinieerd.

21

image in its entirety. Finally, the stage of a sequence in which a shot features also influences the time taken to perceive the entire image. For example, a long shot used in the motivation phase takes longer to appreciate, since the location and subjects need to be recognised, whereas in the resolution phase the same shot type may be shorter in duration, since in this case the viewer can orient himself or herself much more quickly. Strategy A

If camera distance of a shot = close-up then clip it to a length = 60 Frames.

Strategy B

If camera distance of a shot > medium-long sequence. kind = realisation or resolution then clip it to a length = 136 Frames.

Both strategies indicate the need to trim a shot. However, the above strategies may cause problems, as the actions portrayed may simply require more time than suggested by the rules. The necessary steps to be performed to achieve a necessary chain of actions while still allowing to trim are to identify the start frame of the first relevant action and to detect any overlap with the second relevant action. If such an overlap exists, then it is possible to cut away the section of the shot in which the first action is performed in isolation. The same mechanism applies to the end of the shot. It is of course, important that the established spatial and temporal continuity between the shot and its predecessor and successor is still valid. Figure 4 describes the application of edge trimming for a shot of 140 frames. The shot should portray an actor walking and then sitting, but should not be longer than 108 frames. 0

140 eat walk

sit talk Shot Annotations

Figure 4: Trimming of a shot from 140 to 108 frames Strategies C represents an example of temporal clipping as discussed immediately above, focussing on frame elimination from a single shot that sequentially portrays a number of actions.

22 Strategy C

Chapter 4 If close-up < camera distance < long and sequence.action.tempform = equivalence and number of frames > as calculated in E-Strategy A or B and number of performed actions in the shot > 3 then verify the frame overlap of first and second action as X verify the overlap of last and last - 1 action as Y cut away the pure frame numbers for the first action if X > 36 cut away the pure frame numbers for the last action if Y> 36

For a more detailed theoretical discussion of automated film editing see [Nack 1996]. Editing environments as described above not only support the traditional linear presentation of media material but also demonstrate new ways of media presentation and exploitation. Cinemas, for example, connected via satellite or fiber-optic lines to the server of their distributor, are in a position to tailor films for audience on a macro as well as micro level (independent films, films for fringe groups, etc). As a result people in New York might see a different version of a film than audiences in Los Angeles, Berlin, Sidney or Tokyo, and even within a particular town different versions might be available at any time. Comparable developments can be envisioned on the personal consumer level, by providing means to purchase a special version of a film online and have it delivered in the suitable format (online, DVD, videocassette). We could even move a step further, if we take current developments of digital humans in consideration (see the work at the Centre for Human Modeling and Simulation [HMS 2001] and the Digital Live consortium at MIT Media Lab [MIT 2001]. A possible outcome might be that audiences not only request particular storylines but can chose from actors, themes and genres and a system then generates their individualised film/animation. However, such availability of choice requires different ways of developing story ideas (more on an interactive, non-linear, object oriented level), which will also influence the production of material – and of course will change the way of communication and discourse. It would not be astonishing, if we will see similar developments as durinhg the introduction of private TV, which resulted in a much larger diversity of styles people preferred to communicate in, about a wider variety of topics than public TV could ever offer. This diversity in combination with the idea of compelling experience introduces concepts, such as arbitrariness, vagueness, and

4. Fout! Opmaakprofiel niet gedefinieerd.

23

constant progression into our notion of discourse. The following section investigates this changing view on and the resulting use of knowledge using the example of encyclopaedic spaces for the domain of theory, history, and anthropology of film.

3.

ENCYCLOPAEDIC SPACES

The implication of an encyclopaedic space is that the notion of a ‘piece of work’ vanishes and leaves space for a creative and productive cycle, a living environment allowing all sorts of continuous enhancements. If we only had the information gathered during the production of media, including its reuse and modifications, we would still lack knowledge about the intrinsic meaning of the material. If we think about an encyclopaedic space for film, we would expect investigations to follow, like most research in arts and humanities, an interpretative, associative method based on historic-cultural materials including primary sources as well as secondary materials. The main objective is a discourse-oriented collective interpretation of questions that allows, by following the branches of interdependencies, a comparison of the most diverse theories, originally based on different perspectives [Andrew 1984]. An analysis of film addresses some of the following issues [Andrew 1984]: • The raw material, including questions about the medium and its relations to reality, photography, and illusion, its use of time and space, colour, sound, props, actor make-up, etc. • The methods and techniques, including questions on the creative and technological processes which shape and treat the raw material as well as the underlying psychology or economics. • The forms and shapes of film, containing questions about film categories, the adaptation of other art forms, genres, and audience expectations. • The purpose and value of cinema, seeking the goal of cinema for humankind. Allowing access to a digital knowledge space as outlined above, experts would be able to exchange the traditional linear manuscript-based discourse with a conceptual information space that provides the simultaneous comparison of a theoretical train of thought including the full audio-visual data. As a result, the community as a whole develops and strengthens its own knowledge and practice - or, in other words, provides through the information space its perspectives on the domain. It is this process of ‘perspective making’ [Boland & Tenkasi 1995] that underpins the building

24

Chapter 4

of a community’s identity: its basic assumptions, goals, terminology, and modes of discourse. On the other hand, the accumulated information is in most cases not only accessible for the domain community but also to other interested parties, such as the general public, either for pleasure or for educational purposes in the context of life-long learning. We should also not forget that economic reason will stimulate such developments. The problem we are then focused with is that of ‘perspective taking’ [Boland & Tenkasi 1995], which refers to the process of trying to engage with another community’s perspective. This can be difficult when their respective way of knowing assumes different agendas or does not match at all. The access and resulting presentations of relevant information in an appropriate stylistic way (i.e. shaped rhythmically and thematically into rich information textures) requires a perspective management in the form of a dynamic and adaptive generation of information presentation. This is in particular necessary if the interpretation of audio-visual material is based on other audio-visual material, e.g. showing in an educational application the referential character of the station scene from De Palma’s ‘Untouchables’ to the ‘Odessa steps’ scene in Eisenstein’s ‘Battleship Potemkin’ by comparing shots next to each other (Figure 1 provided a simplified representation of this example). As a result of the two conceptual views on the content of the semantic network, an architecture is required where • the definitions of concepts and relations will be left to the domain experts to allow for consistent structuring of the information space on a semantic as well as syntactic level, obviously requiring a sort of editorial board; • the encoding of these structures and their spatio-temporal identification marks for audio-visual data is generic; • the established annotations can be used, as any other information node in the network, to guide users who explore the domain; • the definition of particular presentation style will be left to those responsible for environments that provide access to a knowledge space, i.e., information brokers, publishers in a broader sense, all sorts of cultural institutions, educational environments, private users, etc. These presentation styles provide the means to generate customised context-oriented content presentation on the fly, due to their affiliation to the consistent data structures created by the domain experts. The interesting aspects of such an architecture for our current investigation are those that provide active human involvement. We begin with the discussion of an Information Space Editing Environment for the

4. Fout! Opmaakprofiel niet gedefinieerd.

25

development of information nodes and relations, and conclude with ideas regarding dynamic information presentation.

3.1

Information Space Editing Environment (ISEE)

This environment serves the specific needs of generating semantic networks as a side effect of working with different multimedia materials, gathering information, and evaluating the materials. The ISEE facilitates through an annotation user interface, based on an XML standard browser, to • mark up single objects or regions in a text, 2D or 3D image, graphic, photo, video, audio, or animated 3D information unit (text document, graphic, photo, video), • annotate these marked sections using special forms and menus, • re-edit the marked document passage or the additional annotation nearly in real time under the same Web interface. In this way, a “dynamic information authoring” can be performed. All interim findings will be conserved and made available as a basis for ongoing analysis. The required tools are the following: Ontology editor It allows the definition of concepts and relations in the form of taskspecific, controlled vocabulary/subject indexing schemata for in-depth semantic-based indexing of various media. Concepts and relations should be described as standard-based structures [Grosso et al. 2000]. Annotation editor This permits the splitting of media content into small information units, their description on a denotative level, and the establishment of their conceptual relationships (connotative level). The description process is controlled by the defined ontology and follows a strata-oriented approach, which allows a fine-grained time and space-oriented description of media content. By placing an annotation or a relation the creator leaves a mark in the network, since every placed item has an author. Examples of such environments are provided by [Auffret 1999]. Semantic network editor It provides the means of defining rhetorical structures on information units. The structures are also ontology-controlled and are used during the presentation in combination with user profiles and the presentation plans for the delivery of content (see an example for discourse ontologies in Buckingham Shum & Selvin [2000]).

26

Chapter 4

Such a authoring environment is naturally domain dependent. One of the future challenges will be to allow cross-domain relationships. In our example of theory, history, and anthropology of film, this means that features and extraction methods developed for the broadcasting domain, i.e., colour-based classifiers or movement detectors in MPEG-7, should be reusable for the interpretational level as applied in film criticism. Consider the image in Figure 5, which is taken from Bertolucci's "The Last Emperor". One of the key structural elements in this movie is a colour code, which accompanies different stages of Pu Yi's life. For example, when he cuts his veins, he sees, for the first time in the film, red as a pure colour. Low-level feature description where colour is one parameter Spatial-temporal descriptor

Content description: Attempt of suicide

Concept of birth Concept of re-birth

Actor description

Figure 5. A scene from Bertolucci's “The Last Emperor” (1987) and a structural representation of its interpretation by one researcher It must be possible that a scholar can analyse the colour code in Bertolucci's "The Last Emperor" while watching the film. Thus, the menu option of colour identification will allow the scholar to click on the video during run-time, resulting in a colour representation. In Figure 5 the dark node presents this representation. At the same time the system will identify the region in the frame and establish a ‘spatial-temporal descriptor’, describing the potential region (bloodstain) and by using object tracking the time period in which it appears. The relation between colour and temporalspatial descriptor and the anchor to the data is established automatically. The scholar might like to annotate the material even further. She creates a content description for the scene in natural language, as conveyed by the node ‘Content description’ in Figure 5. Within this description the name of

4. Fout! Opmaakprofiel niet gedefinieerd.

27

the actor, which in fact cannot be seen in the scene, can be linked to the already existing actor description, as demonstrated by the link between the nodes ‘Content description’ and ‘Actor description’. Every search for the actor in the film will now also retrieve this sequence. A further interpretation of the material might bring the researcher to the idea that red represents birth. Thus, the complexity of this image derives from the concept of suicide, which merges with the idea of birth to rebirth. Establishing a relation between the term suicide and the two concepts of birth and re-birth, taken from an ontology of psychology, the researcher can easily generate a complex interpretation in a human and machine readable form. Since each of the links is associated with the id of the researcher, comment nodes of other researchers can be identified without difficulty. Extending the above example, the expert can collect a number of features such as colour, shape, brightness, movement, transition, and so forth, into a description of a particular style, such as • Spielberg’s composition of light, colour and framing in epic-films, • the stylistic description of film noire based on low level features, • the collection of style criteria for the arts, such as for clear obscure, impressionism, cubism. It is then not only possible to facilitate automatic identification of relevant material but also to use low level features as a basis for the comparison of styles. First approaches in that direction are described by Vries et al. [2000], Windhouwer [1999], Nack et al. [2001], Schreiber et al. [2001], and Dorai & Venkatesh [2001]. Allowing experts to control the quality of meta-data, tools are required that support such conceptual work without becoming hopelessly complex. First attempts into that direction are described by Pauwels & Frederix [2000]. However, the established structures will introduce a complexity that makes it hardly possible for a novice to work with. It is therefore essential to allow for intuitive ways of accessing information in a non-linear way based on supportive interface mechanisms that provide a user with the complete and relevant multimedia data according to his or her skills. The next section addresses these issues.

3.2

Dynamic Presentation Environment (DPE)

The main problem we face in interactive, media-based, and data-rich environments is not the expert or semi-expert, but rather the user who has no idea about the classification of structures, the relevance of nodes, the specification of relation types, or the mechanisms that facilitate dynamic use

28

Chapter 4

and reuse. Most of the time this type of user not even has an idea of the available material. What is required is an approach that facilitates the trouble-free access to the information and then captures the object or objects of interest to a user during the browsing session. Consequently, the screen is not understood as a window but rather as a dynamic frame that traverses the information space developed by the domain experts, in which the importance of nodes and relation types is specified. This specification might be either manually assigned (e.g. the expert defines a coherent order of nodes), or is automatically established, for example through valuing the number of accesses. Such information can be used for the determination of access points and browsing directions (see also section 1.1.2 Relations and anchors). By capturing the intentionality of an information need that is assumed to be developing during the browsing session, i.e. the explanation of interest in a media-unit through presenting, pointing at, or otherwise indicating on it, we are in the position to map the position in the interface to the position in the network, of which the interface represents a fraction [Campell 2000, Crestani et al. 1998]. In other words, the user browses the generated interface, where selecting an information unit will trigger the query generator, which transforms the new point of interest, user profile, investigation history and current context to generate the new request for information. The retrieval results will form the basis of the new interface. A detailed description of an interface that supports such behaviour is described in [Toepper 2000]. Such a dynamic concept of the browser (investigation of node surroundings) requires a presentation generator that basically performs as a constraint-based planning system, using information retrieved from relevant content description nodes, syntactical properties of appropriate relations or conceptual annotation nodes, the user model, and the definitions provided by a design specialist or the content space owner. The required tasks of the generator are to • •

analyse the retrieved material based on the user model generated over time during the browsing session, redesign the new presentation according to design issues such as graphic direction, scale, volume, depth, and style. The automated generation processes will thus be performed at different abstract levels, i.e., communication devices, qualitative, quantitative and multidimensional constraints on a spatial and temporal level [Rutledge et al. 1999, Rutledge et al. 2000], and media integration (e.g., assigning filters to media objects to vary their style for closer

4. Fout! Opmaakprofiel niet gedefinieerd.

• •

29

integration with the rest of the presentation [Vries et al 2000, Windhouwer 1999]). update the user model (e.g., user preferences), browsing history and the current context setting on the client side provide a format that a hypermedia browser can interpret (e.g., [SMIL 2001] or [ISO-MPEG-4 2000]).

Systems that try to explore and develop innovative techniques for adaptive [Brusilovsky et al. 1998, De Bra & Calvi 1998] or adaptable [Rutledge et al. 1999] presentations based on the above requirements are described by [Andre et al. 1996, Andre et al 2000, Boll et al. 2000, Kamps 1999, van Ossenbruggen 2001]. Based on such environments, users can investigate an unknown space provided with the most relevant material and its annotations for the actual moment and presentation platform, allowing a progressing experience of completing the understanding of complex concepts in procedural, and participatory means (i.e. interactive and investigative in a navigable encyclopaedic space, providing access to the full temporal and spatial means of the media items, if possible). Such an experience yields an understanding of a concept more primal and powerful than any appeal through normal text in a linear logical form. However, to encourage users other than experts to participate rather than just consume, mechanisms for social indexing need to be provided such as suggesting new relations between established nodes, providing extra information or posing questions about particular aspects of the content. The problem is that this input needs to be integrated so that the structural consistency is not damaged. As a consequence, we need tools that allow the general public to contribute to the communal information space in an intuitive and easy way – which is the most difficult endeavour.

4.

CONCLUSION

In this chapter we argued that the technology for new media to date follows the strains of traditional communication by supporting the linear representation of an argument resulting in a final product of contextrestricted content. However, we also tried to show that the emerging information society is heading towards information generation, where ideas are communicated in various forms and media, in which old and new information intermix and context is a variable concept, and where people are aware that the notion of a ‘work’ disappears, to be replaced by a living environment allowing all sorts of processes.

30

Chapter 4

The focus of our argument was that the traditional linear approach of generating information and thus meaning is far too restrictive, as any form of information is necessarily imperfect, incomplete and preliminary, because it accompanies and document the progress of interpretation and understanding of a concept. Consequently, we described the need for collective sets of descriptions growing over time. We suggested and described adaptive environments that serve as integrated information spaces for use in distributed media production, media-based research, and facilitate direct access and navigation by the general public. We are aware that the approaches described in this article are but a small step towards the intelligent use and reuse of media-based information. In fact, far more work is required on flexible formal annotation mechanisms and structures, but also on tools that first support human creativity to create the best material for the required task and additionally use the creative act to extract the significant syntactic, semantic and semiotic aspects of the content description. We mentioned some main social and economic implications but not on a sufficiently detailed level. We hope that some of the ideas presented in this chapter can stimulate a fruitful discussion of important issues such as Intellectual Property Rights (IPR), access rights (who will be provided with the tools for influencing the information spaces), economic models for access and distribution (support of socio-contributions, i.e., pay by providing new information, either in the form of material - audio, video, text, or by suggesting new relations), and social implications of distributed work. The future may see the emergence of systems and environments that in some respect not only represent a communal repository, thus supporting our intellect, but also access the memory of every single user by working on primary sensory material, such as sound, colour, or light reaching the deepest layers of the emotional memory. Such spaces will become an integral part of human culture, helping us to face the contradictions of life itself because contradiction is an essential part of the informational nature of these spaces.

REFERENCES Aguierre Smith, T. G., & Davenport, G. (1992). The Stratification System. A Design Environment for Random Access Video. In ACM workshop on Networking and Operating System Support for Digital Audio and Video. San Diego, California Aigrain, P., Joly, P., & Longueville, V. (1995). Medium Knowledge-Based MacroSegmentation of Video into Sequences. In M. Maybury (Ed.) (pp. 5-16), IJCAI 95 Workshop on Intelligent Multimedia Information Retrieval. Montréal: August 19, 1995 E. Andre, J. Muller, and T. Rist. (1996). WIP/PPP: Knowledge-Based Methods for Fully Automated Multimedia Authorin. London, UK, 1996.

4. Fout! Opmaakprofiel niet gedefinieerd.

31

E. Andre, J. Muller, and T. Rist. (2000). Presenting Through Performing: On the Use of Multiple Lifelike Characters in Knowledge-Based Presentation Systems. In Proc. of the Second International Conference on Intelligent User Interfaces (IUI 2000), pages 1- 8, 2000. Andrew, D. (1984). Concepts in Film Theory. Oxford: Oxford University Press. Arnheim, R. (1956). Art and Visual Perception: A Psychology of the creative eye. London: Faber & Faber. Auffret, G., Carrive, J., Chevet, O., Dechilly, T., Ronfard, R., and Bachimont.B. (1999). Audiovisual-based Hypermedia Authoring: using structured representations for effcient access to AV documents. In Proceedings of the 10th ACM conference on Hypertext and Hypermedia, pages 169- 178. Bailey, B. P., Konstan, J.A., & Carlis, J.V. (2001). DEMAINS: Designing Multimedia Applications with Interactive Storyboards. In Proceedings of the 9th ACM International Conference on Multimedia, pp. 241 -250, Ottawa, Canada, Sept. 30 - Oct. 5, 2001. Bettetini, G. (1973). The Language and Technique of the Film. The Hague: Mouton Publishers. Bloch, G. R. (1986) Elements d'une Machine de Montage Pour l'Audio-Visuel. Ph.D., Ecole Nationale Supérieure Des Télécommunications. Bloom, P.J. (1985). High-quality digital audio in the entertainment industry: an overview of achievements and challenges, IEEE Acoust. Speech Signal Process. Mag., 2, 2-25 (1985) Boland, R.J.J. & Tenkasi, R.V. (1995). Perspective Making and Perspektive Taking in Communeties of Knowing. Organization Science, 6 (4), 350–372. Boll, S. Klas, W., & Wandel, J. (2000). A Cross-Media Adaptation Strategy for Multimedia Presentations. In ACM Multimedia '99 Proceedings, pages 37{46, Orlando, Florida, October 30 - November 5, 1999. Addison Wesley Longman. Bordwell, D. (1989). Making Meaning - Inference and Rhetoric in the Interpretation of Cinema. Cambridge, Massachusetts: Harvard University Press. Brachman, R.J. & Levesque, H.J. (1983), Readings in Knowledge Representation. San Mateo, California: Morgan Kaufmann Publishers. Brooks, K.M. (1999). "Metalinear Cinematic Narrative: Theory, Process, and Tool. " MIT Ph.D. Thesis. Buckingham Shum, S. & Selvin, A. (2000). Structuring Discourse for Collective Interpretation. Electronic proceedings of Open Conference on Collective Cognition and Memory Practices, September 19-20, 2000. http://www.limsi.fr/WkG/PCD2000/indexeng.html Brusilovsky P., Kobsa A. & Vassileva J.(ed.) (1998).Adaptive Hypertext and Hypermedia. Kluwer Academic Publishers, Dordrecht, The Netherlands. Carroll, J. M. (1980). Toward a Structural Psychology of Cinema. The Hague: Mouton Publishers. Campell, I. (2000). Interactive Evaluation of the Ostensive Model Using a New test Collection of Images with Multiple relevance Assesment. Information Retrieval, Vol 2, 1, pp. 87 – 114, Kluwer Academic Publishers, Boston. Collier G. H. (1987). Thoth-II: Hypertext with Explicit Semantics. In: Hypertext '87 Proceedings, pp. 269-289, Chaper Hill, North Carolina ACM Press, November 13-15, 1987. Crestani, F, Lalmas, M, Van Rijsbergen, C J, & Campbell, I. (1998) Is this document relevant?..probably": a survey of probabilistic models in information retrieval, ACM Computing Surveys, Volume 30, No. 4 (Dec. 1998), pp. 528-552. DAML+OIL (2001). http://www.daml.org

32

Chapter 4

Davis, M. (1995) Media Streams: Representing Video for Retrieval and Repurposing. Ph.D., MIT. De Bra, P. & Calvi. L. (1998). AHA: A Generic Addaptive Hypermedia System. Proceedings of the 2nd Workshop on Adaptive Hypertext and Hypermedia, Pitsburgh, June 1998, pp. 5 – 17. Del Bimbo, A. (1999). "Visual Information Retrieval", Morgan Kaufmann Ed, San Francisco, USA DeRose, S. & Durand , D. (1994) Making Hypermedia Work—A User’s Guide to HyTime, Kluwer Academic Publishers, Boston Dorai, C. & Venkatesh, S. (2001). Bridging the Semantic Gap in Content Management Systems: Computational Media Aesthetics. Proceedings of the First Conference on Computational Semiotics for Games and New Media - COSIGN 2001, pp. 94 – 99, Amsterdam, 10 – 12 September, 2001. Dramatica (2001). http://www.dramatica.com/ Eco, U. (1985). Einführung in die Semiotik. München: Wilhelm Fink Verlag. Eisenstein, S. M. (1991). Selected Works: Towards a Theory of Montage. London: BFI Publishing. Fehlis, H. (1999) Hybrides Trackingsystem für virtuelle Studios. Fernseh- + Kinotechnik; Bd. 53, Nr. 5 Final Draft (2001). http://www.finaldraft.com/ Gibson, W. (1986). Neuromancer. Phantasia Press, 1st Phantasia Press ed. West Bloomfield Gregory, J. R. (1961) Some Psychological Aspects of Motion Picture Montage. Ph.D. Thesis, University of Illinois. Greimas, J. (1983). Structural Semantics: An Attempt at a Method. Lincoln: University of Nebraska Press. Grosso, W.E, Eriksson, H., Fergerson, R.W., Gennari, J. H., Tu, S.W., & Musen M.A. (1999). Knowledge Modeling at the Millennium (The Design and Evolution of Protege2000). Stanford Medical Informatics (SMI), SMI Report Number: SMI-1999-0801. Stanford, USA. Gupta A. & Jain R. (1997) Visual information retrieval. Communications of the ACM, 40:71-79. Halasz, F.G.. (1988). Reflection On Notecards: Seven Issues For The Next Generation Of Hypermedia Systems. Communications of the ACM, July 1988, 31 (7). HMS (2001). http://hms.upenn.edu/) Hardman, L. (1998). Modelling and Authoring Hypermedia Documents. Phd Thesis, University of Amsterdam Hirata, K. (1995). “Towards Formalizing Jazz Piano Knowledge witha Deductive ObjectOriented Approach”. Proceedings of Artificial intelligence and Music, IJCAI, pp. 77 – 80, Montreal. ISO HyTime (1997). ISO/IEC JTC1/SC18 WG8, W1920 rev., Hypermedia/Time-Based Structuring Language (HyTime), 2d ed., Int’l Organization for Standardization, Geneva, May 1997. ISO MHEG (1997). International Standard ISO/IEC 13522-5:1997 (MHEG-5) ISO MPEG-4 (2000). MPEG-4 Overview - (V.15 - Beijing Version), ISO/IEC JTC1/SC29/WG11 N3536, Beijing - July 2000 ISO MPEG-7 DDL (2001). Text of ISO/IEC FCD 15938-2 Information Technology Multimedia Content Description Interface - Part 2: Description Definition Language, ISO/IEC JTC 1/SC 29/WG 11 N4288, 19/09/2001

4. Fout! Opmaakprofiel niet gedefinieerd.

33

ISO MPEG-7 MDS (2001). Text of ISO/IEC 15938-5/FCD Information Technology Multimedia Content Description Interface - Part 5: Multimedia Description Schemes, ISO/IEC JTC 1/SC 29/WG 11 N4242, 23/10/2001 Janowitz, M., & Street, D. (1966). The Social Organization of Education. In P. H. Rossi & B. J. Biddle (Eds.), The New Media and Education. Chicago: Aldine. Johnson, S.E., Jourlin, P., Spärk jones, K. 7 Woodland P.C. (2000). Audio Indexing and retrieval of Complete Broadcast News Shows. RIAO’ 2000 Conference proceedings, Vol 2, pp. 1163 – 1177, Collége de France, Paris, France, April 12-14 2000 Kamps, T. (1999). Diagram Design : A Constructive Theory. Springer Verlag, 1999. Kuleshov, L. (1974). Kuleshov on Film - Writing of Lev Kuleshov. Berkeley: University of California Press. Lemström, K. & Tarhio, J. (2000). Searching Monophonic Patterns within Polyphonic Sources. RIAO’ 2000 Conference proceedings, Vol 2, pp. 1163 – 1177, Collége de France, Paris, France, April 12-14 2000 Lévy, P. (1994). L’intelligence collective. Pour une anthropologie du cyberspace. Édition la Découverte, Paris. Lindley, C. (2000). A Video Annotation Methodology for Interactive Video Sequence Generation, BCS Computer Graphics & Displays Group Conference on Digital Content Creation, Bradford, UK, 12-13 April 2000. Lindley, C. & Nack, F. (2000). Hybrid narrative and categorical strategies for interactive and dynamic video presentation generation. The New Review of Hypermedia and Multimedia, vol 6, pp. 111 – 146, Taylor Graham. Melucci, M. & Orio, N.. (2000). SMILE: a System for Content-based Musical Information Retrieval Environments. RIAO’ 2000 Conference proceedings, Vol 2, pp. 1261 - 1279, Collége de France, Paris, France, April 12-14 2000 Metz, C. (1974). Film Language: A Semiotic Of The Cinema. New York: Oxford University Press. Mills, T.J., Pye, D., hollinghurst, N.J. & Wood, K.R. (2000). At&TV: Broadcast Television and Radio Retrieval. RIAO’ 2000 Conference proceedings, Vol 2, pp. 1135 – 1144, Collége de France, Paris, France, April 12-14 2000 MIT (2001). http://gonzo.media.mit.edu/public/web/consortium.php?id=1 Movie Magic Screenwriter (2001). http://www.scriptthing.com/MMS2k_site.html Movie Master (2001). By Comprehensive Cinema Software , 148 Veterans Drive Northvale, NJ 07647 Nack, F. (1996) “AUTEUR: The Application of Video Semantics and Theme Representation in Automated Video Editing,” Ph.D., Lancaster University, 1996. Nack, F. & Putz, W. (2001). Designing Annotation Before It's Needed. In Proceedings of the 9th ACM International Conference on Multimedia, pp. 251 - 260, Ottawa, Canada, Sept. 30 - Oct. 5, 2001. Nack, F., Windhouwer, M., Pauwels, E., Huijberts, M., & Hardman, L. (2001). The Role of High-level and Low-level Features in Semi-automated Retrieval and Generation of Multimedia Presentations. CWI-technical report, INS-R0108, 2001 Ossenbruggen, v.J., Cornelissen, F., Geurts, J., Rutledge, L., & Hardman, L. (2001). Towards second and third generation Web-based multimedia. In The Tenth International World Wide Web Conference, pp. 479 – 488, Hong Kong, May 1 – 5, 2001 Parkes, A. P. (1989) An Artificial Intelligence Approach to the Conceptual Description of Videodisc Images. Ph.D. Thesis, Lancaster University. E. Pauwels and G. Frederix. (2000). Image Segmentation by Nonparametric Clustering Based on the Kolmogorov-Smirnov Distance. In Proc. of ECCV 2000, 6th European Conference on Computer Vision, Dublin, pages 85{99, June 2000.

34

Chapter 4

Peirce, C. S. (1960). The Collected Papers of Charles Sanders Peirce - 1 Principles of Philosophy and 2 Elements of Logic, Edited by Charles Hartshone and Paul Weiss. Cambridge, MA: The Belknap Press of Harvard University Press. RDF (2001). http://www.w3.org/RDF/ RDF Schema (2001).http://www.w3.org/TR/2000/CR-rdf-schema-20000327/ Robertson, J., De Quincey, A., Stapleford T. & Wiggins, G. (1998) Real-Time Music Generation for a Virtual Environment. Proceedings of ECAI-98 Workshop on AI/Alife and Entertainment, August 24, 1998, Brighton. Rutledge L., van Ossenbruggen J., Hardman L., & Bulterman D.C.A. (1999). Mix'n'Match: Exchangeable Modules of Hypermedia Style. Proceedings of the 10th ACM conference on Hypertext and Hypermedia, pp. 179-188, February 21-25, Darmstadt, Germany. Rutledge, L., Davis, J., van Ossenbruggen, J., & Hardman L. (2000). Inter-dimensional Hypermedia Communicative Devices for Rhetorical Structure In: Proceedings of International Conference on Multimedia Modeling 2000 (MMM00), November 13-15, 2000, Nagano, Japan Sack, W. (1993). Coding News And Popular Culture. In The International Joint Conference on Artificial Intelligence (IJCA93) Workshop on Models of Teaching and Models of Learning. Chambery, Savoie, France. Santini, S. & Ramesh J. (2000) Integrated Browsing and Querying for Image Databases. IEEE MultiMedia ,July -September 2000,pp. 26 - 39, IEEE Computer Society Semantic Web (2001). www.semanticweb.org SMIL (2001). http://www.w3.org/TR/2001/REC-smil20-20010807/ SMPTE (1999). Dynamic Data Dictonary Structure, 6. Draft, September 1999. Schreiber, A. T. G., Dubbeldam, B., Wielemaker, J., & Wielinga, B (2001). Ontologybased Photo Annotation, IEEE Intelligent Systems, pp 66 – 74, May/June 2001 (Vol. 16, No. 3) http://www.computer.org/intelligent/ex2001/x3066abs.htm Sowa, J. F. (1984). Conceptual Structures: Information Processing in Mind and Machine. Reading, MA: Addison-Wesley Publishing Company. Steinmetz, A. and M. Hemmje (1994). Konzeption eines digitalen Videoeditiersystems auf Basis des internationalen Standards ISO/IEC 11172 (MPEG-1). Sankt Augustin, GMD: 134. Streitz, N.A., Tandler, P., Müller-Tomfelde, C., and Konomi, S. (2001). Roomware: Towards the Next Generation of Human-Computer Interaction based on an Integrated Design of Real and Virtual Worlds. In: J. A. Carroll (Ed.): Human-Computer Interaction in the New Millennium, pp. 553-578, Addison Wesley. TALC (1999). http://www.de.ibm.com/ide/solutions/dmsc/ Tonomura, Y., Akutsu, A., Taniguchi, Y., & Suzuki, G. (1994). Structured Video Computing. IEEE MultiMedia, 1(3), 34 - 43. Toepper, H. (2000) Sergej Eisenstein – a documentation about his work and life. 1. Workshop on Digital Storytelling, Darmstadt, Germany, 15-16/6/2000, pp. 59 – 72. http://www.zgdv.de/distel2000/ Vries, A.P. de, Windhouwer,M.A., Apers,P.M.G., & Kersten, M.L. (2000). Information Access in Multimedia Databases based on Feature Models. New Generation Computing, 18(4):323-339, October 2000. Windhouwer, M.A., Schmidt,R.A. & Kersten, M.L. (1999).. Acoi: A System for Indexing Multimedia Objects. In International Workshop on Information Integration and Web-based Applications & Services, Yogyakarta, Indonesia, November 1999.

4. Fout! Opmaakprofiel niet gedefinieerd. Wold, E., Blum, T., Keislar, D. & Wheaton, J.(1996). Content-Based Classification, Search, and Retrieval of Audio. IEEE Multimedia Magazine, Vol.3, No. 3, pp. 27 – 36, Fall 1996 XML (2001). http://www.w3.org/TR/REC-xml XML Schema (2001). XML Schema Part 0: Primer, W3C Recommendation, 2 May 2001 http://www.w3.org/TR/xmlschema-0/ XML Schema Part 1: Structures, W3C Recommendation, 2 May 2001 http://www.w3.org/TR/xmlschema-1/ XML Schema Part 2: Datatypes, W3C Recommendation, 2 May 2001 http://www.w3.org/TR/xmlschema-2/

35