A Multiple-Interpretation Framework for Modelling Video ... - CiteSeerX

2 downloads 48 Views 65KB Size Report
Returning to the example of the statue of an eagle on a cross (diegetic ... human subject always understand present significations in terms of 'other' signs implicit ...
A Multiple-Interpretation Framework for Modelling Video Semantics C. A. Lindley ([email protected]; fax: +61-2-9325-3101) CSIRO Mathematical and Information Sciences E6B, Macquarie University Campus North Ryde, NSW, Australia, 2113

Abstract This paper presents an approach to the modelling of video semantics based upon the concept of multiple interpretation paradigms. Four broad types or levels of paradigms are identified: the diegetic level, the level of connotation, the level of subtext, and the cinematic level. These levels are explained, and examples are used to illustrate each level and the interactions between levels. Interpretation paradigms at the different levels provide a framework for modelling video semantics for search, retrieval, browsing and synthesis. The implications of this model for system infrastructure are also considered. Introduction Effective retrieval of video components is a critical element of video reuse, whether it is in the context of search, browsing, or the synthesis of more specific virtual video productions. Effective retrieval relies upon a characterisation of the contents of video databases, and search for relevant video components based upon those content descriptions. Content descriptions may take a number of different forms, each facilitating different kinds of retrieval operations suitable for different tasks. In general, more effective retrieval requires more effective description and characterisation of the semantics of video content. Most research on video content representation to date has concentrated upon very general features of video data. General techniques, such as retrieval based upon visual data features, are amenable to automated content analysis, but have limited effectiveness from a user’s perspective. To make retrieval more relevant and precise, representations of video semantics are required. This has generally been addressed by using text and catalog descriptions of video data that are generated by hand. While effective for many applications, these approaches cannot handle queries expressed in language different from that used within textual annotations, and the text is generated for specific purposes that may not reflect the particular interests of a specific user. More sophisticated approaches to video content representation have started to model and represent content using knowledge representation techniques. These techniques go further than simple text, image, or structure-based approaches. However, very limited attention has been paid to the issue of what video semantics or content amounts to. This paper seeks to address this limitation by proposing an approach to video content representation based upon the concept of multiple interpretations. In particular, it is proposed that in addition to characterisations by data properties, low level structure, and text, it is necessary to develop and represent multiple semantic descriptions at four levels: the diegetic level, the level of connotation, the level of subtext, and the cinematic level. Current attempts to represent video semantics have started to address the diegetic level, and the cinematic level to some degree, but there has been little or no work at the connotative or subtextual levels. This paper proposes that for comprehensive content representation, a research program is

1

needed that addresses the articulation and representation of multiple content models (or interpretation paradigms) at all of these levels, as well as analysing and representing the relationships between models within and between the different levels. Video Content Representation Shot Detection and Representation

Approaches to video content representation are largely determined by the basic model of video that is being addressed. Where video is understood as an undifferentiated, compressed or uncompressed, data stream, content analysis and representation has concentrated upon the detection of basic structural elements of the video data. In particular, a lot of research has addressed the detection of shot boundaries. Shots are a fundamental unit of manipulation, and many algorithms and schemes have been developed for detecting shot boundaries (see Aigrain et al, 1996). Shot boundaries can include optical effects (cuts, fades, dissolves, wipes, etc.). Shot detection requires separating various factors of image change: motion (object and camera), luminosity changes and noise, shot change (abrupt and progressive). There are two main types of method: those resting on differences in the statistical signatures of different types of change (eg. Srinivasan et al, 1997), and those resting on explicit modelling of motion or image content (eg. Zhang et al, 1995). Many effective algorithms are now available for temporal segmentation. Visual Feature Detection and Representation

Shots provide a convenient unit of search and retrieval. Search and retrieval within a shot database of unspecified content must be carried out using the most general features of video data. That is, content-based retrieval must be conducted without the benefit of any specific knowledge about the content or semantics of the data. Types of similarity definitions, feature extraction, and similarity measures for systems that do not assume any specific image domain or a-priori semantic knowledge of the images include (from Aigrain et al, 1996): colour similarity, texture similarity, shape similarity, spatial similarity (preliminary model-based segmentation appears to be needed for overlapping objects), and object presence analysis. Research into the detection of camera operation features (motion, pan/tilt, and zoom) builds upon basic visual feature characterisation by providing some basic semantic information about shots. Camera operation information figures significantly in the analysis and classification of video shots, since camera operations often reflect the communication intentions of the film director (Srinivasan et al, 1997). Camera operation analysis is also useful for indexing and retrieval because it allows the segmentation of longer shots into shorter units defined by homogeneous camera work, and can help in selecting good representative images or keyframes for a video shot (Aigrain et al, 1996). Shot detection, visual feature characterisation, and camera operation detection have the very attractive characteristic that they can be largely, and perhaps fully, automated. However, these levels of description alone provide very limited effectiveness for applications concerned with what a video stream is “about”. Use of Media Domain Knowledge

2

Zhang et al (1995) show how the use of visual and temporal domain knowledge can be used to parse and index news video programs on the basis of their visual content. The domain knowledge used concerns the spatio-temporal structure of the images that occur in a highly stylised and constrained form of video production, ie. that of a regular news program. For shot classification, news anchor shots have a generally fixed range of temporal and spatial structures, while news shots do not generally have a fixed structure, and so are identified as those that do not conform to the anchor shot model. This model is an effective one for searching within a video database when the form of the search objects is highly standardised. As with the other approaches described above, by itself it does not involve representing what a shot is “about” beyond its classification as one of several broad types of shot occurring in the stylised program. Text Representations of Video Semantics

Textual models of video can include a range of documents including scripts, synopses, and standard delivery items. Steinmetz (1996) distinguishes the following documents. Production data includes all forms of data produced, gathered and used during the production of a film or video. Conceptual data includes mainly the script and storyboard, but may include first sketches, brainstorming notes, and decisions and changes made during the actual production. Administrative data includes equipment lists, personnel tables, spreadsheets, financial data, contracts, etc.. Video Meta Data (also called annotations) is data about the video material, eg. scene descriptions. Production data and conceptual data can be further broken down to include a source document (eg. a novel, report, or textbook), a synopsis, an outline, a treatment, a screenplay, and a shooting script. Each of these documents represents a different kind of description of the content of a video that can be used for search and retrieval operations using general text retrieval mechanisms. The Informedia project at CMU has shown that combined speech, text, and image analysis can provide much more information, and hence higher performance in video content analysis and abstraction, than using any one medium by itself (Aigrain et al, 1996). This is because content-based retrieval cannot replace the functionality of parametric (SQL) search, text, and keywords to represent the rich semantic content of visual material. Srinivasan et al (1997) describe a video model that combines image features, camera operations, low-level video structure, and text with a query strategy that tightly couples and synchronises structural and semantic objects in the video domain. The model supports parametric search and keyword search on structural and semantic information. Two classes of objects are distinguished: 1. structured objects, representing structural components of the video stream. These include frames, camera motions, and shots. 2. semantic objects that model the concepts presented in the video. The forms that semantic objects can take include catalog descriptions of the contents of the video, segment descriptions, textual dialog transcripts, and shot lists.

3

In this system, semantic (ie. text) objects are indexed by keywords or full text retrieval, shots are indexed by frame and camera motion, and frames are indexed by image features. Associating rich textual objects with the video content that they describe can have a number of advantages. Search can be conducted at various hierarchical levels of video structure, where components in the document hierarchy (defined by a synopsis, outline, treatment, screenplay, and shooting script) correspond to levels in the hierarchy of the video structure. Hierarchical structure supports logarithmic searching for components at the lowest levels. Analysis of lower level components can be used for ranked search and retrieval of components at higher levels. Finally, search can be distributed, using the document hierarchy as an indexing structure. The issue of the kinds of queries that are effective at different levels and for different user profiles and purposes must be determined by experimentation. While textual descriptions are a rich source of descriptive information, the question of what a video stream is “about” can equally be asked of the associated text. Moreover, there can be a large interpretive gap between a written script and its visual realisation. Knowledge Representation Languages

Text is a convenient form for representing various kinds of semantic information. More formal languages (ie. knowledge representation languages) can be used to represent forms of video semantics that are not convenient to represent in text, or to support computational operations that informal text does not support. For example, Davis (1994) describes the Media Streams prototype for video representation and retrieval. In Media Streams, the underlying representation of video combines two distinct representations: 1. a semantically structured generalisation space of atemporal categorical descriptors. The semantic memory is a categorical/definitional representation. 2. an episodically structured relational space of temporal analogical descriptions. The episodic memory is a representation of sequences of events. Using these representations, atemporal semantic retrieval (of icons and video segments) and temporal analogical retrieval (of video segments and sequences) are implemented using matching by analogical similarity. Representing Video Semantics and the Conflict of interpretations Davis (1994) raises the issue of representing video syntax, semantics, and “common sense”, but does not develop any detailed view of what these phenomena can amount to. A theory of the meaning of cinematic systems is crucial to the project of video content representation. This section proposes preliminary elements of such a theoretical framework (expressed informally) within which the articulation of video content models may take place coherently and systematically. Interpretation Paradigms for the Representation of Video Semantics

The film theorist Christian Metz (1974) presents a number of insights into the nature of film semiotics, ie. the study of the ways in which films signify, or are meaningful. Metz points out that the development of a unified syntax for film is impossible, due to a number of crucial differences between film on the one hand and natural language systems on the other. A

4

language system is a system of signs used for intercommunication. Cinema is not a language system since it lacks important characteristics of the linguistic fact. In particular, cinematic language is only partly a system, and it uses few true signs (convention makes some images into kinds of signs). The preferred form of signs is arbitrary, conventional, and codified, not a characteristic of an image, since an image is not the indication of something else, but the pseudopresence (or close analogy) of the thing it contains. Hence there is a film language, but not a film language system. As Metz states, "Cinematic image is primarily speech - a rich message with a poor code, or a rich text with a poor system. It is all assertion." The meanings imparted by film are primarily those imparted by its subject. Montage demonstrates the existence of a "logic of implication", thanks to which the image becomes language, and which is inseparable from the film’s narrativity. The understanding of a film precedes the conventionalisation of specific "syntactic" devices. The plot and subject make syntactic conventions understandable. Metz suggests that the total cinematographic message could bring five main levels of codification into play: 1. perception, to the degree that it already constitutes a system of acquired intelligibility. 2. recognition and identification of visual or auditory objects appearing in a film - ie. the ability (culturally acquired) to manipulate correctly the denoted material of the film. 3. all the "symbolisms" and connotations of various kinds that attach themselves to objects (or to relationships between objects) outside of films, ie. in culture. 4. all the great narrative structures that obtain outside of films within each culture. And 5. the set of properly cinematographic systems that, in a specific type of discourse, organise the diverse elements furnished to the spectator by the four preceding instances. From these observations we must conclude that the systematisation of the meanings manifested by filmic productions is extremely complex, requiring a general analysis and representation of the cultural codes depicted in films, in addition to the specific codifications of their depiction. Syntactic conventions that arise within filmic productions interact with the systems of meaning that they are used to codify. These manifestations are recognised in the varieties of type, genre, and style of filmic productions. That is, a partial syntax may emerge within a particular genre executed in a particular style. A comprehensive “syntax of film” must include the full range of such partial syntaxes, and that range and the styles within it are always changing and evolving. A major challenge for video content representation therefore is to address the systematic analysis and representation of the cultural codification systems represented in film, the ways they can be represented in film, and the ways in which they can be meaningful. Interpretation Paradigms

In discussing the different dimensions of meaning that video content may have, it is useful to define an interpretation paradigm. Here we define an interpretation paradigm as a set of fundamental assumptions about the world, together with a set of beliefs following from those assumptions (analogous to Kuhn’s (1972) concept of a scientific paradigm). An interpretation paradigm is associated with an ongoing discourse of interpretation conducted by those who subscribe to it, which aims to understand phenomena by interpreting them in terms of the paradigm. The meaning of a phenomenon is its interpretation within an interpretation paradigm.

5

The process of modelling and representing video semantics can now be expressed as a project of identifying, analysing, and representing the various interpretation paradigms that are or may be implicated in the process of understanding video, as well as the interrelationships between the different paradigms. This process can begin with the identification of at least four broad levels or types of paradigm involved in video semantics: the diegetic level, the level of connotation, the level of subtext, and the cinematic level. The Diegetic Level of Video Semantics

Diegesis designates the sum of a film’s denotation: the narration, the fictional space and time dimensions implied in and by the narrative, the characters, landscape, events, and other narrative elements considered in their denoted aspect (Metz, 1974). Based upon this definition, we define the diegetic meaning of video data as the four-dimensional spatiotemporal world that it posits, together with the spatiotemporal descriptions of agents, objects, actions, and events that take place within that world. Davis (1994) considers a number of specific ontological issues in the representation of video semantics that address the diegesis of video data as a spatio-temporal world. The sequencing of shots allows the construction of many types of space (including real, artificial, and impossible spaces). For real locations it is possible to distinguish the actual location of the recording, the spatial location inferred by a viewer of an isolated shot , and the spatial location inferred when a shot is viewed in the context of a shot sequence. The virtual spaces created by videos require the use of relative three-dimensional spatial position descriptions. Video requires techniques for representing and visualising the complex structure of the actions of characters, objects, and cameras. For representing the action of bodies in space, the representation needs to support the hierarchical decomposition of units, spatially and temporally. Conventionalised body motions (walking, sitting, eating, talking, etc.) compactly represent motions that may involve multiple abstract body motions (represented according to articulations and rotations of joints). Much of the challenge of representing action is in knowing what levels of granularity are useful. Time (analogously to space) requires the representation of actual time and both possible and impossible visually-inferred time. As an example of the meaning of an image at the diegetic level, consider the image of a marble eagle standing upon a cross with its wings spread wide: the statue has a particular size and shape, rest at a particular location on top of the wall of a stadium, etc.. The Connotative Level of Video Semantics

The connotative level of video semantics is the level of metaphorical, analogical, and associative meaning that the denoted (ie. represented diegetic) objects and events of a video may have (corresponding to the third level of codification noted by Metz, above). The connotative level captures the cultural codes that define the culture of a social group and are considered “natural” within the group. There is some arbitrariness in the relationship between the connotative signifier and the connotative signified, since other symbolisms can always be established during the course of a film. Returning to the example of the statue of an eagle on a cross (diegetic description), at the connotative level we might say that the image connotes the power of a particular single-party state, its history, mass rallies, explicit ideology, and fate during a particular period of history.

6

The Subtextual Level of Video Semantics

The level of subtext corresponds to the level of hidden and suppressed meanings of symbols and signifiers, preceding and extending the immediacy of intuitive consciousness. This is the level addressed by Paul Ricoeur’s philosophical hermeneutics (see Kearney, 1986). Based upon the model of text, philosophical hermeneutics acknowledges polysemy (multiple meanings) as a fundamental feature of all language. Philosophical hermeneutics is a universal philosophy which acknowledges that when we use language we are interpreting the world, not literally as if it possesses a single transparent meaning, but figuratively in terms of allegory, symbolism, metaphor, myth, and analogy. The project of philosophical hermeneutics is the task of deciphering the hidden meanings behind the surfaces of language, which amounts to “transferring” meaning (by interpretation) from one semantic plane to another through the linguistic agencies of metaphor, allegory, simile, metonymy, etc.. Philosophical hermeneutics seeks to situate and demarcate the theoretical limits of each “hermeneutic field” (corresponding to an interpretation paradigm, as define above), showing how each interpretation operates within a specific set of theoretical presuppositions. A given signifier actually has an infinite number of possible interpretations. However, the human subject always understand present significations in terms of ‘other’ signs implicit in the past (archaic meanings of the subconscious) and of the future (an anticipation of new meanings). The historical transfer of meaning places us in a hermeneutic circle where each interpretation is both preceded by a semantic horizon inherited from tradition and exposed to multiple subsequent rereadings by other interpreters. This applies to the connotative level as well as the subtextual level, and means that a definitive representation of “the meaning” of video content is in principle impossible. The most that can be expected is the development and representation of a body of evolving interpretations and their interrelationships. The subtextual level is more specifically concerned with the levels of meaning that may not be immediately apparent to a reader (ie. viewer). The subtextual level includes “hermeneutic models of suspicion”, which may be Nietzschean, Marxian, Freudian, feminist, etc.. In a sense, one person’s subtext may be another person’s connotation: connotation may be a matter of familiarity. However, the connotation/subtext distinction runs deeper than that. Reading subtextual codes requires special training: they correspond to Metz’ description of specialised codes that concern more restricted and specific social activities. The specialised discourse concerned with subtextual interpretation may also be seen to correspond with the fourth level of codification noted by Metz (above), that of the “great narrative structures” within a culture, since each imposes its own narrative picture upon the history and progress of culture on a large scale. The subtextual level represents a range of possible “readings” or interpretations of video content, and hence is an important level of video semantics. Again considering to the image of an eagle on a cross, at the subtextual level its meanings begin to multiply. From the interpretation paradigm of the party that used the image, it may be seen to represent the victory of the universal will to power in the creation and expansion of the party representing that will to power. But this is not “the” subtext. It is only one subtext. Other readings, within different subtextual interpretation paradigms, may read the image subtext as: the expression of anger arising from severe infantile repression, the hypostatised desire for the authority of an absent father figure, or an assertion of partriarchical power. An ideologically neutral content-based search and retrieval system must not restrict the range of possible

7

interpretations of images. If such a system uses content representations, it must represent and support these different views of content. The Cinematic Level of Video Semantics

The cinematic level of video semantics is concerned with the specifics of how formal film and video techniques are incorporated in the production of expressive artefacts (“a film”, or “a video”) in such a way as to achieve artistic, formal, and expressive integrity. The process is complex, partially codified within various stylistic conventions, and tightly linked to other levels of meaning. A detailed analysis of the mechanisms of this level cannot be undertaken here. However, a systematic analysis must include the study of how cinematic techniques bring together and manifest meanings at the diegetic, connotative, and subtextual levels, in ways that are specific to time-based audio-visual media. This level concerns the purely cinematographic syntagmatic types described by Metz (1974) (concerned with relationships between the temporal sequences of filmic material and implied diegetic status and relationships). More commonplace examples of cinematic techniques (extracted from Arnheim, 1971) include shot characteristics (eg. effect of angles, size, and relative placement on how one object is interpreted in relation to others in the diegetic space, etc.), mobile camera, shot speed (frame rate), optical effects, and principles of montage (eg. long strips for quiet rhythm, short strips for quick rhythm, climactic scenes, tumultuous action). Interactions Between Semantic Levels

Interactions between interpretation paradigms at the different levels of meaning are highly complex. There can be no universal film syntax. However, within specific film genres and styles, very specific syntactic and semantic conventions can be identified. This makes highly stylised genres the natural starting point for the articulation of interpretation paradigms at the four levels proposed in this paper. For example, a restorative three-act structure (Dancyger and Rush, 1995) and the continuity style of editing are currently the dominant model for mainstream commercial feature films. The continuity style in editing yields some very specific rules for interconnecting shots, together with the kinds of diegetic meaning created by application of the rules (eg. from Arnheim, 1971, “time continuity within scenes cannot be broken”, “time continuity can be broken between different scenes occurring in different places, especially if a temporal connection between them is not important”, etc.). The continuity style emphasises the primacy of action, involving the viewer in the unfolding drama more than it invites contemplation of the viewed image as a visual artefact. Characteristics of the restorative three-act structure include having one central character, who undergoes a metamorphosis during the course of the story, and a resolution that resolves the main character’s conflict by a return to complete order by the end of the film (Dancyger and Rush, 1995). The act structure together with the continuity style invite a particular form of engagement in the story, leading the viewer through a process of identification and implicit affirmation of the “moral” of the story. The emotional impact of these films is tightly controlled in a way that leaves little time for viewers to reflection on the real value of the morality being depicted. Genres add more specific standard motifs and patterns to this basic format (Dancyger and Rush, 1995). The example of the eagle on the cross can be used to illustrate the interactions between the different levels of semantics. At the diegetic level, the spatial object is that of a statue of an eagle standing upon a cross on the wall of a stadium. The connotation is (for example) a particular political party. Cinematic techniques and details at the diegetic and connotative

8

levels can be used to make different statements about the political party according to the subtext that is expressed: - showing the statue at night, illuminated by one light from the side, in a front-on shot against a dark sky may be used to state that the party stands in a relationship of power to the viewer - showing the statue in daylight from above and at an oblique angle, with pidgeons pecking at one another on the eagle’s head, may be used to parody the leadership of the political party, undermining their self-image of strength and nobility by suggesting a lack of intelligence, vision, and humanity Detailed analysis of the interactions between the diegetic, connotative, subtextual, and cinematic levels of codification for a representative example of a film or video are too complex for this paper. The groundwork for this analysis work is already being conducted by film theorists. The challenges for video content representation in digital multimedia systems are to develop an appropriate architecture for representing and using these models, and then drawing from film theory and ciritical discourse to develop the appropriate models. Use of Interpretation Paradigms for Content-Based Retrieval As an example of the use of interpretation paradigms at different semantic levels, we can consider the kind of query that might be satisfied by representations at each level. We assume that a search is being conducted over a shot database. A query consists of a statement of the “subject” of interest. The example of the eagle on the cross is used once again: - diegetic level query: - connotative level: - subtextual level: - cinematic level:

“statue of an eagle”, “statue of an eagle on a cross” “symbol of the Nazi party” “manifestation of infantile repression”, “image of power” “long shot, low-key lighting”

In a real system, any given query may contain terms integrating several levels; for example a request for “parody of Nazi power” can only be satisfied by referring to combined connotative, subtextual, and cinematic semantics. Implications for Infrastructure Multiple models of video content at different levels of meaning can be used for describing the semantics, or constraints upon semantics, of video content in search, browsing, and video synthesis operations. These models may also be used for mapping regularities and clusters in video data, for use (for example) in analysing a user’s concept of “interesting” material. However, the development of such models will be a very challenging task. “Meaning” is a complex phenomenon: within an interpretation paradigm meanings continue to be articulated, and in that respect the paradigm is involved in an ongoing process of selfdefinition; a paradigm is not a clearly defined and isolated whole, but a coalescence of ideas within a nexus of ever-changing and evolving articulations and interrelationships. The model of interpretation paradigms is a model of continuously evolving texts, and texts about texts (about texts … and so on). An artefact (eg. a film) is given an interpretation by a viewer, ie. it is understood. If that understanding is used as the basis for another codification (eg. a book about the film), then that codification becomes another artefact, a text, that is again made meaningful by acts of interpretation (via reading or viewing). The same applies if the text is

9

codified in the form of a knowledge base or knowledge model – its meaning is not absolute, but arises as a result of reading or interpreting the model within an evolving cultural context. For this reason, the problem of video content representation must be addressed by locating interpretation models within a constantly evolving information space in which the discourses of interpretation are themselves represented. The problem of content representation therefore shifts: instead of being a problem of definitively representing the content of video data, it becomes a problem of how to merge the information spaces of interpretative discourse with that of the video objects of interpretation. Conclusion This paper has proposed that the development of content models for video data should be conducted on at least four levels of video semantics: the diegetic level, the level of connotation, the level of subtext, and the cinematic level. The development of models at these levels represents a major challenge. Moreover, the set of models cannot be expected to be static, but must be continually evolved to reflect the evolution of general interpretative discourse, and of culture at large. To this end it is necessary to develop systems in which broad interpretative discourse can be conducted - to provide tools, languages, and an appropriate infrastructure to allow specialists to articulate interpretative models at the different levels - and then to use these models in support of video search, retrieval, browsing, and synthesis. Only within such an environment can more automated functions be integrated with changing models of semantics to provide comprehensive access to content without a priori limitations based upon unanalysed connotative and subtextual biases and predispositions. References Aigrain P., Zhang H., and Petkovic D. 1996 “Content-Based Representation and Retrieval of Visual Media: A State-of-the-Art Review”, Multimedia Tools and Applications 3, 179-202, Klewer Academic Publishers, The Netherlands. Arnheim R. 1971 Film as Art, University of California Press. Dancyger K. and Rush J. Alternative Scriptwriting: Writing Beyond the Rules, 2nd edn., Focal Press, 1995. Davis M. “Knowledge Representation for Video”, Proceedings of the 12th National Conference on Artificial Intelligence, AAAI, MIT Press, pp. 120-127, 1994. Kearney R. Modern Movements in European Philosophy, Manchester University Press, 1986. Kuhn T. S. The Structure of Scientific Revolutions, 2nd edn., The University of Chicago Press, 1972. Metz C. 1974 Film Language: A Semiotics of the Cinema, trans. by M. Taylor, The University of Chicago Press. Srinivasan U., Gu L., Tsui K., and Simpson-Young W. G. “A Data Model to Support Content-Based Search in Digital Videos”, submitted to the Australian Computing Journal, 1997. Steinmetz A., “DiVidEd - A distributed Video Production System”, Proceedings of VISUAL’96 Information Systems Conference, Melbourne, February1996. Zhang H. J., Tan S. Y., Smoliar S. W., and Yihong G. “Automatic parsing and indexing of news video”, Multimedia Systems, ACM, Vol. 2 no. 6, pp. 256-266, 1995.

10

Suggest Documents