Integrating Heterogeneous Metadata into a Distributed Multimedia ...

2 downloads 0 Views 206KB Size Report
COGIS 2009 – COGNITIVE SYSTEMS WITH INTERACTIVE SENSORS. Integrating Heterogeneous Metadata into a Distributed Multimedia. Information System.
Integrating Heterogeneous Metadata into a Distributed Multimedia Information System Mihaela Brut1, Sébastien Laborie1, Ana-Maria Manzat1, Florence Sèdes1 1: IRIT - Institut de Recherche en Informatique de Toulouse, 118, route de Narbonne, 31062 Toulouse, France {Mihaela.Brut, Sebastien.Laborie, Ana-Maria.Manzat, Florence.Sedes}@irit.fr

Abstract: Currently, a lot of distributed multimedia systems exist but they do not address the problem of distributed indexation. The project LINDO enables to deploy remotely, on demand, different indexing algorithms that might produce metadata in different formats. This could yield to an interoperability problem because of the existing differences between various standard metadata formats in terms of structures and vocabularies. In this paper, we propose a metadata model that encapsulates the most common metadata standards, such as MPEG-7, EXIF, MXF, ID3, Dublin Core, etc. Keywords: Distributed multimedia information system, Interoperability, Multimedia metadata

1. Introduction Nowadays, many distributed multimedia information systems are available in various domains, such as video surveillance, patient medical records, broadcast, etc. In general, such systems use identical indexation engines, which extract metadata remotely, obtaining the resulting metadata into an uniform format, in order to cope with the interoperability problem. In the LINDO project different indexing algorithms, that might produce metadata in different formats, can be deployed on demand remotely. Consequently, many existing standards can be used, such as Dublin Core, Exif, ID3, MPEG-7, MXF, etc. Mixing these standards does not ensure interoperability because some of them might share a number of semantically similar metadata elements, which could be syntactically different. Moreover, it could be also difficult for a user to query these heterogeneous metadata. To overcome this interoperability problem, we propose in this paper a metadata model that defines an abstract representation of a multimedia document and which encapsulates the most common metadata standards information. The remainder of the paper is structured as follows. Section 2 presents some current distributed multimedia information systems. The LINDO project is detailed in Section 3 in order to introduce new challenges in this kind of systems, such as the distributed indexation aspects. Since several metadata formats could be used together in LINDO, the interoperability problem is presented in Section 4. To overcome this problem, we propose in Section 5 a metadata model that encapsulates the most

common metadata standards. Section 6 gives a brief conclusion and some perspectives. 2. Current Distributed Multimedia Information Systems Currently, a lot of systems are in charge with the management of huge multimedia collections, and different strategies are adopted for performing multimedia indexing. The multimedia content-based analysis and classification are two traditional approaches, both based on the effective multimedia content processing. The results are many times stored in a conventional numeric format, and more seldom in some established multimedia metadata vocabularies. [1] and [2] illustrate many systems in charge with contentbased image retrieval, such as SIMBA, SIMPLIcity, PictureFinder, ImageFinder, Caliph&Emir, VIPER/GIFT and Oracle Intermedia. These systems adopt the MPEG-7 standard for uniformly storing the indexation process results. Some commercial systems, such as QBIC, PhotoBook, WebSeek, make use of different metadata standards, while other systems (e.g., SkyFinder [3]) do not consider existing metadata vocabularies. Alongside with the different multimedia metadata representations provided by the indexation process, we are interested especially in the two following aspects: (1) managing distributed multimedia collections and (2) dealing with the metadata diversity in terms of vocabulary and structure. In the following we present some recent approaches that treat these aspects. The CANDELA [4] and CAIM1 projects focused on the storage and the networked delivery of distributed multimedia contents. They mainly use a single indexation engine which is centralized in order to simplify the querying process. The SAPIR2 project is in charge with large-scale search in audio-visual content using peer-to-peer information retrieval. The search engine makes use of three different indexation engines, one for images, another one for text and the last one for video and audio contents. Each media is processed by its corresponding indexation engine which provides some metadata descriptions encoded in a uniform metadata structure.

1 2

http://caim.uib.no/ http://www.sapir.eu

COGIS 2009 – COGNITIVE SYSTEMS WITH INTERACTIVE SENSORS

Besides, the K-Space3 project and the Muscle network of excellence4 provide other metadata structures and vocabularies which are semantically based for semiautomatic annotation and retrieval of multimedia contents. However, they do not take into account the majority of multimedia features that are available in the most common existing multimedia metadata standards. As could be noticed, each particular system considers a set of restricted context: it uses a set of available indexation algorithms, generally handled by a single indexation engine. Furthermore, it converts the resulted multimedia metadata into a convenient format, which might be a standard, such as MPEG-7. Usually, the case of heterogeneous metadata is avoided, as well as the problem of heterogeneous distributed indexation engines. 3. The LINDO Project The ITEA2 project LINDO5 (Large scale distributed INDexation of multimedia Objects) aims to develop a generic architecture, in which not only the storage of multimedia documents is distributed, but also the indexation is distributed on different storage units that are possibly heterogeneous (e.g., in terms of capacities) and geographically distant. An important issue of LINDO is the integration of different indexation engines in the system and their deployment in real time, while the system is running. Figure 1 illustrates the general workflow of the distributed indexation processes. The descriptions below follow the arrow labels of the figure.

Figure 1: The LINDO indexing architecture 3 4

5

http://kspace.qmul.net:8080/kspace/ www.muscle-noe.org http://www.lindo-itea.eu

A. Initially, the central server is in charge of deploying different sets of extractors6 on each remote servers. This deployment could be based on predefined queries that might be treated by some remote servers, as proposed in [5], or based on server contexts. B. Each specific source (e.g., police vehicles) that captures multimedia contents belongs to one remote server, at least. Of course, several sources may be associated to one remote server. Multimedia contents could be acquired at real time or not, and may be encoded in different formats. C. Each multimedia content that is ingested by a remote server is stored into a multimedia collection. During the execution of the system, this collection is evolving, i.e., it could increase when contents are ingested or it could decrease when contents are deleted (e.g., as it might be the case for old multimedia documents). D. Some extractors treat each multimedia content. An extractor identifies some multimedia features, like vehicle registration plate numbers, persons, particular sounds, etc. The set of extractors that are applied on two different multimedia contents could differ according to the multimedia content types, formats, etc. Moreover, remote servers may contain different sets of extractors. E. In LINDO, the extractor outputs may be encoded using different metadata standards. Consequently, mediators are developed in order to translate their outputs into an integrated metadata structure that will be detailed in Section 5. Thus, the querying process would be simplified. F. Each content metadata produced by extractors is stored into a metadata collection. Its content points to the multimedia collection elements in order to access the multimedia contents. As a result, when a certain multimedia content is deleted, its description is also removed. Moreover, the metadata collection may contain additional information about the remote server and also about the metadata itself that have not been obtained by extractors, like for instance information about context (e.g., server locations). G. In order to avoid querying all remote servers, subsets of their metadata collections are sent to a central server [6]. From all the multimedia content descriptions, some general information can be extracted automatically, such as the types of multimedia contents, the time periods, the locations, the identified objects that corresponds to a remote server metadata collection. H. The central server is able to answer to very general queries and for specific ones it will forward them to relevant remote server metadata collections. I. During the execution of the system, the central server is able to deploy on demand some specific extractors on the remote servers. In [5], this deployment is based on the user queries. 6

An extractor could also refer to the term indexing algorithm.

COGIS 2009 – COGNITIVE SYSTEMS WITH INTERACTIVE SENSORS

Currently, we have instantiated this architecture with several technologies that are able to support XML data. The Oracle Berkeley DB XML7 has been chosen in order to store the metadata descriptions (i.e., it is used to support the metadata collections). The advantage of such database is that it is a native XML database and it embeds an XQuery engine. For computation efficiency, each remote server is composed of several sub-servers that execute simultaneously the different modules. For instance, some extractors run on different sub-servers because their applications are time and CPU consuming. In order to send the XML data through the network, the JMS8 (Java Message Service) API is used. It allows sending synchronous as well as asynchronous messages. Finally, each remote server embeds some web services (e.g., answering to some XQuery queries, giving details about the remote server characteristics, like its geographical location) that might be run by external consumers, like the central server. As mentioned previously (i.e., the E arrow of Figure 1), the extractors may produce some metadata in different formats. This could yield to an interoperability problem, as presented in the next section. 4. The Metadata Interoperability Problem Metadata express the key features of a multimedia document, providing valuable semantic information for multimedia search and retrieval operations. [7] presents the metadata in the centre of the multimedia document lifecycle, which makes the metadata model even more important in the management of documents. Besides, in the last years metadata gained great interest among researchers and industrials. This joint work is very productive and, as a consequence, the number and the heterogeneity of metadata formats increased steeply (Section 4.1). Consequently, it raises also an important interoperability problem (Section 4.2). 4.1. Standards and Vocabularies for Multimedia Metadata Management Many metadata standards allow to describe multimedia contents (e.g., name, title, physical description), technical descriptions (e.g., color histogram, file format, resolution) and administrative information (e.g., rights management, authors). We provide further a brief overview of the most common metadata standard vocabularies:

IPTC Photo Metadata10 is made to describe and administrate photographs and to provide the most relevant rights related information. For audio-visual contents:  MPEG-711 (Multimedia Content Description Interface) represents one of the biggest ISO efort in the direction of complex media content modeling (including support for spatio-temporal descriptions), aiming to be an overall for describing any audio-visual content.  MXF12 (Material Exchange Format) supports the interchange of material for the content creation industries. MXF allows applications to know the duration of the file, what essence codecs are required, what timeline complexity is involved and other key points to allow interchange.  ID313 metadata are embedded with an MP3 audio file format and provides multiple information about a song, such as title, artist, album, genre, as well as involved people list, lyrics, band, ownership, recording dates.  MusicXML14 metadata provides support for universally translating and encoding the common Western musical notation. For texts:  TEI15 (Text Encoding Initiative) is a standard for representing the texts structure, widely used by libraries, museums, or publishers.  PDF and Open Office APIs retrieve standard archive information (e.g., title, creation date). Furthermore, other metadata standards could be used to describe any multimedia contents, such as Dublin Core16. In addition, various communities and initiatives developed some domain-specific standard vocabularies:  LSCOM17 (Large-Scale Concept Ontology for Multimedia) organizes more than 800 visual concepts for which extraction algorithms are known to exist.  DICOM18 (Digital Imaging and Communications in Medicine) is a standard for handling, storing, and transmitting information in medical imaging.  FGDC19 (Federal Geographic Data Committee, 2003) for geospatial data.  NewsML20 (International Press Telecommunications Council) for news objects.  TV-Anytime21 for TV digital broadcasting. 

10 11

For images:  Exchangeable Image File Format (Exif)9 includes metadata related to the image data structure and characteristics, capturing information, recording offsets, etc.

12 13 14 15 16 17 18

7 8 9

http://www.oracle.com/technology/products/berkeley-db/xml/index.html http://java.sun.com/products/jms/overview.html http://www.digicamsoft.com/exif22 /exif22/html/exif22_1.htm

19 20 21

http://www.iptc.org/IPTC4XMP/ http://www.chiariglione.org/mpeg/standards/mpeg-7/mpeg-7.htm http://www.smpte-mxf.org/ http://www.id3.org/Developer_Information http://www.recordare.com/xml.html http://www.tei-c.org/ http://dublincore.org http://www.ee.columbia.edu/ln/dvmm/lscom/ http://medical.nema.org/Dicom/ http://www.fgdc.gov http://www.newsml.org http://www.tv-anytime.org

COGIS 2009 – COGNITIVE SYSTEMS WITH INTERACTIVE SENSORS

As could be noticed, a lot of various metadata standards and vocabularies were developed. In order to describe a multimedia document, several metadata formats can be potentially used together. This could lead to an interoperability problem. 4.2. Different Aspects and Solutions for the Interoperability Problem A major problem concerning the adoption of multimedia metadata standards inside real applications is the interoperability problem between the existing multimedia vocabularies. This issue could concern many aspects:  Different metadata structures: Metadata created and enhanced by different tools and systems follows different standards and representations, which are not necessarily interoperable [8].  Metadata overlapping: None of the multimedia metadata standards used in practice for multimedia contents can fully describe all categories of multimedia features. Actually, a complex description requires metadata from multiple vocabularies. However, two different vocabularies could share some metadata elements and each of them could embed their own specific elements. Table 1 illustrates such differences between metadata standards, e.g., DC and DICOM. Table I Comparing different metadata vocabularies EXIF

DC

PDF

MXF

TITLE

X

X

X

X

DICOM -

SOFTWARE USED AUTHOR CREATION DATE LANGUAGE SUBJECT MODIFICATION DATE

X

-

X

X

X

X X

X X

X X

-

-

X

X

-

X X

X

-

X

-

-

X

X

-





22 23

-

-

Vocabulary's inherent semantics: The inherent semantics of the information encoded in a standard XML-based metadata vocabulary are only specified within that standard framework, based on the standard structure and terminology [9]. For instance, it is hard to re-use MPEG-7 metadata in environments that are not based on MPEG-7 or to integrate non-MPEG metadata in an MPEG-7 application. Metadata synonymy: The different standards include synonyms, i.e., metadata with different syntactic forms but referring the same semantic information. Multimedia applications should deal with these synonymies. As a solution, the W3C Media Annotations Working Group22 (MAWG) developed a set of mappings between various multimedia metadata formats23.

http://www.w3.org/2008/WebVideo/Annotations/

http://www.w3.org/2008/WebVideo/Annotations/drafts/ontology10/WD/ mapping_ table.html

Consequently, in order to use together multiple standards for describing a multimedia document and to be interoperable, an application should address the abovementioned aspects. For acquiring this goal, [10] proposes several methods, one is to define a global structure that is common to several metadata vocabularies. Based on this method, we developed a framework that considers the most common multimedia metadata vocabularies, thus the most common multimedia features, and encapsulate them inside a global structure. Other approaches adopt such model, e.g., [11] which is the most related to our framework. They propose a general metadata model that describes the media contents and the temporal, spatial and hypermedia dimensions of multimedia documents. However, they do not consider existing standards and, thus, neither the interoperability problem. Hence, we extend their framework, as presented in the next section.

5. A generic Metadata Framework The framework presented in [11] has two major lacks:  limited document descriptors: they do not consider descriptions of the entire multimedia document.  limited media descriptors: they define their own vocabulary and do not consider the most common metadata standard vocabularies for describing multimedia features. In order to address these lacks, our framework includes supports for describing the entire document as well as the included multimedia objects. For both dimensions, we have proposed a general metadata structure that covers the most common metadata vocabularies. Figure 2 illustrates an overview of our model.

Figure 2: An overview of the generic metadata format As shown in the figure, it has two main levels. The former corresponds to the entire multimedia document (General Information) and the latter corresponds to the media elements that appear in the document (Image, Text, Video and Audio).

COGIS 2009 – COGNITIVE SYSTEMS WITH INTERACTIVE SENSORS

We made the choice of structuring our proposed format on two levels because the description of a multimedia document can be done in two major granularities:  by presenting general things about the entire document, e.g., its author, its size, its creation date.  by describing every media elements of the document separately. For instance, a text element can be described according to its organization into sections and an image according to regions of interests. As a representation language, we have chosen XML24 due to its extensibility in vocabulary and structure. Another important reason for this choice was also motivated by the fact that almost all the multimedia standards have already an XML DTD. Furthermore, many indexation engines provide as outputs XML data. Each metadata encoded in our format corresponds to only one multimedia document. The association between the document and its metadata is done through the root element represented in Figure 2, i.e., the src attribute of Multimedia metadata. The value of this attribute refers generally to the URL of the multimedia document. In order to construct our generic metadata framework and associate to each component relevant descriptors, we have studied the metadata standards themselves and have determined their common element sets, as well as their differences. The mappings between standards can be done on different levels: component level (i.e., between predefined groups of elements), element level (i.e., between tag names), data type level and value level [12]. An overview of the mapping realized at the element level is presented in Table 1. Thanks to this metadata standards comparison, the most common elements were added to the General information component. Similarly, the same approach was adopted to determine the similar descriptors for each media types, i.e., Image, Text, Video and Audio. In the following, we give more details about our framework. We begin with the general component and then present the descriptors used for each media type. We finish by focusing on the modeling of the semantic metadata. 5.1. General Information Metadata In order to describe the general information of multimedia documents, its descriptors could be applied to any type of media. For instance, author or title information could correspond to an image, a text, a video, an audio content or an entire multimedia presentation. Dublin Core provides a set of general descriptors that allow describing any kind of resources. However, from our metadata standards comparison, it appears that other general elements could be used to depict a multimedia document. For instance, as shown in Table 1, the software used metadata element is not included in the Dublin Core vocabulary but it is present in all the other presented standards, i.e., Exif, PDF, MXF and DICOM. 24

http://www.w3.org/TR/xml/

Naturally, as the software used metadata is a common element for many standards, it could be applied on a multimedia document description, i.e., the software used for editing the document. Consequently, these common descriptors identified from metadata standards are added to our metadata framework in the general information component. 5.2. Video Metadata Nowdays, the retrieval of videos by their contents is widely common on the web, e.g., give me a video of a man drinking a coffee. Many current algorithms could detect several objects and actions in videos. MPEG-7 and MXF are the most used standards to encode such information. Both of them allow annotating video segments and detail video features. Hence, we associate to the video metadata component the elements Sequence and Object. Moreover, MPEG-7 and MXF cover many other common information, such as camera motions. Hence, we add such common information in our proposal. However, they also contain different descriptors, e.g., only MXF allows to specify GPS positions, while only MPEG-7 allows to describe the texture of video frames. As theses video standards tries to cover the majority of descriptors, we also add to our metadata framework the disjoint sets of vocabulary elements. 5.3. Audio Metadata Many improvements were also made in the field of audio signal processing. Many algorithms could detect different events in an audio content, e.g., cries, strong noise, specific speakers, changes of speaker, topics, etc. Similarly to the video metadata component, the most used audio metadata standards segment the content and describe particular information of these audio fragments. Consequently, the metadata elements of an audio content are quite similar to those of the video metadata component, i.e., it contains the elements Segment and Object. The audio Segment element is different from the video Sequence element because different specific vocabularies are used to describe audio and video fragments. 5.4. Image Metadata Many standards are able to annotate images, such as Exif, IPTC, DICOM, MPEG-7. With different levels of granularities, these standards allow to describe:  Global image characteristics (e.g., resolutions, GPS locations) that, for example, may come from the Exif vocabulary.  Specific regions of interests (e.g., persons, trees, buildings, textures, histograms), thanks to the MPEG-7 standard. Our metadata framework considers these two levels. For the global characteristics, we benefit from the general descriptors provided by the Exif vocabulary. Of course, more specific descriptors could be added to our framework, such as DICOM for describing medical

COGIS 2009 – COGNITIVE SYSTEMS WITH INTERACTIVE SENSORS

information. For regions of interests, the region element is used in order to provide a hierarchical structure of the image contents. 5.5. Text Metadata For text media, [11] proposes a model based on its logical structure, i.e., its organization into textual units (e.g., paragraphs, sections, chapters). In our framework, we enriched this structure by augmenting the supported vocabulary with common sets of elements present in the metadata associated to PDF, Doc and RTF files. At the structure level, we associate to a text unit a description of its content through the Object element. For instance, a sentence could have associated a description of its speaker or a description of a monument referred inside. 5.6. Semantic Metadata

processing. As done by the Media Annotations Working Group of the W3C in [13], and by [14], we plan to associate syntactic information with some elements defined in ontology in order to locate its semantics. After the indexation process, we intend also to inject into the multimedia contents our metadata format through the KLV model proposed by the SMPTE Working Group, thus keeping some metadata inside the multimedia contents.

6. Acknowledgement This work has been supported partially by the EUREKA Project LINDO (ITEA2 – 06011) and by the Marie-Curie project SOMIR (PIEF-GA-2009-235229).

7. References

While the previous metadata types provide support for expressing the physical, technical or administrative information, the framework should handle some semantic metadata. As could be noticed in Figure 2, all the descriptors contain an object element. Its role is to describe some particular (semantic) information that can be extracted from a media. For instance, if the monument Notre-Dame de Paris was detected in a media, we can add supplementary information inside its corresponding Object element in order to obtain a semantically enriched description. In our metadata framework, this enrichment can be done through the type and the property attributes of an object by providing directly semantic descriptions or referring to external semantic descriptions via URIs. Moreover, each object element must have a unique identifier (ID) inside the corresponding XML-based metadata document. By using this identifier, an object can be related to other objects with spatio-temporal relations. This is also the case for media elements. Our metadata framework could be used to describe any multimedia document formats (e.g., SMIL, XHTML, Powerpoint presentations). Our proposal has the advantage to integrate many multimedia metadata standards into one single uniform structure, thus facilitating the querying and the retrieving processes.

6. Conclusion In this paper, we have presented a generic metadata framework for describing multimedia documents that allows encapsulating the most common multimedia metadata standards. The corresponding XML Schema and some examples are available at http://www.irit.fr/LINDO/. This metadata framework is integrated into a distributed multimedia information system, named LINDO, which handles multiple remote indexation engines. Despite the metadata richness supported by the exposed standards and vocabularies, the meaning of metadata elements still needs for a semantic enhancement in order to become explicit and transparent for automatic

[1] R. C. Veltkamp, M. Tanase: “Content-based image retrieval systems: A survey”. Technical Report UU-CS-2000-34, Department of Computing Science, Utrecht University, 2000 [2] H. Kosch, M. Maier: “Content-Based Image Retrieval Systems - Reviewing and Benchmarking”, the 9thWorkshop on Multimedia Metadata (WMM), Toulouse, 2009 [3] L. Tao, L. Yuan, J. Sun: “SkyFinder: Attribute-based Sky Image Search”, the 36th International Conference and Exhibition on Computer Graphics and interactive Techniques (SIGGRAPH), New-Orleans, Louisiana, 2009 [4] P. Pietarila, U. Westermann, S. Jarvinen, J. Korva, J. Lahti, and H. Lothman: “CANDELA – storage, analysis, and retrieval of video content in distributed systems,” the International Conference on Multimedia and Expo(ICME), Amsterdam, 2005. [5] M. Brut, S. Laborie, A.-M. Manzat, F. Sèdes: “A Framework for Automatizing and Optimizing the Selection of Indexing Algorithms”. the 3rd International Conference Metadata and Semantics Research (MTSR), p. 48-59, Milan, 2009. [6] S. Laborie, A.-M. Manzat and F. Sèdes: "Managing and querying efficiently distributed semantic multimedia metadata collections", In IEEE MultiMedia special issue on multimediametadata and semantic management, IEEE , 2009 (to appear). [7]J. R. Smith and P. Schirling: “Metadata Standards Roundup”. In IEEE MultiMedia, 13(2): 84–88, 2006. [8] V. Tzouvaras, R. Troncy, J. Z. Pan: “Multimedia Annotation Interoperability Framework”. W3C Incubator Group Editor's Draft, 2007. http://www.w3.org/2005/Incubator/mmsem/XGRinteroperability/ [9] G. Stamou, J. van Ossenbruggen, Z. Pan Jeff , G. Schreiber Guus, “Multimedia Annotations on the Semantic Web”, IEEE MultiMedia, 13(1): 86-90, 2006 [10] L. M. Chan and M. L. Zeng: “Metadata Interoperability and Standardization – A Study of Methodology Part I”. In D-Lib Magazine, 12(6), 2006. [11] I. Amous, A. Jedidi, F. Sèdes: “A Contribution to Multimedia Document Modeling and Querying”. In Multimedia Tools and Applications, 25(3): 391-404, 2005 [12] Ch. Timmerer, J. Jabornig, H. Hellwagner, “Delivery Context Descriptions – A Comparison and Mapping Model”, the 9thWorkshop on Multimedia Metadata (WMM), Toulouse, 2009 [13] W. Lee, T. Bürger, F. Sasaki, V. Malaisé, F. Stegmaier, J. Söderberg: “Ontology for Media Resource 1.0”. W3C Working Draft, 2009. [14] R. Troncy, W. Bailer, M. Hausenblas, P. Hofmair, R. Schlatte, “Enabling Multimedia Metadata Interoperability by Defining Formal Semantics of MPEG-7 Profiles”, In 1st International Conference on Semantics And digital Media Technology (SAMT), vol. LNCS 4306, pg. 41-55, Athens, 2006.

COGIS 2009 – COGNITIVE SYSTEMS WITH INTERACTIVE SENSORS