Audio Hyperlink based on Mpeg-7

4 downloads 0 Views 210KB Size Report
Gaussian Taped filter run across the diagonal of the symilarity matrix) of the song "Tangerine" from Led. Zeppelin. Peaks topping a pre determined threshold.
Semantic Audio Hyperlinking: a Multimedia-Semantic Web scenario Giovanni Tummarello1,2, Christian Morbidoni1, Paolo Puliti1, Francesco Piazza1 1. SEMEDIA group – Università Politecnica delle Marche (ITALY) http://semedia.deit.univpm.it [email protected] 2. CNR/INSTI Pisa (ITALY) Abstract In this paper we illustrate the preliminary exploration of a Semantic Web scenario where acoustic relationship among audio resources are extracted in a distributed way and made available using the tools of the Semantic Web. In this scenario, peers use MPEG-7 to exchange description of audio they are interested in and use these to locally extract different kind of acoustic similarities. When very relevant similarities among audio resources are found, these are expressed using the syntactic and semantics of the semantic web, that is, using RDF and an appropriately expressive ontology. There are a number of reasons why such process is interesting. First of all other agents might retrieve such indications and use them to perform browsing and exploration. Then algorithms might exploit audio hyperlinks to make further inferences. We demonstrate this by showing how the network of audio hyperlinks can be used to further enrich the metadata annotations of audio resource, in a way somehow similar to the famous “PageRank” algorithm.

1. Introduction The Semantic Web initiative is concerned with the overall vision of distributed, machine readable metadata on the Internet. To enable this scenario, standardized frameworks have been developed to express semantic relationships between resources (RDF [1]) and the ontologies describing domain classes and relationships among these (RDFS and OWL [2]). Resources are expressed by strings with a uniform syntax URIs. These can be web accessible, such as those that are expressed by Uniform Resource Locators (URLs) (e.g. http://www.example.org/foo.txt) as well as external to the web domain, such as a book identified by its ISBN number (e.g. urn:isbn:0-34533973-8 for Tolkien's “Return of the King”) or the concept of “author” (dc:author). RDF “models” are composed by a set of triples relating resources using the terms defined in the web shared ontologies and form “conceptual graphs” which can

then be serialized as XML files and made available on the web (e.g. http://www.g1o.net/g1ofoaf.xml). By making these graphs accessible a wide and largely unexplored range of scenarios become possible, mostly involving intelligent agents which retrieve such descriptions to fulfill user specific tasks. How exactly are these agents going to operate and which interesting, useful tasks they could fulfill is fully under research and, as today, it is safe to say that such exploration is still in its preliminary stage. While there exists a great wealth of algorithms that could in theory provide “high level” annotations of Multimedia material coarse enough to fit the scenario of SW, little has so far been done to match these capabilities directly, thus building real world applications useful to end users. On the other hand interest for distributed intelligent multimedia metadata applications is certainly all but new, the most notable initiative being MPEG-7 [3], a ISO standard since 2001. MPEG-7 consists in a large body of descriptive tools and metadata vocabulary, but has so far notably lacked real support and deployed applications, probably due the excessive richness and monolithic structure. In these work, we build upon MPEG-7 Low Level Audio descriptors to create semantically typed “Hyperlinks” between Audio Resources. These are expressed in RDF and, once made available “on the Semantic Web” can form the base for advanced browsing of audio collections and intelligent algorithms. In section 2 we show how these hyperlinks are generated starting from MPEG-7 descriptions, section 4 shows the result of an algorithm that uses the generated web of semantic audio hyperlinks to enrich the metadata surrounding Audio Resources.

2. Generating Audio Hyperlinks We define Semantic Audio Hyperlink an RDF annotation which expresses and explains a strict acoustic relationships between two audio resources. To semantically explain such relationship, such annotation will make use of an appropriate, albeit simple, ontology providing a hierarchy of similarity measures which can be used for specific browsing and reasoning purposes.

At the same time, given that any proximity evaluation algorithm will necessarily yield a number of links that linearly grow with the available audio resources, it is necessary to chose tools that offer an adequate precision in the evaluation. For the purpose of this study, we developed a methodology based on MPEG-7 low level audio descriptors. As the scope of this work is to explore the audio hyper link scenario, the algorithm we select is not meant to directly advance the state of the art in the specific signal processing/audio distance metric task. The algorithm we present, however, has very desirable properties, not least being completely based on MPEG7, and performs this specific task well. The algorithm is implemented upon the extraction and processing framework MPEG7AudioDB (MADB), described in [4][5]. In MADB, MPEG7 low level descriptors are either extracted from MP3s or imported from external sources. This is an important step in the overall scenario: there is no need for a peer to have all the audio tracks involved in the annotation, as MPEG-7 metadata streams are in fact sufficient, and they could be fetched, e.g. from on line services. Once a number of MPEG-7 streams have been imported, each representing an audio resource, a temporal segmentation processes is performed with an algorithm inspired by [6]. This works by first constructing a Auto Similarity Matrix [7] and then using it to create a novelty function based on a specific filter that matches the expected auto similarity profile of a feature discontinuity. This algorithm can be tuned to address a number of purposes among which classification, summarization, identification, browsing and annotation. Mpeg-7 Based Audio Segmentation The first step of this procedure is to build a Similarity Matrix describing how an audio resource compares to itself, in a bi-dimensional time representation. Given of time dependent low level features forming a vector v(t), the element i,j of the similarity matrix is given by the cosine distance between v(i) and v(j). The cosine distance is defined by: D(i, j ) =

vi • v j vi v j

Given the nature of the segmentation we seek, it helps to use an average version over a time interval. This takes the form of: Dw (i, j ) = 1 / w∑k =0 D(i + k , j + k ) w −1

A key parameter is the signal feature vector chosen to calculate the cosine distance. In our experiments we used the AudioSpectrumEnvelope (a standard Mpeg-7 LLD), which corresponds to the logarithmic spectrum

(used in [8] for audio segmentation). It is important to notice, however, that using different features will lead to different kind of audio segmentations (e.g. possibly focusing on rhythmic or timbric aspects). Given its composition, the similarity matrix will have high values in the cells corresponding to segments of audio that are maximally equal according to the selected feature, in our case have a similar spectrum. By definition, the matrix will also have a value of 1 in its diagonal. Construction of a novelty function At this point, the similarity matrix is processed by sliding a bi-dimensional filter along its diagonal. The filter, a non-causal “Gaussian Taped” matrix, is shaped to “match” and maximally highlights moments of “novelty” in the audio resource. The purpose is to highlight points in time where two mutually different, internally similar blocks of features values are detected. The peaks from the output of the filter are located and used as “segmentation points”. As an example, Figure 1 shows the novelty function and segmentation points of a sample audio resource.

Figure 1 The novelty function (output of the Gaussian Taped filter run across the diagonal of the symilarity matrix) of the song "Tangerine" from Led Zeppelin. Peaks topping a pre determined threshold will be considered "segmentation points" Finally, we notice that the dimensions of the filter matrix is very important in determining the granularity of the segmentation. A small value of the dimension leads to a fine granular segmentation, while a bigger value allows identifying longer segments, which are more representative considering the purpose of our experiments: establishing acoustic links to propagate genre annotations. In our experiments kernel's dimension has been set to 256, corresponding to 2560 ms of audio, given that the MPEG-7 low level features had been sampled at 10ms time-step. Building accurate Hyperlinks Once a segmentation of all the audio files has been performed, we use a measure of similarity among single fragments. If very high similarities are found, an

Choosing Mpeg-7 descriptors The set of LLD composing the feature vector is important in determining which kind of acoustic hyper link we want to extract. As in our test we aimed to create links among sounds that generically “sounded alike” (e.g. same genre), we referred to the signal features used in literature to perform genre classification. In particular we choose Mpeg-7 descriptors which have a direct relation to features considered in [9][10][11]: • • •

• •

AudioSpectrumEnvelope (ASE), a low resolution version of the signal spectrum. AudioSpectrumCentroid (ASC): a centroid measure of the spectrum, it indicates the “brightness” of the sound. AudioSpectrumSpread (ASS): it is the standard deviation of the spectrum around its centroid. It is an inexpensive way to describe the shape of the spectrum (how much the energy is concentrated near its centroid) AudioPower (AP): the signal power. HarmonicRatio (HR): a measure of the sound harmonic degree.

Finally, we report that while this process could be in theory performed on the entire audio tracks, the segmentation procedure greatly increases its accuracy given the unacceptable performance of averaging the low level descriptors over the whole duration of the (musical) audio resources.

most objective being rich semantic browsing of audio collections. In this section we present however a novel applications of audio hyperlinks: metadata enhancement based on label propagations. This application responds to the following question: “supposing we have a fraction of audio resources annotated with metadata, e.g. with the genre, how to propagate these annotations across the web of audio hyperlinks?” This sort of algorithm, dubbed “Relaxation labelling” has been used in literature for a variety of purposes (see [12] for an application on web site ratings) and can be somehow related with the PageRank algorithm by which the Google search engine rates the pages it indexes: trust is just partially a function of the page itself, but rather comes from the links among the whole index. Studied in detail by Hummel and Zucker [13], the relaxation labeling algorithm uses contextual information for finding consistent mapping between labels and nodes in a graph. An optimized approach [14] reduces the complexity using radial projection in the update process of the support vectors, which basically define how the labeling of each node is influenced by the labels assigned to other nodes. In our implementation, the nodes that were manually annotated do not receive support from the others, thus their label can't change. Also, a further assumption is made about the compatibility coefficient rij(A,B), which is the support that the node A with a label j gives to the assignment of label i to node B. As it is not clear how the support between different genre labels can be measured, being a kind of subjective task, we assume that rij(A,B) is zero when i and j are the same label, while it is equal to the weight of the audio hyperlink if they correspond to different genres. 100% 90% 80% 70% correct labellings

audio hyper link is establish among corresponding (entire) audio tracks. We define a feature vector v as combination of projections of the MPEG-7 LLD's during the fragments. Projections are defined using operators such as “mean” and “variance”. The weight of the link between two segments is then evaluated according to a normalized scalar product of the corresponding feature vectors, much similarly to what previously done to compute the self similarity matrix. An acoustic hyperlink is detected between two audio clips when such normalized product is higher than a set threshold T, usually for a very high value of T (e.g. 0.99).

60% 50% 40% 30% 20%

3. Metadata Hyperlinks

propagation

over Audio

Once the acoustic hyperlinks have been extracted, we can think of expressing them on the Semantic Web, enabling a number of interesting applications and scenarios. By means of a proper ontology, capable of describing different kind of audio links (e.g. 'sounds like', 'same instrument playing'), one could obtain online RDF/RDFS documents which could be harvested by spiders and processed for a variety of purpose, the

10% 0% 10%

20%

40%

60%

80%

% of pre labe lle d tracks

Figure 2Well labeled tracks using different algorithms. Radial Projection (light), ZuckerHummel (dark) , darker line indicates the trivial result with no algorithm applied. As a result of this procedure, annotations that were originally referring only to a small subset of the database can be “propagated” to other known resources. The experimental results, shown in Figure 2 show how this is in fact very successful at least in case

of genre annotations; as an example, starting from 20% annotated audio resources, up to 75% correctly annotated resources can be obtained. Finally, in Figure 3 we see how the algorithms performs even better as we lower the threshold to declare an “acoustic link” between two audio resources. This turns into a higher number of “lower quality” links which are probably less interesting to a human browser but which the algorithm can successfully use for the purpose of the relaxation labeling. Zucker-Hummel method as more hyperlinks are given 100,00% 90,00%

% of correct labelling

80,00% 70,00% 60,00% 50,00% 40,00% 30,00% 20,00% 10,00% 0,00% 10%

20%

40%

60%

80%

% of pre labeled tracks

Figure 3Performance of the algorithm as the number of links increases in number but decreases in quality

4. Conclusions and future work In this work we explored Audio Hyperlinks, a scenario of Multimedia annotations on the Semantic Web. We

illustrated a novel procedure that feeds on heterogeneous MPEG-7 Low Level descriptions and produces indications of acoustic similarities among audio resources. We then illustrated how these can be written using the tools and formalisms of the Semantic Web initiative. For this purpose we created a demonstrative ontology and illustrated how this could be used to encode audio hyperlinks and fosters interoperability and potentially “intelligent” distributed behaviors. Finally, we presented an application of audio hyperlinks which uses the resulting network of annotations, possibly retrieved online by an intelligent agent, to enhance the annotations of the database itself. These works have been performed as enhancements of the MPEG7AudioDB [15] framework and all the source code producing the results here illustrated is available as open source. Further works will concentrate on extracting Audio Hyperlinks of different nature and maximally relevant especially interesting browsing possibilities for humans.

5. Acknowledgments Thanks and gratitude goes to Michele Bartolucci and Francesco Saletti for their work on the libraries and algorithms. Thanks also goes to Oreste Signore (CNR/W3C) for the support provided.

Bibliography 1: Resource Description Framework (RDF), W3C Specification, http://www.w3.org/RDF/ , RDF, 2: OWL Web Ontology Language Overview,W3C Recommendation 10 February 2004, OWL, 3: , ISO/IEC JTC1/SC29/WG11 N4031. MPEG-7, 2001 4: G. Tummarello, C. Morbidoni, P. Puliti, A. F. Dragoni, F. Piazza, From Multimedia to the Semantic Web using MPEG-7 and Computational Intelligence, 2004WedelMusic 2004 , Barcellona 5: H. Crysand, G. Tummarello, F. Piazza, An MPEG7 Library for Music, 20043rd MUSICNETWORK Open Workshop, Munich 6: M.Cooper, J.Foote, Automatic music summarization via similarity analysis, 2002Third International Symposium on Musical Information Retrieval (ISMIR), pp. 81-85 September 2002, Paris 7: J.Foote, Visualizing music and audio using selfsimilarity, 1999Proceedings of ACM Multimedia '99, (Orlando, FL) ACM Press, pp. 77-80, 1999 8: J. Foote, Automatic audio segmentation using a measure of audio novelty, 2002Proceedings of IEEE International Conference on Multimedia and Expo, vol. I, pp. 452-455, 2000

9: G.Tzanetakis, Musical genre classification of audio signals, July 2002IEEE Transactions on Speech and Audio Processing, vol. 10, No. 5 10: G.Lu, T.Hankinson, A technique towards automatic audio classification and retrieval, 1998Fourth International Conference on Signal Processing, October 12-16, Beijing 11: E.Wold, T.Blum, D.Keislar, J.Wheaton, Contentbased classification, search and retrieval of audio, 1996IEEE Multimedia Magazine, vol. 3, no. 3, pp.2736 12: Massimo Marchiori, The Limits of Web Metadata, and Beyond, 1998Proceedings of the Seventh International World Wide Web Conference (WWW7) 13: R. A. Hummel, S. W. Zucker, On the foundations of relaxationlabeling processes , PAMI, 5(3):267-287, may 1983. 14: P. Parent, S.W. Zucker, Radial projection: An efficient update rulefor relaxation labeling, PAMI, 11 (8):886-889, august 1989. 15: G.Tummarello, C.Morbidoni, F.Piazza, MPEG-7 Audio Db,

Suggest Documents