Audio Data Model for Multi-criteria Query Formulation and Retrieval Tizeta Zewide
Solomon Atnafu
IT Services Section (UNECA) Addis Ababa, Ethiopia Tel: 251 911 401240
Department of Computer Science Addis Ababa University Tel: 251 911 406946
[email protected]
[email protected]
ABSTRACT The amount of available audio data is increasing rapidly in consequence of advancements in media creation, storage and compression technologies. This rapid increase imposes new demands in audio data management and retrieval. In this work, we proposed an audio data model and repository model to fulfill user requirements in retrieving audio data from large collections. The proposed audio data repository model facilitates a multi-criteria query formulation and audio data retrieval where by audio can be queried both by its low- and high-level features. In the proposed model, a generic audio repository model that can handle a general audio as well as a sub-repository model that can manipulate speech through its constituent units is discussed. Finally, the viability of the proposed model is demonstrated by a prototype system developed for an application in the medical domain.
The remaining parts of this paper are organized as follows. In Section 2, a survey of related works is presented. Section 3 addresses the proposed models. Experiments and results are described in section 4. In Section 5, an Audio Data Management for Medical Application (ADMMA) prototype that demonstrates the usability of our proposal is presented. In Section 6, conclusions and future works are depicted.
2. RELATED WORK
Keywords
Audio can be captured using devices such as microphone or produced by program algorithms. Audio recording devices take analog signals and convert them into digital values with specific audio characteristics such as format, encoding type, number of channels, sampling rate, sample size, compression type etc. A widely known example for digitally encoded analog audio is the CD-Audio standard. It defines a sampling rate of 44.1 KHz and a quantization of 16 Bits. Such an encoding preserves all perceivable frequencies and eliminates audible quantization noise [3].
Audio data model, audio data repository model, Multi-criteria audio data retrieval, Audio Data Management for Medical Application (ADMMA).
Audio Features
1.
INTRODUCTION
Audio data is an integral part of many multimedia applications. As a result, it has become a critical component of information systems that needs to be efficiently managed. Various applications areas such as: Bioacoustics, Medical services, Music industries, Investigation services as well as movie and animation industries require the management of large audio collection. Such lists of application areas show the need for modeling audio data so that we can manipulate, analyze and retrieve audio from large digital collections. While data modeling is the basis for any multimedia information management system, to our best knowledge, no research attempt has been made towards modeling audio data. Thus, the main objective of this work is to develop a data model and repository model for capturing audio data content and enabling multi-criteria query formulation and retrieval. To meet these objectives, identification of audio components that need to be captured in order to represent and describe an audio data is required. Furthermore, identification of the appropriate low-level and highlevel audio data features that can be applied for a particular application domain is considered. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MoMM2009, December 14–16, 2009, Kuala Lumpur, Malaysia. Copyright 2009 ACM 978-1-60558-659-5/09/0012...$10.00.
Audio features are those characteristics that describe the contents of an audio signal and are derived from a segment of audio with the method of feature extraction to form a feature vector. This feature vector representation has a fixed dimension that can be used as a fundamental building block of various types of audio analysis and information extraction algorithms as it is used to characterize, classify and index a given audio signal [1, 4].
MPEG-7 Low-Level Descriptors (LLD) MPEG-7 provides concepts that describe multimedia data. The audio part of the MPEG-7 standard specifically contains descriptive elements that characterize the underlying audio signal itself rather than merely “labeling” it with high-level tags [11]. MPEG-7 Audio provides structures for describing audio content, and high-level Description tools that are more specific to a set of applications. MPEG-7 low-level audio descriptors are of general importance in describing audio [12]. Based on these descriptors it is often feasible to analyze the similarity between different audio files [8].
Audio Feature Selection and Classification An audio archive might contain a broad range of sound types which are distinguished by different acoustic characteristics. In such cases, the major difficulty is extracting features that are capable of characterizing all sounds in those archives. Considerable studies have been carried out in order to find optimal features that are applicable to the general audio and specific audio class discriminations [9, 13]. But, these studies showed that the optimal features selection depends on the domain and classification technique. On the other hand, various studies have shown that MPEG-7 low-level descriptors are tailored to be used in generic audio signals.
Audio signal classification is concerned with identifying to which set of classes a sound is most likely to fit. Several techniques have been employed for the purpose of classifying an unknown sound by measuring the similarity between an input feature vector and those of known sounds. The most common approaches are Gaussian Mixture Models (GMM) based methods [8, 13, 9], Hidden Markov Model (HMM) [5], Nearest Neighbor methods [13, 9], Neural Network (NN) variants [7], Vector Quantization (VQ) [7, 8] and Support Vector Machine (SVM) [14].
basic determinants for classification are the audio features considered and the classification model used in [10], where an audio stream is classified into speech, music and environmental sounds. Audio classification, as discussed in section 2, is a matured research field. Using those classification techniques, general audio can be segmented and classified in terms of its semantic classes. General audio, in this context, is defined to be any audio clip with no assumptions on length, segmentation, source category or composition with other sounds.
Audio Retrieval
We focus only in identifying and proposing an audio data and repository model that can be used for generic audio signals. We use MPEG-7 features for representing the high-level and the low-level information because of the advantages that the MPEG-7 standard offers.
Different approaches have addressed automatic analysis of audio content, be it speech/music classification [13, 16, 29], retrieval of similar sounds [17, 6], or music genre classification [7]. Two approaches have been commonly employed to retrieve audio data. The first is to use Query-By-Example (QBE) retrieval, where the query is a piece of audio and retrieval is made by using a similarity measure. The other approach is to generate textual indices and then use text-based information retrieval. A QBE system for isolated sounds has been developed at Muscle Fish LLC [6]. The Muscle Fish system is a pioneering work that has resulted in a compelling audio retrieval for a general database by similarity demonstration. In [18] an audio retrieval system is developed for detecting specific passages within a piece of music on any other audio file. The advantage of the QBE approach is that similarity is computed on features that are automatically derived from the audio signal and can therefore be applied inexpensively on a large scale. But these systems make no pretence of attaching semantics to the queries as they are not oriented towards audio semantics. Audio information retrieval based on high-level features is certainly a reasonable answer to the semantic drawbacks of information retrieval based on low-level features. So, it is apparent that modeling the high-level meaning of audio requires semantic content together with their low-level counterparts. A very popular means of semantic audio retrieval is to annotate the audio with text, and use text-based database management systems to perform the retrieval. However, this approach has significant drawbacks when confronted with large volumes of audio data due to its time consuming process and its subjectivity nature due to human intervention. The other most commonly used approach is to apply Speech Recognition (SR) techniques to extract spoken words from a sound track or to use automatic text transcription of an audio segment [2, 19]. Manual annotation is still widely applicable because SR systems are mostly error prone due to problems such as Out Of Vocabulary (OOV) words, challenges related to independent speakers and continuous speech, lack of language models and poor audio quality etc [32]. In addition, SR systems’ application is limited to speech.
3. MODELING AUDIO DATA 3.1 Audio Data Model An audio data model is a set of concepts that can be used to describe the structure and content of an audio. To design an audio model, it is essential to look into the characteristics of audio in detail to see the possible components and classifications that may affect the modeling. Fore example, when speech is concerned, it is possible to hierarchically structure it in to topics, sentences and words. But the same classification can not be applied to other categories. For instance, structuring and classification of music is different. Though it is difficult to structure music in a hierarchical fashion, it can be structured into its main components: intro, verse, chorus, bridge, etc. and each of these can be sub classified in to further classes The aforesaid difficulties signify that as different audio types require different processing and retrieval techniques, a general audio signal is first classified into one of those audio classes. In this work, the
The proposed audio data model can be used as a generic model that can be applied in unconstrained domain. If particular consideration is needed to a specific audio class, further refinement is necessary. This classification is made based on the signal characteristics which are represented by MPEG-7 low-level descriptors. These features are modeled together with their metadata information. In our work, these metadata classifications are used with little modification. The metadata classifications used in the proposed audio data model are defined below. Creation and Production Information Metadata (CPI metadata)
– metadata elements that are related to the process of audio data creation and application-specific information, such as creators (authors, singers), title, date recorded etc. This information is usually provided by the media acquisition process. Contextual metadata – a subjective set of information which is an external knowledge produced by listener resulting from patterns, categories or associations with previously known facts. This contextual metadata is a key in efficient semantics handling. Storage metadata - refers to information directly related to the medium holding the audio and the characteristics of the audio encoding such as format, encoding type, compression type, number of channels, sampling rate, sample size etc. In general, by integrating the abovementioned categories of metadata, it is possible to cope with audio data efficiently. Along with those descriptions, one can capture low-level descriptors to make required piece of audio identified. Since MPEG-7 offers a framework for the description of audio documents that describe the actions, objects and context of an audio clip, all these types of information can be handled by MPEG-7 description schemes. Some of these descriptions like creation, production and usage can only be inserted manually whereas other content descriptions can partly be extracted automatically. Basic information about the storage media, like encoding, number of channels, sampling rate, sample size, can directly be read from the analog to digital converter media devices. The proposed audio data model that incorporates all the aforementioned issues is shown in Figure 1. The class of environmental sound (Esound) constitutes a number of audio signals that are neither speech nor music. In order to model them, apart from the low-level descriptors, domain-dependent keywords must first be extracted by domain experts. Finally, all the metadata information associated with the above mentioned audio units, together with their low-level features, are stored as MPEG -7 descriptions next to the media in the database.
3.2 Audio Data Repository Model (ADRM) A data repository model is a conceptual representation of a repository that deals with the way data is stored in a Database Management System (DBMS). This section describes how to represent audio components in a convenient way so as to store them in a DBMS. In
Figure 1: The Proposed Audio Data Model this work, we used ORDB framework because a database system that stores audio information should also have a way to store, manipulate and render audio in a way different from techniques employed in the relational systems. Besides, an audio data repository must capture the low-level feature descriptions and metadata information of audio data that are required for content-based and high-level feature based (e.g. Keyword-based) audio retrieval. Based on the aforesaid requirements, we propose an audio data repository model that facilitates capturing, representation and management of audio data. Our proposed audio data repository model can also be considered as an implementation of the audio data model proposed in the previous sub-section under an ORDBMS paradigm. The repository model can be considered as an extension of the image data repository model proposed by some of the authors of this paper in [15]. The proposed Audio Data Repository Model (ADRM) is a schema of five components A (ID, O, F, M, T), under an ORDB scheme, where: ID: unique identification that identifies an audio data in the database table O: an object that is used to store the physical audio data either as a Binary Large Object (BLOB) or as a BFILE , as an external file. F: set of low-level features that represent the audio object M: metadata information that describes the audio data as per the classification given in the audio data model. T: temporal information associated with a given audio unit. It consists of the starting time of an audio unit and the duration of the audio play.
3.3 ADRM: the case of speech Speech is one subclass of audio data. Unlike, other types of audio categories, speech can be characterized by its hierarchical structure. As per the structure given in the proposed data model, speech data can be organized into topics, sentences and words. Such classification enables users to retrieve only part of the speech in which they are interested in. Though the proposed ADRM can be applied to any generic audio data (music, speech or environmental sound), speech data needs further refinement as it can be represented in a hierarchical fashion. Thus, we have given speech extra consideration so as to show how each of the speech units can be modeled without loosing generality of the proposed repository model. With this consideration, below we will give formal definitions for each speech unit starting from the “basic” unit - word.
Word (W) – is a basic speech unit where a set of words constitute a sentence. Although word is not a unit of user’s interest in most cases, capturing its necessary information enables keyword-based audio retrieval and facilitates searching of a specific audio segment as per its importance to users and based on the application domain. However, in this study, Word refers to keywords that are considered to be representative of a speech document by domain experts. Word is constructed using five components W(IDw, Ow, Fw, Mw, Tw), where: IDw : unique identification for a word refers to the word audio source or object Ow : Fw : low-level features extracted at word level. Mw : represents the alphanumeric data that is stored to represent the word, for instance the text that denotes the word. temporal information related to each word in a speech. Tw : Sentence(S) – a unit of speech which comprises a set of words. Sentence is a “basic” speech unit from user’s requirement point of view that can convey meaningful information. It consists of five components: S(IDs, Os, Fs, Ms, Ts). All the elements of “S” share the definition given to “W” except that these components are captured at sentence level. Topic (Tp) – is a subset of speech unit that contains sequence of sentences that is considered to express full information regarding a particular subject under consideration. Topic also has five components: Tp(IDt, Ot, Ft, Mt, Tt). “Mt” represents the metadata information that might be given at topic level as well as identification of all sentences that form the topic. Capturing sentence information enables to identify the “parent” topic encompassing the sentences. Speech (Sp) - is a collection of one or more topics and is identified by five components: Sp(IDsp, Osp, Fsp, Msp, Tsp). “Osp” refers to the source speech itself. “Fsp” refers to the low-level representation extracted at speech level. The “Msp” component holds the general description about the speech, the CPI metadata information as well as the identification of topics that makes the speech. The CPI metadata is inherited by all “child” units of speech (i.e. topics, sentences and words) because it is considered to be common to all sub units. “Tsp” holds the timing information of the speech.
4. ADMMA The proposed models can be applied to any audio data management application domains. But, for the sake of demonstration, we have
together with the source audio in the audio repository model format. A similarity matching (using Euclidean Distance Metric) is applied for a QBE between an input audio and stored audio data on the basis of their respective features. The quality of the features is measured in terms of recall/precision pairs of the top-ranked matches. In the experiment, one sample at a time was drawn from the database to serve as an example query and the rest were considered as samples to be compared to the query. A database sample is considered correctly retrieved, when the sample falls in the same class as that of the example sound and the correctness is verified using domain expert evaluation.
selected a medical application and developed a prototype called “Audio Data Management for Medical Application – ADMMA” to demonstrate the usability of the models. This prototype limits itself to only two classes of applications in the medical domain: heart sounds identification for cardiac diagnosis (Esound category) and speech for medical image description by radiologists (speech category). This prototype demonstrates how the proposed models can effectively be used for audio retrieval to assist physicians in differentiating between the different heart sounds by giving input heart sound and querying the system to retrieve similar sounds and if needed with any related information. We have also demonstrated that heart sounds can be retrieved based on the high-level features in which they are described. Such systems can be convenient to diagnose heart sounds as well as to develop teaching aids for auscultation training.
The first experiment is made by combining all the aforementioned features together (combined feature vector) and applying similarity matching between each category of the example and sample heart sounds. However, the aggregate features have given a recall/precision rate of 11.9% and 8.35%, which is very low. Since the combined features did not give promising results, further experiments were conducted using individual features. Firstly, similarity matching is made based on the distances computed between ASF values of the query sound and sample sounds are computed for each sound category. ASF yields relatively good results in that an average of 78.8% and 55.2% recall and precision pairs are attained respectively. Then, the same procedure has been applied for the other low-level features. In Figure 5 and Figure 6, the computed results of recall and precision results are presented respectively.
In the second case, a recorded speech of medical image description is considered to illustrate audio retrieval techniques with high-level features, structured based on the proposed audio repository model for speech; these techniques can facilitate communication among physicians through a written text and/or medical images.
ADMMA ADMMA has two categories of interfaces: Audio entry and Audio retrieval interfaces.
As indicated in Figures 2 and 3, most of the features exhibit poor results except ASF and AH. Thus, a combined feature of the two good-yielding features is used to check weather a good result can be attained if they are used together. However, the combined feature doesn’t show the expected identification power, which is a recall and precision rate of 32.4% and 22.7% respectively.
Audio Data Entry Interfaces Based on the proposed audio model, these interfaces allow users to enter audio source, encoding schemes, CPI and Conceptual metadata. For a consistent representation of all metadata information related to audio, MPEG-7 metadata is used and IBM Multimedia Annotation Tool is employed to annotate audio metadata.
100%
Audio Retrieval Interfaces
90%
These interfaces are responsible for rendering query results to users. It enables users to formulate multi-criteria query so as to retrieve audio using either its low-level or high-level features.
80% Combined Recall ASF Recall HSC Recall HSD Recall HSS Recall HSV Recall ASB Recall ASP Recall AH Recall ASF & AH Recall
70%
Audio data that is required to conduct experiment is defined from the selected application domain. Since there is no publicly available reference database of heart sounds and speech-based medical image descriptions, a custom database has been built from internet searches and live recordings. A total of 78 sounds, 48 heart sounds having 6 classes (each having 8 sounds) and 30 speeches are collected. As these sounds lack common representation and formats, preprocessing is done to convert them into a fixed target format with predefined settings. In addition, with the help of an off-the-shelf audio editing tool, audacity, background noises are eliminated. Then, feature extraction was done using the MPEG-7 LLD extractor that is provided by the Technical University of Berlin (TUB). However, all LLDs are not expected to be compulsory in regard to a specific audio class. Thus, a preliminary experiment has been done to identify LLDs that best suit to the class of heart sound.
Content-Based Audio retrieval For the content-based experiment, some features are selected based on their ability of robust identification and on their performance in a variety of applications and researches [21]. These features include AudioSpectrumFlatness (ASF), HarmoniSPectral-Centroid (HPC), HarmonicSpectralDeviationType(HSD), HarmonicSpectralSpread (HSS), Harmonic-SpectralVariation (HSV),AudioHarmonicity (AH), AudioSpectrumBasis (ASB) and AudioSpectrum-Projection (ASP). Other MPEG-7 features are left as they are especially destined to be useful for particular classes of sounds. Thus, the aforementioned features are extracted from the 48 heart sounds and stored in their form of feature vector in the database
Recall(%)
60%
5. EXPERIMENT
50% 40% 30% 20% 10% 0% Gallop
Holosystolic murmur
Late Systolic murmur
Mid Systolic murmur
Normal Heart Sound
Systolic ejection murmur
Average
Heart sounds
Figure 2: Recall results for the different heart sounds The basic objective of conducting this experiment is to identify which MPEG-LLD feature(s) best suits heart sound. Based on the above results, ASF employed good result for the specific domain and sample data. Thus, it has been used as a determinant feature in identifying heart sounds. The main experiment in dealing with content-based audio retrieval is conducted for identifying a piece of audio using its finger print. The basic idea of using audio fingerprint is to identify a piece of audio content by extracting unique signature from it. Motivated by the results of ASF (see Table 1), fingerprints are extracted and used as low-level features for representing heart sounds in content-based audio retrieval. To extract fingerprint, the audio signal is first segmented into overlapping frames having length of 0.37 seconds. This results in the extraction of a 32-bit sub-fingerprint for every 11.6 milliseconds and a sub-fingerprint is a compact representation of a single frame. Since a sub-fingerprint doesn’t contain sufficient
data to identify an audio signal, a fingerprint block consisting of 256 subsequent sub-fingerprints is used. Finally, the fingerprints of audio signals are compared using humming distance, which measures the number of disagreements between two vectors. 70.00%
60.00%
Precision(%)
50.00%
Combined Precision ASF Precision HSC Precision HSD Precision HSS Precision HSV Precision ASB Precision ASP Precision AH Precision ASF & AH Precision
40.00%
30.00%
20.00%
addition, it reflects the inherent relationship that exists within a given audio. Therefore, in this work, the low-level and high-level features that should be captured to represent and audio data are identified and modeled. In addition, an audio repository model that reflects intrinsic features of audio data that should be captured and stored in an ORDBMS to enable effective audio data retrieval is proposed. Since the proposed models store both the high-level and low-level of audio data they enable to apply multi-criteria query formulation and retrieval, which in turn reduce the semantic gap between the two features. The models and techniques proposed are evaluated for audio data management requirement in a medical application.
7. [1] [2]
10.00%
0.00% Gallop
Holosystolic murmur
Late Systolic murmur
Mid Systolic murmur
Normal Heart Sound
Systolic ejection murmur
Average
Heart sounds
Figure 3: Precision results for the different heart sounds Two fingerprint blocks are said to be similar if their humming distance, Bit Error Rate (BER), is below certain threshold. As studies indicate in [20], it is shown that BER of less than 35% leads to very reliable identification and the same threshold has been used in our experiment. The results have shown an average of recall and precision pair of 83.0% and 58.2% is attained respectively, see Table 2. Though it is difficult to give general conclusion due to the number of classes and samples, the results give good insight regarding the features that need to be considered especially in dealing with periodic signals like heart beat.
[3] [4]
[5]
[6]
[7]
[8] [9]
Keyword-based audio retrieval The second group of our experiment is concerned in keyword-based audio retrieval. Both heart sounds and speech based image descriptions are tested. In the case of speech, speech recordings of image description are captured and structured into constituent units. The first step in structuring speech is to identify keywords that represent a given speech. To identify those keywords, domain experts were involved for each medical image description considered in the experiment. Then, six physicians who have experience on internet searching were involved in providing keyword-based queries. Fifteen such queries are used in the system so as to evaluate the applicability of the proposed model in helping users retrieve audio data based on semantic information (key-words). Among the 15 queries applied for the speech based image description, 12 of them (80%) were supported by the system. 20% of the queries failed to be retrieved because the queries used keywords that were not in the system. But all the 12 queries have been retrieved with 100% recall and precision rate. On the other hand, retrieval of heart sounds has been conducted by providing the system with metadata information regarding heart sounds such as category, diseases that are believed to cause some type of abnormal heart sounds. All given queries were correctly retrieved with a 100% recall and precision rates.
[10]
6.
[19]
CONCLUSION
The amount of available audio data is in rapid increase, and retrieving audio information is becoming more and more difficult. Thus, the increase in the amount of audio data not only caused audio to draw more attention but also the requirement of efficient management of audio data. One such requirement is proper modeling of audio data. Modeling audio data enables to identify features that must be captured to facilitate queries that are performed on audio data. In
[11] [12] [13]
[14] [15]
[16]
[17] [18]
[20] [21]
REFERENCES G. Tzanetakis and P. Cook "MARSYAS: a framework for audio analysis,'' Organized Sound 4(3), Cambridge University Press, 2000. M. Slaney. Semantic-audio retrieval. In Proc. 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 4, pages IV4108–11, 2002. A.I. Zayed. Advances in Shannon’s Sampling. Theory, CRC Press, Boca Raton, pp.157-159, 1993. G. Tzanetakis, P. Cook. Multifeature audio segmentation for browsing and annotation, In Proc. of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New York, pp 17-20, 1999. T. Zhang and C. J. Kuo. Hierarchical system for content-based audio classification and retrieval. In Proc. International Conference on Acoustic, Speech, Signal Processing, volume 6, pp 3001–3004, 1998. E. Wold, T. Blum, D. Keislar, and J. Wheaton. Content-based classification, search, and retrieval of audio. In IEEE Multimedia, vol. 3, pages 27-36, 1996. D. Pye. Content-based methods for the management of digital music. In Proc. the international Conference on Acoustics, Speech, and Signal Processing, 2000. J. Foote. Content-based retrieval of music and audio. In Proc. SPIE, pages 138–147, 1997. M. Liu and C.Wan, A study on content-based classification and retrieval of audio database. In Proc. Int. Database Engineering and Applications Symposium (IDEAS-01), Grenoble, France; IEEE Computer Society Press, pp 339–45, 2001. D. Mitrovic, M. Zeppelzauer and C. Breiteneder. Discrimination and Retrieval of Animal Sounds, In Proceedings of the IEEE conference on Multimedia Modeling, 2006. ISO/IEC JTC1/SC29/WG11 (MPEG). Multimedia content description interface - part 4: Audio International Standard 15938-4, 2001. H-G. Kim, N. Moreau & T. Sikora, MPEG-7 audio and beyond. West Sussex: Wiley, 2005. E. Scheirer and M. Slaney, Construction and evaluation of a robust multi-feature speech/music discriminator. In Proc. ICASSP, Munich, Germany, pp 1331–1334, 1997. B. Whitman, D. Roy, and B. Vercoe, Learning word meanings and descriptive parameter spaces from music. In HLT-NAACL03, 2003. Solomon Atnafu, Lionel Brunie, and Harald Kosch, Similarity-Based Operators and Query Optimization for Multimedia Database Systems; Int. Database Engineering & Applications Symposium (IDEAS'01), Grenoble, France; IEEE Computer Society Press, pp. 346-355, 2001 Zhong, D. and Chang, S. –F. An integrated approach for content-based video object segmentation and retrieval, IEEE Transactions on Circuits and Systems for Video Technology, vol. 9(8), 1259-1268, Dec. 1999. Musclefish homepage, available on: http://www.musclefish.com (consulted on July 16, 2006). Chrisitan Spevak, Emmanuel Favreau: SoundSpotter- A prototype system for content-based audio retrieval. In Proc. of the 5th Int. Conference on Digital Audio Effects (DAFx-02), Hamburg, Germany, September 26-28, 2002. Wei-Ta Chu, Wen-Huang Cheng, Jane Yung-Jen Hsu and Ja-LingWu: Toward semantic indexing and retrieval using hierarchical audio models. Multimedia Systems, 10(6): 570–583, 2005. J.A. Haitsma and T. Kalker, A Highly Robust Audio Fingerprinting System, Proc. ISMIR 2002, Paris, 2002. Benetos, E., Kotti, M., Kotropoulos, C., Burred, J., Eisenberg, G., Haller, M., & Sikora, T. Comparison of Subspace Analysis-Based and Statistical Model-Based Algorithms for Musical Instrument Classification. 2nd Workshop on Immersive Communication and Broadcast Systems (ICOB), 2005.