Serbia & Montenegro, Belgrade, November 22-24, 2005
EUROCON 2005
Audiovisual Retrieval Framework for Multimedia
Archives on Java Enabled Mobile Devices
Iftikhar Ahmad, Serkan Kiranyaz and Moncef Gabbouj audio and video (with audio) clips based on aural information. Nowadays video clips usually contain both audio and video tracks, which are two distinct sources to extract visual and aural features for content-based query operation. So far visual information gets more attention in the area of content-based information retrieval. However the audio can be even more important than the visual counterpart since it is most likely to be unique and significantly stable within the entire duration of the content [4]. Therefore, the proposed framework supports both aural and visual queries distinctively and as an alternative to each other whilst being performed over multimedia databases. With the encouraging results of content-based image retrieval frameworks implemented over mobile platforms [5], [6] and due limitations imposed by text-based queries, many researchers and scientists have undertaken the challenge of meeting content management for the users of mobile devices. Content-based multimedia retrieval from mobile devices adds new challenges beside the contentbased query operation. Different mobile devices come in different design and capabilities. Moreover, they have different operating systems and input/output limitations. So it is hard to provide a generic solution that suits all devices. However, with the help of Java [7] as platform and programming language some of the problems can be addressed. M-MUVIS client can use command types [7] from Mobile Information Device Profile (MIDP) [7] to map the client commands to the device buttons such that it has the look and feel of device applications. Recently the capabilities of mobile devices have been significantly improved (faster input/output, memory capacity, processing and battery power) but comparatively they are still far behind personal computers. With the existing mobile operating systems such as Symbian OS [8], Window CE, Linux, etc. it is possible to extract the content-based features of the media items on the mobile devices [9]; however, this approach presents several limitations. For example mobile devices have propriety Application Programming Interfaces (APIs) for handling (i.e. accessing, processing, editing, streaming, etc.) of multimedia items. Applications using such proprietary APIs will be limited to certain set of devices or certain platforms (operating systems). Another limitation is battery power consumption and lack of system resources such as CPU power and memory capacity. Even though a
Abstract - Content-based multimedia retrieval over wireless networks is a challenging research problem. In this paper we focus on a generic client-server framework designed to perform aural and visual multimedia retrievals over wireless networks. The client is developed for Java enabled mobile devices where a user can initiate a content-based query process to retrieve similar multimedia items and the server is designed on a host computer to execute the query requests and stream back the retrieval results. Experiments show that the proposed scheme achieves efficiency in terms of adaptability and speed.
Keywords - aural, content-based, feature, mobile, indexing, retrieval, visual I. INTRODUCTION
Rapidly increasing digital media technologies and compression standards combined with today's multimedia mobile phones and Internet has led to the widespread exchange of multimedia information. This, however, brings the problem of handling and accessibility of such a massive digital media. In order to overcome this problem, we are currently developing a content-based multimedia indexing and retrieval framework, so-called MobileMUltimedia Video Indexing and retrieval System (MMUVIS) for mobile platforms. Content-Based Multimedia Retrieval (CBMR) addresses the problem of accessing a multimedia item that bears some certain content and it usually relies on the characterization of the low-level features such as color, shape, texture, etc. It has been an active area of research for the past decade. Systems such as MUVIS [ 1], VisualSEEk [2] and Virage [3] are such frameworks designed for indexing and retrieving images and/or audiovideo files. In the area of content-based retrieval, most common query types are visual and aural. A visual query can be performed for still images and video clips whereas an aural query can be performed in order to retrieve the similar Iftikhar Ahmad, is working in Nokia Corporation, Tampere, Finland (e-mail: Iftikhar.ahmad(nokia.com). Serkan Kiranyaz, is a researcher in the Tampere University of Technology, Tampere, Finland-33720. (e-mail:
[email protected]). Prof. Moncef Gabbouj is working in the Tampere University of Technology, Tampere, Finland-33720; (e-mail: moncef.gaboujgtut.fi).
1-4244-0049-X/05/$20.00 (C2005 IEEE
163
standalone CBMR system can be implemented on a mobile device, with the ever-increasing number of multimedia items it might take an infeasible amount of time to perform content-based indexing and retrieval operations. Furthermore, such a system would eventually reduce the mobile device talk and standby time. Consequently, in this paper, we present an efficient CBMR framework, M-MUVIS, which is designed based on contemporary MUVIS framework [1]. In order to avoid the aforementioned limitations, M-MUVIS is designed as a client-server architecture where the server runs on a computer and the client application runs on a Java enabled mobile device. The content-based query operation and media handling, streaming, etc. are performed on the server side; whereas, the lightweight operations such as sending a query request and User Interface (UI) capabilities are implemented on the mobile device. The rest of the paper is organized as follows: Section II gives the overview of M-MUVIS architecture. Section III explains M-MUVIS database structure. Section IV describes the communication between M-MUVIS client and server while experimental results are presented in Section V. Finally, conclusions are drawn in Section VI.
clips are drawn as thumbnail images on QRI. Main view of M-MUVIS client and a sample QRI in M-MUVIS client are shown in Figure 2. M-MUVIS client settings can be seen in Figure 3. These settings are used in a content-based query operation and in QRI retrieval. Generally speaking, mobile devices support limited number of media formats. Therefore, a dedicated media converter is used in order to convert the alien (unsupported) media items to a mobile-friendly media format and it is provided in the proposed framework on the server side. Furthermore, the user can specify some optional parameters to optimize the performance of the media converter.
5turt clEs from Me nu hStplm. u fissitiHo IfttbkMr.h ed tuttW i
Close Options A
0ptions
II. M-MUVIS SYSTEM ARCHITECTURE
As mentioned earlier M-MUVIS is a client-server framework where the client application is used to initiate the content-based query operation and send it to the server, which in turn performs the content-based query operation and sends the query results back to the client. Since Java is device agnostic, the client application developed in Java is therefore supported by a vast majority of mobile devices. The proposed client-server architecture of M-MUVIS framework is shown in Figure 1.
\fideo query
HMUVIS
- HU YIS
close
Figure 2: M-MUVIS client main view and QRI on 6630
* Setting
, S tting
L
Thr
70
I*d
h
_ 3) I
t
15
ThumbX p6
5
dft__l
0r,n 5
Opion.-s A
k
rTW h*rdth
i70
ac
Figure 3: M-MUVIS client setting view on Nokia 6630. For an efficient query operation, an adapted Progressive Query operation [4], [6] is used where the intermediate query results are generated and delivered to an M-MUVIS client during the execution of the query process. As mentioned earlier, mobile devices are limited in memory and Java virtual machine consumes significant part of it. Therefore, only a limited amount of memory is usually left for an M-MUVIS client to perform the retrieval of the media items. Memory management is automatic in a Java platform, i.e. Java applications can request virtual machine to free some memory [7] but this is a costly operation in terms of processing power. For an efficient memory usage, M-MUVIS client does not cache the media items. QRI is JPEG encoded and M-MUVIS client use a higher compression rate to further reduce the memory usage.
Figure 1: M - MUVIS framework There are two Servlets (web applications) [10] on the M-MUVIS server side to perform the query operation and media retrieval. MUVIS Query Servlet (MQS) is used for the content-based query operation and MUVIS Media Retrieval Servlet (MMRS) is used for the media retrieval. In a query operation, similarity distances of multimedia items are calculated and ranking operations are performed afterwards; whereas, in a media retrieval operation "Query Resultant Image" (QRI) is created and streamed to the client. The first-best 12 images or key-frames of video
III. M-MUVIS DATABASE FORMATION M-MUVIS databases are created offline by MUVIS
164
DbsEditor [4] application as MUVIS and M-MUVIS shares a common native database structure. Visual and aural features of the multimedia items (images and videos) are extracted offline. MUVIS DbsEditor application is used to append/remove the audio, video and still images to/from a M-MUVIS database. DbsEditor is also used to convert alien audio/video clips or still images to M-MUVIS supported file formats, according to user's specifications. More information about DbsEditor can be found in [4]. Due to unknown duration of audio/video clips which might require impractical time for feature extraction and indexing processes during a query operation, audio/video clips should first be appended offline to a M-MUVIS database upon which a query can then be performed. There is no such necessity for images; any supported digital image (internal or external to the active database) can be queried within the active database. The similarity distances will be calculated by the particular distance functions, each of which is implemented in the corresponding visual/aural feature extraction modules [4]. The list of supported image formats in the M-MUVIS framework is shown in Table 1. Table 1: M-MUVIS supported image formats
client in a textual format. Upon receiving the text stream carrying the query results, M-MUVIS client can then send a request for the QRI retrieval. The client and server use Hyper Text Terminal Protocol (HTTP) [12] for communication. Since HTTP is a stateless protocol, a session is created in MQS when the query request is received from an M-MUVIS client. Session tracking allows the M-MUVIS client to retrieve the query results. The fields of the text stream for a sample contentbased aural query request sent to M-MUVIS server for a video clip are shown in Table 4. Table 4: A sample content-based query request of a video clip MTV3 Clip_10.avi in textural format between M-MUVIS client and server M-MUVIS Client Query String sn=3,qt=601,in=MTV3_Clip_ 10.avi,sqn=4,sqt=500,qm t=490 Field Description sn Session number qt Query type in Media name sqn Progressive sub query number sqt Progressive sub query time qmt Query media type
Convertible Formats BMP TIFF JPEG-2K PNG Non-convertible Formats GIF PCT TGA EPS The list of supported audio formats and properties is shown in Table 2. Table 2: Audio supported format on M-MUVIS server M-MUVIS Audio Codecs Channel File Sampling numbers Formats Frequency MP3 Mono MP3 16, 22.050, 24, AAC Stereo AAC 32, 44.1 KHz G721 AVI Any G723 MP4 PCM _I
JPEG
PCX1
The list of
IPCX
supported
V. EXPERIMENTAL RESULTS The sample database used in our experiments contains 476 images of different formats and 60 video clips of different duration; the total duration of all the video clips in the database is 150 minutes approximately. M-MUVIS server is active on a PC equipped with P4 2.99GHz and 1.9GB of RAM. Basic visual features such as YUV, HSV, RGB color histograms, Gray Level Co-Occurrence Matrix (GLCM) as a texture feature [13], and an aural feature (MFCC) are extracted for this database. In this section we present the time statistics of different query operations over the sample database. Basically mean and standard deviation (SD) are the statistics computed over 50 query operations performed over the sample database. Server Query Time (SQT) is the time spent to perform a query operation on the server side whereas Client Query Time (CQT) is the entire time passed between sending a query request, performing the query on the server side and the reception of the query results in textual format.
|WMF IPGM
video formats and
properties
is
shown in Table 3. MMRS converts the alien media items to the supported formats. Table 3: Video supported format
Codecs H263+ MPEG-4 YUV 4:2:0 RGB 24
on
M-MUVIS server
M-MUVIS Video Frame Rate Frame Size 1.25 fps QCIF
File Formats
A. Content-based Image Queries in an Image Database Table 5 presents the basic time statistics for an image query among the images in the sample database. Table 5: An image query time statistics within the sample image database Devices and network Mean (ms) SD(ms) 6620 GPRS [15] (CQT) 5413 162 6630 3G [15] (CQT) 3760 149 9500 Edge [15] (CQT) 5022 702 9500 WLAN (CQT) 1684 140 721 PC SQT 35 Due to the dynamic nature of the Internet and Java
AVI MP4
(3gp)
IV. COMMUNICATION BETWEEN M-MUVIS CLIENT AND SERVER
An M-MUVIS client can initiate a similarity-based query operation by sending the content-based Query By Example (QBE) request over the Internet to the M-MUVIS server. The server performs the query operation over the active database and sends the query results to the M-MUVIS 165
platform a high SD is observed. Recall that mobile networks (GPRS, 3G and EDGE [14]) do have hidden latencies which may also contribute towards a higher SD.
VI. CONCLUSIONS A unified framework for content-based multimedia retrieval from Java enabled mobile devices is presented. In the proposed framework, M-MUVIS, the user can perform different content-based query operations in a multimedia database to retrieve the similar multimedia items to a query item. M-MUVIS has the following novel features. * Any M-MUVIS client, which runs on a mobile device, can perform content-based queries within multimedia databases located in a host machine where the server
B. Content-based Image Query in a Video Database Table 6 presents the basic time statistics for an image query among the video collection of the sample database. Note that the query image is compared with the key-frames
of the entire video collection and the best matched keyframes are shown in the QRI. Table 6: An image query time statistics within the sample video database Devices and network Mean (ms) SD(ms) 6620 GPRS (CQT) 4674 1065 6630 3G (CQT) 3160 950 9500 Edge (CQT) 3674 434 9500 WLAN (CQT) 1984 340 PC SQT 685 85
application is active. M-MUVIS client adapts to the native look and feel of the device where it is running and at the same time it provides a uniform user experience across a range of devices. * MMRS performs the multimedia type conversions during a query process according to the M-MUVIS client's request whenever needed. * The amount of data exchanged between M-MUVIS client and server is drastically reduced and this in effect reduces the overall CQT and hence achieves a higher query speed. We can foresee that CQT can further be reduced by reducing the information exchange between M-MUVIS client and server in the content-based query operation by implementing session tracking on the server side. *
C. Aural Query in a Video Database Table 7 presents the basic time statistics of an aural query among the video collection of the sample database. Apparently the aural queries take more time with respect to visual queries due to excessive number of aural key-frames compared to the visual counterpart [4]. Table 7: An aural query time statistics within the sample video database Devices and network Mean (ms) SD(ms) 6620 GPRS (CQT) 17416 1282 1124 6630 3G (CQT) 15189 9500 Edge (CQT) 27416 931 9500 WLAN (CQT) 13971 610 PC SQT 11125 180 The big portion of CQT is spent for the communication between the M-MUVIS client and server. A major portion of the communication (CQT time) is used in M-MUVIS client-server synchronization. In order to reduce the CQT on the mobile device the communication between MMUVIS client and server is reduced considerably by reducing data exchange between M-MUVIS client and server as compared with [6]. Since QRI is a JPEG encoded image, for the efficient retrieval we use a low quality factor (high compression rate) for JPEG encoding to reduce the QRI size and in effect to reduce the network traffic. Since most of the mobile devices do not support high quality display, such higher compression rates for the QRI can conveniently be used. As a result such a network traffic reduction does not only reduce the CQT but it also makes such a service feasible. In the content-based query operation from the mobile devices Wireless Local Area Network (WLAN) has shortest latencies so a low CQT is observed in WLAN. Due to the dynamic nature of the GPRS, 3G and EDGE a higher SD is observed as compared to WLAN. In several experiments it is observed that SD is proportional to the amount of data exchanged between client and server.
REFERENCES [1] "MUVIS", httplmuvis.cs.tut.fi [2] J.R. Smith and Shih-Fu Chang "VisualSEEk: A fully automated content-based image query system", ACM Multimedia, Boston, Nov. 1996.
[3] "Virage", httplwwwviragecom [4] S. Kiranyaz, "Advanced Techniques
[5]
[6] [7] [8] [9]
[10]
for Content-Based Management of Multimedia Databases", PhD Thesis, Publication 541, Tampere University of Technology. I. Ahmad, F. Alaya Cheikh, B. Cramariuc and M. Gabbouj, "Query by Image Content using NOKIA 9210 Communicator", Proc. of the Workshop on Image Analysis for Multimedia Interactive Services, WIAMIS'O1, pp.133-137, Tampere, Finland, May 2001. I. Ahmad, S. Abdullah, S.Kiranyaz, M.Gabbouj, "Progressive query technique for image retrieval on mobile devices", Content-Based Multimedia Indexing (CBMI), Riga, Latvia, Jun. 2005. James Keogh, "The Complete Reference J2ME", published by McGrawHill OSBORNE Edition. b "Symbian OS", hq:wws 0. Guldogan, M. Gabbouj, "Content-based image indexing and retrieval framework on symbian based mobile platform", European Signal Processing Conference, EUSIPCO 2005, Antalya, Turkey, Sep. 2005. Sing Li, Paul Houle, Mark Wilcox, Ron Phillips, Piroz Mohseni, Stefan Zeiger, Hans Bergsten, Matthew Ferris, Danny Ayers, "Professional Java Server Programming", published by Peer Information Inc., ISBN: 1861002777.
[11] "3gpp",
[12] Clinton Wong, HTTP Pocket Reference, published by O'Reilly. [13] M. Partio, B. Cramariuc, M. Gabbouj, A. Visa, "Rock Texture Retrieval Using Gray Level Co-occurrence Matrix", In Proc. of 5th Nordic Signal Processing Symposium, October 2002. [14] J. Lempiainen, M. Manninen, Radio Interface System Planning for GSM/GPRS lJMTS, Published by Kluwer Academic. [15] "Nokia", l
166