Editorial Knowledge Engineering, Semantics, and ... - IEEE Xplore

4 downloads 77740 Views 686KB Size Report
in computer vision, pattern recognition, image processing, and .... Izquierdo is an Associate Editor of the IEEE TRANSACTIONS ON CIRCUITS ... China Normal University, Shanghai, China, in 1982, the M.Sc. degree in computer science from.
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 17, NO. 3, MARCH 2007

257

Editorial Knowledge Engineering, Semantics, and Signal Processing in Audio–Visual Information Retrieval

T

O DEVELOP the technology able to produce accurate levels of abstraction in order to annotate and retrieve content using queries that are natural to humans is the breakthrough needed to narrow the gap between low-level features that can be computed automatically with current algorithms, and the richness of semantics in user queries. To bridge this gap is a challenge that has captured the attention of researchers in computer vision, pattern recognition, image processing, and other related fields, evidencing the difficulty and importance of such technology and the fact that the problem is unsolved. This technology offers the possibility of adding the audio–visual dimension to well-established text databases enabling multimedia based information retrieval. Audio–visual information retrieval is a key component of future multimedia systems, since the exponential growth of audio–visual data, along with the critical lack of tools to annotate and structure digital data, is rendering useless vast portions of available content. Clearly, content is worthless if it cannot be found and used. As a consequence, audio–visual information retrieval is destined to become pervasive in almost every aspect of daily life and a pillar for key achievements in future scientific and technologic developments. Compared with text-based information retrieval, image and video retrieval is not only less advanced but also more challenging. Writing was explicitly developed to share and preserve informationwhilepicturesandsoundhavebeentraditionallyused to express human’s artistic and creative capacity, but this model is changing in the digital age. This shift is also revolutionizing the way people process and search information: from text-only to multimedia-based search and retrieval. This progression is a consequence of the rapid growth in consumer-oriented electronic devices such as digital cameras, camcorders, and mobile phones, along with the expansion and globalization of networking facilities. The immediate effect of this trend is that generating digital content has become easy and cheap, while managing and structuring it to produce effective services has not. This applies to the whole range of content owners, from professional digital libraries with their terabytes of visual content to the private collector of digital pictures stored in the disks of conventional personal computers. To get closer to the vision of useful multimedia-based search and retrieval, the annotation and search technologies to be employed need

Digital Object Identifier 10.1109/TCSVT.2006.890273

to be efficient and use semantic concepts that are natural to the user. This Special Issue reports research work aimed at semanticbased automatic and semi-automatic processing of audio–visual content for annotation, search and retrieval. After a thorough review process, a total of 13 papers were selected. The first three papers address the problem of video and image summarization and classification. A. Hanjalic tackles the problem of temporal video segmentation, and explores the possibilities for defining theoretical limits for the expected performance of a general parsing algorithm in his paper. Specifically, he addresses the challenge of computing the coherence of video content, which is the most critical issue determining the ability of an algorithm to parseavideoautomatically.Ameasureofcoherenceisintroduced using the average uncertainty in extracting content-related information from data. The author argues that such a measure is more powerful in revealing the true quality of a video parsing algorithm than the classical comparison of parsing results with ground truth. The second paper by You et al. describes a framework for human perception analysis in video understanding based on multiple visual cues. Video features that prominently influence human perception, such as motion, contrast, special scenes, and statistical rhythm, are first extracted and modeled. A perception curve that corresponds to human perception change is then constructed from these individual models. As an application of the presented perceptive analysis, a scheme for video summarization is reported and used to validate robustness and generality of the proposed framework. The paper by Le Borgne et al. deals with coding of natural scenes in order to extract semantic information. A new scheme to project natural scenes onto a basis in which each dimension encodes statistically-independent information is presented. The study of the resulting coding units extracted from well-chosen categories of images shows that they adapt and respond selectively to discriminant features in natural scenes. Given this basis, the authors define global and local image signatures relying on the maximal activity of filters on the input image. Semantic-based image analysis and categorization is crucial in visual information retrieval. Four papers address this specific problem. The paper by Athanasiadis et al. presents a framework for simultaneous image segmentation and object labeling leading to automatic image categorization. To make decisions on handling image regions, the proposed framework uses possible semantic labels, formally defined as fuzzy sets, instead of the visual features used traditionally. A visual context repre-

1051-8215/$25.00 © 2007 IEEE

258

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 17, NO. 3, MARCH 2007

sentation and analysis approach is presented, blending global knowledge in interpreting each object locally. Contextual information is based on a novel semantic processing methodology, employing fuzzy algebra and ontological taxonomic knowledge representation. In the next paper, Djordjevic and Izquierdo introduce a system for object-based semi-automatic indexing and retrieval of natural images. Three important concepts underpin the proposed system: a new strategy to fuse different low-level content descriptions; a learning technique involving user relevance feedback; and a novel object-based model to link semantic terms and visual objects. To achieve high accuracy in the retrieval and subsequent annotation processes several low-level image primitives are combined in a suitable multifeature space. Support vector machines are used to learn from gathered information through relevance feedback. An adaptive convolution kernel is defined to handle the proposed structured multifeature space. The positive definite property of the introduced kernel is proven, as essential condition for uniqueness and optimality of the related convex optimization problem. In their paper, Yang et al. propose a semantic categorization method for generic home photo. The authors use a two-layered support vector machine to combine camera metadata and semantic features for multilabel classification. The two-layered classifier detects local and global photo semantics using a feed-forward approach. The first layer predicts the likelihood of predefined local photo semantics based on camera metadata and regional low-level visual features. In the second layer, one or more global photo semantics are detected based on the likelihood. This approach also exploits a concept-merging process based on a set of semantic-confidence maps in order to handle selection ambiguities on overlapping local photo regions. Next, Vallet et al. propose a method to build a dynamic representation of the semantic context of ongoing retrieval tasks. This is used to activate different subsets of user interests at runtime and aim to filter out user preferences that are out of context. The next three papers relate to semantic-based analysis of different media types including music, linguistic, and contextual information. The paper by Gillet et al. focuses on music videos which exhibit a broad range of structural and semantic relationships between the music and the video content. To identify such relationships, a two-level automatic structuring of the music and the video is achieved separately. Note onsets are detected from the music signal, along with section changes. The video stream is independently segmented to detect changes in motion activity, as well as shot boundaries. Then, a two-level segmentation of both streams, giving four audio–visual correlation measures is achieved. In their paper, Ramirez et al. tackle the task of identifying performers from their playing styles. The authors investigate how skilled musicians express and communicate their view of the musical and emotional content of musical pieces and how to use this information in order to automatically identify performers. Deviations of parameters such as pitch, timing, and amplitude are analyzed. The approach to performer identification consists of establishing a performer-dependent mapping of internote features to a repertoire of inflec-

tions characterized by intranote features. The last regular paper of this Special Issue by de Jong et al. analyzes the potential contribution of linguistic content and other nonimage aspects to the processing of audio–visual data. The authors summarize the various ways in which linguistic content analysis contributes to enhancing the semantic annotation of multimedia content, and, as a consequence, to improving the effectiveness of conceptual media access tools. The last three papers are Transactions Letters addressing important aspects of the Special Issue. The first article by Yu et al. focuses on the combination of PCA and LDA for dimension reduction and classification. The novelty of this approach is to find an optimal combined parameter configuration. The authors also applied boosting to improve the performance. Next, Lu et al. demonstrate a systematic way to analyze a binary partition tree representation of natural images for the purposes of archiving and segmentation. Within the tree structure, these problems are transformed into locating prevalent tree branches. By studying the evolution of region statistics, the authors highlight nodes which represent the boundary between salient details and provide a set of tree levels from which simplifications and segmentations can be derived. In the last paper, Tang and Lewis aim at demonstrating some of the disadvantages of datasets like the Corel set for effective auto-annotation evaluation. The authors first compare the performance of several annotation algorithms using the Corel set and find that simple near neighbour propagation techniques perform fairly well. An annotation method based on support vector machine achieves even better results, almost as good as the best found in literature. The authors then build a new image collection using the Yahoo Image search engine and query-by-single-word searches to automatically create a more challenging annotated set. This Special Issue has assembled a small sample papers originating from well-known research groups world-wide. The contributing authors were instrumental in the completion of the Special Issue and the editors would like to thank each of them. The anonymous referees played a key role in the review and selection process ensuring thos Special Issue includes only the submissions of highest technical quality.

EBROUL IZQUIERDO, Guest Editor Department of Electrical Engineering Queen Mary, University of London London E1 4NS, U.K.

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 17, NO. 3, MARCH 2007

259

Ebroul Izquierdo (M’97–SM’02) received the Ph.D. degree in mathematics for his thesis on the numerical approximation of algebraic-differential equations from the Humboldt University, Berlin, Germany, in 1993. He is a Full Professor (chair) of multimedia and computer vision and Head of the Multimedia and Vision Group at Queen Mary, University of London. From 1990 to 1992, he was a Teaching Assistant at the Department of Applied Mathematics, Technical University Berlin. From 1993 to 1997, he was with the Heinrich-Hertz Institute for Communication Technology, Berlin, Germany, as an Associate Researcher. From 1998 to 1999, he was with the Department of Electronic Systems Engineering of the University of Essex, Essex, U.K., as a Senior Research Officer. Since 2000, he has been with the Electronic Engineering Department, Queen Mary, University of London. He has served as session chair and organizer of invited sessions at several conferences and has published over 200 technical papers including chapters in books. He coordinated the EU IST project BUSMAN on video annotation and retrieval. He is a main contributor to the IST integrated projects aceMedia and MESH on the convergence of knowledge, semantics and content for user-centred intelligent media services. He coordinates the European project Cost292 and the FP6 network of excellence on semantic inference for automatic annotation and retrieval of multimedia content, K-Space. Prof. Izquierdo is an Associate Editor of the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY (TCSVT) and the EURASIP Journal on Image and Video Processing. He has served as Guest Editor of three Special Issues of the IEEE TCSVT, a Special Issue of the Journal Signal Processing: Image Communication and a Special Issue of the EURASIP Journal on Applied Signal Processing. He is a Chartered Engineer (U.K.), a Fellow of the The Institution of Engineering and Technology (IET), Chairman of the Executive Group of the IET Visual Engineering Professional Network, a member of the British Machine Vision Association, and a member of the steering board of the Networked Audio–visual Media Technology Platform of the European Union. He is a member of the program committee of the IEEE Conference on Information Visualization, the International Program Committee of EURASIP&IEEE Conference on Video Processing and Multimedia Communication, and the European Workshop on Image Analysis for Multimedia Interactive Services.

Jian Zhang (S’95–M’98–SM’04) received the B.Sc. degree in electronic engineering from East China Normal University, Shanghai, China, in 1982, the M.Sc. degree in computer science from Flinders University of South Australia, Adelaide, Australia, in 1994, and the Ph.D. degree from the School of Information Technology and Electrical Engineering, Australian Defence Force Academy, University of New South Wales, Australia, in 1997. In 1997, he joined the Visual Information Processing Laboratory, Motorola Laboratories, Sydney, Australia, as a Senior Research Engineer and later became a Principal Research Engineer, and a Foundation Manager of Visual Communications Research Team. While at Motorola, he worked on diverse research projects including image processing, video coding, image segmentation, and multimedia content adaptation for wireless and broad-band applications. In addition to many publications from his research output, he also holds more than ten patents filed in the U.S., U.K., Japan, and Australia. Since 2004, he has been with National ICT Australia (NICTA), Sydney, where he is a Principal Researcher and is affiliated with the School of Computer Science and Engineering, University of New South Wales as a Conjoint Associate Professor. He leads several major NICTA research projects in the areas of image and video processing, video surveillance, and multimedia content management. Dr. Zhang is an Associate Editor of the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY and the EURASIP Journal on Image and Video Processing. He is a member of the scientific committee of several international conferences including the Technical Co-Chair for IEEE Multimedia Signal Processing Workshop (MMSP) 2008.

260

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 17, NO. 3, MARCH 2007

Thomas Sikora (M’93–SM’96) received the Dipl.-Ing. degree and Dr.-Ing. degree in electrical engineering from Bremen University, Bremen, Germany, in 1985 and 1989 respectively. Currently, he is Professor and Director of the Communication Systems Group at the Technical University Berlin, Berlin, Germany. In 1990, he joined Siemens Ltd. and Monash University, Melbourne, Australia, as a Project Leader responsible for video compression research activities in the Australian “Universal Broadband Video Codec” consortium. Between 1994 and 2001, he was the Director of the “Interactive Media” Department, Heinrich-Hertz-Institute (HHI), Berlin GmbH, Germany. He is a Co-Founder of 2SK Media Technologies and Vis-a-Pix GmbH, two Berlin-based start-up companies involved in research and development of audio and video signal processing and compression technology. He has been involved in international ITU and ISO standardization activities as well as in several European research activities for a number of years. As the Chairman of the ISO-MPEG video group (Moving Picture Experts Group), he was responsible for the development and standardization of the MPEG video coding algorithms. He also served as the chairman of the European COST 211ter video compression research group. He frequently works as an industry consultant on issues related to interactive digital audio and video. He is an appointed member of the Advisory and Supervisory board of a number of German companies and international research organizations. Dr. Sikora is recipient of the 1996 German ITG award (German Society for Information Technology). He is an Associate Editor for a number of international journals including the IEEE Signal Processing Magazine and EURASIP Signal Processing: Image Communication. He also served as the Editor-in-Chief of the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY. He is a member of ITG.

Thomas S. Huang (F’00) received the B.S. degree in electrical engineering from National Taiwan University, Taipei, Taiwan, R.O.C., and the M.S. and D.Sc. degrees in electrical engineering from the Massachusetts Institute of Technology (MIT), Cambridge. He was on the Faculty of the Department of Electrical Engineering at MIT from 1963 to 1973 and on the Faculty of the School of Electrical Engineering and Director of its Laboratory for Information and Signal Processing at Purdue University, West Lafayette, IN, from 1973 to 1980. In 1980, he joined the University of Illinois at Urbana-Champaign, where he is now the William L. Everitt Distinguished Professor of Electrical and Computer Engineering, Research Professor at the Coordinated Science Laboratory, Head of the Image Formation and Processing Group at the Beckman Institute for Advanced Science and Technology, and Co-Chair of the Institute’s major research theme Human Computer Intelligent Interaction. His professional interests lie in the broad area of information technology, especially the transmission and processing of multidimensional signals. He has published 20 books and over 500 papers in network theory, digital filtering, image processing, and computer vision. Dr. Huang is a member of the National Academy of Engineering, a Foreign Member of the Chinese Academies of Engineering and Sciences, and a Fellow of the International Association of Pattern Recognition and the Optical Society of American. He has received a Guggenheim Fellowship , an A.V. Humboldt Foundation Senior U.S. Scientist Award, and a Fellowship from the Japan Association for the Promotion of Science. He received the IEEE Signal Processing Society’s Technical Achievement Award in 1987 and the Society Award in 1991. He was awarded the IEEE Third Millennium Medal in 2000. Also in 2000, he received the Honda Lifetime Achievement Award for "contributions to motion analysis." In 2001, he received the IEEE Jack S. Kilby Medal. In 2002, he received the King-Sun Fu Prize, International Association of Pattern Recognition and the Pan Wen-Yuan Outstanding Research Award. In 2005, he received the Okawa Prize. In 2006, he was named by IS&T and SPIE as the Electronic Imaging Scientist of the Year. He is a Founding Editor of the International Journal Computer Vision, Graphics, and Image Processing and Editor of the Springer Series in Information Sciences published by Springer Verlag.