ADAPTIVE LEARNING ALGORITHMS FOR SEMANTIC OBJECT EXTRACTION Nikolaos Doulamis, Anastasios Doulamis and Stefanos Kollias Dept. of Electrical and Computer Engineering, National Technical University of Athens, Iroon Polytechniou 9, Zografou, Athens 15773, Greece Tel: +30 1 7722491; fax: +30 1 7722492 e-mail:
[email protected]
ABSTRACT
In this paper a novel approach is proposed for semantic video object segmentation. In particular, it is assumed that several neural network structures have been stored in a system database or memory. Each network have been learned to be appropriate to a speci c application. Then, a retrieval mechanism is introduced which selects that network from the memory which better approximates the current environment. Since, however, the retrieved network does not exactly correspond to the current conditions a small adaptation of its weights will be necessary in most cases. For that reason, an ecient training algorithm has been adopted based on both the former and the current network knowledge. The former one corresponds to knowledge existing in the memory while the latter is provided by the training set selection module based on the user assistance and a color segmentation algorithm. Experimental results are presented illustrate the performance of the proposed method to real life applications.
1 INTRODUCTION
Object segmentation is an important task within the scope of MPEG-4 and MPEG-7 standardization phase [4, 8]. The concept of video objects has been adopted to the new scheme for improving the coding eciency and introducing multimedia capabilities. Video Objects are semantic objects of arbitrary shape. They represent a meaningful entity in a digital video stream, for example a human, a chair, a building and so on [1,11]. Despite the fact that the MPEG-4/7 standards do not directly deal with object segmentation[4,8], they require such algorithms as a pre-coding stage [1,11]. However, although humans can solve this problem eortlessly, it remains one of the most dicult problems for computer systems, apart perhaps from the case of video game or graphics applications, where object segmentation is apriori available. Current segmentation techniques are based on color, motion or a combination between them [7,10]. Nevertheless, a semantic object, such as a person in a scene, generally contains regions with dierent color, texture and motion characteristics [9].
Neural networks can help towards the problem of object segmentation due to their non linear capabilities [3,6]. However, conventional neural network structures make the assumption of stationarity of the training data which is not held in video/image processing applications, where dynamic changes of the operational environment usually occur. Thus, the use of an initial static training set and an initially trained network deteriorates the network performance when the operational environment is dierent from the training one. To improve the performance of neural networks in such applications, a retraining procedure is required to adapt the network behavior to the current conditions [2]. This procedure should include three main parts: a) a decision mechanism which will determine when and what type of retraining should be applied b) a MAP procedure which creates a training set composed of input and desired output data automatically computed from the current environment and c) a training algorithm which will eciently adapt the network weights. In the present work, instead of retraining the network whenever the network performance deteriorates [2], the algorithm rst seeks a system memory to retrieve the most appropriate network (weights, size and the respective training set) for the current condition. This approach has the advantage of improving the time eciency of the system since, in the most cases, only a small weight modi cation is adequate.
2 SYSTEM OVERVIEW Figure 1 illustrates the block diagram of the proposed neural network architecture. It consists of four stages; a retrieval mechanism, a training algorithm, a training set selection module and at last a decision mechanism. The main objectives of the aforementioned modules are brie y described next. Retrieval Mechanism: This mechanism seeks in the system memory in order to nd the most appropriate neural network architecture for the current environment. It is activated by the decision mechanism, while it activates the training algorithm.
is available to the system. The vector ti in (1) corresponds to a feature vector of an image block, while Nc indicates the number of image blocks belonging to the estimated training set Sc . The di is the desired (target) output of the input output ti and assuming that p classes are available, i.e., !i ; i = 1; 2; ; p, the di has the form of di = [P!d1i ; ; P!dpi ]T (2) where P!dji denotes the probability that the feature ti belongs to the jth class. Then, the most appropriate network in the system memory for the current environment, is estimated by minimizing an error ek for all networks in the memory, k 2 Sm k^ = arg min ek (3) k2Sm
The error ek of (3) indicates how close is the output of the kth network in the set Sm (memory) to the current condition, representing by the set Sc . As a result, we conclude that ek =
Figure 1: The proposed neural network architecture. Training Algorithm: Based on the output of the retrieval mechanism, the goal of the training algorithm is to adapt the network weights to the current condition. This is accomplished by exploiting both the former and the current network knowledge. The former is provided by the retrieval mechanism, while the latter by the training set selection module. Training Set Selection Module: This module selects the most representative data for the current environment and it is activated by the decision mechanism. Decision Mechanism: The goal of this mechanism is to determine when a new neural network should be applied. In case that the network performance is found to be appropriate, the same network weights and structure are used to conduct the classi cation. Instead, when the network performance deteriorates, both the retrieval mechanism and the training set selection module are activated.
3 RETRIEVAL MECHANISM
When the decision mechanism detects a change of the environment, it activates the training set selection module and the retrieval mechanism. An estimation of the network performance to the current condition is required for the implementation of this mechanism. This estimation is accomplished by the training set selection module which is later described. Let us assume that a training set, say Sc , consisting of pairs (ti ; di) with i = 1; 2; ; Nc
(1)
N X i=1
(fk (wk ; ti)
di)T (fk(wk ; ti) di) (4)
where fk denotes the non linear input-output relationship of the kth neural network in Sm , depending on the type of the neural activation function and its connection graph. Based on (3,4) the most appropriate network, say, k^ is retrieved from the mechanism, with an approximation error of the current environment ek^ . The smaller the error ek^ is, the closer the selected network is to the current condition. In general, it is anticipated that only a small perturbation, say, w of the retrieved weights wk^ is required so that the network better approximates its operational environment. In this case we can conclude to an ecient algorithm for the neural network weight adaptation.
4 TRAINING ALGORITHM
Let us, for simplicity, consider a two-class classi cation problem, with classes !1, !2. Let a feedforward neural network classi er include a single output neuron and one hidden layer consisting of q neurons. Extension to classi cation problems and networks of higher complexity can be performed in a similar way. The small perturbation of the selected network weights can be achieved by letting the network output, for those blocks which are included in the training set Sc , take values identical to the estimated ones. This deducts that the following equation is held. zi (wF ) = di for all data in Sc
(5) where wF are the nal network weights and zi (wF ) represents the respective network output when its input is the feature ti. That is zi (wF ) = fk^ (wF ; ti) (6)
It is should be mentioned that in a 2-class classi cation problem the network output de ned in (2) can be written in a scalar form since the sum of all probabilities must be equal to unity. Using the fact that only a small perturbation of the retrieved network weights wk^ is required for obtaining the nal weights wF , that is wF = wk^ + w (7) Equation (6) can be written as a linear system equations by expanding the neural activation function into a rst order Taylor series expansion. cT = (w)T A (8) The vector c expresses the dierence between the retrieved network output and the target output provided by the training Sc , i.e., c = [d1; ; dN ]T [z1(wk^ ); ; zN (wk^ )]T (9) while the matrix A depends on the retrieved weights wk^ . The dimension of the vector c is in general smaller than the number of the unknown weights w. Thus, uniqueness is imposed by minimizingthe following quantity
computer can help a human to nd the precise locations of the boundaries. This means that the semantic video object extraction is divided into two tasks. The rst one is a semi-automatic segmentation and the other an automatic segmentation of the remain frames within the same scene or environment. The rst phase allows the system to learn what the semantic object is, while the second stage extracts similar objects in the remaining frames via the proposed procedure. Without human assistance the retrieval mechanism cannot nd the most appropriate network. On the other hand, without computer assistance, it is too dicult for humans to manually extract all video objects with pixel accuracy for all frames. When the decision mechanism detects that the network performance is inadequate, it sends a message to the user calling him/ her to select some regions or points of the semantic objects. Then the system automatically extends these regions using color and motion segmentation, forming a estimated training set for the current condition. This set is used by the retrieval mechanism to select the proper network from the memory.
6 DESICION MECHANISM
where E (w) = Ni=1 (kfi (w; ti) di k2 ). It can be shown that equation (10) takes the form of 1 ES = (w)T KT Kw (11) 2 The error in (11) is a convex function and therefore any relative minimum is a global one. Many methods can be used to compute this global minimum, such as the gradient projection method for estimating the nal network weights wF .
The activation of the decision mechanism should be performed when the semantic object extraction is inadequate. In multimedia applications this is equivalent to a change of the environment. Therefore, the decision mechanism implementation can be relied on a scene detection algorithm using for example color or motion information. Each time a scene change occurs, the computer sends a message to the user telling him/her that the previous semantic object extraction is not probably valid for the current frame. It is user's responsibility to determine the performance of the object segmentation algorithm. However, in speci c type of applications a fully automatic decision mechanism can be implemented exploiting information of the coarse approximation mask [2,5].
The purpose of this task is to select regions of the current condition belonging to the desired classes. Automatic realization of this module can be implemented if we set constraints about the type of the extracted objects (e.g., foreground or background) or the kind of the application (video-conferencing). In [2], we have proposed a Maximum a Posteriori procedure for optimally select the training data based on a coarse approximation of the nal classi cation. However, in multimedia applications where the type of the video objects is not a priori available, the previous technique cannot be directly applied. In this case, a corporation between human and computer is needed. On the one hand, only a human knows the real meaning of semantic or video objects while on the other, a
In this section we provide experimental results of the proposed neural network architecture as far as the problem of video object segmentation is concerned. In particular, we consider that two dierent structures of neural networks have been stored in the system memory. The rst one, denoted as N1 have been learned to extract humans from their background especially in video conference applications. The second, say, N2 corresponds to classi cation of a ship as it is depicted in Figure 3. In the rst experiment, we apply the proposed neural network system using the MPEG-4 test video sequence Akiyo. Our goal is to extract the Akiyo' body from the background. In this case the retrieval mechanism nds as the most appropriate network for the classi cation the neural N1 . Then, the training algorithm adapts the
ES = P
N 1X E (wk^ ) E (wF ) 2 2 i=1
5 TRAINING SET SELECTION TASK
(10)
7 EXPERIMENTAL RESULTS
tation is performed to modify the network weights to the current condition. Experimental results, using real images, have been provided to indicate the performance of the proposed scheme.
References
Figure 2: (a) The original image (b) the semantic segmentation provided by the network structure. (a)
(b)
[1] L. Chiariglione, \MPEG and Multimedia Communications," IEEE Trans. Circ. and Syst. for Video Techn. vol. 7, no. 1, pp. 5-18, Feb. 1997. [2] A. Doulamis, N. Doulamis and S. Kollias, \On line Retrainable Neural Networks: Improving the Performance of Neural Networks in Image Classi cation Problems," submitted to IEEE Trans. Neural Networks.
[3] N. Doulamis, A. Doulamis, D. Kalogeras and S. Kollias, \Very Low Bit Rate Coding Using Regions of Interest," accepted for publication, IEEE Trans. Circ. Syst. For Video Techn.
Figure 3: (a) The original image (b) the semantic segmentation provided by the network structure. (a)
(b)
network weights to the current condition and the nal classi cation mask is illustrated in Figure 2(b). We have adopted the image block as the resolution of the classi cation. For clarity of presentation, when a block has been selected as foreground one it is included, as it is, in the gures while the background blocks are presented in gray color. It is observed that the scheme provides an accurate video object segmentation even in the complex background of Akiyo. The next experiment concerns the extraction of the ship of Figure 3(a). In this cases the network N2 is retrieved and then the training algorithm takes place. Since the test image have dierent luminosity than the image learning the network, a small adaptation of the retrieved network weights is required. The results of the segmentation are illustrated in Figure 3(b), where the video object have been extracted with a high degree of accuracy.
8 CONCLUSIONS
It has been shown in this paper that signi cant improvement of the neural network performance can be achieved, especially in image recognition or classi cation problems, by introducing adaptive learning algorithms. Whenever a change of the environment occurs, the learning algorithm seeks in its memory to nd some previously stored network whose the output better approximates the desired one. Then a small weight adap-
[4] ISO/IEC JTC1/SC29/WG11 N1678, MPEG-7: \Context and Objectives (v.3)," April 1997. [5] S. Kollias, N. Doulamis and A. Doulamis, \Improving the Performance of MPEG Compatible Encoders Using on Line Retrainable Neural Networks," Proc. of ICIP ' 97, Santa Barbara, CA, October 1997. [6] S. Y. Kung, \Neural Networks for Intelligent Multimedia Processing," IEEE Signal Processing Magazine, vol. 14, no. 4, July 1997. [7] F. Meyer and S. Beucher, \Morphological Segmentation," J. Visual Commun. on Image Representation, vol. 1, no. 1, pp. 21-46, Sept. 1990. [8] MPEG Video Group, \MPEG-4 Video Veri cation Model-Version 5.0," Doc. ISO/IEC JTCI/SC29/ WG11, N1469 Maceio, Nov. 1996. [9] E. Reusens, T. Ebrahimi, C. Le Buhan, R. Castagno, V. Vaerman, L. Piron, C. de Sola Fabregas, S. Bhattacharjee, F. Bossen and M. Kunt, \Dynamic Approach to Visual Data Compression," IEEE Trans. Circ. and Syst. for Video Techn. vol. 7, no.1, pp. 197- 211, Feb. 1997. [10] P. Salembier and M. Pardas, \Hierarchical Morphological Segmentation for Image Sequence Coding," IEEE Trans. on Image Processing, vol. 3, no. 5, pp. 639-551, Sept. 1994. [11] T. Sikora, \The MPEG-4 Video Standard Veri cation Model," IEEE Trans. Circ. and Syst. for Video Techn. vol. 7, no. 1, pp. 19-31, Feb. 1997.