European Symposium on Advanced Imaging and Network Technologies, Conference on Digital Compression Technologies & Systems for Video Communications, Berlin, Germany, October 1996
Image sequence segmentation for object oriented coding Achim Ibenthal Philips Semiconductors PCALH/VS Stresemannallee 101, 22529 Hamburg, Germany Sven Siggelkow, Rolf-Rainer Grigat Technische Universitat Hamburg-Harburg, Technische Informatik I, 21071 Hamburg, Germany
ABSTRACT An algorithm for the segmentation of image sequences is presented, taking into account especially aspects for object oriented coding. A fundamental requirement of such applications is the temporal stability of the segmentation. This is improved in this article compared to other existing approaches by including motion estimation into the segmentation process. Additionally a hierarchical approach enables an ecient predictive coding on one hand and a semantic data access on the other hand. As a direct result from using full colour information for the segmentation process, coding of the chrominance information can be done with extremely high compression ratios. Instead of full resolution chrominance information only few mean chrominance vectors need to be transferred (which corresponds to a compression factor of about 1000). Additionally object shapes must be coded but this has to be done for greyscale images anyway. Keywords: image sequence segmentation, motion estimation, object oriented coding, object tracking, MPEG-4
1 INTRODUCTION In contrast to the widespread coding techniques like MPEG-1 and 2, that work on rectangular blocks resulting in visual remarkable eects at high compression ratios (e.g. blocking), second generation coding techniques work on basis of objects instead of blocks. Figure 1 shows an example of an object oriented coding system. First the segmentation of the images is done. For motion compensated prediction an ane motion estimation is used. The information provided by these blocks is then used to code the objects. In this paper the segmentation part is treated. The image segmentation is done with respect to the human visual system reducing visual artefacts. Furthermore an object based data access is supported. The temporal stability of the segmentation is of major importance both for an ecient predictive coding of objects and for object tracking combined with content dependent data access. E.g. the speaker of a scene might be transmitted in good quality whereas the background may be coded lossy. Another advantage is the possibility to extract an object and put it into another { perhaps virtual { scene. So several speakers might be put together in a virtual video conference.
image sequence
segmentation
memory
τB
segmentation information images
affine motion estimation
Coder
modification
segmentation and motion information coding of object-boundary -content -motion coded bitstream
sender
Figure 1: Example of an object oriented coding system
2 SEGMENTATION The segmentation relies on a region growing process combined with edge detection in order to achieve global stability and high local correctness. The most important requirements are:
consideration of the characteristics of the human visual system, conformity with the subjective image partitioning and the temporal stability for object based data access and for predictive coding. The last requirement is dicult to achieve when segmentation is done for each image separately. So processing of an image takes into account the segmentation of the preceding image if no scene change has been detected. The segmentation works in two dierent modes: intraframe segmentation and interframe segmentation. Processing in intraframe mode is done purely 2D whereas in interframe mode the segmentation is done also on basis of the segmentation of the preceding image. Thus the connectivity of objects is preserved along the time axis. For taking advantage of the characteristics of the human visual system not only the luminance information but also the chrominance information is used for the segmentation process which is often neglected in the literature. But using full colour information can help in improving the adaptation to the human visual system. 1
Additionally a three-step hierarchy is built up, enabling both eective predictive coding and the required object based data access.
2.1 Intraframe Segmentation The intraframe segmentation is based on a combination of centroid linkage region growing and DRF edge detection. Let f (x) = (Y (x); U(x); V (x))T be the colour of the image at position x. The region growing process combines adjacent pixels with similar colour to regions Rs. For simplicity this is done by comparing the mean colour of a region f (Rs) = 1 X f (x) jRsj x2Rs 2
3
with the colour of the pixel that should be added by a metric
8 20 1 1 3 > > 4@ 1=kY A (f ? g)5 if kY 1; max > i > < 1=kY d (f ; g) = > 20 k 1 3 i 4@ 1Y A (f ? g)5 otherwise; > > max i 1 > : i with representing the direct product. For weighting luminance (Y ) and chrominance (U; V ) there is a factor kY in it, e.g. a kY 1 means that segmentation is mainly based on luminance information (segmentation of a greyscale image), with kY 1 the luminance information is nearly neglected. The image is scanned row by row, trying to assign the actual pixel to one of the regions, that its left, upper left, upper, and upper right neighbour pixels already have been assigned to. If a new pixel y is assigned to a region Rs, the region's mean colour need not be calculated over all region pixels again. It is much more eective to store also the region size, so that the new mean colour can be calculated from the old one by old old f new(Rs ) = jRsj f old(Rs ) + f (y) jRsj + 1 jRsjnew = jRsjold + 1:
Pixels that cannot be added to a neighbouring region as d is too big, establish new regions. Because of this automatic generation of new regions the image is strongly oversegmented (see gure 2, left side). Therefore similar regions and regions with weak boundaries are merged. With respect to the coding eciency small regions are also eliminated (see gure 2, right side).
2.2 Interframe Segmentation For the segmentation of image sequences there are dierent possibilities: The simplest way is treating each image separately. But this would lack temporal object connectivity. So if there is no scene change between two pictures, an interframe segmentation is done. This is based on the segmentation of the preceding image in order to nd correspondences between the objects of sequential images and to gain temporal stability. Thus the eciency of predictive coding is improved. Other published techniques just take the segmentation of the preceding image
Figure 2: Segmentation of an image of the sequence Claire before/after merging (629 resp. 39 regions)
and adapt it to the new image content. These techniques cannot handle fast moving small objects or must enlarge the segmentation rate. But that only moves the critical motion limit and does not remove the key problem. 4
1
In our approach motion information is considered to solve this problem. The motion information is supplied by a region matching process, that works the same way as blockmatching but which is based on regions instead of blocks. For each region Rn its motion from image Im? to image Im is determined by comparing the colour vectors of all pixels of the region in both images. Let f m (x) denote the colour vector at position x in frame m. Then matching is done by searching the minimum of the following equation for all variations of v: 5
1
1 X
f m? (x) ? f m(x + v)
= min; v 2 ZZ : jRnj x2Rn !
1
2
This process has been improved by upsampling the images for subpixel accuracy. Furthermore it was speeded up and stabilised by spatial and temporal motion prediction techniques, so that only a small subset of ZZ has to be tested. 2
Before segmenting the actual image, the segmentation of the preceding image is transferred to the actual image. This is done by assigning the region number nm? (x), that has been assigned to a pixel x in the preceding image Im? , to the new position x0 in the actual image Im , referring to the estimated motion v(Rnm?1 x ) of the region, to which x belongs. If there is more than one region assigned to a pixel, the region with the best prediction according to a metric d (not necessarily the same as above, we only use the luminance component as chrominance components are subsampled in the original images) between the colour vectors is selected. 1
1
(
0 := x + v(Rnm?1 x )
x
(
)
nm (x0 ) := nm? (x) 1
with
? d f m(x0 ); f m? (x) = min 1
!
)
Figure 3: Mask of pixels, where old segmentation could not be transferred, where prediction was too bad, and where new segmentation has to be done (union of rst and second mask)
The upper left mask of gure 3 shows the areas where the old segmentation could not be transferred by this technique. It can be seen that these areas are very small except for the new image area caused by camera motion. The upper right mask shows the areas where prediction was too bad. The combination of these areas (lower mask) has to be segmented new. For this purpose an expansion of the old regions is done. But as new objects may have moved into the scene or objects can have been uncovered this might not cover the whole image. Therefore it is necessary to start a growing of new regions. This process works the same way as in intraframe mode, but is constrained to pixels, that have not been assigned to an old region. Also in interframe mode postprocessing has to be done like in intraframe mode, but in this case distinguishing between old and new regions: Old ones can be coded predictively, so that the minimum size of a region can be smaller. Additionally merging of regions because of similar colour or a small gradient at the boundary is forbidden between old regions. Thus the old segmentation is kept in order to gain temporal stability. In addition region boundaries have to be adapted as the translational motion estimation cannot describe arbitrary object motion or deformation. This is done by comparing each boundary pixel with its neighbouring regions. If it ts better to another one, the boundary is shifted.
third layer
A
(merging of regions with similar motion)
A
B
second layer
B
(merging of regions with similar chrominance and motion)
first layer
Figure 4: Segmentation layers
2.3 Hierarchical Segmentation In order to achieve a segmentation conformable to a subjective image partition, a hierarchical merging of regions is done. The requirements are dicult to meet since there is no information on the image semantics. Nevertheless motion information has already been proven to be useful for semantic segmentation. In our contribution chrominance information is used as a second semantic feature, as it is hardly in uenced by shading eects. It can be regarded nearly as a property of the object material. 6,7
So a three layered hierarchy is built up (see gure 4): The rst layer represents the segmentation described above. In the second layer regions of the rst layer are merged with respect to similar chrominance and motion. The highest layer corresponds to a semantic segmentation, where regions of the second layer are merged in case of similar motion. For improving the temporal stability of the segmentation on and between these three layers it is tried rst to reconstruct the relations of the hierarchy of the preceding image, when the segmentation hierarchy of the actual image is built up. For the reconstruction also the duration of a particular constellation in the past is considered: Let be region B in gure 4 the face of the speaker and be region A the whole person. If the connection between A and B already existed for several frames, there might be a higher motion dierence between the face and the rest of the speaker without separating the face from the person. On basis of these three hierarchy levels coding of the sequence can be done. The layers are suitable both for an ecient prediction of old objects and for supporting an object based data access up to choosing semantic objects.
Y
200 100 0
20
40
60
80 100 Column
120
140
160
20
40
60
80 100 Column
120
140
160
20
40
60
80 100 Column
120
140
160
40 U
20 0 −20
10 0 V
−10 −20
Figure 5: Original and coded colour components of a row of the sequence Claire
3 OBJECT ORIENTED CODING OF CHROMINANCE INFORMATION The second segmentation layer has been used for a very ecient coding of chrominance information. The human eye is not as sensible to chrominance information as to luminance information. This property can be taken advantage of by coding the chrominance information in a very simple way: It could be coded by only one UV -pair for each region of the second segmentation layer, thus not weakening chrominance edges but only removing detailed information that is not noticed by the human eye . Figure 5 shows the original (dashed lines) and the coded colour components (solid lines) of a row from the image sequence Claire. Chrominance edges are even steepened compared to the original with subsampled chrominance components. For coding the luminance information a more accurate technique should be used instead, because the eye is very sensible to errors in that component. The proposed technique for chrominance coding corresponds to a compression factor of 1000 (compared to full resolution chrominance information). It even can be raised further by temporal prediction (the colour of objects might not change fast in most cases) and coarser quantization. Of course not only the objects' mean UV -pair must be transferred. The objects' shapes have to be coded too in order to know, where the chrominance values have to be put. But that has to be done for greyscale images anyway in object oriented coding. Thus there is Results
can be obtained by email from
[email protected] with eos/spie-image-request as subject
Figure 6: Temporal segmentation of the sequence Foreman: upper left: image I (intraframe segmented), upper right: image I (interframe segmented with boundary adaptation), lower left: image I (interframe segmented without boundary adaptation), and lower right: image I (same with boundary adaptation) 1
2
15
15
nearly no extra coding eort for colour images compared to greyscale images.
4 RESULTS The approach described before shall be illustrated by the well known sequence Foreman, here in QCIF format. At the lowest layer a number of about 70 regions was obtained. These were reduced to approximately 25 regions at the second layer and two to nine regions at the highest layer. Figure 6 presents the results of the temporal segmentation for the rst, the second and the 15th frame of the sequence: There is no big dierence between the two consecutive images I and I (upper images). When doing segmentation without adaptation of boundaries, these become increasingly wrong (lower left image). In the lower right image boundary adaptation has been done in each step. Thus the segmentation boundaries t better to the objects. On the other hand, there is more 1
2
Figure 7: Hierarchical segmentation of image I from the sequence Foreman: upper left: lowest layer, upper right: merging of regions with similar chrominance and motion, at the bottom: merging of regions with similar motion 110
boundary motion when doing adaptation. Most boundary motion can be seen at not important, weak edges. Therefore the adaptation should be done at steep boundary parts only. In Figure 7 the three segmentation layers of frame I are shown. The upper left image shows the lowest segmentation layer. In the upper right image regions with similar chrominance and motion were merged. Thus the background could already be well detected as only one object. For the proposed chrominance coding one UV -pair is sucient for nearly the whole background. The speaker itself is still divided into several regions. On the highest segmentation layer (lower image) there are only two regions left, namely the speaker and the background. On the other hand it can be seen that there are some errors: There are notches in the hard hat. But these already exist on the rst segmentation layer and are caused by colours that actually t better to the background. A part of the background has been merged to the speaker due to wrong motion estimation. A more complex motion model should achieve better results. That would also improve the segmentation of the speaker in this scene in the rst frames where motion is very small and so the speaker could not be well segmented with our translational motion model. 110
5 REFERENCES [1] P. Salembier, L. Torres, F. Meyer, and Ch. Gu. Region-based video coding using mathematical morphology. Proceedings of the IEEE, 83(6):843{857, June 1995. [2] R. M. Haralick and L. G. Shapiro. Image segmentation techniques. Computer Vision, Graphics, and Image Processing, 29:100{132, 1985. [3] J. Shen and S. Castan. Further results on DRF method for edge detection. In 9th ICPR, page 203, Rome, 1988. [4] M. Pardas and P. Salembier. Time recursive segmentation of image sequences. In Signal Processing VII, Theories and Applications, pages 18{21, 1994. [5] D.-F. Shen and S. A. Rajala. Segmentation based motion estimation and residual coding for packet video: a goal oriented approach. SPIE Visual Communications and Image Processing, 1818:825{836, 1992. [6] M. Hotter. Objektorientierte Analyse-Synthese-Codierung basierend auf dem Modell bewegter, zweidimensionaler Objekte. PhD thesis, University of Hannover, 1992. published as VDI-Fortschrittbericht (Reihe 10, Nr. 217), VDI-Verlag. [7] F. Fechter. Konturgesteuerte Bildmischung durch Bewegungssegmentierung. Fernseh- und Kino-Technik, 49(11):651{660, November 1995.