1 Nov 2000 - ABSTRACT. Description and standardization of low-level visual features such ... MPEG-7 is standardizing an exchange format for meta data [2].
Proc. 34th Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, CA, October 29 - November 1, 2000.
Polygonal Shape Descriptors – An Efficient Solution for Image Retrieval and Object Localization André Kaup and Jörg Heuer Siemens Corporate Technology, Information and Communications, 81730 Munich, Germany Invited Paper
ABSTRACT Description and standardization of low-level visual features such as object shapes is essential for interactive multimedia access. This paper discusses a polygon-based shape descriptor providing two functionalities: shape description for image retrieval and shape coding for object localization. The descriptor is optimized in two ways: with respect to retrieval vs. size and with respect to localization error vs. size. While for these functionalities usually two shape descriptors are used, this paper presents an efficient single representation fulfilling both requirements.
1. INTRODUCTION Recently developed image and video coding standards such as JPEG2000 or MPEG-4 [1] have not only been optimized with respect to coding efficiency but have also added functionality to enable interactive image and video presentations. Thus they provide mechanisms to allow the user to browse through the visual content. To extend this approach for information retrieval, MPEG-7 is standardizing an exchange format for meta data [2]. The meta information can be used for information retrieval by querying or filtering, or can also be delivered as additional information to the multimedia data. Thus meta information can serve several purposes and coding of this information has to be optimized with respect to the different purposes. For instance, a shape specification can enable annotation of visual regions in an image by localization or visualization of the contour. On the other hand, the shape specification also can be used for search and retrieval based on the contour of visual objects. While both functionalities can be realized with the same shape specification, an efficient encoding has different requirements based on the targeted application. For instance the context based arithmetic encoder (CAE) [3] is an efficient pixel based encoder of the shape for visualization. But if it is used for matching, the complex decoding process and a contour extraction has to be executed before a matching can take place. Other shape representations for localization such as runlength encoding [4] and polygon based encoding [5] behave similar. These algorithms are specified in more detail in Section 3. Contrary to shape representations for localization, an efficient shape descriptor for retrieval is evaluated with respect to its retrieval rate. For instance the curvature scale space descriptor described in [6,7] shows a good retrieval performance. But computation of this descriptor is based on a non reversible transformation and hence this descriptor cannot be used for localization. In this paper a polygon based descriptor proposed in [8,9] is discussed. One property of this shape specification compared to the one presented earlier is the ability to use it for shape based image retrieval or for region localization and visualization. For
the first functionality a good retrieval rate at a small descriptor size is required while in the later case the descriptor should approximate the original contour with a small distortion at a low rate. While both representations for retrieval and localization can be optimized separately, a combined representation would reduce the overall size of transmitted meta data. This paper is structured as follows: in the next section the polygon based descriptor and results for an optimization with respect to image retrieval are presented. In the third section the polygon based shape encoding is compared to the shape coding used by MPEG-4 and one mentioned in literature. Also an inherent shape descriptor for localization and retrieval is discussed. Finally a conclusion is drawn.
2. A POLYGON BASED SHAPE SPECIFICATION FOR RETRIEVAL For computation of a shape descriptor based on the polygonal representation of the contour it is desirable to approximate the original contour preserving the perceptual appearance at the level sufficient for object recognition or retrieval. To achieve this, an appropriate approximation (or curve evolution) method was proposed in [10]. The curve evolution method achieves the task of shape simplification in a parameter-free way, i.e., the process of evolution compares the significance of vertices of the contour based on a relevance measure. Since any digital curve can be regarded as a polygon without loss of information (with possibly a large number of vertices), it is sufficient to study evolutions of polygonal shapes. The basic idea of the proposed evolution of polygons is very simple. In every evolution step, a pair of consecutive line segments s1,s2 is substituted with a single line segment joining the endpoints of s1 and s2. The key property of this evolution is the order of the substitution. The substitution is done according to a relevance measure K given by β ( s1, s 2 )l ( s1 )l ( s 2 ) (1) K(s ,s ) = 1
2
l ( s1 ) + l ( s 2 )
where β (s1,s2) is the turn angle at the common vertex of segments s1, s2 and l is the length function normalized with respect to the total length of a polygonal curve. The evolution algorithm is assuming that vertices which are surrounded by segments with a high value of K(s1,s2) are important while those with a low value are not. A cognitive motivation of this property is given in [10], where also a detailed description of the discrete curve evolution can be found. The signal based attributes of this algorithm are a simplification of the shape complexity where unlikely to shape simplification using diffusion processes no dislocation of the remaining vertices occurs (see Fig. 1). This is an important feature of the evolution process since this descriptor can also be used
for localization. For image retrieval it is important that the shape descriptor computed by the evolution process is robust with respect to noise. This is the case since the hereby caused small segment pairs result in small values of the relevance measure K. Thus these segment pairs are removed in an early stage of the evolution process. A more formal justification of the above properties can be found in [10].
Fig. 1: Fine to coarse simplification of the original contour (left) by preserving the perceptual information. The vertices of the simplified contour are also vertices of the original contour.
Fig. 3: L1 norm in the tangent space for two arcs c and d. Coordinate axes Octant boundary Segment to encode
Bit representation of a segment:
3
# Bits Major
# Bits Minor
Octant
Major component Y: Representation according to dynamic range of Y
Minor component X: Representation according to bits of Min ( dyn. range of X; bits needed to code Y)
Fig. 2: Coded representation of a segment: the dynamic range of the X and Y component is encoded in the header of the contour bitstream.
2.1
Shape descriptor encoding
Depending on the threshold of the relevance measure K for termination of the evolution process more or less contour details are preserved in the shape descriptor. This will be investigated in more detail in Section 2.3. The descriptor itself is encoded using the following differential variable length encoding scheme [5,11]: • The vertices of the processed polygon are encoded differentially which corresponds to encode the segments of the polygon. If localization is needed, also the first vertex is encoded in absolute values using its x and y coordinates. • For the descriptor a dynamic range of the coordinate values is specified • Each segment is coded by specification of the octant it is lying in and the major as well as minor component. While the octant specifies major and minor component and its signs, the specification of the value of the major component also specifies the dynamic range of the minor component (Fig. 2).
2.2
Shape based retrieval
Matching of the polygon based shape descriptor is done by comparing convex parts. This is motivated by the observation that in the evolution process the contour of visual parts become at a certain stage convex parts. Since this stage might not be a common one for all visual parts and also differs from the stage reached in the computation of the shape descriptor for the matching, a comparison of groups of convex arcs is performed for matching. Thus the key idea is to find the right correspondence of visual parts. It is assumed that a single visual part (i.e. a convex arc) of one curve can correspond to a sequence of consecutive convex and concave arcs of the second curve. Hence the correspondence can be a one-to-many or a many-to-one mapping, but never a manyto-many mapping. This assumption is justified by the fact that a single visual part should match to its noisy versions that can be composed of sequences of consecutive convex and concave arcs, or by the fact that a visual part obtained at a higher stage of evolution should match to the arc it originates from. In [8] the matching is described in detail whereas here only the metric of the matching between two descriptors D1, D2 representing closed polygons O1 and O2 is described. The similarity measure is based on the comparison of corresponding convex and concave arcs of the contour. The similarity measure of arcs Sa is based on the computation of the L1-Norm in the tangent space representation of the arcs. A polygonal arc Cη1 is represented by step function T(Cη1) in the tangent space. For illustration, see Fig. 3. Based on this measurement the comparison of two closed contours can be computed by determination of the best mapping of corresponding arcs of both contours. For simplification only a one to many or many to one mapping is evaluated. In [11] a more detailed specification of the matching metric is given.
2.3
Descriptor optimization
For evaluation the test data set of MPEG-7 of in total 3450 shapes was used. The test data set can be divided into three main parts with respect to the following main objectives: • Part A: robustness to scale (A1) and rotation (A2)
• •
Evolution of the retrieval rate 88,0 Retrieval Rate (%)
Part B: performance of the similarity-based retrieval Part C: robustness to small non-rigid transformations due to motion For the retrieval performance only the recall is measured, where recall is the ratio of the number of the retrieved relevant shapes to the number of the relevant shapes in the database. The test conditions include a specification of the size of the retrieved image set. So beside the recall the measurement of the precision is not considered. To evaluate the discriminating power with respect to the descriptor size also the average size of the shape descriptor with respect to the database is computed. Fore a more detailed specification of the test conditions and the descriptor optimization with respect to retrieval see [11].
87,5 87,0 86,5 86,0 85,5 85,0 84,5 20,0
25,0
30,0 35,0 Average Descriptor Size (Bytes)
40,0
45,0
Fig. 5: Evolution of the retrieval rate against the descriptor size.
45,00 40,00 35,00 30,00 25,00 20,00 15,00 10,00 5,00 0,00
Number of Vertices Descriptor Size
Numbe of Vertices
20,0 15,0 10,0 5,0 0,0 0,2
0,4
0,6
0,8
1,0
1,2
Average Size (Bytes)
Evolution of the number of vertices 25,0
1,4
Relevance Measure Threshold K
Fig. 4: Evolution of the number of vertices according to the evolution threshold. As discussed previously, the degree of contour polygon abstraction achieved by the evolution process also influences the descriptor size. The evolution algorithm is removing points of the contour as long as the area changed in the tangent space by this simplification remains below a given threshold. By increasing this threshold, the evolution on the contour removes more vertices and thus the set of remaining vertices which have to be stored is smaller. In general this reduction is also related to the size of the descriptor. But since a variable length code is used to encode
the segments, a reduction of the number of vertices must not be related in a linear fashion to the reduction over the encoded descriptor data. In Fig. 4 the relation between the average number of vertices and the evolution cost for the given data set can be observed. The lowest relevance threshold K for the evolution is 0.34. For this value the contours in the test set are reduced to an average number of 22 vertices, which relates to an average descriptor size of 42 bytes. Even though there is no linear relation between the threshold K and the descriptor size, it is still a monotone relation: when at a relevance threshold of 1.0 the number of vertices falls to 12, the descriptor size monotonically decreases to an average of 24 bytes. To find an optimal operation point for the relevance measure, it has to be investigated how the contour abstraction effects the shape retrieval, and to what stage contour details have a significance with respect to the shape recognition. Different thresholds K for the evolution algorithm have been applied to the test set in order to get the optimal point of operation which maximizes the relation retrieval vs. descriptor size. As mentioned above, the descriptor can also be used for localization. But the range of abstraction applied by high values of the relevance measures only targets the retrieval functionality.
S ize D isto rtio n D ia g ra m
Size Distortion Diagram
1 0 ,0
7,0 R u n le n g th P o lyg o n : O p tim a l ra te CAE P o lyg o n : A d d in g P o in ts a t m a x . E rro r
9 ,0 8 ,0
6,0 5,0 Area Error (%)
Peak Distance
7 ,0 6 ,0 5 ,0 4 ,0
4,0 3,0 2,0
3 ,0 2 ,0
1,0
1 ,0
0,0
0 ,0 0
100
200
300
400
A ve ra g e D e s crip to r S ize
500
0
100
200
300
400
Average Descriptor Size
Fig. 6: Size distortion diagrams of context arithmetic encoding, run length coding, and polygon based contour coding using the peak distance/rate optimized vertex allocation or the algorithm based on vertex selection at maximum distance error.
500
Size Distortion Diagram
S iz e D is to rtio n D ia g ra m
8,00 1 4 ,0 0
Optimal rate sense from one point Evolution
1 0 ,0 0 Area Error (%)
Peak Distance (%)
1 2 ,0 0
Optimal Rate Sense (0.6)
6,00
4,00
2,00
8 ,0 0 6 ,0 0 4 ,0 0 2 ,0 0
0,00 0,00
50,00
100,00
150,00
200,00
0 ,0 0 0 ,0 0
Average Descriptor Size
5 0 ,0 0
1 0 0 ,0 0
1 5 0 ,0 0
2 0 0 ,0 0
A v e r a g e D e s c r ip t o r S iz e
Fig. 7: Size distortion diagram for polygon encoding based on the optimized vertex selection starting from one point, starting from the vertices of the shape descriptor and for the contour evolution of the shape descriptor. The threshold for the relevance measure K was varied in the range 0.34 to 1, which corresponds to a range in the number of vertices per object from 12 to 22. Fig. 5 shows the diagram of retrieval rate vs. descriptor size. The variation in descriptor size is obtained by subsampling the vertex points. This is achieved by changing the threshold of the relevance measure for the evolution algorithm. The retrieval rate is the average value over all test queries as described in Section 3. The diagram shows that the retrieval performance is increasing below K≈0.6 if the descriptor detail and with it the size is increased. But increasing the detail further on does not result in a better but even worse overall performance. At a low evolution cost small segments caused by errors in the contour segmentation are not eliminated in the evolution process. This results in a smaller retrieval rate even though more details of the contour are preserved in the descriptor.
3. THE SHAPE DESCRIPTOR FOR LOCALIZATION In the previous section a shape descriptor for retrieval was presented which is based on an polygonal approximation of the contour. This representation of the contour is composed of vertices of the original contour specifying visual parts. Hence this representation can also be used for localization even though the precision is very low at a working point of K≈0.6. For localization the precision should not be fixed to the requirements of a shape description for retrieval. Moreover, shape encoding of a descriptor for localization should be scalable with respect to the precision required by the application. Thus we compare in this section different scalable shape coding techniques for the purpose of localization: polygon based shape coding, runlength coding, and context based arithmetic coding as used in MPEG-4. For polygon based shape coding two cases are distinguished: First, an optimal encoding of a separate descriptor for localization - in this case we examine algorithms for vertex allocation only with respect to localization. Second, we determine the efficiency of having a combined descriptor for localization and retrieval. In this approach an inherent coding of the vertices used by the shape descriptor and vertices allocated for localization is proposed and evaluated.
The efficiency of different algorithms is compared by a rate distortion analysis of the shape representations. For the evaluation two different error measures are used: the peak distance of the contour approximation to the original contour and the relative area error. For complexity reasons the rate distortion optimization of the encoding is done using the peak distance.
3.1
Runlength Coding
Contour runlength coding can be seen as a scalable extension of the chain code [4]. The segment between two vertices is represented by an angle α and a run β in the direction of the major Cartesian component (x,y). Only eight possible directions are allowed, which are restricted to intersect the horizontal axis in an angle whose value is a multiple of π/4. Thus the angle is represented by three bits. The runlength is encoded by a variable length code where β-1 zeros and a final “1” are used. So every segment in the runlength encoded polygonal contour is represented by 3+β bits. Due to quantization of the angle between segments, the runlength coding scheme does not enable to code every vertex of the contour. Based on the position of the previous vertex only vertices in certain angles as described above are possible. A rate distortion optimized encoding of the contour has been specified in [4] under the constraint that all encoded vertices are selected from the original contour.
3.2
Context Based Arithmetic Encoder
Context based arithmetic encoding is a bitmap based coding scheme which is included in the MPEG-4 standard to transmit the shape information of video objects [3]. A rectangular bounding box enclosing the shape is selected such that its horizontal and vertical dimensions are multiples of 16 pixels (macroblock size). Each block of size 16x16 pixels within this bounding rectangle is called binary alpha block (BAB). Three types of BAB's are distinguished and signaled to the decoder: transparent blocks that do not contain information about the object, opaque blocks that are located entirely inside the object, and boundary blocks that cover part of the shape as well as part of the background. For boundary
Size Distortion Diagram
Size Distortion Diagram
8
14 Evolution Adding Points starting from Triangle Adding Points starting from Shape Descriptor
12 10 Area Error (%)
Peak Distance (%)
6
4
8 6 4
2
2 0
0 0
50
100
150
200
0
Average Descriptor Size
50
100
150
200
Average Descriptor Size
Fig. 8: Size Distortion diagram for the polygon encoding based on the additive vertex selection starting from one point, starting from the vertices of the shape descriptor and for the contour evolution of the shape descriptor. blocks a context based arithmetic encoder which exploits spatial redundancy of the binary shape information to be coded was developed within MPEG-4. A template of 10 pixels is used to define the casual context for predicting the shape value of the current pixel. The probability table of the arithmetic encoder can deal with the 1024 contexts. With two bytes allocated to describe the symbol probability for each context, the table size is 2048 bytes. In order to increase coding efficiency as well as to allow lossy shape coding, a BAB can be subsampled by a factor of 2 or 4 resulting in a subblock of size 8x8 pixels or 4x4 pixels, if the resulting block fulfills the BAB accepted quality. The conversion rate (CR) stores the downsampling factor. The subblock is encoded using the encoder described above. The encoder transmits the CR such that the decoder extracts the shape information and then upsamples the subblock to macroblock size. Obviously, encoding the shape using a high subsampling factor reduces the number of bits needed, but the decoded shape after upsampling may not be the same as the original shape. Hence, this subsampling is mostly used for lossy shape coding. In order to achieve smooth object boundaries after upsampling, an adaptive nonlinear filter is used. For more details please refer to [3].
3.3
Polygonal Shape Encoding
The coding scheme for the polygon based shape descriptor for retrieval can also be used in the case of contour encoding for localization. Contrary to the descriptor representation for retrieval, the contour approximation is now optimized for localization and hence for an optimal rate distortion behavior. This algorithm is similar to the optimized vertex selection used for runlength coding with two exemptions: the calculation of the encoding cost of the segments is no more separable and the endpoint of the segments are not limited by the angle but by the dynamic range of the segment. Accordingly, the algorithm was modified with respect of a limited search range. With respect to the first point the optimal rate distortion point has to be found by varying the dynamic range of the encoding. The computation was speed up by the assumption that the shapes have a monotone behavior of the first derivative of the cost function with respect to
the dynamic range. Due to the remaining complexity of this task, this algorithm is only used to determine a reference cost distortion function. Beside the cost distortion optimized vertex allocation also a fast algorithm was implemented: the recursive algorithm is starting from the triangle, i.e. the final simplification using the contour evolution presented in Section 2. Each approximating segment which exceeds a distortion threshold compared to the original contour is split into two segments. The additional vertex for this operation is the point of the original contour with the biggest distance to the former segment. This operation is applied recursively until the maximum peak error is below a required threshold.
3.4
Comparison of encoded shape representations for localization
Part B of the MPEG-7 test set was also used to compute the cost distortion functions for the algorithms mentioned in the prior sections. To analyze the performance of the cost distortion optimized runlength encoding, the cost distortion optimized polygon based encoding, and the CAE of MPEG-4, the resulting peak error with respect to an average size of the encoded shape representation is shown in Fig. 6. It can clearly be seen that for all test conditions the runlength encoding performs worst. For the precision needed for localization (error > 1%) the polygon based encoding performs better than the CAE encoding. At a similar error the size can be reduced by 30-60%. Also the polygon encoding is finer and scalable over a greater range than the CAE method. Only if a near lossless encoding of the shape is required CAE performs better. A similar behavior can be observed when using a relative area error instead of the peak distance error, even though the polygon based encoding is optimized with respect to the peak error measure.
3.5
Polygonal descriptor for joint retrieval and localization
For localization most often a higher precision is required than in case of shape description for retrieval as described in Section 2.
Therefore, to save bandwidth, the data needed by the shape descriptor for retrieval can be incorporated into the descriptor for localization if this functionality is required. To analyze the effect on the efficiency for localization we have compared two inherent encoding schemes: • to start the allocation of vertices for polygonal shape encoding not with a triangle but with the vertices used by the shape descriptor, • to use the shape evolution algorithm to select the vertices also for shape encoding. Notice that in the second case the vertices contained in the shape descriptor are also contained in the shape specification of an earlier step in the evolution which is appropriate according to a lower distortion threshold. In Figs. 7 and 8 both approaches are compared against the algorithms proposed for region localization. In case of the rate distortion optimized vertex allocation starting from the vertices of the shape descriptor does not increase the size of the shape decoding with respect to the peak distance error (Fig. 7). Using the contour evolution algorithm instead increases the size by approximately 25-30%. If the relative area error is taken into account, the contour evolution mechanism performs up to 25% better. Beside the optimized vertex selection we also compared the inclusion of the shape descriptor for retrieval into the shape encoding based on the additive vertex selection. Using the absolute peak distance error measurement, shape encoding for localization, shape encoding embedding the shape descriptor for retrieval, and shape encoding based on the contour evolution perform similar (Fig. 8). With respect to the relative area error the last two methods perform slightly better. Concluding it can be said that the shape descriptor can be embedded into the encoded shape representation of the contour for localization or visualization. This can be done with no increase in size if the absolute peak distance is considered and even with a more compact representation when a combined measure of absolute peak distance and relative area error is taken under account. Thus the size of the descriptor is reduced to signaling which vertices of the shape representation have to be used for the descriptor. This can be encoded with less than the number of bits equal to the number of vertices (less than 100 for 1% error) in the first case and has not to be signaled at all in the second case.
4. CONCLUSIONS In this paper we have tackled the problem of combining the functionality of retrieval and localization of a shape descriptor into one representation. Based on the example of a polygon based shape descriptor we have proposed an efficient coding scheme which can be used for retrieval as well as for localization purposes. To optimize this visual descriptor with respect to its size we investigated the retrieval versus size behavior of the descriptor. On the other hand it was also shown that a polygon based shape encoding outperforms the shape encoding mechanisms like CAE and runlength coding with respect to localization. Finally, two approaches to embed the shape descriptor into the encoded shape representation for localization were discussed. Both approaches show that there is nearly no increase in the shape representation size when including the shape descriptor by an inherent encoding. The approach based on the contour evolution computed for the shape descriptor performs even better if a combination of the peak distance error and the area error is considered. Hence the
extra cost is reduced to an optional indexing of the vertices used for retrieval.
5. ACKNOWLEDGEMENT The authors would like to thank Francesc Sanahuja for implementing the shape descriptors and for plotting the comparative figures.
6. REFERENCES [1]
R. Koenen, "MPEG-4 Multimedia for our time", IEEE Spectrum, pp. 26-33, Febr. 1999. [2] MPEG-7: Context, Objectives and Technical Roadmap, V.12, ISO/IEC JTC1/SC29/WG11/N2861, http://www.cselt.stet.it/mpeg/, July 1999. [3] A. Katsaggelos, L. Kondi, F. Meier, J. Ostermann, G. Schuster, "MPEG-4 and rate-distortion based shape coding techniques", Proceedings of the IEEE, pp. 1126-1154, June 1998. [4] G. M. Shuster, A. K. Katsaggelos, "An Optimal Polygonal Boundary Encoding Scheme in the Rate Distortion Sense", IEEE Trans. Image Processing, Vol. 7, No. 1, pp. 13-26, January 1998. [5] K. J. O’Conell, "Object-Adaptive Vertex-Based Shape Coding Method", IEEE Trans. Circiuts Systems for Video Technology, Vol. 7, No. 1, pp. 251-255, February 1997. [6] F. Mokhtarian and A. K. Mackworth, "A theory of multiscale, curvature-based shape representation for planar curves", IEEE Trans. PAMI, Vol. 14, No 8, pp. 789-805, August 1992. [7] F. Mokhtarian, S. Abbasi, and J. Kittler, "Efficient and robust retrieval by shape content through curvature scale space", Image DataBases and Multi-Media Search, pp. 51– 58. World Scientific Publishing, Singapore, 1997. [8] L. J. Latecki and R. Lakämper, "Contour-based shape similarity", Proc. 3rd Int. Conf. on Visual Information Systems, Amsterdam, pp. 617-624, June 1999. [9] J. Heuer, A. Kaup, U. Eckhardt, L. J. Latecki, R. Lakämper, "Results of Polygon based Contour Shape descriptor according to CE1", ISO/IEC JTC1/SC29/WG11/M5906, Noordwijkerhout, March 2000. [10] L. J. Latecki and R. Lakämper, "Polygon Evolution by Vertex Deletion", Proc. 2nd Int. Conf. on Scale-Space Theories in Computer Vision, Corfu, Greece, SpringerVerlag, pp. 398-409, September 1999. [11] J. Heuer, F. Sanahuja, A. Kaup, "Visual feature discrimination versus compression rate for polygon shape descriptors", Proc. Internet Multimedia Management Systems, Boston, SPIE Vol. 4210, Nov. 2000.