their boundary profiles or to synthetic geometric descriptors. In this paper, an ... shape of image objects provides a powerful visual cue for similar- ity matching [6]. ... such as those defined in the MPEG-7 standard [2], the research is far from to ...
SHAPE REPRESENTATION BY SPATIAL PARTITIONING FOR CONTENT BASED RETRIEVAL APPLICATIONS S. Berretti, G. D’Amico and A. Del Bimbo Dipartimento Sistemi e Informatica Universit`a di Firenze via S.Marta 3, 50139 Firenze, Italy ABSTRACT Shape representations for classification or retrieval purposes, have been extensively investigated, but only few methods have tried to represent shapes as extended entities without reducing them to their boundary profiles or to synthetic geometric descriptors. In this paper, an original solution for shape representation is proposed which relies on a modelling technique originally developed to express directional spatial relationships between extended spatial entities. The representation selects a discrete set of points with respect to which a relationship matrix is computed, accounting for spatial distribution of shape pixels. This is accomplished at different levels of resolution by a tree based structural representation. Properties of the representation and a measure of shape similarity are discussed. Efficiency and effectiveness of the proposed solution have been also assessed in the context of content based retrieval applications, through an experimental evaluation on a shape collection. 1. INTRODUCTION Shape is one of the visual features which has mainly attracted the researchers work. In particular, in retrieval by visual similarity, the shape of image objects provides a powerful visual cue for similarity matching [6]. However, despite of the fact that some form of standardization for shape descriptors have been recently proposed, such as those defined in the MPEG-7 standard [2], the research is far from to produce consolidated results. This is mainly due to the complex nature of the shape perception process, which hinders the identification of descriptors applicable in general contexts and still makes the definition of effective and efficient shape representations and matching techniques a challenging research problem. In general, 2-D shape descriptions can be divided into two broad categories: contour-based and region-based. Starting from the observation that related, but not identical shapes, can be transformed one into the other by deforming their boundary profile, several approaches faced the shape description and matching problem regarding shapes as 1-D contours. Most of these approaches assume that objects boundaries can be modelled through a continuous function [4]. In this way, the visual information associated to the spatial distribution of pixels comprised by the object contour, which is two-dimensional by its nature, is reduced to the one-dimensional information of the object boundary. In so doing, relevant visual properties of objects can be lost: for example, with these representations, the external boundary is constrained to be a continuous function disregarding if the internal of the object has holes or a non-uniform density. Similarly,
0-7803-8603-5/04/$20.00 ©2004 IEEE.
non-connected objects cannot be conveniently represented. Despite of these limitations, several contour-based approaches have been proposed and employed for classification and retrieval purposes following either global [4] or local solutions [5]. Region-based approaches use the entire shape region to extract a meaningful description which is more suitable when objects have similar spatial distributions of pixels. In this class of methods, descriptors take into account the actual spatial occupancy of pixels usually by using geometric properties of the shape [3]. However, in order to reduce the amount of data required by the representation, synthetic descriptors are commonly employed. These induce a gain in the efficiency of the representation, but at the cost of a loss of resolution. Moreover, region-based methods are intrinsically global, in the sense that they condense in the descriptors the overall information associated to the spatial distribution of pixels, without capturing the visual information which depends on individual parts of a shape. As a consequence, partial match between shapes is usually not allowed. Actually, little attention has been turned to the development of unified shape descriptors capable to combine into one single representation the main properties of both contour- and region-based approaches. Such descriptors should include spatial information accounting for both the salience of the shape contour and the extension of the region, including its spatial distribution and density, independently from its connectivity. This should be obtained through a representation with a sufficiently compact format, possibly at different levels of resolution. In this paper, we develop a modelling technique specifically tailored to represent and compare the shape of extended spatial entities in the application context of retrieval by visual similarity. The proposed approach develops on the weighted walkthroughs (WW) operator [1], that was originally proposed to express directional spatial relationships between extended spatial entities. In particular, properties of weighted walkthroughs enable quantitative representation and comparison of shapes of extended pixel sets, by taking into account the overall distribution of their individual pixels. Properties of the model are expounded to motivate and assess a quantitative metric of similarity. The rest of the paper is organized in four Sections. In Sect.2, the WW model for expressing the spatial relationships between extended spatial entities is generalized to the case of relationships between points and extended spatial entities. Developing on this model, derivation of a shape representation is expounded in Sect.3. A similarity measure for shape comparison is introduced in Sect.4. Finally, experimental results are reported for a shape benchmark database in Sect.5.
2. RELATIONSHIPS BETWEEN POINTS AND EXTENDED SPATIAL ENTITIES In a Cartesian reference system a point , with coordinates , partitions the plane in four quadrants, upper-left, upper-right, lowerleft and lower-right, that can be encoded by an index pair , with and taking values . In this perspective, the directional relationship between the point and an extended set , can be represented by the number of pixels of that are located in each of the four quadrants, as shown in Fig.1(a). This results in four weights that can be computed with an integral measure on the set of pixels of : ! "#
$&%
+
'
(*),+ '
+ -/.10234
"#56.107 498/.:8*#.
(1)
where ; @$A4 and A3 BC>D , respectively; denotes the area of entity and acts as normalization factor. The four coefficients ?E "6 C resulting from the application of Eq.(1), are then organized in a F,GHF matrix: I
C&%
1J
1
J; J;
1 J;
(2)
The values E " are adimensional positive numbers, undergoing two bounds, which directly descend from the geometric constraints on the set of pixels of entity : each weight takes values in the interval K A* $L , and the sum of the four weights equals 1. 3. SHAPE PARTITIONING The model of Eq.(1), can be applied to represent the relationship between a generic point of the plane and a spatial entity . Each point , independently from its location with respect to entity , induces a different partitioning: the relationship between and can thus be represented by specific values of the coefficients of Eq.(2). In this way, the relationship matrix can be used to express the basic shape information related to the spatial distribution of an entity. This situation is represented in Fig.1(b) which shows the relationship matrixes for two different points and NM , located in the center of mass of the spatial entity and outside of its boundary, respectively. w(p,A)=
w−1,1
w1,1
w( p 2, A) = 0.05 0.0 0.85 0.1
w−1,−1 w1,−1
A
A
p2
0.1 0.35 0.15 0.4
0.05
p2 0.0
0.85
0.1
3.1. Incremental Representation
(xp ,y p )
p1
p
w( p 1, A) = 0.3 0.25 0.2 0.25
(a)
As depicted in Fig.1(b), a matrix computed on an individual point does not provide a sufficiently informative description of the shape. However, introducing additional points and regarding them as parts of an unique descriptor, it is possible to substantially increase the effectiveness of the representation. According to this basic observation, an incremental representation is derived based on the relationships between a set of reference points and a spatial entity. The resulting shape description is able to capture salient shape information of a spatial entity both related to its boundary and to the two-dimensional spatial distribution of its pixels. The representation is constructed on multiple levels of resolution through an incremental approach. At level A (root level), an entity is partitioned into four quadrants with respect to a single point QQ . As first instance, this can be selected coincident with the entity center of mass. Subsequent levels R , can be constructed by locating S additional points 9TP . In principle, there is not a predefined policy for points selection. However, we found that selecting points on a straight line passing from the shape center of mass has some practical advantages. According to this, for the following discussion we assume that points lay on a sampling line passing from QQ . Independently from the particular sampling line, points are inserted in the representation according to a binary generation scheme. In this way, each additional level R introduces F T new points U TP SV%WA3X X X F T 0:Y , as shown in Fig.2. In so doing, a point /TP on the sampling line at level R , partitions into two parts J; of equal length the segment comprised between two points ;TZ J /TZ J; and at the previous level R0@: . The specific criteria used for points selection on the sampling line can affect properties and effectiveness of the resulting representation. However, considerations related to the form of the descriptor and to its measure of similarity do not change. Hence, without loss of generality, in the following we focus on a specific policy. In particular, we used a normalization criteria which considers, for each entity, a circle of area equivalent to that of the entity itself. Practically, the center of the circle can be made coincident with the center of mass of the entity, and its radius is set to ensure that the circle area is equivalent to that of the entity. In this way, points at different levels of the representation are selected on the circle diameter with slope of [\*] , by recursively partitioning the intervals induced by points at the previous level. According to this, QQ coincides with the center of mass of the entity, Q and divide the two semi-diameterM of the circle into two equal parts; at level two, four new points U PV^ S_%WA3:X X X `Y , are located in the middle of each of the four sub-segments originated at level . Following this recursive approach, a sequence of arbitrary length of partitioning points can be generated.
0.3
0.25 p1
0.2
0.25
(b)
Fig. 1. (a) Point partitions the plane and the entity into four parts. The relationship matrix accounts for the number of pixels of comprised in the quadrants. (b) Two partitioning points for entity . The FOG1F matrix I /P C , associated to point /P , represents the fraction of pixels of included in each of the quadrants induced by the pixel P itself.
The incremental representation, which inserts F T points at each new level R , naturally fits the structure of a binary tree. In fact, a correspondence can be established between points U TPa^ Sb% T A3 XX X F 0cdY at level R of the representation and nodes at level R of a binary tree. In turn, each point is associated with the FeGCF matrix of weighting coefficients C TP C which indicates the fraction of entity pixels located in the four quadrants of the partitioning induced by the point itself. This equivalence is shown in Fig.2. Practically, to make easier the tree manipulation and comparison, the binary tree can be cast to a vectorial representation observing that its particular structure is that of a heap. By construction, a direct mapping exists between a heap and a vector. This vector
univocally associates the two children of any internal node , with vector elements at predefined positions (and conversely vector elements with tree nodes). Mapping rules between tree nodes and indexes of vector elements are as follows:
%@A3
4S
K 3L;%-
0
0D:6F
1
a2 2 a2
3
level 0
a1 0
a0
level 1
1
level 2
0
0.15 0.05 1 a
0
a1
0.3 0.25 0 a 0.2 0.25
0.7 0.1
0.15 0.6
0
a1
1
0.1 0.15
0.1 0.05
0.2 0.1
0.15 0.5
0.05 0.8
0.75 0.1
0.55 0.15
0.2 0.15
0.05 0.15
a2
a2
a2
a2
0
1
2
3
Fig. 2. The binary tree of coefficients associated with the entity . Three levels of the representation are shown. This vectorial representation allows efficient comparison between entity representations. In fact, representations for different entities can be compared at different levels of detail by limiting the evaluation of their distance function to the first levels of the hierarchy (i.e. only to the first vector components). The representation can handle non-connected shapes, is invariant to shape translation and scaling, but not to shape rotation. However, rotation invariance can be easily obtained by referring the space partitioning to a reference system local to the shape (the main axis of inertia of the shape is a possible choice). Finally, it is worth noting that a shape representation with a finite number of points is not univocal. This means that different shapes can exist which share the same representation at least for the first levels of the tree. However, given an WA* it can be proved that it is possible to extend the number of levels up to a limit which guarantees that shapes can have the same representation only if they differ less than . 4. SHAPE SIMILARITY MEASURE Distance between shape representations of two entities both extended to levels, is evaluated as:
O%
F"!@0D
J;
!#
T%$Q
P
#
and
,
M & TP
5I
TP
5C
TP
C$I*)
TP
&%
T C$ C*) P
(3)
$Q('
where TP 5I *TP $ I*) TP is the distance computed on corresponding points of the two shapes; I NTP and I*) TP + are ' the FHG2F matrix of shape coefficients computed on homologous points of the representations for entities and , respectively. T
*T T , sums up distances P 5C P C$ I*) P In this way, computed on every node U S ^ S_% 6 X X ' X F T Y comprised in homologous levels R of the hierarchy; the sum is then normalized with respect to the overall number of points of the representation (i.e. nodes of the hierarchy).
/(018015I .1 B /32O8(2O5I
'
K 3L&%aF,BDF
a2 a2
TP
K 3L/%aF,B@
According to this, the left and right child of node are directly accessible as vector elements of index F,B@ and FBF . A
Individual distance components TP 5I TP C$ C*) TP , are then derived in the following way (for the simplicity of notation ' we will indicate C 3PT as , and I*) TP as .- ):
1 B4/35O8(55I
1
where / 0 , /32 and /35 are non-negative numbers with sum equal to 1, and ' 8 0 , 8(2 , 8(5 are the following distance com' ponents: 80 % 5 Bb J 0a561 Bb.1 J evaluates the difference in the horizontal displacement in the two spatial arrangements of pixels captured by and 1 . In fact, the sum evaluates how many pixels in are on the 5 B J right hand side of a generic observation point NTP . This equals 1 iff all the pixels of are on the right of NTP , it equals 0 iff all the pixels of are on the left of NTP , and ranges with' continu5J; OB ity between 1 and 0 in the intermediate cases. 82b% ' 1 30V5 1J BV 1 similarly to 8 0 for the horizontal alignment, evaluates the difference in the vertical displacement in the two spatial arrangements of pixels captured by and 1 . 8(5D% ' ' 51J; J4BV1 30V5 1J; J BV 1 evaluates the difference in the alignment along the diagonal of the Cartesian reference system in the two spatial arrangements of pixels captured by and 1 . In fact, the sum 5 J J Ba evaluates how many pixels in are aligned along the diagonal with respect to the observation point of the shape representation NTP . 5. EXPERIMENTAL EVALUATION Experiments have been conducted on a benchmark database with 1098 shapes of marine animals, derived from the MPEG-7 evaluation set. Results are provided for both computational efficiency and perceptual effectiveness of the shape representation, by considering tasks of retrieval by visual similarity. For the computational efficiency, we considered the ability of the representation to condense the main part of shape information in the first levels of the resulting tree. This allows the use of less coefficients during shape comparison, thus enabling a more efficient similarity computation. To this end, we considered every shape in the database as a query with respect to each other at different levels of shape representation. The result list obtained revolving the query by using the representation with the maximum level is considered as the target one. In this way, we assume that the best result corresponds to the case in which shapes in a result list, obtained using an intermediate level of the representation, have the same rank that yield in the reference list, that is: Z shapes % % R798;: 5 D% R37 : 5 , < % 6 X X X; = , where = is the ^ U 6 XX X =bY ?> U# XX X =bY is size of both the lists and the mapping function which associates the index of a generic shape in the reference list to the index 5 of the same shape in the other list. Thus, a measure of the rank error for the level R , can be expressed through the following function: @
T
E&%
#
' %
R 798;: 5 0
%
Z ' R 7 : 5
I%
XX X =
$ !
that is, the average value@ of rank disparity computed on queries. The lower the value of T E4 the closer the rank of representation at level R with respect to that of the reference level. Results of the comparison between levels 3, 4 and 5 against level 6, averaged on the entire set of database shapes, are shown
rank
5
rank
22
20
18
16
14
12
8
10
0
6
10 11 12
10
4
9
level 6
15
2
8
level 5
20
0
7
26
6
24
5
22
4
20
3
18
2
16
1
0 14
0
(c)
5 12
0
10
8
(b)
1
level 4
15
10
2
level 6
20
6
3
4
4
2
5
level 5
25
cumulated similarity
level 4
level 6 cumulated similarity
(a)
level 5
0
cumulated similarity
level 4 6
rank
Fig. 3. Three cluster prototypes and their corresponding values of recall. The value on the vertical axis sums up the similarity of the shapes included in the retrieval set with the dimension shown on the horizontal axis. The size of retrieval sets is the minimum to comprise every image of the corresponding cluster (5, 24 and 18 are the sizes for clusters represented by the shapes in (a), (b) and (c), respectively).
in Fig.4(a) for the F A best ranked shapes in the retrieval set. The horizontal axis represents the shape ranking according to the reference level, while the vertical axis reports, for each shape, the corresponding rank error in the retrieval list computed using a lower number of representation levels. It can be seen as the error is lower for the first, and more relevant shapes to the query. Plots in Fig.4(b) represent the recall curves for levels 3, 4 and 5 of the representation (note that, for the reference level 6, the recall is always one). In this figure, values of recall are computed on result sets of different size, which is indicated on the horizontal axis. In can be observed as also the representation with only three levels, has a recall quite close to that of the representation with the maximum accuracy level. level 4
level 5
level 3
level 4
level 5
1,05
recall
1 0,95 0,9 0,85 0,8
(a)
19
17
15
13
9
reference rank
11
7
5
3
1
19
17
15
13
9
11
7
5
0,75 3
1
average rank error
level 3 14 12 10 8 6 4 2 0
rank
to any other class. Any miss-classification is highlighted by a zeroslope segment of the curve, which derives from the ‘anticipated’ retrieval of a shape of a different class. For the class of Fig.3(a), four of its five shapes are ranked before any other database shape (independently from the representation level used in shape comparison), while the retrieval set have to be extended until rank 11 to encompass the fifth shape of the class. Similar considerations can be applied to cases represented in Fig.3(b)-(c). In (b), 20 out of 24 class shapes, are correctly ranked in the first 20 positions of the retrieval list; then three shapes of other classes are retrieved before the remaining four shapes of class (b). It is interesting to note that, in this case, levels 4, 5 and 6 attain the same behavior. In (c), the three levels perform in a slightly different way, with level six resulting in better results. Finally, we considered a shape classification task in which the similarity between each shape in the database and each of the class prototype is computed. Each shape is correctly classified if its distance to the prototype of the class to which the shape belong to is lower than the distance with respect to any other cluster prototype. The average classification ratio computed by using level six of the representation was equal to 93%.
(b)
Fig. 4. (a) Plotted curves report the average ranking error while comparing the rank of the reference list (that using 6 levels of the representation) with those with 3, 4 or 5 levels. (b) Recall curves. To evaluate the capability of the shape representation and similarity measure to compare objects based on their prototypical shape, we made a preliminary visual clustering of images resulting into a set of 87 classes. For each of these classes, we chose as a prototype that object whose total distance to the other members of its class was a minimum. Then, we computed the similarity between each class prototype and each of the remaining shapes in the database. Fig.3 summarizes the results of the evaluation for three sample class prototypes. Reference shapes used as queries are reported on the left, while the plots on the right report the curves of recall and precision obtained by resolving the query at different levels, according to the measure of similarity of Eq.(4). The horizontal axis is the dimension of the retrieval set and the vertical axis is the sum of the values of similarity for the shapes that are included in the retrieval set. In doing so, the retrieval set is extended to the size necessary to encompass every shape in the class of the prototype used as query. In this representation a perfect classification would result in a curve strictly increasing, which is obtained when all the shapes comprised in the query class are added to the retrieval set before any other shape in the database. In fact, a zero similarity value is assumed between a class prototype and shapes belonging
6. REFERENCES [1] S. Berretti, A. Del Bimbo and E. Vicario, “Weighted Walkthroughs between Extended Entities for Retrieval by Spatial Arrangement,” IEEE Trans. on Multimedia, vol. 5, no. 1, pp. 52-70, March 2003. [2] M. Bober, “MPEG-7 Visual Shape Descriptors,” IEEE Trans. on Circuits and Systems for Video Technology, vol. 11, no. 6, pp. 716–719, June 2001. [3] S. Loncaric, “A Survey of Shape Analysis Techniques,” Pattern Recognition, vol. 34, no. 8, pp. 983–1001, August 1998. [4] F. Mokhtarian, “Silhouette-based isolated object recognition through curvature scale space,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 17, no. 5, pp. 539–544, May 1995. [5] E. Petrakis, A. Diplaros and E. Milios, “Matching and Retrieval of Distorted and Occluded Shapes Using Dynamic Programming,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 24, no.11, pp. 1501–1516, November 2002. [6] A.W. Smeulders, M. Worring, S. Santini, A. Gupta and R. Jai, “Content-Based Image Retrieval at the End of the Early Years,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 23, no. 12, pp. 1349–1380, December 2000.