using contour information and segmentation for object ... - Core

23 downloads 0 Views 7MB Size Report
kind advice, encouragement, and constant support. Similar thanks are due to Dr. Noel Murphy and Professor Alan Smeaton for their individual advice, support.
USING CONTOUR INFORMATION AND SEGMENTATION FOR OBJECT REGISTRATION, MODELING AND RETRIEVAL by T om asz A d a m e k , M.Sc.

A thesis subm itted in partial fulfilment of the requirem ents for the degree of Ph.D in Electronic Engineering

Supervisor: Dr. N oelE . O 'C onnor

S chool o f E lectro n ic E n g in e e rin g D u b lin C ity U n iv e rsity Ju n e , 2006

D

C

U

D e c la ra tio n

I hereby certify that this m aterial, which I now subm it for assessment on the program m e of study leading to the aw ard of Ph.D in Electronic Engineering is entirely my ow n w ork and has not been taken from the work of others save and to the extent that such work has been cited and acknow ledged w ithin the text of my work.

S ig n ed :______ Tomasz A dam ek

ACK NO W LEDGEM ENTS I w ish to extend my heartfelt gratitude to the m any people w ho have enabled and su p p o rted the w ork behind this thesis. Special thanks goes to m y supervisor, Dr. Noel E. O 'Connor, for his valuable guidance and expertise as well as for his kind advice, encouragem ent, and constant support. Similar thanks are due to Dr. N oel M urphy and Professor Alan Smeaton for their individual advice, support and suggestions at different stages of the work. I w ould like to thank all the colleagues at the Centre for Digital Video Processing for the pleasant w ork atmosphere. I'm indebted to m any m em bers of the group for the interesting and fruitful discussions, especially fellow PhD students Orla Duffner, Sorin Sav, and Ciarán Ó Conaire, and also Dr. H ervé Le Borgne. I w ould like to thank Christan Ferran Bennström and Dr. Josep R. Casas from the U niversität Politécnica de Catalunya for the interesting discussions w hich m o­ tivated m any experim ents presented in this thesis. Christan Ferran Bennström devoted significant am ount of his time to the creation of the im age collection used in this thesis. Different p arts of the research w ork published in this thesis were financially su p ­ ported by the Informatics Research Initiative of Enterprise Ireland, by Science Foundation Ireland under grant 03/IN .3/I361., and by the European Com m is­ sion u nder contract FP6-001765 aceMedia. Dr. M iroslaw Bober of the Visual Inform ation Laboratory of M itsubishi Electric Inform ation Technology Center Europe is acknow ledged for generating and pro­ viding the CSS results. The Centre for Vision, Speech, and Signal Processing of the U niversity of Surrey is acknowledged for m aking the database of shapes of m arine creatures available, and the Center for Intelligent Inform ation Retrieval from U niversity of M assachusetts is acknow ledged for providing the annotated collection of George W ashington's m anuscripts. O n a personal note, special thanks go to all m y friends, especially to the Polish com m unity at D ublin City University. I w ish to extend m y gratitude to my parents H enryka and Julian (Za zycie i wychowanie.) and m y aunt Alicja and uncle Boleslaw (Za zawsze serdeczne goszczenie mnie pod swoim dachem.). Finally, I w ould like to thank Ms. Cristina Blanco (Por su constante apoyo y ánimo, y por recordarme que hay cosas más importantes en esta vida que el ser llamado 'Doctor'. También por proporcionar fotos de sí misma, las cuales mejoran significativamente el valor estético de esta tesis.).

ABSTRACT USING CONTOUR INFORMATION AND SEGMENTATION FOR OBJECT REGISTRATION, MODELING AND RETRIEVAL b y T om asz A d a m e k M.Sc. T

his thesis considers different aspects of the utilization of contour infor­ m ation and syntactic and semantic im age segm entation for object regis­

tration, m odeling and retrieval in the context of content-based indexing and retrieval in large collections of images. Target applications include retrieval in collections of closed silhouettes, holistic w ord recognition in handw ritten histor­ ical m anuscripts and shape registration. Also, the thesis explores the feasibility of contour-based syntactic features for im proving the correspondence of the out­ p u t of bottom -up segm entation to semantic objects present in the scene and dis­ cusses the feasibility of different strategies for im age analysis utilizing contour information, e.g. segm entation driven by visual features versus segmentation driven by shape m odels or sem i-autom atic in selected application scenarios. There are three contributions in this thesis. The first contribution considers struc­ ture analysis based on the shape and spatial configuration of image regions (socalled syntactic visual features) and their utilization for automatic image seg­ mentation. The second contribution is the stu d y of novel shape features, m atch­ ing algorithm s and sim ilarity measures. Various applications of the proposed so­ lutions are presented throughout the thesis providing the basis for the third con­ tribution w hich is a discussion of the feasibility of different recognition strategies utilizing contour inform ation. In each case, the perform ance and generality of the proposed approach has been analyzed based on extensive rigorous experi­ m entation using as large as possible test collections.

L IS T O F P U B L I C A T I O N S T. A dam ek, N. E. O'Connor, and A.E Smeaton, "Word m atching using single closed contours for indexing handw ritten historical docum ents" in accepted for publication in Special Issue on Analysis of Historical Documents, Int'l Journal on Doc­ ument Analysis and Recognition, 2006 T. A dam ek and N. E. O'Connor, "Interactive object contour extraction for shape m odeling" in Proc. 1st Int'l Workshop on Shapes and Semantics, Tokyo, June 2006. N. E. O'Connor, E. Cooke, H. Le Borgne, M. Blighe, and T. A dam ek, "The AceToolbox: low-level audiovisual feature extraction for retrieval and classification" in Proc. 2nd IEE European Workshop on the Integration of Knowledge, Semantic and Digital Media Technologies, London, U.K., Dec. 2005. A. F. Smeaton, H. Le Borgne, N. E. O'Connor, T. A dam ek, O. Smyth, and S. De Burca, "Coherent segmentation of video into syntactic regions," in Proc. 9th Irish Ma­ chine Vision and Image Processing Conf (IMVIP’05), Belfast, Northern Ireland, Aug. 2005. T. Adamek, N. E. O'Connor, G. Jones, and N. Murphy, "A n integrated approach for object shape registration and m odeling," in Proc. 28th Annual Int'l ACM SIGIR Conference (SIGIR'05), Workshop on Multimedia Information Retrieval, Salvador, Brazil, Aug. 2005. T. Adamek, N. E. O'Connor, and N. Murphy, "Multi-scale representation and op­ tim al matching of non-rigid shapes," in Proc. 4th Int'l Workshop on Content-Based Multimedia Indexing, CBMI'05, Riga, Latvia, June 2005. T. Adamek, N. E. O'Connor, and N. Murphy, "Region-based segm entation of im­ ages using syntactic visual features," in Proc. 6th Int’l Workshop on Image Analysis for Multimedia Interactive Services (W IAM IS’05), Montreux, Switzerland, Apr. 2005. T. A dam ek and N. E. O'Connor, "A multi-scale representation m ethod for non­ rigid shapes w ith a single closed contour," IEEE Trans. Circuits Syst. Video Technol, special issue on Audio and Video Analysis for Midtimedia Interactive Services, no. 5, May 2004. T. A dam ek and N. E. O 'Connor, "Efficient contour-based shape representation and matching," in Proc. 5th ACM SIGMM Int'l Workshop on Multimedia Informa­ tion Retrieval (MIR'03), Berkeley, CA, Nov. 2003.

N. E. O 'Connor, S. Sav, T. A dam ek, V. Mezaris, I. Kompatsiaris, T. Y. Lui, E. Iz­ quierdo, C. F. Bennstrom, and J. R. Casas, "Region and object segm entation algo­ rithm s in the QIMERA segm entation platform " in Proc. Int'l Workshop on ContentBased Multimedia Indexing (CBMI'03), Rennes, France, Sep. 2003. N. E. O 'Connor, T. A dam ek, S. Sav, N. M urphy, and S. Marlow, "QIMERA: A softw are platform for video object segm entation and tracking," in Proc. 4th Work­ shop on Image Analysis for Multimedia Interactive Services, London, U.K. (WIAMIS'03), Apr. 2003.

TABLE OF CONTENTS TABLE OF CONTENTS LIST OF TABLES

vi

LIST OF FIGURES 1

INTRODUCTION

1

1.1

The N ew "Digital A g e " .........................................................................

1

1.2

Challenges Brought by the Explosion of Cheap Digital M edia

1

1.3

Retrieval by C o n t e n t ............................................................................

1.4

The Importance of Image Segmentation and Shape Analysis in CBIR 1.4.1

2

vii

. .

2 3

Image S e g m e n ta tio n ...................................................................

4

1.4.2 Shape M atch in g ................. ..........................................................

5

1.5

Objectives of this T h e s i s ......................................................................

6

1.6

The M ain C o n trib u tio n s ......................................................................

7

1.7

Thesis Overview

9

...................................................................................

IMAGE SEGMENTATION, APPLICATIONS AND EVALUATION: A REVIEW 11 2.1

I n tr o d u c tio n ............................................................................................ 2.1.1

2.2

2.3

11

Image S e g m e n ta tio n ...................................................................

11

2.1.2 T ax o n o m y .....................................................................................

12

2.1.3 Chapter S tru ctu re.........................................................................

13

Relevance to Content-Based Im age R e tr ie v a l.................................

13

2.2.1 CBIR Systems Utilizing S e g m e n ta tio n ....................................

14

2.2.2 Automatic Annotation of Image R e g i o n s ..............................

16

G rouping Cues

......................................................................................

17

2.3.1 C o lo u r ...........................................................................................

17

2.3.2 T e x tu re ...........................................................................................

17

2.3.3 General Geometric C u e s ............................................................

18

2.3.4 Application Specific M o d e ls......................................................

19

2.3.5 User In te ra c tio n s .........................................................................

19

TABLE OF CONTENTS

2.4

2.5

2.6

3

Selected M ethods for Bottom-up A utom atic Segmentation: A Re­ view .............................................................................................................

20

2.4.1

C lu s te r in g ...................................................................................

20

2.4.2

M athem atical M orphology

....................................................

22

2.4.3

G raph Theoretic A lgorithm s

.................................................

22

Selected M ethods for Top-down Segmentation: A R ev iew ...........

23

2.5.1

Semi-Automatic S e g m e n ta tio n ..............................................

23

2.5.2

Shape Driven S eg m e n tatio n ....................................................

26

E v a lu a tio n ...............................................................................................

28

2.6.1

Evaluation S tra te g ie s.................................................................

28

2.6.2

Relative Evaluation M e t h o d s .................................................

30

2.6.3

A dopted Evaluation M e t h o d .................................................

31

2.6.4

G round-Truth Creation

..........................................................

34

2.7

D is c u s s io n ...............................................................................................

36

2.8

C o n c lu sio n ...............................................................................................

36

SEGMENTATION USING SYNTACTIC VISUAL FEATURES

38

3.1

39

I n tr o d u c tio n ............................................................................................ 3.1.1

M o tiv a tio n ..................................................................................

39

3.1.2

C hapter S tru ctu re.......................................................................

41

3.2

RSST Segmentation F ra m e w o rk ..........................................................

41

3.3

O verview of the Proposed A p p ro a c h ................................................

43

3.4

M ethodology Used for Training an d T estin g ...................................

44

3.5

3.6

3.7

3.4.1

Training and Test C o r p u s .......................................................

44

3.4.2

D evelopm ent M e th o d o lo g y ....................................................

45

Colour H o m o g e n e ity ..............................................................................

45

3.5.1

Colour S p a c e s.............................................................................

45

3.5.2

O ptim ized Colour H om ogeneity C r ite r io n ........................

45

3.5.3

Extended Colour R e p re s e n ta tio n ................................. ...

47

Syntactic Visual F e a tu re s ......................................................................

50

3.6.1

A d ja c e n c y ...................................................................................

52

3.6.2

Regularity (low co m p lex ity )....................................................

53

Integration of Evidence from Different S o u r c e s .............................

55

3.7.1

Belief T h e o r y .............................................................................

55

3.7.2

D em pster-Shafer's Theory: B ack g ro u n d ..............................

56

3.7.3

A pplication of D em pster-Shafer's Theory to the Region M erg­ ing P r o b le m ................................................................................ 58

3.7.4

Designing Belief Structures for each Source of Information

61

3.7.5

Evaluation of the N ew Fram ew ork for Combining M ulti­ ple F e a tu r e s ................................................................................

66

i

TABLE OF CONTENTS

3.8

...............................................................................

71

3.8.1

Stopping Criterion Based on P S N R .......................................

72

3.8.2

A N ew Stopping C riterio n .......................................................

73

Future W o r k ............................................................................................

78

3.10 D is c u s s io n ...................................................... ........................................

81

3.11 Semi-Automatic S e g m e n ta tio n ...........................................................

82

3.9

Stopping C riterion

3.11.1

M o tiv a tio n ..................................................................................

82

3.11.2

U ser In tera ctio n .........................................................................

83

3.11.3

Im plem entation D e t a i l s ..........................................................

84

3.11.4

R e s u lts ........................................................................................

92

3.11.5

D is c u s sio n ..................................................................................

94

3.12 C o n c lu sio n ................................................................................................

95

SHAPE REPRESENTATION AND M ATCHING: A REVIEW

97

4.1

I n tr o d u c tio n ............................................................................................

97

4.1.1

Shape A n a ly s is ...................................................................

97

4.1.2

2D Shapes Versus a 3D W o rld .............................. ...

98

4.1.3

C hapter S tru ctu re......................................................................

99

Im portant Aspects of Retrieval by S h a p e ..........................................

99

4.2.1

C om putational Shape A n a ly s is .............................................

99

4.2.2

Shape Representation and D escrip tio n .................................

100

4.2.3

Shape M atching and Similarity E s tim a tio n ........................

101

Taxonomy of Shape R e p resen tatio n s.................................................

103

4.3.1

T a x o n o m y ..................................................................................

103

4.3.2

Contour-Based Versus Region-Based T echniques..............

104

4.3.3

Transform Domain Versus Spatial D om ain Representations 105

4.3.4

Shape M o d elin g .........................................................................

4.2

4.3

4.4

4.5

4.6

106

The M ost Characteristic M ethods Utilizing Shape Information Em­ b ed d ed Into a Low Dimensional Vector S p a c e .................................. 106 4.4.1

Sim ple Global Geometric Shape F e a t u r e s ............................

107

4.4.2

F ourier Transform

...................................................................

107

4.4.3

M o m e n ts .....................................................................................

108

The M ost Characteristic Curve M atching Techniques

..................

110

4.5.1

M edial Axis Transform .................................... ........................

Ill

4.5.2

D ense M atch in g .........................................................................

112

4.5.3

Proxim ity M a tc h i n g ................................................................

113

4.5.4

Spread Primitive M a tc h in g ....................................................

114

4.5.5

Syntactical M a tc h in g ................................................................

114

C o n c lu sio n ...............................................................................................

120

TABLE OF CONTENTS

5

M ULTI-SCALE REPRESENTATIO N A N D M A T C H IN G OF CLOSED CONTOURS 121

5.1

5.2

In tro d u c tio n ............................................................................................

122

5.1.1 M o tiv a tio n .....................................................................................

122

5.1.2 C hapter S tru c tu re .........................................................................

124

Multi-scale Convexity Concavity R e p re s e n ta tio n ..........................

124

5.2.1

D e f in itio n .....................................................................................

124

5.2.2 Properties of MCC r e p r e s e n ta tio n ..........................................

127

Matching and Dissimilarity M e a s u re ................................................

129

5.3.1 O ptim al Correspondence Between C ontour P o in ts .............

129

5.3.2 Distances Between C ontour P o i n t s ..........................................

132

5.3.3 Dynamic Program m ing F o rm u la tio n ......................................

132

5.3.4 The Com plete M atching A l g o r i t h m ......................................

134

5.3.5 Shape D issim ilarity .....................................................................

136

5.3.6 M atching Examples

..................................................................

136

R e s u lts .....................................................................................................

137

5.4.1 Retrieval of M arine C re a tu re s...................................................

137

5.4.2 Simulation of MPEG-7 Core E xperim ent................................

138

5.5

Com putational C o m p le x ity ................................................................

141

5.6

Alternative Representation and M atching O p tim iz a tio n ..............

143

5.3

5.4

5.6.1 A n O ptim ization Fram ew ork to Facilitate Evaluation . . . 143 5.6.2 Reduction of Redundancy Between C ontour Point Features 144 5.6.3 O ptim ization M e th o d o lo g y ......................................................

146

5.6.4 O ptim ization C riterio n ...............................................................

147

5.6.5 O ptim ization R esu lts..................................................................

148

5.7

Relation to Existing M e t h o d s .............................................................

152

5.8

Discussion and Future W o rk ................................................................

153

5.8.1 Alternative Approaches to C ontour Evolution and Con­ vexity/C oncavity M e a s u re m e n t...........................................

153

5.8.2

5.9 6

Different Q uantization Schemes for a ............................ .. . 153

5.8.3 R edundancy Between Contour P o i n t s ...................................

153

5.8.4 Limitations of the O ptim ization F r a m e w o r k .......................

154

C o n clu sio n ...............................................................................................

154

CASE ST U D Y 1: W O R D M A T C H IN G IN H A N D W R IT T E N D O C U ­ M ENTS 155

6.1

I n tro d u c tio n ............................................................................................

155

6.1.1 M o tiv a tio n .....................................................................................

155

...........................................................................

157

6.1.3 C hapter S tru c tu re .........................................................................

159

6.1.2 Previous Work

iii

TABLE OF CONTENTS

6.2

7

Contour E x tr a c tio n ...............................................................................

159

6.2.1

B in a riz a tio n .................................................................................

160

6.2.2

Position E stim a tio n ....................................................................

160

6.2.3

Connected C om ponent L abelling............................................

162

6.2.4

Connecting Disconnected L e tte rs ............................................

162

6.2.5

Contour T racing...........................................................................

163

6.3

Contour M atch in g ..................................................................................

163

6.4

E x p e rim e n ts............................................................................................

165

6.4.1

Overall R e s u l t s ...........................................................................

165

6.4.2

Towards Fast R eco g n itio n ........................................................

166

6.4.3

Com parison w ith C S S ..............................................................

171

6.5

Future w o r k .............................................................................................

172

6.6

C o n clu sio n ...............................................................................................

173

CASE STUDY 2: SHAPE REGISTRATION AND M ODELING

174

7.1

I n tro d u c tio n ............................................................................................

174

7.1.1

M o tiv a tio n ....................................................................................

174

7.1.2

Previous W ork

...........................................................................

176

7.1.3

C hapter S tru c tu re .......................................................................

177

Point D istribution M o d e l......................................................................

177

7.2.1

Pairwise A lignm ent

.................................................................

177

7.2.2

G eneralized Procrustes A n a ly s is ............................................

178

7.2.3

M odeling Shape V ariatio n ........................................................

179

7.3

Semi-Automatic S e g m e n ta tio n .........................................................

180

7.4

Landm ark Id e n tific a tio n ......................................................................

180

7.4.1

Pair-wise M a tc h in g ....................................................................

181

7.4.2

Correspondence for a Set of C u r v e s ......................................

181

7.4.3

M odeling Symmetrical S h a p e s ...............................................

184

R e s u l t s .....................................................................................................

184

7.5.1

Cross-sections of Pork C a rc a s s e s ............................................

184

7.5.2

H ead & S h o u l d e r s ....................................................................

185

7.5.3

Synthetic E x a m p le ............................ .......................................

185

7.6

D is c u s s io n ...............................................................................................

186

7.7

C o n c lu sio n ...............................................................................................

188

7.2

7.5

8

GENERAL CO N CLU SIO N S AND FUTURE WORK

189

8.1

General D is c u s s io n ...............................................................................

189

8.2

Thesis Sum m ary & Research C o n trib u tio n s ....................................

190

8.3

Future W o r k ............................................................................................

192

iv

TABLE OF CONTENTS

8.4

8.3.1

Image S e g m e n ta tio n .................................................................

193

8.3.2

Shape M atching.....................................................................

194

Final W o r d .................................................................................................

A THE HUM AN VISUAL SYSTEM AND OBJECT RECOGNITION

195 196

A .l

G estalt Principles of Visual O rganization O f Perception . . . . . .

196

A.2

Structural-Description vs. Image-Based M o d e ls...........................

. 197

A.2.1

Reconstruction of the O utside World from the Retinal Image 197

A.2.2 Exemplar-Based M ultiple-Views M e c h a n is m ......................

198

A.2.3 W hich M odel Explains All Facts of H u m an Visual Percep­ tion? ........................................................... ..................................

199

B IMAGE COLLECTION USED IN CHAPTER 3

200

BIBLIOGRAPHY

205

LIST OF TABLES 3.1

Com bination of two BBA using the D em pster's combination rule

3.2

Param eters T l, T c, T r, an d a for colour homogeneity measures BBA. 64

3.3

Discounting factor a ad j ..........................................................................

64

3.4

D iscounting factor a.cpX ..........................................................................

65

3.5

Results of cross-validation for different combinations of features.

66

3.6

Results of cross-validation of selected merging criteria used w ith the PSNR-based stopping criterion.......................................................

75

Results of cross-validation of selected merging criteria w ith the proposed stopping criterion...................................................................

78

Average retrieval rates for different configurations of the contour m atching algorithm ..................................................................................

150

3.7

5.1

61

6.1

Effect of pruning of w ord pairs based on contour complexity. . . . 170

6.2

Effect of pruning of w ord pairs based on num ber of descenders. . 171

6.3

Effect of pruning of w ord pairs based on num ber of ascenders. . . 171

6.4

WER (excluding OOV words) for the discussed m ethods..........172

LIST OF FIGURES 2.1

M ain segm entation s tr a te g ie s ..............................................................

12

2.2

W eighting functions used for evaluation of se g m e n ta tio n .............

32

2.3

Tool for m anual creation of ground-truth s e g m e n ta tio n ................

35

3.1

Influence of param eter qavg on the spatial segm entation error.

. .

46

3.2

Exam ple of ADCS representation..........................................................

48

3.3

Influence of param eter qext on the spatial segmentation error. . . .

49

3.4

Segm entations obtained by different colour homogeneity criteria (the required num ber of regions is know n).........................................

51

Over-segm entation of occluded background caused by favouring of m erging of adjacent regions...............................................................

53

3.6

G eneral form of BBA f u n c tio n s ...........................................................

59

3.7

Selected segm entation results obtained by m erging criteria inte­ grating colour hom ogeneity w ith different combinations of syn­ tactic features.............................................................................................

69

Segm entations obtained by merging costs integrating geometric m easures w ith two different colour hom ogeneity criteria: CaVg an d Cext.......................................................................................................

70

Influence of Tpsnr on the average spatial segm entation error . . .

74

3.5

3.8

3.9

3.10 Stopping criterion based on accum ulated m erging cost (Caum) . .

76

3.11 Influence of Tcum on the average spatial segm entation error. . . .

77

3.12 Results of fully automatic segm entation..............................................

79

3.13 U pw ards label p r o p a g a tio n .................................................................

86

3.14 D ow nw ards label p r o p a g a tio n ...........................................................

89

3.15 Example illustrating limitations of the solution for labelling of the entire im age proposed by Salembier and G a rrid o ............................

90

3.16 Results of semi-automatic segm entation..............................................

93

4.1

Problems addressed by shape a n a ly s is ..............................................

99

4.2

Shape representation ta x o n o m y ...........................................................

104

4.3

Real p art of ART basis functions

109

vii

........................................................

LIST OF FIGURES

4.4

Taxonomy of curve m atching in spatial d o m a in ...............................

Ill

4.5

CSS e x tra c tio n ..........................................................................................

116

4.6

Radical change in a CSS im age caused by a m inor change in shape 118

4.7

M atching approach proposed by Petrakis et. a l.................................

118

5.1

Contour displacem ent..............................................................................

125

5.2

Extraction of MCC representation.........................................................

126

5.3

Influence of non-rigid deform ations on MCC representation.

.. .

128

5.4

Examples of MCC representation..........................................................

130

5.5

M atching.....................................................................................................

132

5.6

M atching exam ples...................................................................................

137

5.7

Retrieval of m arine creatures..................................................................

138

5.8

Examples from MPEG-7 collection (part B of "CE-Shape-l").

.. .

139

5.9

Retrieval rates for all classes from the MPEG-7 collection...............

140

5.10 Precision versus Recall for the proposed an d the CSS approach. .

141

5.11 Distances betw een shapes com puted using the proposed approach. 142 5.12 MCC-DCT representation.......................................................................

145

5.13 Optim ization algorithm ...........................................................................

147

5.14 Optimal combinations of param eters p i...............................................

149

5.15 Class retrieval rates for different configurations of the approach. .

151

5.16 Average retrieval ratios vs. num ber of DCT coefficients..................

151

6.1

M anuscript sam ple...................................................................................

159

6.2

Binarization using dynam ic threshold.................................................

161

6.3

Position estim ation...................................................................................

161

6.4

Extraction of w ord contours...................................................................

164

6.5

Matching w o rd s.........................................................................................

165

6.6

Selection strategy for starting point......................................................

168

6.7

Word Error Rate as a function of allowed p ath deviation Tj, . , , . 168

6.8

Identification of descenders and ascenders........................................

170

7.1

Multiple assignm ents of points..............................................................

183

7.2

M odeling of the po rk carcasses.............................................................

185

7.3

Modeling of the head & shoulders.......................................................

186

7.4

Modeling of the "bum p".........................................................................

187

B.l

Image collection used in chapter 3.................................................... 204

C h a pter 1

IN T R O D U C T IO N

T

h is chapter briefly discusses the main m otivations for the research carried

out by the author an d reported in this thesis. Also, it briefly outlines the m ain objectives of the w o rk together with the au th o r's contributions and an overview of the structure of the thesis.

1 .1

T h e N e w " D ig ita l A g e "

In recent years, com puter technology has transform ed alm ost every aspect of our life and culture. The ad v en t and popularization of the personal com puter in conjunction w ith the continuous evolution of digital netw orks and the emergence of m ultim edia technology, particulary the proliferation of cheap digital m edia acquisition tools, have enabled a w orldw ide trend tow ards a new "digital age". The new m odels of content production, distribution and consum ption have resulted in fast grow th of the am ount of digital material available. However, the w ealth of inform ation available now adays has also induced a major problem of inform ation overload. In other w ords, the new "digital age" challenges us to develop the ability to find helpful and useful information in a vast sea of otherw ise useless information.

The need for efficient tools to

organize, m anipulate, search, filter and browse through the huge am ounts of digital inform ation is evident from the spectacular success of text-based Web Search Engines such as Yahoo or Google [1].

1 .2

C h a lle n g e s B r o u g h t b y th e E x p lo s io n o f C h e a p D ig ita l M e d ia

The emergence of m ultim edia technology, particulary the explosion of cheap digital m edia acquisition tools such as scanners and digital cameras etc. as well 1

1. INTRODUCTION

as rap id reduction of storage cost, has resulted in accelerated grow th of digital audiovisual media collections, both proprietary and freely available on the World Wide Web. Applications like m anufacturing, medicine, entertainm ent, education, etc. all make use of immense am ounts of audiovisual data. As the am ount of inform ation available in visual form continuously increases, the necessity for the developm ent of hum an-centered tools for the effective processing, storing, m anaging, browsing and retrieval of audiovisual data becom es evident. A large am ount of visual m aterial is available in the form of still images, either as a photograph representing a real scene or as a graphic containing a non-real im age (sketched by a hum an or synthesized by computer). Im ages can be stored either in local databases or distributed databases (e.g. the World Wide Web) and either em bedded in docum ents or available as stand-alone objects. The application areas m ost often listed in the literature w here the use, and retrieval in particular, of images now adays plays a crucial role are: law and crim e prevention, medicine, fashion and graphic design, publishing, electronic commerce, architectural and engineering design, and historical research.

A

detailed study of image users and the possible uses of im ages w as provided by Eakins and Graham in [2]. Due to the volum e of visual m aterial in the form of im ages there is a clear need for image search engines either for private or professional use.

1 .3

R e trie v a l b y C o n te n t

Currently, techniques for image retrieval (or visual m edia generally) are one of the m ost eagerly sought m ultim edia technologies, w ith potentially great com mercial significance.

H ow ever visual inform ation system s are radically

different from conventional inform ation systems and retrieval of data in such system s requires addressing m any novel issues before they can truly become a new fertile area for innovative products. Specifically, retrieving based on the text usually associated w ith visual content is no longer sufficient. In fact, in m any cases this is not even possible. Often there is no textual inform ation associated w ith images and m anual keyw ord annotation is labor intensive. Therefore, recently retrieval by content has been suggested as an alternative retrieval m odel for audiovisual media. In Content-Based Image Retrieval (CBIR) systems images are indexed (and subsequently brow sed and retrieved) based on features extracted directly from their representation rather th an by any associated text. Building m odern systems for content-based retrieval of visual data requires consideration of several key issues such as system design, feature extraction, sim ilarity measures, indexing structures, semantic analysis, inferring content

2

1. INTRODUCTION

abstraction from low-level features, designing user interfaces, querying models and relevance feedback [3]. Notably, CBIR systems require indices to order the data. However, given a visual signal there is little inform ation readily available for indexing. Therefore in recent years a significant research effort has been dedicated to the problem of extraction of visual descriptors w hich could be then exploited later for contentbased brow sing and retrieval. The descriptors may be visual features such as colour, texture, edge direction, shape, scene layout, spatial relationships, or semantic primitives. Also they can be global (one feature describes the whole image) or local (each feature describes an object or a part of the image). A nother crucial com ponent of CBIR system s is establishing similarity betw een visual entities (e.g. between the query and the target). In order to allow inter-operability betw een devices and applications attem pting to solve various aspects of the query by content problem as well as to provide a com m on term inology and test m aterial, the Motion Picture Experts Group (MPEG) introduced a standard know n as MPEG-7 [4, 5]. MPEG-7, formally called the Multimedia Content Description Interface, is a standard for describing m ultim edia content, facilitating sophisticated m anagem ent, browsing, searching, indexing, filtering, and accessing of that content. Unlike previous ISO standards, it is intended to provide representations of audiovisual (AV) inform ation that aim beyond com pression (such as MPEG-1 an d MPEG-2 standards) or even objectbased functionalities (such as su p p o rted by the MPEG-4 standard), and support some degree of interpretation of the AV content. MPEG-7 offers a com prehensive set of standardized audiovisual description tools by specifying four types of norm ative elements: Descriptors, D escription Schemes, a Description Definition Language (DDL), and coding schemes. For a more detailed description of the standard the reader can refer to [4, 5]. This thesis contributes to tw o key research areas enabling some of the above challenges to be addressed: im age segm entation and shape matching. The re­ m ainder of this chapter briefly discusses the main motivation for the research carried out by the author in this thesis by discussing the role of image segmen­ tation and shape m atching in CBIR. This is followed by an outline of the main objectives of the w ork together w ith the au th o r's contributions and an overview of the structure of the thesis.

1 .4

T h e I m p o rta n c e o f Im a g e S e g m e n ta tio n a n d S h a p e A n a ly s is in C B IR

This section discuses briefly the role of im age segmentation and shape matching technologies in addressing some of the m ajor challenges which m ust be faced 3

1. INTRODUCTION

when building CBIR systems.

1.4.1

Image Segmentation

Image segm entation refers to a very broad set of techniques for im age parti­ tioning. Typically, the goal is to partition an image using a chosen criterion so that the created partition reflects the structure of the scene at a level suitable for a given application, e.g. hom ogenous regions or semantic objects. In other w ords, the term segm entation can be used to refer to any process allow ing dis­ covering certain know ledge about the structure of the scene, e.g. shape of the objects present. Therefore, the problem of partitioning an image into a set of ho­ m ogenous regions or semantic entities is a fundam ental enabling technology for understanding scene structure and identifying relevant objects. Identification of areas of an im age th at correspond to im portant regions, e.g. semantic objects, is often considered to be the first step of m any object-based applications. The problem of segm entation is fundam ental to m any challenges encountered in the area of com puter vision. This thesis focuses prim arily on region and object based segm entation of images in the context of visual content indexing and retrieval applications. M any of the low-level descriptors, including the ones defined by the MPEG7 standard, can be used to describe entire images as well as parts of images (e.g. im age regions, which ideally w ould correspond to real objects present in the scene). U tilization of local descriptors, representing images at region or object granularity, allows indexing and retrieval in a m anner closer to hum an perception an d scene understanding. Im age segm entation is a vital tool for extraction of such localized descriptors of the image content. In other w ords, the ability of partitioning the image into regions corresponding to objects present in the scene (or at least to parts of the object hom ogenous according to certain criteria) is central for extraction of local low-level features (e.g. colour, texture, shape) w here each feature partially or completely describes an object. However, MPEG-7 does not specify norm ative tools for image segm entation as this is n o t necessary for inter-operability. Nevertheless, in practice, image segm entation tools are integral parts of m any feature extraction toolboxes - see for exam ple [6]. The author believes that in m any cases, the success of MPEG-7, both, com m ercially and as a tool facilitating research, will depend on the ability to easily produce partitioning of the visual content into regions m eaningful in a given application scenario. It is com m only know n that even limited capabilities of filtering of the AV m a­ terial based on sm all sets of particulary im portant semantic objects/concepts can greatly im prove performance of CBIR systems, e.g. the user m ay w ish to brow se only im ages containing certain type of objects (e.g. faces) or scenes (e.g.

4

1. INTRODUCTION

ind oor/outdor) - see for example the Fischlar system [7]. Often, such function­ alities require autom atic detection, extraction an d recognition of semantically meaningful entities from visual data. Recently it has been show n that textual labels can be assigned (inferred) to hom ogenous image regions autom atically du rin g an off-line training/indexing process [8, 9]. Such labels can be then used for indexing and subsequently for textual querying. Clearly, the ability of autom atically segm ent images based on colour and texture hom ogeneity criteria is a crucial underpinning technology in such a scenario. Finally, semi-automatic (often also referred to as supervised) segmentation tech­ nologies are becom ing sufficiently m ature for integration w ith CBIR opening an interesting possibility of object-based rather th an im age-based queries. Surpris­ ingly, to the best of authors know ledge so far only a few studies have considered such a possibility [10]. Also, the sem i-autom atic tools enable the possibility of rapid semi-automatic annotation of im ages w here labels are manually assigned to image parts [11]. Other interesting approaches to CBIR utilizing image seg­ m entation are discussed in the next chapter (section 2.2).

1.4.2

S h a p e M a tc h in g

Shape plays an im portant role to allow hum ans to recognize and classify objects. In fact, often shapes represent the abstractions (archetypes) of objects belonging to the same sem antic class.

N ot surprisingly, som e user surveys regarding

cognition aspects of image retrieval indicate th at users are more interested in retrieval by shape than by colour and texture [12]. Therefore shape analysis can play an im portant role in addressing some of the key problems which m ust be faced w hen building CBIR system s such as detection of semantic objects in unseen images, establishing similarity betw een objects, and extracting indices to index the data. Shape m atching (or in this case often referred to as tem plate matching) plays an im portant role in the detection an d segm entation of semantic objects in unseen images. Once prototypical objects are extracted, shape analysis can make explicit some im portant inform ation about the object's shape which can be then exploited to recognize that object or distinguish it from other objects. Estimation of similarity betw een shapes of objects (e.g. betw een the query and the target objects) is a key technology enabling querying by example. However, estimation of shape sim ilarity can be indirectly im portant also for textual querying of images, e.g. text recognition in scanned docum ents by using Optical Character Recognition (OCR). Clearly the shape of characters and words is crucial for m apping them into text. One such application is discussed in this thesis in chapter 6 which considers contour m atching for w ord recognition in historically 5

1. INTRODUCTION

significant handwritten documents. N ot surprisingly, am ong several low-level visual features to characterize the AV content MPEG-7 defines two standardized descriptors specifically for 2D shape w hich are: contour-based descriptor relying on the concept of Curvature Scale Space (CSS) [13, 14, 15, 16, 17] and region-based descriptor based on a 2D complex transform defined w ith polar coordinates on the unit disk, called A n ­ gular Radial Transform (ART) [17] - both are discussed in detail in chapter 4. To facilitate the com parison betw een different m ethods for shape-based retrieval, a complete evaluation m ethodology and common test collections were speci­ fied. This methodology, am ong others, is adopted for extensive evaluation of the m ethod proposed by the author in chapter 5. Also, weaknesses of the stan­ dardized contour-based descriptor (CSS) are identified and discussed based on several application examples. In particular, it will be show n than the m ethod for estim ating sim ilarity betw een shapes proposed in this thesis produces better results that the CSS-based m ethod in two applications: retrieval in collections of closed silhouettes an d holistic w ord recognition in handw ritten historical m an­ uscripts.

Summarizing, im age segm entation as well as shape m atching are key technolo­ gies essential to provide the content-based functionalities required by future m ultim edia applications. Therefore not surprisingly they have been hot research topics for a large num ber of groups around the w orld. Contributions to both of these key research areas are m ade and reported in this thesis together w ith a discussion of several applications of the proposed solutions.

1 .5

O b je c tiv e s o f th is T h e s is

This thesis considers different aspects of the utilization of contour inform ation for image segm entation, sim ilarity estimation betw een objects, and also object registration and m odeling in the context of content-based retrieval in large collections of images. The first objective of this thesis is to place in context the research and provide justification for the investigation carried out by the author. This is achieved by presenting various applications of the proposed solutions throughout the thesis and dem onstrating th at image segm entation and shape m atching are key technologies enabling the content-based functionalities required by future m ultim edia applications. The second objective is to outline the current state of the art in image segm enta­ tion and shape m atching by reviewing the most characteristic categories of tech­ niques in both areas and briefly discussing the m ost interesting and prom ising 6

1. INTRODUCTION

techniques from each category. This thesis does not attem pt to cover all aspects of image segmentation and shape analysis nor does it represent a detailed lit­ erature review of the vast am ount of w ork published in these fields. Rather, it restricts itself to a discussion of these technologies in the context of CBIR. The third, and m ost im portant objective is to investigate new approaches to solving the problem s related to image segm entation, contour matching and sim ilarity estimation and to discuss their practicability in various applications. O ne of the main goals is to explore the feasibility of utilizing the spatial configuration of image regions and structural analysis of their contours (socalled "syntactic visual features") for im proving the usefulness of the output of bottom -up image segmentation, e.g. for feature extraction, by creating a m ore meaningful segm entation of the im age corresponding m ore closely to sem antic objects present in the scene and im proving the perceptual quality of the segmentation. This can be achieved by investigating new w ays of utilizing evidence from m ultiple sources of inform ation, each w ith its ow n accuracy and reliability. A nother goal is to investigate selected issues of shape analysis, in particular contour representation, m atching, and estim ation of similarity betw een silhouettes.

The aim is also to dem onstrate the usefulness of the

proposed approach in a variety of applications. In each case, the performance and generality of the proposed techniques m ust be analyzed based on rigorous experim entation using large test collections. More­ over, presentation of selected applications of the proposed solutions should pro­ vide a suitable basis for a discussion of the practicability of different recognition strategies utilizing contour inform ation, e.g. bottom -up (autom atic segmenta­ tion followed by recognition) vs. top-dow n (segm entation driven by shape m od­ els or semi-automatic). The final objective of this thesis is to indicate directions for further research. Namely, to consider possibilities for further im provem ent of the proposed solutions and discuss the prospects of using them as a basis for addressing various other challenges in the field.

1 .6

T h e M a in C o n trib u tio n s

There are three m ain contributions in this thesis. The first contribution is a fea­ sibility study of utilizing spatial configuration of regions and structural analysis of their contours (so-called syntactic visual features [18]) for im proving the cor­ respondence of fully autom atic image segm entation to sem antic objects present in the scene and for im proving the perceptual quality of the segmentation. Sev­ eral extensions to the well-know n Recursive Shortest Spanning Tree (RSST) algo­ rithm [19] are proposed am ong w hich the m ost im portant is a novel framework for the integration of evidence from m ultiple sources of inform ation, each with 7

1. INTRODUCTION

its ow n accuracy and reliability. The new integration fram ework is based on the Theory of Belief (BeT) [20, 21, 22, 23]. The "strength" of the evidence provided by the geom etric properties of regions an d their spatial configurations is assessed and com pared w ith the evidence provided solely by the colour hom ogeneity cri­ terion. O ther contributions in this area include a new colour m odel and colour hom ogeneity criteria, practical solutions to structure analysis based on shape and spatial configuration of im age regions, and a new simple stopping criterion aim ed at producing partitions containing the most salient objects present in the scene. Additionally, the application of the autom atic segmentation m ethod to the problem of semi-automatic segm entation is also discussed. A n easy to use and intuitive semi-automatic segm entation tool utilizing the automatic approach is described in detail. This exam ple dem onstrates that syntactic features can be useful also in the case of supervised scenarios w here they can facilitate more intuitive user interactions e.g. by allow ing the segmentation process m ake more "intelligent" decisions w henever the inform ation provided by the user is not sufficient (or ambiguous). The second contribution is the proposal of a novel rich multi-scale representation for non-rigid shapes w ith a single closed contour and a study of its properties in the context of efficient silhouette m atching and similarity estimation. An initial approach for optim ization of the matching in order to achieve higher perform ance in discrim inating betw een shape classes is also proposed. The efficiency of the proposed silhouette m atching approach is dem onstrated in a variety of applications. In particular, the retrieval results are com pared to those of the C urvature Scale Space (CSS) [13, 14, 15] approach (adopted by the MPEG-7 standard [16, 17]) and it is show n that the proposed scheme perform s better that CSS in two applications: retrieval in collections of closed silhouettes and holistic w ord recognition in handw ritten historical m anuscripts.

As a m atter of fact it is

show n that the proposed contour based approach outperform s all techniques for w ord m atching proposed in the literature w hen tested on a set of 20 pages from the George W ashington collection at the Library of Congress [24, 25]. Additionally, it is show n that the m atching approach can be extended w ith some success to the problem of m atching m ultiple contours and therefore facilitating establishm ent of dense correspondence between multiple silhouettes w hich can then be em ployed for contour registration. Various applications of the proposed techniques are presented throughout the thesis providing the basis for the third contribution which is a discussion of the feasibility of different recognition strategies utilizing contour information.

8

1. INTRODUCTION

1 .7

T h e s is O v e rv ie w

The first chapters of the thesis consider segm entation of images of real w orld scenes w hile the later chapters focus on estim ation of similarities betw een objects' silhouettes. The final chapters present case studies dem onstrating the usefulness of the proposed solutions in selected applications. C hapter 2 presents the state of the art in im age segm entation in the CBIR con­ text. The chapter discusses the relevance of image segm entation in the context of content-based image retrieval, outlines the m ain categories of approaches, briefly describes the grouping cues m ost com m only utilized in segm entation and reviews the m ost characteristic approaches to autom atic and sem i-autom atic im­ age segm entation. The chapter also briefly discusses evaluation m ethodologies for assessm ent of segmentation quality proposed thus far in the literature and describes in som e detail the evaluation m ethod adopted throughout this thesis. C hapter 3 explores the feasibility of utilizing the spatial configuration of regions and structural analysis of their contours (so-called syntactic visual features) for im proving the correspondence of the o u tp u t of bottom -up segm entation by region m erging to semantic objects present in the scene and im proving its perceptual quality.

A new fram ework for utilizing evidence from m ultiple

sources of inform ation, each w ith its ow n accuracy and reliability, for m eaningful region m erging is proposed. Two practical m easures for analyzing the spatial configuration of image regions and structural analysis of their contours are proposed together w ith several im provem ents to colour representations used during the region merging process and a stopping criteria aimed at producing partitions containing the m ost salient objects present in the scene. Finally, it dem onstrates the utilization of the approach in a semi-automatic scenario. C hapter 4 outlines the m ost im portant issues related to shape analysis in the context of content-based image retrieval. It focuses on the major challenges related to 2D shape representation, m atching and similarity estim ation an d gives an overview of selected techniques available in the literature in order to provide a context for the research carried out in the rem ainder of the thesis, particulary in chapter 5. C hapter 5 looks at the developm ent of a new representation and m atching technique devised by the author for non-rigid shapes w ith a single closed contour.

A n initial approach for optim ization of the m atching in order to

achieve higher perform ance in discrim inating betw een shape classes is also presented. The efficiency of the proposed approaches is dem onstrated on two collections an d the retrieval results are com pared to those of the m ethod based on Curvature Scale Space (CSS)[13,14,15], w hich has been adopted by the MPEG7 standard [16,17]. C hapters 6 and 7 present two case studies, both utilizing the techniques 9

1. INTRODUCTION

proposed in the earlier chapters. C hapter 6 exam ines the application of the silhouette m atching m ethod proposed in chapter 5 to holistic w ord recognition in historically significant h andw ritten m anuscripts.

Utilization of both the

silhouette m atching m ethod presented in chapter 5 and the semi-automatic segmentation approach discussed in chapter 3, in an integrated tool allowing intuitive extraction and registration of contour exam ples from a set of images for construction of statistical shape m odels is discussed in chapter 7. The final chapter sum m arizes the thesis by review ing the achieved research objectives and recalling the m ain conclusions d raw n throughout this thesis. Also, it indicates directions for further research by discussing possibilities for further im provem ent of the proposed techniques an d looks at future prospects in using them to address related challenges in the field.

10

C h a pter 2

IM A G E S E G M E N T A T IO N , A P P L IC A T IO N S A N D E V A L U A T IO N : A R E V IE W

HIS chapter discusses the importance of im age segm entation in the con­

T

text of content-based image retrieval, outlines the m ain categories of ap­ proaches, and review s the m ost characteristic segm entation methods. The sub­ ject of objective evaluation of segmentation quality is also discussed in some de­ tail as it has received relatively little attention in the literature, especially w hen com pared to the large num ber of publications on the topic of image segm enta­ tion itself.

2 .1

2.1.1

In tro d u c tio n

Image Segm entation

The term image segm entation1, refers to a broad class of processes of partitioning an image into disjoint connected regions which are hom ogeneous w ith respect to a certain criteria such as low-level features (e.g. colour or texture), general prior know ledge about the w o rld (e.g. boundary sm oothness), high-level know ledge (e.g. semantic m odels) or even user interactions. The problem of im age segm entation is a fundam ental enabling technology for understanding scene structure and identifying relevant objects. Identification of areas of an im age th at correspond to hom ogenous o r/a n d im portant regions, e.g. semantic objects, is often considered to be the first step of m any object-based applications such as for exam ple content-based im age indexing and retrieval (e.g. using the recently introduced MPEG-7 [26, 4, 5] standard) or region-ofinterest coding using the JPEG2000 standard [27]. This thesis focuses prim arily ’O ften e x c h an g e d w ith th e te rm p erceptu al gro u p in g .

11

2. IMAGE SEGMENTATION, APPLICATIONS AND EVALUATION: A REVIEW

DATA DRIVEN (BOTTOM-UP):

MODEL DRIVEN (TOP-DOWN):

colour homogeneity texture homogeneity geometric homogeneity

models (e.g. shape templates) semi-automatic (supervised)

Figure 2.1: M ain segm entation strategies. on region and object based segm entation of images in the context of visual content indexing and retrieval applications.

2.1.2

T ax o n o m y

A plethora of approaches to im age segm entation have been proposed in the literature [28, 29, 30, 31, 32, 33, 34]. Exhaustive surveys can be found in [35, 36, 37, 38, 39].

The techniques available in the literature can be classified

according to m any different criteria. M any authors divide the methods into tw o distinct groups: (i) region-based approaches relying on the homogeneity of spatially localized features (e.g. colour or texture) and (ii) boundary-based m ethods using mostly gradient inform ation to locate object boundaries. Others classify methods according to the mathem atical m ethodology they employ, e.g. (i) variational, (ii) statistical, (iii) graph-based, and (iv) morphological. A nother im portant criterion, adopted for discussion in the rem inder of this thesis, is the inform ation level used for grouping. According to this criterion the m ethods can be broadly classified as either visual features-driven (bottom-up) or model-driven(top-down) [18] - see Figure 2.1. A pproaches from the first category attem pt to infer m eaningful entities (ideally objects) solely from analysis of visual features, e.g. colour or texture homogeneity. A pproaches from the second category attem pt to segm ent image into objects w hose visual features match best the visual features of the m odels used. It should be noted that the term bottomup is also often used to refer to m ethods m erging iteratively pixels into more com plex entities (e.g. [19]) and the term top-down often refers to region splitting techniques (e.g. [40]) or to approaches w here segm entation of the down-sampled im age is used to initialize the segm entation of the im age at finer resolution (e.g. [30]). H owever in this thesis the terms bottom-up an d top-down are mainly used to refer to the level of inform ation used and not necessarily to the order in w hich images at different scales are analyzed. Alternatively, approaches can also be classified as either object-based (resulting in sem antic objects) or region-based (producing regions w ith homogeneous colour a n d /o r texture).

Clearly, there is a close relationship betw een the last two

12

2. IMAGE SEGMENTATION, APPLICATIONS AND EVALUATION: A REVIEW

criteria, i.e. typically sem antically m eaningful objects can be produced only by m aking use of objects' m odels, p rio r knowledge about a particular application or user guidance. The approaches driven by visual features typically rely on a certain hom ogeneity criterion, e.g. colour or texture. They are attractive in m any applications as they do not require m odels of individual semantic objects or user interactions. However, they also rarely can be used to extract complete semantic objects as the problem of visual features-driven object-based segmentation is under constrained in m any applications. In contrast, use of the model-driven approaches, although requiring utilization of prior know ledge about the objects to be extracted, leads to well defined detection problems.

Typically, in the

case of general content, the visual features-driven approaches can only partition the in p u t image into hom ogenous regions although in specific application scenarios can also produce objects, while, the model-driven approaches are used m ainly to extract the im portant sem antic objects from the irrelevant structures present in the scene. In the case of systems carrying out shape analysis, the segm entation algorithm s typically aim at object-based segm entation in order to extract/recover the shape of the relevant objects present in the scene. However, the tw o approaches (visual features-driven and model-driven) are not m utually exclusive and they c o u ld /sh o u ld be used both for segm entation and object extraction. For example, in [41] bottom-up segmentation w as used to obtain a single partitioning of the im age an d also to build its hierarchical representation in the form of a Binary Partition Tree (BPT). BPT was then used to guide the search for the optim um m atch betw een a reference contour and the contours of the partition leading to object segm entation and detection.

2.1.3

C h a p te r S tru c tu re

The rem ainder of this chapter is organized as follows: the next section discuses the relevance of image segm entation in the context of Content-Based Image Retrieval (CBIR) focusing prim arily on reviewing selected CBIR systems making use of segm entation and describing some recently em erged approaches to im age retrieval utilizing autom atic segmentation. Section 2.3 briefly discusses cues w hich can be used for im age segmentation. Selected bottom-up and topdown approaches are review ed in sections 2.4 and 2.5 respectively. Evaluation m ethods are discussed briefly in section 2.6. A short discussion of the prospects of image segm entation in the context of CBIR is given in section 2.7 and conclusions are form ulated in section 2.8.

2 .2

R e le v a n c e to C o n te n t- B a s e d Im a g e R e tr ie v a l

U tilizing segm entation in CBIR system s allows representation of images at the level of hom ogenous regions, w hich in an ideal case correspond to semantic 13

2. IMAGE SEGMENTATION, APPLICATIONS AND EVALUATION: A REVIEW

objects or at least to significant p arts of the objects.

Such region or object

granularity allows utilization of local features for indexing and retrieval in a m anner closer to hum an perception and scene understanding. State of the art CBIR systems make use of image segm entation to extract a separate set of indexing features for each region (or object) [42, 43, 29, 44]. A ccording to [45], the systems utilizing segm entation can be divided into two categories depending on the strategy used for matching regions from the query and the target image: (i) individual region matching - w here a single region selected by the user from the query image is m atched to all regions in the collection [46, 43] or alternatively retrieval results obtained by using several queries w ith individual regions are m erged [29] and (ii) frame region matching w here inform ation of all regions com posing the images is used [47,45]. Recently, the possibility of another category of systems utilizing im age segm entation em erged w hen it w as shown that keyw ords can be associated w ith im age regions autom atically during an off-line training/indexing process [8]. Such keywords can be then used for indexing and subsequently for textual querying. The rem ainder of this section aims at highlighting the im portance of image segm entation in CBIR by reviewing selected approaches utilizing som e form of im age partitioning.

2.2.1

CBIR Systems Utilizing Segmentation

Recent years have w itnessed an explosion of image retrieval systems, developed as platform s facilitating research or even as commercial products. Extensive reviews of the major CBIR systems can be found in [2,48, 3], This section briefly overview s those systems utilizing some form of image segm entation. O ne of the earliest CBIR systems, and also one of the first to provide limited object-based functionalities is the Query By Image Content (QBIC) system devel­ oped by IBM A lm aden Research Center [49,50,11]. In this approach sim ple lowlevel features (colour, texture and sim ple shape descriptors) are extracted from entire im ages or objects sem i-automatically segm ented in the database popula­ tion step. The segm entation is based on the active contours technique which aligns the approxim ate object's outline provided by the user w ith the closest im age edges. Each extracted object can be annotated by textual description. All im ages are also represented by a reduced binary m ap of edge points to al­ low retrieval based on rough sketches. The system allows com bination of textbased keyw ord queries w ith visual queries, i.e. queries-by-example, by a rough sketches and selected colour and texture patterns. One of the first im age retrieval systems utilizing autom atic im age segm entation is Picasso developed by the Visual Information Processing Lab, U niversity of Florence [51, 52]. A t the core of the system is a pyram idal colour segm entation 14

2. IMAGE SEGMENTATION, APPLICATIONS AND EVALUATION: A REVIEW

of images (i.e.

at the lowest level, each region consists of only one pixel,

while the top of the pyram id corresponds to the the entire image) obtained by iterative m erging of adjacent regions. Each level of the pyram id corresponds to a resolution level of the segm entation. The above representation is stored in the form of a m ultilayered g rap h in w hich nodes representing regions are characterized by colour (a binary 187-dimensional colour vector), position of its centroid, size, and shape (elongation and the major axis orientation). Additionally, d uring database population the objects of interest in each image are bounded w ith their m inim um enclosing rectangle and a Canny edge detector is used w ithin each of these rectangular areas to extract edges. The system allows querying by colour regions (by draw ing a region w ith a specific colour or by tracing the contour of some relevant object in an example image), by texture, or by shape (draw n sketches). Netra, developed by the D epartm ent of Electrical and C om puter Engineering, University of California, Santa Barbara, is another CBIR system utilizing image segm entation [53, 46]. In this system im ages are automatically segm ented into regions of hom ogeneous colour using an edge flow segmentation technique [54]. Each region is characterized by colour (represented by a colour codebook of 256 colours), texture (represented by a feature vector containing the norm alized m ean and standard deviation of a series of G abor wavelet transforms), shape (represented by three different feature vectors: curvature at each point on the contour, centroid distance function, and Fourier descriptors) and spatial location.

The query is form ulated by selecting one of the regions from the

query image or alternatively, if the exam ple im age is not available, directly by specifying the color and spatial location. iPure is a CBIR system developed by IBM India Research Lab, N ew Delhi [42]. In this system im ages are segm ented into regions w ith homogenous colour using the Mean-Shift algorithm [55, 56, 33].

Each region is represented by

colour (average colour in CIE LUV space), texture (coefficients of the Wold decom position of the image view ed as a random field), and shape (size, orientation axes, an d Fourier descriptors). Additionally, the spatial lay-out is characterized by the centroid, the m inim um bounding box, and contiguity. The query is form ulated by selecting one or m ore regions from the example image. Istorama is an im age retrieval system developed by Informatics and Telematics Institute, Centre for Research an d Technology Hellas, Greece [43]. In this system all images are autom atically segm ented by a variant of K-Means algorithm called KMCC [30]. Each region is characterized by its colour, size and location. Similarly to the Blobword system the user form ulates the query by selecting a single region from the query image.

Additionally, the user m ay stress or

dim inish the im portance of a specific feature by adjusting weights associated w ith each feature.

15

2. IMAGE SEGMENTATION, APPLICATIONS AND EVALUATION: A REVIEW

Blobworld is a CBIR system developed at C om puter Science Division, University of California, Berkeley [29],

In this approach images are segm ented into

regions w ith uniform colour and texture (the so-called blobs) by a variant of the Expectation Maximization (EM) algorithm. Each region is characterized by its colour (represented by a histogram of the colour coordinates in the CIE LAB colour space), texture (characterized by m ean contrast and anisotropy over the region), and sim ple global shape descriptors (size, eccentricity, and orientation). The query-by-exam ple is form ulated by selecting one or more regions from one of the query im ages followed by specification of the im portance of the blob itself and also its colour, texture, location, and shape. A n interesting m ethodology to image retrieval was presented in [44], In this ap­ proach images are autom atically segm ented by a variant of K-Means algorithm called KMCC [30] an d the resulting regions are represented by low-level features such as colour, position, size and shape. The m ain innovation of this m ethodol­ ogy is autom atic association of these descriptors w ith qualitative interm ediatelevel descriptors, w hich form a simple vocabulary term ed object ontology. The ob­ ject ontology allows the qualitative definition of the high-level concepts and their relations in a hum an-centered fashion and therefore facilitates querying using semantically m eaningful concepts. Applicability to generic collections w ithout requiring m anual definition of the correspondences betw een regions and rele­ van t identifiers is ensured by simplicity of the em ployed ontology. Utilization of the object ontology allows narrow ing dow n the search to a set of potentially rel­ evant images. Finally, a relevance feedback m echanism, based on Support Vector Machines and using the low-level descriptors is em ployed for further refinement of retrieval results.

2.2.2

Automatic Annotation of Image Regions

Some user studies suggest that in practice queries based on global low-level features (e.g.

colour histogram s, texture) are surprisingly rare while at the

same time text associated w ith images is particulary useful [57]. Interestingly, a num ber of w orks [8, 9] have show n that keywords can be associated w ith image regions autom atically d uring an off-line training/indexing process. Such labels can be then used for indexing and subsequently for textual querying. One of the m ost prom ising approaches so far to predicting w ords from seg­ m ented images w as described in [9]. In this approach a statistical model links w ords and im age data, w ithout explicit encoding of correspondence between w ords and regions. This m odel combines the aspect m odel w ith a soft cluster­ ing model. It is assum ed that images and co-occurring w ords are generated by nodes arranged in a tree structure. The joint probability of w ords and image regions is m odelled as being generated by a collection of nodes, each of which has a probability distribution over w ords and regions. The region probability 16

2. IMAGE SEGMENTATION, APPLICATIONS AND EVALUATION: A REVIEW

distributions are Gaussians over feature vectors and the w ord probabilities are provided by simple frequency table. A region's features im ply a probability of being generated from each node. These probabilities are then used to weight the nodes for w ord emission. Param eters for the conditional probabilities linking w ords and regions are estim ated from the w ord-region co-occurrence data using an Expectation-Maximization algorithm . While the approach w as shown to label only some regions reasonably well (e.g. sky, water, snow, people, fish, planes) it certainly represents an extremely interesting approach to object recognition. In [58] three classes of image segm entation algorithm s (Blobworld [29], Normal­ ized Cuts [28] and Mean-Shift [33]) were evaluated in the context of the above annotation approach, e.g. in terms of the w ord prediction performance. It was found that the Normalized Cuts approach provides the best support of all three m ethods for w ord predictions closely followed by the Mean-Shift segmenter.

2 .3

G ro u p in g C u e s

This section looks at possible features (grouping cues) w hich are relevant to the im age segmentation problem.

2.3.1

C o lo u r

Colour is the prim ary low-level feature in practically all image segmentation techniques which can be found in the literature [29, 30, 31, 59, 18].

One

of the main issues related to colour is the choice of an appropriate colour space. Typically it is advantageous that the colour space used guarantees a low correlation among com ponents and is perceptually uniform, i.e. the numerical distance is proportional to the perceived colour difference. Due to the above requirements many resent publications advocate utilization of CIE LUV and its im proved version CIE LAB colour spaces w hich for similar colours are approxim ately perceptually uniform [28,29, 30].

2.3.2

T exture

Texture is arguably the second m ost im portant low-level feature after colour in the image partitioning problem . Several approaches to texture characterization have been shown to be useful for segm entation [30] including multi-orientation filter banks [60] and the second-moment matrix [61,29]. However, utilization of texture in image segm entation should not be considered sim ply as a thoughtless extension of the colour feature. There are several issues w hich m ust be addressed w hen m aking use of texture information in order to achieve satisfactory results for natural images often depicting both textured 17

2. IMAGE SEGMENTATION, APPLICATIONS AND EVALUATION: A REVIEW

and untextured objects.

A n interesting discussion on utilization of texture

inform ation for image segm entation can be found in [60], w here the authors discuss issues like adaptive scale selection, i.e. the size of the local w indow used to com pute texture descriptor - see also [29], and the fact (rarely discussed in the literature) that contours m ay constitute a serious problem for many texture analysis m ethods, i.e. filters tuned to various orientations and spatial frequencies w hich are often used for texture analysis produce large responses along the direction of the edge. Finally, it should be noted that colour and texture often provide conflicting evidence (e.g. consider a zebra-like pattern) w hich m ust be addressed in the final decision step of the segm entation process. Currently, the m ost com m on approach to alleviate such conflicts is a conditional filtering of the colour com ponents, i.e. textured areas are sm oothed in order to ensure their colour hom ogeneity (e.g. a zebra becomes a grey horse (colour feature) w ith stripes (texture featu re)) [29,30].

2.3.3

General Geometric Cues

M any researchers attem pted to incorporate general geometric prior knowledge about the w orld [62, 63, 64, 65,66, 67,18, 68] into the segm entation process. The m ain idea is to use some generic geometric cues w hich allow segm entation of images into m eaningful entities w ithout building object specific models. One example are m ethods based on the deformable m odel paradigm of Active Contours w hich incorporate p rio r know ledge about a contour's smoothness and resistance to deform ations [64, 65, 66].

Other m ethods attem pt to perform

Perceptual Grouping w hich is typically defined as a process that organizes image entities (typically edges or regions) into higher-level structures based on some generic geometric perceptual m easures. The majority of such approaches are directly m otivated by theories from the field of Cognitive Psychology regarding hum an perception and object recognition2 such as for exam ple the principles of Gestalt theory and the principle of common-cause3. Geometric perceptual m easures com m only used for perceptual grouping (and potentially to image segm entation) are proximity, continuity, closure, compact­ ness, edge parallelism , collinearity, and cocircularity (see exam ples in a special section on Perceptual O rganization in Com puter Vision in [67]). A nother exam­ ple of a set of geometric perceptual measures potentially useful for segmentation are the Syntactic Visual Features, an d in particular a Quasi-Inclusion Criterion, ad­ vocated by Ferran and Casas [18] for creation of Binary Partition Trees (BPT). Syn­ tactic visual features include com plex homogeneity (e.g. shape homogeneity), com pactness, regularity, inclusion an d symmetry. 2 A b rie f d is c u ssio n of th e m a in m o d e ls o f the h u m a n v isu al sy stem a n d object recognition can b e fo u n d in a p p e n d ix A 3B ased o n th e id e a th a t m a n y re la tio n s in v isu a l in fo rm atio n h a v e a low p ro b a b ility to occur at ra n d o m [69]

18

2. I M A G E S E G M E N T A T IO N , A P P L I C A T I O N S A N D E V A L U A T IO N : A R E V I E W

A lthough utilization of geometric properties a n d /o r structural relationships for segm entation is intuitive, integration of such inform ation w ith other grouping cues (e.g.

colour and texture) is a non-trivial task.

In m any approaches

proposed so far such integration is perform ed in an ad-hoc m anner [62, 63, 18, 68]. Moreover, despite the fact that generality of m any of the geometric cues is arguable very little w ork has been done to quantitatively dem onstrate their usefulness on large collections, i.e. typically a handful of exam ples of segm entation is shown.

2.3.4

Application Specific M odels

In some applications it is possible to use a prior m odel of w hat is expected in the im age further reducing am biguities and leading to fully automatic techniques for object-based segm entation, detection and recognition. In other w ords, the partitioning process is guided by models of specific objects to be segm ented. Intuitively, such m odels should allow m odeling of com m on real w orld deform ations and variations w ithin object classes. M odels commonly used for segm entation and object recognition are shape m odels [70, 71, 72, 73, 74, 75, 76] and models of colour (or intensity) variations, i.e.

designed for

synthesizing new images of the m odelled object (e.g. Active Appearance Models (AAMs) [77, 78] or Eigenfaces [79]).

2.3.5

User Interactions

Semi-automatic or supervised approaches allow a user, by h is /h e r interactions, define w h at objects are to be segm ented w ithin an im age [32]. The result of interaction provides clues as to the nature of the u ser's requirem ents which are then used by an automatic process to segm ent the required object. The user should have the possibility of obtaining exactly the segm entation h e/sh e desires in terms of content and accuracy. The am ount of required interactions should not be excessive and should be easily and quickly perform ed. Also, it is im portant that the system responds to the users interactions in real-time, ideally instantaneously. Rapid responses are crucial for cases w hen interactions need to be applied iteratively in order to refine the segmentation result, e.g. for complex objects or objects w ith adjacent parts exhibiting a certain degree of homogeneity. The strategies to user interaction can be grouped in three classes [80]: featurebased [81, 82, 83], contour-based [84,34], and region-based [85]. 19

2. I M A G E S E G M E N T A T IO N , A P P L I C A T I O N S A N D E V A L U A T IO N : A R E V IE W

2.4

Selected Methods for Bottom-up Automatic Segmentation: A Review

This section describes selected m ethods for bottom -up automatic segm entation of heterogenous still images, typically colour, w ithout utilizing any a priori know ledge about the scene depicted in the image. As already explained, the literature in the area of automatic image segm entation is enorm ous.

Different m ethodologies w ere used in the past to tackle the

problem of segm entation such as: histogram thresholding [86, 87] (i) clustering [88, 89, 90, 30, 29, 91, 56, 33], (ii) mathematical morphology [92, 93], (iii) graph theoretic algorithms [19, 94, 59, 28, 95, 28], (iv) statistical methods [96, 97, 98], (vi) edge based segmentation [99, 100], (vii) neural networks [101] and many others. This section discusses three categories of approaches w hich arguably are the most significant in the context of content-based im age retrieval for general content and w hich are also relevant to the w ork presented in this thesis in chapter 3.

2.4.1

Clustering

Clustering m ethods are one of the earliest approaches to image segm entation [90]. Traditional clustering approaches typically ignore spatial inform ation and op­ erate solely on the feature space (usually colour or texture). The feature sam ­ ples com puted for each pixel are treated as vectors in multi-dimensional space and the objective is to group them into com pact, well-separated clusters using a specific distance measure. Once the clustering is completed cluster labels are m apped onto the im age plane to produce the final regions. The m ost widely used m ethods from this category are K-Means algorithm [88] and its fuzzy equiv­ alent Fuzzy C-Means [89]. The m ain draw back of the above approaches are: (i) they require a priori know ledge of the nu m b er of clusters, (ii) they produce dis­ connected regions, (iii) they assum e approxim ately spherical clusters in the fea­ ture space. To alleviate the first two draw backs the K-Means-with-Connectivity-Constraint (KMCC) algorithm was proposed in [30]. In this approach, the segmentation is perform ed in the com bined intensity-texture-position feature space in order to produce connected regions that correspond to real-life objects. The initial num ber of clusters is estim ated by applying a variant of the MaxiMin algorithm to a set of non-overlapping image blocks. The problem of disconnected regions is alleviated by integrating K-Means w ith a com ponent labelling procedure, i.e.

at each iteration the initial clusters (possibly disconnected) are broken

dow n to the m inim um num ber of connected regions using a recursive fourconnectivity com ponent labelling algorithm an d the centres of the new regions are recalculated. 20

2. I M A G E S E G M E N T A T IO N , A P P L I C A T I O N S A N D E V A L U A T IO N : A R E V IE W

O ther approaches from this category are based on param etric models of feature distributions. The best-known approach from this category is Blobworld [29]. In this approach, the joint distribution of colour, texture, and position features is m odelled w ith a m ixture of Gaussians. The param eters of this m odel are then estim ated using the Expectation-maximization (EM) algorithm and the final segm entation is defined based on the pixels' cluster memberships, i.e. pixels are assigned labels from clusters for which the pixel attains the highest likelihood. The num ber of m ixture components (K ) required by the EM algorithm is found by repeating the clustering for a selected range of K w here the initial param eters for each EM process are found from a predefined K squared w indow s in the image. The final partitioning is selected from the set of K solutions according to the Minimum Description Length (MDL) principle. Since the sam e sim ple param etric shape for all clusters cannot handle the complexity of real feature space and also m ixture models require the num ber of clusters as a param eter, recently density estimation-based non-parametric m ethods have gained in popularity. In the approaches from this category, cluster centres are autom atically identified based on the assum ption that dense regions in the feature space correspond to local maxima (modes) in the probability density function. The clusters associated w ith each m ode are delineated based on the local structure of the feature space [91]. One of the m ost pow erful techniques allow ing detection of m odes of the density is the Mean-Shift procedure recently rediscovered in [55, 56, 33] after it was forgotten for more than tw enty years. In this procedure, a kernels (windows) initiated at each pixel iteratively m ove in the direction of the maxim um increase in the joint density (i.e. tow ard region w here the m ajority of feature points reside) until they reach local maxima (peaks) of the density (i.e. stationary points in the estim ated density) leading to the identification of the m odes of the density (cluster centres). Note that flat plateaus in the density m ay cause the procedure to stop, therefore m odes which are closer than the w id th of the kernel need to be grouped together. Each pixel is then associated w ith a significant m ode of the density located in its neighbourhood. A lthough the technique was originally designed for image segm entation by analysis of only colour feature space [55], m ore recently it was also successfully applied to the joint spatial-range dom ain [33]. A n interesting property of the approach is th at it successfully creates large gradient regions w hich are typically over-segm ented by other clustering methods. Its state of the art perform ance encouraged m any researchers to utilize the approach in various applications such as im age retrieval [42], object-based video coding for MPEG-4 [102] and shape detection an d recognition [103].

21

2. IM A G E S E G M E N T A T IO N , A P P L I C A T I O N S A N D E V A L U A T IO N : A R E V IE W

2.4.2

Mathematical Morphology

One of the m ost prom inent m ethods from this category is the watershed trans­ form [92] which associates a region to each m inim um of the image gradient. The application of the w atershed transform involves typically three stages: (i) pre-processing - the image is filtered in order to reduce the presence of noise, (ii) marker extraction - starting points for the flooding process are extracted, e.g. small homogeneous regions, and (iii) watershed transform - the topographic surface of a gradient image is flooded by sources located at the markers. As the markers grow, dams are erected to separate lakes p roduced by different sources. The fi­ nal segm entation is defined by the set of dam s p ro d u ced during the flooding process. To reduce the over-segm entation due to m any local minima caused by noise, multi-scale gradient operators have been proposed in [93].

2.4.3

Graph Theoretic Algorithms

In Graph Theoretic approaches, im ages are represented as w eighted graphs, w here nodes correspond to pixels or regions an d the edges' weights encode the inform ation about the segmentation, e.g. pairw ise homogeneity or edge strength [19, 59, 28,104]. The segm entation is obtained by partitioning (cutting) the graph so that a criterion is optim ized. Several approaches from this category utilize Region Adjacency Graphs w here nodes represent regions while edges contain region pairw ise dissimilarities [105, 59]. The pairw ise dissimilarities are com puted using a distance betw een regions' colour features. RAGs can be simplified by successive m ergings of neighboring regions, however these operations need to be guided by global inform ation to avoid erroneous solutions. A n interesting representation that em erged from the RAG structure was the Shortest Spanning Tree proposed by M orris et al. in [19] and applied to image segm entation and edge detection problem s. The approach was later extended to colour images in [94]. A fast version of the Recursive Shortest Spanning Tree (RSST) was proposed in [106, 95]. A n equivalent region growing algorithm was studied extensively in the w ork of Salembier [59] an d used for creation of Binary Partition Trees (BPT), a hierarchical structure th at w as show n to be useful for filtering, retrieval, segm entation and coding [38,107]. A lthough the above approaches are typically com putationally efficient, the merging criteria are based on local properties of the g raph which may lead to suboptim al solutions to extracting the global im pression of a scene. To alleviate this problem a new global graph-theoretic criterion for m easuring the goodness of an image partition, called Normalized Cut, w as proposed in [28]. In this approach, pixels are represented as a w eighted undirected graph where all nodes in the graph are connected and the w eigh at each node is a function of the pairw ise similarity bo th in spatial and feature space. The graph is partitioned 22

2. IM A G E S E G M E N T A T IO N , A P P L I C A T I O N S A N D E V A L U A T IO N : A R E V IE W

into two disjoint sets by m inim izing the value of the normalized cut defined as the cut cost (total w eight of the removed edges) as a fraction of the total edge connections to all nodes in the graph. The m inim ization of this criterion can be efficiently perform ed based on a generalized eigenvalue problem, e.g. the eigenvectors are used to partition the image into tw o disjoint regions and if desired the process is continued recursively. In recent years the technique attracted considerable attention in the CBIR com m unity due to its state of the art performance - see for exam ple [58].

2.5

Selected Methods for Top-down Segmentation: A Review

It is common know ledge that fully automatic image segm entation based solely on low-level processing techniques often fails not only due to the complexity of the real w orld, e.g.

noise illuminations, reflections, shadow s and colour

variability w ithin objects etc., b u t also due to the fact that the interpretation of the scene is typically context dependent. These limitations of the bottom -up approaches can be alleviated by utilization of prior know ledge of the w orld or user interactions to additionally constrain the partitioning problem. As already explained in section 2.3 the prior knowledge can be general, e.g. boundary sm oothness, or application specific, e.g. models of expected semantic objects. This section briefly review s selected approaches to supervised segmentation and to segm entation and detection based on shape models.

2.5.1

Semi-Automatic Segmentation

Various strategies to user interaction for extraction of accurate semantic objects from images w ere proposed in the past.

They can be grouped in three

classes [80]: feature-based [81, 82, 83], contour-based [84, 34], and region-based [85]. Each category of interactions requires different approaches to the automatic process perform ing the segm ention.

Feature-based Interactions

Feature-based approaches allow the user to draw on the scene, typically by placing markers in the form of single point seeds or by dragging a mouse over the image and creating a set of labelled scribbles for objects to be segmented. Typically, tw o or m ore different m arkers(or scribbles) can have the same label, indicating different p arts of the sam e object in the image. Probably the best know n exam ple of an approach guided by such interactions is Seeded Region Growing (SRG) technique [81]. This m ethod is controlled by a small 23

2. I M A G E S E G M E N T A T IO N , A P P L I C A T I O N S A N D E V A L U A T IO N : A R E V IE W

num ber of pixels, know n as seeds, chosen by the user from K sets. The process evolves inductively from the seeds, labelling pixels w hich border at least one of the labelled regions. In each step a single unlabelled pixel w ith the m ost similar colour to the m ean colour of one of the adjacent labelled sets is ad d ed to that set. The process is repeated until all pixels are allocated. The approach w orks well in noisy im ages but is sensitive to seed initialization (the size and position of seeds m ay affect the initial average colour and consequently the consecutive iterations) and in some cases results in jagged boundaries. Also the approach seems to be not suitable for cases where the region of interest consists of m ore than one hom ogeneous subregions. [81] states that the m ethod m ay not be applicable to highly textured images. A nother approach to partition the scene into objects based on scribbles was pro­ posed by Chalom and Bove in [82] and further im proved by O 'C onnor [32]. Both approaches [82, 32] m odel the probability distribution functions (PDF) of each feature (colour, texture and location) param etricaly as a sum of Gaussian PDFs. The segm entation is perform ed using an Expectation Maximization (EM) algorithm . The m ain draw back of the above m ethods is the additional require­ m ent of an estim ated num ber of m odes for each object and the relatively large num ber of user interactions required. A n elegant and efficient solution to image partitioning based on scribbles was described in [38] w here the in p u t image is autom atically pre-segm ented into regions and represented as a Binary Partition Tree (BPT) to allow rapid responses to users' actions (an approach based on m orphological flooding performed using pre-com puted tree representation was also presented in [83]). The tree structure is used to encode similarities between regions pre-com puted during the autom atic segmentation. This structure is then used to rapidly propagate the labels from scribbles provided by the user to large areas (regions) of the image. Recently, a very interesting approach was proposed in [108] w here the globally optim al segm entation is found using Graph Cuts approach. The u ser's scribbles provide hard constraints for segm entation w hile additional soft constraints incorporate both boundary and region inform ation, i.e. the cost function is defined in terms of boundary and region properties of the segments. The relative im portance betw een the boundary properties term and the region properties term can be controlled by the user. Interestingly the m ethod is valid for N-D images, e.g. 3-D images or even video sequences treated as a single 3D volume.

C ontour-based Interactions Contour-based approaches typically require initialization of the contour model close to the object's borders or placing a seed point on the interior of the region. In som e applications however, e.g. medical im age analysis [109], the rough 24

2. I M A G E S E G M E N T A T IO N , A P P L I C A T I O N S A N D E V A L U A T IO N : A R E V IE W

position of the object is know n a priori and the initialization of the contour can be achieved automatically. Typically, the segm entation is based on the deformable model paradigm of Active Contours (often called Snakes and also referred to as Variational Methods), proposed first in the seminal w ork of Kass et. al. [64]. In these approaches, the models incorporate prior know ledge about a contour's sm oothness and resistance to deformations. A regularized estimate of a contour is obtained by applying external forces that attract the m odel to edges present in the image. It has been show n in [110] that the m odel can be made less sensitive to the initialization by adding an additional "internal" force im itating inflation of the contour. The first active contour m odels w ere im plem ented using param etric represen­ tations of the evolving curves [64]. O ne of the m ain advances in the area was introduction of the Geodesic Active Contours [111, 112] models w hich are para­ meterization free. In these approaches the segmentation involves solving the energy-based m inim ization problem by the com putation of geodesics or m ini­ mal distance curves [112]. The evolving contour is em bedded as the zero levelset of a higher dim ensional surface (typically signed distance function). The entire surface is then evolved m inim izing a metric defined by the curvature and image gradient. A lthough m ore com putationally expensive than the param etric approaches, the im plicit im plem entation allows convenient automatic handling of changes of contour topology, i.e. splitting and merging, allowing sim ultane­ ous detection of several objects. Geodesic Active Contours are less sensitive than the param etric counterparts to the initialization requiring only the initial curve to be placed com pletely inside or outside of the object boundaries. Recently, m otivated prim arily by the (region based) M umford-Shah functional [113], the active contour form ulation has been extended to deal w ith m ultiple cues [65] (e.g. by including region-based terms) leading to Active Regions models which further increase robustness to the initialization and noise. For example, in [114] supervised texture segm entation w as obtained by guiding the curve evolution by both: (i) the boundary-based (geometric) term and (ii) region-based (statisti­ cal) terms. Finally, it should be noted that the variational approaches can also be used for fully autom atic segm entation. For exam ple, in the approach proposed in [65], the initial region seeds are scattered random ly over the entire image followed by curve evolution aided via B ayes/M inim um Description Length (MDL) criteria. In [66] an unsupervised textured im age segm entation m ethod based on geodesic active contours is proposed w here the regions and the param eters of the distributions are u p d ated sim ultaneously thus dynamically learning the model of each region.

25

2. I M A G E S E G M E N T A T IO N , A P P L I C A T I O N S A N D E V A L U A T IO N : A R E V IE W

2.5.2

Shape Driven Segmentation

Shape Models W hile m ethods based on active contours incorporate a priori know ledge w hich is relatively general they typically require careful initialization of the model close to the object's borders or placing a seed point on the interior of the region. In som e applications it is possible to incorporate much m ore specific models, further reducing ambiguities and leading to fully automatic techniques for object-based segm entation, detection and recognition. This section focuses on shape m odels an d their applications. Shape m odels can be divided into tw o categories: (i) special-purpose deform able tem plates carefully constructed ("hand-crafted") for a particular problem [115] and (ii) tem plates derived semi-automatically via statistical analysis of training exam ples [116,117]. Arguably the best know n statistical shape m odel is the Point Distribution Model (PDM) (also called Active Shape Models(ASMs) or "Smart Snakes") proposed by Cootes and Taylor [70, 117]. The model is derived by exam ining the statistics of the coordinates of the labelled points over the training set via the Karhunen-Loeve transform. The m odel gives the average position of the points, and a description of the m ain m odes of variation. The approach w as later extended to m odels capable of synthesizing new images of the object of interest, the so-called Active Appearance Models (AAMs) [78]. O ther statistical approaches to shape m odeling include m ethods proposed by Pentland and Sclaroff [116] (where the contours of prototype objects are rep­ resented using Finite Element m ethods and describe variability in term s of vi­ brational m odes), Lades et al. [118] (where shape and grey level inform ation is m odelled using elastic m esh and Gabor jets), M ardia et al. [119] and G renander et al. [120] (the use of w hich in image segm entation, however, appears diffi­ cult [78]).

Utilization of Statistical Shape Models for Image Segmentation There are several different approaches w hich could be taken to m ake use of statistical shape m odels in the segmentation process [70, 117, 72, 73, 74, 75, 76, 71]. However, all involve an optimization step finding the set of param eters w hich best m atch the m odel to the image (or hypothetical group of regions from the image). The m ost straightforw ard approach is to fit different instances of the expected object to the im age.

For example, in the approach proposed by Cootes et.

al. [70, 117] different sets of param eters are used to generate instances of the 26

2. IM A G E S E G M E N T A T IO N , A P P L I C A T I O N S A N D E V A L U A T IO N : A R E V IE W

shape from the PDM m odel which are then projected into the image. The set of param eters which minimizes a cost of fitting the instance to the image m easure is chosen as the best set to interpret the object. N ote that the above use of statistical models shares some similarities w ith approaches m atching a single shape to an unseen image - see for example [121]. Recently, a considerable num ber of approaches have been proposed for integra­ tion of statistical shape m odels into the variational segm entation process [72,73, 74, 75, 76] w here the evolved contour is constrained n o t only by the smoothness and the fidelity of the segm ented im age but also by the compatibility w ith the expected shape. The m ain difficulty in such approaches is the need to account for possible pose transform ations betw een the m odel an d the object present in the segmented image. A n example of such an approach w as proposed in [72] where shape inform ation is incorporated into the evolution process of Geodesic Active Contours. A n initial curve is em bedded as the zero level set of a higher dim en­ sional surface (in this case signed distance function). The training set consists of a set of surfaces and the shape m odel is build over the distribution of surfaces similar to the approach from [70] for constructing PDMs. At each step of the evolution, the m axim um a posteriori (MAP) position and shape of the object is estimated based on the model and the im age gradient. Finally, an active con­ tour is evolved based on im age gradients, curvature and towards a m aximum a posteriori estimate of shape and pose. Finally, statistical shape models can be used to aid the region grouping process. One such approach w as proposed by Sclaroff and Liu [71].

The m ethod

consists of two stages: (i) over-segm entation using the mean-shift colour region segmentation algorithm an d (ii) region m erging via grouping and hypothesis selection guided by deform able shape m odels. In the second stage, various combinations of candidate adjacent region groupings are tested. The candidate regions are selected based on colour characteristics. The num ber of merging hypothesis is lim ited by evaluation of edge strengths betw een the initial regions. Each grouping hypothesis is m atched w ith the m odel by deforming the model in order to m inim ize a cost function consisting of three terms: (i) a colour compatibility term, (ii) an area overlap betw een m odel an d grouping hypothesis, and (iii) a deform ation term.

The final partitioning of the whole image is

obtained by enforcing global consistency, i.e. a global cost function consisting of two terms: (i) sum of costs of individual grouping hypotheses and (ii) num ber of groupings, m inim ized using the Highest Confidence First algorithm (HCF). The system was show n to perform well on im ages containing fruits, leafs, fish, and blood cells. A lthough the technique can only segm ent objects consisting of one colour, utilization of the shape m odel allow ed correct segmentation in the presence of shadow s, poor lighting conditions an d object overlapping.

27

2. I M A G E S E G M E N T A T IO N , A P P L I C A T I O N S A N D E V A L U A T IO N : A R E V IE W

2.6

Evaluation

M eaningful assessm ent of perform ance is crucial for the developm ent of any algorithm . However, the subject of objective evaluation of segmentation quality has received relatively little attention in the literature [122, 123] com pared to the large num ber of publications on the topic of im age segm entation itself. This state of affairs is surprising since clearly discussion on a consistent and an objective evaluation criterion is crucial for com parison of algorithm s and could help answ er fundam ental questions regarding the feasibility of incorporating segm entation in the context of a particular application. The rem ainder of this chapter discuses difficulties that m ust be faced in com par­ ing segm entation results, describes briefly evaluation m ethods proposed in the literature, and explains the m ethodology adopted by the author to evaluate the approach discussed in the next chapter. For the sake of brevity, the discussion focuses on evaluation of region-based image segm entation, however, m any issues presented here are relevant to other segm entation scenarios.

2.6.1

Evaluation Strategies

The current practise in the field of region-based segm entation is to justify new advances by providing a set of qualitative results, i.e. by presenting selected segm entation examples.

A lthough such an approach m ay be sufficient to

illustrate the effects of low-level processing, e.g. edge detection, it does not provide satisfactory dem onstration of the generality of segmentation methods. This is especially im portant in the context of CBIR systems operating w ith general content. A nother alternative is subjective ad-hoc assessment by a representative group of h um an viewers.

Subjectivity should be m inim ized by respecting strict

evaluation conditions, e.g.

[123] suggests to follow recom mendations for

video quality evaluation developed by ITU [124, 125].

However, such a

process requires a significant num ber of hum an evaluators and is very time­ consuming. Because of its high cost, such evaluation is extremely rare. The author is familiar w ith only one such study presented in [98], where subjective experim ents provided the basis for better understanding of hum an perception of segm entation artifacts. From another point of view, segmentation is alm ost never the final objective in any application. Typically (e.g. in CBIR), it is just one step tow ards scene understanding. In fact, the m ain difficulty in segm entation evaluation lies in establishing a criterion adequate to a given application [123].

Therefore, it

should be possible to perform meaningful objective assessm ent of a grouping 28

2. IM A G E S E G M E N T A T IO N , A P P L I C A T I O N S A N D E V A L U A T I O N : A R E V IE W

algorithm by m easuring the overall performance of the system in which it is integrated.

However, in the case of CBIR systems, such an approach is

often impractical due to its complexity. Usually the perform ance of the system depends on combinations of param eters of the system an d the segmentation algorithm itself.

Therefore, it is difficult to guarantee optim al evaluation

conditions. Moreover, in recent years, it has become ap parent that the key elem ent of such systems are user interactions, and relevance feedback [10] in particular.

Unfortunately it is practically im possible to take such factors

into account in extensive evaluation experim ents such as those required for developm ent of a new segm entation approach. Since the developm ent carried out in the next chapter requires m ultiple series of extensive evaluation experiments, all three above strategies had to be discarded. Instead, objective automatic evaluation m ethods based on m easures capable of adequate assessment of segm entation quality are investigated. There are two m ain classes of autom atic assessm ent m ethods: standalone and relative. The first one is typically perform ed only in cases w here no reference segm entation is available. It exploits available know ledge about expected prop­ erties of desirable segm entations for a particular application. The main diffi­ culty lies in establishing an evaluation measure suitable for the type of content and application addressed. Since the results of such evaluation do not neces­ sarily correspond to hum an perception of the segm entation quality [126, 127], this scenario is not considered in this thesis. However, it should be noted that despite the above limitation, the problem of standalone evaluation of segmen­ tation quality is central to the problem of segm entation itself. As a matter of fact, there is certain analogy betw een features proposed for standalone evalua­ tion in [126] and the geometric properties of regions proposed for incorporation w ithin the segmentation fram ew ork in the next chapter. Relative evaluation m ethods are based on similarity m easures used to quantify differences betw een the evaluated and reference segm entation often referred to as ground-truth. This type of evaluation is significantly m ore reliable than the standalone m ethod, how ever large collections of test data w ith a groundtru th are rarely available. It should also be recognized th at relying on a single reference segm entation m ask for each image may be questionable, especially in the context of CBIR. The quality of partitioning is query an d user dependent and possible queries are typically n o t know n prior to segm entation. A ground-truth collection consisting of single m asks for each image does not take into account such ambiguities in the partitioning of the scene. This issue is discussed in more detail in section 2.6.4 which describes the creation of the ground-truth for the collection utilized in all experim ents presented in the next chapter. D espite the above limitation, relative evaluation m ethods represent the best trade-off between reliability of evaluation (value of the inform ation provided)

29

2. I M A G E S E G M E N T A T IO N , A P P L I C A T I O N S A N D E V A L U A T IO N : A R E V IE W

and the practicability of extensive evaluation experiments required for the next chapter. Therefore, the discussion presented in the following sections is limited only to relative evaluation m ethods.

2.6.2

Relative Evaluation M ethods

One of the m ain difficulties in objective evaluation is finding an adequate mea­ sure capable of assessing segm entation quality aligned w ith hum an perception of segm entation quality - see the discussions in [123] and in [98]. Since none of the m ethods proposed in the literature are currently well-established in the re­ search community, this section gives a brief review of available approaches most relevant to experim ents in this thesis and provides argum ents for the adopted solution. For a m ore com prehensive review of the existing m ethods the reader should refer to [127,98]. O ne of the earliest set of evaluation m ethods suitable for video segmentation w ere proposed in [128,129]. They evaluate both spatial and tem poral accuracies for scenes com posed of tw o objects (foreground & background). spatial accuracy is m easured in term s of misclassified pixels.

In [128]

Weighting of

such misclassified pixels according to their distance to the reference object was introduced in [129]. A m ore advanced m ethodology w as proposed in [123] where a comprehensive set of relevant features to be com pared was identified and objective quality m etrics com bining these features w ere suggested. spatial features included:

The proposed classes of

(i) shape fidelity - com puted similarly to spatial

accuracy m easure from [129], (ii) geometrical similarity - based on similarities of sim ple geometrical features like size, position, elongation and compactness, (iii) edge content - based on similarities betw een edge contents of the reference and estim ated object, (iv) statistical data similarity, e.g. similarity of the brightness and "redness" values of reference an d estimated objects. A lthough the above list of features is quite com prehensive, the usefulness of com bining all different m easures into one single evaluation criterion is questionable.

Combining

various features using ad-hoc w eights seems impractical since in m any cases analysis of such com bined results m ay still require m anual inspection of the result in order to fully u n d erstan d w hich features contributed to the combined error (and in w hat ways). In this thesis, simpler, but easier to interpret measures are preferred. A very interesting recent study of the im pact of topology changes, in particular added regions and holes, w as presented in [98]. Subjective tests have been carried out to m easure the annoyance of these artifacts w hen presented alone or in com bination.

The experim ents indicated that, independently from content,

annoyance values corresponding to added regions quickly reaches saturation 30

2. I M A G E S E G M E N T A T IO N , A P P L I C A T I O N S A N D E V A L U A T IO N : A R E V IE W

level for larger sizes.

It also confirm ed the common hypothesis that holes

contribute m ore annoyance com pared to added regions of the sam e size. A nother im portant o u tp u t of the above experim ents was dem onstration that various types of topological errors contribute differently to the overall annoyance. M oreover, the proportions in w hich they contribute to the annoyance are content dependent. Therefore, more studies of this problem have to be carried out on this subject before combination of m ultiple errors and such differentiations can be m eaningfully incorporated in autom atic evaluation m etrics for general content. A relatively simpler, yet well aligned w ith hum an intuition, m ethod w as pro­ posed in [127]. It w as specifically designed for evaluation of im age segm en­ tation in the context of CBIR systems. It is prim arily based on the approach from [129] for evaluation of spatial accuracy, b u t extended to still im age seg­ m entation w here both the evaluated segm entation and the g ro und-truth m ask m ight contain several regions. Special consideration w as devoted not only to the accuracy of boundary localization b u t also to over- and under-segm entation. To tackle such issues, the evaluation starts by establishing exclusive correspon­ dence betw een regions from reference and evaluated masks. Then, three dif­ ferent types of errors are taken into account: (i) accuracy errors for the asso­ ciated pairs of regions from reference and evaluated masks, (ii) errors due to under-segm entation com puted based on non-associated regions from the ref­ erence m ask, and finally (iii) errors due to over-segmentation com puted from non-associated regions from the evaluated mask. A lthough the m ethod is quite sim ple, a set of convincing evaluation results was provided in [127] indicating that the m ethod correlates well w ith subjective evaluation. The above m ethod is ad o p ted in this thesis due to the small num ber of param eters involved and ease of interpretation. A formal description of the m ethod and justifications for param eter values chosen for all experim ents are provided in section 2.6.3. Finally it should be noted that hum ans are very sensitive to errors corrupting objects w ith high semantic value. All the above studies regarding evaluation acknow ledge the im portance of object's relevancy, e.g. such relevance m easure could indicate the likeliness that a particular object will be further m anipulated and u sed [130]. However, due to the difficulties w ith providing such relevancy inform ation w ithin the reference segm entation currently it is rarely taken into account du rin g evaluation.

2.6.3

Adopted Evaluation Method

This section gives the formal description of the evaluation m ethod proposed in [127] an d discusses the au thor's im plem entation of the above approach. 31

2. I M A G E S E G M E N T A T IO N , A P P L I C A T I O N S A N D E V A L U A T IO N : A R E V IE W

Figure 2.2: Functions used for w eighting the contributions of misclassified pixels based on their distance from the reference region to which they belong: / i (.,.) is used for false negatives (pixels belonging to r q b u t not in Sk), whereas / 2( ., .) is used for false positives (pixels assigned to s^, b u t not in rq) [129].

The O rig in a l E valuation Approach fro m [127]

Let S = si, S2 , . . . , s k be the segm entation m ask to be evaluated com prising K regions sk, and 71 = n ,

■■■, tq be the reference m ask comprising Q reference

regions. Let also a region be defined as a set of pixels p. Prior to calculation of segm entation errors, the correspondence betw een reference and evaluated regions is established. The pairing strategy involves exclusive associations of regions rq from the reference m ask w ith regions

from the evaluated mask

on the basis of region overlapping. In particular, m aximization of rq ("| .s> was suggested in [127]. It should be noted that d u e to over and under-segmentation of certain parts of the image not all regions from the evaluated and the reference mask are associated. Once correspondences are established, the spatial errors can be com puted. Let A = (rq,Sk) denote the set of region pairs identified using the pairing procedure, and let M r , N s denote the sets of non-associated regions of masks 1Z and S respectively. For every pair (rq, Sk) from set A the spatial error E q is evaluated using the criterion from [129]:

Eq=

Y! /i(P -r 0.5 is im posed to prevent such associations.

In other w ords, the evaluated and the reference regions can be paired only if their intersection is at least half the size of their union. This restriction results in a stricter evaluation criterion how ever was found to better correspond to hum an 4Both w eighting functions are replaced by the unit function. 33

2. I M A G E S E G M E N T A T IO N , A P P L I C A T I O N S A N D E V A L U A T I O N : A R E V IE W

perception of the segmentation quality.

2.6.4

G round-Truth C reation

A lth o u g h a handful of methods allow in g satisfactory relative quality evaluation of segmentation results were proposed in the past, practically no image collec­ tions w ith a ground truth are publicly available at the tim e of carrying out the research presented in this thesis. To the best of author's knowledge, no study using a ground truth w ith a significant num ber of images of real scenes partitioned into homogenous regions is currently available in the literature.

For example, in [123] m ainly M PEG -

4 sequences w ith reference masks containing objects rather then homogenous regions were used.

Three M P E G sequences and one from an European 1ST

project Art.live were used in [98].

Finally, the evaluation m ethod from [127]

was tested w ith synthetic images created using the reference textures of the VisTex database [131] and only seven, rather sim ple, images from the Corel gallery [132]. Since none of the above collections was sufficient or suitable for the experiments reported in the next chapter the author com m itted significant effort into collab­ oration w ith a colleague w o rk in g in another university to create an appropriate ground-truth for a small collection of images. The collection contains 100 images from the Corel gallery [132] and 20 photographs fro m p rivate collections taken using various digital cameras (including mobile phone cameras). Reference masks w ere created m anually in a collaborative effort using a tool developed by the author. A screenshot of the tool is shown in Figure 2.3. Typical creation of a new reference region involves choosing a label for the new region, draw ing its boundary on the original im age and then filling its interior.

To ensure high accuracy of the masks, the area around the cursor

is autom atically zoomed. Furtherm ore, the mask can be edited in a Boundary

Mode in w hich only region boundaries superimposed on the original image are displayed allow ing accurate localization of the borders on the original image and fine corrections.

A dditionally, the tool allows definition of objects by

specification of their constituent homogenous regions into semantic objects w ith a hope that such semantic inform ation m ay be useful in the author's future research in this field. The typical tim e required to m an u ally partition an image was between 5-10 minutes, depending on user d raw in g skills and complexity of the scene. A lthough this tim e seems acceptable, in fact m anual segmentation of several images proved tiresome. This provided very strong m otivation for tools allo w in g semi-automatic segmentation. A n example of such tool is discussed in the next chapter.

The biggest difficulty in the creation of the ground-truth proved to be the specifi34

2. IM A G E S E G M E N T A T IO N , A P P L I C A T I O N S A N D E V A L U A T IO N : A R E V IE W

Figure 2.3: G U I of the tool for m anual creation of ground-truth segmentation.

cation of rules ensuring that m anual annotators provid e consistent segmentation of images into homogenous regions. In fact, to the best of author's knowledge, there has been no form al description of the expected output of region-based im ­ age segmentation useful in the context of C B IR proposed in the literature. For example M ezaris et.al. sim ply claim that their segmentation algorithm should

"produce connected regions that correspond to the real-life objects shown in the image” and “providefast segmentation with high perceptual segmentation quality " in order to be useful in the context of content-based im age and retrieval [30]. In the case of the ground-truth described here four annotators were provided w ith a description of the desired segmentation rather than w ith a form al defini­ tion. The properties of the desired segmentation w ere specified as following: • The image should be divided into as large as possible connected regions w ith homogenous colour or texture, • Lighting effects should be ignored, e.g. shadows, reflections, etc. Unfortunately, the above description leaves annotators w ith the problem of deciding w h at terms like "colour hom ogeneity" and "large regions" m ean in the context of a particular image.

In fact, the lack of a form al definition of

desired results represents a major challenge for the utilization of bottom-up segmentation in applications involving general content. H ow ever, the m ain objective of the next chapter is to evaluate the potential im ­ pact of geometrical inform ation in im proving autom atic segmentation and in­ vestigate the problem of integration of evidence from m ultip le sources of infor­ m ation in a region m erging process. Despite possible ground-truth inconsisten­ cies, relative evaluation can still provide useful inform ation on how w ell par­ ticular segmentation results correlate w ith the m ajority of the m anually created masks. In other words, such ground-truth can be used to obtain comparative results, e.g. that a particular m ethod produces results closer to the m ajority of masks in the ground-truth. Nevertheless clearly they cannot provide the answer to h ow useful a given tool is in the context of C B IR systems or any other appli­ cations. The author is also conscious that the size of the dataset used in terms of 35

2. IM A G E S E G M E N T A T IO N , A P P L I C A T I O N S A N D E V A L U A T IO N : A R E V IE W

num ber of images seems be far from optimal. H ow ever, it should be also noted that the ground-truth contains 1089 reference regions w hich should still ensure statistically relevant results. A lthough, the evaluation results obtained using the above collection should be interpreted w ith a certain reserve, the author believes that the approach used represents a significant departure from solely presenting selected qualitative w hich is currently a common practice in the literature. The fu ll dataset together w ith the ground-truth segmentation is presented in appendix B. Finally, it should be also recognized, that very recently the problem of the as­ sessment of im age and video segmentation started to attract deserved attention w ith in the research com m unity resulting in a several articles devoted solely to this problem [1 3 3 ,1 3 4 ,1 3 5 ,1 3 6 ,1 3 7 ,1 3 8 ].

2.7

Discussion

U nfortunately, in m an y applications, image segmentation has proven to be very challenging and rem ained to a large extent an unsolved problem, despite being the subject of study for several decades. Although, it is w ell recognized that in certain scenarios u tilizin g segmentation can be very successful, m any researchers consider reliable unsupervised general-purpose image segmentation as a hopeless goal.

For instance, it is com m only accepted that the content

of interest is a user dependant quality and w ill therefore always depend on some degree of user interaction. Consequently, the problem of inferring content of interest from generic visual data w ith o u t any hum an intervention is ill defined. This thesis attem pts to objectively discuss the prospects of utilization of image segmentation in the context of C BIR systems focusing prim arily on the abovem entioned issues. M a n y of the above aspects of im age segmentation are discussed throughout the thesis in the context of several applications. W ord recognition in handw ritten historical documents presented in chapter 6 utilizes a simple segmentation by thresholding approach. C hapter 3 focuses p rim arily on the feasibility of utilizing contour-based syntactic features for im proving the correspondence of the output of automatic segmentation of images representing real w o rld scenes to semantic objects present in these scenes b u t discuses also some aspects of the semi­ automatic segmentation approach proposed originally in [38,107].

2.8

Conclusion

This chapter outlined the most im portant aspects of image segmentation in the context of content-based image retrieval. It aim ed at discussing the m ain 36

2. IM A G E S E G M E N T A T IO N , A P P L I C A T I O N S A N D E V A L U A T I O N : A R E V IE W

categories of approaches and giving an overview of a sm all selection of the most com m only used techniques a n d /o r those particulary relevant to the research carried out by the author in chapters 3 and 7. The subject of objective evaluation of segmentation quality is also discussed in some detail pro vid in g a basis for the quantitative experiments carried out b y the author in chapter 3.

37

C

h a p t e r

3

REGION-BASED SEGMENTATION OF IMAGES USING SYNTACTIC VISUAL FEATURES T

HIS chapter presents a new m ethod for segmentation of images into large

regions that reflect the real w o rld objects present in the scene. It explores the feasibility of u tilizin g spatial configuration of regions and their geometric properties (the so-called syntactic features [18]) for im proving the correspondence of image segmentation produced by the w ell-kn o w n Recursive Shortest Spanning

Tree (RSST) algorithm [19] to semantic objects present in the scene and therefore increasing its usefulness in various applications. The m ain challenge is to asses the "strength" of the additional evidence for re­ gion m erging provided by the spatial configurations of regions and their geomet­ ric properties and compare it w ith the evidence provided solely by the colour hom ogeneity criterion. Several extensions to the RSST algorithm [19] are pro­ posed am ong w hich the most im portant is a novel fram ew ork for integration of evidence from m u ltip le sources of inform ation w ith the region merging process. The fram ew o rk is based on the Theory of Belief (BeT) [23] and allows integra­ tion of sources p ro vid in g evidence w ith different accuracy and reliability. Other extensions include a new colour m odel and colour homogeneity criteria, prac­ tical solutions to analysis of region's geometric properties and their spatial con­ figuration, and a n ew sim ple stopping criterion aim ed at producing partitions containing the most salient objects present in the scene. Extensive experim entation is used to asses the generality of the proposed modifications. It is dem onstrated that syntactic features can provide additional basis for region m erging criteria w hich prevent form ation of regions spanning m ore than one semantic object, thereby significantly im proving the perceptual

38

3. S E G M E N T A T IO N U S I N G S Y N T A C T IC V IS U A L F E A T U R E S

quality of the segmentation. The second p art of this chapter describes a fast, easy to use and intuitive semi­ automatic segmentation tool u tilizing the autom atic algorithm proposed in the first part.

3.1 3.1.1

Introduction

M otivation

The problem of partitioning an im age into a set of homogenous regions or semantic entities is a fundam ental enabling technology for understanding scene structure and iden tifyin g relevant objects. Identification of areas of an image or a video sequence that correspond to im p o rtan t regions is the first step of m any object-based applications. U nfortunately, the problem of segmentation remains to a large extent unsolved, despite being the subject of study for several decades. A lthough, it is w e ll recognized that in certain scenarios utilizing segmentation can be very successful, m any researchers consider reliable unsupervised general­ purpose image segmentation as a hopeless goal. The research in this chapter was m otivated b y the author's belief that in a given application (e.g. CBIR) all possible sources of grouping evidence should be investigated and the quality of segmentation objectively evaluated in extensive experiments.

The author

shares the opinion presented in [29], that segmentation, w hile imperfect, is an essential first step for scene understanding and that a combined architecture for segmentation and recognition is needed. For exam ple, sensible integration of segmentation features in an image indexing context should lead to more efficient object-based indexing solutions [139]. As described in chapter 2 a large num ber of approaches to image or video segmentation have been proposed in the literature [1 4 0 ,3 8 ,1 0 7 ,2 8 ,3 3 ,2 9 ,3 0 ,3 1 , 18]. They can be broadly classified as either Bottom-Up (visual feature-based) or

Top-Down (model-based) approaches. This chapter focuses on automatic BottomUp segmentation, w hich does not require developing models of in dividual objects, and is potentially useful in the context of C B IR systems. M a n y existing approaches aim to create large regions using (relatively simple) homogeneity criteria based on colour, texture or m otion. However, the applica­ tion of such approaches is often lim ited as they fail to create meaningful p arti­ tions due to either the com plexity of the scene or difficult lighting conditions. In [18], Ferran and Casas propose a new source of evidence for region merg­ ing w hich they term Syntactic Features. Syntactic features could be potentially im portant to a w id e range of segmentation applications. They represent geo­ m etric properties of regions and their spatial configurations. Examples of such features include homogeneity, compactness, regularity, inclusion or symmetry. 39

3. S E G M E N T A T IO N U S I N G S Y N T A C T IC V I S U A L F E A T U R E S

They advocate the use of the above features in bottom -up approaches as a w a y of partitioning images into more m eaningful entities w ith o u t assuming any ap­ plication dependent semantic models. In [18], they carried out a proof of concept of this idea using a quasi-inclusion criterion. Other approaches attem pting to inte­ grate geometric properties of regions include [64, 62, 65,66, 67,18, 68] - see also section 2.3.3 for b rief overview. A lthough the idea of using geometric properties w ith in a region merging process to create more m eaningful partitions is not new, the approaches currently available in the literature integrate the geometric features in a rather adhoc m anner [62, 18, 68] typically developed to suit only particular geometric properties and not easily extendible to other features. Also, to the best of the author's know ledge, currently there is no objective evaluation available in the literature to support the usefulness of any such features. Syntactic features can serve as an im portant source of inform ation that helps to close the gap b etw een lo w level features and a semantic interpretation of the scene.

O f course, probably they do not provide evidence "strong

enough" to reliably group regions w ith very different colour, texture and m otion into complex semantic objects.

How ever, it can be argued that they can

significantly im p ro ve the performance of automatic bottom -up segmentation techniques bringing them closer to the semantic level by allow ing creation of large m eaningful regions enhancing their usefulness in CBIR systems. In other w ords, creation of larger, m ore m eaningful regions should facilitate extraction of m ore m eaningful local features useful for content based indexing and retrieval. Furtherm ore, syntactic and semantic sources of inform ation are not m u tu ally exclusive, and if available they could, and probably should, be used together to com plem ent the segm entation procedure. Syntactic features can be useful even in the case of supervised scenarios where they can facilitate more intuitive user interactions e.g. b y allo w in g the segmentation process make more "intelligent" decisions w henever the inform ation provided by the user is not sufficient. This chapter focuses on the integration of syntactic features into the image seg­ m entation process and proposes a practical solution for fu lly automatic, robust and fast segm entation that is potentially useful for object-based m ultim edia ap­ plications. The objective is to automatically segment images into large regions generally corresponding to objects or parts of objects w hilst avoiding the cre­ ation of regions spanning more than one semantic object. The potential of incor­ porating syntactic features into the segmentation process to meet this objective and allow satisfactory results w ith challenging real w o rld images is evaluated by extensive experim entation.

40

3. S E G M E N T A T IO N U S I N G S Y N T A C T IC V I S U A L F E A T U R E S

3.1.2

C hapter Structure

The rem ainder of this chapter is organized as follows: the next section describes in detail the Recursive Shortest Spanning Tree (RSST). Then, an overview of the proposed segmentation approach is presented in section 3.3. N ext, the training and evaluation methodologies adopted throughout the chapter are described in section 3.4. Section 3.5 discusses and evaluates several improvements to a region's colour representation and the colour hom ogeneity criteria used during the region merging process. Section 3.6 discusses selected syntactic features, proposes practical solutions to analysis of a region's geometric properties and their spatial configuration and discusses the m ajor difficulties related to the integration of syntactic features w ith in a region m erging framework. The novel fram ew ork for integration of evidence from m u ltiple sources of inform ation w ith in the region RSST algorithm based on the Theory of Belief (BeT) is proposed in section 3.7.

A new simple stopping criterion aimed at

producing partitions containing the most salient objects present in the scene is proposed and evaluated in section 3.8.

Future research is discussed in

section 3.9 and the properties of the final complete approach are discussed in section 3.10.

Section 3.11 describes a fast, easy to use and intuitive semi­

automatic segmentation tool utilizin g the automatic algorithm proposed in the first part of the chapter. Conclusions are form ulated in section 3.12.

3.2

RSST Segmentation Framework

The Recursive Shortest Spanning Tree (RSST) algorithm is selected for the incorpo­ ration of syntactic features as it provides a convenient fram ew o rk for integration of a region's geometric properties, and also due to its speed. The RSST algorithm was first proposed by M o rris et al. in [19] for image segmentation and edge detection problems.

It was later extended to colour

images in [94]. It was also shown to be useful for im age coding [63]. One of its application is object segmentation, extraction and tracking [85]. The COST 211quat project adopted RSST as the grouping m ethod in the Analysis Model (A M ) [140]. In [141] RSST was utilized for video-object segmentation based on an affine m otion m odel. The most recent w o rk regarding fast im plem entation of the algorithm can be found in [106,95]. Sim ilar region m erging algorithms were studied extensively in the w o rk of Salembier and G arrid o for creation of Binary

Partition Trees (BPT) [107]. The RSST segmentation starts by m apping the in p u t im age into a weighted graph [19], where the regions (in itially pixels) form the nodes of the graph and the links between neighboring regions represent the m erging cost, computed according a selected hom ogeneity criterion. M erging is p erform ed iteratively. A t 41

3. S E G M E N T A T IO N U S I N G S Y N T A C T IC V IS U A L F E A T U R E S

each iteration tw o regions connected by the least cost link are merged. M erging tw o regions involves creating a jo in t representation for the new region and updating its links w ith its neighbors. stopping condition is fulfilled, e.g.

The process continues until a certain

the desired num ber of regions is reached

or the value of the least lin k cost exceeds a predefined threshold. The hom ogeneity criterion used to define merging cost and the stopping crite­ rion p lay a key role in obtaining the desired segmentation1. Since at each iter­ ation only tw o regions connected b y the least cost link are merged the merging order is exclusively controlled by the function used to compute the merging cost. Let 7'i and r 7 be tw o neighboring regions. The merging cost is based solely on a sim ple colour hom ogeneity criterion defined as [142,140, 85]:

C o r ig & ti^ W c i-C jW l^ &i

I

(3.1)

CLj

w here Cj and Cj are the m ean colours of r, and rj respectively, a, and aj denote region sizes, and || • H2 denotes the £ 2 norm. It should be noted that the second factor encourages the creation of large regions2 by decreasing the cost of merging fo r small regions. A ltern ative m erging criteria based on colour can be found in [1 9 ,1 4 3 ,4 1 ,1 8 ], Various stoping criteria have been used in the past including the desired num ber o f regions [140], thresholding of the least link cost [19, 144] and m in im um acceptable value of PSNR betw een the original and segmented image [41] (reconstructed using m ean region intensities).

However, none of the above

criteria can produce large regions and at the same time ensure that the merging process w ill stop before m erging im portant semantic boundaries. The RSST algorithm is fast and capable of producing m eaningful over-segmentation (typically more than 100 regions) useful in m any applications - see for exam­ ple [140], H ow ever, the above colour homogeneity criterion is insufficient to produce m eaningful results in applications where further simplification of the scene is required, i.e. it is incapable of producing large regions w ith a high prob­ ab ility of corresponding to semantic notions [38]. Very often the produced re­ gions do not represent the true semantic structure of the scene. O ne of the reasons for poor perform ance is utilization of identical simple hom ogeneity criterion at all stages of the merging process. In other words, the same m erging criterion is used at the level of pixels and at stages where regions contain thousands of pixels. Sum m arizing, the RSST is selected for the incorporation of syntactic visual features due to the fo llo w in g reasons: 'The stopping criterion may not be required in certain applications, e.g. revealing hierarchical structure of the scene by creating Binary Partition Tree (BPT) [107]. 2In fact, the factor encourages creation o f regions similar in size. 42

3. S E G M E N T A T IO N U S I N G S Y N T A C T IC V IS U A L F E A T U R E S

• I t is based solely on colour homogeneity and, w hen used to produce segmentations w ith a small num ber of regions, often creates regions w ith geometric properties w hich are rare in the real w o rld .

U tilization

of regions' geometric properties promises an intuitive rem edy to this problem , • It provides convenient access to geometric properties of every region at each iteration, • It is capable of producing m eaningful over-segmentation w ith w ell local­ ized region contours, • I t is extrem ely fast, especially w hen compared w ith other graph theoretic algorithm s, i.e. it takes less than 3 seconds on a standard PC (Pentium II I 6 0 0 M H z ) to segment an image of C IF size (352x288), • I t is easy to implement.

3.3

Overview of the Proposed Approach

To overcom e the lim itation of the original RSST algorithm , both, the merging cost measure and the stoping criterion are m odified and extended, e.g. an ad d itio n al set of features.

by

A d d itio n al geometric properties are integrated

in the RSST m erging fram ew ork by em ploying new measures for calculating the cost o f m erging tw o neighboring regions.

The new features control the

m erging order and provide a basis for a reliable stopping criteria.

The new

m erging order is based on evidence provided by different features (colour and geom etric properties) and fused using an integration fram ew o rk based on the Theory of Belief (BeT) [22, 23], The im pact of the above modifications on segm entation quality is demonstrated by rigorous experim entation using the evalu atio n m ethodology described in section 2.6.3. The segm entation process is divided into tw o stages. The in itial partition, re­ q u ired for structure analysis, is obtained by the RSST algorithm w ith the original m erging criterion im plem ented as in [140] since it is capable of producing good results w h en regions are uniform and small. This stage ensures the lo w compu­ tational cost of the overall algorithm and also avoids the application of syntactic features to small regions w ith meaningless shape. This initial stage is forced to stop w h e n a predefined num ber of regions is reached (100 for all experiments presented in this chapter). Then, in the second stage, region representation is extended b y region boundary and, depending on a variant of the approach, by a n e w colour model. Subsequently, hom ogeneity criteria are re-defined based on colour and syntactic features and the m erging process continues until a cer­ tain stopping criterion is fulfilled. The new merging order is based on evidence p ro vid ed b y different features (colour and geometric properties) fused using the 43

3. S E G M E N T A T IO N U S I N G S Y N T A C T IC V IS U A L F E A T U R E S

proposed integration fram ew ork. It should be stressed that all extensions pro­ posed in the follow ing sections consider m ain ly the second stage of merging w hen regions already have m eaningful shape.

3.4

Methodology Used for Training and Testing

A ll solutions proposed in this chapter are tuned and evaluated using the spatial accuracy error measure and the collection of images w ith ground-truth discussed in sections 2.6.3 and 2.6.4 respectively.

3.4.1

Training and Test Corpus

Recall from section 2.6.3 that the collection of m anually segmented images contains 100 images from the Corel dataset and 20 images selected from various sources such as keyframes from w e ll kn o w n M P E G -4 test sequences, digital camera, and m obile phone camera segmented by four annotators.

The full

dataset together w ith the ground-truth segmentation is presented in appendix B. The images from the Corel dataset are typically w ell composed and contain objects w ith distinctive colour w h ile the images from other sources are often taken w ith difficu lt lighting conditions and contain objects w ith similar colour. A lthough the images from the Corel collection appear somehow to be easier to segment due to the fact that different objects have distinctive colour, they also present the ultim ate challenge to any approach u tilizing features other than colour. In other words the aim is to integrate the additional inform ation, e.g. region's geometric properties, in such a w a y that the segmentation of the images w ith difficult lighting conditions is im proved but also at the same time the good performance in the case of images containing objects w ith distinctive uniform colour is retained. Taking into account the relatively small num ber of images available the pro­ posed extensions are evaluated by tw o-fold cross validation. The dataset is d i­ vided into tw o subsets (let us denote them as subset A and B), each containing 60 images, i.e. 50 images from the Corel dataset and 10 from the other sources. Each tim e one of them is used for param eter tuning, the other subset is used as the test set. The final result is com puted as the average error from the tw o tri­ als. Also, all images showing selected segmentation results are generated during the cross-validation. H ow ever, for clarity of presentation, whenever a particular param eter is discussed its influence on segmentation results is demonstrated on the entire im age collection, e.g. all plots are generated using the entire collection.

44

3. S E G M E N T A T IO N U S I N G S Y N T A C T IC V IS U A L F E A T U R E S

3.4.2

D evelopm ent M ethodology

For clarity, various m erging costs and stopping criteria are developed and tested separately. First, n e w m erging costs controlling the merging order are proposed in sections 3.5, 3.6, and 3.7. They are evaluated by stopping the region merging process based on the num ber of regions available from the ground-truth and an assessment of the resulting segmentation mask in terms of spatial accuracy error.

This helps to sim plify the integration of various features w ith in the

m erging cost and p relim in ary evaluation of their generality and usefulness and at the same tim e to avoid the complex influence of a stopping criterion on the results. Once useful features and their m eaningful integration for controlling the region m erging order are identified, various stopping criteria are discussed in section 3.8.

3.5

Colour Homogeneity

C olour hom ogeneity is a basic criterion for partitioning in practically all image segmentation approaches w hich can be found in the literature [2 9 ,3 0 ,3 1 ,5 9 ,1 8 ]. This section discuses some modifications to the original colour homogeneity criterion used w ith in the RSST algorithm [140].

The m ain aim is to ensure

that the most appropriate colour homogeneity criterion is em ployed before integrating other, p otentially less influential, features.

3.5.1

C olour Spaces

M a n y resent publications in the area of colour segmentation advocate utilization of the so called perceptually uniform colour spaces: C IE LU V and its im proved version C IE LAB [28, 29, 30].

In the rem ainder of this chapter, colour is

represented using the C IE L U V colour space, w hich is often considered as a good trade-off betw een the approxim ation of perceptual u n ifo rm ity and the cost of conversion from com m only used RGB or Y U V colour spaces.

3.5.2

O p tim ized C olour H om ogeneity C riterion

Despite its sim ple form , the merging cost measure defined in equation 3.1 realizes relatively com plex merging rules. It gives m erging preference to small regions, even if the colour is significantly different, by "punishing" merging of large regions. This illustrates the im plicit assumption that all homogenous regions (or objects) present in the scene should be of equal sizes. In other words, w hen segmenting an im age w ith uniform colour, at each iteration, all regions w ill have approxim ately equal size. 45

3. S E G M E N T A T IO N U S I N G S Y N T A C T IC V IS U A L F E A T U R E S

115 0 95

0 551

^

5

7

9

11

13

15

Figure 3.1: Influence of parameter qavg on the spatial segmentation error.

Intu itively, the strong preferences of m erging of small regions even for regions w ith significant colour differences is one of the m ain reasons for the poor perform ance of the RSST algorithm in the final merging stages.

M oreover,

it is com m on know ledge, that even uniform colour spaces do not guarantee that distances between very different colours agree w ith hum an perception of uniform ity. For exam ple, lets im agine tw o links between pairs of regions, both w ith equal costs according to form ula 3.1, b u t the first one between tw o large regions w ith quite sim ilar colour and the second one between two smaller regions but w ith m ore distinct colour. We w o u ld like to introduce a m odification to the equation 3.1 so that the contribution of the colour distance to the overall m erging cost w o u ld be greater comparing the contribution induced by their sizes and therefore resulting in a smaller m erging cost for the first link compared to the second. It is proposed here that a more suitable balance between colour distance and the size dependent scaling factor can be obtained using a new m erging cost defined as follows:

Cavg( i,j) = ||Ci - c j h ^ • Uj

T

0 .j

(3.2)

where param eter qavg is found experim entally using the training collection and evaluated later on the test collection. Figure 3.1 shows the evaluation results for the approach w ith the m odified m erging cost measure Cavg utilized in the second merging stage for different values of param eter qavg obtained for the entire image collection using the optim al num ber of regions as the stoping criterion. It can be seen that values of param eter qavg greater than 4.0 lead to a significant decrease of average spatial accuracy error. Interestingly, for qavg = 4.0 the above m odification is somehow sim ilar to the m erging criterion proposed recently in [144] for segmentation of 46

3. S E G M E N T A T IO N U S I N G S Y N T A C T IC V IS U A L F E A T U R E S

luminance images3. In our case the lowest segmentation error is obtained by ( j a v g = 6 but the difference m ay be specific to the collection, evaluation method

or stopping criterion used. The significant im provem ent by this trivial m odification is also confirmed by cross-validation where the average spatial segmentation error is reduced from 1.00 to 0.64. It should also be stressed that the im provem ent is consistent over the m ajority of image classes and was confirmed b y m anual inspection of the results. To provide as illustrative results as possible, an equal number of examples showing cases w here the m odification leads to a significant im provem ent (the first 5 rows) and examples where the m odification resulted in larger spatial segmentation error (the last five rows) than the original RSST approach are shown in Figure 3.4 (3rd and 4th colum n).

It can be seen that unlike the

original approach the m odified version of the m erging criterion allows creation of large regions w h ich easily explain the significant reduction of average spatial segmentation error.

3.5.3

E x te n d e d C o lo u r R e p re s e n ta tio n

In the in itial m erging stage average values of colour components are sufficient in representing a region's colour. H ow ever, as segmentation progresses and re­ gions grow, average values m ay become unsuitable for effective characterization of their colour. To ensure that this sim plification does not compromise the merg­ ing order, an extended alternative colour m odel is proposed and evaluated. A fine and compact representation of region's colour m otivated by the so-called

Adaptive Distribution of Colour Shades (A D C S ) proposed in [31] is used during the second m erging stage. In this m odel, each region contains a list of pairs of co lo u r/p o p u latio n 4 w hich can more precisely represent its complex colour v ari­ ations. For example, im agine an extreme case w here a chessboard like object is in itially pre-segmented into w hite and black squares. The A D C S representation of the region representing the w hole chessboard is capable of accurately represent its colour, e.g. by tw o pairs of co lour/pop ulatio n: one for black and one for w hite colour, w h ile an average colour sim ply w o u ld not allow distinguishing it from a com pletely grey object. The A D C S is u tilized in the proposed approach in the follow ing way.

After

the initial segmentation each region's colour representation (average values of colour components) is converted into the A D C S representation. Such conversion 3T he so lu tio n p ro p o s e d in [144] w a s m o tiv a te d as b e in g th e g eo m etrical m e a n of the criterio n from eq u a tio n 3.1 (w h e n a p p lie d to lu m in an c e c o m p o n e n t on ly ) a n d a criterio n c o m p u te d as the sq u a re d difference b e tw e e n reg io n s' m e a n lu m in a n c e v alu es. 4W h ere p o p u la tio n refers to th e ratio b e tw e e n th e n u m b e r of pixels w ith su ch co lo u r a n d the total size of th e region.

47

3. SEGMENTATION USING SYNTACTIC VISUAL FEATURES

Initial Partitioning

Final

Figure 3.2: Exam ple of A D C S representation.

is trivial as at this stage a region's A D C S list contain only one colour/population p air representing th eir average colour and size. Subsequently, in the second m erging stage, w h en ever tw o regions are merged, the joint representation is created by concatenating lists from both regions. In other words, at the end of the segmentation process each region's colour is represented by an A D C S list of c o lo u r/p o p u la tio n pairs, each corresponding to one of its composing initial regions and representing its average colour and ratio between its size and the size of the final region. This extended colour representation has the advantage of lo w requirem ents for both storage and for updating the merging cost. Also, no p rior assum ption needs to be made about the underlying colour distribution. Figure 3.2 shows an exam ple of the extended colour representation for a large face region. It has been shown in [31] that two such A D C S representations can be efficiently compared using Quadratic Distance [145]. Therefore, the colour difference be­ tw een regions r* and rj is calculated as the quadratic distance d2 quad{I , J ) between their corresponding colour distributions characterized b y tw o A D C S represen­ tations I = ( c { , p i ) , . .. ,(c £ ,p £ ),...,(c { ,,p { r ) and J = (c { ,p { ) , . . . ,

ictn Pw)i ■■■i icw iP w ) consisting of V and W pairs of co lo u r/p o p u latio n respec­ tively: 4 w ( I, J ) = I T A j I + J t A j J - 2I t A / j J where matrices A 7, A J and

(3.3)

A IJ express colour similarities between,respec­

tively, Vs colour pairs w ith themselves, those of J w ith themselves and those of I w ith those of J. The colour similarities are com puted based on the Euclid­ ean distance, e.g. m atrix A IJ = [ac/,..i] is calculated as:

dri„j l-"w = 1 - rlmax

a ci c j

(3.4)

w here dci rj is the Euclidean distance between two colours cJ w and (■:,[>represented in the L U V colour space and dmax denotes the m axim um of this distance in the

48

3. S E G M E N T A T IO N U S I N G S Y N T A C T IC V IS U A L F E A T U R E S

*qx>

« \

075

055

2

3

4

6

5

j ) as a good indication for m erging m ay result in preferences for merging

regions for w hich the common border is jagged. It should also be m entioned that a measure of the so called quasi-inclusion for im proving the order of region m erging based on the concept of convex-hull was proposed in [18]. Quasi-inclusion m ay p lay very sim ilar role to the adjacency measure defined here. Since quasi-inclusion should not be influenced by the contour jaggedness it could be very interesting to compare their performances. As already discussed, total inclusion of a sm all region inside another should provide strong evidence supporting m erging the tw o regions. In the proposed 52

3. SEGMENTATION USING SYNTACTIC VISUAL FEATURES

Figure 3.5: Over-segm entation of occluded background caused by favouring of merging of adjacent regions [68].

approach m erging of included regions is encouraged by incorporation of the ad­ jacency measure and m erging of small regions is favoured by the size dependent factor in the colour hom ogeneity measures. The author's observations indicate that m erging of sm all included regions is sufficiently encouraged by these two measures. How ever, favoring the creation of compact (and simple) regions m ay also discourage fo rm atio n of large regions corresponding to occluded background - see for exam ple Figure 3.5. It is clear that in this particular case the adjacency inform ation should be ignored due to strong colour similarity. Examples like the above provide strong m otivation for research of an advanced w ay of fusing the inform ation w h ich w o u ld allow modeling of doubt and taking into account the reliability of d ifferen t sources of evidence.

3.6.2

R e g u la r it y ( l o w c o m p le x ity )

Further evidence fo r region merging is provided by the fact that the shape com plexity of m ost objects tends to be rather low. Both global and local shape complexities can be explored in this regard. H ow ever, for the reasons described below, only global shape complexities are used in this w ork. The term "co m p lexity", although often used in the context of shape analysis, is not w e ll defined. Probably the most advanced w o rk towards better understand­ ing of perception of shape complexity can be found in [146] where the complex­ ity measure is based on entropy measures of global distance and local angle, and a measure of shape randomness5. To m aintain the sim plicity of the set of features tested here a m ore straightforward measure of com plexity is tested in­ stead. D epending on the performance of this sim ple approach, incorporation of more advanced measures of complexity m ay be part of the author's future research in this field.

G lo b a l C o n to u r C o m p le x ity Intuitively, m ore com plex shapes have a longer perim eter in relation to their area compared to sim ple shapes. Therefore global shape com plexity xi of a region r 7; can be m easured as the ratio between its perim eter length k and the square root 5Unfortunately, the final measure requires regularization parameters set in an ad-hoc manner.

53

3. SEGMENTATION USING SYNTACTIC VISUAL FEATURES

of its area a*. In fact the above ratio is often used in the literature as a measure of compactness and it is also dependent to a certain degree of the local contour complexity. Average changes of global shape com plexity for a p a ir of neighbouring regions

ri and i’j caused by their m erging is measured in the proposed scheme as:

a , , S J ) - t„ajXi¿~tXi{ ...i [ a djXj + 0 .7

(3-7)

where x tJ denotes shape com plexity of a hypothetical region formed by merging rj and r j. The denom inator in the above form ula can be seen as an estimation of the average com plexity of both regions before m erging. L o w values of Ccpx{i, j ) provide evidence for m erging the tw o regions e.g. values below 1.0 indicate that the complexity of the joint region is low er than the average of complexities of the two original regions. H o w ever it should be also noted that there is a certain overlap w ith the adjacency measure. The shape of the m erged region

is more lik e ly to be simpler than the

shape of the constituent regions r, and rj if they share a longer common border.

Local C ontour C o m p lex ity Objects w ith very jagged contours are rather rare in the "real w orld". Moreover the RSST algorithm tends to produce jagged contours at the border separating regions w hich should be m erged [38, 62], Therefore excessive jaggedness could be used for detection of such "artificial" borders and provide evidence for merging them. However, contours w ith corners w ith high value o f curvature are also created at the meeting point of tw o different convex objects interpenetrating each other. This phenomenon is often referred in the literature as the Principle ofTransversal-

ity [147], A n attem pt to distinguish such salient corners created by interpenetrat­ ing objects from the corners belonging to a jagged b oundary artificially created by the merging algorithm and to utilize such perceptual inform ation w ith in the RSST-based merging was presented in [62,63]. D isappointingly the performance of the approach was demonstrated only on a single im age (Lenna). A lthough in the proposed approach, detection of jaggedness of the common bor­ der using tw o different m ethods (one based on the M oravec operator [148] and another based on contour curvature) and explicit u tilizatio n of such information w ith in the proposed fram ew o rk led initially to plausible results for m any im ­ ages (see the author's in itia l approach in [68]), extensive experiments revealed also that it can cause some erroneous merges. Therefore several problems have to be addressed before the local complexity of region's contours can be explicitly utilized for reliable region merging: (i) choice of correct scale at w hich the lo54

3. SEGMENTATION USING SYNTACTIC VISUAL FEATURES

cal com plexity is measured, (ii) it is not clear whether isolated corners w ith high curvature or a measure of the overall jaggedness of the common border provides more useful evidence for m erging, (iii) a reliable w ay of distinguishing between artificially created jagged borders from the high curvature points resulted by in ­ terpenetrating objects. The above problems together w ith the fact that both measures Cadj and Ccpx proposed in this thesis already capture w ell the jaggedness of the common borders m otivated the fact that the local complexity of contours is not explicitly used in this w ork.

3.7

Integration of Evidence from Different Sources

M ean in g fu l integration of evidence from different sources into a single merging cost has proven to be a n o n -trivial task. Intuitively, colour homogeneity should provide m uch more reliable evidence than any other geometric properties, i.e. the new features described in the previous sections should offer only an auxiliary evidence. Therefore it is necessary to take into account the reliability of different sources of in fo rm ation w hen integrating them w ith in the region m erging algorithm. M oreover, certain geometric properties in some cases cannot be measured and therefore cannot provide any evidence supporting the region m erging hypothesis, e.g.

changes in shape com plexity caused by merging

tw o regions are meaningless if the hypothetical region created by the merging shares a significant p art of its contour w ith the entire im age as explained later in section 3.7.4. In other w ords, the integration fram ew ork should be able to combine the evidence for m erging or not from various sources b ut also should be able to accept that certain measurements m ay not be precise (doubtful) or even "unknow n" in some cases. Both considerations provide strong m otivation for adopting the Belief Theory [20, 21], w hich offers a convenient m odel for handling imprecise and uncertain information, and particulary its revised version proposed recently b y Smets [22, 23] called Transferable Belief Models (T B M ), as the m ain fram ew o rk for integration of m u ltip le features w ith in the proposed region m erging algorithm .

3.7.1

B e lie f T h e o r y

Traditionally, uncertain data was dealt w ith the use of probability theory, w hich in some cases has proven to be inadequate [149]. M ore recently, other models have been proposed for handling imprecise knowledge, e.g. Theory of

Fuzzy Sets [150], Possibility Theory [151], or uncertain inform ation, e.g. Belief Theory [2 0 ,21,22 ,23]). This chapter examines the possibility of using Belief Theory to combine evidence p ro vid ed by colour homogeneity and the syntactic features 55

3. SEGMENTATION USING SYNTACTIC VISUAL FEATURES

for im p ro vin g the performance of the RSST-based segmentation.

Belief Theory (BeT) (or Theory of Belief Functions) was introduced by Dem pster [20] and Shafer [21] and later revised by Smets [22, 23] w h o proposed Transferable

Belief Models (TBM ). The BeT is adopted in this w ork as it represents a pow erful tool for combining measures of evidence.

The m ain features of Dempster-

Shafer's theory are: • Convenient management of uncertainties by explicitly m odelling the doubt; • The ability of taking into account reliability of inform ation sources; • D istinct representation of ignorance, imprecision and conflicting inform a­ tion; • Convenient incorporation of statistical a n d /o r expert knowledge. In other words, BeT appears more general and more flexible than the commonly used probability m odel and allows representation of states of knowledge and opinions that probability models cannot [152], Typically Dempster-Shafer's theory is used for inform ation fusion in decision m akin g problems, e.g. medical diagnosis or recognition/classification [153,154], H ere, it is used to compute m erging cost utilizing evidence provided by m ultiple sources such as colour hom ogeneity and the syntactic features proposed in the earlier sections.

3 .7 .2 D e m p s te r-S h a fe r's T h e o ry : B a c k g ro u n d Fram e o f D iscernm ent A ssum e a definition of a finite set of N exclusive and exhaustive hypotheses called a fram e of discernment such as:

Hn ]

ii =

(3.8)

Basic B e lie f Assignm ent A lso, the pow er set 2n of fi be composed of 2N propositions A of f2, such as: 2 ° = {0, { # ! } , . . . , {H n }, {H i U H 2}, {H v U H 3} , . . . , Ü}

(3.9)

Finally, let us define a set of U independent evidence sources S i , . .. , Su- A piece of evidence brought by a source S u on a proposition A is m odelled by the belief structure m u called the Basic Belief Assignment (BBA) defined as:

m u : 2n — ►[0,1] 56

(3.10)

3. SEGMENTATION USING SYNTACTIC VISUAL FEATURES

w ith m „ ( 0 ) = 0 and

m u{A) = 1.

In other w ords, a m odel m u associates a belief mass to each of the symbols of 2n for each value of the evidence from Su .

C o m b in in g In fo rm a tio n Sources Evidence theory provides a convenient fram ew ork for fusing or combining several sources of evidence. A com m only used operator is the orthogonal sum also called the Dem pster's rule of combination [22, 155].

According to this

operator, w h ic h is com m utative and associative, the combined belief structure m® is defined by: m® = m i © m 2 © ■■■© m u

(3.11)

where for tw o sources of inform ation Su and Sv the combined belief structure m® is defined as:

w here K can be interpreted as a measure of the conflict between the sources to be com bined (m ® (0)) and is defined as:

K=

Y

™u( B ) - m v(C)

(3.13)

snc=0 A lthough, due to the assumption Y I acu m * ( A ) = 1, the norm alization of m® (A ) b y (1 — K ) seems natural, it has been also criticized by some authors [156]. It has been show n in [155] that the above norm alization corresponds to assimilation of the closed-world postulate (the evidence m ay support only propositions from

Q), w h ile com bining sources w ithou t norm alization corresponds to assimilation of the open-world assumption (the evidence m ight support propositions from outside f2). The u tilizatio n of the norm alization is further discussed in the next section in the context of the region merging problem.

T akin g In to A ccount Source R e lia b ility W hen source Su is considered not completely reliable the confidence in this source can be attenuated by a factor a u E [0,1]. This procedure is often referred to as discounting [21]. The new attenuated structure m “ is then defined as:

(A ) = a um u{A)

VA ¿ 0 , (3.14)

and

= 1 - a u + a um u{Q) 57

(3.15)

3. SEGMENTATION USING SYNTACTIC VISUAL FEATURES

3.7.3 A pplication of D em pster-Shafer's T heory to the Region M erging Problem This section describes the major aspects of the application of BeT to the problem of integration of m u ltip le features w ith in the region merging process:

(i)

definition of the fram e of discernment, (ii) general form of the belief structure used to m odel the w ay a piece of evidence is brought by a source on a proposition, (iii) taking into account source reliability, (iv) combination of beliefs from m u ltip le sources and (v) a new m erging cost form ula based on the combination of beliefs.

Frame o f D iscern m en t Considering the region merging problem , let the fram e of discernment ii represent a set of tw o hypothesis, nam ely M E R G E and D O N T M E R G E , which are exhaustive and exclusive:

Ü = {M ERG E, D O N T M E R G E }

(3.16)

Since in case of the region m erging problem the closed-world postulate appears more intuitive than the open-world postulate6 the closed-world assumption and consequently the assimilation of the norm alization in equation 3.12 are adopted in the proposed fram ew ork.

Basic Belief A ssignm ent The pow er set 2n of Q, is composed of 4 propositions A:

2 ° = {0, { M E R G E } , { D O N T M E R G E } , { M E R G E U D O N T M E R G E } } (3.17) The proposition { M E R G E U D O N T M E R G E } is also referred to in this thesis as doubt { D O U B T } . Let us define a set of U independent7 evidence sources S i , . . . , Su where S i always denotes one of the colour hom ogeneity criterion and S 2 , - - . ,S u denote a set of other properties associated w ith a link betw een tw o regions. A piece of evidence (measurement) Cu brought by a source S u on a proposition A is m odelled by the belief structure m u form ally defined by form ula 3.10. In other words, for each measure Cu (e.g. a colour hom ogeneity or a syntactic feature) model m u maps each value of Cu into a belief on a proposition (singleton or composed hypothesis) of 2V- . 6In the proposed framework the evidence should support propositions only from ii. 7The assumption of independency of sources is further discussed in section 3.7.5 and sec­ tion 3.9. 58

3. SEGMENTATION USING SYNTACTIC VISUAL FEATURES

0,8

/

*

0.6-

mjMERGE} . fT^(DONTMERGE) mu(DOUaT]

0B

t *t

o.e

04

0.2j

mu(MERGE) - . . mU(DONTMERGE) m“(DOUBT)

0.4 /

!

Tu'

T^U

0.2

t TML

T««li

°uu

(a)

T*flo

r

cu

(b)

Figure 3.6: G eneral form o f BBA fu n ction s (D O U B T = MERGE U D O N T M E R G E ) : (a) - BBA for a reliable sou rce o f inform ation, (b) - D iscou n ted BBA for a sou rce of lim ited reliability, i.e. increased doubt.

T heoretically the belief structures (BBA) co u ld be obtained based on both statistical or expert k n ow led ge. H ow ever, autom atic derivation of BBA based so lely on the training data is a n ontrivial p rob lem and is currently investigated by several researchers [153]. Therefore in this w ork the m od els are based on both expert k n o w le d g e and statistical k n o w le d g e obtained from the training collection sim ilar to the m e th o d o lo g ies p ro p o sed in [154] for training of static posture recogn ition system and in [152] for training of facial em otions classifiers. The general form of such m o d els is sh o w n in Figure 3.6a. the neutral valu e o f m easure C u as T^, i.e.

Let us denote

the value of Cu w h ich d oes not

p rovide ev id en ce either for m ergin g or n o t m erging. Therefore, in the prop osed b elief structure for valu e o f m easure Cu eq u al to T£ the entire m ass of belief is associated w ith D O U B T (M E R G E U D O N T M E R G E ) w h ile the m asses of b elief associated w ith p rop osition s M E R G E or D O N T M E R G E are equal to zero. A s the valu e o f C n decreases(increases) from T£ to T lu(TlL) the m ass o f b elief associated w ith D O U B T d ecreases w h ile the m ass of b elief associated w ith M E R G E ( D O N T M E R G E ) increases. Finally, for values of Cu b elow (ab ove) T^(T[L) the entire m ass of b elief is associated w ith M E R G E ( D O N T M E R G E ) and the m ass o f b elief associated w ith D O U B T is equal to zero.

T aking In to A ccou n t Source R e lia b ility Figure 3.6b sh o w s an exam ple of the BBA from Figure 3.6a discoun ted by fac­ tor a u = 0.3. It sh ou ld be ob served that d iscou n tin g sim p ly transfers the be­ lief from p rop osition s M E R G E and D O N T M E R G E to D O U B T ( M E R G E U D ON TM ERGE).

Clearly, the difficulty then lies in the correct choice of the thresholds 7\{, T£, T £ and d iscou n tin g factor a u for each sou rce o f inform ation.

59

A ll the above

3. SEGMENTATION USING SYNTACTIC VISUAL FEATURES

param eters are estim ated using the training collection as discussed later in section 3.7.4.

C om bining In fo rm atio n Sources A lthough the general form of belief structures (BBA) is rather intuitive the result of com bining such BBA for two or more inform ation sources using the D em pster's com bination rule from equations 3.11 and 3.12 may not be im m ediately apparent. Therefore, let us consider an example w here two belief structures m u an d m v are combined using the above rule.

According to

equation 3.12 the com bined belief structure m® is com puted as follows:

m® ( M E R G E ) =

(i nu( ME RGE ) ■m u{ M E RG E) + + m u( ME RGE ) ■m v{DOUBT) + + m u(DOUBT) - m v ( M E R G E ) j

(3.18)

m® {DO N T M E R G E ) = - — { m u(DO N T M E R G E ) ■m v( D O N T M E R G E ) + + m u( D O N T M E R G E ) ■m v ( D O U B T ) + + m u(DOUBT) ■m v(DO N T M E R G E ) )

» {DOUBT) = —

( i » „ ( D O U B T ) ■m „ ( D O U B T )))

(3.19)

(3.20)

K = m u( ME R GE ) ■m v( D O N T M E R G E ) + + m u( D O N T M E R G E ) ■m v( ME R G E )

(3.21)

M erging Cost In the proposed approach the new cost of m erging of tw o neighbouring regions rj an d rj based on evidence from one or a combination of sources is defined using an em pirically derived formula:

Crotal (h 3) = m ^ D O N T M E R G E ) - m® „ (M E R G E ) 60

(3.22)

3. SEGMENTATION USING SYNTACTIC VISUAL FEATURES

Table 3.1: Combination of two BBA using the D em pster's combination rule. Each table entry indicates hypothesis supported by a certain conjunction of belief masses from the individual BBA. niu

MERGE

DOUBT

DONTMERGE

MERGE MERGE 9{I 0.5 to avoid the influence of m e a n in g less exam ples8. Values of param eters T^px, T(?px, and T£px fou n d u sin g the entire im age collection are 0.7, 1.1, and 1.4 respectively. T hese v a lu es appear sen sible since if w e recall the definition o f C cpx (equation 3.7) v a lu es o f C cpx b e lo w 1.0 indicate that the com plexity of the joint region is low er than the average com plexity of the tw o 8Examples where the hypothetical region rij shares a significant part of its contour with the image border. 65

3. SEGMENTATION USING SYNTACTIC VISUAL FEATURES

Table 3.5: Results of cross-validation for different combinations of features.

Measures

Integrated Features

Avg. Spadai Segm. Error Collection A

Collection B

Avg.

C o r ig

(eq.3.1)

0.95

1.05

1.00

m odified colour h o m o g en eity (colour m od­ elled by m ean values of colour com ponents; m odification of the balance betw een factors depending on colour difference a n d region's size;)

C a v g

(eq.3.2)

0.60

0.68

0.64

m odified colour h o m o g en e ity a n d a d ja ­ cency m easure

C aVg

(eq.3.2) (eq.3.6) (eq.3.2) (eq.3.7) (eq.3.2) (eq.3.6) (eq.3.7)

0.53

0.65

0.59

0.49

0.58

0.54

0.48

0.64

0.56

(eq.3.5)

0.59

0.67

0.63

(eq.3.5) (eq.3.6)

0.54

0.57

0.56

(eq.3.5) (eq.3.7)

0.50

0.56

0.53

(eq.3.5) (eq.3.6) (eq.3.7)

0.48

0.54

0.51

original colour h om ogeneity criterio n [140] (colour m odelled by m ean values of colour components;)

C a d i

m odified colour h o m o g en eity a n d m easu re of changes of g lobal sh a p e com plexity

C a v g

m odified colour hom ogeneity, adjacency m easure, a n d m easure o f changes of global shape com plexity (com plex d epen d en cy be­ tween adjacency and changes of global shape complexity)

C a v g

colour hom ogeneity b a se d o n the e x ten d ed colour rep resen tatio n (the exten d ed colour representation and q uadratic distance) colour h om ogeneity b a se d on the e xtended colour re p resen tatio n a n d adjacency m ea­ sure

C cp x

C a d j C

cp x

C e x t

C e x t C a d j

colour hom ogeneity b a se d o n th e ex­ ten d ed colour re p re se n tatio n a n d m easure of changes of global sh a p e com plexity

C e x t

colour hom ogeneity b a se d on the exten d ed colour representation, adjacency m easure, a n d m easure of changes of g lo b al shape com plexity (complex d ep en d en cy betw een adjacency and changes of global sh ap e com ­ plexity)

C e x t

C c p x

C a d j C c p x

original regions whereas values above 1.0 indicate increase in shape complexity if the regions are m erged. Estim ated values of discounting factor a cpx for use in combination w ith both colour homogeneity measures and adjacency measure are shown in Table 3.4.

3.7.5 Evaluation of the N ew Framework for Combining Multiple Features This section presents results of cross-validation obtained by different combina­ tions of features (sources of inform ation). It should be recalled from section 3.4 that only the m erging order is evaluated here by using the num ber of regions available from the g ro u n d -tru th as the "optim al stopping criterion". Fully auto­ m atic stopping criteria are discussed and evaluated in the next section. Table 3.5 shows perform ances obtained by different combinations of features during the cross-validation. The first column contains a brief characterization of each feature com bination. The second column recalls the formulas used to 66

3. SEGMENTATION USING SYNTACTIC VISUAL FEATURES

com pute each homogeneity measure.

The performances in terms of average

spatial segmentation error obtained during the cross-validation for subsets A and B are shown in the third and fourth columns respectively.

C olum n 5

contains the average error from the tw o trials. The results for the original RSST algorithm are shown in the first row for reference.

A ll the rem aining rows

contain results obtained by utilization of a single or m u ltip le features using the proposed integration fram ew ork. A ll the extensions discussed in previous sections led to some im provem ent of the segmentation quality.

The improvem ents obtained d u rin g the cross­

valid atio n are consistent w ith the improvements prom ised d u rin g the training indicating good generalization. The best result was obtained for the approach integrating colour hom ogeneity based on the extended colour representation together w ith the adjacency measure and the measure of shape complexity change w hich almost halved the average spatial segmentation error compared to the original RSST approach. Also, m anual inspection of the results confirms that the proposed extensions significantly im prove correspondence of segmentation to semantic notions compared to the original RSST approach. A lth o u g h the integration of both geometric properties w ith the colour homo­ geneity based on the extended colour representation further reduces the spatial accuracy errors the im provem ent is not as significant as the one obtained by the relatively simple modifications of the colour hom ogeneity criteria discussed in section 3.5. Integration of the syntactic features reduced the spatial segmenta­ tion error b y only around 15% comparing to the best colour hom ogeneity criteria used alone ( Cext ). Clearly, this m ay be attributed to the fact that the results ob­ tained b y the proposed m odifications of the colour hom ogeneity criteria on the tested collection are close to saturation and subsequent im provem ents are much harder to obtain. M an u a l inspection of the results indicate that the im provem ent obtained b y integration of syntactic features is generally m ore pronounced in cases of difficult lighting conditions, e.g. shadows across a hum an face, textured areas and areas containing m any small regions w ith distinctive colour included inside larger objects. Such images constitute only part of the collection used for evaluation. The images from the Corel collection, w hich constitute most of the collection used here, are probably one of the most challenging im age sets pub­ licly available to evaluate the performance of the syntactic features since m any of them can be successfully segmented using solely colour hom ogeneity criteria. Therefore an im portant property of the proposed approach is that the introduc­ tion of syntactic features im proves the results w herever possible w ith o u t degrad­ ing the perform ance in cases where the colour hom ogeneity alone is sufficient. Nevertheless, the above experiments confirm that syntactic features, although useful, can provide only auxiliary evidence to colour inform ation. Also, integration of both geometric properties, adjacency and shape complexity,

67

3. SEGMENTATION USING SYNTACTIC VISUAL FEATURES

resulted in a smaller im provem ent than could be expected from improvements caused by each of them in d ividu ally.

In fact, the performance obtained by

integrating only one of the tw o syntactic features is comparable to the one obtained by integration of both of them . This result is confirmed by manual inspection.

This can be explained b y the complex relation (dependencies)

between these features already discussed in section 3.6.

For example, it is

possible that utilizatio n of one feature m ay affect the geometric properties of regions created at each iteration in a w a y w hich affects the role of the other feature. Moreover, the form ulas 3.11 and 3.12 used for combination of sources are derived based on the assum ption that the sources are independent. Com bination of correlated sources is currently under investigation by several researchers [157] and potential developm ents in this area could clearly benefit the proposed approach. Representative examples of segmented images produced by the approach based solely on the extended colour representation and its various combinations integrating the syntactic features are show n in Figure 3.7. As previously, the first top five examples show cases w here the syntactic features result in lower spatial segmentation error than the approach based solely on the extended colour representation whereas the bottom five examples show cases where the syntactic features increase the error. In fact, close m anual inspection revealed that the measure of changes of global shape com plexity alone often produce more m eaningful results than both syntactic features together except in cases where the initial partition contained erroneous regions w ith complex shape causing further erroneous merges attem pting to reduce the shape com plexity or in cases where some objects w ere over-segmented due to the fact that the regions constituting them had already circular or elliptical shape (e.g. a butterfly w ith relatively small but circularly shaped spots w ith distinctive colour). M oreover, close m anual inspection of the results revealed that, although seg­ mentations obtained using the extended colour representation are often close to the one obtained by using the m odified criterion based on regions' average colour, in some cases u tilization of the extended representation leads to signifi­ cantly better partitions. It should also be noted that the colour hom ogeneity criterion based on a region's average colour ( C avg) does not integrate w e ll w ith the adjacency measure, i.e. this results in very small im provem ent. In fact, integration of CaVg w ith both syntactic features resulted in worse performance that w hen integrated solely w ith the measure of change in global shape complexity (Ccpx).

This can be

explained b y the fact that the utilizatio n of the adjacency measure causes m any small regions w ith irregular shape or sem i-included but w ith relatively different colour to be m erged at early iterations. Subsequently, the average colour m ay

68

3. SEGMENTATION USING SYNTACTIC VISUAL FEATURES

Reference M ask

P S

B i

Q

Q i

Q S

Q H

;

Figure 3.7: Segmentations obtained by merging criteria integrating colour homo­ geneity based on the extended colour representation w ith different combinations of syntactic features (the required num ber of regions is known).

69

3. SEGMENTATION USING SYNTACTIC VISUAL FEATURES

IlTlcl£6

Reference M ssk

C^nvg/ ^a d jt @cplx

G^eicfc/ ^ a d jt ^ cplx

Figure 3.8: Segmentations obtained by merging costs integrating the measure of adjacency and changes in global shape com plexity w ith tw o different colour hom ogeneity criteria: the m odified colour homogeneity based on regions' average colour ( C avg) and the colour homogeneity based on the extended colour representation ( Cext )■ The required num ber of regions is know n.

70

3. SEGMENTATION USING SYNTACTIC VISUAL FEATURES

not be sufficient to represent the colour of such regions resulting in some erroneous merges in consecutive iterations. The above example illustrates the most disappointing lim itation of the RSST algorithm w hich is that it does not explore w ell the solution space. In other words, in extreme cases more plausible merges perform ed at early iterations m ay lead to erroneous merges at the later stages. Representative examples of segmented images produced by merging costs integrating geometric measures w ith the tw o colour homogeneity criteria: CaVg and Cext are shown in Figure 3.8.

As previously, the first top five examples

show cases w here the extended colour representation results in low er spatial segmentation error than the approach based on the m odified criterion based on regions' average colour whereas the bottom five examples show cases where the extended colour representation increases the error. M an u al inspection of the results revealed that evaluation of merging criteria using partitions w ith the num ber of regions equal to the number of regions from the ground-truth m ay not fu lly illustrate the effects of the proposed extensions. For example, a m odification resulting in a more m eaningful partition but producing a significant num ber of sm all regions w ith distinctive colour has a serious disadvantage w hen evaluated using the proposed methodology. In other words, the high num ber of sm all regions m ay lead to significant under­ segmentation of the rem aining parts of the im age in order to reduce the overall num ber of regions to the num ber of regions from the ground-truth.

M ore

m eaningful comparison of m erging criteria could be achieved by analyzing all partitions produced during the m erging process and taking into account only the partition w ith m in im u m spatial segmentation error.

3.8

Stopping Criterion

Segmentation of scenes into semantic entities based solely on low level features is an ill-posed problem and b u ild in g systems relying on such single partitions should be avoided w henever possible. Instead the segmentation should produce a hierarchical representation of the im age, e.g.

BPT. How ever, currently

m any applications, including C B IR systems discussed in section 2.2, can utilize only a single partitio n of the scene.

M oreover, even in scenarios where the

segmentation is used to produce a hierarchical representation of the image it is often necessary, e.g. due to reasons of efficiency, to identify a single partition w ith in such a representation most lik e ly to contain a meaningful segmentation see for exam ple [41]. In particular, a stopping criterion could be used to select a range of the most likely m eaningful partitions or at least to identify the top nodes in such hierarchical representations w hich are often meaningless, i.e. correspond to merging different semantic entities. 71

3. SEGMENTATION USING SYNTACTIC VISUAL FEATURES

This section discusses the feasibility of producing such single partitions aligned w ith the sem antically m eaningful entities present in the scene by stopping the region m erging process. It is assumed that potential applications can benefit from single partitions despite their lim ited accuracy. First, a very simple but comm on stopping criterion based on value of Peak Signal to Noise Ratio (PSRN) between the original and segmented image (reconstructed using mean region colour) [41] is discussed. This is follow ed by a proposal of a new approach to stopping the m erging process.

3.8.1

S to p p in g C r it e r io n B ased o n P S N R

In this case a single partitioning of the image is achieved based on value of PSNR betw een the original and segmented image (reconstructed using m ean region colour) - see for exam ple [41]. First, let us assume that all merges perform ed d uring the region m erging process are stored in a Binary Partition Tree (BPT) together w ith values of PSNR at each merging iteration. A single partition is then obtained from the BPT by deactivating nodes follow ing the m erging sequence u n til the value of P S N R falls below a pre-defined threshold. The threshold is typ ically chosen in an ad-hoc m anner to suit a particular application. In this w o rk, only the luminance inform ation is used in order to lim it the com putational complexity. Form ally P SN R for an M x TV image is computed as:

P S N R = 20 log10

MAXl

__

E £ 0 1E i i o 1

^ 2g) - *(M ))2

w here L is the original image (luminance inform ation only) and K is the segmented im age reconstructed using mean region intensities. M A X l is the m axim u m lum inance value. The m ain draw back of such an approach is that PSNR values do not reflect correctly the sim ilarity between the original image and the segmented one. This is particulary apparent for textured regions or regions w ith gradual changes of colour. For exam ple it is impossible to chose a single threshold w hich can result in a m eaningful segmentation of both: images containing objects consisting of heavily textured regions and images containing semantically different objects but w hich have sim ilar colour, e.g. compare the image from Figure 3.10a w ith the im age from Figure 3.10b). H ere w e take advantage of the evaluation m ethodology used in previous sections to dem onstrate the influence of different values of the threshold ( T p s n r ) on the spatial accuracy of the segmentation and compare the approach w ith the case w hen the o p tim al num ber of regions is used. The stopping m ethod is 72

3. SEGMENTATION USING SYNTACTIC VISUAL FEATURES

evaluated in combination w ith several selected region m erging criteria described in the previous sections. Figure 3.9 shows the segmentation quality obtained for the entire collection by six homogeneity criteria for different values of the threshold Tpsnr used to stop the merging process. The results of cross-validation are shown in Table 3.6. N o t surprisingly, the stopping criterion based on PSN R obtained significantly higher spatial segmentation error com paring to the "m an ual" stopping criterion for all merging criteria except the original RSST algorithm . This is confirmed by visual inspection of the segmentation masks - see the 3rd column from Figure 3.12. Interestingly, Figure 3.9 suggests that different merging criteria may be suitable for different stopping conditions, e.g. over- or under-segmentation. For example, although the original hom ogeneity criterion ( Corig) performs relatively w ell in cases of under-segmentation (T p s n r below 44dB forces the creation of large regions) it is significantly outperform ed by the new merging criteria w hen over-segmentation is required. A lthough for the m ajority of the evaluated m erging criteria the lowest spatial segmentation error was obtained for T p s n r close to 44 dB m anual inspection of the results revealed that for m any images this value results in strong under­ segmentation. This can be explained by the fact that m an y of the images in the collection contain a small num ber of objects, often textured, for w hich the above value of T p s n r produces results close to optimal. Interestingly, the lowest spatial segmentation error was obtained by the merging criterion integrating the colour hom ogeneity based on the extended colour representation and the measure of changes of shape com plexity alone. The above merging criterion not only obtained the best perform ance for the optim al value of T p s n r = 44 dB, but also outperform ed other configurations for higher values of the threshold, possibly being more general for other collections. The above results suggest that further im provem ent could be achieved if the merging criteria are optim ized simultaneously w ith the stopping criterion. Also, to fu lly utilize the im provem ents proposed in previous sections to merging criteria a new stopping criterion is required.

3 .8 .2

A N e w S to p p in g C r it e r io n

A lternatively to the PSNR-based solution, the m erging process could be stopped b y comparing the cost of m erging at each iteration w ith a predefined threshold. H o w ever the author's experiments (results om itted for brevity) indicate that choosing a single threshold in such cases results either in over-segmentation of some scenes, e.g. w ith difficu lt lighting conditions, or under-segmentation, e.g. containing objects w ith sim ilar colour. 73

3. SEGMENTATION USING SYNTACTIC VISUAL FEATURES

36

38

40

42

44

46

48

Tpsnr Figure 3.9: Influence of Tpsnr on the average spatial segmentation error.

This section discusses a n ew approach to obtaining a single partition of images in w hich instead of evaluating the goodness of segmentation at a single iteration only, the decision is m ade based on the evolution of the merging cost accumu­ lated d uring the complete m erging process, i.e. starting from the level of pixels and finishing w ith the single region corresponding to the entire image. As in the previous approach, the selected partition is obtained from the BPT built during the m erging process b y deactivating nodes follow ing the m erging sequence. First, let us define an accum ulated merging cost measure Ccum w hich measures the total cost of all mergings perform ed to produce t regions as:

CoumW = ( ^

,Cmrs(n>

IV. 0.0

‘f 1 S i < W, otherwise

(3 29)

w here Cmrg(n ) denotes cost of m erging of a single pair of regions reducing the num ber of regions from n + 1 to n computed using a given merging criterion.

N i denotes the num ber of regions in the initial partition, i.e. produced in the first merging stage. O n ly costs of merging perform ed during the second stage contribute to the value of Ccum. Figure 3.10 shows measure Ccum plotted for each iteration t of the merging process for three images, each presenting a different segmentation challenge. The basic idea behind the stopping criterion is to find the num ber of regions

t s w h ich partitions the curve Ccum(t ) into two segments in such a w ay so that w ith decreasing values of t, values of Ccum(t ) w ith in segment [ l , t a] increase significantly faster than for the segment [ta, TV/]. Also it should be possible to bias the result towards under- or over-segmentation. In the proposed approach t s is found using the method proposed in [158] for bi-level thresholding designed to cope w ith uni-m odal distributions (histogram 74

3. SEGMENTATION USING SYNTACTIC VISUAL FEATURES

Table 3.6: Results of cross-validation of selected m erging criteria used w ith the FSNR-based stopping criterion._____________________________________________

Integrated Features

Measures

Avg. Spatial Segm. Error Collection B

Avg.

orig in al colour hom ogeneity criterion [140] (colour m odelled by m ean values of colour com ponents;)

Corig (eq.3.1)

0.84

0.84

0.84

m odified colour hom ogeneity (colour m od­ elled by m ean values of colour components; m odification of the balance between factors d ep en d in g on colour difference and region's size;)

Cavg (eq.3.2)

0.93

0.88

0.91

c olour h om ogeneity based on the extended c olour rep re se n tatio n (based on the ex­ tended colour representation and quadratic distance)

Cext (eq.3.5)

0.90

0.76

0.83

c olour h om o g en eity based on the extended c o lo u r re p resen tatio n a n d the adjacency m easu re

Cext (eq.3.5) Cadj (eq.3.6)

0.82

0.77

0.80

colour h om ogeneity based on the ex­ te n d e d c olour representation and m easure of c h an g es of g lobal shape com plexity

c ext (eq.3.5) CCpx (eq.3.7)

0.74

0.78

0.76

colour h om o g en eity b ased on the extended colour representation, adjacency m easure, a n d m easu re of changes of global shape com plexity (complex dependency betw een adjacency and changes of global shape com ­ plexity)

Cext (eq.3.5) Cadj (eq.3.6) Ccpx (eq.3.7)

0.79

0.78

0.79

Collection A

based thresholding). The algorithm is based on the assumption that the m ain peak of the uni-m odal distribution has a detectable corner at its base which corresponds to a suitable threshold point. The approach has been found suitable for various problems requiring thresholding such as edge detection, image difference, optic flow, texture difference images, polygonal approximations of curves and even param eter selection for the split and merge im age segmentation a lgorithm [40]. H ere, the approach is used to determ ine the stopping criterion t s based on the accum ulated merging cost measure Ccum■Let us assume a hypothetical reference accum ulated m erging cost measure Cref having a form of a line passing through points ( l , Ccum{1 )) and ( r rum, Ccum{Tcum) ) , where Tcum is a param eter whose role is explained later.

The threshold point t s is selected to m axim ize the

perpendicular distance between the reference line and the po in t ( t s, Ccurn(Ls)J

- see examples in Figure 3.10. It can be shown that this step is equivalent to an application of a single step of the standard recursive subdivision method for determ ining the polygonal approxim ation of a curve [159]. Param eter Tcum can be used to bias the stopping criterion towards under- or over-segm entation, i.e. smaller values of Tcum bias the stopping criterion to­ w ards under-segmentation w h ile larger values tend to lead to over-segmentation. H o w e ve r, it should be stressed that it is the shape of Ccum(t) w h ich plays the d o m in an t role in selecting t s. 75

3. SEGMENTATION USING SYNTACTIC VISUAL FEATURES

Figure 3.10: Stopping criterion based on accumulated merging cost (C'cum) for Tcum = 70. The m erging order is computed using the extended colour representation together w ith syntactic features. Values of Ccum are re-scaled to the range [0, T^m] for visualization purposes.

Figure 3.11 shows the average spatial segmentation error obtained for the entire image collection by six m erging criteria for different values of threshold Tcum. Table 3.7 shows results of cross-validation. Clearly, the proposed stopping criterion significantly outperforms the PSNRbased stopping criterion w hen used w ith any of the new m erging criteria. It should be noted that the best com bination of merging criterion (the extended colour representation combined w ith syntactic features) together w ith the new stopping criterion leads to significantly lo w er average spatial segmentation error than the original RSST approach even w ith the "m anual" stopping criterion based on the num ber of regions in the ground-truth mask. H ow ever, it can be seen from Figure 3.11 that the new stopping criterion is not suitable for use w ith the m erging criterion from the original RSST approach. This suggests dependency of the new stopping criterion on the balance between the regions' colour differences and the size of the regions, i.e. values of qavg and qext discussed in section 3.5. A lth o u g h the m erging criterion integrating the syntactic features obtained the best perform ance its quantitative results are comparable w ith the merging cri76

3. SEGMENTATION USING SYNTACTIC VISUAL FEATURES

T

cum

Figure 3.11: Influence of Tcum on the average spatial segmentation error.

teria based solely on colour proposed in section 3.5. This could suggest that a stopping criterion w hich can fu lly utilize the benefits of syntactic features yet re­ mains to be found. H ow ever, further m anual inspection of the results for values of Tcujn d ifferent than the ones producing the m in im u m spatial segmentation error reveals that in fact the syntactic features have strong positive im pact on the results, especially for values of Tcum ~ 400. This is especially true for the m erging criterion integrating measure Ccpx w hich greatly lim its the num ber of regions spanning more than one object. The reason this configuration obtained slightly worse results in the qualitative evaluation than the configuration based solely on colour homogeneity criterion for Tcum ~ 400 is due to the fact that in some cases it also over-segments large regions. How ever, practically all cases of such over-segm entation are meaningful, i.e. there is a visual difference between the created regions. Figure 3.12 shows selected segmentation results obtained by the original RSST algorithm w ith the PSNR based stopping criterion (3rd column) and segmenta­ tions produced b y the best of the proposed approaches: merging criterion in­ tegrating the extended colour representation combined w ith syntactic features together w ith the new stopping criterion. As in the previous cases, the first five rows show examples where the new approach improves the results compared to the original RSST and the last five rows show examples where the proposed approach obtained higher spatial segmentation error than the original RSST ap­ proach. H o w ev er, it should be stressed that the latter cases are very rare and in fact even in such cases the partitions appear intuitive. Moreover, it should be stressed that all segmentations for the proposed methods were produced for values of param eters found in the training process w hich due to the evaluation criterion chosen and the collection used tends to produce under-segmentation. As already explained, for the merging criteria integrating the syntactic features subjectively b etter results can be obtained by higher values of Tcum. 77

3. SEGMENTATION USING SYNTACTIC VISUAL FEATURES

Table 3.7:

Results of cross-validation of selected merging criteria w ith the

proposed stop p in g criterion. Avg. Spatial Segm. Error

Measures

Integrated Features

Collection A

Collection B

Avg.

0.91

0.81

original colour hom ogeneity criterion [140] (colour m odelled by m ean values of colour components;)

Corig (eq.3.1)

0.71

m odified colour hom ogeneity (colour m od­ elled by m ean values of colour com ponents; m odification of the balance betw een factors depending on colour difference and region's size;)

Cavg (eq.3.2)

0.60

0.76

0.68

colour hom ogeneity based on the extended colour representation (based on the ex­ tended colour representation and quadratic distance)

Cext (eq.3.5)

0.60

0.74

0.67

colour hom ogeneity based on the extended colour rep resen tatio n a n d adjacency m ea­ sure

Cext (eq.3.5) Cadj (eq-3.6)

0.67

0.66

0.67

colour hom ogeneity based on the ex­ ten d ed colour rep resen tatio n and m easure of changes of global sh a p e com plexity

Cext (eq.3.5) Ccpx (eq.3.7)

0.61

0.67

0.64

colour hom ogeneity b a se d on the extended colour representation, the adjacency m ea­ sure, and the m easure of changes of global shape com plexity (complex dependency be­ tw een adjacency a n d changes of global shape complexity)

Cext (eq.3.5) Cadj (eq.3.6) Ccpx (eq.3.7)

0.61

0.64

0.63

For comparison, the last column from Figure 3.12 shows the results of the Blobw o rld segmentation algorithm 9. The Blob w o rld algorithm has been extensively tested and produced very satisfactory results in the past [29]. Although, in m any cases the proposed stopping criterion does not produce perfect segmentation, i.e.

ideally aligned w ith the m anual segmentation, it

successfully identifies the most visually salient regions in almost all evaluated images.

This presents a tremendous op p o rtu n ity for utilizing such salient

regions in C BIR systems.

It should be stressed that the method performs

w ell on images presenting very different challenges w ith fixed value of the parameter Tcum and unlike the stopping criterion based on PSNR very erroneous segmentations are extrem ely rare.

3.9

Future Work

RSST has the advantage of speed and sim plicity compared to m any other ap­ proaches and proved to be convenient for integration of m ultiple features. H o w ­ ever, it has also one inherent lim itation w h ich is that it does not exhaustively explore the solution space. Future w o rk could progress in tw o main directions. One research route could consider solving some of the limitations of the current ’Source code obtained from http:// elib.cs.berkeley.edu/src/blobworld/ 78

3. SEGMENTATION USING SYNTACTIC VISUAL FEATURES

Im age

Reference M ask

O riginal RSST

C KXt, C artj, C cpix

PSNR stop. crit.

Proposed stop. crit.

Blobworld

Figure 3.12: Selected segmentation results obtained by the fu lly automatic versions of the proposed extensions.

79

3. SEGMENTATION USING SYNTACTIC VISUAL FEATURES

approach and the second one could focus at the possibilities of adaptation of the solutions proposed in this w ork into other segm entation fram ew orks especially the ones promising a better exploration of the solution space. The new frame­ w ork for integration of evidence from m ultiple sources, each w ith its own accu­ racy and reliability, into the region m erging process opens m any possibilities of integration of new features such as texture, additional geom etric properties, or in the case of video even m otion information. In particular, possible research routes could include: • U tilization of A dditional G eom etric Features - The discussed list of geometric properties potentially useful in controlling m erging order is by no means complete and could be further extended by for example sym m etry analysis, shape hom ogeneity (e.g. sim ilarity betw een parts of region's contours), continuity and collinearity [67] (see also section 2.3.3 for other possible generic geometric measures). • U tilization of M ulti-scale C olour and Texture Features - The colour ho­ m ogeneity could be further extended by integration of texture homogene­ ity, e.g. by adopting the approach from [30], and by incorporation of the so-called Boundary Melting approach w hich favors m erging of regions with low m agnitude of color gradient along their com m on boundary [105]. In fact utilization of features com puted from pixel's local neighbourhood, e.g. texture or even multi-scale colour inform ation, could provide a remedy to the non-optim al exploration of the solution space in the initial merging stages. In other words, colour or texture inform ation from a pixel's neigh­ bourhood could help avoid some erroneous m erges caused by noise and the inability of RSST to find the globally optim al solution. However, uti­ lization of texture or multi-scale colour inform ation could also affect con­ tour localization [31] and increase the com putational complexity. • A nalysis of Feature D ependencies - The dependencies betw een geometric features could be further analyzed by using techniques like Principal Com­ ponent Analysis (PCA). Also the m ost recent findings in the area of gener­ alization of TeB to dependent sources of inform ation, e.g. [157], should be considered for extending the proposed integration framework. Fur­ thermore, optim ization of all param eters of the m erging criteria simulta­ neously w ith the optim ization of the stoping criteria could lead to more significant im provem ent in segm entation quality. • U tilization of an A lternative Technique in the Pre-Segm entation Stage Using an alternative segm entation m ethod to produce the initial partition could im prove its quality and in consequence also the quality of the final segmentation. In fact, the proposed approach, i.e. extensions proposed for the second merging stage, could be utilized as a post-processing step to any segm entation algorithm.

80

3. SEGMENTATION USING SYNTACTIC VISUAL FEATURES

• T r a in in g a n d T e stin g o f th e A p p r o a c h w it h O th e r C o lle c tio n s - A s h as b e e n a lrea d y e x p la in e d in s e c tio n 2.6.4 the s iz e o f the d a ta set u s e d for train­ in g a n d te stin g in this th e sis is n o t o p tim a l. T herefore the fin d in g s o f this ch a p ter s h o u ld be c o n fir m e d w it h m u ch larger and m ore h e te r o g e n o u s im ­ a g e c o lle c tio n s. O n e su c h c o lle c tio n that h as b e e n m a d e r ecen tly p u b lic is th e

B e r k e le y

S e g m

e n ta tio n

D

a ta s e t

[160]. T he c o llectio n c o n ta in s 300 im a g es,

e a c h s e g m e n te d b y 2-6 a n n o ta to rs. T he se g m e n ta tio n m a sk s c o n ta in clo sed co n to u r s o f e a c h reg io n . A d o p t in g th e c o lle ctio n to the e v a lu a tio n m e th o d ­ o lo g y u s e d in th is th e sis w o u ld req u ire c o n v e r sio n o f th e m a sk s from c o n ­ tou r r ep resen ta tio n in to r e g io n s. A ls o it c o u ld b e e x tre m ely in tere stin g to e v a lu a te th e a p p ro a ch u s in g th e o rig in a l b o u n d a r y d e te c tio n a n d s e g m e n ­ ta tio n b e n c h m a r k p r o p o s e d for th e B erk eley c o lle c tio n w h ic h e v a lu a te s the r e g io n s b o u n d a r ie s in te r m s o f p r e c isio n a n d recall [161]. • A u to m a tic A s s e s s m e n t o f th e S e g m e n ta tio n Q u a lity - A lth o u g h , as s h o w n in th is th e sis, a u to m a tic im a g e se g m e n ta tio n can b e q u ite su c c e ssfu l for m a n y ty p e s o f im a g e s it s h o u ld b e a lso a c k n o w le d g e d that m a n y im a g e s c a n n o t b e m e a n in g fu lly se g m e n te d in an au to m a tic fa sh io n . T his p r e se n ts a sig n ific a n t c h a lle n g e in u tiliz in g su c h resu lts in CBIR sy ste m s. O n e s o lu tio n w o u ld b e to a c c o m p a n y th e se g m e n ta tio n re su lts w ith a c o n fid e n c e m e a su r e in d ic a tin g h o w lik e ly it is that it co n ta in s a m ea n in g fu l s e g m e n ta tio n w h ic h c o u ld b e th e n tak en in to a cco u n t b y the CBIR sy stem . S u c h c o n fid e n c e c o u ld b e d e r iv e d b a se d o n the m ea su re

C c u m

p r o p o se d in

th is chapter.

3.10

D iscu ssio n

E x te n siv e e v a lu a tio n c o n fir m e d th a t th e p r o p o se d m o d ific a tio n s sig n ific a n tly im p r o v e th e q u a lity o f th e s e g m e n ta tio n p ro d u c e d b y the RSST alg o rith m . C o n ­ siste n t im p r o v e m e n t in sp a tia l a ccu ra cy is o b se r v e d th r o u g h all c la sse s o f tested im a g e s c o n fir m in g th e g e n e r a l n a tu r e o f th e p r o p o se d so lu tio n s. In particular, th e m o s t p r o m is in g o f th e fu lly a u to m a tic a p p ro a ch es p r o p o se d h ere red u ces th e a v e r a g e sp a tia l s e g m e n ta tio n error b y arou n d 20% co m p a re d to th e o rig i­ n a l RSST a p p r o a c h w ith th e s to p p in g criterion b a se d o n PSN R . T h e p r o p o se d m e th o d , in co n tr a st to m a n y e x is tin g a p p ro a ch es, in c lu d in g the B lo b w o rld a lg o ­ rithm [29], p r o d u c e s a ccu rate r e g io n b o u n d a rie s. P erh ap s the m o s t im p ortan t p r o p e r ty o f th e p r o p o s e d a lg o r ith m is th at it w o r k s re la tiv e ly w e ll o n im a g es from d iffe r e n t c o lle c tio n s, su c h as sta n d a rd test im a g e s, p r o fe ssio n a l datab ases, d ig ita l p h o to g r a p h s a n d v id e o k e y fr a m e s from b road cast TV, all o f

p a r a m

e te r

v a lu e s .

u s in g

a f ix e d

s e t

T h is, a lo n g w ith th e fact that it tak es le ss th a n 3 se c o n d s

o n P e n tiu m III 600 M H z to s e g m e n t a n im a g e o f CIF siz e (352x288)10, m a k es the p r o p o s e d a lg o r ith m v e r y a ttra ctive fo r ob ject-b ased a p p lica tio n s su c h as con ten t10Blobworld ,using mostly Matlab code, requires over 30 minutes for this task. 81

3. SEGMENTATION USING SYNTACTIC VISUAL FEATURES

b a se d im a g e retriev a l in large c o lle c tio n s or d e fin itio n o f in terest r e g io n s for c o n te n t-b a s e d c o d in g o f still im a g e s in th e c o n te x t o f th e JPEG2000 [27, 30], for e x a m p le . A lth o u g h th e in tr o d u c tio n o f sy n ta ctic featu res r e d u c e d th e sp a tia l a ccu ra cy er­ rors, th e im p r o v e m e n t is sm a ller th a n th e im p r o v e m e n t o b ta in e d b y th e m o d ­ ific a tio n o f th e c o lo u r h o m o g e n e ity criterion. T herefore, the e x p e r im e n ts c o n ­ fir m e d that sy n ta c tic fea tu res, a lth o u g h u se fu l, can p r o v id e o n ly e v id e n c e a u x ­ ilia ry to c o lo u r in fo rm a tio n . M oreover, th e a n a ly sis o f se v e r a l sy n ta c tic featu res r e v e a le d m ajor c h a lle n g e s in in te g ra tin g th em in to a se g m e n ta tio n fr a m ew o rk a r isin g fro m str o n g o v e r la p s b e tw e e n v a r io u s g e o m e tr ica l p ro p erties. T h e e x p e r im e n ts a lso co n firm ed that a u to m a tic b o tto m -u p se g m e n ta tio n o f im a g e s w it h g e n e r a l co n ten t is a n o n trivial p r o b lem a n d th erefore u tiliz a tio n o f a s in g le p a r titio n in g sh o u ld b e a v o id e d w h e n e v e r p o ssib le . In ste a d b o tto m u p s e g m e n ta tio n s h o u ld b e rather u s e d to d isc o v e r th e h ierarch ical stru ctu re o f th e sc e n e , e .g . for creation o f a

B in a r y

P a r titio n

T r e e s

(BPT) w h ic h ca n b e u se d

for e ffic ie n t rep resen ta tio n o f th e stru ctu re o f the s c e n e [107], C reation o f BPT h a s b e e n a d v o c a te d as an im p o rta n t p re -p r o c e ssin g ste p in m a n y a p p lic a tio n s in c lu d in g r e g io n -b a s e d c o m p re ssio n a n d re g io n -b a se d featu re ex tra c tio n in the c o n te x t o f M P E G -7 d e sc r ip tio n o f con ten t.

O n e su c h a p p lica tio n o f BPT is

s h o w n in s e c tio n 3.11 w h e r e BPT p r o d u c e d b y the a u to m a tic a p p r o a ch p r o p o se d in th e p r e v io u s s e c tio n s is u se d to e n su r e fast r e sp o n se s o f a se m i-a u to m a tic s e g m e n ta tio n to o l. F in a lly it s h o u ld b e str e sse d that a sig n ifica n t effort h a s b e e n d e v o te d in this w o r k to o b je c tiv e e v a lu a tio n o f the p r o p o se d m o d ific a tio n s. To th e b e s t o f the a u th o r 's k n o w le d g e th is is th e m o st c o m p r e h e n siv e e v a lu a tio n o f th e b o tto m u p r e g io n m e r g in g ap p ro a ch cu rren tly a v a ila b le in th e literature.

H ow ever,

d e s p ite s u c h sig n ific a n t e v a lu a tio n effort th e au th or a c k n o w le d g e s th at so m e o f th e r e su lts are o n the v e r g e o f sta tistica l sig n ific a n c e , e.g. in teg ra tio n o f the sy n ta c tic fe a tu r e s v e r s u s the m o d ifie d m e r g in g criteria.

M a n y a p p r o a c h e s to

im a g e s e g m e n ta tio n p r o p o se d in th e p a st h a v e b e e n a c c o m p a n ie d b y o n ly a h a n d fu l o f e x a m p le s . C learly, the research carried o u t b y th e au th o r in d ic a tes that d e v e lo p in g a g e n e r a l so lu tio n to im a g e se g m e n ta tio n is an e lu s iv e ta sk an d g e n e r a lity o f a n y im p r o v e m e n t d e m o n str a te d o n ly o n a h a n d fu l o f e x a m p le s s h o u ld b e a p p r o a c h e d w ith an a p p rop riate reserve.

3.11 3.11.1

Sem i-A utom atic Segm entation

M otivation

T h e p r o b le m o f p a r titio n in g a n im a g e in to a se t o f sem a n tic en tities is a fu n d a ­ m e n ta l e n a b lin g te c h n o lo g y for u n d e r sta n d in g sc e n e stru ctu re an d id e n tify in g 82

3. SEGMENTATION USING SYNTACTIC VISUAL FEATURES

rele v a n t objects. U n fo rtu n a tely , as alrea d y d is c u s s e d in th is chapter, fu lly a u to ­ m a tic o b ject-b a sed s e g m e n ta tio n rem a in s to a la rg e e x te n t an u n s o lv e d p ro b lem , d e s p ite b e in g th e su b ject o f s tu d y for sev era l d e c a d e s. For a certain cla ss o f m u l­ tim e d ia a p p lic a tio n s it is u n a v o id a b le n o t to in c lu d e u se r in p u t in the se g m e n ta ­ tio n p r o c e ss in o rd er to a llo w se m a n tic m e a n in g to b e d e fin e d [162]. In th e case o f a u to m a tic te c h n iq u e s th e co n trol o v e r d esir ed p a r titio n in g is o fte n p er fo r m ed th r o u g h a tria l-a n d -erro r-b a sed a d ju stm en t o f p a ra m eters. Clearly, to p erfo rm e a s y s e g m e n ta tio n , e v e n b y an u n sk ille d user, a n o th e r form o f con trol w h ic h is r e a d ily c o n c e p tu a liz e d is n e e d e d . T his r eq u irem en t h a s le a d to d e v e lo p m e n t of

s u p e r v is e d

or

s e m

i- a u to m

a tic

ap p ro a ch es w h e r e a user, b y h is /h e r in teraction s,

d e fin e s w h a t o b jects are to b e se g m e n te d w ith in a n im a g e [32]. T he r e su lt o f in tera ctio n p r o v id e s c lu e s a s to th e natu re o f th e u s e r 's req u irem en ts w h ic h are th e n u se d b y a n a u to m a tic p r o c ess to se g m e n t th e req u ired object. T he ty p ica l req u irem en ts for s e m i-a u to m a tic ap p ro a ch es to g e th e r w ith a r e v ie w o f the m o st ch aracteristic a p p r o a c h e s h a v e b e e n alread y p r o v id e d in ch ap ter 2 (se ctio n 2.5.1) M a n y research ers, in c lu d in g the author, b e lie v e th a t in m a n y a p p lica tio n s the s h a p e o f se m a n tic o b jects in im a g e s can o n ly b e a ccu ra tely recovered b y a s u p e r v is e d a p p r o a c h . T h is m a k e s the p ro b lem o f se m i-a u to m a tic se g m e n ta tio n a n d its v ia b ility e x tr e m e ly relev a n t in the co n tex t o f th is th esis. For ex a m p le , th e s h a p e m a tc h in g te c h n iq u e p r ese n ted in ch ap ter 5 a ssu m e s that the co n to u rs are a lrea d y ex tra cted . In a p p lic a tio n s w h e r e an a u to m a tic p a rtitio n in g o f sc e n e s in to se m a n tic e n titie s ( b

o tto m

- u p

approach) is im p o ssib le , in c lu d in g th e u se r in

th e s e g m e n ta tio n p r o c e s s m ig h t p r o v id e a fea sib le a ltern a tiv e, e sp e c ia lly if the c o lle c tio n is sm a ll or in th e c a se o f v id e o w h e r e th e object se g m e n te d in o n e im a g e can b e s u c c e s s fu lly tra ck ed o v er sev e ra l fram es. A s a lrea d y d is c u s s e d in ch a p ter 2, another v ia b le a lte rn a tiv e to object s e g m e n ­ ta tio n a n d s u b s e q u e n t r e c o g n itio n is a

to p - d o w n

a p p r o a ch w h ere a m o d e l o f a n

object is crea ted o ff-lin e a n d s u b se q u e n tly u se d for r e c o g n itio n in u n se e n im a g e s. E v e n in th is sc e n a r io ty p ic a l sh a p e m o d e ls req u ire te n s o f con to u r e x a m p le s th erefore p r o v id in g a d d itio n a l m o tiv a tio n for th e research o n se m i-a u to m a tic se g m e n ta tio n a p p r o a c h ca rried o u t in the re m a in d er o f this chapter. Specifically, th e to o l d e sc r ib e d in th is se c tio n h as p r o v en to b e p a rticu la r ly u sefu l in th e c o n ­ tex t o f c o n str u c tio n o f sta tistica l sh a p e m o d e ls as d e sc r ib e d in ch ap ter 7 w h e r e it is u s e d for e x tr a c tio n o f accurate con tou r e x a m p le s fro m a se t o f im a g e s. T he so lu tio n p r e se n te d in th e r e m a in d er o f this ch ap ter ta k es fu ll a d v a n ta g e o f the a u to m a tic a p p r o a c h a n d a ll p r o p o se d a d v a n ce s d e sc r ib e d in the first p art o f this chapter.

3.11.2

U ser Interactio n

V arious str a te g ie s fo r u se r in tera ctio n for extraction o f accu rate objects h a v e b e e n a lrea d y d is c u s s e d in s e c tio n 2.5.1.

T he au th or b e lie v e s that o n e o f th e m o st 83

3. SEGMENTATION USING SYNTACTIC VISUAL FEATURES

in tu itiv e sc h e m e s for u ser in tera ctio n s is th e

f e a tu r e - b a s e d

u se r is a llo w e d to d r a w o n th e sc e n e , ty p ic a lly b y p la c in g s in g le p o in t la b e lle d

s e e d s

s c r ib b le s

a p p ro a ch w h ere the m a r k e r s

in a form o f

or b y d r a g g in g a m o u se o v e r th e im a g e a n d creating a set o f for ob jects to b e se g m e n te d . T w o or m o r e d ifferen t m arkers (or

scrib b les) can h a v e th e s a m e lab el, in d ic a tin g d ifferen t p a rts o f th e sam e object in th e im age. A n e le g a n t an d efficien t s o lu tio n to im a g e p a r titio n in g b a s e d o n scribbles w a s d e sc r ib e d in [38] w h e r e th e in p u t im a g e is a u to m a tic a lly p r e -se g m e n te d in to re­ g io n s a n d rep resen ted as a

B in a r y

P a r titio n

T re e

(BPT) to a llo w ra p id r esp o n se s to

u s e r 's action s. T h e tree stru ctu re is u se d to e n c o d e sim ila r itie s b e tw e e n region s p r e -c o m p u te d d u r in g th e a u to m a tic se g m e n ta tio n . T h is stru ctu re is th en u se d to r a p id ly p ro p a g a te th e la b e ls from scrib b les p r o v id e d b y th e u se r to large areas (r e g io n s) o f the im a g e. T h is s o lu tio n w a s a d a p te d b y th e a u th o r as it p ro m ised h ig h sp e e d e fficien cy cru cial in a p p lica tio n s in v o lv in g u se r in teraction s. A lso it a llo w e d c o n v e n ie n tly ta k in g a d v a n ta g e o f th e d e v e lo p m e n ts , p resen ted in the first p a rt o f this chap ter o f in c o rp o r a tin g sy n ta ctic fea tu r es in to au tom atic se g ­ m en ta tio n . S in c e th e p rim a ry g o a l o f th e to o l d e v e lo p e d b y th e a u th o r is extraction o f a s in g le c lo se d co n to u r fro m a s e t o f im a g e s (see ch a p ter 7) th e a llo w e d interaction is restricted to o n ly tw o o b jects referred in the r em a in d e r o f this chapter as fo r e g r o u n d a n d b a c k g r o u n d .

B oth objects ca n b e m a d e u p o f an arbitrary

n u m b e r o f reg io n s, h o m o g e n o u s a cco rd in g to a certain criteria. O n ly tw o sets o f d isc o n n e c te d scrib b les are required: o n e for fo r e g r o u n d a n d o n e for b ackground. T h e a u th o r's o b se r v a tio n s in d ic a te that the a b o v e r estrictio n greatly sim p lifies u se r in tera ctio n a n d s u b se q u e n t in terp reta tio n o f th e r e su lts. In particular, o n a sta n d a rd PC, fo r e g r o u n d a n d b a ck g r o u n d can b e a sso c ia te d w ith left an d right m o u s e b u tto n s resp ectiv ely . T h e scrib b les can h a v e a c o m p le x sh a p e and cover la rg e areas or can b e as sm a ll as o n e pixel.

3.11.3

Im plem entation D etails

T h e b a sic id ea o f th e a p p r o a ch ca n b e fo u n d o r ig in a lly in [38]11 an d so m e im ­ p le m e n ta tio n d e ta ils w e r e a lso p r o v id e d in [107]. T h is s e c tio n a im s at p r o v id in g a d e e p e r illu stra tio n o f th e central im p le m e n ta tio n is s u e s a n d d isc u sse s so m e c h o ic e s m a d e b y th e author. In particular, it p r o v id e s a n a lte rn a tiv e an d sim p ler s o lu tio n to b o tto m -u p la b e l p r o p a g a tio n , d e m o n str a te s s o m e fla w s o f the strat­ e g y s u g g e s te d in [107] for fillin g o f th e entire im a g e sp a c e a n d p resen ts a sim p le s o lu tio n to th e a b o v e p ro b lem .

11[38] discusses various applications of BPT representation the context of MPEG-4 and MPEG-7 standards. 84

3. SEGMENTATION USING SYNTACTIC VISUAL FEATURES

Creation of BPT In th e a u th o r 's im p le m e n ta tio n , th e BPT is created u sin g th e a u to m a tic s e g m e n ­ ta tio n a p p ro a ch d e sc r ib e d in th is first p art o f the chapter. T h e sy n ta c tic approach w a s preferred o v e r th e o r ig in a l R SST alg o rith m b ec a u se , as w a s d em o n stra ted in se c tio n 3.7.5, in m o s t c a se s, it r e su lts in a m ore in tu itiv e /m e a n in g fu ll m erg in g order.

In oth er w o r d s, th e in c o r p o r a tio n o f the sy n ta ctic featu res in the p re­

s e g m e n ta tio n sta g e le a d s to th e c r e a tio n o f m ore m e a n in g fu l p a rtitio n s, w h ic h is e s p e c ia lly im p o r ta n t at th e h ig h e r le v e ls o f BPT w h ic h are c lo se to the root. O n e ca n sp e c u la te that th is s h o u ld im p r o v e th e in tu itiv e n e ss an d e fficien cy o f u ser in te r a c tio n s e.g. b y a llo w in g th e se g m e n ta tio n p ro cess m a k e m ore "intelligent" d e c is io n s in areas o f th e s c e n e w h e r e there is n o ex p licit in fo r m a tio n p r o v id e d b y th e u se r s in teraction s. To e n su r e th at e v e r y part o f th e im a g e ca n be sp lit, if req u ired b y the user, the c o m p le te BPT (sta rtin g fro m th e le v e l o f p ix els) is u sed .

Propagation of Labels - Basic Idea In th e a p p ro a ch d e sc r ib e d in [38], o n c e the im a g e is p r e -se g m e n te d an d th e BPT is crea ted , the u ser c a n start in te r a c tio n s b y a d d in g "drags" to the o rig in a l im age. E ach tim e a n e w d r a g is a d d e d th e a u to m a tic p ro cess u p d a te s the se g m e n ta tio n m a sk . T h e b a sic id e a b e h in d th is p ro c ess is to a ssig n r e g io n s c o in c id e n t w ith m o u s e d ra g to th a t object.

H o w e v e r , a reg io n can b e a ssig n e d to an object

o n ly if it c o in c id e s w ith m o u s e d r a g s w ith o n ly o n e la b e l (o n ly o n e object) - o th e r w ise , a c o n flic t o ccu rs.

T h e BPT is u se d to se le c t th e la rg est p o ssib le

r e g io n s (areas) w ith o u t s u c h c o n flic t (co n ta in in g scrib b les for o n ly o n e object). C o n flic ts are id e n tifie d b y p r o p a g a tin g lab els o f the m o u s e d ra g s from leafs (p ix e ls) to w a r d s th e ro o t o f th e BPT (w h o le im age).

T he a lg o r ith m w o rk s in

th ree m a in sta g e s [107]: (i) a s s ig n m e n t o f lab els from scrib b les to le a fs (p ixels), (ii) u p w a r d s (b o tto m -u p ) p r o p a g a tio n o f n o d e la b els to th eir p a ren ts (rem ain in g n o n -le a f n o d e s ) if th ere is n o c o n flic t, an d (iii) to p -d o w n p r o p a g a tio n o f lab els s o , in e a c h z o n e o f in flu e n c e , c h ild n o d e s h a v e the sa m e la b e l as their parents. N o t e that, as a r e su lt o f th e a b o v e p ro cess, a certain n u m b er o f n o d e s m a y r e m a in w ith o u t la b e ls, ju d g e d a s b e in g "too different" w ith resp ect to the region s d e fin e d b y th e scrib b les [38], T h e m o st in te r e stin g o f th e a b o v e th ree sta g e s is the b o tto m -u p p r o p a g a tio n . In th e im p le m e n ta tio n s u g g e s t e d in [107] the p r o p a g a tio n p r e c e d e s from the lo w e st le v e ls 12 o f the BPT to th e ro o t. F or e a c h n o d e from the current le v e l, its a sso cia ted la b e l is p r o p a g a te d to its p a r e n t o n ly if n o n e o f th e sib lin g 's d e sc e n d a n ts h a v e b e e n a ss ig n e d a d iffe r e n t la b e l. W h e n a n o d e h a s tw o c h ild r en w ith tw o d ifferen t 12Level corresponds to the location of the node within the hierarchy (root is at the top level, its children are one level below, etc.). 85

3. SEGMENTATION USING SYNTACTIC VISUAL FEATURES

/

\

/

\

U

/

\

U

U

u

/ \

F!

U

/ \

B

U

F

(a)

/

S

\

B

/

U

\

N .' U

B

/

\

Fi

u F

/

\

U

F

(b) F ig u re 3.13: Label p ro p a g a tio n , i.e. id e n tific a tio n o f th e la r g est p o ssib le r e g io n s w ith o u t conflict: (a) - Initial BPT. L abels (F), (B) a n d (U ) c o rr esp o n d to fo r e g r o u n d , b a ck g ro u n d a n d u n k n o w n object resp ectiv ely , (b) - BPT after label p r o p a g a tio n , thick er lin es in d ica te th e u p w a r d label p r o p a g a tio n . L abels (C) d e n o te s co n flict. N o d e s su r r o u n d e d b y d o tte d sq u ares (RZI) w ill b e s e le c te d for th e c o n str u c tio n o f th e in itial m ask . T h e n o d e m ark ed w ith th e star is a n ex a m p le o f a n area (z o n e ) w ith o u t a label.

la b e ls c o n flic t is d ete c te d . T h is r e su lts in th e creation o f the

z o n e s

o f

in f lu e n c e

(su b trees) for ea c h la b el fo r m e d b y n o d e s w ith o u t co n flict - s e e e x a m p le from F ig u re 3.13. T he h ig h e s t n o d e s in su c h z o n e s, referred to in th e rem a in d er of th is c h a p te r as

r o o t

o f z o n e

o f

in f lu e n c e

(RZI), can b e id e n tifie d b y se a r c h in g for

n o n -c o n flic t n o d e s w h o s e p a ren t is m ark ed w ith the lab el in d ic a tin g conflict. A s s ig n in g la b e ls to all p ix e ls in ea c h z o n e , a ssu m in g th at its RZI is la b e lle d w ith a k n o w n la b e l, req uires to p -d o w n p r o p a g a tio n startin g from a g iv e n RZI. The a b o v e p r o c e d u r e le a d s to fa st c la ssific a tio n o f large r e g io n s e v e n if th e p r o v id e d sc r ib b le s c o n s is t o f o n ly a fe w p ix els.

A u th o r 's I m p le m e n ta tio n o f L a b el P r o p a g a tio n T h e a lg o r ith m for b o tto m -u p p r o p a g a tio n p r o p o se d in [107] v is its all n o d e s o f th e BPT (e x c e p t leafs) p r o c e e d in g le v e l b y le v e l an d u p d a te s their labels. A lth o u g h , th is a lg o rith m m a y h e lp in u n d e r sta n d in g th e o v e r a ll a p p roach , it c a n b e sig n ific a n tly s im p lifie d th ereb y r e d u c in g c o m p u ta tio n a l co m p lex ity . T he m a in in c o n v e n ie n c e in im p le m e n ta tio n o f the o rig in a l a p p ro a c h is c a u se d b y th e r e q u ir e m e n t to o rg a n ize n o d e s in to le v e ls. M oreover, th e a p p r o a ch d o e s n ot take a d v a n ta g e o f the fact that it is n o t n e c e ssa r y to v is it all n o d e s as con flicts c a n o c c u r o n ly o n p a th s b e tw e e n la b e lle d lea fs (m ark ed d ir ectly b y th e scribbles) 86

3. SEGMENTATION USING SYNTACTIC VISUAL FEATURES

a n d the root. B e lo w is the c o m p le te a lg o r ith m for b o tto m -u p lab el p r o p a g a tio n , d e v e lo p e d b y the author, w h ic h rec tifies th e a b o v e w e a k n e sse s. T h e p r o c e d u r e b e g in s b y in itia liz a tio n o f all lea fs (p ix els) w ith o n e o f three labels: F o r e g r o u n d

th e user.

(F),

B a c k g r o u n d

(B) or

U n k n o w n

(U ) a cco r d in g to th e d ra g s m a d e b y

A ll r e m a in in g n o d e s are in itia liz e d as

from F igu re 3.13a.

U n k n o w n

(U ) - se e e x a m p le

T he p r o p a g a tio n p r o c e e d s u p w a r d s a lo n g p a th s b e tw e e n

la b e lle d le a fs (p ix e ls m a rk ed b y a m o u s e d rag) a n d the root. N o te that leafs are p r o c e sse d seq u en tia lly , i.e. for a g iv e n la b e lle d lea f the c o m p le te p a th to the ro o t is u p d a te d b e fo r e p r o c e ssin g th e n e x t leaf. L ab els b e tw e e n a g iv e n la b e lle d le a f an d th e r o o t are u p d a te d as fo llo w s: For a g iv e n n o d e (or in itia lly a leaf)

n ,

if its p aren t n o d e is la b e lle d a s u n k n o w n (U ) th e lab el from the current n o d e is a s s ig n e d to th e p a ren t a n d th e n th e p r o c e d u r e is rep ea ted for the parent. If the p aren t n o d e h a s b e e n a lrea d y m a rk ed (b y p r o p a g a tio n from an oth er scrib b led le a f (p ixel)) w ith th e sa m e la b el as its c h ild th en th e p r o p a g a tio n sto p s (for this leaf). If a c o n flic t b e tw e e n la b e ls o f th e cu rren t n o d e a n d its p aren t is d ete cte d th e n th e p a r e n t is m a rk ed w ith la b e l

C o n f lic t

(C). N o te that sin ce in su c h a

ca se th e p r o p a g a tio n to w a r d s th e r o o t c o n tin u e s, la b els (F) or (B) a ssig n e d b y p r o p a g a tio n fro m a n o th er scrib b led le a f p r o c e sse d before w ill b e o v erw ritten w ith lab el (C). O f co u rse, th e c o n flict p r o p a g a tio n w ill sto p if th e p a ren t is a lrea d y m a rk ed w it h (C) (b y p r o p a g a tio n from oth er leafs). T h is m a k es th e final resu lt (resu lt after p r o c e ssin g a ll leafs) in d e p e n d e n t o n the order in w h ic h leafs are p r o c e sse d .

B e lo w is th e p s e u d o c o d e o f th e p r o p a g a tio n p ro ced u re for a

s in g le leaf: INPUT: Leaf I w ith a know n label (F or B) VARIABLES: r - root n - current node p - parent of n OUTPUT: U pdated labels on the path from I to r 1. START: n —» I ;

p —>

2. WHILE ( p ± r A N D

n. parent; n.label + p.label )

{ IF ( p.label = U ) n —>p ;

THEN

p.label —> n.label;

ELSE

p.label —> C;

p —►n . p a r e n t ;

} T h e a b o v e a lg o r ith m is n o t o n ly m o re stra ig h tfo rw a rd to im p le m e n t th an th e o n e p r o p o se d in [107], b u t a lso le a d s to a r e d u c tio n in co m p u ta tio n a l com p lexity. S in ce th e p r o p a g a tio n is p e r fo r m e d o n ly b e tw e e n scrib b led leafs an d th e root, a n d ty p ic a lly th e n u m b e r o f scrib b led p ix e ls is m u c h sm aller than th e total n u m b e r o f p ix e ls in th e im a g e , o n ly a sm a ll p er cen ta g e o f all n o d e s from the BPT h a v e to b e u p d a te d . For in sta n ce , in th e m o s t d e m a n d in g e x a m p le from F igu re 3.16 (CIF siz e )

o n ly 2% p e rc en t o f all n o d e s h a d to b e u p d a ted .

T his

c o r r e sp o n d s to a r e d u c tio n in e x e c u tio n tim e for the b o tto m -u p p ro p a g a tio n

87

3. SEGMENTATION USING SYNTACTIC VISUAL FEATURES

from 1 5 m s to 1 m s o n a n In tel P e n tiu m III 6 0 0 M H z w h e n c o m p a r ed w ith the a u th o r 's im p le m e n ta tio n o f the o rig in a l a lg o rith m v is itin g all n o d e s. A lth o u g h b o th a p p r o a c h e s le a d to real tim e e x e c u tio n o n a m o d e r n PC th is m a y b e c o m e a n im p o rta n t a d v a n ta g e w h e n c o m p u ta tio n a l reso u rces are lim ite d , e.g.

on

h a n d h e ld d e v ic e s . A n e x a m p le o f su c h p ro p a g a tio n can b e se e n in F igu re 3.14b w h ic h d e m o n str a te s an a ttem p t at extractin g th e g irl's face from th e clu ttered background. A lth o u g h , th e a b o v e im p r o v e m e n ts se e m v e r y en c o u r a g in g , the b o tto m -u p p r o p a g a tio n is o n ly o n e o f tw o m a in ste p s in the la b e llin g algorith m . M oreover, it s h o u ld b e r e m e m b e r e d th at, the c o m p u ta tio n a l c o st o f th e p r o p o se d so lu tio n d e p e n d s o n th e n u m b e r o f scrib b led p ix e ls a n d in ex trem e ca se s, e.g .

w h o le

im a g e c o v e r e d b y scrib b les, th e c o m p u ta tio n a l c o st ca n be e q u a l to or e v e n e x c e e d th e c o st o f th e o rig in a l ap p ro a ch from [107].

H o w e v e r , ta k in g in to

a c c o u n t th at s u c h s itu a tio n s s h o u ld n o t occu r in p ractice, th e p r o p o se d so lu tio n is still a n a ttra ctiv e a ltern a tiv e.

Labelling the Entire Image U n fo rtu n a tely , th e B P T -b ased la b ellin g p ro ced u re d o e s n o t gu a ra n tee that all p arts o f th e im a g e w ill b e c la ssified as p art o f a k n o w n object.

S o m e o f the

su b tr e e s m a y n o t c o n ta in a n y leaf w ith a k n o w n lab el ((F) or (B)) - se e th e n o d e m a r k e d w ith a star in F ig u re 3.13b. [38] c o n c lu d e s th at sin c e there is in su fficie n t in fo r m a tio n r e g a r d in g th e u n la b e lle d r eg io n s th is is in fact an attractive featu re o f th e a p p ro a ch a n d th e u n la b elled r e g io n s can b e c la ssifie d b y th e u ser d e fin in g a d d itio n a l m ark ers. It a ls o s u g g e s ts that fillin g th e en tire im a g e sp a ce can b e u se fu l o n ly for s p e c ific a p p lic a tio n s. T h e a u th o r 's o b s e r v a tio n s in d ic a te that the a b o v e c o n c lu sio n w o u ld h o ld o n ly if th e h iera rch y (sim ila r itie s) o f re g io n s rep resen ted b y th e BPT a lw a y s rep resen ts th e true (or at le a s t e x p e c te d b y the user) sem a n tic structure o f th e scen e. In p ractice, th e tree is c o n str u c te d w ith lim ite d h o m o g e n e ity criteria w h ic h are o fte n n o t su ffic ie n t to p r o d u c e r eg io n s w ith a h ig h p ro b a b ility o f co rr esp o n d in g to se m a n tic n o tio n s [38]. T h e m a in a d v a n ta g e o f BPT is th at it s h o u ld rep resen t o n ly th e m o s t u s e f u l m e r g in g s o f th e reg io n s [107], H o w e v e r , in practice so m e o f th e im p o r ta n t m e r g in g s m a y b e m issin g . For e x a m p le , F igu re 3.14c s h o w s a n a tte m p t to p a r titio n th e c o m p le x sce n e from F igu re 3.14a in to a TV se t a n d su r r o u n d in g b a c k g r o u n d . A c co rd in g to sim ila rities sto re d in th e BPT, the r eg io n c o r r e s p o n d in g to th e g ir ls h air is m o re sim ilar to th e r e g io n c o r r e sp o n d in g to th e cen tre o f th e T V s e t th a n the reg io n c o r r e sp o n d in g to the b o tto m o f th e TV set. T he hair a n d th e cen ter o f the TV se t w e r e m e r g e d first an d the b o tto m o f th e TV se t w a s m e r g e d to the alread y c o m b in e d hair-T V se t region . S in ce b o th th e h air a n d th e cen tre o f the TV w e r e m a rk ed w ith d ifferen t lab els, the r e g io n c o r r e s p o n d in g to th e b o tto m o f th e TV se t c o u ld n o t b e la b elled . In oth er

88

3. SEGMENTATION USING SYNTACTIC VISUAL FEATURES

(b) Scribbles

rÌE .+ S c r ib b k »

Mask

v ■

Foreground__________

m

1

(c)

Scribbles

Orig.+Scribbles

M ask

Fon?ttroimd

\

(d) F igu re 3.14: E x a m p le s o f sem i-a u to m a tic se g m e n ta tio n . L abels (F), (B), a n d (U ) are r e p r e se n te d b y black , w h ite , a n d g rey c o lo u r s resp ectively, (a) - T h e o rig in a l im a g e , (b) - F ace ex tra ctio n u sin g o n ly label p r o p a g a tio n in BPT, (c) - A tte m p t at p a r titio n in g th e sc e n e from (a) in to TV -set an d b a ck g ro u n d u s in g o n ly lab el p r o p a g a tio n in BPT, (d) - S e g m e n ta tio n b a se d o n in teractio n s from e x a m p le (c) w ith fillin g o f th e en tire im ag e.

89

3. SEGMENTATION USING SYNTACTIC VISUAL FEATURES

1 ,2 ,3,4

(a)

(b)

Figure 3.15: Example illustrating limitations of the solution for labelling of the entire image proposed in [107]. The example shows a case where there is no labelled descendant of a sibling of an unassigned RZI (region 4) which is at the same time a neighbor of the RZI: (a) - Initial image partitioning; region 4 does not have any pixel adjacent with labelled regions 1 and 2, (b) - BPT. words, the only similarity information stored in the BPT for this region was with a region labelled with conflict (region representing hair and the centre of the TV set combined). Similarly, many other regions from the example remained unlabelled due to conflicts occurring at lower levels of the tree. Therefore, even if filling the entire image space is important only for specific applications, in practice it may be advantageous to also correct mistakes caused by the above limitations of BPT. This could not only reduce the amount of additional interactions required, but would simplify presentation and in consequence interpretation of the results (especially for an unexperienced user). Of course, if the filling procedure does not perform according to the user's wishes additional interactions might still be required, but they should never exceed the effort required without such a procedure. [107] suggests a solution for filling the entire space utilizing again BPT. For a given unassigned RZI, the algorithm starts from its sibling and scans all the descendants until one region that is, at the same time, neighbor of the RZI and labelled with a known label, is found. If several regions fulfill this criterion, one can be chosen arbitrary or on specific rules. Clearly, the use of the above approach for correcting mistakes caused by the limitations of BPT discussed earlier may be problematic since it is based on the same BPT. Moreover, it relies on the false assumption that there is at least one labelled descendant of the RZI's sibling which is at the same time a neighbor of the RZI. An example illustrating one such case when this assumption does not hold is depicted in Figure 3.15. Since the above assumption is not always true the filling procedure would have to be executed iteratively, i.e. start from nodes for which the assumption holds and continue until the entire image is labelled. In other words, utilization of an iterative labelling approach seems unavoidable in order to guarantee labelling of the entire image without exceptions. To avoid such arbitrary decisions and exceptional cases an alternative solution was adopted by the author. The filling is performed by an iterative merging 90

3. SEGMENTATION USING SYNTACTIC VISUAL FEATURES

process similar in concept to the region growing strategy in [81]. However the algorithm operates with RZIs created by the approach described in section 3.11.3 instead of pixels. Because of this the computational cost of the filling is very low. The procedure requires prior top-down propagation of not only labels of RZIs but also identifiers of all zones. The unlabelled nodes (regions) are iteratively assigned to the competing objects, formed by already classified regions. Simple non-parametric models of foreground (0 F) and background (Ob ) objects comprise of RZIs with (F) and (B) labels respectively. Classification of an unassigned RZI region is performed by comparison of its similarities to adjacent regions from O f and 0 B. The order of labelling is based on the confidence with which the labels can be assigned. At each iteration, only one region, with the highest confidence, is assigned to an appropriate object. Then, the representation of the extended object is updated and the process continues until all regions are labelled. In the current implementation, the distance between an unclassified region (RZI) Ri and an object O is computed as the shortest distance between Ri and one of the regions already assigned to O and adjacent to R{.

D (R i,0 ) = min dr(R i,R j) Rj GO

(3.30)

where dr(R i,R j) is computed as Euclidean distance between average colours represented in LUV space for two adjacent regions Ri and Rj. If the two regions do not have at least one pair of adjacent pixels dr(R i,R j) is set to oo. Note that adjacency is not required for regions assigned in the first stage by label propagation therefore disconnected objects are still possible. The confidence CF with which an unclassified region Ri can be assigned to the foreground object is computed using the following formula: D (R i,O n )

C F ( Ri) =

D(Ri,0F)+D(Ri,0B) 0.5

-r

I (»(R i,0 F )# 0 V D {R i,0B ) & ) A \ y (D(Ri,0F)^oo WD(Ri,0B)^oo) J

(3.31)

otherwise

C F has values in the range [0,1 ], Values of CF above(below) 0.5 indicate that Ri should be assigned to the foreground (background) object. Of course, the con­ fidence C B can be defined analogically leading to the trivial relation C B(Ri) = 1 —C F(Ri). Moreover, the confidence C with which Ri can be classified as fore­ ground or background can be defined as C(Ri) = max {CF(R i),C B(Ri)}. At each iteration, values of C F, C B and C are computed for all unlabelled regions and only one region Rmax which can be classified with the highest confidence Rjnax = arg m&x.yRie0u{C (Ri)} is assigned to an appropriate object: foreground if C F(Rmax) > 0.5 or background otherwise13. Then, the confidence values are u O ,J denotes a set of all roots of unlabelled subtrees. 91

3. SEGMENTATION USING SYNTACTIC VISUAL FEATURES

updated for all remaining unlabelled regions adjacent to Rmax according to the new foreground/background models and the process is repeated iteratively un­ til all regions are classified. An example of the filling of the entire space from the previously considered example is shown in Figure 3.14d. Note, that the procedure utilizes all possible similarities between adjacent regions, not only the ones stored in BPT. Summarizing this section, the most significant differences between the approach presented in [38] and the solution described in this thesis is the incorporation of syntactic features for the extraction of the BPT, a propagation algorithm optimized to provide maximum speed, and an alternative method for filling the entire space. The final approach combines the speed efficiency of the solution described in [38] with the capabilities of Seeded Region Growing [81] to fill the entire space.

3.11.4

Results

An example of user interactions and corresponding results are shown in Fig­ ure 3.16. For easier interpretation of the results the user is presented with a view of the final segmentation mask and a view of the foreground object. Ad­ ditionally the foreground and background objects are brightened and darkened respectively in the original image. The above approach has proven to be effective and practical. When testing the automatic approach, the pre-segmentation of a CIF (352x288) image takes under 3 seconds on a standard PC with Pentium III 600 MHz processor. This extra time is negligible14 taking into account the benefit of practically instantaneous responses of the tool to subsequent user interactions. Rapid responses ensured a good user experience in cases where the interaction has to be applied iteratively in order to refine the segmentation result. In other words, the approach is in fact iterative in nature due to the speed with which the mask can be created following the interactions. For comparison, the EM-based framework [32] required more extensive interac­ tions to extract the Foreman from Figure 3.16f and produced subjectively poorer quality results. Also the approach based on morphological flooding [83] seems to produce worse quality results for this example. Strict experimental evaluation of the time required to segment an image by both non-skilled and experienced users is outside the scope of this thesis. However it is safe to claim that a typical time spent by a user obtaining a required segmentation result is between 5-10 seconds. Clearly, the time required depends on user experience, complexity and size of the object under consideration and in 14In most cases it is below the time acceptable for opening an image before presenting it to the user. 92

3. SEGMENTATION USING SYNTACTIC VISUAL FEATURES

O rig in a l im a g e

Scribbles

O rfg ,

M ask

IbM *»

W

i

M

M

Figure 3.16: Results of semi-automatic segmentation.

93

3. SEGMENTATION USING SYNTACTIC VISUAL FEATURES

some cases on the artifacts introduced by compression of the input image.

3.11.5

D iscu ssio n

The described tool has proven to be particulary useful in the context of the construction of statistical shape models which is described in chapter 7 where it is used for extraction of accurate contour examples from a set of images. The restriction of the approach to two objects only was motivated by the fact that it leads to very intuitive user interactions in many application scenarios where there is only one dominant foreground object present in the scene. However, it should be stressed that, unlike in the case of other approaches, e.g. see [108], adaptation of the presented approach to multiple objects is trivial, i.e. would require mainly changes of the GUI. The above semi-automatic segmentation approach can be used not only for off­ line extraction of object contours, e.g. for construction of shape statistical m od­ els, but could be also particulary suited for fast formulation of object-based queries in object oriented retrieval by example systems, like the one proposed in [10]. The author believes that fast and easy to use semi-automatic segmen­ tation tools will play a crucial role in the development of future retrieval-byexample systems. The above approach is particulary suited to such applications as it typically leads to good segmentation results within a few seconds of inter­ actions. Although the initial segmentation can be obtained very quickly, the author's observation indicate that the tool looses its advantage over other interaction strategies (for example contour adjustment described in [80]) in cases requiring very fine corrections of boundaries, e.g. whenever objects have very similar colour. In such cases scribbles have to be used to define both internal and external contours separating the objects. The most sensible solution seems to be integration of various interaction schemes within one tool allowing the user to freely select the most convenient way of interacting at different stages during the segmentation process, like for example in a tool described in [80]. Moreover, whilst the scribble-based interaction model is intrinsically suited for iterative refinement of the segmentation mask by adding new "drags", it is not equally convenient when the user wants to rectify his/h er mistakes caused by a scribble extending to the wrong object. In the case of the tool implemented by the author, this can be achieved only by covering of the mistaken parts of the scribble with the correct type of scribble or by restarting the whole interaction process. The user experience could be further improved by extending the functionality of the GUI by providing means of undoing the last interactions and also selective deleting of whole scribbles. The subjective feedback from a number of unexperienced users was very 94

3. SEGMENTATION USING SYNTACTIC VISUAL FEATURES

positive. However the effectiveness of the tool should be evaluated by rigorous usability tests. Such rigorous evaluation is outside scope of this thesis and will be a part of author's future research in this area. It would be particularly interesting to quantify the influence of the syntactic extension to automatic segmentation proposed in this chapter on user impression, in terms of number of interactions and accuracy. Also the tool should be compared to other state of the art approaches - see section 2.5.1. Finally, the feasibility of porting the approach to state of the art hardware devices allowing new ways of user interactions, e.g. touch sensitive table, could be also investigated. It would be particulary appealing to compare the tool run on standard PC with mouse against tools run on such futuristic devices.

3.12

C on clu sion

Several extensions to the well known RSST algorithm [19] are proposed allowing segmentation of images into large regions that reflect the real world objects present in the scene. The most important is a novel framework for integration of evidence from multiple sources of information, each with different accuracy and reliability, within the region merging process. Other extensions include a new colour model and colour homogeneity criteria, practical solutions to analysis of region's geometric properties and their spatial configuration, and a new simple stopping criterion aiming at producing partitions containing the most salient objects present in the scene. All the above extensions and their evaluation can be seen as a feasibility study of utilizing spatial configuration of regions and their geometric properties (the so-called syntactic features [18]) for improving the quality of image segmentation produced by the RSST algorithm thereby increasing its usefulness to various applications. The "strength" of the evidence for region merging provided by the spatial configurations of regions and their geometric properties is assessed and compared w ith the evidence provided solely by the colour homogeneity criterion. To the best of author's knowledge this chapter presents the first comprehensive and practical solution for integration of multiple features, including geometric properties, into a region merging process and demonstrates its performance in extensive experiments. It is shown that syntactic features can provide additional basis for region merging criteria which prevent formation of regions spanning more than one semantic object, thereby improving the quality of the segmentation. The experiments demonstrate that the proposed solutions are generic in nature and allow satisfactory segmentation of real world images without any adjustment to the algorithm's parameters. The result of the study is a new computationally efficient method for fully auto­ 95

3. SEGMENTATION USING SYNTACTIC VISUAL FEATURES

matic image segmentation. The technique can produce both: single partitioning of the image containing the most salient objects present in the scene or hierarchi­ cal representation of the scene in the form of a Binary Partition Tree. The chapter also describes utilization of the output of the proposed automatic approach to fast and intuitive semi-automatic segmentation.

96

C

h a p t e r

4

SHAPE REPRESENTATION A N D MATCHING: A REVIEW analysis is an essential research topic in many areas. It plays an important role in systems for object recognition, matching, registration, and analysis. Unlike any other visual feature, e.g. colour or texture, the shape of an object can usually reveal its identity. Therefore, shape information could become a particulary im portant and powerful feature when used in object-based similarity search and retrieval. Not surprisingly, users are often more interested in retrieval by shape than by colour and texture [12]. However, retrieval by shape is still considered to be one of the most difficult aspects of content-based image search.

S

HAPE

This chapter aims at outlining the most important issues related to shape analysis in the context of content-based image retrieval. In particular, it focuses on the major challenges related to 2D shape representation, matching and similarity estimation and gives an overview of selected techniques available in the literature in order to provide a context for the research carried out in the remainder of this thesis, particulary in chapter 5.

4.1 4.1.1

Introduction

Shape A nalysis

Although challenging, shape analysis is an essential research subject in many areas including medicine, biology, physics, and agriculture, computer vision, image understanding, document analysis and many others [163,17]. Countless vital visual tasks in the fields of computer vision and image understanding include some form of shape analysis. Intuitively, an object's shape often carries the most valuable information about objects present in analyzed visual content. Frequently, a vision system can deduce some important facts about 97

4. SHAPE REPRESENTATION AND MATCHING: A REVIEW

the visual scene only by exploitation of certain information about the shape of objects present. Shape is clearly an important cue for object recognition since hum ans can often identify semantic objects solely on the basis of their shapes or composition of primitives. This distinguishes shape from other elementary visual features (e.g. colour or texture), which usually do not reveal an object's identity. Shape analysis, and representation and similarity estimation in particular, are key technologies allowing indexing and retrieving images based on their visual content. However, retrieval by shape is also considered to be one of the most difficult aspects of content-based image search. Despite three decades of intensive research there are many unsolved problems related to shape-based object detection and recognition, similarity estimation between shapes and utilization of shape information in content-based image retrieval. The most im portant of these are discussed in the remainder of this chapter.

4.1.2

2D Sh apes V ersus a 3D W orld

Although the real world is three-dimensional, reconstruction of 3D scenes from 2D views is often impossible. Moreover, despite recent advances in computer technology, dealing with 3D models is still computationally expensive, frequently to a prohibitive degree [163]. On the other hand, there is very strong empirical evidence provided by the field of cognitive sciences that a substantial part of the information present in an object view is carried by outline 2D shape [164,165] which often conveys enough information to allow the recognition of the original object. [164] demonstrated experimentally the usefulness of silhouette representations for a variety of visual tasks, ranging from basic-level categorization to finding the best view of an object. Often, 2D silhouettes represent the abstractions (archetypes) of complex 3D objects belonging to the same class. For this reason, shape analysis, and analysis of 2D silhouettes in particular, are considered im portant components of many systems performing object recognition. Contour-based matching methods can be very effective in applications where high intra-class shape variability is expected due to either deformations in the object (rigid or non-rigid, e.g. resulted from objects flexibility) or as a result of perspective deformations [16]. Moreover, dealing w ith 2D silhouettes is cheap from a computational point of view which is crucial w hen working with large collections. Therefore, although the real world is three-dimensional the issue of two­ dimensional shapes analysis is still of vital importance [163]. Methods for comparing 2D shapes are valuable building blocks in applications requiring object recognition and object-based retrieval from visual collections has been the 98

4. SHAPE REPRESENTATION AND MATCHING: A REVIEW

Figure 4.1: Typical problems addressed by shape analysis. focus of much research. For these reasons this thesis is prim arily concerned with closed planar contours, particularly with aspects related to representation, matching and similarity estimation in cases where the similarity between the two curves is weak.

4.1.3

C hapter Structure

The next section outlines the most important aspects related to retrieval based on shape. Then, section 4.3 discusses a taxonomy of shape representations. Section 4.4 describes in some detail selected methods utilizing shape information embedded into a low dimensional vector space (feature vectors). The most characteristic curve matching methods which establish correspondence between local shape elements prior to similarity estimation are discussed in section 4.5. The review focuses on methods particulary relevant to the research carried out by the author in chapter 5.

4.2 4.2.1

Im portant A sp ects of Retrieval b y Shape

C o m p u tation al S h a p e A n a ly sis

In general, computational shape analysis and recognition requires addressing several problems which, according to [163], can be broadly divided into three classes: shape pre-processing, shape transformations and shape classification - see Figure 4.1. From the above list the most im portant aspects of shape analysis in the context of the retrieval by shape are: (i) shape pre-processing, especially detection and registration, (ii) representation and description, (iii) matching, including establishing correspondence between shapes and estimation of shape similarity and (iv) organization of shapes into search structures [163]. Pre-processing is related to separation of the object of interest from other non­ relevant elements of the scene. One of the most challenging, and (in many 99

4. SHAPE REPRESENTATION A ND MATCHING: A REVIEW

applications) critical, of all pre-processing tasks is shape detection which is typically obtained by image segmentation. Due to its importance, selected issues related to image segmentation are outlined in chapter 2 and then further investigated in chapter 3. Shape registration is discussed in chapter 7. The discussion in this chapter focuses on selected aspects of shape transforma­ tions and shape classification, particulary on shape representation, and estab­ lishing pairwise correspondence between shapes and similarity estimation. 4.2.2

S h a p e R ep resen tation and D escrip tio n

One of the most important issues related to shape-based retrieval is extraction of information from shapes which is relevant for a given application. According to [163] this task can be divided into two closely related problems: (i) mapping of the initial shape representation (typically an ordered chain of points) into a representation more appropriate to a specific application, and (ii) extraction of relevant shape features (often referred to as descriptors) for subsequent classi­ fication, recognition or analysis of physical properties. In fact, descriptors are often extracted not directly from the initial sequence of points but from some more appropriate representation of the contour. The terms shape representation and shape descriptor are frequently interchanged in the literature however typi­ cally the term shape descriptor is used to indicate a compact representation which does not necessarily allow reconstruction of the original shape. Although the choice of a good representation is task specific the most common criteria which should be considered when selecting a representation (or descrip­ tion) are [166,147,167,17]: 1. Scope - This defines the class of shapes that can be described by the method; 2. Invariance to rigid transformations (invariance to changes of pose) - Rigid trans­

formations do not change the shape of the object, therefore it is often desir­ able to use representations which are invariant to these transformations. 3. Robustness to deformations (stability and sensitivity) - This describes how sensitive a shape representation (or descriptor) is to changes in shape; 4. Uniqueness - This describes whether a one-to-one mapping exists between shapes and shape descriptors; The uniqueness of the representation deter­ mines discriminatory power of the whole matching technique. 5. Accessibility (memory and speed efficiency) - This describes how easy it is to compute a shape representation or descriptor in terms of memory requirements and computational cost; 6.

Scalability - This means that a representation contains information about the shape at different levels of detail;

7. Local support - Only a representation computed using local information can provide an ability to recognize that the shape of a segment of a curve is the 100

4. SHAPE REPRESENTATION A N D MATCHING: A REVIEW

same as the shape of another curve segment. 8.

Biological or psychophysically plausibility;

9. Capabilities to reconstruct the original shape;

4.2.3

S h a p e M a tc h in g a n d S im ila rity E s tim a tio n

Shape matching refers to a broad set of methods for comparing shapes and is used typically for object recognition [147]. In fact, shape matching may consist of three distinctive processes, typically performed consecutively: (i) finding a transformation of one shape into another (often referred to as shape alignment), (ii) establishing correspondence between elements of two shape representations, and (iii) measuring the resemblance between shapes. The complexity of each of the above three processes depends primarily on the shape transformations and deformations expected in a given application and on the representation used in a given approach. For example, finding a rigid transformation of one shape into another may be not necessary if the poses of the shapes are known or if the representation used is translation, scale and rotation invariant. In the context of content-based image retrieval, one of the main objectives of shape matching is estimation of similarity between two shapes. Measuring the resemblance between shapes is a key tool in interpretation of images, determining an object's identity ("categorization") and in clustering. Analogically to shape representation the desired properties of the similarity measure are application dependent, however it is possible to identify the most common properties appearing in literature [166,168,169,147,170,171,167,17]: 1. Correspondence to human perception/judgment of similarity - Intuitively, this is the most important characteristic of a similarity measure. However it should also be noted that the perception of similarity is typically context and user dependent. 2. Respect metric properties - Many authors suggest that a similarity measure should respect metric properties [168] despite the fact that it is unclear whether the function underlying the hum an notion of shape similarity obeys metric properties [170,171], The main advantage of obeying metric properties is the fact that many computationally efficient methods finding nearest neighbors rely on the metric properties of the utilized comparison function. 3. Invariance to transformations - It is typically necessary that the similarity measure is invariant under a chosen group of transformations, e.g. pose transformations or affine transformations 4. Robustness to certain type of deformations - When studying various techniques 101

4. SHAPE REPRESENTATION A ND MATCHING: A REVIEW

for shape analysis it is important to remember that it is not difficult to com­ pare very similar shapes. The difficulties arise when one has to measure the degree of similarity between two shapes which are significantly differ­ ent [169,170]. 5. Continuity - Some research suggested that shape similarity is a continuous phenomenon, i.e. similarity gradually deteriorates as one shape progres­ sively deforms with respect to the other [168]. Although this property ap­ pears intuitive, its formal relevancy to human perception is still not obvi­ ous. [171] simply uses the term continuity to refer to robustness to various deformations: (i) perturbations, (ii) small cracks, (iii) blur, and (iv) noise and occlusions; 6.

In [168] it was suggested that many "small" shape deformations should af­ fect the similarity measure less than one "large" deformation of equal total magnitude and that different deformations should contribute differently to the similarity measure, e.g. the articulation of parts should contribute less than variations in the shape and size of portions of objects;

It should be noted that despite such a relatively exhaustive list of desired properties for similarity measures identified during three decades of intensive research, nobody has really been able to answer the basic question: "what qualifies similarity of shape?". Given that the above list of desired properties of shape matching contains several contradictory properties it should be apparent that it would be impossible to develop a technique suitable for any kind of problem. Not all methods are suitable for every kind of application, i.e. the choice depends on the shape properties, e.g. intra and inter class variability, and the particular application, e.g. acceptable computational cost or presence of noise [147], The choice of the proper shape representation and matching method is often a compromise between retrieval effectiveness and computational complexity. The speed criterion is vital for CBIR systems since typically the user expects to have a response w ithin a few seconds after he/sh e submits the query. Speed requirements often limit the number of techniques which can be used in CBIR systems. In particular, the matching should have low computational cost as it is typically expected to perform on-line. Also, descriptors should be extracted efficiently, especially in the case of large image databases, however, more tolerance is allowed in this case as feature extraction can typically be performed off-line.

Partial Matching in Large Collections of Shapes

Many recently proposed matching approaches can deal explicitly with various degrees of occlusions [172, 3]. Some can even match both open and closed 102

4. SHAPE REPRESENTATION AND MATCHING: A REVIEW

curves [170, 167]. Clearly, such properties can be very useful in many appli­ cations. Partial matching is typically considered more challenging than global matching due to the need to utilize appropriate local features and a higher com­ putational cost of matching. However, in the case of shape-based retrieval in large collections, utilization of techniques performing partial matching has much more serious implications. In a large collection it is very likely that for a given query many shapes can be found which share many very similar parts but which are globally dissimilar. The main question here is how to "punish" the occlusions in the final similarity measure. The danger is that such partially similar shapes could be ranked as being more similar to the query than other intuitively more similar shapes. An example of such situation can be found in [167] (page 1511) where the authors comment on their results in a following way: "Answer 10 looks dissimilar to the query. However, a closer look reveals that this shape matches the upper part of the query." Therefore, the treatment of occlusions is application dependent. Also, there are many applications that require matching of closed curves and where occlusions simply contribute to the dissimilarity between shapes. This thesis focuses on such cases.

4.3

Taxonom y o f Shape R epresentations

As can be seen from the above discussion there are many important aspects of shape matching to be considered when choosing an approach suitable for a particular application. The large dimensionality of the problem and the vast number of methods available in the literature make consistent and comprehensive overview of the state of the art on shape matching difficult. This section discusses a taxonomy of shape representations which is the most commonly used in literature to classify shape matching techniques. This taxonomy provides a starting point for the review of the most characteristic shape matching techniques presented in sections 4.4 and 4.5.

4.3.1

T ax o n o m y [173,163]

Various taxonomies for shape representation methods were proposed in the past [173, 163]. Figure 4.2 shows the subdivision of shape representations proposed in [163]. The basic criterion for characterizing shape representations (and also matching) is based on whether it is contour-based (also known as boundary- or outline-based) or region-based (considers compositions of 2D regions and uses their internal details). According to [163], the contour-based approaches can be further subdivided into three classes: (i) parametric contours - silhouettes represented as parametric curves by ordered sequences, e.g. contour points, 103

4. SHAPE REPRESENTATION A ND MATCHING: A REVIEW

Figure 4.2: Shape representation taxonomy according to [163]. (ii) sets of unordered contour points and (in) curve approximations - boundaries represented by a sequence of geometric primitives fitted to the contour. The region-based techniques can be subdivided into: (i) region decomposition - regions represented as a set of simple primitives, (ii) bounding regions and (iii) internal features, e.g. regions represented by their skeletons. According to this taxonomy, a third basic class of shape representations referred to as transform domain techniques can be identified. However, it should be noted that this category contains both contour- and region-based methods since many transforms can applied to both ID and 2D signals. The transform domain category can be subdivided into linear and non-linear and the linear category can be further subdivided into single-scale and multi-scale representations. In [173] a somewhat different subdivision of representations was proposed. The two basic categories, contour-based and region-based, are further divided into spatial domain and transform domain sub-categories. The subdivision is made on the basis of whether the result of the analysis is numeric or non-numeric. For instance, scalar transform techniques produce numbers (scalars or vectors) as results, e.g. Fourier Transform [169,174,175] or moments [176,177,178,179,180]. In contrast, a space-domain technique may produce another image, e.g. Medial Axis Transform (MAT) [181,182].

4.3.2

C o n to u r-B a se d V e rsu s R e g io n -B a s e d T ec h n iq u e s

The contour- and region-based methods offer two different notions of similarity, i.e. objects having similar spatial distributions may have very different outline contours and vice versa. Therefore the suitability of the methods from the above categories is strongly application dependent. Utilization of objects' silhouettes in CBIR systems has strong motivation from studies of hum an visual systems. The field of cognitive science provides strong empirical evidence that the hum an visual system pays great attention to abrupt transitions in information, i.e. focuses on edges and tends to ignore uniform regions [183,164,165,163]. Some studies even suggested that 2D outline shape 104

4. SHAPE REPRESENTATION A N D MATCHING: A REVIEW

alone conveys enough information to recognize the original 3D object [164,165]. On the other hand, in some applications, e.g. when dealing with logos and trade­ marks, it m ay be more convenient to represent shapes as compositions of 2D re­ gions with their internal details (e.g. holes) rather than using their silhouettes. Also, region-based representations are somehow less sensitive to non-uniform boundary noise or occlusions than the contour-based approaches. However, many region-based techniques can deal only with rigid transformation since the elastic matching of 2D regions or even sets of points has often prohibitive com­ putational complexity. This thesis is primarily concerned with closed planar contours - particularly with matching and comparison in cases where the similarity between the two curves is weak.

4.3.3

Transform D om ain V ersus S p atial D o m a in R ep resen ta tio n s

Irrespective of the chosen subdivision convention it should be noted that rep­ resentations from the transform domain category are embedded into a low di­ mensional vector space (feature vectors) therefore implying very straightfor­ ward similarity measures, e.g. Euclidean distance between feature vectors, while methods utilizing representations from the spatial domain categories first have to establish correspondence between elements from both shapes and then the fi­ nal similarity is computed by accumulating local distances between the corre­ sponding elements. In the case of methods from the transform domain category, similarity estimation typically has very low computational complexity which makes these approaches particulary attractive in CBIR applications where the similarities between the query and shape from the searched collection have to be computed on-line. However, identifying a set of measures which can completely and unambiguously describe shapes and at the same time are robust to elastic deformations has proven to be elusive [170]. In many applications, especially w hen elastic deformations or significant occlusions are expected, only methods from the spatial domain category can be used due to the utilization of local fea­ tures. The price to pay for such flexibility is much higher computational cost of similarity estimation compared to the methods utilizing feature vectors. The subdivision into transform domain and spatial domain categories is the primary classification used for the review of the selected matching approaches presented in sections 4.4 and 4.5. Closely related to shape representation is the issue of extracting relevant infor­ mation (shape descriptors) from representations in order to to characterize an object. Shape description methods produce descriptor vectors (known also as feature vectors) from a given shape representation. Choosing a set of features from a shape representation to characterize an object often represents a challeng­ 105

4. SHAPE REPRESENTATION A N D MATCHING: A REVIEW

ing problem and the choice of features depends strongly on a specific problem. The following four categories of description methods were identified in [163]: (i) global measurements (see also section 4.4.1), e.g. area, perimeter, number of holes etc., (ii) signal processing (transforms) - where typically only a subset of the complete representation from the new domain is used as a feature vector, e.g. Fourier Descriptors as discussed in section 4.4.2, (iii) decomposition into primitives and (iv) description through data structures, e.g. binary trees.

4.3.4

S hape M o d elin g

In the past, object detection and recognition was performed solely using a single reference shape or several instances of shapes belonging to a given class used individually. However, as already explained in section 2.5.2, more than a decade ago Cootes and Taylor introduced one of the more influential ideas within the image analysis community, the so-called Active Shape Model (ASM) or Smart Snakes [70] which are deformable models with global shape constraints learned through observations gathered from multiple examples of shapes from the same class. Since then many new approaches for shape modeling have been proposed [116,118,120]. Such models (also often referred to as statistical shape models) are widely ap­ plicable to many areas of image analysis including: analysis of medical im­ ages [184, 185], industrial inspection tools, modeling of faces, identification of the model object in unseen images, e.g. hands and walking people [117,186]. Clearly, some aspects of statistical shape modelling and utilization of such mod­ els for object detection and recognition are closely related to the problem of shape representation and matching discussed in this chapter. However, for clar­ ity of presentation, this chapter focuses primarily on pairwise shape matching. Some issues related to statistical shape modeling and their utilization in topdown image segmentation were already discussed in chapter 2 (section 2.5.2). Construction of ASM is described in some detail in chapter 7.

4.4

T he M ost Characteristic M eth od s U tilizin g Shape

Inform ation Em bedded Into a Low D im en sion al Vector Space This section describes three common techniques for measuring the similarity be­ tween shapes based on shape representations embedded into a low dimensional vector space (feature vectors). In such a case, the matching process reduces to a very straightforward similarity estimation, e.g. by using Euclidean distance between feature vectors. 106

4. SHAPE REPRESENTATION A N D MATCHING: A REVIEW

4.4.1

S im p le G lo b a l G eom etric S hape Features

Often, coarse estimation of shape similarity, a n d /o r shape classification, can be obtained based on a comparison of simple scalar measures derived from the global shape. Some simple global shape features are perimeter, area, aspect ratio, elongation, eccentricity, circularity, rectangularity, complexity, symmetry, sharpness, num ber of holes and number of corners. The obvious advantage of using simple shape descriptors is faster calculation of similarity and general applicability [187]. Although these features are typically easy to compute, allow fast comparison between shapes and are generally applicable to a w ide class of shapes [187] they also often result in many false positives [175] and this is unacceptable in many applications. As a matter of fact, the majority of systems for CBIR dealing with general content currently available in the literature utilize only simple global shape features [11, 52, 29, 44] (see also section 2.2.1). For example, one of the most recent systems such as the SCHEMA Reference System [188] and a system described in [1 0 ] use simple global shape descriptors which are a subset of the contour-based descriptor standardized in MPEG-71.

4.4.2

Fourier Transform

Very often, analyzed shapes do not have holes or internal markings and the associated boundaries can be conveniently represented by a single closed curve parameterized by arc length. In many such cases a useful representation of a curve boundary can obtained using the ID Fourier Transform [169,174,175,173]. The Fourier Descriptors (FD) represent silhouettes in the frequency domain as complex coefficients of the Fourier series expansion of some function derived from the boundary (parameterized boundary signatures). The extraction of FD usually starts by obtaining a ID representation of the con­ tour (often referred to as a shape signature), which is derived from the coordi­ nate chain, e.g. curvature, centroidal distance (radius) or complex coordinate functions [169]. Then, complex Fourier coefficients are computed by perform­ ing a Discrete Fourier Transform (DFT) of the shape signature. Typically, only a subset of all coefficients is used for classification purposes. Translation invari­ ance of the FD is achieved by using signature functions which are invariant to translation. Scale normalization is typically obtained by dividing all coeffi­ cients by the m agnitude of the coefficient corresponding to the size of the shape, e.g. the zero'th coefficient in the case of shape radii or the first coefficient in the case of complex coordinate functions [169]. Rotation invariance is usually 'intended to complement the CSS-based descriptor discussed in section 4.5.5. 107

4. SHAPE REPRESENTATION A ND M ATCHING: A REVIEW

achieved by taking only the magnitude information from Fourier coefficients and ignoring the phase. The Fourier transformation captures the general feature of a shape by converting the sensitive direct representation into the frequency domain where the lower frequency coefficients contain information about the general shape, and the higher frequency coefficients contain information about smaller details. The similarity between objects' silhouettes is typically measured using the Euclidean distance between their normalized feature vectors. Methods utilizing Fourier descriptors are easy to implement and are based on the well developed theory of Fourier analysis [147], The description is reasonably robust to noise and geometric transformations. Moreover, similarity estimation is very straightforward and has low computational cost which is crucial for allowing real-time querying in large collections. The major limitation of the Fourier descriptors is their rather poor performance in similarity retrieval [3, 17] when dealing with elastic deformations or non-uniform noise. In other words, often the similarity measure based on FD does not correspond to human judgements of similarity. This can be explained by the fact that the frequency terms have no apparent equivalent in hum an vision and the fact that Fourier coefficients do not provide local spatial information.

4.4.3

M om ents

Moments or functions of moments are among the most established and fre­ quently used shape descriptors, mainly due to the fact that they can capture global information about the shape image and do not require closed boundaries. They are particulary attractive in applications dealing with complex shapes that consists of holes in the object or several disjoint regions, e.g. logos and trade­ marks. They are typically computed from whole 2D regions (gray-level or bina­ rized images), however, they can be also extracted solely from shape boundaries. The most classical type of moments are regular moments which for digital images are defined as: mpq = Y 2 '5 2 xPyqf ( x >y) x

(4>1)

y

where mpq is the (p, 7 )-moment of an image function f( x , y). Translation invariance can be achieved by using central moments. The infinite sequence of moments p, q = 0 , 1 ,..., uniquely determines the shape, and vice versa. Typically, high-order moments, which are usually very noisy, are discarded and only a limited number of low-order moments are included in the feature vector and consequently used for similarity estimation an d /o r classification [171,179]. Although central moments were shown to be useful, they are only invariant to translation. A more general form of invariance is given by seven rotation, 108

4. SHAPE REPRESENTATION AND MATCHING: A REVIEW

F ig u re 4.3: R eal p art o f A R T 3 6 b a sis fu n c tio n s (tw e lv e an g u la r an d three radial). Im a g in a r y p arts are sim ila r e x c e p t for q u ad ratu re p h a se d ifferen ce [3].

tra n sla tio n , a n d sc a le in v a r ia n t m o m e n t characteristics in tr o d u c e d b y H u [176] a n d o fte n referred to as

m

o m

e n t

in v a r ia n ts .

T h ese are s e v e n n o n lin e a r fu n ctio n s

d e fin e d o n reg u la r m o m e n ts d e r iv e d u sin g the th eory o f algeb raic in v a ria n ts in recta n g u la r c o o r d in a te s. A lth o u g h attractive d u e to their m ore g en eral form o f in v a ria n ce, ty p ic a lly o n ly six fe a tu r e s (m o m e n t in varian ts) are u tiliz e d sin c e g e n e r a tin g a b ig g e r n u m b e r is n o t a trivial task [177], T h is sig n ific a n tly lim its th eir d isc r im in a tio n p o w er. T h e p h y sic a l m e a n in g o f m o m e n ts (or m o m e n t in varian ts) is n o t clear, e x c ep t in s o m e lo w -o r d e r ca ses.

T h e a p p r o a ch es b a se d o n th em ten d to return to o

m a n y fa lse p o s itiv e s [175], w h ic h is a resu lt o f their lo w d iscr im in a to ry po w er. In fact, g e n e r a lly m o m e n ts are s e n s itiv e to d ig itiz a tio n error, cam era n on -lin earity, n o n -id e a l p o s itio n o f ca m era a n d to m in o r sh a p e d efo r m a tio n s, p articu lary n o n -lin e a r d e fo r m a tio n s [178]. A n o th e r d isa d v a n ta g e o f th e m e th o d s rely in g o n m o m e n ts is th eir h ig h c o m p u ta tio n a l c o st o f extra ctio n sin ce featu res are c o m p u te d u s in g th e en tire r e g io n , in c lu d in g in terior p ix els. M o reo v er, as e x p la in e d in [177], th e d e fin itio n o f regular m o m e n ts h a s th e form o f p ro jectio n o f fu n c tio n

f ( x ,

y

) o n to the m o n o m ia l

x p y q ,

th e b a sis se t w h ic h is

n o t o r th o g o n a l. T h is im p lie s a certa in d e g r e e o f r e d u n d a n c y b e tw e e n

m

vq

and

m a k e s th e r e c o v e r y o f th e o r ig in a l im a g e from th ese m o m e n ts q u ite e x p e n siv e . To o v e r c o m e th e a b o v e p r o b le m s, o r th o g o n a l m o m e n ts w e r e p r o p o se d a m o n g w h ic h th e T r a n s f o r m

Z e r n ik e

M

o m

e n ts

[177] a n d m o m e n ts r esu ltin g from

A n g u la r

R a d ia l

(ART) [17] are b y far th e m o st pop u lar. T he rem ain d er o f th is se c tio n

fo c u s e s o n th e A R T as it h a s b e e n a d o p te d as a re g io n -b a sed sh a p e d escrip tor in th e M P E G -7 sta n d a rd . In fact, b o th m e th o d s are q u ite sim ila r w ith the m a in d iffe r e n c e ly in g in th e b a sis fu n c tio n s used: Z ern ik e M o m en ts are d e r iv e d from Z ern ik e p o ly n o m ia ls w h ile A R T m o m e n ts are b a se d o n c o sin e fu n ctio n s. T h e A R T -b a sed d e sc r ip to r is c o m p u te d b y d e c o m p o sin g th e sh a p e im a g e in to o r th o g o n a l 2 D b a sis fu n c tio n s.

T h e ART is an o r th o g o n a l u n ita ry transform

w h o s e c o m p le x b a sis fu n c tio n s are d e fin e d o n a u n it d isk b y a c o m p le te se t o f o r th o n o r m a l s in u s o id a l fu n c tio n s e x p r e sse d in p o lar coo rd in a tes. T he ART b a sis fu n c tio n are se p a r a b le a lo n g th e a n g u la r a n d radial d irectio n s - se e F igu re 4.3. 109

4. SHAPE REPRESENTATION A N D MATCHING: A REVIEW

Prior to the extraction of ART coefficients, the shape is normalized w ith respect to translation and size by "inscribing" into the unit disk. Rotation invariance can be achieved by taking only the magnitudes of the ART coefficients. ART gives a compact and efficient way to express pixel distribution within a 2D object region. The details regarding the format of the descriptor defined by the MPEG-7 standard can be found in [5, 17]. Similarity between shapes is calculated by summing the absolute differences between all descriptor elements (size-normalized ART coefficients) [5]. As pointed out in [17], region-based techniques like ART are generally less sensitive than contour-based methods to extreme distortions due to scaling, but at a price of significantly worse performance in similarity-based retrieval. This can be explained by the fact that there is no evidence for any analogy between ART and the human visual system. Also, the ART coefficients, like all other region-based descriptors, are sensitive to elastic (non-linear) deformations and the initial normalization, e.g. outliers may affect the process of "inscribing" the shape into the unit disk.

4.5

T he M ost Characteristic Curve M atching T echniques

As already explained in section 4.2.3, shape similarity can be estimated based on shape representations embedded into a low dimensional vector space (fea­ ture vectors) or based on distances between local features associated with cor­ responding elements from the two shapes. This section gives an overview of shape matching techniques which utilize representations from the spatial domain category supporting local features. The review focuses on various representa­ tions from this category, methods used for solving the correspondence problem between elements of two representations and potential applications of the dis­ cussed techniques. Methods utilizing representations from the spatial domain category first have to establish correspondence between elements from both representations, taking into account shape variations and pose transformations, and the final similarity is then computed by accumulating local distances between the associated elements. Due to the utilization of local features and advanced matching procedures such methods can often deal with significant elastic deformations or occlusions. The price to pay for such flexibility is much higher computational cost of matching compared to the methods utilizing descriptors embedded in low dimensional feature vectors. There are many important aspects of shape matching based on local dissimi­ larities of elements from both representations such as: (i) identification of basic shape elements (shape segmentation), (ii) choice of local features (attributes) as­ sociated w ith each shape element, and as mentioned already (iii) quantification 110

4. SHAPE REPRESENTATION A N D MATCHING: A REVIEW

Figure 4.4: Taxonomy of curve matching in spatial domain. of local and global similarities, (iv) establishing of the correspondence between elements from both representations. For clarity, this review is primarily based on the classification of matching techniques described in [170] - see Figure 4.4. Many authors [170, 189] make a primary distinction between the majority of the silhouette matching techniques which focus on the curve itself, and approaches based on the curve's medial axis. Techniques matching curve elements are then further categorized into either dense matching or feature matching (sometimes referred also as transient event matching). [170] further divides techniques from the latter category into three groups: (i) proximity matching, (ii) spread primitive matching, and (iii) syntactic matching. It should also be noted that some matching methods discussed in the following sections are applicable, or can be extended, to segmentation and identification of objects in unseen images. This task is closely related to the the so-called topdown segmentation problem discussed in section 2.5.

4.5.1

M e d ia l A x is T ra n s fo rm

A representation that has proven to be relevant in human vision is the Medial Axis Transform (MAT) proposed originally by Blum [181, 182]. MAT produces a skeleton and a width value at each point on the skeleton (the so-called quench function) [190,147]. Medial axis are readily computed from contours and provide a topological description of object boundaries which are relatively stable over changes in viewpoint. This approach led Sebastian et. al. [191] to attempt to capture the structure of the shape in the graph structure of the skeleton. The similarity between shapes is computed based on the "edit" distance between shock graphs. Sebastian et. al claim that their approach based on shock graphs is efficient and robust to various transformations including articulation, deformation of parts and occlusion. The method appears to offer graded similarity measure and may 111

4. SHAPE REPRESENTATION AND MATCHING: A REVIEW

be combined with appropriate indexing methods. However, matching graphs involves solving a NP-complete isomorphism problem which is computationally very expensive. Therefore, currently, two contradicting views regarding the use­ fulness of the graph-based approaches can be found in the literature. While [191] claims efficient matching using graph-based approaches and superiority over curve matching, others [170,192] reported sensitivity of the graph representation to occlusions and small contour changes and also prohibitive matching complex­ ity. Given the above difficulties with graph representations, a natural alternative is to represent shapes using their curves elements. The following sections discuss the main categories of methods measuring similarity directly using elements from both curves.

4.5.2

D ense Matching

Dense matching is one of the broadest categories of methods measuring similarity directly using elements from both curves. The curves are typically represented using an ordered sequence of equidistant contour points where the number of points is chosen to satisfy a predefined representation error. Dense matching methods perform a dense mapping between the two dense curve representa­ tions. The task is often formulated as a parameterization problem with some cost function to be minimized. The final dissimilarity between contours is ob­ tained by summing the cost of local deformations. A common approach is to define the cost function as an "elastic energy" needed to transform one curve to the other [168, 193, 194], where the "elastic energy" is defined in terms of contour curvature. Another common approach is to use the so-called turning function [195] instead of curvature. Although many of the approaches from this category allow "elastic matching", e.g. by utilizing Dynamic Programming (DP) for finding the optimal mapping between the curves, some assume only rigid deformations [195]. The main advantages of the methods from this category is that they are rather easy to implement and facilitate neatly continuous measures of similarity. However, the author is not aware about any rigorous quantitative evaluation of the retrieval capabilities of any method from this category using a collection containing more than 10 curves. This recent lack of interest in the methods from this category can probably be explained by the common belief that they have high computational complexity compared to feature (event) matching methods [170], and that the methods utilizing curvature are inherently size variant [170]. After analyzing m any matching techniques, the author tends to disagree with some of the above opinions. Although the dense matching techniques deal 112

4. SHAPE REPRESENTATION A ND MATCHING: A REVIEW

with dense representations, the algorithms which can be used for establishing the correspondences between curve elements are typically much simpler and have much lower order of complexity than the methods able to deal with sparse representations. Effectively, only a few feature matching methods may compete in terms of speed with the dense matching methods and only at the cost of utilization of very sparse and therefore often ambiguous representations. Moreover, recently new exact pruning methods for Dynamic Time Warping (DTW), which is often utilized by dense matching approaches, were proposed in [196] making the dense matching methods even more attractive in terms of fast search in large collections. Moreover, in the case of closed contours, size invariance can be often achieved by size normalization performed prior to the matching, even in cases of modest occlusions. However, the existing dense matching methods do have a serious drawback which is related to the fact that they operate at only one scale, e.g. in [168] the curvature is computed after smoothing the contour by some fixed amount to reduce the influence of the noise. It should be noted that the contour matching method proposed in chapter 5 of this thesis can be categorized as a dense matching approach which, unlike other methods from this category, utilizes a rich multi­ scale representation.

4.5.3

Proxim ity M atc h in g

Proximity methods match two sets of unordered key points (feature points) by searching simultaneously for pose alignment and correspondence such that the distances between corresponding key points of each shape are minimized [197, 198,199,200]. These methods are attractive in applications where the objects can be assumed to be rigid and the order of points is unavailable. Many techniques from this category can be easily extended to the problem of measuring the similarity between a model and a portion of the image. As a consequence they can be applied for object detection in cluttered scenes without the need for object segmentation [198,199]. Unfortunately, as is also commonly known, proximity measure is inadequate for weakly similar boundaries, i.e. treating curves as sets of points ignores more qualitative "structural" information [170]. One of the most typical examples of proximity matching methods is the Hausdorff Distance [198,201,171] which is a measure of resemblance between two arbitrary sets of geometric points superimposed on one another. The Hausdorff distance between two finite point sets A = {ai, a?,,... ap} and B = {&i, b%,... bq} is defined as: H{A, B) = max{/i(^4, B), h(B, *4)} where the function h(A, B), often called Directed Hausdorff Distance, is computed by finding for each point of A the distance to the nearest point of B and taking the largest of such distances as h(A, B). The Hausdorff distance is sensitive to noise since it can be influenced by a single outlier. A similar measure but without this drawback is the Partial 113

4. SHAPE REPRESENTATION AND MATCHING: A REVIEW

Hausdorff Distance [171]. An im portant property of the Hausdorff distance is that it can be meaningfully applied to sets of different sizes, i.e. where there is not one to one correspondence between all points. The directed partial Hausdorff distance can even be applied for partial matching [198,201]. However, as in the case of all proximity matching techniques, computing the Hausdorff distance under pose transformation is very computationally expensive. An elegant alternative solution from this class was proposed by Shapiro and Brady [197] and later extended by Sclaroff and Pentland [200]. In these approaches points are m apped to an intrinsic invariant coordinate frame. Due to their significance in shape registration their approach is further discussed in section 7.1.2.

4.5.4

Spread Prim itive M atching

Spread primitive matching refers to problems where the analyzed curves are di­ vided into shape elements, or primitives (e.g. straight lines) but the information regarding the order of such primitives is unavailable. Many approaches from this category seek the largest set of matches between elements from both shapes inducing compatible coordinate transformation (pose). The methods from this class are relatively robust to noise. However, similarly to proximity matching techniques, they are also inadequate for weakly similar boundaries, i.e. shapes which underwent non-linear ("elastic") transformations. A common method for finding an objects pose in the abovementioned way is the Generalized Hough Transform (GHT) [202]. In this technique, each pair of model and image primitives (such as edges or vertices) defines a range of possible rigid transformations from a model to an image (pose of the model w ith respect to the image). GHT works by accumulating independent pieces of evidence, coming from each of such pair of primitives, for possible coordinate transformations (pose) in a quantized space of transformation parameters. Alternative approaches from this category represent object primitives as nodes in a graph with the connecting arcs representing their relations. The properties of the primitives are then encoded as the node attributes and the matching problem is formulated as attributed relational graph matching [203].

4.5.5

Syntactical M atching

In applications where the ordering information of curve primitives is available the so-called syntactical representations could be used [170], where a curve is represented by an ordered list of shape elements having attributes like length, orientation, bending angle, etc. Some early approaches from this category relied on pose invariant attributes 114

4. SHAPE REPRESENTATION A N D MATCHING: A REVIEW

of the shape elements to ensure invariance to geometrical transformations of the whole matching algorithm [204, 172]. However they completely ignored the problem of resolution (or scale), e.g. smoothing may result in an entirely different list of curve primitives. More recently, the resolution problem has been addressed in two distinctive ways. Some approaches utilize a cascade of representations at different scales [13, 14,189, 3]. In other words, they attempt to match and quantify similarity based on matching transient events occurring at different scales of the multi-scale rep­ resentations. The most characteristic methods from this category are based on the Curvature Scale Space (CSS) introduced first by Witkin [205]. Other approaches, inspired by string comparison algorithms, emulate smoothing during the match­ ing process [170,167] instead of using pre-computed multi-scale representations. They use edit operations (merging, substitution, deletion and insertion) to trans­ form one string into the other. The operators are designed to handle noise and resolution changes. The remainder of this section describes the most representative methods from both categories since they are the methods used to benchmark the author's ap­ proach. The discussion focuses on potential drawbacks of syntactical matching and aims at providing context for the research carried out by the author and reported in the next chapter.

C urvature Scale Space (CSS)

Scale-space filtering for ID functions was originally introduced by Witkin [205] and used in several approaches for contour matching [166,13,14,15, 206], mod­ eling [189], and even robust corner detection [207] and fast snake propagation [17] The Curvature Scale Space (CSS) image was first proposed as a representation for planar curves by Mokhtarian and Mackworth [166]. In this approach, a parametric representation of the curve is convolved with a Gaussian function w ith increasing standard deviation a. The CSS image is created by tracking the position of inflection points, identified as zero curvature crossing points, across increasing scales and along the contour. Assume the parameterized contour C(u) = y(u)^j, where u is a variable measured along the contour from a starting point and x(u) and y{u) are periodic. C (u) is convolved with a Gaussian kernel 4>a of width cr: X a(u) = x(u) (g>4>Au) =

x(L)(p„(u - t)dt,

and the same for y(u).

115

tr{u) = — ay2n

^

(4-2)

4. SHAPE REPRESENTATION AND MATCHING: A REVIEW

CONTOUR

CSS IMAGE

Figure 4.5: Extraction of CSS image for a fish contour [207]. Curvature at different levels of smoothing is computed as: w

/C{u) - — ^ (T O where X a and Y„



2

v »y

'

+ (Ki)2) i

'T—(4.3)

and X", F " are respectively the first and the second derivatives of

An extraction of a CSS image for a fish example from [207] is illustrated in Figure 4.5. At a given scale the position of inflection points along the contour can be determined by locating curvature zero-crossings. With increasing value of a, the resulting contour becomes smoother. The curvature zero-crossings continuously move along the contour until two such zero-crossings meet and annihilate. The number of zero-crossings of the curvature decreases until finally the contour is convex and the curvature is positive. The CSS image of a curve is in a fact binary image computed by recording these positions of the curvature zero-crossings in the u —a plane for continuously increasing a. The size of each of the arch-shaped contours in the CSS image, often referred to as lobes or peaks, represents the depth and the size of the corresponding feature on the original contour. The CSS images are invariant under translation. An object's rotation or a change in the starting point of the contour results in a circular shift of its CSS image which can be taken into account during the matching. Scale invariance can be obtained by normalizing CSS images according to the size of the original curves. Noise may result in some small lobes at the lowest scales of the CSS images but the major ones are usually unaffected. The above properties make CSS images very attractive for extraction of compact descriptors potentially useful for efficient estimation of similarity in the context of CBIR. Therefore, not surprisingly, several approaches used the CSS images for that purpose [166, 13, 14, 189], By far the most commonly used match­ ing technique utilizing CSS images was proposed by Mokhtarian and Mack­ worth [13,14,15,206]. 116

4. SHAPE REPRESENTATION A N D MATCHING: A REVIEW

Their approach relies on a very compact descriptor containing only the positions of annihilation of inflection point pairs in the u — a plane, i.e. the positions of the zero-crossing point maxima [13, 14, 15, 206]. They can be located in the example from Figure 4.5 at the tops of the lobes, (it, a) coordinates of these maxima express the position of concavity or convexity along the contour and its significance. Since small lobes are produced by the noise only the most significant ones have to be retained, e.g. not smaller than 1 0 % of the highest lobe. Similarity estimation between two curves requires solving a relatively simple correspondence problem between their two sets of maxima. To ensure invariance to rotation and change of staring point the matching algorithm has to search for the optimal horizontal translation between the two sets of maxima. Mokhtarian and Mackworth propose a fast coarse-to-fine matching strategy. Detailed descriptions of different versions of the matching algorithm can be found in [13,14,15,206]. A description of the most recent and effective variation of the technique can be found in [17]. The above shape descriptor is very compact, allows fast matching, and extensive tests using general shape collections [208] revealed that the method is quite robust w ith respect to noise, scale and orientation changes of objects2 Because of these advantages the CSS-based shape descriptor has been included as the contour-based shape descriptor in the ISO/IEC MPEG-7 standard. A detailed description of the MPEG-7 shape descriptors and the most recent matching algorithms can be found in [17]. Despite its extensive advantages, the above description has one serious draw ­ back which is its potentially high degree of ambiguity allowing only coarse com­ parison between shapes [209]. Specifically, the positions of zero-crossing point maxima for very deep and sharp concavities and for very long shallow concav­ ities may be identical. Moreover, the convex segments of the curve are repre­ sented only implicitly by assuming that every concavity must be surrounded by two convexities. As such, it is impossible to use the CSS image, and therefore any descriptors extracted from such image, to distinguish between totally con­ vex curves (i.e. circles, squares, triangles). The last drawback can be somehow resolved by including global descriptors, e.g. eccentricity and circularity. Finally, some problems with the robustness of CSS images were reported by Ueda and Suzuki in [189]. Figure 4.6 shows an example from [189] where small shape changes result in drastic structural changes in the corresponding CSS image. Clearly, the above limitation could affect the coarse-to-fine matching of CSS maxima proposed by Mokhtarian and Mackworth. Surprisingly, the most recent publications regarding CSS images suggest their stability - see for example [17], Since the coarse-to-fine strategy may not be adequate for matching of heavily 2The capabilities of the method can be checked at "www.surrey.uk". 117

4. SHAPE REPRESENTATION AND MATCHING: A REVIEW

Figure 4.6: Example of radical change in a CSS image caused by a minor change in shape [189].

Figure 4.7: Matching approach proposed in [167]: (a) - Geometric quantities for representing a segment, (b) - Dynamic Programming (DP) table. deformed shapes Ueda and Suzuki proposed a rather complex approach, where inflection points are matched from the lowest scale to the highest [189], Such a fine-to-coarse convex/concave structure matching algorithm was utilized for automatic acquisition of shape models by generalizing the multi-scale convex/concave structure of a class of shapes [189].

Matching and Retrieval of Contours by Petrakis et. al. [167] An advanced syntactical contour matching algorithm that emulates smoothing during the matching process rather than using a pre-computed multi-scale representation was recently proposed in [167]. The main feature of the approach is that it operates implicitly at multiple scales by allowing matching of merged sequences of consecutive small segments in one shape with larger segments of another shape, while being invariant to translation, scale, orientation, and starting point selection. The approach can handle both open and closed curves and can deal with noise, shape distortions, and possibly occlusions. In this approach, the original curves are segmented into convex and concave segments based on inflection points identified based on approximation using local cubic B-splines. Local dissimilarities between segments are computed based on three geometric quantities (segment invariant attributes) defined at the inflection points which are rotation angle, length and area - see example from Figure 4.7a. The method operates by allowing matching of merged 118

4. SHAPE REPRESENTATION AND MATCHING: A REVIEW

sequences of segments from one shape with larger segments in the other shape. The explicit computation of the scale-space representation is avoided due to the introduction of a mechanism for computing the attributes of the merged segments. Correspondences between similar parts of the two shapes are found using a Dynamic Programming (DP) algorithm. The matching starts by building a DP table, where rows and columns correspond to inflection points from both shapes (Figure 4.7b). Starting at a cell at the bottom row and proceeding upw ards and to the right, the table is filled with the cost of the partial match containing the segments between the inflection points examined so far. Because convex segments cannot match concave ones, only about half the cells are assigned cost values, in a checkerboard pattern. Merges, where a sequence of segments of one shape matches a single segment of the other shape introduce "jumps" in the traversal of the DP table. Reaching the top row implies a complete match. The authors of the above approach claim that through merging at different levels of details their algorithm avoids the costly smoothing operations of the scale­ space based approaches. However, this leads to the need for searching for the best match at multiple scales during the matching process, implying extremely high matching complexity, e.g. in the case of optimal matching 0 (M 3N 2), where M and N denote the number of segments in each shape. For instance, a sequential search of a database of 1,100 shapes of marine life species [14] takes several minutes on a standard PC (Pentium PC 1000MHz) [167]. It was shown in [167], that the method outperforms simple techniques such as the one based on moments but produces comparable results to the CSS-based methods (which have typically much lower matching cost) when tested on the database of 1,100 closed contours of marine life species [14]. Interestingly, when tested on smooth shapes, such as the shapes in the GESTURES dataset [210] the method performed slightly better for large answer sets and slightly worse for small answer sets than the method based on Fourier Descriptors. Moreover, it is not clear from [167] how meaningful a comparison would be between two totally convex curves or weakly similar curves where some segments may not have an intuitive correspondence. Finally, although simpler that the technique proposed in [189] or in [170] it still appears quite complicated and nontrivial to implement, making the comparison with other methods difficult. An interesting syntactical matching algorithm, sharing many commonalities with the approach by Petrakis et. al. [167], was also proposed by Gdalyahu and Weinshall [170].

119

4. SHAPE REPRESENTATION AND MATCHING: A REVIEW

4.6

C onclusion

This chapter outlined the most im portant aspects of retrieval based on shape and discussed major challenges related to 2D shape representation, matching and similarity estimation. It aimed at giving an overview of selected techniques available in the literature in order to provide a context for the research carried out in the remainder of this thesis, particulary in chapter 5. However, since a complete overview of the vast number of existing shape analysis methods would be impractical, this chapter focuses only on a small selection of the most commonly used a n d /o r those particulary relevant to the research carried out by the author in chapters 5, 6 and 7. A much larger selection of methods has been described by Loncaric in [147], where he identified over thirty shape description and matching methods covering a broad diversity of m ethods and providing a large set of useful references. State of the art in shape matching research was also discussed by Veltkamp [190, 171]. A very comprehensive discussion on various issues related to shape analysis can be found in [163]. A description of methods adopted by the MPEG-7 standard, the selection process (including description of various methods considered during the standardization) and an interesting set of applications of contour analysis can be found in [17]. Shape analysis and shape matching in particular is an important research subject in the area of CBIR. However, despite three decades of intensive research there are many unsolved aspects of similarity estimation between shapes. Many of the shape matching approaches proposed in the literature are not suited for use in CBIR systems due to factors like high computational cost of similarity estimation, lack of robustness to various deformations, and most importantly inability to provide similarity estimation aligned with human judgm ent of similarity in a given context. Methods where shape information is em bedded into a low dimensional vector space (feature vectors) typically lack robustness against non-rigid deformations. Early dense matching techniques suffer from the inability to utilize multi-scale information and, due to the recent popularity of syntactical approaches, they were never tested with large test collections to fully asses their usefulness/potential in CBIR. The most recent syntactical approaches, although motivated by studies of hum an perception and dealing with compact structural information, led to utilization of complex matching algorithms which are often difficult to implement and also to adhoc solutions for solving ambiguities between compared shapes resulting in similarities poorly aligned with hum an perception of similarity. The above situation provided motivation for the research reported in the next chapter.

120

C

h a p t e r

5

MULTI-SCALE REPRESENTATION A N D MATCHING OF NON-RIGID SHAPES WITH A SINGLE CLOSED CONTOUR chapter presents a new m ethod for efficiently matching non-rigid shapes represented with a single closed contour and for subsequently quantifying dissimilarity between them in a way aligned with human intuition. The approach is primarily intended for indexing and retrieval of objects in large collections based on their 2D shape.

T

h is

In this new approach, termed the Multi-Scale Convexity Concavity (MCC) repre­ sentation [2 1 1 ], contour convexities and concavities at different scale levels are represented using a two-dimensional matrix. The representation can be visual­ ized as a 2D surface, where "hills" and "valleys" represent contour convexities and concavities that are present at different scales. It is shown that such a rich representation facilitates reliable matching between two shapes and subsequent quantification of differences between them. The optimal correspondence be­ tween two shape representations is achieved using dynamic programming and a dissimilarity measure is defined based on this matching. The algorithm is very effective and invariant to several kinds of transformations including some artic­ ulations and modest occlusions. The retrieval performance of the approach is il­ lustrated using several collections including the MPEG-7 shape dataset, which is one of the most complete shape databases currently available. Extensive experi­ ments demonstrate that the proposed method is well suited for object indexing and retrieval in large databases. Furthermore, the representation can be used as a starting point to obtain more compact descriptors. 121

5. MULTI-SCALE REPRESENTATION A ND MATCHING OF CLOSED CONTOURS

Based on this assumption, the second part of this chapter discusses several improvements of the initial approach. An alternative and more compact form of the MCC representation [211] is proposed, along with an optimization framework designed to further improve matching performance and facilitate the comparison of different versions of the approach by ensuring optimal conditions, and thereby a "level playing field", for the methods being compared. Attractive properties of the new approach are that it outperforms existing methods in several applications, is robust to various transformations and deformations, has low matching complexity (compared to fine contour matching techniques [170,167]) and that is relatively easy to implement.

5.1 5.1.1

Introduction

M o tiv a tio n

The crucial problem in contour matching is the choice of features used to quan­ tify differences between contour segments. Biological research has shown that contour curvature is one important clue exploited by the human visual system in this regard [212]. However, curvature is a local measure and is therefore sen­ sitive to noise and local deformations. Estimating curvature involves calculating contour derivatives, which is difficult for a contour represented in digital form. In fact, the important structures in a shape generally occur at different spatial scales. Therefore, the most powerful shape analysis techniques are those based on multi-scale representations [213, 206, 209,163]. Whilst multi-scale represen­ tations are very useful in analyzing contours, they are also highly redundant this is the price to be paid in order to make explicit important structural informa­ tion. Approaches to removing thus redundancy are again motivated by studies of hum an perception. Many researchers recognized the importance of transient events in hum an visual perception [212, 205]. In [212] it was demonstrated that corners and high curvature points concentrate more information than straight lines or arcs. Numerous successful shape analysis techniques were motivated by these observations. Various multi-scale contour representations proved to be very useful for detecting such transient events [205,189, 206,163,209]. Methods based on transient events are cognitively motivated, compact and al­ low fast matching. The main drawback is that they are generally highly am­ biguous. As discussed in the previous chapter, Curvature Scale-Space (CSS) [206] method (arguably the most popular approach), cannot distinguish between very deep and sharp concavities and very long shallow concavities. In fact, multi­ scale structure analysis or event detection introduces an additional processing layer to obtain more meaningful or more compact representation but this addi­ tionally complicates the matching process. 122

5. MULTI-SCALE REPRESENTATION A N D M ATCHING OF CLOSED CONTOURS

Given these, and other observations in chapter 4, one could conclude that a simpler and more robust technique for matching and quantification of similarity between shapes could be obtained by directly using a rich (dense) shape representation extracted by recording certain shape properties computed during a contour evolution. This idea is evaluated in this chapter by introducing a rich (dense) shape description method termed Multi-scale Convexity Concavity (MCC) representation, where for each contour point, information about curvature (the amount of convexity/concavity) at different scale levels is represented using a two-dimensional matrix. A Dynamic Time Warping (DTW) technique is used to find an optimal refined alignment along the contours upon which the dissimilarity measure is defined. Attractive properties of the new approach are that it outperforms existing methods, is robust to various transformations and deformations, has low matching complexity (comparing to other fine contour matching techniques [170,167]) and is easy to implement. It should also be noted that MCC extraction has very low computational cost compared to other similar approaches [206, 163] and is therefore more suitable for on-line contour-based recognition, where usually storage requirements do not play the key role. The proposed multi-scale approach is an alternative solution to the approaches attempting to match events occurring in the scale space. The main motivation for this work was provided by the drawbacks of the scale space technique (CSS) for plane curves proposed by Mokhtarian and Mackworth in [13]. The innovation of the proposed approach is that the multi-scale representation is not used to detect important transient events to obtain a compact shape descriptor, but the amount of convexity/concavity is used directly for quantifying the differences between shapes. It is demonstrated throughout the chapter that using such a rich representation for quantifying differences between shapes can lead to significant improvement in retrieval effectiveness. However the author does not mean to imply that the transient events do not contain significant information about the structure of the contour, but rather that using a rich representation can ensure that none of the important information has been unnecessarily discarded allowing simpler and more robust matching general to all types of closed contours. Although such a rich descriptor provides excellent retrieval performance, the inherent redundancy between scale levels implies high storage requirements. Therefore, an alternative representation, termed MCC-DCT, that reduces this redundancy by de-correlating the information from different scale levels and allows selective discarding of unimportant information is proposed. Also, the study of the influence of weights with which information from different scale levels is incorporated into the dissimilarity measure on the overall retrieval performance led to the development of an optimization framework to determine optimal proportions between information represented by different contour point features according to a chosen criterion. This framework is presented 123

5. MULTI-SCALE REPRESENTATION AND MATCHING OF CLOSED CONTOURS

in the context of optimizing the new shape representation so that it can be compared both to non-optimized versions of the approach as well as the original MCC approach that, in itself, is difficult to optimize. It is demonstrated that the average retrieval performance can be further improved by using such a framework. Furthermore, this framework could be used to tailor the shape representation to a specific application.

5.1.2

C h a p te r S tru c tu re

The remainder of this chapter is organized as follows: The next section describes the multi-scale shape representation in detail and discusses its properties. Section 5.3 presents an optimal solution for matching two MCC representations and provides a definition of dissimilarity measure. Section 5.4 presents retrieval results, including a comparison with the CSS approach in the author's own simulation of the "CE-Shape" MPEG-7 core experiment. The computational complexity of the proposed approach is discussed in section 5.5. An alternative version of the shape representation and one possible parameter optimization approach is described in section 5.6. Section 5.7 discusses the relationship between the proposed approach and other curve matching techniques followed by a discussion of prospects for further improvements and an outline of possible directions for future research in this area in section 5.8. Finally, conclusions are formulated in section 5.9.

5.2 5.2.1

M u lti-scale C onvexity C oncavity R epresentation D e fin itio n

Since usage of compact shape descriptors leads to problems with generalization and a trade-off between robustness to deformations and lack of discriminatory power, a new rich multi-scale shape representation termed Multi-scale Convexity Concavity (MCC) representation which stores information about the amount of convexity/concavity at different scale levels for each contour point is proposed. The representation takes the form of a 2D matrix where the columns correspond to contour points (contour parameter u) and the rows correspond to the different scale levels a. Position (u , a ) contains information about the degree of convexity or concavity for the uth contour point at scale level a. The simplified boundary contours at different scale levels can be obtained via a curve evolution process similar to that used in [13] to extract CSS images. Assuming a size normalized closed contour C represented by N contour points and parameterized by arc-length u: C(u) = ( x(u), y(u)'], where u e (0, N). The coordinate functions of C are convolved with a Gaussian kernel 4>a of width O’ ^

{1 j 2

■■■CTrnax}'

124

5. MULTI-SCALE REPRESENTATION A ND MATCHING OF CLOSED CONTOURS

P(U, Dtot(i + N - 1, N - 1); END IF; END FOR 5. IF checking of mirror transformation required THEN • FLIP the columns of the Distance Table; • REPEAT step 4; END IF; The above algorithm finds the minimum cost Dmin of traversing the distance table for N pairs of starting points (all possible circular shifts) and if necessary in a given application can take into account mirror transformation. It should be stressed that the most computationally expensive step, calculation of pairwise distances between contour points is performed only once and does not have to be repeated for evaluating a mirror transformation or a shift.

'For clarity, operations related to tracing of the optimal path whenever the objective is also establishing the best correspondence between contour points are omitted from the pseudo-code. 135

5. MULTI-SCALE REPRESENTATION A ND MATCHING OF CLOSED CONTOURS

5.3.5

S h ap e D issim ila rity

The final dissimilarity between two contours is calculated by normalizing the cost of the optimal path (Dmin) found by the matching algorithm by the number of contour points N used and a crude estimate of complexities of both shapes computed from their MCC representations:

D(A, B) = ---------------- rN ■izA + x Bj

(5.9)

Complexities x a and x/j are estimated as the averages of dynamic ranges of concavity/convexity measures from all scales: CTm a x

= — Y I milXV™ A } " mj.n{ /^ } l ° m ax — : uA uM

(5‘10)

G—1

and similarly for x b The introduction of the second normalization is motivated by the author's observations that hum ans are generally more sensitive to contour deformations when the complexity of the contour is lower. Thus, dissimilarities between low complexity contours should be additionally penalized. It should be stressed that there is no extra penalty for stretching and shrinking of contour parts during the matching process. Rather, a globally optimal match between two contours is found using a rich multi-scale feature for each contour point and dissimilarity is based solely on distances between corresponding points. This is contrary to the majority of other contour matching approaches based on dynamic programming where different penalty factors for stretching and shrinking, usually chosen in an ad-hoc fashion [170], drive the matching process.

5.3.6

M atc h in g E xam ples

Examples of typical matching are shown in Figure 5.6. Figure 5.6a shows matching of two shapes from the MPEG-7 dataset: stef09 and stefl3. It can be seen, that despite significant non-rigid deformations between these shapes, the optimal path deviates only minimally from the straight diagonal line (roughly part {6 } of the shapes), resulting in intuitive correspondences between contour points. This example confirms that the approach exhibits robustness to non­ rigid deformations. Figure 5.6b shows an example of matching in the presence of outliers (or occlusions) where shape horsel4 from the MPEG-7 dataset is matched with its modified version representing the horse with a rider. The fragment of the optimal path corresponding to matching of unaltered parts of the modified shape to the original shape is practically straight indicating that 136

5. MULTI-SCALE REPRESENTATION AND MATCHING OF CLOSED CONTOURS

Figure 5.6: Matching examples: (a) - Matching of two shapes stef09 and stefl3 from MPEG-7 collection, (b) - Matching of horsel4 from MPEG-7 collection and its manually modified version to illustrate matching in presence of outliers or occlusions, (c) - Matching of stef09 with horsel4. the representation of these parts remained unaffected even in the presence of the significant outlier. The deviation of the path from the straight diagonal line (segments labelled as {1 } and {2 }) is related to matching of the rider from one contour to the horse's back from the other contour. The cost of matching the rider to the back of the horse is used to quantify the dissimilarities between two shapes. Matching two very dissimilar shapes is illustrated in Figure 5.6c. Despite significant differences between both shapes, the matching algorithm is able to find an intuitive match. The best match is found for a mirrored version of shape horsel4. Of course, even such globally optimal matching should result in a large dissimilarity measure between these shapes. The above examples illustrate the capabilities of the matching algorithm to compensate for small deviations in the position of the convexities/concavities along the contour and the fact that an intuitive correspondence can be found even in the presence of significant outliers.

5.4

R esults

This section demonstrates the usefulness of the proposed method to shape-based retrieval in large datasets. All presented experiments are performed using MCC representations w ith 1 0 0 equally spaced contour points and scale levels from 1 -1 0 .

5.4.1

R e trie v a l o f M a rin e C re a tu re s

Illustrative retrieval results obtained using the proposed approach on a database of 1100 shapes of marine creatures [206], are illustrated in Figure 5.7. Although there is no ground-truth classification associated with the above collection it can be safely stated that the retrieval results correspond closely to human perception of similarity. Moreover the results compare favourably with results obtained by 137

5. MULTI-SCALE REPRESENTATION AND M ATCHING OF CLOSED CONTOURS

X

r

0

o o

-

C K

)

O

\ _____

o

-

o < , yf\'s Lv ;>

o

-

~ ~ ir >

-

< r~ ----- 5 ' - - — ,

(r—

^

f c — , w

3

-

u / \ V ' J 1 r V 0

o 0

-

o
|¿¿ëj£têfcàt

d ^ a A J l^ /h o A A J tfr fô ü fâ W

sJjc/tr  à j

¿ ¿ e/le A

J ^ e /ie  ô

,..

(d)

(e) e

( /m it* , ÿ jï^o fie i& y 'fftri&JuH& l a

si A

,1

I

a

I

s> A A

a

/ffitc d td ti'

. • - •'.. ."

si AA

(f) 'JJ’A m t'v iÿ '

ty jL n u to ÿ ' & A 'tm isÿ f (X /fr

O Z$> m vnÿ Q w & tow tofl

(g)

J m im m A .

Î^Â M tû n A . Îm É o d L cn i. iw t-

$nÉ ïï< È toA iffÈ S fÊ z i

(h)

indiudm. S^AJtm ^âkcùm\ ÎA É

(0 2 7 0 .

V7A\

VVA

IL T

ÎLtâuÉm H Y0

W ?

Figure 6.4: Interm ediate results of the contour extraction process (columns in o rder from left to right): in p u t gray-scale image, binarization, localization of the m ain body of lower case letters and removal of m argin line residua, connected com ponents alg.(gray levels) and best connecting links (red lines), binary mask, w ord contour.

164

6. CASE STUDY 1: WORD MATCHING IN HANDWRITTEN DOCUMENTS

y) s

. > ♦ C*) ^

.

«

Figure 6.5: M atching w ords. Finding the optim al match between two contour representations corresponds to finding the lowest cost diagonal p ath through the table containing pairw ise distances betw een contour point features from both w ords. The optimal p ath has to begin and end at the cell corresponding to the alignm ent of the two starting contour points under consideration. Various strategies can be em ployed for p airing starting points. Deviations of the path from a straight diagonal p ath com pensates for elastic deform ations betw een contours i.e. stretching and shrinking of parts. the need for skew and slant angle norm alization. Furthermore, elastic contour m atching based on a Dynamic Program m ing (DP) technique should provide a flexible w ay to com pensate for inter-character and intra-character spacing variations. For the purpose of this chapter, the MCC-DCT version of the contour represen­ tation an d optim ized m atching is used. The m atching procedure w as optim ized using the shape collection from the MPEG-7 Core Experiment "CE-Shape-1" (part B) [215]. It should be noted th a t this dataset contains general shapes and does not contain any examples of w o rd contours. An example of m atching w ord contours using the above m ethod is show n in Figure 6.5.

6.4 6.4.1

Experiments

O v e ra ll R e su lts

The approach w as tested using a set of 20 pages from the George Washington collection at the Library of Congress. These contain 4,856 w ord occurrences of 1,187 unique words. The collection w as accurately segmented into w ords using the algorithm described in [24]. G round-truth annotations for the w ord images were created m anually as in [25]. The contours for all w ords w ere extracted according to the procedure described 165

6. CASE STUDY 1: WORD MATCHING IN HANDWRITTEN DOCUMENTS

in section 6.2. Then, for each w ord contour the MCC-DCT descriptor w ith 100 equally spaced contour points, each w ith a feature vector consisting of 10 DCT coefficients ( airs *-'°J

0 1 2

50% 9% 2%

WER (w /o OOV words) 0.180 0.164 0.165

oo

0%

0.165

Effect of pruning of w ord pairs based on num ber of 0 1 2

1]pruned pairs ro/ l Utotal pairs */0J 73% 30% 9%

WER (w /o OOV w ords) 0.204 0.164 0.165

oo

0%

0.165

Tasc

In the last experim ent all three pruning rules were combined: Rule 4: Two w ord contours Cl and Cj are similar only if Rules 1,2 and 3 are satisfied. Incorporating the above rule and setting

tx

= 0.3,

= 0 and

Ta s c

= 1 led to

discarding 85% of the total matches while m aintaining a low average WER of 0.183 (excluding OOV words).

In sum m ary, the relatively m inor adaptation of the m atching algorithm and the incorporation of the pruning technique could reduce the average time required to recognize a single w ord to 0.4s (for the particular size of the annotated collection) w hilst m aintaining a low average WER. This corresponds roughly to a 70 fold increase in speed com pared to the original approach.

6.4.3

C o m p a ris o n w ith C SS

In the past, different shape m atching techniques (typically not requiring para­ m eterization of w ord contours) have been tested for their potential capabilities to m atch w ords. For example, [227] dem onstrated that the Shape Context ap­ proach [121], w hich is currently the best classifier for handw ritten digits, per­ forms poorly in term s of recognition precision and speed for w ord matching. The procedure for extraction of single closed contours for each w ord presented in section 6.2 opens up the possibility of utilizing m atching techniques requir­ ing ordering of contour pixels. In this section, the performance of the CSS ap­ proach [13] w hich w as adopted as contour-based shape descriptor by the ISO/IEC MPEG-7 standard and discussed in section 4.5.5 is evaluated in this context. 171

6. CASE STUDY 1: WORD MATCHING IN IIANDWRITTEN DOCUMENTS

Table 6.4: WER (excluding OOV w ords) for the discussed methods. WER (w /o OOV words) m ethod profiles-based featurest and HM M [25] 0.349 M CC-DCT 0.174 MCC-DCT w ith pruning 0.183 0.350 CSS [13]

In the experiment, the CSS extraction and m atching was performed using the MPEG7 experim entation Model (XM softw are v5.6) [219]. To facilitate com parison of all three approaches, their results are collected in Table 6.4.

A dopting the CSS technique for recognition of words from the

George W ashington collection resulted in relatively high average WER of 0.350 (excluding OOV words). Such poor perform ance can be explained by two major drawbacks of the CSS technique: the occurrence of am biguity w ith regard to concave segments and its inability to represent convex segments. This result confirms the need for utilization of a rich shape descriptor, as discussed in chapter 5.

6.5

Future work

It is acknow ledged that the contour extraction procedure has been tested w ith only one dataset.

Therefore m ore tests should be perform ed with various

handw riting styles and w ith particular em phasis on generality of rules used for connecting w ord com ponents and also on the effects of m anuscript quality degradation. Also, it w ould be interesting to evaluate different binarization m ethods proposed recently (at the time of w riting this thesis) in [242] w ith the variant of N iblack's m ethod described in section 6.2.1. A n obvious im provem ent to further reduce WER w ould be the optimization of the similarity m easure using the fram ew ork proposed in chapter 5 in conjunction w ith a suitable training collection of han d w ritten words. Furthermore, currently all contours from the database are re-scaled to a predefined size prior to the ex­ traction of the MCC representations. In the case of w ord matching, the x-height should be sufficient for size alignment. Replacing the size invariance property w ith x-height invariance could further im prove discrimination between words. The incorporation of additional features in order to capture the "inner" structure of a w ord and dots, e.g. size and centroid of the inner holes [232], should also be investigated. The num ber of contour m atchings necessary to classify a single w ord could be further reduced by developing an appropriate indexing strategy for a training dictionary. One interesting challenge w ould be augm entation of training dictionaries by synthesizing new w ord contours based on initial training sets to reduce WER 172

6. CASE STUDY 1: WORD MATCHING IN HANDWRITTEN DOCUMENTS

caused by out-of-vocabulary words. Finally, it should be stressed that, w ord m atching is just a first step in w ord recognition. Incorporating a statistical lan­ guage m odel as a post-process could further im prove the recognition accuracy of the system as w as show n in [25]. In the future, this algorithm could serve as a starting point for building a com plete indexing system for large collections of handw ritten historical docum ents, such as those used for experim entation purposes here, but also illum inated Gaelic m anuscripts.

Furthermore, in the latter case, a specific

challenge of differentiation of m ultiple scribes w ithin a single docum ent based on characterizing the shapes of w ords could be investigated. Finally, a feasibility study should be carried out for application of the m atching algorithm to on-line handw riting recognition as is required in handheld devices.

6.6

Conclusion

In this chapter, application of the contour m atching technique described in chap­ ter 5 to w ord recognition in historical m anuscripts has been proposed and tested. Extensive experim entation conducted using the George W ashington collection shows that system s based on w ord contour m atching can significantly outper­ form existing techniques. Specifically, it has been show n that on a set of tw enty pages of good quality, the average recognition accuracy w as 83%. A daptation of the contour m atching to the specific task of w ord recognition in order to im prove perform ance and reduce com putational load together w ith a simple pruning technique w ere also discussed. Moreover, prelim inary investigation of potential sim plifications allow speculation that additional developm ent w ould further im prove the performance,

173

C h a pter 7

CASE STUDY 2: A N INTEGRATED APPROACH FOR OBJECT SHAPE REGISTRATION A N D MODELING I

N

this chapter, an integrated approach/tool to fast and efficient construction

of statistical shape m odels will be proposed that is a potentially useful tool in Content-based Image Retrieval (CBIR). The tool allows intuitive extraction of ac­ curate contour exam ples from a set of images using the semi-automatic segmen­ tation approach described in chapter 3. A set of labelled points (landmarks) is identified autom atically on the set of examples thereby allow ing statistical m od­ eling of the objects' shape. The chapter discusses the feasibility of utilizing the m ethod for pairw ise correspondence proposed in chapter 5 for automatic land­ m ark identification in sets of contours. The landmarks are used to train statistical shape m odels know n as Point Distribution Models (PDM) [70]. Qualitative results are presented for 3 classes of shape w hich exhibit various types of nonrigid de­ formation.

7.1 7.1.1

Introduction

M o tiv a tio n

Object-based retrieval requires im portant underpinning technology for both object extraction (i.e. segm entation) and indexing in term s of features such as shape, colour and texture. To date, segmentation and indexing have usually been perform ed independently. For instance, the matching algorithm discussed in chapter 5 allows effective retrieval by shape assum ing that objects have been already segm ented and their binary masks are available. From another point of

174

7. CASE STUDY 2: SHAPE REGISTRATION AND MODELING

view, the fully automatic segm entation approach described in chapter 3 utilizes only image data (bottom-up approach) w ithout using any prior information about objects present in the scene. In this chapter, an integrated generic approach to shape registration and subsequent generation of shape m odels for particular classes of objects (e.g. hum an head and shoulders) that are useful for top-down segm entation and recognition purposes is presented. This chapter focuses on the issues of model generation and not on their subsequent use in a retrieval context. M ore than a decade ago Cootes and Taylor introduced one of the m ore influential ideas w ithin the the image analysis community, the so-called Active Shape Model(ASMs) or "Smart Snakes" [70]. Active Shape M odels are deformable m odels w ith global shape constraints learned through observations. Objects are represented by a set of labelled points. Each point is placed on a particular part of the object. By examining the statistics of the position of the labelled points a Point Distribution Model (PDM) is derived. The model gives the average position of the points, and a description of the m ain m odes of variation found in the training set. The statistical analysis of shapes is w idely applicable to m any areas of im age analysis including: analysis of medical images [184, 185], industrial inspection tools, m odeling of faces, hands and walking people [117, 186], The use of PDMs to automatically identify examples of the m odel object in unseen im ages and the relations of PDMs w ith other forms of flexible tem plates have been presented in [117]. For detailed studies on PDMs the reader can refer to the vast am ount of literature available [248, 249,184, 78,185,250]. Statistical shape m odeling m ethods are based on exam ining the statistics of the coordinates of the labelled points over the training set. The only requirem ent is a labelled set of examples upon w hich to train the model. Correspondence is often established by m anual annotation, w hich is subjective, labor intensive and for im proved accuracy, intra and inter-annotator variability studies are required. The goal of this chapter is to develop algorithm s necessary for fast and efficient construction of PDMs elim inating the burden of m anual landm arking.

The

proposed approach is based on tw o key technologies developed in previous chapters:

the semi-automatic segm entation approach for fast extraction of

exam ples of closed contours from a set of images discussed in chapter 3 and a n ew m ethod for automatic identification of landm arks on the set of examples based on the matching technique described in chapter 5. The typical scenario w ould be a situation w hen the user collects several images containing instances of the object u nder consideration and constructs its statisti­ cal shape model, using as small a num ber of interactions as possible. The process w o u ld involve extraction of several object contours by the supervised segmenta­ tion. The tool w ould then autom atically place a set of landm arks on each contour and build the required statistical shape model.

175

7. CASE STUDY 2: SHAPE REGISTRATION AND MODELING

7.1.2

P reviou s Work

In the past, semi-automatic landm arking systems have been developed where the im age containing a new exam ple is searched using the model built from the current set of examples and the result edited before adding the new exam ple to the training set. A lthough such approaches considerably reduce the required effort, m any researchers recognize/suggested the need for fully automatic landm ark placement.

The aim is to place points to build a model which

best captures the shape variation an d also which has minimal representation error [78]. The m ethods for finding correspondences across sets of shapes can be classified into two classes [78]: pair-wise m ethods [251,252,253] w hich perform a sequence of individual pairw ise correspondences optimizing a pairw ise function and group-wise m ethods [254, 250, 255, 256] which explicitly aim to optim ize a function defined on the whole set of shapes. An elegant solution w as proposed in [257], where salient image features such as corners and edges were used to calculate a Gaussian-weighted proxim ity matrix m easuring the distance betw een each feature. A Singular Value Decomposition (SVD) is perform ed on the m atrix to establish a correspondence m easure for the features and effectively finds the m inim um least-squared distance between each feature. However, this approach is unable to handle rotations of more than 15 degrees and is generally unstable [78]. This m ethod was further extended in [197] and [200]. A lthough, the above extensions provide good results on certain shapes, stability problem s w ere reported in cases w hen the two shapes are similar as w ell as an inability to deal w ith loops in the boundary [251]. O ther m ethods [252, 253] assum e th a t contours (usually closed) have already been segm ented and use curvature inform ation to select landm ark points. However, they perform well only if the corresponding points lie on boundaries that have the sam e curvature. Hill et. al. proposed an alternative m ethod for determ ining non-rigid correspondence betw een pairs of closed boundaries [251] based on generating sparse polygonal approxim ations for each shape.

The

landm arks w ere further im proved by an iterative refinement step. An interesting system for autom ated 2D shape m odel design by registering similar shapes, clustering exam ples and discarding outliers was described in [258]. Since the above pair-wise m ethods m ay n o t find the best global solution, recently group-w ise m ethods were proposed. In [254], landm arks are placed on sets of closed curves using direct optim ization. The approach is based on a measure of com pactness and specificity of a m odel which is a function of the landm ark position on the training set. A lthough in som e cases the m ethod produces results better than m anual annotation, this is a large, nonlinear optim ization problem and the algorithm does not alw ays converge.

Recently, Davies et.al. [259]

proposed an objective function for group-w ise correspondences of shapes based on the Minimum Description Length (MDL) principle. The landm arks are chosen 176

7. CASE STUDY 2: SHAPE REGISTRATION AND MODELING

so the data is represented as efficiently as possible. This approach w as further extended in [255] by adding curvature measures. A steepest descent version of MDL shape m atching was proposed in [256]. Currently, m ethods based on the MDL principle are seen by many as the state of the art solution: they are fully autom atic an d in m any cases provide more m eaningful models than m anual annotations. However, the quality of the model directly relies on the choice of the objective function [259, 256] and optim ization m ethod w hich has to take into account m any local minima. Often, convergence is slow and scales poorly w ith the nu m b er of examples. The duration of one m odel-evaluation-tuning loop can take h ours or even days. Also, im plem entation and tuning of the MDL fram ew ork requires a lot of experim entation and param eter tw eaking [260]. For a m ore detailed review of m ethods finding correspondences across contour sets one m ay refer to [250, 78, 261].

7.1.3

C h ap ter Structure

The rem ainder of this chapter is organized as follows: The next section gives a short introduction to the field of statistical shape analysis by describing alignm ent of sets of points using Procrustes Analysis and m odeling of shape variation using PDMs.

The proposed landm arking fram ework is described

in detail in section 7.4 and section 7.5 presents selected qualitative results. Possible alternative approaches and directions for future research in this area are discussed in section 7.6. Finally, conclusions are form ulated in section 7.7.

7.2

Point Distribution Model

This section aim s at giving a brief introduction to the field of statistical shape analysis by describing two basic techniques: alignm ent of sets of points using Procrustes Analysis and modeling of shape variation using PDMs.

7.2.1

P a irw ise A lig n m en t

A ccording to the definition from [248], shape is all the geometrical inform ation that rem ains w hen location, scale and rotational effects are filtered o u t from an object. Therefore, equivalent points from different shapes can be com pared only after establishing a reference pose (position, scale and rotation) to w hich all shapes are aligned. A com m only used procedure for aligning corresponding point sets is Procrustes Analysis. The Procrustes analysis requires one-to-one correspondence of points often called landmarks1, i.e. points which identify a salient feature on an object 1Synonyms for landmarks include homologous points, nodes, vertices, anchor points, markers,

177

7. CASE STUDY 2: SHAPE REGISTRATION AND MODELING

and which are present on every exam ple of the class. Each planar shape is characterized by a set of n landm arks (com m on for the whole set of examples) and its vector representation is denoted as x = [ari, x 2, ■■■, x n, yi, 2/2, • • ■, Z/n]TProcrustes analysis is based on m inim ization of the Procrustes distance between two shapes, x i and X2, w hich is sim ply the sum of the squared point distances: n

(7.1)

The alignm ent of tw o shapes x i and X2 involves three steps: 1. Align position of the two centroids, 2. Re-scale each shape to have equal size, 3. Align orientation by rotation. The centroid is defined as the centre of a m ass of the physical system consisting of unit masses at each landm ark: (7.2)

Size norm alization is perform ed using a scale metric called centroid size: n (7.3)

Orientation alignm ent is achieved by m axim izing the correlation between the two sets of landm arks. The shape x i is optim ally superim posed upon x 2 by perform ing a Singular Value Decomposition (SVD) of m atrix x f x 2 and using V U r as the rotation matrix.

7.2.2

G en era lized Procrustes A n a ly sis

The alignm ent of a set of shapes is typically perform ed in an iterative2 manner [184]:

m o d e l p o in ts a n d k e y p o in ts. 2A n analy tic so lu tio n can b e fo u n d in [248],

178

7. CASE STUDY 2: SHAPE REGISTRATION AND MODELING

1. Chose an initial estim ate of the mean shape (e.g.the first shape in the set), 2. Align all shapes to the m ean shape, 3. Re-calculate the estim ate of the m ean from the aligned shapes, 4. If the m ean has changed return to step 2. The m ost frequently used estim ate of the m ean shape is the Procrustes mean shape:

(7.4) i=l where N denotes the num ber of shapes. It is im portant to fix the size and orientation at each iteration by norm alization in order to avoid any shrinking or drifting of the m ean shape. However, in fact, not all sets require orientation adjustm ent during alignm ent of examples to the m ean (in step 2).

In many applications orientation of

examples is already fixed w ithin the scene, e.g.

the "H ead & Shoulders"

example from section 7.5.2. The au th o r's observations indicate that, in such cases, correcting orientation of examples in order to minim ize the Procrustes Distance (equation 7.1) m ay result in additional artificial (non-intuitive) m odes of variation. Therefore, for flexibility the tool developed in this chapter allows skipping the orientation alignm ent. In practice, it m ight be useful to allow utilization of alternative alignm ent methods.

7.2.3

M o d e lin g S h a p e V aria tio n

Once a set of aligned shapes is available the mean shape and variability can be found. In this w ork, statistical shape models known as Point Distribution Models (PDM) [70] are used. Consider the case of having N planar shapes consisting of n points, covering a certain class of shapes and all aligned into a com m on frame of reference. Training PDMs relies upon exploiting the inter-landm ark correlation in order to reduce dimensionality. It involves calculating the m ean of the aligned examples x (according to equation 7.4), and the deviation from the mean of each aligned example ¿x, = x t — x, an d calculating the eigensystem of the 2 n x 2 n covariance matrix of the deviations described by

= 1/iV

(¿Xi)( Afc+i. This results in an ordered 179

7. CASE STUDY 2: SHAPE REGISTRATION AND MODELING

basis w here each com ponent is ranked according to variance. Modifying one com ponent at a time gives the principal modes of variation. A ny shape in the training set can be approxim ated using the m ean shape and a w eighted sum of these deviations obtained from the first t modes: x = x + Pb

(7.6)

w here P = (pi, P 2, ■■■, Pt) is the 2 n x t matrix of the first t eigenvectors and b = (fri, 62, . . . , bt)T is a t elem ent vector of w eights (shape param eters), one for each eigenvector. New exam ples of shapes can be generated by varying the param eters b w ithin suitable limits, so that the new shapes will be similar to those in the training set. t is typically chosen as the sm allest num ber of modes such that the sum of their variances explain a sufficiently large proportion of the total variance of all the variables. The m odes are often com parable to those a hum an w ould select to design a param eterized model. However, as explained in [78], they m ay not always separate shape variation in an obvious m anner since they are derived directly from the statistics of a training set.

7.3

Semi-Automatic Segmentation

Since a typical m odel requires tens of examples, special consideration is devoted to ensure easy and intuitive user interactions in a segm entation process so that generation of these examples is as sim ple as possible. C ontour examples are extracted using the semi-automatic segm entation approach discussed in chapter 3. This approach has proven to be extremely practical and efficient, allow ing rapid segmentation while providing the flexibility necessary to obtain high accuracy in extracted examples.

7.4

Landmark Identification

The strategy used in the fram ework for autom atic identification of landmarks on a set of contours is similar to the scheme proposed by Fleute and Lavalee for landm arking sets of closed 3D surfaces [262], Each training exam ple is initially m atched to a single reference tem plate and a m ean is built from these matched examples. Then, iteratively, each exam ple is m atched to the current mean and the procedure is repeated until convergence. 180

7. CASE STUDY 2: SHAPE REGISTRATION AND MODELING

7.4.1

P air-w ise M atch in g

The above fram ew ork relies upon the ability to match pairs of shapes (in order to match each shape w ith the m ean) and to m easure the quality of the match (in order to decide the initial reference shape). In other words, the m ean cannot be built and subsequently up d ated if there is no dense correspondence between contour points of the whole set. In order for this process to be successful, an accurate, robust m ethod for establishing pairw ise correspondence is required. It is also required that the pairw ise m atching is invariant w ith respect to pose (translation, scale and rotation). Furtherm ore, the m ethod has to perform well in cases w here the two boundaries represent different examples from the same class of objects and a nonrigid transform ation is required to m ap one boundary onto the other. This chapter studies the usefulness of the pairw ise shape matching technique presented in chapter 5 for the above landm arking framework. The m ethod is particulary attractive for the above application as has been dem onstrated, in chapters 5 and 6, it perform s well in cases of elastic deformations and w here the similarity betw een curves is w eak [211]. For the purpose of the landm arking framework, the MCC-DCT version [211, 243] of the contour representation is used.

The m atching procedure w as optim ized using the shape collection

from the MPEG-7 Core Experim ent "CE-Shape-1" (part B) [215] as discussed in section 5.6. This choice of this configuration w as based on the assum ption that its better recognition perform ance on collections containing general shapes has been resulted by m ore intuitive matching. It should be noted that the m atching algorithm requires the two contours to be represented by an equal num ber of equidistant points. This will lead to an additional re-sam pling step in the landm arking algorithm. The extension of the m atching algorithm to a non-uniform ly sam pled contour is straightforw ard and will be p art of authors' future work.

Also, the landm arking scheme

requires dense correspondence (ideally betw een every pixel). To allow such fine m atching w ith o u t increasing the com putational cost, faster versions of the m atching algorithm could be u tilized /tested , e.g. based on the strategy for fast selection of starting points [170].

7.4.2

C o rresp o n d en ce for a S et o f C urves

Before the landm arking process begins, all examples are dow n-sam pled and rep­ resented by a fixed num ber of equidistant contour points (300 in all experiments in this chapter). Then, the position and size of the contours are norm alized ac­ cording to equations 7.2 an d 7.3. Next, an initial reference contour is selected from the set in such a w ay that the total value of dissimilarities betw een the reference and all other exam ples is minimized. This ensures that the initial ref­ 181

7. CASE STUDY 2: SHAPE REGISTRATION AND MODELING

erence is chosen close to the centre of the m odelled shape space. Subsequently, each exam ple is m atched to the reference using the pairw ise m atching m ethod discussed in section 7.4.1. This provides a com m on basis for identification of points from all examples. In other w ords, a contour point in any of the examples is identified by the point from the reference contour to w hich it is best m atched. This allows optim al rotation of examples to superim pose w ith the reference (as explained in section 7.2.1) and finally construction of the m ean contour accord­ ing to equation 7.4. If the distance betw een the current reference and the newly estim ated m ean (com puted using equation 7.1 and norm alized by the centroid size of the reference), exceeds a predefined threshold the m ean is taken as the new reference and the procedure is repeated, otherw ise the procedure stops. As a result, a dense correspondence betw een each exam ple and the m ean is ob­ tained. U sing these dense correspondences the landm arks placed on the m ean shape can be projected back to each example. The m ain steps of the proposed fram ew ork are outlined below: 1. Sam ple uniform ly all examples, 2. N orm alize all examples: a) Translate centroid to position (0,0); b) Re-scale to unit size, 3. Select the initial reference contour, 4. Find correspondences between examples and the reference, 5. Rotate each example to superim pose w ith the reference, 6. Create a m ean contour according to Eq. 7.4, 7. Take the m ean as the new reference, 8. If the difference betw een the old and the new reference exceeds a predefined threshold then go to step 4., 9. G enerate landm arks in the reference contour, 10. Project landm arks back to each example. To avoid any drifting of the mean shape its orientation is adjusted a t each iteration by optim al superposition (m inim ization the Procrustes distance) w ith the initial reference. Additionally, the m ean contour is re-sam pled before a new iteration begins to ensure equidistance of its contour points. This additional step could be elim inated by adapting the pairw ise m atching algorithm to m atch nonuniform ly sam pled contours. The objective of step 4. is to identify, on each example, positions corresponding best to the contour points from the reference mean. However, the m atching algorithm m u st com pensate for nonrigid transform ations as well as small differences in the position of contour points caused by sampling. The m atching is sym m etrical w hich m eans that both contours (an example and the reference

182

7. CASE STUDY 2: SHAPE REGISTRATION AND MODELING

An e x a m p l e

L2

(b) Figure 7.1: U pdating of the m ean and landm ark propagation in cases of multiple assignments of points.

mean) are treated equally. Therefore, a single contour point from an example can be matched w ith two reference points from the mean, and vice versa. The first case, depicted in Figure 7.1a, does not require any special treatment. The position of the single point from the considered exam ple can be used to update the positions of both m atched points from the reference, e.g. to construct the new mean. Also, during stage 8, if both reference points from the mean are selected to become landm arks, the position of the single point can be used for both of them w ithout causing any conflict. In contrast, the second case requires special consideration since the single landm ark (LI) cannot be used for two different points from the considered example. Therefore, only the centre of the segment connecting both points is used to update the position of the matched reference point and can potentially become a landm ark - see Figure 7.1b. The num ber of contour points should be considerably large com pared to the required precision in localization of the final landm arks. For m aximum precision, the num ber of contour points should be com parable with the average num ber of pixels in each example. The set of landm arks can be generated autom atically on the mean using any sensible m ethod, for example the approach based on detection of Critical Points presented in [263]. In the current im plem entation all points from the mean were selected as landm arks. However, choosing the extrem a of convexity/concavity measure from the MCC representation as m ajor landm arks (due to high m atch­ ing reliability at those points) and placing the m inor landm arks equally between them could further im prove specificity of the final PDMs. The selection of the initial reference contour requires computation of pairwise dissimilarities betw een all N examples im plying 0 ( N 2) complexity. Since a coarse m easure of the dissim ilarity is sufficient for the selection, the com puta­ tional load of this step can be reduced by using a reduced num ber of contour points. 183

7. CASE STUDY 2: SHAPE REGISTRATION AND MODELING

7.4.3

M o d e lin g S ym m etrical Shapes

O ften one w ould like to m odel shapes exhibiting m irror symmetry, e.g. a face or a hum an head & shoulders. In such cases, extending the training set by m irrored versions of the boundaries usually leads to more intuitive m odes of variations especially com pared w ith cases where originally there is a discrepancy between the num bers of exam ples from each view (e.g. more faces looking to the left than to the right) and the num ber of examples is small. Clearly such an approach does not require any m odification of the proposed landm arking m ethod, however it implies additional cost of matching of these additional examples w ith the mean. An alternative strategy requires adaptation of steps 4-6 of the proposed framework. In such a case, each iteration begins by m atching only the original set to the mean. Based on this match, a new m ean and its m irrored version are created. Then, both versions of the mean (original and mirrored) are m atched and their correspondence is used to establish correspondences between the m irrored set of contours and the mean obtained from the original set. Finally, the new sym m etrical version of the mean, including original and mirrored sets, is created. This approach avoids the need for individual matching of m irrored examples, explicitly ensures that each m irrored boundary is m atched w ith the reference in exactly the sam e w ay as its original version3, and further improves convergence.

7.5

Results

This section presents qualitative results of training PDMs for 3 classes of shapes exhibiting various types of nonrigid deformation. In all three cases, the supervised segm entation of the set took less than 5 m inutes. The landm arking process converged after 2-5 iterations in all cases, requiring typically less than 30 seconds on a stan d ard PC.

7.5.1

C ro ss-sectio n s o f Pork Carcasses

The data in this experim ent consists of 14 gray scale images (768x576 pixels each) [77].

An exam ple image is show n in Figure 7.2a.

The results are

com parable w ith the m anual annotations presented in [77]. It can be observed from Figure 7.2b th at the m odel is compact (the three first eigenvectors explain more than 77% of all variations) and qualitative results, show n in Figure 7.2c, indicate good specificity 4. 3With the precision limited by the contour sampling. 4A specific m odel should only generate instances of the object class that are similar to those in the training set.

184

7. CASE STUDY 2: SHAPE REGISTRATION AND MODELING

(a) I fti =-3 v^X7

(»1=0

i»s=0

f>3=—3

A3

/>3~ 0

(c) Figure 7.2: Pork carcasses: (a) - Example image of pork carcass cross-section, (b) - Eigenvalues, (c) - Deformation using 1st, 2nd and 3rd principal modes. 7.5.2

H ead & S h o u ld ers

The data in this experim ent consists of 19 colour images of CIF size. Example images (w ith extracted object boundaries) are show n in Figure 7.3a. The training set was extended using m irrored versions of the exam ples as described in section 7.4.3. The final m odel is compact (three first eigenvectors explain m ore than 76% of all variations) and qualitative results show n in Figure 7.3c indicate good specificity.

7.5.3

S y n th e tic Exam ple

In this experim ent, synthetic data designed to "stress test" the m odel, in order to illustrate som e potential limitations of the proposed approach is used. The training set, show n in Figure 7.4, is a synthetic "bum p" example introduced and discussed originally in [250]. The examples were generated from a synthetic object that exhibits a single m ode of shape variation w here the "bum p" m oves along the top of th e box. Therefore, it should be possible to describe the variation w ith a single param eter. Q ualitative results for the proposed approach are show n in Figure 7.4. The algorithm failed to find "ideal" (the most intuitive) correspondences for the 185

7. CASE STUDY 2: SHAPE REGISTRATION AND MODELING

(b) .

C \

v

/ 5 / / bl=0

t>i = —3 v/Ai"

\

i>2=-3»/\2 II \

b2 - 0

(j2=+3\/A2

J\

Cb3=-3>/XZ '

(

\ b3 = 0

(>.1—I \/A3

(C)

Figure 7.3: H ead & Shoulders: (a) - Examples of contours extracted using userdriven segm entation tool for the head & shoulders m odel, (b) - Eigenvalues, (c) - Deformation using 1st, 2nd and 3rd principal modes. set as some of the "bum p" end-points lie on boundaries w ith concavity while others correspond to convex parts. The m atching algorithm could only w arp the parts w ith different curvature attem pting to find optim al global correspondence. The poor localization of the end-points of the "bum p" for some examples com prom ised the specificity of the model as show n in Figure 7.4b. It should be noted however, that the m ean has desired shape and the first eigenvector still describes alm ost 98% of the total m odel variations. This example suggests that m aking the approach fully functional (applicable to all classes of shape deformations) w ould require allowing m anual corrections of some landmarks a n d /o r further optim ization, e.g using a M DL-based approach.

7.6

Discussion

The m ain limitation of the approach is related to the fact that the matching relies on the assum ption that the corresponding points lie on boundaries that have 186

7. CASE STUDY 2: SHAPE REGISTRATION AND MODELING

i>2= ~2v^2

—2 \/A 3

&2=0

I)2= +2\/X;j

63=0

63 = -f-2\/A 3

(b) Figure 7.4: "Bump": (a) - The "Bump" training set, (b) - Deformation using 1st, 2nd and 3rd principal m odes.

similar values of curvature (convexity/concavity). As show n in section 7.5.3, this may com prom ise the com pactness and specificity of the final model. One solution w ould be to allow the user to correct the position of misplaced landm arks.

Alternatively, a MDL-based m ethod could be used to further

refine the correspondences as these approaches do n o t rely on pairing any boundary features. In other w ords, the proposed m ethod could provide initial correspondences w hich could be then refined using a MDL-based technique. Such a hybrid approach could reduce the sensitivity of the MDL m ethod to the search param eters and reduce the time required for convergence. Also, neither MDL m ethods or the proposed m ethod are suitable for objects represented by m ultiple o p en /clo sed boundaries, e.g. face expressions. A ddressing this problem will be p art of au th o r's future research.

O ne approach w ould be

to integrate a m ethod based on Shape Context [121] or its recently proposed extension w ith a figural continuity constraint [264], The extension of the proposed approach to open curves w ould require extension of the user interface to allow extraction of open curves (e.g. by marking end­ points of the closed curves) and a relatively straightforw ard modification of the pairw ise m atching algorithm . In fact, the matching problem could be simplified by assum ing that the end-points of the curves correspond. Summarizing, the tool w o u ld have to integrate m ultiple m odules allowing different types of interaction an d generated models for full flexibility in different 187

7. CASE STUDY 2: SHAPE REGISTRATION AND MODELING

applications.

7.7

C onclu sion

The proposed approach allows rap id creation of statistical shape models for classes of objects represented by single closed boundaries.

The tool allows

intuitive semi-automatic segm entation of objects' exam ples from a set of images, autom atically identifies correspondences for the extracted set of contours and creates Point D istribution Models. The proposed landm arking m ethod is fast a n d does not require any tw eaking of the param eters. In all presented examples, it took less than 5 m inutes for an unexperienced user to segm ent and create a final model.

188

C h a pter 8

GENERAL CONCLUSIONS A N D FUTURE WORK T

HIS

final chapter sum m arizes the thesis by reviewing the achieved re­

search objectives and recalling the m ain conclusions draw n throughout this thesis. Also, it indicates directions for further research by discussing possi­ bilities for further im provem ent of the proposed techniques as well as their use in related challenges in the field.

8.1

General Discussion

In recent years content based retrieval of m ultim edia data has become a very active research area resulting in initiatives such as a dedicated ISO standard for description of m ultim edia content nam ed MPEG-7, large scale evaluation activities, e.g. TrecVid, and a large num ber of large-scale academic and industrial CBIR systems. However, it is safe to say that the current CBIR system s are in their infancy and their perform ance w hen dealing w ith general content is rather poor. Moreover, the majority of currently available systems are able to deal only w ith collections w hich are reasonably homogenous. Clearly providing content based functionalities for heterogenous collections such as the content of World Wide Web is m uch more dem anding. Therefore, one can expect the problem of content based retrieval of m ultim edia data to remain a challenging research area for m any years to come. A lthough im age or video content is clearly much more versatile com pared to text, their interpretation has rarely a unique meaning. Such am biguity implies considerable difficulties in finding optim al ways to abstract im ages or video using visual descriptors. In recent years, it has become clear that in order to provide content based functionalities both low-level visual features and high-level semantic features

189

8. GENERAL CONCLUSIONS AND FUTURE WORK

should be be integrated. However, both the optim al choice of low-level features and also capturing of the semantic concepts based on the autom atically extracted low-level features is context, user, and query dependent. Furtherm ore, it is also not clear how to intertwine contextual inform ation into num erical sim ilarity m easures used by computers. C urrently the problem of extraction of semantically m eaningful entities from visual data can be successfully solved only by techniques tuned to a very narrow application scope and w hich are not easily extensible to other application contexts.

In other w ords, current image processing techniques are unable

to autom atically extract semantically m eaningful entities from general visual content. A lthough the capabilities of existing CBIR systems and their underpinning technologies seem quite limited taking into account the typical variability of the m ultim edia content, they certainly help to better understand the problem of retrieval by content w hich hopefully will lead to new advances in the area. It is essential that future CBIR systems further utilize contextual inform ation and know ledge of the w orld. Also it should be accepted that interpretation of the content depends strongly on the user and h is /h e r query and that the developm ent objective for future CBIR systems should not be a further reduction of the u se r's role b u t rather em bracing it by providing better querying models and relevance feedback mechanisms. M any of the major obstacles to the developm ent of efficient CBIR system s may be solved by future intelligent acquisition devices. In m any applications new devices could provide valuable contextual inform ation, e.g. time and location of capturing the data. Also, the problem of image segm entation and extraction of sem antic entities w ould simplify if the new acquisition devices could p ro ­ vide additional modalities, e.g. depth or tem perature. H ow ever it should be noted th at even in such cases m any of the fundam ental problem s of im age seg­ m entation w ould remain essentially the same, e.g. fusion of the inform ation from m ultiple sources or the choice between bottom -up and top-dow n process­ ing. Besides, even in the case w here all semantically m eaningful entities are successfully extracted, the problem of m easuring similarities betw een such enti­ ties an d integrating contextual inform ation into num erical m easures rem ains to be solved.

8.2

Thesis Summary & Research Contributions

The content analysis w ork reported in this thesis has focused on developm ents in the areas of image segmentation and shape m atching in the context of content based functionalities required by future m ultim edia applications. The 190

8. GENERAL CONCLUSIONS AND FUTURE WORK

lim itations of current techniques from both of these areas have been evaluated and extended. Strengths and weaknesses of the proposed solutions have been dem onstrated in the context of selected applications. The author hopes that this contribution leads to a better understanding and perhaps further advances in the field of content analysis. The first chapters of the thesis consider segm entation of images of real w orld scenes w hile the later chapters focus on estim ation of similarities betw een objects' silhouettes. The final chapters present case studies dem onstrating the usefulness of the proposed solutions in selected applications. The research areas of im age segm entation and shape analysis are discussed in the context of content-based image retrieval and m otivations for the investiga­ tion carried out by the author are outlined. The current state of the art in image segm entation and shape m atching is illustrated by reviewing the m ost charac­ teristic categories of techniques in both areas of research and briefly discussing the m ost interesting and prom ising approaches from each category. The feasibility stu d y of utilizing spatial configuration of regions and structural analysis of their contours (so-called syntactic visual features [18]) for im proving the quality of fully autom atic image segm entation perform ed via region m erging is successfully carried out in chapter 3. Several extensions to the w ell-known Recursive Shortest Spanning Tree (RSST) algorithm [19] are proposed am ong w hich the m ost im portant is a novel fram ework for integration of evidence from m ultiple sources of inform ation, each w ith its ow n accuracy and reliability, e.g. colour hom ogeneity and a region's geometric properties, w ith the RSST region m erging process. The new integration fram ew ork is based on Theory of Belief (BeT) [20, 21, 22, 23]. The "strength" of the evidence provided by the geometric properties of regions and their spatial configurations is assessed and com pared w ith the evidence provided solely by the colour hom ogeneity criterion by extensive experimentation. Other contributions in this area include a new colour m odel and colour homogeneity criteria, practical solutions to structure analysis based on shape and spatial configuration of image regions, and a new sim ple stopping criterion aimed at producing partitions containing the m ost salient objects present in the scene. An application of the autom atic segm entation m ethod to the problem of semi­ autom atic segm entation is also discussed. An easy to use and intuitive sem i­ autom atic segm entation tool utilizing the automatic approach is described in detail. This exam ple dem onstrates that syntactic features can be useful also in the case of supervised scenarios w here they can facilitate more intuitive user interactions e.g. by allow ing the segmentation process m ake more "intelligent" decisions w henever the inform ation provided by the user is not sufficient. Utilization of the approach in an integrated tool allowing intuitive extraction and registration of contour examples from a set of images for construction of

191

8. GENERAL CONCLUSIONS AND FUTURE WORK

statistical shape models is discussed in chapter 7. Selected issues of shape analysis, in particular contour representation, matching, and estimation of similarity betw een silhouettes are investigated in chapter 5. A novel rich multi-scale representation for non-rigid shapes w ith a single closed contours and a study of its properties in the context of silhouette matching and sim ilarity estim ation is presented. An initial approach for optim ization of the m atching in order to achieve higher perform ance in discrim inating between shape classes is also proposed.

The efficiency of the proposed approach is

dem onstrated in a variety of applications. In particular, the retrieval results are com pared to those of the Curvature Scale Space (CSS)[13,14,15](adopted by the MPEG-7 standard [16,17]) and it is shown that the proposed scheme perform s better than CSS in tw o applications: retrieval in collections of closed silhouettes and holistic w ord recognition in handw ritten historical m anuscripts. As a m atter of fact, it is show n that the proposed contour based approach outperform s all techniques for w ord m atching proposed in the literature w hen tested on a set of 20 pages from the George Washington collection at the Library of Congress [24, 25]. Additionally, it is shown that the m atching approach can be extended w ith some success to the problem of m atching m ultiple contours and therefore facilitating establishm ent of dense correspondence between m ultiple silhouettes w hich can then be em ployed for contour registration. T hrough presentation of various applications of the proposed solutions it is dem onstrated that all contributions have the potential of providing the key technologies enabling content-based functionalities in the future multim edia applications.

In each case, the perform ance and generality of the proposed

techniques is analyzed based on extensive and rigorous experim entation using as large as possible test collections.

8.3

Future Work

This section revisits the discussion about the role of im age segm entation and shape m atching in the context of content based retrieval and proposes a num ber of research problem s which, according to the author, are w orth investigating to fu lly /fu rth er exploit the potential of the approaches proposed in this thesis. It should be noted that incremental im provem ents to all proposed techniques have already been discussed in previous chapters.

This section focuses on

applications of the proposed approaches w hich have n o t been sufficiently investigated in this thesis. 192

8. GENERAL CONCLUSIONS AND FUTURE WORK

8.3.1

Im a g e S e g m e n ta tio n

A lthough the problem of partitioning an image into a set of hom ogenous regions or semantic entities is typically considered to be a fundam ental enabling technology for understanding scene structure and identification of relevant objects, recently its role in the context of content based retrieval has been questioned. For example, despite recent advances in the area of region-based image seg­ m entation (including the one proposed in this thesis) to the best of the au th o r's know ledge no one has reported quantitative results dem onstrating that utilizing local features representing hom ogenous regions leads to better retrieval capabil­ ities in AV m aterial w ith general content than using features representing image blocks or even entire images. To better illustrate the problem let us consider CBIR systems for visual data w ith general content. In order to allow reasonable response time to the query, the com putation of visual descriptors and the creation of the index should be done off-line.

However, often the selection of the optimal features is both

context and query dependent. For exam ple, extraction of local features, w here ideally each feature describes a sem antic object or at least an im portant part of an object, requires image segm entation.

However, autom atic bottom -up

segm entation is often under-constrained as well as context and query dependent due to the inherent am biguity of the visual content. An alternative top-dow n solution to im age segm entation requires utilization of domain-specific models of the semantic entities to be extracted. H owever the relevancy of the objects is again query dep en d en t an d it is currently impossible to build such models for all semantic objects that could possibly appear in the scene. Also, even in cases w here the sem antic objects can be automatically extracted, the problem of estim ation of sim ilarity betw een the query and the target image rem ains to be solved. In fact, the problem of sim ilarity estimation based on local features appears to be m uch more difficult w hen utilizing local features than simple global descriptors.

For exam ple, some researchers reported that in the case

of CBIR system s allow ing query by exam ple w ith replacement of the global features w ith local features describing individual hom ogenous regions leads to retrieval im provem ent only in som e very specific cases (e.g. scenes w ith a single dom inant semantic object w ith hom ogenous colour) while decreases the perform ance in others (e.g. scenes defined by composition of m ultiple visually distinctive objects). Also, the vast m ajority of approaches to scene classification into semantic classes, e.g. in d o o r/o u td o o r or citycsape/landscape, use features representing predefined blocks rath er than hom ogenous regions. Therefore the author firmly believes that to fully asses the role of image segm entation in the content based retrieval problem further research effort should be devoted to investigate new ways of utilizing the results of image 193

8. GENERAL CONCLUSIONS AND FUTURE WORK

segm entation in CBIR systems. O ne possibility w ould be a departure from representing the results of segm enta­ tion as a single segmentation mask for each image w ith a view to utilizing more flexible representations able to capture the structure of the scene and allowing utilization of the contextual inform ation, e.g. the Binary Partition Tree (BPT) ad­ vocated in [107], Em ployment of such structures, or in more general em ploy­ m ent of visual features extracted based on m ultiple partitioning hypothesis, to detection of high-level concepts or scene classification should also be considered. Also, new flexible querying models allow ing both: image-based and objectbased queries by example and also new w ays of establishing sim ilarity between the query an d the target suitable for such m odels should be investigated. Finally, detection and extraction of new classes of semantic entities potentially im portant in m any content based applications should be investigated.

For

exam ple, currently it is widely recognized that detection and recognition of faces present in the visual content can provide very valuable inform ation for content-based retrieval. A lthough m any pow erful approaches to face detection have been proposed [265, 266] the problem of face recognition in general content rem ains to be solved [252, 79, 115]. However, additional inform ation potentially valuable for content based functionalities in m any applications could be provided by detection and extraction of person's head & shoulders. The presence of the shoulders should be useful for validation of the face detection. Also, the extracted head & shoulder should provide more reliable evidence for person's identification than face in m any applications, e.g. identification of the player's team in close-ups present in m ost sports videos.

8.3.2

S h a p e M a tc h in g

Clearly shape analysis and shape m atching in particular are very im portant in the context of CBIR how ever it is also a challenging task and m any problem s from this area rem ain to be solved. Shape sim ilarity depends highly on the context, e.g. different applications may require em ploym ent of different notions of shape similarity. Until now, m ost of the shape m atching techniques were proposed in the context of a very specific application and tested on relatively hom ogenous collections.

Incorporation

of contextual inform ation in the shape sim ilarity m easures is currently in its infancy [3] and certainly deserves greater attention. Also com bining the output of a nu m b er of m ethods to establishing sim ilarity betw een shapes could open up interesting possibilities, e.g. increased retrieval perform ance or im proved efficiency by using coarse to fine strategies. A lthough establishing shape similarity is a challenging task, it can be perform ed w ith reasonable success, as has been show n in this thesis. Currently, the main 194

8. GENERAL CONCLUSIONS AND FUTURE WORK

problem in the w ay to fully exploit shape inform ation in CBIR systems appears to be the limited ability to extract the sem antically meaningful entities. One means of extracting such objects is by utilization of shape matching techniques which can m atch a tem plate shape directly w ith an object present in an unseen image or by em ploying image segm entation driven by a shape models, e.g. the PDM models discussed in chapter 7. Such techniques have been successfully applied to a very specific problem s and their application to more general content certainly requires m ore research. The author acknow ledges that such techniques, unlike the m ethods investigated in this thesis, should employ shape templates consisting of m ultiple disconnected closed an d open contours or use models not only of the outline of the object b u t also of its internal parts in order to be applicable to a w ider class of problems. Finally, the prelim inary investigation of utilization of the contour matching ap­ proach proposed in this thesis to w ord spotting in handw ritten docum ents car­ ried out in chapter 6 provided very encouraging results and should be further investigated. Particulary interesting is the possibility of matching the trajectory of the pen instead of the the external outlines of the words. This should be eas­ ily achieved in the case of on-line han d w ritin g recognition where the trajectory of the pen is readily available b u t should be also possible in off-line recognition scenarios w here the pen m ovem ents, or at least their approxim ation consistent across m any instances of the same w ord, could be recreated from the w ord's im­ ages.

8.4

Final Word

The author believes that all these challenges outlined above will provide am ple opportunity for interesting research for m any years to come. Extending the state of the art in these areas will significantly enhance existing CBIR applications w hilst enabling other, as yet unknow n, future applications.

195

A p p e n d ix A

THE H U M A N VISUAL SYSTEM A N D OBJECT RECOGNITION T

HIS

appendix briefly discusses selected theories from the field of Cognitive

Psychology regarding hum an perception and object recognition w hich influenced C om puter Vision and Image U nderstanding and in consequence also systems for indexing and retrieval of visual content. A lthough none of th e abstract models presented in this section is explicitly im plem ented in this thesis several theories, concepts and suggestions proposed in the field of C ognitive Psychology provide context and justification for the choices m ade by the author and can serve as explanation of the results obtained throughout the thesis.

A .l

Gestalt Principles of Visual Organization Of Perception

Gestalt is a psychology term which means "unified w hole" and refers to theories of visual perception developed in the 1920s by G erm an psychologists (Max W ertheimer (1880 -1943), Wolfgang Kòher (1887 -1967) and K urtK offka (1886 1941)) attem pting to describe how hum ans tend to organize visual elements into groups or unified w holes. The Gestalt psychologists outlined several principles of perceptual organization on the basis of their use in identifying am biguous patterns. They advocated that hum ans, w hen confronted by any abstract or symbolic visual im age, tend to separate a dom inant shape ("figure") from "background" (or "ground") using several principles of identifying am biguous patterns w hich include: (i) Proximity (things closer together are associated), (ii) Similarity (things sharing visual characteristics such as shape, size, colour, texture, value or orientation are associated), (iii) C ontinuity (preference for continuous figures), (iv) Closure (interpretations w hich produce "closed" rather

196

A. THE HUMAN VISUAL SYSTEM AND OBJECT RECOGNITION

than "open" figures are favored), (v) Smallness (sm aller of two overlapping areas tend to be seen as figures w hile the larger is regarded as background), (vi) Surroundedness (areas seen as surrounded by others tend to be perceived as figures), (vii) Symmetry (symmetrical areas tend to be seen as figures). The above principles serve the overarching principle of pragnanz, which states that the simplest and m ost stable interpretations are favored. The original Gestalt principles were form ulated based on two dimensional ab­ stract or symbolic pictures and their application to com plex real w orld scenes is rather difficult. Moreover, the Gestalt laws sim ply provide descriptions of phenom ena, not an explanation of them. N evertheless, the pioneering sugges­ tion that hum ans m ay be predisposed tow ards interpreting ambiguous images in one w ay rather than another have im portant im plications for designing m od­ ern visual systems. M imicking such natural predispositions by an automatic bottom -up approach to scene segm entation could help in dealing w ith the in­ herent am biguities of such approach. This suggestion provides the basis for in­ vestigation of the usefulness of the so-called syntactic features in the bottom-up segm entation process.

A.2 A.2.1

Structural-Description vs. Image-Based Models

R econ stru ction o f the O u tsid e W orld from th e R etinal Im age

Com putational m odels of recognition proposed in the past were based largely on the w ork of M arr (1982) [267] who popularized a solution that the goal of vision is to reconstruct the 3D scene by producing a conclusive representation of the outside w orld from the retinal image. H e suggested that hum an visual processing passed through a series of stages, each corresponding to a different representation: (i) the retinal image, (ii) a "prim al sketch" representing explicitly lines, edges and junctions, (iii) a "2^D sketch" including information about surface slant and depth, (iv) a "3D model representation" of objects built up hierarchically from simple prim itives such as cylinders and (v) a "space frame centred on the observer" w here the 3D m odels reside. Based on M arr's type of bottom -up com putational approach Biederman (1987) proposed a theory know n as Recognition by Components (RBC) [268,183] accord­ ing to which objects are represented in term s of their sim ple parts, the 3D vol­ umetric prim itives called geons (geometric ions), e.g. bricks, bent cylinders, and wedges. Geons are defined by view point-invariant properties of 2D images [183] such as sm ooth continuation, co-termination, parallelism and symmetry. Since these properties, referred often as "non-accidental", represent properties of the real w orld they are sufficient to define 3D com ponent representations (no need for 2^D sketch). Objects are defined as relationships betw een geons. In other 197

A. THE HUMAN VISUAL SYSTEM AND OBJECT RECOGNITION

w ords, according to Biederman, object recognition only requires detecting edges, arranging them into geons based on nonaccidental properties and object classi­ fication based on geon's types and relationships. O ther sources of inform ation (e.g. colour, texture etc.) are relatively unim portant [268]. Biederman argues that geon representation is necessary since representing objects directly in terms of local image features w ould require a different representation for each slightly altered orientation of the object or for each arrangem ent of occluding contour. Moreover, since accurate geon activation requires only categorical classification of edge characteristics the RBC theory explains well hum an ability to rapidly recognize objects. A lthough the above paradigm is attractive because it decom poses the problem of im age analysis into in dependent and manageable com ponents the reconstruc­ tion is difficult to achieve computationally. For example, the problem s of edge detection and contour extraction are under-constrained due to inherent ambi­ guity present in real scenes. Therefore, hierarchical reconstruction may lead to irreversible propagation of errors from one processing stage to another. More­ over, m any researchers (e.g. Tarr & Btilthoff, 1998) suggest that the RBC theory only explains well coarse discrim inations between object types (e.g. a cup vs. a car) b u t not w ithin an object type (e.g. one car vs. another). Despite the difficulties w ith com putational im plem entation of bottom -up type of processing Biederm ans' w ork provides extremely strong em pirical evidence suggesting that, in the case of hum ans, prim ing is largely visual, rather than conceptual or lexical [268, 183] therefore de-em phasizing the role of context. Biederman argues th at extraordinary speed of object recognition in the absence of any context disagrees w ith a central role for top-dow n information. This is a potentially im portant im plication in designing CBIR systems w here off­ line bottom -up processing is often unavoidable to ensure fast responses to user queries. In other w ords, Biederm ans' w ork provides hope that if com ponents perform ing off-line processing in CBIR systems could imitate, to a certain degree, hum an ability to resolve inherent ambiguities present in complex real w orld scenes, the result of such processing could efficiently serve indexing and retrieval of visual content. This provides justification of the research carried out in chapter 3 investigating alternative sources of information for im proving the perceptual quality of the o u tp u t of bottom -up segmentation.

A .2.2

E xem plar-B ased M u ltip le-V iew s M echan ism

An alternative approach to the RBC paradigm was suggested by Tarr & Biilthoff w ho advocate an exem plar-based multiple-views mechanism as an im portant com ponent of both exemplar-specific and categorical recognition [269]. In the multiple-view s approach object representations are collections of views that depict the appearance of objects from specific viewpoints. 198

They argue that

A. THE HUMAN VISUAL SYSTEM AND OBJECT RECOGNITION

im age-based models containing m ultiple views explain well recent behavioral results, e.g.

recognition perform ance is view point dependent because both

recognition time and accuracy vary w ith the degree of m ism atch betw een the percept and target view. A variety of different m echanism s for generalizing from unfam iliar to familiar views have been proposed, including m ental rotation (Tarr & Pinker, 1989), view interpolation (Poggio & Edelman, 1990), and linear com binations of views (Ullman & Basri, 1991). Tarr & Bulthoff also suggest that there m ust be some mechanism for m easuring the perceptual sim ilarity (within some dom ain) between an input im age and know n objects and betw een an input im age and know n views of that sam e object. Tarr and his collaborators also suggest that a substantial p art of the inform ation present in an object view is carried by outline shape [164,165] and often advocate the need for the com putation of silhouette similarity. [164] dem onstrates the utility of silhouette representations for a variety of visual tasks, ranging from basic-level categorization to finding the best view of an object. They argue that using only silhouettes or boundary contour inform ation a large of exem plars can be separated into stable perceptual categories corresponding closely to hum an perceivers. The above suggestion, com ing from cognitive science, provides very strong m otivation for utilization of silhouettes for classification and recognition in com puter vision and provides justification of the research carried out by the author in several chapters of this thesis.

A.2.3

W h ich M o d el E xplains A ll Facts o f H u m an V isu a l Perception?

C urrently m any researchers, including advocates of both approaches presented above, Biederman [183] and Tarr [269], seem to recognize that no existing single theory can explain all facets of hum an visual perception. In fact, there seems be general consensus that several approaches (including RBC and exem plar-based m ultiple-view s mechanism) are n o t m utually exclusive but sim ply address different aspects of object recognition u nder different conditions determ ined for exam ple by a particular task, context, familiarity, and visual similarity. Indeed, the presence of several recognition mechanisms m ay som ehow explain difficulties in developing a single “universal" com putational m odel able to perform general object detection and recognition. So far, com puter vision is able to successfully address only very specific problems.

199

A p p e n d ix B

IMAGE COLLECTION USED IN CHAPTER 3 T

HIS

appendix presents the collection of images and ground-truth segm en­

tation developed by the author in collaboration with a colleague w ork­ ing in another university. The collection contains 100 images from the Corel gallery [132] and 20 images from various sources such as keyframes from well known MPEG-4 test sequences, digital cam era, and mobile phone camera seg­ mented by four annotators.

200

201

B. IMAGE COLLECTION USED IN CHAPTER 3

B. IMAGE COLLECTION USED IN CHAPTER 3

JET

.T f c - *

*

* ■

B I

.

■ •m

- *

MISCELLANEOUS

1

:

i

r ¿

I I I

À

S

.

h

i

a

'

'

n

â



t

: >

h

r

f

A

i

f

I

t

Û

203

I

k

,

M

B. IMAGE COLLECTION USED IN CHAPTER 3

MOUNTAIN

ROSES

: C

k

i

1

TIGER

ü jL .

Figure B.l: Image collection used in chapter 3. 204

I S S

BIBLIOGRAPHY [1] S. Brin and L. Page, "The anatomy of a large-scale hypertextual web search engine," in Proc. the 7th Int'l W W W Conf. (W W W'98). Brisbane, Australia, 1998. [2] J. P Eakins and M. E. Graham, "Content-based image retrieval: A report to the JISC technology application programme," Institute for Image Data Research, University of Northumbria at Newcastle, UK, www. u n n . a c . u k / i i d r / r e p o r t . h tm l, Technical Report, 1999. [3] F. A. Cheikh, "MUVIS: A system for content-based image retrieval," Ph.D. dissertation, Tampere University of Technology, 2004. [4] B. S. Manjunath, P. Salembier, and T. Sikora, Introduction to MPEG-7: Multimedia Content Description Language. N ew York: John Wiley & Sons Ltd., ISBN: 0-471­ 48678-7,2002. [5] J. M. Martinez, "MPEG-7 overview," ISO/IEC JTC1/SC29/WG11, Tech. Rep. N6828, Mar. 2003. [6 ] N. O'Connor, E. Cooke, H. Le Borgne, M. Blighe, and T. Adamek, "The AceToolbox: Low-level audiovisual feature extraction for retrieval and classification," in Proc. 2nd IEE European Workshop on the Integration of Knowledge, Semantic and Digital Media Technologies, London, U.K., Dec. 2005.

[7] A. Smeaton, H. Lee, and K. McDonald, "Experiences of creating four video library collections with the Fischlar system," Journal of Digital Libraries: Special Issue on Digital Libraries as Experienced by the Editors of the Journal, vol. 4, no. 1, pp. 42^14, 2004. [8 ] P. Duygulu, K. Barnard, N. de Freitas, and D. Forsyth, "Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary," in Proc. IEEE Int'l Conf. on Computer Vision, 2002, pp. 97-112. [9] K. Barnard, P. Duygulu, N. de Freitas, D. Forsyth, D. Blei, and M. I. Jordan, "Matching words and pictures," Journal of Machine Learning Research, vol. 3, pp. 1107-1135,2003. [10] S. Sav, H. Lee, A. F. Smeaton, and N. E. O'Connor, "Using segmented objects in ostensive video shot retrieval," in Proc. 3rd Int'l Workshop on Adaptive Multimedia Retrieval, July 2005.

[11] M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, B. Dom, M. Gorkhani, J. Hafner, D. Lee, D. Petkovic, D. Steele, and P. Yanker, "Query by image and video content: The QBIC system," IEEE Computer Magazine, vol. 28, no. 9, pp. 23-32, Sept. 1995. 205

BIBLIOGRAPHY

[12] S. Lambert, E. de Leau, and L. Vuurpijl, "Using pen-based outlines for objectbased annotation and image-based queries," in Proc. 3rd Int'l Conf. on Visual Information and Information Systems (VISUAL'99), Amsterdam, The Netherlands, ser. LNCS 1614, D. Huijsmans and A. Smeulders, Eds. Springer, June 1999, pp. 585­ 592. [13] F. Mokhtarian and A. K. Mackworth, "A theory of multiscale, curvature-based shape representation for planar curves," IEEE Trans. Pattern Anal. Machine Intell., vol. 14, no. 8 , pp. 789-805, Aug. 1992. [14] F. Mokhtarian, "Silhouette-based isolated object recognition through curvature scale space," IEEE Trans. Pattern Anal, and Machine Intell., vol. 17, no. 5, pp. 539­ 544, May 1995. [15] F. Mokhtarian, S. Abbasi, and J. Kittler, "Robust and afficient shape indexing through curvature scale space," in British Machine Vision Conf. (BM VC’96), 1996, pp. 53-62. [16] M. Bober, "MPEG-7 visual shape descriptors," IEEE Trans. Circuits Syst. Video Technol., vol. 11, no. 6 , pp. 716-719, 2001.

[17] F. Mokhtarian and M. Bober, Curvature Scale Space Representation: Theory, Appli­ cations, and MPEG-7 Standarization, ser. ISBN 1-4020-1233-0. Kluwer Academic Publishers, 2003, vol. 25. [18] C. F. Bennstrom and J. R. Casas, "Binary-partition-tree creation using a quasi­ inclusion criterion," in Proc. 8th Int'l Conf. on Information Visualization (IV'04), IEEE Computer Society Press, London, UK, 2004.

[19] O. Morris, M. Lee, and A. Constantinides, "Graph theory for image analysis: an approach based on the shortest spanning tree," in IEE Proceedings, vol. 133, Apr. 1986, pp. 146-152. [20] A. Dempster, "Upper and lower probabilities induced by multivalued mapping," Annals of Mathematical Statistics, vol. 38, pp. 325-339,1967. [21] G. Shafer, "A mathematical theory of evidence," Princetoivn Univ. Press, Princetown New Jersey, 1976. [22] P. Smets and R. Kennes, "The transferable belief model," Artificial Intelligence, vol.

66,

no. 2, pp. 191-234,1994.

[23] P. Smets, "The normative representation of quantified beliefs by belief functions," Artificial Intelligence, vol. 92, pp. 229-242,1997.

[24] R. Manmatha and N. Srimal, "Scale space technique for word segmentation in handwritten manuscripts," in Proc. 2nd Int'l Conf. on Scale-Space Theories in Computer Vision, Corfu, Greece, Sept. 1999, pp. 22-33.

[25] V. Lavrenko, T. Rath, and R. Manmatha, "Holistic word recognition for handwrit­ ten historical documents," in Proc. Document Image Analysis for Libraries (DIAL'04), 2004, pp. 278-287. [26] MPEG Requirements Group, "MPEG-7 context and objectives," Doc. ISO/MPEC, MPEG Atlantic City M eeting, Tech. Rep. 2460, Oct. 1998. [27] D. Santa-Cruz and T. Ebrahimi, "An analytical study of JPEG 2000 functionalities," in Proc. IEEE Int’l Conf. on Image Processing, vol. 2,2000, pp. 49-52. 206

BIBLIOGRAPHY

[28] J. Shi and J. Malik, "Normalized cuts and image segmentation," IEEE Trans. Pattern Anal, and Machine Intell., vol. 22, no. 8 , pp. 888-905, Aug. 2000. [29] C. Carson, S. Belongie, H. Greenspan, and J. Malik, "Blobworld: Color- and texture-based image segmentation using EM and its application to image query­ ing and classification," IEEE Trans. Pattern Anal, and Machine Intell., vol. 24, no. 8 , pp. 1026-1037, Aug. 2002. [30] V. Mezaris, I.Kompatsiaris, and M.G.Strintzis, "Still image segmentation tools for object-based multimedia applications," hit'I Journal of Pattern Recognition and Artificial Intell., vol. 18, no. 4, pp. 701-725, June 2004.

[31] J. Fauqueur and N. Boujemaa, "Region-based image retrieval: Fast coarse seg­ mentation and fine color description," Journal of Visual Languages and Computing, special issue on Visual Information Systems, vol. 15, pp. 69-95, 2004. [32] N. E. O'Connor, "Video object segmentation for future multimedia applications," Ph.D. dissertation, Dublin City University, School of Electronic Engineering, 1998. [33] D. Comaniciu and P. Meer, "Mean shift: A robust approach toward feature space analysis," IEEE Trans. Pattern Anal, and Machine Intell., vol. 24, no. 5, 2002. [34] S. Sun, D. R. Haynor, and Y. Kim, "Semiautomatic video object segmentation using vsnakes," IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 1, pp. 75-82, Jan. 2003.

[35] N. R. Pal and S. K. Pal, "A review on image segmentation techniques," Pattern Recognition, vol. 26, no. 9, pp. 1277-1294,1993. [36] W. Skarbek and A. Koschan, "Colour image segmentation - a survey," Technical University of Berlin, Department of Computer Science, Germany, Technical Report 94-32, Oct. 1994. [37] Special Issue, "Special issue on image and video processing," IEEE Trans. Circuits Syst. Video Technol., vol. 8 , no. 5, pp. 19-31, Sept. 1998.

[38] P. Salembier and F. Marqués, "Region-based representations of image and video," IEEE Trans. Circuits Syst. Video Technol., vol. 9, no. 8 , pp. 1149-1167, Dec. 1999. [39] H. Cheng, X. H. Jiang, Y. Sun, and J. L. Wang, "Color image segmentation: Advances & prospects," Pattern Recognition, vol. 34, no. 12, pp. 2259-2281,2001. [40] S. Horowitz and T. Pavlidis, "Picture segmentation by a tree traversal algorithm," /. Assoc. Compt. Math., vol. 23, no. 2, pp. 368-388,1976. [41] O. Salerno, M. Pardas, V. Vilaplana, and F. Marqués, "Object recognition based on binary partition trees," in Proc. Int'l Conf on Image Processing (ICIP '04), vol. 2, Oct. 2004, p p . 929-932.

[42] G. Aggarwal, S. Ghosal, and P. Dubey, "Efficient query modification for image retrieval," in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR’00), vol. II, June 2000, pp. 255-261. [43] I. Kompatsiaris, E. Triantafillou, and M. G. Strintzis, "Region-based colour image indexing and retrieval," in Proc. Int'l Conf. on Image Processing (ICIP'01), Thessaloniki, Greece, vol. 1, 2001, pp. 658-661. 207

BIBLIOGRAPHY

[44] V. Mezaris, I. Kompatsiaris, and M. G. Strintzis, "Region-based image retrieval using an object ontology and relevance feedback," EURASIP journal on Applied Signal Processing, no. 6 , pp. 886-901, June 2004.

[45] F. Souvannavong, B. Merialdo, and B. Huet, "Region-based video content index­ ing and retrieval," in Proc. 4th Int'l Workshop on Content-Based Multimedia Indexing (CBMI'05), 2005.

[46] W.-Y. Ma and B. S. Manjunath, "NeTra: A toolbox for navigating large image databases," Multimedia Systems, vol. 7, no. 3, pp. 184-198,1999. [47] F. Jing, M. Li, H.-J. Zhang, and B. Zhang, "An effective region-based image retrieval framework," in Proc. the ACM Int'l Conf. on Multimedia, 2002. [48] R. C. Veltkamp and M. Tanase, "Content-based image retrieval systems: A survey," Department of Computing Science, Utrecht University, Technical Report UU-CS-2000-34, Tech. Rep., Oct. 2000. [49] IBM, "QBIC project," h t t p : //w w w q b ic . a lm a d e n . ibm . com. [50] R. Barber, W. Niblack, M. Flickner, E. Glasman, D. Petkovic, P. Yanker, C. Faloutsos, and G. Taubin, "The QBIC project: Querying images by content using colour, texture, and shape," in Proc. SPIE Conf. on Storage and Retrieval for Image and Video Databases, San ]ose, CA, USA, Feb. 1993, pp. 173-187. [51] A. Del Bimbo, M. Mugnaini, P. Pala, and F. Turco, "Picasso: Visual querying by color perceptive regions," in Proc. 2nd Int'l Conf. on Visual Information Systems, San Diego, Dec. 1997, pp. 125-131.

[52] P. Pala and S. Santini, "Image retrieval by shape and texture," Pattern Recognition, vol. 32, no. 3, pp. 517-527,1999. [53] W.-Y. Ma, "NeTra: A toolbox for navigating large image databases," Ph.D. disser­ tation, Dept, of Electrical and Computer Engineering, University of California at Santa Barbara, June 1997. [54] W. Ma and B. Manjunath, "Edge flow: a framework of boundary detection and image segmentation," in Proc. IEEE Int'l Conf. on Computer Vision and Pattern Recognition, Puerto Rico, June 1997.

[55] D. Comaniciu and P. Meer, "Robust analysis of feature spaces: Colour image seg­ mentation," in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR'97), June 1997, pp. 750-755. [56] E. Choi and P. Hall, "Data sharpening as a prelude to density estimation," Biometrika, vol. 86,1999.

[57] M. Markkula and E. Sormunen, "End-user searching challenges indexing prac­ tices in the digital newspaper photo archive," Information Retrieval, vol. 1, pp. 259­ 285, 2000. [58] K. Barnard, P. Duygulu, R. Guru, P. Gabbur, and D. Forsyth, "The effects of segmentation and feature choice in a translation model of object recognition," in Proc. IEEE Conf. On Computer Vision and Pattern Recognition (CVPR'03), 2003. [59] P. Salembier and L. Garrido, "Binary partition tree as an efficient representation for filtering, segmentation, and information retrieval," in Proc. IEEE Int'l Conf. on Image Processing (ICIP'98), Chicago (IL), USA, Oct. 1998. 208

BIBLIOGRAPHY

[60] J. Malik, S. Belongie, J. Shi, and T. Leung, "Textons, contours and regions: Cue integration in image segmentation," in Proc. IEEE Int'l Conf. on Computer Vision, Corfu, Greece, Sept. 1999. [61] J. Garding and T. Lindeberg, "Direct computation of shape cues using scaleadapted spatial derivative operators," Int'l J. Computer Vision, vol. 17, no. 2, pp. 163-191, Feb. 1996. [62] Y. Zeng and A. G. Constantinides, "Perceptual saliency weighted segmentation algorithm," in Proc. 7th Int'l Conf. on A n Image Processing And Its Applications, July 1999. [63] Y. Zeng, "Perceptual segmentation algorithm and its application to image cod­ ing," in Proc. Int'l Conf. on Image Processing, 1999, pp. 820-824. [64] M. Kass, A. Witkin, and D. Terzopoulos, "Snakes: Active contour models," Int'l Journal of Computer Vision, vol. 1, pp. 312-331,1988. [65] S. C. Zhu and A. Yuille, "Region competition: Unifying snakes, region growing and bayes/M DL for multiband image segmentation," IEEE Trans. Pattern Anal, and Machine Intell., vol. 18, no. 9, pp. 884-900,1996.

[6 6 ] M. Rousson, T. Brox, and R. Deriche, "Active unsupervised texture segmentation on a diffusion-based space," in Proc. Int'l Conf. on Computer Vision & Pattern Recognition), 2003. [67] "Special section on perceptual organization in computer vision," IEEE Trans. Pattern Anal, and Machine Intell., vol. 25, no. 4, Apr. 2003. [6 8 ] T. Adamek, N. E. O'Connor, and N. Murphy, "Region-based segmentation of images using syntactic visual features," in Proc. 6th Int'l Workshop on Image Analysis for Multimedia Interactive Services (W IAM IS’05), Montreux, Switzerland, Apr. 2005. [69] E. Engbers and A. Smeulders, "Design considerations for generic grouping in vision," IEEE Trans. Pattern Anal, and Machine Intell., vol. 25, no. 4, pp. 445^457, Apr. 2003. [70] T. F. Cootes and C. J. Taylor, "Active shape models - smart snakes," in Proc. British Machine Vision Conf. (BMVC'92), Springer Verlag, 1992, p. 266. [71] S. Sclaroff and L. Liu, "Deformable shape detection and description via modelbased region grouping," IEEE Trans. Pattern Anal. Machine Intell., vol. 23, no. 5,

2001. [72] M. E. Leventon, W. L. Grimson, and O. Faugeras, "Statistical shape influence in geodesic active contours," in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR'00), vol. I, 2000, pp. 316-323. [73] A. Tsai, A. Yezzi, W. Wells, C. Tempany, D. Tucker, A. Fan, W. Grimson, and A. Willsky, "Curve evolution technique for image segmentation," in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR'01), vol. I, 2001, pp. 463­ 468. [74] Y. Chen, S. Thiruvenkadam, H. Tagare, F. Huang, and D. Wilson, "On the incorporation of shape priors into geometric active contours," in Proc. (VLSM’01), 2001, pp. 145-152. 209

BIBLIOGRAPHY

[75] D. Cremers, T. Kohlberger, and C. Schnorr, "Nonlinear shape statistics in mumford-shah based segmentation," in Proc. European Conf. on Computer Vision (ECCV'02), vol. II, 2002, pp. 93-108.

[76] M. Rousson and N. Paragios, "Shape priors for level set representations," in Proc. European Conf. on Computer Vision (ECCV'02, LNCS 2351), 2002, pp. 78-92. [77] M. B. Stegmann, "Active appearance models: Theory, extensions and cases," Master's thesis, Informatics and Mathematical Modelling, Technical University of Denmark, Lyngby, h t t p : / /www. imm. d t u . d k /~ aam /, 2000. [78] T. Cootes and C.J.Taylor, "Statistical models of appearance for computer vision," University of Manchester, h t t p : //w w w .is b e .m a n .a c .u k /~ b im , Tech. Rep., Mar. 2004. [79] M. Turk and A. Pentland, "Eigenfaces for recognition," Journal of Cognitive Neuroscience, vol. 3, no. 1, pp. 71-86,1991. [80] B. Marcotegui, P. Correia, F. Marqués, R. Mech, R. Rosa, M. Wollborn, and F. Zanoguera, "A video object generation tool allowing friendly user interaction," in Proc. IEEE Int'l Conf. on Image Processing (ICIP'99), Kobe, Japan, 1999. [81] R. Adams and L. Bischof, "Seeded region growing," IEEE Trans. Pattern Anal, and Machine In tell., vol. 16, no. 6 , pp. 641-647, June 1994.

[82] E. Chalom and V. Bove, "Segmentation of an image seguence using multi­ dimensional image attributes," in Proc. IEEE Int'l Conf. on Image Processing (ICIP'96), Lausanne, Switzerland, vol. 2, Sept. 1996, pp. 525-528.

[83] J. Cichosz and F. Meyer, "Morphological multiscale image segmentation," in Proc. Int’l Workshop on Image Analysis for Multimedia Interactive Services (W1AMIS'97), Louvain-la-Neuve, Belgium, June 1997, pp. 161-166.

[84] C. Gu and M.-C. Lee, "Semiautomatic segmentation and tracking of semantic video objects," IEEE Trans. Circuits Syst. Video Technol., vol. 8 , no. 5, pp. 572-584, Sept. 1998. [85] S. Cooray, N. E. O'Connor, S. Marlow, N. Murphy, and T. Curran, "Semi­ automatic video object segmentation using recursive shortest spanning tree and binary partition tree," in Proc. 3rd Int’l Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS'01), Tampere, Finland, 2001.

[8 6 ] N. Otsu, "A threshold selection method from grey level histograms," IEEE Trans, on Systems, Man and Cybernetics, vol. 8 , pp. 62-66,1978. [87] M. Cheriet, J. N. Said, and C. Y. Suen, "A recursive thresholding technique for image segmentation," IEEE Trans, on Image Processing, vol. 7, no. 6 , pp. 918-920, 1998. [8 8 ] J. McQueen, "Some methods for classification and analysis of multivariate ob­ servations," in Proc. 5th Berkely Symp. on Math. Stat. and Prob., vol. 1, 1967, pp. 281-296. [89] J. C. Bezdek, Pattern Recognition With Fuzzy Objective Function Algorithms.

N ew

York: Plenum Press, 1981. [90] A. K. Jain, M. N. Murty, and P. J. Flynn, "Data clustering: A review," ACM Comput. Surveys, vol. 31, no. 3, pp. 264-323,1999. 210

BIBLIOGRAPHY

[91] M. Herbin, N. Bonnet, and P. Vautrot, "A clustering method based on the estimation of the probability density function and on the skeleton by influence zones," Pattern Recognition Letters, vol. 17, pp. 1141-1150,1996. [92] L. Vincent and P. Soille, "Watersheds in digital spaces: An efficient algorithm based on immersion simulations," IEEE Trans. Pattern Anal, and Machine Intell., vol. 13, no. 6 , pp. 583-598, June 1991. [93] D. Wang, "A multi-scale gradient algorithm for image segmentation using water­ sheds," Pattern Recognition, vol. 30, no. 12, pp. 2043-2052,1997. [94] T. Vlachos and A. G. Constantinides, "A graph-theoretic approach to color image segmentation and contour classification," in Proc. 4th Int'l Conf. Image Processing and Its Applications, Maastrict, The Netherlands, Apr. 1992. [95] S. H. Kwok, A. G. Constantinides, and W.-C. Siu, "An efficient recursive shortest spanning tree algorithm using linking properties," IEEE Trans. Circuits Syst. Video Technol., vol. 14, no. 6 , pp. 852-863, June 2004.

[96] S. Geman and D. Geman, "Stochastic relaxation, gibbs distributions and the bayesian restoration of images/' IEEE Trans. Pattern Anal, and Machine Intell, vol. 6 , no. 6 , pp. 721-741,1984. [97] Z. Tu and S. Zhu, "Image segmentation by data-driven markov chain monte carlo," IEEE Trans. Pattern Anal, and Machine Intell., vol. 24, no. 5, pp. 657-673,

2002. [98] E. D. Gelasca, T. Ebrahimi, M. C. Q. Farias, and S. K. Mitra, "Impact of topology changes in video segmentation evaluation," in Proc. 5th Int’l Workshop on Image Analysis for Multimedia Interactive Services (WLAMIS2004), Lisboa, Portugal, Apr.

2004. [99] W. Perkins, "Area segmentation of images using edge points," IEEE Trans. Pattern Anal, and Machine Intell., vol. 2, no. 1, pp. 8-15,1980.

[100] J. Prager, "Extracting and labelling boundary segments in natural scenes," IEEE Trans. Pattern Anal, and Machine Intell., vol. 2, no. 1, pp. 16-27,1980. [101] N. Papamarkos, C. Strouthopoulos, and I. Andreadis, "Multithresholding of colour and gray-level images through a neural network technique," Image and Vision Computing, vol. 18, pp. 213-222, 2000. [102] J. Guo, J. Kim, and C. Kuo, "Fast and accurate moving object extraction technique for MPEG-4 object based video coding," in Proc. SPIE Visual Comm, and Image Processing, 1999, pp. 1210-1221. [103] L.Liu and S. Sclaroff, "Deformable shape detection and description via modelbased region grouping," in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR’99).

[104] R. Nock and F. Nielsen, "Statistical region merging," IEEE Trans. Pattern Anal, and Machine Intell., vol. 26, no. 11, pp. 1452-1458, Nov. 2004.

[105] M. Sonka, V. Hlavac, and R. Boyle, Image Processing, Analysis, and Machine Vision. London, UK: Chapman and Hall, 1995.

211

BIBLIOGRAPHY

[106] S. H. Kwok and A. G. Constantinides, "A fast recursive shortest spanning tree for image segmentation and edge detection/' IEEE Trans. Image Processing, vol. 6 , no. 2, pp. 328-332, Feb. 1997. [107] P. Salembier and L. Garrido, "Binary partition tree as an efficient representation for image processing, segmentation, and information retrieval," IEEE Trans, on Image Processing, vol. 9, no. 4, pp. 561-576, Apr. 2000. [108] Y. Y. Boykov and M.-P. Jolly, "Interactive graph cuts for optimal boundary & region segmentation of objects in n-d images," in Proc. Int'l Conf. on Computer Vision, Vancouver, Canada, vol. I, July 2001. [109] T. Mclnerney and D. Terzopoulos, "Deformable models in medical image analysis: A survey," Medical Image Analysis, vol. 1, no. 2, pp, 91-108,1996. [110] L. Cohen, "On active contour m odels and balloons," CVGIP:IU, vol. 53, no. 2,1991. [111] V. Caselles, R. Kimmel, and G. Sapiro, "Geodesic active contours," Int'l journal of Computer Vision (IJCV), vol. 22, no. 1, pp. 61-79,1997. [112] J. Sethian, Level Set Methods.

Cambridge University Press, 1996.

[113] D. Mumford and J. Shah, "Optimal approximations by piecewise smooth func­ tions and associated variational-problems," Communications on Pure and Applied Mathematics, vol. 42, no. 5, pp. 577-685,1989. [114] N. Paragios and R. Deriche, "Geodesic active contours and level-set methods for supervised texture segmentation," Int’l Journal of Computer Vision, 2002. [115] A. L. Yuille, D. S. Cohen, and P. Hallinan, "Feature extraction from faces using deformable templates," Int’l Journal of Computer Vision, vol. 8 , no. 2, pp. 99-112, 1992. [116] A. P. Pentland and S. Sclaroff, "Closed-form solutions for physically based m odelling and recognition," IEEE Trans. Pattern Anal. Machine Intell., vol. 13, no. 7, pp. 715-729,1991. [117] T. F. Cootes, C. J. Taylor, D. Cooper, and J. Graham, "Active shape models - their training and application," Computer Vision and Image Understanding, vol. 61, no. 1, pp. 38-59, Jan. 95. [118] M. Lades, J. Vorbruggen, J. Buhmann, J. Lange, C. von der Malsburt, R. Wurtz, and W. Konen, "Distortion invariant object recognition in the dynamic link architecture," IEEE Trans, on Computers, vol. 42, pp. 300-311,1993. [119] K. V. Mardia, J. T. Kent, and A. N. Walder, "Statistical shape models in image analysis," in Proc. Computer Science and Statistics: 23rd INTERFACE Symposium, E. Keramidas, editor, Interface Foundation, Fairfax Station), 1991, pp. 550-557. [120] U. Grenander and M. Miller, "Representations of knowledge in complex systems," Journal of the Royal Statistical Society, vol. 56, pp. 249-603,1993. [121] S. Belongie, J. Malik, and J. Puzicha, "Shape matching and object recognition using shape contexts," IEEE Trans. Pattern Anal, and Machine Intell., vol. 24, no. 4, pp. 509-522, Apr. 2002. [122] Y. J. Zhang, "A survey on evaluation methods for image segmentation," Pattern Recognition, vol. 29, no. 8 , pp. 1335-1346, Aug. 1996.

212

BIBLIOGRAPHY

[123] P. Correia and F. Pereira, "Objective evaluation of relative segmentation quality," in Proc. Int'l Conf. on Image Processing (ICIP'OO), Vancouver, Canada, Sept. 2000, pp. 308-311. [124] ITU-R, "Methodology for the subjective assessment of the quality of television pictures," Recommendation BT.500-7,1995. [125] ITU-T, "Subjective video quality assessment methods for multimedia applica­ tions," Recommendation P.910, Aug. 1996. [126] P. Correia and F. Pereira, "Standalone objective evaluation of segmentation quality," in Proc. 3rd Int'l Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS'01), Tampere, Finland, May 2001. [127] V. Mezaris, I. Kompatsiaris, and M. Strintzis, "Still image objective segmentation evaluation using ground truth," in Proc. 5th COST 276 Workshop (2003), Berlin, 2003, pp. 9-14. [128] M. Wollborn and R. Mech, "Refined procedure for objective evaluation of video object generation algorithms," in Doc. 1SO/IEC/JTCI/SC29/WG11 M3448, 43rd MPEG Meeting, Tokyo, Japan, Mar. 1998. [129] P. Villegas, X. Marichal, and A. Salcedo, "Objective evaluation of segmentation masks in video sequences," in Proc. Workshop on Image Analysis For Midtimedia Interactive Services (WIAMIS'99), Berlin, May 1999, pp. 85-88. [130] P. Correia and F. Pereira, "Estimation of video object's relevance," in Proc. European Conf. on Signal Processing (EUSIPCO'2000), Tampere, Finland, Sept. 2000. [131] MIT Vision Texture (VisTex) database, h t t p : //w w w -w h ite . m e d ia . m i t . e d u /v i s m o d / i m a g e r y /V is i o n T e x t u r e / v i s t e x . htm l. [132] Corel stock photo library, Corel Corp., Ontario, Canada. [133] M. Wirth, M. Fraschini, M. Masek, and M. Bruynooghe, "Performance evaluation in image processing," EURASIP Journal on Applied Signal Processing, vol. 2006, pp. Article ID 45 742,3 pages, 2006. [134] S. Chabrier, B. Emile, C. Rosenberger, and H. Laurent, "Unsupervised perfor­ mance evaluation of image segmentation," EURASIP Journal on Applied Signal Processing, vol. 2006, pp. Article ID 96306,12 pages, 2006. [135] P. Correia and F. Pereira, "Video object relevance metrics for overall segmentation quality evaluation," EURASIP Journal on Applied Signal Processing, vol. 2006, pp. Article ID 82195,11 pages, 2006. [136] X. Jiang, C. Marti, C. Irniger, and H. Bunke, "Distance measures for image segmentation evaluation," EURASIP Journal on Applied Signal Processing, vol. 2006, pp. Article ID 35 909,10 pages, 2006. [137] R. Piroddi and T. Vlachos, "A method for single-stimulus quality assessment of segmented video," EURASIP Journal on Applied Signal Processing, vol. 2006, pp. Article ID 39 482, 22 pages, 2006. [138] R. Usamentiaga, D. F. Garcia, C. Lopez, and D. Gonzalez, "A method for as­ sessment of segmentation success considering uncertainty in the edge positions," EURASIP Journal on Applied Signal Processing, vol. 2006, pp. Article ID 21746, 12

pages, 2006.

213

BIBLIOGRAPHY

[139] T. Y. Lui and E. Izquierdo, "Scalable object-based image retrieval," in Proc. IEEE Int'l Conf. on Image Processing (ICIP'03), Barcelona, Spain, Sept. 2003. [140] A. Alatan, L. Onural, M. Wollbom, R. Mech, E. Tuncel, and T. Sikora, "Image sequence analysis for emerging interactive multimedia services - the European COST 211 Framework," IEEE Trans. Circuits Syst. Video Technol., vol. 8 , no. 7, pp. 802-813, Nov. 1998. [141] E. Tuncel and L. Onural, "Utilization of the recursive shortest spanning tree algorithm for video-object segmentation by 2-d affine motion modeling," IEEE Trans. Circuits Syst. Video Technol, vol. 10, no. 5, pp. 776-781, Aug. 2000.

[142] J. H, Ward, "Hierarchical grouping to optimize an objective function," American Stat. Assoc., vol. 58, pp. 236-245,1963. [143] L. Garrido, P. Salembier, and D. Garcia, "Extansive operators in partition lattices for image sequence analysis," EURASIP Signal Processing, vol. 6 6 , no. 2, pp. 157­ 180, Apr. 1998. [144] T. Brox, D. Farin, and P. de With, "Multi-stage region merging for image segmen­ tation," in Proc. 22nd Symposium on Information and Communication Theory in the Benelux, Enschede, The Netherlands, May 2001.

[145] J. Hafner, H. Sawhnay, W. Equitz, M. Flickner, and W. Niblack, "Efficient color histogram indexing for quadratic form distance functions," IEEE Trans. Pattern Anal, and Machine Intell, vol. 17, no. 7, pp. 729-736, July 1995. [146] Y. Chen and H. Sundaram, "Estimating the complexity of 2D shapes," in Proc. Int'l Workshop on Multimedia Signal Processing (M M SP'05), May 2005, [147] S. Loncaric, "A survey of shape analysis techniques," Pattern Recognition, vol. 31, no. 5, pp. 983-1001,1998. [148] H. Moravec, "Towards authomatic visual obstacle avoidance," in Proc. 5th Int'l Joint Conf. on Artificial Inteligence, Caregie-Mellon University, Pittsburg, PA, Aug. 1977. [149] J. Bezdek, "Fuzziness vs. probability - the n-th round," IEEE Trans, on Fuzzy Systems, vol. 2, no. 1, pp. 1^12,1994.

[150] L. Zadeh, "Fuzzy sets as a basis for a theory of possibility," Fuzzy Sets and Systems, vol. 1, pp. 3-28,1978. [151] D, Dubois, J. Lang, and H. Prade, "Automated reasoning using possibilistic logic: Semantics, belief revision, and varaible certainty weights," IEEE Trans, on Knoioledge and Data Engineering, vol. 6 , pp. 64-71,1994.

[152] Z. Hammal, A. Caplier, and M. Rombaut, "A fusion process based on belief theory for classification of facial basic emotions," in Proc. 8th Int'l Conf. on Information Fusion, Philadelphia, USA, July 2005. [153] A. Al-Ani and M. Deriche, "A new technique for combining multiple classifiers using the dempster-shafer theory of evidence," Journal of Artificial Intelligence Research, vol. 17, pp. 333-361, 2002.

214

BIBLIOGRAPHY

[154] V. Girondel, A. Caplier, and L. Bonnaud, "A belief theory-based static posture recognition system for real-time video surveillance applications/' in Proc. IEEE Int'l Conf. on Advanced Video and Signal based Surveillance (AVSS'05), Como, Italy,

Sept. 2005. [155] P. Smets, E. Mamdami, D. Dubois, and H. Prade, Non-Standard Logics for Automated Reasoning, ser. ISBN 0126495203. Academic Press, Harcourt Brace Jovanovich Publisher, 1988. [156] L. Zadeh, "A mathematical theory of evidence (book review) - a critique of the normalization factor in dempster's rule of combination," A I Magazine, vol. 5, no. 3, pp. 81-83,1984. [157] M. E. Cattaneo, "Combining belief functions issued from dependent sources," in Proc. 3rd Int'l Symposium on Imprecise Probabilities and Their Applications (ISIPTA'03), Lugano, Switzerland, 2003. [158] P. Rosin, "Unimodal thresholding," Pattern Recognition, vol. 34, no. 11, pp. 2083­ 2096, Nov. 2001. [159] U. Ramer, "An iterative procedure for the polygonal approximation of plane curves," Computer, Graphics and Image Processing, vol. 1, pp. 244-256,1972. [160] D. Martin, C. Fowlkes, D. Tal, and J. Malik, "A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics," in Proc. Inti Conf. Computer Vision, 2001. [161] "Berkeley segmentation and boundary detection benchmark and dataset," h t t p : / /www. c s . b e r k e l e y . e d u /p r o j e c t s / v i s i o n / g r o u p i n g / seg b en ch , 2003. [162] P. Correia and F. Pereira, "Classification of video segmentation application sce­ narios," IEEE Trans. Circuits Syst. Video Technol., special issue on Audio and Video Analysis for Multimedia Interactive Services, vol. 14, no. 5, pp. 735-741, May 2004. [163] L. F. Costa and R. M. Cesar, Shape Analysis and Classification, Theoiy and Practise, ser. ISBN 0-8493-3493-4. CRC Press LLC, 2000. [164] F. Cutzu and M. J. Tarr, "The representation of three-dimensional object similarity inhuman vision," in SPIE Proc. from Electronic Imaging: Human Vision and Electronic Imaging II, 1997, pp. 460^471. [165] W. G. Hayward, "Effects of outline shape in object recognition," Journal of Experimental Psychology: Human Perception and Performance, vol. 24, no. 2, pp. 427­ 440,1998. [166] F. Mokhtarian and A. K. Mackworth, "Scale-based description and recognition of planar curves and two dimensional shapes," IEEE Trans. Pattern Anal. Machine Intell., vol. 8 , no. 1, pp. 34-43, Jan. 1986. [167] E. G. M. Petrakis, A. Diplaros, and E. Millos, "Matching and retrieval of distorted and occluded shapes using dynamic programing," IEEE Trans. Pattern Anal. Machine Intell., vol. 24, pp. 1501-1516,2002. [168] R. Basri, L. Costa, D. Geiger, and D. Jacobs, "Determining the similarity of deformable shapes," in Proc. IEEE Workshop on Physics Based Modeling in Computer Vision, 1995, pp. 135-143. 215

BIBLIOGRAPHY

[169] H. Kauppinen, T. Seppanen, and M. Pietikainen, "An experimental comparison of autoregressive and fourier-based descriptors in 2D shape classification," IEEE Trans. Pattern Anal, and Machine Inteli, vol. 17, pp. 201-207,1995. [170] Y. Gdalyahu and D. Weinshall, "Flexible syntactic matching of curves and its application to automatic hierarchical classification of silhouettes," IEEE Trans. Pattern Anal. Machine Inteli., vol. 21, no. 12, pp. 1312-1328, Dec. 1999.

[171] R. Veltkamp, "Shape matching: Similarity measures and algorithms/' Utrecht University, Technical Report UU-CS-2001-03, 2001. [172] N. Ansari and E. J. Delp, "Partial shape recognition: A landmark-based ap­ proach," IEEE Trans. Pattern Anal, and Machine Inteli, vol. 12, no. 5, pp. 470-483, 1990. [173] M. Safar, C. Shahabi, and X. Sun, "Image retrieval by shape: A comparative study," in IEEE Int'l Conf. on Multimedia and Expo (ICME'00), N Y , USA, Aug. 2000, pp. 141-144. [174] Y. Rui, A. C.She, and T. S. Huang, "Modified fourier descriptors for shape representation - a practical approach," in 1st Int'l Workshop on Image Databases and MultiMedia Search, Amesterdam, Holland, Aug. 1996. [175] Y. Rui, A. C.She, and T. S. Huang, "A modified fourier descriptor for shape matching in MARS," Image Databases and Knowledge Engineering, (ed. S.K. Chang), World Scientific Publishing House in Singapore, vol. 8 , pp. 165-180,1998.

[176] M.-K. Hu, "Visual pattern recognition by moment invariants/' IRE Trans, of Information Theory, vol. 8,1962.

[177] A. Khotanzad and Y. Hong, "Invariant image recognition by zernike moments," IEEE Trans. Pattern Anal, and Machine Inteli, vol. 12, no. 5, pp. 489^197, May 1990. [178] D. Shen and H. Horace, "Discriminative wavelet shape descriptors for recognition of 2D patterns," Pattern Recognition, vol. 32, no. 2, pp. 151-156,1999. [179] B. Zitova and J. Flusser, "Landmark recognition using invariant features," Pattern Recognition Letters, vol. 20, pp. 541-547,1999. [180] J. Flusser, "Refined moment calculation using image block representation," IEEE Trans, on Image Processing, vol. 9, no. 11, pp. 1977-1978, Nov. 2000. [181] H. Blum, "A transformation for extracting new descriptors of shape," Models for Perception of Speech and Visual Forms, MIT Press, pp. 362-380,1967. [182] H. Blum, "Biological shape and visual science (part 1)," Journal of Theoretical Biology, no. 38, pp. 205-287,1973. [183] I. Bierderman, Visual Object Recognition, In An Invitation to Cognitive Science (S. F. Kosslyn and D. N. Osherson (Eds.)). Visual Cognition. MIT Press, 1995, ch. 4, pp. 121-165. [184] F. L. Bookstein, "Landmark methods for forms without landmarks: Localizing group differences in outline shape," Medical Image Analysis, vol. 1, no. 3, pp. 225­ 244,1997. [185] A. Neumann and C. Lorenz, "Statistical shape model based segmentation of medical images," Computerized Medical Imaging and Graphics, vol. 22, no. 2, pp. 133-143,1998.

216

BIBLIOGRAPHY

[186] J. Marchant and C. Onyango, "Fitting grey level point distribution models to animals in scenes," Image and Vision Computing, vol. 13, no. 2, pp. 3-12, Feb. 1995. [187] J. Iivarinen, M. Peura, J. Sarela, and A. Visa, "Comparison of combined shape de­ scriptors for irregular objects," in Proc. 8th British Machine Vision Conf. (BMVC'97), Sept. 1997, pp. 430-139. [188] V. Mezaris, H. Doulaverakis, S. Herrmann, B. Lehane, N. OConnor, I. Kompatsiaris, W. Stechele, and M. Strintzis, "An extensible modular common reference system for content-based information retrieval: the SCHEMA reference system," in Proc. 4th Int'l Workshop on Content-Based Multimedia Indexing (CBMI'05), Riga, Latvia, June 2005.

[189] N. Ueda and S. Suzuki, "Learning visual models from shape contours using multiscale convex/concave structure matching," IEEE Trans. Pattern Anal. Machine Intell, vol. 15, no. 4, pp. 337-352,1993. [190] R. Veltkamp and M. Hagedoorn, "State of the art in shape matching," Utrecht University, Technical Report UU-CS-1999-27,1999. [191] T. Sebastian, P. Klein, and B. Kimia, "Recognition of shapes by editing shock graphs," in 8th Int'l Conf. on Computer Vision, Vancouver, 2001, pp. 755-762. [192] L. J. Latecki and R. Lakamper, "Contour-based shape similarity," in Proc. Int'l Conf. on Visual Information Systems, Amsterdam, vol. LNCS 1617, June 1999, pp. 617-624. [193] I. Cohen, N. Ayache, and P. Sulger, "Tracking points on deformable objects using curvature information," in Proc. 2nd European Conf. on Computer Vision (ECCV'92), Sept. 1997, pp. 458-466. [194] T. B. Sebastian, P. N. Klein, and B. B. Kimia, "On aligning curves," IEEE Trans. Pattern Anal. Machine Intell., vol. 25, no. 1, pp. 116-125, Jan. 2003. [195] E. M. Arkin, L. P. Chew, D. P. Huttenlocher, K. Kedem, and J. S. B. Mitchell, "An efficiently computable metric for comparing polygonal shapes," IEEE Trans. Pattern Anal, and Machine Intell., vol. 13, no. 3, p. 209216, Mar. 1991. [196] E. Keogh, "Exact indexing of dynamic time warping." in Proc. 28th Int'l Conf. on Very Large Data Bases, Hong Kong, 2002, pp. 406-417. [197] L. S. Shapiro and J. M. Brady, "A modal approach to feature-based correspon­ dence," in Proc. 2nd British Machine Vison Conf. (BMVC'91), Springer-Verlag, Sept. 1991, pp. 78-85. [198] D. P. Huttenlocher, G. A. Klanderman, and W. Rucklidge, "Comparing images using the hausdorff distance," IEEE Trans. Pattern Anal. Machine Intell., vol. 15, no. 9, pp. 850-863,1993. [199] S. Umeyama, "Parameterized point pattern matching and its application to recognition of object families," IEEE Trans. Pattern Anal. Machine Intell., vol. 15, no. 2, pp. 136-144, Feb. 1993. [200] S. Sclaroff and A. P. Pentland, "Modal matching for correspondence and recogni­ tion," IEEE Trans. Pattern Anal. Machine Intell., vol. 17, no. 6 , pp. 545-561,1995. [201] X. Yi and O. I. Camps, "Line-based recognition using a multidimensional haus­ dorff distance," IEEE Trans. Pattern Anal, and Machine Intell., vol. 21, no. 9, pp. 901-916, Sept. 1999.

217

BIBLIOGRAPHY

[202] W. E. L. Grimson and D. P. Huttenlocher, "On the sensitivity of the hough transform for object recognition," IEEE Trans. Pattern Anal, and Machine Intell., vol. 12, no. 3, pp. 255-274, Mar. 1990. [203] W. J. Christmas, J. Kittler, and M. Petrou, "Structural matching in computer vision using probabilistic relaxation," IEEE Trans. Pattern Anal, and Machine Intell., vol. 17, no. 8 , pp. 749-764, Aug. 1995. [204] J. W. Gorman, O. R. Mitchell, and F. P. Kuhl, "Partial shape recognition using dynamic programing," IEEE Trans. Pattern Anal, and Machine Intell, vol. 10, no. 2, pp. 257-266,1988. [205] A. Witkin, "Scale-space filtering," in Proc. 8th Int'l Joint Conf. on Artificial Intell. (IJCAI'83), 1983, pp. 1019-1022. [206] F. Mokhtarian, S. Abbasi, and J. Kittler, "Efficient and robust retrieval by shape content through curvature scale space," in Proc. Int'l Work­ shop on Image Database and Multimedia Search, Amsterdam, the Netherlands, h t t p : / / www. e e . s u r r e y . a c . u k / R e s e a r c h /V S S P /im a g e d b /d e m o . h tm l,

1996, pp. 35-42. [207] F. Mokhtarian and R. Suomela, "Robust image corner detection through curvature scale space," IEEE Trans. Pattern Anal, and Machine Intell, vol. 20, no. 12, pp. 1376­ 1381, Dec. 1998. [208] L. J. Latecki, R. Lakamper, and U. Eckhardt, "Shape descriptors for non-rigid shapes with a single closed contour," in Proc. IEEE Conf. On Computer Vision and Pattern Recognition (CVPR'00), 2000, pp. 424-429. [209] A. Quddus, F. A. Cheikh, and M. Gabbouj, "Wavelet-based multi-level object retrieval in contour images," in Proc. the Very Low Bit rate Video Coding (VLBV99) ivorkshop, Kyoto, Japan, Oct. 1999, pp. 43-46. [210] E. Milios and E. Petrakis, "Shape retrieval based on dynamic programming," IEEE Trans, on Image Processing, vol. 9, no. 1, pp. 141-147,2000. [211] T. Adamek and N. E. O'Connor, "A multi-scale representation method for non­ rigid shapes with a single closed contour," IEEE Trans. Circuits Syst. Video Technol, special issue on Audio and Video Analysis for Multimedia Interactive Services, no. 5, May 2004. [212] F. Attneave, "Some informational aspects of visual perception," Psychological Review, vol. 61, pp. 183-193,1954.

[213] L. Davis, "Understanding shapes: Angles and sides," IEEE Trans, on Computers, vol. 26, no. 3, pp. 236-242, Mar. 1977. [214] C. M. Myers, L. R. Rabiner, and A. E. Rosenberg, "Performance tradeoff in dynamic time warping algorithms for isolated word recognition," IEEE Trans. Acoust., Speech, Signal Processing, vol. 28, no. 12,1980. [215] S. Jeannin and M. Bober, "Description of core experiments for MPEG-7 mo­ tion/shape," MPEG-7, ISO/IEC/JTC1/SC29/W G11/M PEG99/N2690, Seoul, Mar. 1999. [216] M. Bober and W. Price, "Report on results of the CE-5 on contour-based shape," MPEG-7, ISO/IEC/JTC1/SC29/WG11/MPEGOO/M6293, Beijing, July 2000.

218

BIBLIOGRAPHY

[217] S. Belongie, J. Malik, and J. Puzicha, "Matching shapes," in 8th IEEE Int'l Conf. on Computer Vision, July 2001, pp. 452—163. [218] L. Latecki and R. Lakämper, "Shape similarity measure based on correspondence of visual parts," IEEE Trans. Pattern Anal, and Machine Intell., pp. 1185-1190, Oct.

2000. [219] "ISO/IEC MPEG7 experimentation model (XM software)," www. l i s . e i . tum. d e / r e s e a r c h /b v /to p ic s /m m d b /e _ m p e g 7 . htm l. [220] C. Van Rijsbergen, biformation Retrieval, 2nd ed. Dept, of Computer Science, Univ. of Glasgow, 1979. [221] T. B. Sebastian, P. N. Klein, and B. B. Kimia, "Shock-based indexing into large shape databases," Lecture Notes in Computer Science, vol. 2352, pp. 731-746, 2002. [222] W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling, Numerical Recipies in C: The A rt O f Scientific Computing. Cambridge University Press, Cambridge, 1986. [223] J. Neider and R. Mead, "A simplex method for function minimization," Computer Journal, no. 7, pp. 308-313,1965. [224] T. M. Rath, R. Manmatha, and V. Lavrenko, "A search engine for historical manuscript images," in Proc. S1GIR Conf (SIGIR’04), July 2004. [225] C. I. Tomai, B. Zhang, and V. Govindaraju, "Transcript mapping for historic hand­ written document images," in Proc. 8th Inti Workshop on Frontiers in Handwriting Recognition, Aug. 2002, pp. 413—418.

[226] T. Rath and R. Manmatha, "Features for word spotting in historical manuscripts," in Proc. Int'l Conf. on Document Analysis and Recognition (ICDAR'03), vol. 1, 2003, pp. 218-222. [227] T. Rath and R. Manmatha, "Word image matching using dynamic time warping," in Proc. IEEE Conf. On Computer Vision and Pattern Recognition (CVPR’03), vol. 2, 2003, pp. 521-527. [228] S. Marinai and A. D. (eds.), "Document analysis systems vi," in Proc. 6th Int'l Workshop DAS'04, Florence, Italy. Verlag, LNCS 3163, Sept. 2004. [229] J. He and A. Downton, "User-assisted OCR enhancment for digital archive construction," in Proc. Int'l Conf. On Visual Information Engineering, Apr. 2005. [230] S. M. Harding, W. B. Croft, and C. Weir, "Probabilistic retrieval of OCR degraded text using n-grams," in Proc. 1st European Conf. on Research and Advanced Technology for Digital Libraries, Pisa, Italy, Sept. 1997, pp. 345-359.

[231] C. C. Teppert, C. Y. Suen, and T. Wakahara, "The state of the art in on-line handwriting recognition," IEEE Trans. Pattern Anal. Machine Intell, vol. 12, no. 8, pp. 787-808, Aug. 1990. [232] D. Cheng and H. Yan, "Recognition of handwritten digits based on contour information," Pattern Recognition, vol. 31, no. 3, pp. 235-255,1998. [233] J. Favata, "Character model word recognition," in 5th Int'l Workshop on Frontiers in Handwriting Recognition, Essex, England, Sept. 1996, pp. 437-440.

219

BIBLIOGRAPHY

[234] S. Madhvanath and V. Govindaraju, "The role of holistic paradigms in handwrit­ ten word recognition," IEEE Trans. Pattern Anal. Machine Intell, vol. 23, no. 2, pp. 149-164, Feb. 2001. [235] R. F. Farag, "Word-level recognition of cursive script," IEEE Trans. Comput., vol. C-28, no. 2, pp. 172-175, Feb. 1979. [236] G. Dzuba, A. Filatov, D. Gershuny, and I. Kil, "Handwritten word recognition-the approach proved by practice," in Proc. Sixth Int’l Workshop Frontiers in Handwriting Recognition, 1998, pp. 99-111.

[237] S. Madhvanath, G. Kim, and V. Govindaraju, "Chaincode contour processing for handwritten word recognition," IEEE Trans. Pattern Anal. Machine Intell., vol. 21, no. 9, pp. 928-932, Sept. 1999. [238] S. Madhvanath, E. Kleinberg, and V. Govindaraju, "Holistic verification of hand­ written phrases," IEEE Trans. Pattern Anal. Machine Intell., vol. 21, no. 12, pp. 1344­ 1356, Dec. 1999. [239] W. Niblack, An Introduction to Digital Image Processing.

Englewood Cliffs, N.J.:

Prentice Hall, 1986. [240] J. Sauvola, T. Seppanen, S. Haapakoski, and M. Pietikainen, "Adaptive document binarization," in Proc. Int'l Conf. on Document Analysis and Recognition, vol. 1,1997, pp. 147-152. [241] L. Vincent, "Morphological grayscale reconstruction in image analysis: Applica­ tions and efficient algorithms," IEEE Trans. Image Processing, vol. 2, no. 2, pp. 176­ 201, Apr. 1993. [242] J. He, Q. Do, A. Downton, and J. Kim, "A comparison of binarization methods for historical archive documents," in Proc. 7th Int'l Conf. on Document Analysis and Recognition (ICDAR'2005), Seoul, Korea, Aug. 2005.

[243] T. Adamek, N. E. O'Connor, and N. Murphy, "Multi-scale representation and optimal matching of non-rigid shapes," in Proc. 4th Int'l Workshop on Content-Based Multimedia Indexing, CBMI'05, Riga, Latvia, June 2005. [244] R. Manmatha and W. B. Croft, "Word spotting: Indexing handwritten archives," in Proc. Intelligent Multi-media Information Retrieval Collection, M. M aybury (ed.), A AA I/M IT Press, 1997.

[245] S. Kane, A. Lehman, and E. Partridge, "Indexing George Washington's handwrit­ ten manuscripts," Technical Report MM-34, Center for Intelligent Information Re­ trieval, University of Massachusetts Amherst, Tech. Rep., 2001. [246] T. Adamek and N. O'Connor, "Efficient contour-based shape representation and matching," in Proc. 5th A CM SIGMM Int'l Workshop on Multimedia Information Retrieval (MIR'03), Berkeley, CA, Nov. 2003.

[247] L. Huang and M. Wang, "Efficient shape matching through model-based shape recognition," Pattern Recognition, vol. 29, no. 2, pp. 207-215,1996. [248] I. L. Dryden and K. V. Mardia, Statistical Shape Analysis. John Wiley & Sons, 1998. [249] C. Goodall, "Procrustes methods in the statistical analysis of shape," Jour. Royal Statistical Society, Series B, vol. 53, pp. 285-339,1991.

220

BIBLIOGRAPHY

[250] R. H. Davies, "Learning shape: Optimal models for analysing shape variability," Ph.D. dissertation, University Of Manchester, 2002. [251] A. Hill and C. J. Taylor, "A method of non-rigid correspondence for automatic landmark identification," in Proc. 7th British Machine Vison Conf. (BMVC'96), BMVA Press, Sept. 1996, pp. 323-332.

[252] A. Benayoun, N. Ayache, and I. Cohen, "Human face recognition: From views to models - from models to views," in Proc. Int'l Conf. on Pattern Recognition, Sept. 1994, pp. 225-243. [253] C. Kambhamettu and D. Goldgof, "Point correspondence recovery in non-rigid motion," in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR'92), Sept. 1992, pp. 222-227. [254] A. C. W. Kotcheff and C. J. Taylor, "Automatic construction of eigenshape models by direct optimisation," Medical Image Analysis, vol. 2, no. 4, pp. 303-314, Sept. 1998. [255] H. Thodberg and H. Olafsdottir, "Adding curvature to minimum description length shape models," in Proc. 14th British Machine Vision Conf. (BMVC'03), BMVA Press, vol. 2,2003, pp. 251-260.

[256] A. Ericsson and K. Astrom, "Minimizing the description length using steepest descent," in Proc. 14th British Machine Vision Conf. (BMVC'03), BMVA Press, vol. 2, 2003, pp. 93-102. [257] G. Scott and H. Longuet-Higgins, "An algorithm for associating the features of two images," Phil. Trans. Roy. Soc. London, vol. B 244, pp. 21-26,1991. [258] N.Duta, A. Jain, and M. Dubuisson-Jolly, "Automatic construction of 2d shape models," IEEE Trans. Pattern Anal. Machine Intell., vol. 23, no. 5, pp. 433^46,2001. [259] R. Davies, T. Cootes, C. Twining, and C. Taylor, "An information theoretic approach to statistical shape modelling," in Proc. 12th British Machine Vison Conf. (BMVC'01), BMVA Press, Sept. 2001, pp. 3-11.

[260] J. Hladuvka, "Establishing point correspondence on training set boundaries," VRVis Technical Report # 2003-040, Tech. Rep. 2003-040, 2003. [261] T. Cootes, "Timeline of developments in algorithms for finding corre­ spondences across sets of shapes and images," University of Manchester, h t t p : //w w w . i s b e . man. a c . u k /~b im , Tech. Rep., June 2004. [262] M. Fleute and S. Lavallee, "Building a complete surface model from sparse data using statistical shape models: Application to computer assisted knee surgery system," in Proc. Int'l Conf. on Medical Image Computing and Computer Assisted Intervention, 1998, pp. 879-887. [263] P. Zhu and P. Chirlian, "On critical point detection of digital shapes," IEEE Trans. Pattern Anal. Machine Intell., vol. 17, no. 8, pp. 737-748, Aug. 1995. [264] A. Thayananthan, B. Stenger, P. Torr, and R. Cipolla, "Shape context and chamfer matching in cluttered scenes," in Proc. Conf Computer Vision and Pattern Recogni­ tion, Madison, USA, June 2003.

[265] E. Hjelmas and B. K. Low, "Face detection: A survey," Computer Vision and Image Understanding, vol. 83, pp. 236-274, 2001.

221

BIBLIOGRAPHY

[266] M. H. Yang, D. J. Kriegman, and N. Ahuja, "Detecting faces in images," Traus. Pattern Anal. Machine Intell., vol. 24, no. 1, pp. 696-706, Jan. 2002.

IEEE

Vision: A computational investigation into the hitman representation and processing of visual information. San Francisco: Freeman, 1982.

[267] D. Marr,

[268] I. Bierderman, "Recognition-by-components: A theory of human image under­ standing," Psychological Review, vol. 94, no. 2, pp. 115-147,1987. [269] M. J. Tarr and H. H. Biilthoff, "Image-based object recognition in man, monkey, and machine," In Special issue on Object Recognition in Man, Monkey, and Machine, M. j. Tarr and H. H. Biilthoff (Eds.), Cognition, no. 67, pp. 1-20,1998.

222

Suggest Documents