Real 3D objects introduces several difficulties, namely the presence of noise in the feature data, the ignorance of the correspondence between the features of ...
H y p e r B F Networks for real object recognition 1
R. Brunelli1, T. Poggio31 Istituto per la Ricerca Scientific a e Tecnologica 138050 Povo, Trento, I T A L Y 2 Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge, Massachusetts 02139, USA
Abstract E v e n i f represented i n a w a y w h i c h i s i n v a r i a n t t o i l l u m i n a t i o n c o n d i t i o n s , a 3 D o b j e c t gives rise t o a n i n f i n i t e n u m b e r o f 2 D v i e w s , d e p e n d i n g o n i t s pose. I t has been r e c e n t l y s h o w n ([13]) t h a t it is possible to synthesize a m o d u l e t h a t can recognize a specific 3D o b j e c t f r o m a n y v i e w p o i n t , by using a new technique of learning f r o m e x a m p l e s , w h i c h are, i n t h i s case, a s m a l l set o f 2 D v i e w s o f t h e o b j e c t . I n t h i s p a p e r w e e x t e n d t h e t e c h n i q u e , a ) t o deal w i t h real o b j e c t s ( i s o l a t e d paper clips) t h a t suffer f r o m noise a n d occlusions a n d b ) t o e x p l o i t negat i v e e x a m p l e s d u r i n g t h e l e a r n i n g phase. W e also c o m p a r e d i f f e r e n t versions o f t h e m u l t i layer n e t w o r k s c o r r e s p o n d i n g t o our t e c h n i q u e a m o n g themselves a n d w i t h a s t a n d a r d Nearest N e i g h b o r classifier. T h e s i m p l e s t v e r s i o n , w h i c h is a R a d i a l Basis F u n c t i o n s n e t w o r k , perf o r m s less w e l l t h a n a Nearest N e i g h b o r classifier. T h e m o r e p o w e r f u l versions, t r a i n e d w i t h p o s i t i v e a n d n e g a t i v e e x a m p l e s , p e r f o r m significantly better. O u r results, w h i c h m a y have i n t e r e s t i n g i m p l i c a t i o n s for c o m p u t e r vision despite t h e r e l a t i v e s i m p l i c i t y o f t h e task s t u d i e d , are especially i n t e r e s t i n g for u n d e r s t a n d i n g the process o f o b j e c t r e c o g n i t i o n i n b i o l o g i c a l v i sion.
1
Introduction
Shape-based v i s u a l r e c o g n i t i o n o f 3 D o b j e c t s m a y b e solved b y f i r s t h y p o t h e s i z i n g t h e v i e w p o i n t (e.g., u s i n g i n f o r m a t i o n o n f e a t u r e correspondences b e t w e e n the i m age a n d a 3 D m o d e l ) , t h e n c o m p u t i n g t h e appearance o f t h e m o d e l o f t h e o b j e c t t o b e recognized f r o m t h a t v i e w p o i n t a n d c o m p a r i n g i t w i t h t h e a c t u a l i m a g e ( [ 6 ; 20; 9 ; 1 1 ; 2 1 ] ) . M o s t r e c o g n i t i o n schemes d e v e l o p e d i n c o m p u t e r v i s i o n over t h e l a s t few years e m p l o y 3 D m o d e l s o f o b j e c t s . A u t o m a t i c l e a r n i n g o f 3 D m o d e l s , however, i s i n i t s e l f a d i f f i c u l t p r o b l e m t h a t has n o t been m u c h
1278
Vision
addressed i n t h e past a n d w h i c h presents d i f f i c u l t i e s , esp e c i a l l y for a n y t h e o r y t h a t w a n t s t o account f o r h u m a n ability in visual recognition. R e c e n t l y , r e c o g n i t i o n schemes have been suggested t h a t , r e l y i n g o n a set o f 2 D views o f t h e o b j e c t i n s t e a d of a 3D m o d e l ([2; 5; 13]), offer a n a t u r a l s o l u t i o n to t h e problem of model acquisition. In particular, Poggio and E d e l m a n ([13]) have a r g u e d t h a t for each o b j e c t there exists a s m o o t h f u n c t i o n m a p p i n g a n y p e r s p e c t i v e v i e w i n t o a " s t a n d a r d " view of the object and t h a t this m u l t i v a r i a t e f u n c t i o n m a y b e a p p r o x i m a t e v e l y synthesized f r o m a s m a l l n u m b e r of views of t h e o b j e c t . Such a funct i o n w o u l d b e o b j e c t specific, w i t h d i f f e r e n t f u n c t i o n s c o r r e s p o n d i n g t o d i f f e r e n t 3 D o b j e c t s . Since synthesizing an a p p r o x i m a t i o n to a f u n c t i o n f r o m a small number of sparse d a t a - t h e v i e w s - can be considered as l e a r n i n g a n i n p u t - o u t p u t m a p p i n g f r o m a set o f examples ([14; 15]), P o g g i o a n d E d e l m a n used a scheme for t h e a p p r o x i m a t i o n o f s m o o t h f u n c t i o n s w h i c h i s e q u i v a l e n t t o a class of multilayer networks called Regularization Networks a n d , i n t h e i r m o r e general f o r m , H y p e r B a s i s f u n c t i o n s . For each 3D o b j e c t t h e r e exist a s m a l l n e t w o r k , w h i c h is " l e a r n e d " d i r e c t l y f r o m a s m a l l set of p e r s p e c t i v e views o f the o b j e c t . T h e y d e m o n s t r a t e d t h e successful p e r f o r mance of such a scheme u s i n g c o m p u t e r s i m u l a t e d 3D wireframe objects similar to paperclips. Their experim e n t s assumed t h a t t h e o b j e c t h a d been i s o l a t e d f r o m t h e b a c k g r o u n d a n d t h a t features (such a s t h e specific corners or angles b e t w e e n t h e segments) h a d been ext r a c t e d a n d m a t c h e d t o t h e c o r r e s p o n d i n g features o f t h e m o d e l v i e w s . F u r t h e r m o r e , t h e i r d a t a were noisefree a n d w i t h o u t a n y occlusions. I n t h i s paper w e e x t e n d t h e i r t e c h n i q u e a n d e x p e r i m e n t s t o m o r e realistic s i t u a t i o n s . O u r u l t i m a t e goal i s t o i m p l e m e n t a s y s t e m for t h e r e c o g n i t i o n o f h u m a n faces b y a p p l y i n g t h e H y p e r B F t e c h n i q u e t o v i e w vectors c o m p u t e d f r o m t h e i m a g e b y e x t r a c t i n g f e a t u r e s such a s t h e p o s i t i o n o f t h e eyes a n d m o u t h a n d t h e color o f t h e hairs. R e a l 3 D o b j e c t s i n t r o d u c e s several d i f f i c u l t i e s , n a m e l y t h e presence o f noise i n t h e f e a t u r e d a t a , t h e ignorance o f t h e correspondence b e t w e e n t h e f e a t u r e s o f d i f f e r e n t
v i e w s o f t h e same o b j e c t a n d f i n a l l y t h e necessity t o use i n c o m p l e t e f e a t u r e v e c t o r s (due t o t h e presence o f o c c l u sions a n d / o r t o t h e i n a b i l i t y t o recover c o r r e c t l y some o f t h e o b j e c t s f e a t u r e s ) . I t seems reasonable t o l i m i t , a t least i n a f i r s t s t e p , a l l o f these d i f f i c u l t i e s t o t h e recogn i t i o n phase: t h e l e a r n i n g phase i s s u p e r v i s e d a n d uses " g o o d " example views, where the problem of correspondence has been r e m o v e d a n d t h e noise r e d u c e d . T h e m a i n result of the paper is t h a t the H y p e r b f t e c h n i q u e , s u i t a b l y m o d i f i e d , c a n d e a l successfully w i t h t h e p r o b l e m s o f noise, occlusions a n d m i s s i n g c o r r e s p o n dences, a t least f o r t h e s i m p l e 3 D o b j e c t s w e consider here. O n e o f t h e m o s t u s e f u l a n d i n t e r e s t i n g o f o u r extensions o f t h e t e c h i q u e i s t h e use o f n e g a t i v e e x a m p l e s i n t h e t r a i n i n g , t h a t i s i n t h e m o d e l a c q u i s i t i o n , phase. T h e p l a n o f t h e p a p e r i s a s f o l l o w s . T h e f i r s t section gives a b r i e f r e v i e w o f t h e s i m p l e R B F t e c h n i q u e . W e t h e n describe t h e e x p e r i m e n t s a n d c o m p a r e t h e p e r f o r m a n c e o f d i f f e r e n t v e r s i o n o f the a l g o r i t h m ( i n c l u d i n g p e r f o r m a n c e of a s t a n d a r d Nearest N e i g h b o r classifier). T h e m o r e general H y p e r B F n e t w o r k i s t h e n i n t r o d u c e d and characterised in terms of experimental performance.
2
R a d i a l Basis F u n c t i o n s
R a d i a l Basis F u n c t i o n s can be r e g a r d e d as a special case of R e g u l a r i z a t i o n N e t w o r k s i n t r o d u c e d in [14] as a general a p p r o x i m a t i o n t e c h n i q u e t h a t can b e used i n p r o b lems o f l e a r n i n g f r o m e x a m p l e s . A scalar f u n c t i o n c a n be a p p r o x i m a t e d , g i v e n i t s value o n a sparse set o f p o i n t s { x i } , b y a n e x p a n s i o n i n r a d i a l functions: (1)
3
E x p e r i m e n t a l setup
T h e o b j e c t s used i n t h e e x p e r i m e n t s were five paper clips, r a n d o m l y generated o f t h e same l e n g t h , rendered t h r o u g h r a y - t r a c i n g techniques (see F i g . 1). T h e s m a l l n u m b e r o f o b j e c t s used m u s t b e t a k e n i n t o account w h e n t h e e x p e r i m e n t a l results are c o n s i d e r e d . T h e s m a l l difference i n p e r f o r m a n c e a m o n g t h e different techniques i s e x p e c t e d t o increase i f t h e n u m b e r o f objects t o b e classified increases. T h e c l i p was f i r s t isolated f r o m the b a c k g r o u n d ([1]) a n d t h e r e s u l t i n g b i n a r y image was skeletonised ( [ 1 8 ] ) , g i v i n g essentially a l i n e d r a w i n g . A p o l y g o n a l a p p r o x i m a t i o n r o u t i n e i d e n t i f i e d l i n e segments w h i c h were t h e n used t o r e c o n s t r u c t t h e c l i p u s i n g a n A * search ( [ 1 ] ) . T h e same a l g o r i t h m s were a p p l i e d to a real c l i p (images were t a k e n w i t h a C C D c a m e r a ) a n d p r o v e n t o p e r f o r m e q u a l l y w e l l . T h e r e c o n s t r u c t e d clips were used to s i m u l a t e d i f f e r e n t degrees of occlusions by ass i g n i n g t o each c l i p a f i n i t e r a d i u s ( t h e bigger t h e r a d i u s the h i g h e r t h e o c c l u s i o n p e r c e n t a g e ) . For each c l i p a set of 80 views were a v a i l a b l e a m o n g w h i c h t h e l e a r n i n g a n d t e s t i n g sets c o u l d b e chosen. T h e a t t i t u d e was r e s t r i c t e d t o one o c t a n t o f the v i e w i n g sphere. E a c h c l i p , h a v i n g six vertices, was described as a v e c t o r of t w e l v e c o o r d i n a t e s , r e p r e s e n t i n g the p o s i t i o n o n t h e i m a g e p l a n e o f t h e vertices (expressed i n the b a r y c e n t r i c c o o r d i n a t e s y s t e m i d e n t i f i e d b y t h e vertices t o remove t r a n s l a t i o n a l d e p e n d e n c y ) . T h e use o f the (x,y) c o o r d i n a t e s of t h e vertices on t h e i m a g e plane i s one o f t h e m a n y possible choices o f features. For i n stance, t h e angles b e t w e e n the c l i p segments c o u l d have been used ([13]). B r o a d l y s p e a k i n g , every f e a t u r e t h a t can b e cast i n t o n u m e r i c a l f o r m a n d t h a t m a p s s m o o t h l y i n t o t h e o u t p u t o f the n e t w o r k , m a y b e used i n t h e recogn i t i o n process.
where represents t h e usual E u c l i d e a n n o r m . T h e c o m p u t a t i o n of t h e coefficients c i rests on t h e i n v e r t i b i l ity of m a t r i x w h i c h has been p r o v e d (see M i c c h e l l i [12]) for f u n c t i o n s such as:
(2) (3) I t i s possible t o use fewer r a d i a l f u n c t i o n s t h a n exa m p l e s , i.e. d a t a p o i n t s . T h e r e s u l t i n g o v e r c o n s t r a i n e d s y s t e m c a n be solved in a least square w a y u n d e r t h e c o n d i t i o n s o f M i c h e l l i ' s t h e o r e m a n d proves t o b e useful w h e n m a n y e x a m p l e s are a v a i l a b l e . P o g g i o a n d G i r o s i ( [ 1 4 ; 15]) have s h o w n t h a t t h e R B F t e c h n i q u e i s a special case o f t h e r e g u l a r i z a t i o n a p proach to the a p p r o x i m a t i o n of m u l t i v a r i a t e functions. I n t h e r e g u l a r i z a t i o n a p p r o a c h one seeks t h e a p p r o x i m a t i n g f u n c t i o n w h i c h i s closest t o t h e d a t a a n d s m o o t h e s t , according to an appropriate criterion. T h e R B F techn i q u e d e s c r i b e d a b o v e , w h i c h i s t h e s i m p l e s t version o f t h e H y p e r B F scheme d e s c r i b e d l a t e r , was used i n t h e experiments described in the next section.
F i g u r e 1: A real c l i p ( l e f t ) a n d a r a y t r a c e d c l i p ( r i g h t )
4
Experiments
T h e p u r p o s e o f t h e e x p e r i m e n t s was t o test t h e recognit i o n p e r f o r m a n c e o f several s t r a t e g i e s . For each r e n d e r e d c l i p , f r o m a set of 80 v i e w s , a l e a r n i n g set of g i v e n card i n a l i t y a n d a t e s t i n g set of f i x e d c a r d i n a l i t y (10 views) were e x t r a c t e d . T h e p r o b l e m o f f e a t u r e correspondence was c o n s t r a i n e d b y t h e n a t u r e o f t h e o b j e c t s . T h e clips c o u l d be present in t h e l e a r n i n g set in one of the t w o natural orientation (the reconstruction algorithm did not f i x t h i s a m b i g u i t y ) . I n t h e r e c o g n i t i o n phase each c l i p was used i n t h e t w o ways a t t e m p t i n g r e c o g n i t i o n a n d
Brunelli a n d Poggio
1279
choosing t h e b e s t r e s u l t . T h e o r d e r o f t h e vertices was assumed t o b e c o r r e c t b u t f o r t h i s o r d e r i n g a m b i g u i t y . T h e o u t p u t r e p r e s e n t a t i o n used f o r r e c o g n i t i o n was t h e characteristic f u n c t i o n of the clips:
(4) w h e r e P(Ci) represents t h e space s p a n n e d by t h e perspective p r o j e c t i o n s o f t h e i - t h c l i p . T h e synthesis o f t h e c h a r a c t e r i s t i c f u n c t i o n f r o m t h e l e a r n i n g set i s realised by w h a t w i l l be r e f e r r e d to as RBF module. T h e response o f a m o d u l e t o a n i n p u t v e c t o r ( a c l i p t o b e classified) i s t h e value o f t h e R B F e x p a n s i o n a t t h e g i v e n p o i n t , c l i p p e d t o t h e i n t e r v a l [ 0 , 1 ] . T h e c l i p i s t h e n assigned t o t h e m o d u l e w i t h t h e h i g h e s t response. Several g r o u p s o f e x p e r i m e n t s have been p e r f o r m e d : • using only positive examples and complete vertex information; • using only positive examples and occluded clips; • using positive a n d negative examples w i t h complete vertex information. P o s i t i v e e x a m p l e s are views o f t h e c o r r e c t c l i p a n d n e g a t i v e e x a m p l e s are views of o t h e r c l i p s ( t h e r e f o r e inc o r r e c t clips for t h e m o d u l e u n d e r t r a i n i n g ) . In the first g r o u p of experiments the number of centers i n t h e R B F expansions m a t c h e d t h a t o f t h e available p o s i t i v e e x a m p l e s for t h e g i v e n c l i p . T h e p e r f o r m a n c e reflects h o w w e l l t h e R B F scheme w o r k s w i t h noise i n t h e f e a t u r e v e c t o r a n d i n t h e specific t a s k . A s w e m e n t i o n e d b e f o r e , r e a l 3 D o b j e c t s e x h i b i t occlusions ( t o p o l o g i c a l absence o f f e a t u r e s ) , u n r e c o v e r e d features (defect i v e f e a t u r e e x t r a c t i o n ) a n d correspondence p r o b l e m s . T h e f i r s t t w o c h a r a c t e r i s t i c s r e q u i r e t h e use o f i n c o m plete f e a t u r e v e c t o r s . I n t h e " v a n i l l a ' ' version o f R B F described i n s e c t i o n 2 , t h e r a d i a l basis f u n c t i o n s m u s t b e c o m p l e t e l y specified w i t h t h e i r p a r a m e t e r s before t r a i n i n g . For G a u s s i a n r a d i a l basis f u n c t i o n s , t h e value for m u s t b e chosen. L e t b e t h e l e a r n i n g set. T o each example example
a nearest n e i g h b o r can be associated, i.e. an such t h a t
(5) Since t h e t h e o r y ([14]) a l l o w s o n l y f o r a g l o b a l value o f ( a n d n o t for a d i f f e r e n t one f o r each u n i t ) , t h e average nearest n e i g h b o r d i s t a n c e was u s e d :
(6) w i t h n t h e n u m b e r o f e x a m p l e s i n t h e l e a r n i n g set. I f t h e l e a r n i n g set spans u n i f o r m l y t h e space w h e r e t h e f u n c t i o n i s d e f i n e d , n o p r o b l e m s arise. I f t h i s i s n o t t h e
1280
Vision
case o v e r / u n d e r g e n e r a l i s a t i o n m a y b e e x p e c t e d i n t h i s s i m p l e scheme. D u r i n g t h e c l a s s i f i c a t i o n t a s k , i n c o m p l e t e f e a t u r e vectors m a y b e presented t o the R B F m o d u l e . T w o strategies were i n v e s t i g a t e d . T h e f i r s t one i s p r o b a b l y t h e s i m p l e s t : a l l possible c o m b i n a t i o n o f t h e a v a i l a b l e d a t a are t e s t e d . I t s m a j o r d r a w b a c k i s t h a t t h e n u m b e r o f such c o m b i n a t i o n s g r o w s q u i t e r a p i d l y w i t h t h e n u m b e r o f t h e a v a i l a b l e f e a t u r e s . T h e second s t r a t e g y i s d i r e c t l y r e l a t e d to t h e i d e a of characteristic views of a t h r e e d i m e n s i o n a l o b j e c t (see [7]). I t c a n b e s h o w n t h a t t h e space o f t h e possible p e r s p e c t i v e v i e w s o f a t h r e e d i m e n s i o n a l o b j e c t can b e p a r t i t i o n e d i n t o subspaces w i t h t h e following properties: • a l l t h e v i e w s in a single subspace are t o p o l o g i c a l l y equivalent (their projected edge-structure exhibit a j u n c t i o n line i d e n t i t y ) ; • every v i e w in a single subspace can be t r a n s f o r m e d i n t o a n o t h e r v i e w o f t h e same subspace w i t h a linear t r a n s f o r m a t i o n ( i n homogeneous c o o r d i n a t e s ) . A n equivalence r e l a t i o n can t h e n b e d e f i n e d a n d t h e q u o t i e n t space c o n s i d e r e d . T h e elements o f t h e q u o t i e n t space are called characteristic views. T h e topological equivalence o f the views i n t h e same subspace i m p l i e s t h a t each v i e w has t h e same n u m b e r o f features (such as vertices a n d faces), t h e reverse b e i n g n o t necessarily t r u e . I t i s t h e n n a t u r a l t o assign a n R B F m o d u l e t o each characteristic view.
F i g u r e 2: A s c h e m a t i c v i e w of t h e modified characteristic view a p p r o a c h L e t u s focus o n G a u s s i a n R B F s . I f a s u f f i c i e n t l y dense l e a r n i n g set i s a v a i l a b l e , t h e value r e t u r n e d b y t h e R B F a p p r o x i m a t i o n q u i c k l y decays w i t h t h e d i s t a n c e f r o m t h e l e a r n i n g set. I t i s reasonable t o consider t h e examples i n t h e l e a r n i n g set a s w i r e f r a m e o b j e c t s , s o t h a t t h e r e are n o occlusions ( s o m e w h a t l i k e C A D m o d e l s ) . W e can t h e n e s t i m a t e t h e value o f o n t h i s virtual set. The l e a r n i n g set c a n b e s u b d i v i d e d i n t o subsets whose m e m bers share t h e same n u m b e r o f v i s i b l e (real) features. O f course, a n e q u a l n u m b e r o f features i s necessary b u t n o t sufficient t o share t h e same c h a r a c t e r i s t i c v i e w . T h e d i a g r a m of F i g . 2 shows a possible scenario a n d h o w it can b e m a p p e d i n t o d i f f e r e n t R B F m o d u l e s , one f o r each d i -
m e n s i o n a l i t y ( i n s t e a d o f one f o r each c h a r a c t e r i s t i c v i e w ) of the learning examples. T h e use o f a r e d u c e d d i m e n s i o n a l i t y i m p l i e s t h e use o f a modified As is d i r e c t l y r e l a t e d to a d i s t a n c e , t h e f o l l o w i n g t r a n s f o r m a t i o n r u l e was a d o p t e d :
where is the value at dimensionality j. Whenever a f e a t u r e v e c t o r w i t h a d i m e n s i o n a l i t y t h a t does n o t m a t c h a n y o f t h e a v a i l a b l e e x a m p l e s i s presented t o t h e recogn i t i o n m o d u l e , the first strategy can be employed. We m u s t , h o w e v e r , solve t h e p r o b l e m o f h o w t o c o m p u t e t h e d i s t a n c e b e t w e e n p o i n t s f o r w h i c h some c o o r d i n a t e s m a y n o t b e a v a i l a b l e . L e t u s define t h e m e t r i c i n t h e f o l l o w i n g way: (8) where
M is a s y m m e t r i c , p o s i t i v e d e f i n i t e m a t r i x in c a l l e d t h e metric matrix. If t h e s t a n d a r d euc l i d e a n m e t r i c is used a n d t h e l e a r n i n g set is c o m p l e t e the f o l l o w i n g m e t r i c c a n b e used ( [ 1 0 ] ) :
(9) where is t h e K r o n e c k e r s y m b o l , gi — 1 if t h e it h c o o r d i n a t e i s a v a i l a b l e i n t h e p o i n t t o b e classified a n d gi — 0 o t h e r w i s e . A m o r e useful m e t r i c can also be d e f i n e d ([10]) whose d i a g o n a l e l e m e n t s t a k e i n t o acc o u n t t h e n u m b e r o f m i s s i n g c o o r d i n a t e s . I f w e choose t o preserve t h e trace o f t h e ( d i a g o n a l ) m e t r i c m a t r i x t h e elements o f the m e t r i c m a t r i x are g i v e n b y : (10) T h i s corresponds t o w o r k i n g i n a r e d u c e d d i m e n s i o n a l i t y R B F m o d u l e w i t h a n effective chosen a c c o r d i n g t o the equation 7. T h e resulting m i x e d strategy proved to b e q u i t e effective, especially w i t h h i g h occlusions p e r centages. T h e use o f m u l t i p l e R B F m o d u l e s w i t h t h e mixed strategy and a weighted metric provides u n i f o r m results, s o t h a t t h e answer o f t h e d i f f e r e n t m o d u l e s can b e c o m p a r e d t o d e t e r m i n e t h e best answer. T o gauge t h e p e r f o r m a n c e o f t h e R B F scheme, a Nearest N e i g h b o r c l a s s i f i c a t i o n scheme was used on t h e same d a t a sets. Several a d d i t i o n a l e x p e r i m e n t s n o t r e p o r t e d here have also been p e r f o r m e d ( [ 3 ] ) . For each e x p e r i m e n t t w o q u a n t i t i e s are s h o w n i n t h e figures: M I N / M A X : t h e m i n i m u m e r r o r m a d e b y the m o d u l e on a w r o n g clip divided by the m a x i m u m error on a right clip. W h e n MIN/MAX > 1 t h e m o d u l e c a n a v o i d false a l a r m t h r o u g h t h e i n t r o d u c t i o n o f a t h r e s h o l d (sec [4]). T h e average M I N / M A X value over t h e d i f f e r e n t m o d u l e s i s r e p o r t e d i n the g r a p h s .
R E C O G N I T I O N : i s c o m p u t e d a s s i g n i n g each c l i p t o the module w i t h the smallest error, i g n o r i n g any information from M I N / M A X .
5
P e r f o r m a n c e analysis
T h e c o m p a r i s o n o f " v a n i l l a " R B F a n d Nearest N e i g h b o r c l a s s i f i c a t i o n seems t o f a v o r t h e l a t t e r (see F i g . 3). T h i s can b e e x p l a i n e d b y some recent t h e o r e t i c a l results. I t can be s h o w n ( [ 2 ] , see also [4]) t h a t u n d e r o r t o g r a p h i c p r o j e c t i o n s , t h e v i e w of a g i v e n o b j e c t as d e n n e d earlier, spans a 6 - d i m e n s i o n a l subspace. If t h e viewer is at a reasonable d i s t a n c e f r o m t h e o b j e c t t h e r e is o n l y a negligeable difference b e t w e e n p e r s p e c t i v e a n d o r t o g r a p h i c p r o j e c t i o n . T h i s i m p l y t h a t t h e views o f each c l i p a l m o s t certainly span non-intersecting 6-manifolds, embedded in t h e R12 r e p r e s e n t a t i o n space. In these e x p e r i m e n t s , t h e s t a n d a r d e u c l i d e a n m e t r i c s was used. T h e use o f Gaussians effectively corresponds to s e t t i n g a v o l u m e " a r o u n d " t h e 6 - m a n i f o l d . T h i s v o l u m e i s bigger w h e n the n u m b e r o f e x a m p l e s i s s m a l l e r ( l a r g e r i n t e r samples distances) a n d t h i s c a n e x p l a i n t h e i n f e r i o r p e r f o r m a n c e o f t h e R B F r e c o g n i t i o n scheme, w h i c h i s overgeneralizi n g . T h e average M I N / M A X e r r o r i s b e t t e r f o r R B F even i f t h e w o r s t case ( w o r s t M I N / M A X f i g u r e o f t h e r e c o g n i t i o n m o d u l e s ) i s u s u a l l y s l i g h t l y b e t t e r for N N . I t i s i n t e r e s t i n g t o c o m p a r e t h e p e r f o r m a n c e o f exh a u s t i v e m a t c h i n g a n d m u l t i p l e m o d u l e s m a t c h i n g ([3]). W h e r e a s a t l o w o c c l u s i o n percentages t h e p e r f o r m a n c e o f t h e t w o strategies i s v e r y s i m i l a r , t h e m u l t i p l e m o d u l e s a p p r o a c h p e r f o r m s d e f i n i t e l y b e t t e r w i t h a large n u m b e r of occlusions. T h e use of a w e i g h t e d m e t r i c increases a l l t h e p e r f o r m a n c e measures for b o t h strategies. T h i s i m p r o v e m e n t i s due t o t h e f a c t t h a t w h e n some features are m i s s i n g we r e q u i r e a progressively closer m a t c h f o r those w e have: t h i s a l l o w s t h e r e d u c t i o n o f false a l a r m s and of overgeneralization. A n o t h e r possible o u t p u t r e p r e s e n t a t i o n , w h i c h w e d i d n o t use because it can be p r o v e d to y i e l d an i n f e r i o r perf o r m a n c e in t h i s case ( [ 3 ] ) , is t h a t of a p r o t o t y p e v i e w ([13]): every v i e w of a given c l i p is m a p p e d i n t o a part i c u l a r v i e w of t h a t c l i p , c a l l e d i t s prototype view ( t h i s m a p p i n g is realized t h r o u g h a v e c t o r v a l u e d f u n c t i o n ) . T h e c l i p i s t h e n assigned t o t h e m o d u l e e x h i b i t i n g t h e smallest distance f r o m t h e o u t p u t o f t h e R B F e x p a n s i o n to its prototype view. T h e differences i n p e r f o r m a n c e u s i n g o n l y p o s i t i v e exa m p l e s , p o s i t i v e e x a m p l e s w i t h i n f o r m a t i o n f r o m negat i v e e x a m p l e s ( R B F centers o n l y o n p o s i t i v e examples) a n d p o s i t i v e a n d n e g a t i v e examples (centers o n b o t h posi t i v e a n d n e g a t i v e e x a m p l e s ) have been t e s t e d u s i n g r a d i a l G a u s s i a n f u n c t i o n s (see F i g . 4 ) . D u e t o t h e decayto-zero of Gaussians n e g a t i v e e x a m p l e s are effective for the v a n i l l a R B F scheme o n l y w h e n t h e n u m b e r o f e x a m ples is s m a l l .
Brunelli a n d Poggio
1281
6
H y p e r Basis F u n c t i o n s t e c h n i q u e
T h e " v a n i l l a " R B F t e c h n i q u e p r o v i d e s a basis for c o m p a r i s o n w i t h t h e extensions o f t h e R e g u l a r i s a t i o n network technique described by T. Poggio a n d F. Girosi in [14]. F r o m a m o r e general f o r m u l a t i o n o f t h e v a r i a t i o n a l problem of regularization they derive the following app r o x i m a t i o n scheme, i n s t e a d o f e q u a t i o n ( 1 ) :
(11) where the parameters t a , t h a t we call "centers," and the coefficients c a are u n k n o w n , a n d are in general m u c h fewer t h a n t h e d a t a p o i n t s The term p(x)
F i g u r e 3 : C o m p a r i s o n b e t w e e n R B F w i t h c o m p l e t e vertex i n f o r m a t i o n ( A D ) and the m u l t i p l e modules strategy ( M I ) a t d i f f e r e n t o c c l u s i o n percentages.
is a p o l y n o m i a l t h a t o f t e n c a n be n e g l e c t e d , t h o u g h it usually consists o f t h e c o n s t a n t a n d linear t e r m s . T h e n o r m is a weighted norm
(12)
F i g u r e 4 : R e c o g n i t i o n rates u s i n g i n f o r m a t i o n f r o m nega t i v e e x a m p l e s : P C P N I uses as centers o n l y the posit i v e examples b u t takes a d v a n t a g e o f t h e negative e x a m ples i n f o r m a t i o n ( P C P N I ( n ) uses o n l y p o s i t i v e centers to compute P N C P N I uses a s centers all o f t h e p o s i t i v e a n d n e g a t i v e e x a m p l e s ; P C P I does n o t use negative examples a n d i s p r o v i d e d for c o m p a r i s o n .
1262
Vision
where W is an u n k n o w n square m a t r i x a n d t h e supers c r i p t T i n d i c a t e s t h e transpose. In the s i m p l e case of d i a g o n a l W t h e d i a g o n a l elements W i assign a specific w e i g h t t o each i n p u t c o o r d i n a t e , d e t e r m i n i n g i n fact the u n i t s o f measure a n d t h e i m p o r t a n c e o f each f e a t u r e . In t h i s f o r m u l a t i o n t h e l e a r n i n g stage is used to estim a t e n o t o n l y the coefficients o f t h e R B F e x p a n s i o n , b u t also the m e t r i c ( p r o b l e m d i p e n d e n t d i m e n s i o n a l i t y r e d u c t i o n ) a n d t h e p o s i t i o n o f the centers ( o p t i m a l examples s e l e c t i o n ) . T h i s m o r e general o p t i m i z a t i o n p r o b l e m c a n n o t b e solved t h r o u g h m a t r i x i n v e r s i o n a n d m e t h o d s l i k e grad i e n t descent ([17]) o r r a n d o m m i n i m i z a t i o n techniques ( [ 8 ; 19]) m u s t be used (as t h e p o s s i b i l i t y of g e t t i n g t r a p p e d i n p o o r local m i n i m a i s s i g n i f i c a n t w e o p t e d for the l a t t e r a p p r o a c h ) . T h e p o s s i b i l i t y o f m o v i n g t h e e x p a n s i o n centers, i n i t i a l l y l o c a t e d on some of the l e a r n i n g e x a m p l e s , is useful i n the presence o f noise i n the available d a t a and for i m p r o v i n g the representativity o f t h e center. T h e a d j u s t a b l e m e t r i c i s even m o r e i m p o r t a n t , since i t allows to set a m o r e specific d i s s i m i l a r i t y measure b e t w e e n new views a n d centers t h a n e u c i i d e a n m e t r i c . One o f t h e m o s t i n t e r e s t i n g results i s t h a t i n t h e task o f o b j e c t r e c o g n i t i o n , a n d i f a n a p p r o a c h s i m i l a r t o ours is used, an a d j u s t a b l e m e t r i c requires necessarily negat i v e examples. T h e a r g u m e n t , s u b s t a n t i a t e d b y e x p e r i m e n t s , rests o n the t r i v i a l o b s e r v a t i o n t h a t i f o n l y posit i v e examples were used, i t w o u l d b e possible t o o b t a i n a n o p t i m a l ( i n t e r m s o f q u a d r a t i c e r r o r o n t h e examples) a p p r o x i m a t i o n to the characteristic function of an object b y m a p p i n g every possible i n p u t v i e w t o the value 1 , t h a t is by choosing W = 0. T h i s is o b v i o u s l y a p o o r choice for t h e task o f o b j e c t d i s c r i m i n a t i o n . T h e use o f negative examples i s therefore essential. T h e a d d e d c o m p l e x i t y i s s m a l l since t h e c o m p u t a t i o n a l c o m p l e x i t y o f t h e R B F
e x p a n s i o n d e p e n d s l i n e a r l y o n t h e n u m b e r o f examples ( f o r a f i x e d n u m b e r o f c e n t e r s ) . T h i s means t h a t i f m i n i m i z a t i o n t e c h n i q u e s such a s g r a d i e n t descent o r r a n d o m m i n i m i s a t i o n are used, t h e increase i n t h e c o m p u t a t i o n t i m e i s e x p e c t e d t o b e l i n e a r . I n t h e case o f o b j e c t recogn i t i o n , n e g a t i v e e x a m p l e s a l l o w t o r e m e d y i n a n elegant w a y w h a t i s t h e m a i n p r o b l e m i n m o s t l e a r n i n g tasks: t h e need of a large n u m b e r of e x a m p l e s . Use of a large number of positive examples w o u l d make the whole app r o a c h o f t h i s p a p e r q u i t e m o o t : each o b j e c t w o u l d b e represented essentially as a l o o k - u p t a b l e of v e r y m a n y of i t s views w i t h a l l t h e c o r r e s p o n d i n g c o m p l e x i t y o f storage a n d a c q u i s i t i o n . O u r p r o p o s a l , s u p p o r t e d b y o u r exp e r i m e n t s , is, i n s t e a d , t o use a s m a l l n u m b e r o f p o s i t i v e views a n d a large n u m b e r o f n e g a t i v e e x a m p l e s , w h i c h are easily a v a i l a b l e i f t h e d a t a basis i n c l u d e s m a n y o b jects . T h e p l o t s o f F i g . 5 gives some i n f o r m a t i o n o n how t h e g e n e r a l i z a t i o n proceeds w i t h t h e n u m b e r o f p o s i t i v e a n d n e g a t i v e e x a m p l e s a v a i l a b l e as w e l l as a c o m p a r i s o n w i t h t h e nearest n e i g h b o r results o n t h e same d a t a sets. R B F ( N ) refers t o a one center H y p e r B F e x p a n s i o n w i t h m o v a b l e coefficient, center a n d d i a g o n a l m e t r i c ( N i s the number of positive examples). T h e e x p e r i m e n t a l results a l l o w u s t o r a n k the different R B F approaches t o g e t h e r w i t h t h e Nearest N e i g h b o r classification scheme p r o v i d i n g a useful gauge. T h e R B F approaches can be s o r t e d , by i n c r e a s i n g p e r f o r m a n c e , in the following way: 1 . " v a n i l l a " R B F : use o f o n l y p o s i t i v e examples w i t h as m a n y centers as a v a i l a b l e examples 2 . R B F w i t h n e g a t i v e e x a m p l e s : o n l y t h e p o s i t i v e examples are used as centers b u t the i n f o r m a t i o n f r o m the n e g a t i v e examples is used 3. Nearest N e i g h b o r : t h e p e r f o r m a n c e is nearly equal t o t h a t o f R B F w i t h n e g a t i v e examples 4. H y p e r B F w i t h diagonal t i o n m u s t b e used)
metric
(negative i n f o r m a -
5 . H y p e r B F w i t h c o m p l e t e m e t r i c : again negative i n -
euclidean m e t r i c . A c h a r a c t e r i s t i c v i e w r e p r e s e n t a t i o n has been c o m p a r e d to an e x h a u s t i v e search for t h e best m a t c h i n t h e case o f self-occluded clips. Especially i n t e r e s t i n g is t h e use of n e g a t i v e examples w h i c h y i e l d some i m p r o v e m e n t i n t h e v a n i l l a R B F a p p r o a c h a n d are c r i t i c a l l y i m p o r t a n t for the m o r e general H y p e r B F scheme, w h i l e r e d u c i n g c o n s i d e r a b l y t h e c o m p l e x i t y o f the technique in terms of representation and model acquisition. T h e H y p e r B F g e n e r a l i z a t i o n o f t h e basic t e c h n i q u e , w i t h t h e i n t r o d u c t i o n o f m o v a b l e e x p a n s i o n centers and the synthesis of a t a s k d e p e n d e n t m e t r i c , p r o v e d to be successful in o b t a i n i n g a c o m p a c t r e p r e s e n t a t i o n of t h e objects t h a t appears to w o r k well in the - a d m i t t e d l y still v e r y l i m i t e d - r e c o g n i t i o n tasks d e s c r i b e d here.
References [1]
D. H. B a l l a r d a n d C. M. B r o w n .
Computer Vision.
P r e n t i c e H a l l , E n g l e w o o d C l i f f s , N J , 1982. [2] R. B a s r i a n d S. U l l m a n . R e c o g n i t i o n by linear c o m b i n a t i o n s o f m o d e l s . T e c h n i c a l r e p o r t , T h e Weizm a n n I n s t i t u t e o f Science, 1989. [3] R . B r u n e l l i a n d T . P o g g i o . Use o f r b f i n real o b j e c t r e c o g n i t i o n . T e c h n i c a l R e p o r t 9011-09, I . R . S . T , 1990. [4] S . E d e l m a n a n d T . P o g g i o . B r i n g i n g the g r a n d m o t h e r back i n t o t h e p i c t u r e : a m e m o r y - b a s e d v i e w of object recognition. Technical Report A . L Memo N o . 1 1 8 1 , Massachusetts I n s t i t u t e o f Technology, 1990. [5] S. E d e l m a n a n d D. W e i n s h a l l . A self-organizing m u l t i p l e - v i e w representation of 3d objects. Techn i c a l R e p o r t A . L M e m o N o . 1146, Massachusetts I n s t i t u t e o f T e c h n o l o g y , 1989. [6] M . A . Fischler a n d R . C . Bolles. R a n d o m sample consensus: a p a r a d i g m for m o d e l f i t t i n g w i t h app l i c a t i o n s t o i m a g e analysis a n d a u t o m a t e d c a r t o g raphy. Communications of the ACM, 24:381-395, 1981.
f o r m a t i o n m u s t b e used.
7
Conclusions
W e have a p p l i e d r e c e n t l y p r o p o s e d n e t w o r k s for l e a r n i n g f r o m e x a m p l e s ( t h e v a n i l l a R B F version a s w e l l a s t h e m o r e p o w e r f u l H y p e r B F scheme) t o t h e p r o b l e m o f 3 D o b j e c t r e c o g n i t i o n . E x p e r i m e n t a l results o n r e c o g n i t i o n r a t e have been o b t a i n e d for w i r e f r a m e o b j e c t s ( p a p e r clips represented as p o l y l i n e s , hence w i t h no occlusions) a n d for m o r e realistic o b j e c t s ( p a p e r clips represented as a set of c i l i n d e r s of v a r y i n g r a d i i , hence e x h i b i t i n g d i f f e r e n t percentages o f o c c l u s i o n s ) . W e have e x t e n d e d t h e R B F t e c h n i q u e i n o r d e r t o cope w i t h t h e p r o b l e m o f feature occlusion t h r o u g h the i n t r o d u c t i o n of a modified
[7] H. F r e e m a n a n d I. C h a k r a v a r t y . T h e use of chara c t e r i s t i c views i n the r e c o g n i t i o n o f three d i m e n sional o b j e c t s . In E. S. G e l s e m a a n d L. N. K a n a l , e d i t o r s , Pattern Recognition In Practice, pages 2 7 7 288. N o r t h - H o l l a n d , 1980. [8] F. G i r o s i a n d B. C a p r i l e . A n o n d e t e r m i n i s t i c m i n i mization a l g o r i t h m . Technical Report A . I . Memo N o . 1254, Massachusetts I n s t i t u t e o f Technology, 1990. [9]
D. P. H u t t e n l o c h e r a n d S. U l l m a n . In Proc. 1st Int. Conf.
Computer
Vision,
pages
102-111,
Washing-
t o n , D . C . , 1987.
Brunelli a n d Poggio
1283
[10]
A. K. J a i n a n d R. C. D u b e s . tering Data.
[11]
Algorithms for Clus-
P r e n t i c e H a l l , 1988.
D. G. L o w e . Perceptual organization and visual recognition. K l u w e r A c a d e m i c P u b l i s h e r s , 1986.
[12] C . A . M i c c h e l l i . Interpolation of scattered data: D i s t a n c e m a t r i c e s a n d c o n d i t i o n a l l y p o s i t i v e defin i t e f u n c t i o n s . Constr. Approx, 2 : 1 1 - 2 2 , 1986. [13] T. P o g g i o a n d S. E d e l m a n . A n e t w o r k t h a t learns to recognize t h r e e - d i m e n s i o n a l o b j e c t s . Nature, 3 4 3 ( 6 2 2 5 ) : l - 3 , 1990. [14] T . P o g g i o a n d F . G i r o s i . A t h e o r y o f n e t w o r k s for approximation and learning. Technical Report A . I . M e m o N o . 1140, M a s s a c h u s e t t s I n s t i t u t e o f T e c h n o l o g y , 1989. [15] T . P o g g i o a n d F . G i r o s i . N e t w o r k s for a p p r o x i m a t i o n a n d l e a r n i n g . In Proc. of the IEEE, Vol. 78, pages 1 4 8 1 - 1 4 9 7 , 1990. [16] T . P o g g i o a n d F . G i r o s i . R e g u l a r i z a t i o n a l g o r i t h m s for l e a r n i n g t h a t are e q u i v a l e n t t o m u l t i l a y e r netw o r k s . Science, 2 4 7 : 9 7 8 - 9 8 2 , 1990. [17] W. H. Press, B. P. F l a n n e r y , S. A. T e u k o l s k y , a n d W. T. V e t t e r l i n g . Numerical Recipes in C. Camb r i d g e U n i v e r s i t y Press, 1988. [18] S . S u z u k i a n d K . A b e . B i n a r y p i c t u r e t h i n n i n g b y an i t e r a t i v e p a r a l l e l t w o - s u b c y c l e o p e r a t i o n . Pattern Recognition, 2 0 ( 3 ) : 2 9 7 - 3 0 7 , 1987. [19] G . T e c c h i o l l i a n d R . B r u n e l l i . O n r a n d o m m i n i m i z a tion of quadratic forms. Technical Report in prep., I . R . S . T , 1990. [20] D . W . T h o m p s o n a n d J . L . M u n d y . Threedimensional model m a t c h i n g f r o m an unconstrained viewpoint. In Proceedings of the IEEE Conference on Robotics and Automation, pages 2 0 8 - 2 2 0 , 1987. [21] S. U l l m a n . A l i g n i n g p i c t o r i a l d e s c r i p t i o n s : an approach to object recognition. Cognition, 3 2 : 1 9 3 254, 1989.
1284
Vision
F i g u r e 5 : L E F T : r e c o g n i t i o n rates u s i n g G a u s s i a n H y p e r B F as a f u n c t i o n of t h e percentage of n e g a t i v e exa m p l e i n t h e l e a r n i n g set. R I G H T : c o m p a r i s o n between complete metric (C) and diagonal metric (D) ( b o t h w i t h one m o v a b l e c e n t e r ) . Abscissas represent t h e n u m b e r o f p o s i t i v e e x a m p l e used. A l l t h e p o s i t i v e e x a m p l e s o f every o t h e r o b j e c t were used as n e g a t i v e e x a m p l e s .