Dynamic Texture Recognition by Spatio-temporal ... - IEEE Xplore

1 downloads 0 Views 1MB Size Report
by spatio-temporal multiresolution histogram based on velocity and acceleration fields is presented. The spatio- temporal multiresolution histogram has many ...
Dynamic Texture Recognition by Spatio-temporal Multiresolution Histograms

Jihong Pei JianJun Huang Modern Education School of Electronic Tech & Infor Center Engineering Shenzhen University,Guangdong,518060,China [email protected], [email protected]

Abstract

dynamic model [1], Chin-Hwee and Loong-Fah proposed a representation of both spatial and temporal aspects of texture [2]. The direct use of motion for object recognition has been realized computationally by Nelson and Polana in their qualitative analysis of temporal texture [3]. In their work, statistical features based on the magnitudes and directions of flow vectors were calculated to recognize different types of temporal texture. Their study highlighted the computational possibility of using low level motion features for recognition. However, as a pioneering work, the intriguing spatio-temporal relationships of moving parts of the objects were not fully addressed [2]. The distinctive spatial features of the moving objects were also insufficiently exploited. Bouthemy and Fablet [4] analyzed statistically on the temporal distribution of appropriate local motion-based measures to perform global motion characterization of video shots. They enhanced the temporal descriptive power by analyzing motion over an extended sequence, rather than just over two frames. But, the spatial characteristics of moving object were ignored. Bar-Joseph [5] used multi-resolution analysis (MRA) tree merging for the synthesis and merging of 2D textures and extended the idea to temporal textures. For 2D textures new MRA trees were constructed by merging MRA trees obtained from the input. Payam and Gianfranco [6] made an assumption that the sequences of images were realizations of second-order stationary stochastic processes (the covariance is finite and shift-invariant). They set out to classify and recognize not individual realizations, but statistical models that generate them. Optical flow histogram [7-9] or variance flow [10, 11] is usually employed to characterized global motion or activity. However, it can be difficult for us to quantify or describe some motion types, especially those that are indeterminate in temporal extent.

UTH

Dynamic textures are sequences of images of moving scenes that exhibit certain stationarity properties in time, for example, sea-waves, smoke, foliage, whirlwind etc. This work proposed a novel characterization of dynamic textures that poses the problems of recognizing. A method by spatio-temporal multiresolution histogram based on velocity and acceleration fields is presented. The spatiotemporal multiresolution histogram has many desirable properties including simple computing, spatial efficiency, robustness to noise and ability of encoding spatiotemporal dynamic information, which can reliably capture and represent the motion properties of different image sequences. Velocity and acceleration fields of different spatio-temporal resolution image sequences are accurately estimated by structure tensor method. We describe a simple matching algorithm based on multiresolution histogram, which measure difference between two sequences.

1. Introduction Dynamic textures are usually meant as multidimensional stochastic processes exhibiting some stationary over time, for example, sea-waves, smoke, foliage, whirlwind etc. There are two main researching aspects in dynamic (or time-varying) textures, representation and recognition; modeling, synthesis and editing, the former is mainly applied in the field of pattern recognition and vision understanding; the later in the field of computer graphics. In this paper, we are interested in spatio-temporal statistical properties of the special visual scenes for their recognition. Geometric, photometric, dynamic and properties are used for object and texture recognition, image and video retrieves. Saisan dealt with the problem of recognizing a sequence of images based upon a joint photometric * PT

PT

Zongqing Lu Weixin Xie School of Electronic Engineering Xidian University, Xi’an ,710071,China [email protected] HTU

TP

TP

*

HTU

This work is Sponsored by Shenzhen Science & technology project, Grant No. 200338.

Proceedings of the IEEE Workshop on Motion and Video Computing (WACV/MOTION’05) 0-7695-2271-8/05 $ 20.00 IEEE

UTH

HTU

UTH

In our work we complete spatio-temporal multiresolution decomposition of video sequences and compute optical flow fields. We adopt a similar ideal (multiresolution histograms) from [12] to complete the multiresolution representation of video sequences. The remaining sections are organized as follows. Section 2 outlines the motion fields. Section 3 gives the description of spatio-temporal multiresolution histograms based upon velocity and acceleration fields, and a matching measurement is also given. Section 4 describes the experiments conducted, discusses the results of the experiments, and presents as well as conclusion and future research directions.

2. Motion vector field Motion information is a crucial cue for visual perception, motion classification and action recognition. Optical flow [13] provides an estimation of 2D representation of apparent velocities based on the pixel intensity values across a group of adjacent frames. Unfortunately, optical flow can’t be estimated based on image intensities alone unless additional constraint is imposed (e.g., smoothness). Such constraint are either difficult to implement in practice or are not true over the entire image. Numerous methods of optical flow estimation have been developed in the last 20 years. Recently, new optical flow computation schemes, such as tensor-based method, were proposed based on the total least squares (TLS) approach, tensor-based method is implemented by constructing 3D structure tensor and performing eigen-space analysis. Motion vectors are then estimated according to the threshold of eigenvalues that also provides the confidence measurements on the discontinuities of motion field. The three-dimensional (3D) structure tensor technique [14,15,16] has been shown to have very low systemic error and noise sensitivity, and to yield very good results. Our goal is to design a general framework to provide a global statistical characterization of motion content based optical flow field. According to the commonly-used assumption that the image brightness f ( x, t ) at point x [ x, y ]T and time t changes only due to motion, we have f x d x  f y d y  ft 0 (1) Optical flow is v Letting ’f

v ( x, t ) = v ( x, y, t ) = [vx , v y ]T K [ f x , f y , ft ]T , v

[vx , v y ,1]T

Optical flow constraint equation K ’f T v 0 (2) The above constraint provides only one linear equation with two unknowns (i.e., the local velocity vector

( wf / wx, wf / wy, wf / wt )). Additional constraints are needed to allow the velocity to be determined locally (aperture problem). Equation (2), enable us to determine the normal velocity directly which is given by  ft vnormal fx2  f y2 Normal flow, being the amount of pixel movement along the image intensity gradient, naturally describes the temporal aspect of an object in dynamic motion [2]. K v is the 3-D flow in x - t space, it can be solved by minimizing the following energy function in the least squares sense: K 2 E ( x ) ³ w( x  x c) ª¬’f T v º¼ d 2 x c (3) D Where w(˜) is a 2D window function that defines the neighborhood D around x [14]. Definition the mean operation” ˜ ”as follows: f

³

D

w( x  x c) fd 2 x c

(4)

The energy function in (3) can be expressed as K 2 K K ª¬’f T v º¼ (5) E( x ) v T (’f )(’f )T v K K So the structure tensor is T v T (’f )(’f )T v , here we do not adopt affine model. Following figure 1 show corresponding velocity field.

a

fire

sequence

and

60

50

40

30

20

10

0

10

20

30

40

50

60

70

80

Figure 1. Left: image sequence, right: velocity field Structure tensor is T

K K v T (’f )(’f )T v , which has three

eigenvalues, ordering them as: 0 d O1 d O2 d O3 . K K K Corresponding eigenvectors are: v1, v2 , v3 . K K They satisfy [14]: || (’ f )T v ||2 O || v ||2 . Four cases often need to be analyzed: K 1. O1,2,3 0 : In this case, (’ f )T vk 0 for k 1, 2, 3 , K K K v1, v2 , v3 are orthogonal basis in the x - t space, thus ’ f has to be zero, i.e., the brightness distribution is constant in the neighborhood D . No motion can be detected. 2. O1,2 0 , O3 ! 0 : The aperture problem, only the normal flow can be computed:

Proceedings of the IEEE Workshop on Motion and Video Computing (WACV/MOTION’05) 0-7695-2271-8/05 $ 20.00 IEEE

 ft

vnormal

fx2  f y2

3. O1,2,3 ! 0 : In this case, the brightness varies in all directions. No apparent motion can be detected, in this case. 1 ª v1,1 º K T 4. O1 0, O2,3 ! 0 : v = « » , v1 [v1,1, v1,2 , v1,3 ] , v is v1,3 ¬ v1,2 ¼ optical flow. We use the definition of confidence to corner measure [30].



c

O3  O1 2 O3  O1



O3  O2 2 O3  O2



(6)

Only velocity information may not represent dynamic properties of temporal texture sufficiently, so acceleration (changing rate of velocity) should be considered. Acceleration is: wv( x, t ) a( x, t ) wt In real application we can compute acceleration as: a( x, n ) v( x, n  1)  v( x, n ) (7) Shown as figure 2. 60

60

50

50

-

40

30

40

30

20

20

10

10

0

10

20

30

40

50

60

0

10

70

0

80

10

20

30

40

50

60

70

80

60

50

40

=

30

20

10

20

30

40

50

60

70

80

Figure 2. Acceleration field

work, various recognition systems [17,18] based on histograms were developed. Now histograms are an important tool for the retrieval of images and video from multimedia databases [19-22]. Some of the reasons for their importance are that they can be computed easily and efficiently, they achieve significant data reduction, and they are robust to noise and local image transformation. However, the single histogram is not adequate, since it dose not capture spatial information. Some works on histogram extensions are focused on local histograms. To discriminate between images, several features have been identical or similar histograms, several features have been suggested that extend plain histograms. Some algorithms have used local intensity histograms rather than global ones. Local histograms have been combined with explicit image coordinates [23], another representation that combines image scale together with the histogram is the locally orderless histograms suggested by Griffin [24] as well as Koenderink and Van Doorn[25]. Kadir and Brady [26] compute local histograms over regions of varying size. The local histograms are often related to the hard problem of region segmentation. Another limitation of local histograms is that they do not represent image structure. However the extension of traditional histograms to multiresolution histograms combining intensity and spatial information are introduced [27], which can overcome some shortcoming of single histogram or local histogram. The dominant types of multiresolution decompositions have constructed with derivative filters as well as orientation and frequency selective filters such as differences of Gaussians, Gabor filters wavelets, and steerable filters. We extend the spatial multiresolution decomposition to spatial-temporal multiresolution decomposition by incorporating additional decomposition along the temporal axis. Letting a video sequence is f ( x, t ) , x = [ x, y ] . To decrease the video sequence resolution, we use a spatial-temporal Gaussian filter: glx ,lt ( x, t ) § xx T · § 1 t2 · exp ¨  exp ¨  1/ 2 2 2 ¸ 2 ¸ ( 2S ) l x (lt ) V xV t © 2l xV x ¹ © 2ltV t ¹ Where V x is the standard deviation of the filter [28] in 3/ 2

Similar to v( x, t ) and a( x, t ) , we define C ( x, t ) , the representation of confidence measure, which is defined in equation 6.

3. Spatio-temporal multiresolution histogram Histograms have been widely used to represent, analyze and characterize images. One of the initial applications of histograms was the work of swain and ballard for identification of 3D objects. Following that

spatial field, V t is the standard deviation in temporal field and (l x , lt ) is the spatio-temporal resolution, l x : spatial resolution, lt : temporal resolution. A filtered image sequence, f glx ,lt , then we resample the f glx ,lt along spatial and temporal axes to get a lower spatio-temporal resolution sequence, as shown in figure 3.

Proceedings of the IEEE Workshop on Motion and Video Computing (WACV/MOTION’05) 0-7695-2271-8/05 $ 20.00 IEEE

d v ,n = { x | n d angle[vlx ,lt ( x, t )] < n + 1} d a ,n

180 d n  180 = { x | n d angle[alx ,lt ( x, t )] < n + 1} 180 d n  180

D (˜) and E (˜) are weighting functions. Figure 3. Left: the origin sequence, right: low resolution sequence, whose spatial and temporal scales are half of the origin. We compute histograms of different spatio-temporal multiresolution image sequences, in the following we describe how to define the histogram. Note our spatiotemporal multiresolution histogram likes the multiresoltion histogram of image [12], is video representation because spatio-temporal multiresolution decomposition is applied to the video. It is different from representations where multiresolution decomposition is applied exclusively to histogram. f lx ,lt ( x, t ) is a video sequence at spatio-temporal resolution (l x , lt ) , which is resampled from f glx ,lt . vlx ,lt ( x, t ) , al x ,lt ( x, t ) and Cl x ,lt ( x, t ) are the corresponding

velocity, acceleration and confidence measure(equation 6) of video sequence f lx ,lt ( x, t ) . We think the orientation of velocity and acceleration are important clues to specify the dynamic properties of different image sequences. Orientation of velocity and acceleration are in the range [-179,180], the total level number of orientation of velocity and acceleration is 360. In velocity and acceleration fields, the histograms (orientation of velocity and acceleration) are: hlv [ hlv ( -179 ), hlv (0 ) ˜ ˜˜, hlv (180 )] hla

[ hla (-179 ), hla (0 ), ˜ ˜ ˜, hla (180 )]

The subscript l (l x , lt ) represents different spatiotemporal resolution level. For simplicity x,t represent discrete points sampled along spatial and temporal axes separately. T

¦ ¦ D [C

l x ,lt

hlv ( n )

(8)

T

¦ ¦ D [C

l x ,lt

t 1 xD

¦ ¦ D [C h ( n)

l x ,lt

t 1 xd a ,n

( x, t )] E (|| alx ,lt ( x, t ) ||) (9)

T

¦ ¦ D [C

l x ,lt ( x, t )] E (|| al x ,lt ( x, t ) ||)

t 1 x D

180 d n  180 . D is the whole spatial region.

1

1

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1 0

0.1

0

50

100

150

200

250

300

350

400

0

1

1

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1 0

50

100

150

200

250

300

350

400

0

50

100

150

200

250

300

350

400

0

50

100

150

200

250

300

350

400

0.1

0

50

100

150

200

250

300

350

400

0

1

1

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1 0

0

0.1

0

50

100

150

200

250

300

350

400

0

Figure 4. The first column: histograms of velocity, the second column: histograms of acceleration, from top to down the sptio-temporal resolution of corresponding video is reduced, all bins range from [179,180] (degree). The multiresolution histograms is a vector: = = [= 1 , = 2 , ..., = m ] [hlv( i ) ( 179), ˜ ˜ ˜, hlv( i ) (180), hla( i ) ( 179), ˜ ˜ ˜, hla( i ) (180)] .

hlv( i ) is a histogram of velocity field ( hla( i ) , the acceleration

( x, t )] E (|| vlx ,lt ( x, t ) ||)

T

a l

ha ( gl f ) at three different spatio-temporal resolutions.

= i = [hlv( i ) ,hla( i ) ] =

( x, t )] E (|| vl x ,lt ( x, t ) ||)

t 1 xd v ,n

T is the length of a sequence. Figure 4 shows different histograms hv ( gl f ) ,

field), the resolution of the corresponding video sequence is (l x (i ), lt (i )) , i 1, 2,...m The total number of resolution levels is m . To compare different sequences, we can directly use different histogram similarity measures were used in the retrieval experiments. A measurement is shown in the following definition [29], H a and H b denote two histograms with B bins, and

Proceedings of the IEEE Workshop on Motion and Video Computing (WACV/MOTION’05) 0-7695-2271-8/05 $ 20.00 IEEE

C1

N Hb

, C2

N Ha

1 C1

B

N Ha

¦H

B

a

( j) ,

N Hb

j 1

¦H

b

( j)

j 1

The L2 distance measure is proportional to the accumulated squared bin differences

(e)

(f)

B

¦ [C

d L2 ( H a , H b )

1

˜ H a ( j )  C2 ˜ H b ( j )]2

j 1

d L2 is normalized by

2 ˜ N H a ˜ N Hb to the range [0,1] .

In this paper we adopt the L2 distance measure to compute difference between two histogram vectors.

4. Experiments

(g) Figure 5. Video sequences: a-fountain, b-fountain1, cfire, d-tree, e-water, f-water1, g-water2.

The matching performance of the spatio-temporal multiresolution histogram was tested with seven real sequences: fire, tree, fountain, fountain1, water, water1, water2 (figure 5), which include similar sequences with different dynamics, such as three water sequences, and two fountain sequences. In (8), (9) equations the weighting functions are D (˜) 1 , E exp (˜) , all sequences are 320 u 280 u 64 ; the total number of resolution levels m is 4; the deviations V x of spaio-temporal Gaussian filter glx ,lt ( x, t ) are 1.5 in spatial axes and V t is 1.0 in temporal axis; l x (i ) (1.7 )( i 1) , lW (i ) (1.4)( i 1) ; the subsampling factors are 2 in spatial axes and 1.5 in temporal axis. Matching result is shown in figure 6. Simulation results suggest significant separation amoung different catrgories of dynamic textures.

Figure 6. The differences of each pair sequences. Water, water1, water2 are similar sequences, the distances among them are small, same are fountain and fountain1.

5. Conclusion and future work

(a)

(c)

(b)

(d)

The main contribution of this paper lines in the introduction of notion of spatio-temporal multiresolution histogram of dynamic texture and the realization of its recognizing usage. The spatio-temporal multiresolution histograms (orientation of velocity and acceleration) can reliably capture and represent the motion properties of different image sequences. Moreover, it retains the simplicity, efficiency and robustness. Confidence and magnitude of velocity in optical flow are taken into account. Future research efforts should find a way to determine (l x (i ), lt (i )) , i 1, 2,...m , according to image sequence adaptively. Gear toward devising a more wideranging methods to obtain other descriptive features from the extended spatio-temporal textures. In this paper the

Proceedings of the IEEE Workshop on Motion and Video Computing (WACV/MOTION’05) 0-7695-2271-8/05 $ 20.00 IEEE

distance of histogram vectors is L2 Distance Measure, which is a relatively crude form, the Tsallis generalized entropies and difference of histograms [12] may be adopted. In further research, spatial texture information should be used, not only motion information.

6. References [1] Saisan, P. Doretto, G. Ying Nian Wu Soatto, S. Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference, Volume: 2, ISSN: 1063-6919. [2] Chin-Hwee Peh and Loong-Fah Cheong, Synergizing Spatial and Temporal Texture, IEEE Trans. on Image Processing, Vol. 11, NO. 10, Oct 2002, pp. 1179-1191. [3] R. C. Nelson and R. Polana, “Qualitative recognition of motion using temporal texture,” CVGIP: Image Understand., vol. 56, no. 1, 1992, pp. 78–89. [4] P. Bouthemy and R. Fablet, “Motion characterization from temporal cooccurences of local motion-based measures for video indexing,” in Proc. Conf. Pattern Recognition, Brisbane, Australia, Aug. 1998, pp. 905–908. [5] Z. Bar-Joseph, R. El-Yaniv, D. Lischinski, and M. Werman. “Texture mixing and texture movie synthesis using statistical learning.” IEEE Transactions on Visualization and Computer Graphics, 7(2), 2001, pp. 120–135. [6] Gianfranco Doretto, “Dynamic Texture Modeling ” Master degree, University of California, Los angles 2002. [7] E. Ardizzone and M. L. Cascia, “Video indexing using optical flow field,” in Proc. IEEE Int. Conf. Image Processing, Lausanne, Switzerland, Sept. 1996, pp. 831–834. [8] Y. Deng and B. S. Manjunath, “Content-based search of video using color, texture and motion,” in Proc. IEEE Int. Conf. Image Processing, vol. 2, Santa Barbara, CA, 1997, pp.534–537. [9] A. K. Jain, A. Vailaya, and W. Xiong, “Query by video clip,” Multimedia Syst., vol. 7, no. 5, 1999. pp. 369–384. [10] N. Vasconcelos and A. Lippman, “Toward semantically meaningful feature spaces for the characterization of video content,” in Proc. IEEE Conf. Image Processing, Santa Barbara, CA, Oct. 1997, pp. 78–89. [11] V. Vinod, “Activity based video shot retrieval and ranking,” in Proc. IEEE Conf. Pattern Recognition, Brisbane, Australia, 1998, pp. 682–684. [12] E. Hadjidemetriou, M. D. Grossberg, and S. K. Nayar. Multiresolution Histograms and their Use for Recognition. IEEE Trans. on Pattern Analysis and Machine Intelligence, 26(7), July 2004, pp. 831–847. [13] B. Horn and B. Schunck, “Determining Optical Flow,” Artificial Intelligence, vol. 17, nos. 1-3, 1981. pp. 185-203. [14] Haiying Liu, Rama Chellappa, and Azriel Rosenfeld, Accurate Dense Optical Flow Estimation Using Adaptive Structure Tensors and a Parametric Model. IEEE Trans. on Image Processing, 12(10), Oct 2003, pp. 1170–1180. [15] G. Farneb¨ack, “Very High Accuracy Velocity Estimation using Orientation Tensors, Parametric Motion, and Simultaneous Segmentation of the Motion Field,” in Proceedings of ICCV’01, vol. I. Vancouver, Canada, July 2001. [16] G. H. Granlund and H. Knutsson, Signal Processing for Computer Vision, Kluwer Academic Publishers, 1995.

[17] B. Funt and G. Finlayson. Color constant color indexing. IEEE Trans. on Pattern Analysis and Machine Intelligence, 17(5), May 1995, pp. 522–529. [18] M. Stricker and M. Orengo. Similarity of color images. In Proc. of SPIE Conference on Storage and Retrieval for Image and Video Databases III, volume 2420, 1995, pp. 381–392. [19] J. Bach, C. Fuler, A. Gupta, A. Hampapur, B. Horowitz, R. Humphrey, R. Jain, and C. Shu. The Virage image search engine: An open framework for image management. In In SPIE Conference on Storage and Retrieval for Image and Video Databases IV, volume 2670, 1996, pp. 76–87. [20] W. Niblack. The QBIC project: Querying images by content using color, texture, and shape. In Proc. of SPIE Conference on Storage and Retrieval for Image and Video Databases, volume 1908, 1993, pp. 173–187. [21] H. Zhang, A. Kankanhali, and S. Smoliar. Automatic partitioning of full–motion video. Multimedia Systems, 1, 1993, pp. 10–28. [22] H. Zhang, C. Low,W. Smoliar, and J.Wu. Video parsing, retrieval and browsing: An integrated and content–based solution. ACM Multimedia, 1995, pp. 15–24. [23] W. Hsu, T. Chua, and H. Pung. An integrated color-spatial approach to content-based image retrieval. In Proc. of ACM Multimedia, 1995, pp. 305–313. [24] L. Griffin. Scale–imprecision space. Image and Vision Computing, 15, 1997, pp. 369–398. [25] J. Koenderink and A. J. V. Doorn. The structure of locally orderless images. International Journal of Computer Vision, 31(2–3), 1999, pp. 159–168. [26] T. Kadir and M. Brady. Saliency, scale and image description. International Journal of Computer Vision, 45(2), 2001. pp. 83–105. [27] E. Hadjidemetriou, M. Grossberg, and S. Nayar. Spatial information in multiresolution histograms. In Proc. of the Computer Vision and Pattern Recognition Conference, volume 1, 2001, pp. 702–709. [28] J. Koenderink. The structure of images. Biological Cybernetics, 50, 1984, pp. 363–370. [29] A. Müfit Ferman , A. Murat Tekalp and Rajiv Mehrotra Robust Color Histogram Descriptors for Video Segment Retrieval and Identification. IEEE Trans. on Image Processing, 11, May 2002, pp. 479–508. [30] H. Haussecker and H. Spies, Handbook of Computer Vision and Applications Volume 2: Signal Processing and Pattern Recognition. New York: Academic, ch 13, 1999.

Proceedings of the IEEE Workshop on Motion and Video Computing (WACV/MOTION’05) 0-7695-2271-8/05 $ 20.00 IEEE

Suggest Documents