video indexing using optical flow field - Semantic Scholar

10 downloads 0 Views 192KB Size Report
the descriptive power of the optical ow eld. Our method is a low level one and detect .... via a Power Macintosh 7100/66AV. A WWW online demo is available onĀ ...
VIDEO INDEXING USING OPTICAL FLOW FIELD Edoardo Ardizzone and Marco La Cascia Dipartimento di Ingegneria Elettrica University of Palermo, Palermo, Italy E-mail: fardizzone,[email protected]

ABSTRACT

The increasing development of advanced multimedia applications requires new technologies for organizing and retrieving by content databases of digital video. Several content based features (color, texture, motion, ecc. . . ) are needed to perform a reliable content based retrieval. In this paper we present a method for automatic motion based video indexing and retrieval. A prototypal system has been developed to prove the validity of our approach. Our system automatically splits a video into a sequence of shots, extracts a few representative frames (said r-frames) from each shot and computes some motion based features related to the optical ow eld. Motion based queries are then performed either in a qualitative or quantitative way. Results obtained with our system proved that motion based query can play a central role in content based video retrieval.

1. INTRODUCTION Ideally, queries put to a video database should refer to the content of stored videos and results should contain a few videos matching the query. To this aim image and image sequence contents must be described and adequately coded. In the last years a lot of work has been done about color, texture, structural and semantical indexing [1]. Motion based video indexing has been addressed only by a few researchers. In this paper we present a method to characterize the videos based on motion in a completely automatic way; our approach is domain independent and rely on the descriptive power of the optical ow eld. Our method is a low level one and detect only automatically computable features, no qualitative features are detected. However the user can specify some common queries (zoom, pan, . . . ) in a qualitative way. This work was partially supported by MURST 40% and 60%.

The rest of the paper is organized as follows. Section 2 addresses several related works. Section 3 describes the proposed method for indexing and querying. Section 4 presents the experimental results on a relatively large database. Finally, Section 5 contains some nal considerations about our approach and its possible extensions.

2. RELATED WORK In the last years several content-based retrieval systems [2, 3, 4, 5] have been developed. These systems di er in terms of video features extracted, degree of automation reached for feature extraction and level of domain independence. However only a few researchers addressed the task of indexing and retrieving videos by motion content. In the QBIC system [3], a content-based retrieval system treating both images and videos, videos are split in shots [6] (image sequences presenting continuous action which appears to be from a single operation of the camera); each shot is characterized via the camera motion and the objects motion. Video queries are then speci ed in terms of either objects motion or camera motion. From each shot is also extracted one (or more) representative frame that is characterized using color, texture, ecc. . . Akutsu et al. [7] proposed a similar method that works by rst segmenting the video into shots. The shots are then indexed based on types of camera motion. Dominant color in the shot is also used to index the video. The motion based video indexing system that is proposed in the rest of this paper allows for more general queries and all the motion features used are automatically extracted. This paper focus on query by motion to prove the importance of motion in contentbased retrieval. No color or texture information was used in the querying step even if results obtained integrating color, texture and motion would be probably better than those obtained using only motion informa-

tion.

3. THE PROPOSED METHOD Our method is based on a prior video segmentation to extract shots. Several tecniques exist to perform such a task [6, 8]; we used the neural based method proposed by Ardizzone et al. [8]. This method is not able to detect shot boundaries in presence of editing e ects (fade, dissolve, ecc. . . ) but was choosen because it is very fast and reliable in the other cases. Once the shots are extracted, the system extracts the r-frames. The technique used is based on heuristics and results very e ective, in despite of its simpleness: if the video is shorter than one second then only a frame is chosen as r-frame (the middle one). If the shot is longer than one second then a frame for each second is chosen as r-frame. A number of tests showed that such a simple technique is sucient to completely describe a shot. The r-frames are characterized via their optical ow eld; this computation involves the adjacent frames. To compute the ow eld we used the Nagel [9] technique as implemented in [10]. Essentially this is a gradient based technique and uses the second order derivatives to measure optical ow. The basic measurements are integrated using a global smoothness constraint. The problem is formulated as the minimization of the functional:

ZZ

2 (rI T v + It )2 + krI k 2 + 2  2 [(uxIy ? uy Ix )2 + (vx Iy ? vy Ix )2 + + (u2x + u2y + vx2 + vy2 )]dx dy (1)

where I (x; t) is the intensity of the pixel at location = (x; y)T and time t, v = (u; v)T is the optical

ow, It ; Ix ; Iy ; ux; uy ; vx ; vy are the partial derivative of I; u; v, rI = (Ix (x; t); Iy (x; t))T and ;  are numerical constants. With the use of Gauss-Seidel iterations, the solution may be expressed in an iterative form. Details on this technique and its implementation can be found in [10]. In our implementation the ow eld is then truncated and normalized (values of M (x; y)  10 are put to 10; then M (x; y) ! M (x; y)=10; where M (x; y) is the ow vector magnitude expressed in pixels) so that all the ow vectors of all the ow eld are into the range (0{1). This tecnique was preferred to others because it is a good compromise between computational cost and precision of the ow eld and leads to a dense ow.

x

A general schema of our approach is shown in Figure 1.

3.1. Motion based features

Once the optical ow eld is computed in the above mentioned manner, we need a technique to code its content to allow content-based queries. The rst step of our method consists of spatially splitting the ow eld in four equal region; for each region we can then compute some motion based features. The splitting was performed to preserve spatially related information that are not integrated in the computed features. The features we used are the following:

 = L 1L

H () = 1

y ?1 X?1 LX

Lx

=0 y=0

x

y ?1 X?1 LX

Lx

M (x; y)

(2)

M (x; y) ( (x; y); ; )

(3)

=0 y=0

x y x

where M (x; y) and (x; y) are respectively the magnitude and the phase of the normalized ow eld vector at (x; y), Lx and Ly are respectively the frames' width and height expressed in pixels and is the function:



1 when ( ? 2 )  < ( + 2 ) 0 otherwise (4) In our implementation we used  = 10o.

( ; ; ) =

3.2. Motion based querying

Once features are computed and stored the user can specify queries based on the motion content of the video. Queries may be qualitative or quantitative. Quantitative queries are directly based on the low level features computed as above. For each quadrant the user can specify the motion intensity i and the dominant direction i . The user can also leave unspeci ed the motion parameter in one or more quadrant; in this case the query is performed using only the information of the quadrants whose motion parameters were speci ed. Motion intensity should be speci ed by the user into the range (0{1), dominant direction should be in the range (?180o{180o). Queries are performed selecting the 4n best matching r-frames, where n is the number of r-frames the user want returned as result, using as similarity function the sum of the absolute di erences between i speci ed by the user and the corresponding value relative to the considered r-frame.

Beteween these 4n r-frames the n ones that best match the user query are selected using the following similarity function:

Sk =

i +W X X

= ?

i  i W

Ai e?B(

?

) Hi ( ) k = 1; 2; : : : ; m

 i 2 W

(5) where the rst sum is on all the quadrants whose motion parameters were speci ed by the user, i 's and i 's are the user speci ed values and Hi is the periodic extension of the direction histogram (2) of the corresponding quadrant of the k-th r-frames, B is a costant and Ai satis es the relation: X i +W ? (6) Ai e?B( W i )2 = i ?

i W

In our implementation we used W = 20o and B = 1:5. To specify a qualitative query the user has to specify the type and the intensity of motion. The type of motion is one of the following: ZOOM-IN, ZOOM-OUT, PAN-LEFT, PAN-RIGHT, UP, DOWN. When performing the query our system converts the qualitative label in a set of four directions i . In particular we used the conversion shown in Table 1. The query is then processed as a quantitative query. This kind of conversion, in despite of its simpleness, seems to be very e ective.

4. EXPERIMENTAL RESULTS In the following a few sample queries are reported that show the validity of our approach. These queries were performed on our WWW demo database that contains about one hundred r-frames acquired from TV. In gure 2 and 3 results of queries by motion intensity are shown; note how motion intensity can discriminate video sequences. In gure 4 results of a qualitative query (pan-left with high motion intensity) are shown. Queries based on qualitative camera motion and quantitative motion intensity lead to a simple way for querying in video database. Experimental results showed that motion is a good starting point for querying by video content. In the whole test we performed results are coherent.

5. CONCLUSIONS In this paper we proposed a method that allows the user for an e ective motion based querying by video content.

Our method use only motion information for describing and searching video content. Integrating this method with color or texture features may lead to the realization of an e ective automatic content-based video retrieval system. Better results may be certainly obtained using also objects motion descriptors. However to perform such a description a segmentation step is needed to locate the moving objects and discriminate them from the background. A robust and automatic segmentation step has been proved to be a very dicult task on static images (in a general purpose environment). Other areas of future works include optimized database accesses and a more powerful and friendly user interface. Our method was developed and tested in C on a DEC AXP3000 workstation; the videos were acquired via a Power Macintosh 7100/66AV. A WWW online demo is available on the Internet at the URL: http://wwwcsai.diepa.unipa.it/research 6.

REFERENCES

[1] V.N. Guditava and V.V. Raghavan. Content-based image retrieval systems. IEEE Computer, Sept. 1995. [2] M. LaCascia and E. Ardizzone. Jacob: Just a contentbased query system for video databases. In Proc. of ICASSP - Atlanta - May 7-10, 1996. [3] M. Flickner et al. Query by image and video content: the qbic system. IEEE Computer, Sept. 1995. [4] V.E. Ogle and M. Stonebraker. Chabot: Retrieval from a relational database of images. IEEE Computer, Sept. 1995. [5] A. Pentland, R.W. Picard, and S. Sclaro . Photobook: Tools for content-based manipulation of image databases. In Proc. of SPIE: Storage and Retrieval Image and Video Database II, February 6-10, San Jose, 1994. [6] A. Hampapur, T. Weymouth, and R. Jain. Digital video segmentation. In ACM Multimedia'94 Proceedings: ACM Press, 1994. [7] A. Akutsu, Y. Tonomura, H. Hashimoto, and Y. Ohba. Video indexing using motion vectors. In Proc. of SPIE: Visual Communication and Image Processing, 1992. [8] E. Ardizzone, G.A.M. Gioiello, M. LaCascia, and D. Molinelli. A real-time neural approach to scene cut detection. In Proc. of IS&T/SPIE - Storage & Retrieval for Image and Video Databases IV, January 28 - February 2, San Jose., 1996. [9] H.H. Nagel. On the estimation of optical ow: relations between di erent approaches and some new results. Ati cial Intelligence, 33, 1987. [10] J.L. Barron, D.J. Fleet, and S.S. Beauchemin. Performances of optical ow techniques. Int. Journal of Computer Vision, 12:43, 1994.

Motion type 0 1 2 3 ZOOM-IN 135o 45o ?135o ?45o ZOOM-OUT ?135o ?45o 135o 45o o o o PAN-LEFT 0 0 0 0o o o o PAN-RIGHT 180 180 180 180o UP ?90oo ?90oo ?90oo ?90oo DOWN 90 90 90 90 Table 1: Qualitative-quantitative conversion table.

Video

R-frames

Shots Video segmentation

R-frames extraction

Optical flow computation

Motion-based features Motion-features extraction

Figure 1: General schema of our approach to automatic video indexing.

(a)

(b)

Figure 2: Two best matching r-frames for a query on "low-motion". The r-frames are reported with the adjacent frame to show the istantaneous motion.

(a)

(b)

Figure 3: Two best matching r-frames for a query on "high-motion". The r-frames are reported with the adjacent frame to show the istantaneous motion.

(a)

(b)

Figure 4: Two best matching r-frames for the query "PAN-LEFT with high motion intensity". The r-frames are reported with the adjacent frame to show the istantaneous motion.