Semantic Scene Detection System for Baseball ...

5 downloads 1291 Views 884KB Size Report
For the camera motion, we defined six basic templates as shown in Fig. 8, ..... e shot is e inant”, skin color features are. 's o ace skin, th. Tns. (i.e., contain. ” scene ...
國立雲林科技大學 資訊工程研究所

碩士論文

架 構 在 MPEG-7 規 格 下 之 棒 球 運 動 影 片 場 景 偵 測 系 統

Semantic Scene Detection System for Baseball Videos Based on the MPEG-7 Specification

研究生:童 連 宏 指導教授:黃 胤 傅 博士

中華民國

九 十 八

年 六 月

架構在 MPEG-7 規格下之棒球運動影片場景偵測系統

Semantic Scene Detection System for Baseball Videos Based on the MPEG-7 Specification

研究生:童連宏 指導教授:黃胤傅 博士

國立雲林科技大學 資訊工程研究所 碩士論文 A Thesis Submitted to Institute of Industrial Design National Yunlin University of Science & Technology in Partial Fulfillment of the Requirements for the Degree of Master of Design in Information Engineering

June 2009 Douliu, Yunlin, Taiwan, Republic of China 中華民國九十八年 六 月

架構在 MPEG-7 規格下之棒球運動影片場景偵測系統 學生:童連宏

指導教授:黃胤傅 博士

國立雲林科技大學資訊工程所碩士班

摘 要 在本論文中,我們主要基於 MPEG-7 規格提出一個以影片內涵為基礎 的多媒體分析/擷取系統,系統能對影片內容進階分析,如找出影片中含有 語意的場景。這裡我們在棒球影片中預先定義 8 種含有語意的場景。首先, 一個基於可變動顏色直方圖的鏡頭邊界偵測方法用來有效的將一部影片 切割成許多鏡頭。接著,各種不同的視覺特徵包括場地顏色特徵、膚色特 徵和攝影機的動作資訊將從影片中被擷取出並用來分析每個鏡頭的語 意。根據不同場景的視覺特性,在偵測場景時,我們設計一個二階段的分 類策略。最後,實驗結果顯示,我們所提出的架構,對於辨識出這 8 種語 意的場景它們的準確度和回覆率分別為 81%和 84%。 關鍵字: 內涵式影音檢索, 棒球影片, 語意場景偵測, 支援向量機, MPEG-7

i

Semantic Scene Detection System for Baseball Videos Based on the MPEG-7 Specification

Student : Lien-Hung Tung

Advisor : Yin-Fu Huang

Graduate School of Computer Science and Information Engineering National Yunlin University of Science and Technology

Abstract In this thesis, we proposed a content-based multimedia analysis/retrieval system mainly based on the MPEG-7 specification which is capable of handling the high-level content analysis such as the semantic scene detection for baseball broadcast videos. Here, eight semantic scene classes were predefined for baseball videos. First, an effective shot boundary detection scheme based on scalable color histogram was proposed to segment a video into many shots. Then, various visual features including field color features, skin color features, and camera motion information were extracted to analyze the semantics for each shot. According to the visual properties of different scenes, we developed a two-stage classification strategy for the semantic scene detection. Finally, the experimental results showed that the proposed framework identifies eight semantic baseball scenes with 81% of precision rates and 84% of recall rates. Keywords: CBVR, baseball broadcast video, semantic scene detection, SVM, MPEG-7

ii





在雲林科技大學的碩士生涯,首先最誠摯感謝指導教授 黃胤傅教 授,在研究上不時給予想法與指導,並灌輸我對於做研究應有的積極態度 和方法,更提供我們一個自由學習成長的實驗室,讓我在這些年中獲益匪 淺。其次,要感謝東海大學資工所 林祝興教授與中興大學資科所 賈坤 芳教授在百忙之中抽空前來給予口試及論文上的指正與諸多建議。 在研究所修業期間,也要謝謝實驗室的同學們,透過彼此交流讓我學 習到更多知識,碰到問題能夠一起討論。對於所有幫助過我、關懷過我的 人,致上由衷感謝。 另外,我要感謝我的家人,感謝你們對我無微不至的關懷與照顧,以 及在經濟上與精神上的支持,讓我能專注於課業研究中,願將這份榮耀與 喜悅與你們分享。 最後,我要謝謝我的女友,在碩二壓力最大的一年,在背後默默的支 持,陪伴著我,給予我研究上的建議並帶給我無數的歡笑與喜悅。

iii

Contents 中文摘要 --------------------------------------------------------------------------------- i 英文摘要 ---------------------------------------------------------------------------------ii 誌謝-------------------------------------------------------------------------------------- iii Contents --------------------------------------------------------------------------------- iv List of Tables ---------------------------------------------------------------------------- v List of Figures -------------------------------------------------------------------------- vi 1 Introduction ------------------------------------------------------------------------ 1 2 System Overviews ---------------------------------------------------------------- 2 2.1 System Architecture------------------------------------------------------ 2 2.2 MPEG-7 Specification--------------------------------------------------- 5 2.3 Gaussian Mixture Model (GMM)-------------------------------------- 7 3

Video Analysis--------------------------------------------------------------------- 9 3.1 Shot Boundary Detection ----------------------------------------------- 9 3.2 Motion Feature Extraction -------------------------------------------- 12 3.2.1 Motion Vector Estimation ------------------------------------ 12 3.2.2 Camera Motion Classification ------------------------------- 13 3.3 Key-frame Extraction-------------------------------------------------- 17 4 Visual Feature Extraction ------------------------------------------------------ 17 4.1 Field Color Feature Extraction --------------------------------------- 18 4.2 Skin Color Feature Extraction ---------------------------------------- 21 5 Semantic Scene Detection ----------------------------------------------------- 23 5.1 Pre-classification by SVMs ------------------------------------------- 23 5.2 Post-classification for Semantic Scenes----------------------------- 24 6 MPEG-7 Annotation and Video Content Access ---------------------------- 25 7 Experimental Results ----------------------------------------------------------- 28 7.1 Shot Boundary Detection --------------------------------------------- 29 7.2 Pre-classification by SVMs ------------------------------------------- 29 7.3 Post-classification ------------------------------------------------------ 30 8 Conclusions ---------------------------------------------------------------------- 32 References------------------------------------------------------------------------------ 33

iv

List of Tables Table 1 Table 2 Table 3 Table 4 Table 5 Table 6

Structure of the MPEG-7 --------------------------------------------------- 6 Eight orientations of a background sub-MVF ------------------------- 16 Orientation sequences of camera motion templates ------------------- 16 Performances on shot boundary detection------------------------------ 29 Performances on classification stage 1---------------------------------- 30 Performances on semantic scene detection----------------------------- 30

v

List of Figures Fig. 1 Example of baseball game progress------------------------------------------- 3 Fig. 2 System architecture ------------------------------------------------------------- 4 Fig. 3 SVM hyper-plane---------------------------------------------------------------- 8 Fig. 4 HSV color space with the 16x4x4 quantization --------------------------- 11 Fig. 5 Block matching technique---------------------------------------------------- 13 Fig. 6 Search window----------------------------------------------------------------- 13 Fig. 7 Example of MVFs and sub-MVFs ------------------------------------------ 14 Fig. 8 Camera motion templates ---------------------------------------------------- 15 Fig. 9 Procedure of field color feature extraction --------------------------------- 19 Fig. 10 Example of field color features in different scene types ---------------- 19 Fig. 11 Field feature histograms extracted from a pitching scene -------------- 20 Fig. 12 Example of skin color features in different scene types ---------------- 21 Fig. 13 Example of skin region detection ------------------------------------------ 22 Fig. 14 Decision tree of two-stage classification---------------------------------- 23 Fig. 15 Architecture of the semantic analysis/retrieval--------------------------- 26 Fig. 16 Interface of the video analysis, (a) main window, (b) control panel -- 27 Fig. 17 Interface of the content-based multimedia retrieval system ------------ 28

vi

1 Introduction Over the past decade, the developments of multimedia technologies grew rapidly. With the increasing amount of audio-visual information, people prefer actively accessing information which they are interested in. Therefore, video indexing retrieval is strongly required for video searching and summarization. A typical construction of such indexes is mostly carried out by human experts who manually assign a limited number of keywords to videos, but it is a costly and time-consuming task. Hence, content-based video retrieval (CBVR) has been studied extensively to efficiently manipulate videos based on their contents. Recently, many researches have been devoted to the content analysis of various kinds of sports videos such as soccer, baseball, tennis, etc [6, 27, 28]. To understand video contents, a system has to derive the semantics from video data using the achievement of artificial intelligence, computer vision, and pattern recognition techniques. With the processing of sports videos, people can search and retrieve interesting contents conveniently and effectively in a huge amount of videos. As motivated by the above observations, we applied the concepts of CBVR to baseball broadcast videos, and developed an efficient shot-based baseball scene detection system. For the semantic scene detection [8, 10, 22, 25], by extracting the semantics of successive segmented shots, various kinds of video scenes could be identified. However, the major challenge of video content analysis is how to bridge the gap between low-level features and high-level semantics. In [25], the authors integrated segmentation and classification approaches to label soccer videos into semantic units. In [8], baseball scenes were classified by using the maximum entropy method in which image, audio, and text features were utilized. In [10], the authors proposed a new scene classification technique for baseball videos based on spatial and temporal features. In [22], a shot-based baseball scene classification was proposed where four scenes were detected by using low-level features. In this thesis, we extended and improved the methods [10, 22] by further considering mid-level features for more scene classes. The extracted mid-level features such as field color features, skin color features, and camera motion information are expected to act as an effective link between low-level video

1

processing and high-level video content analyses. Furthermore, a semantic scene classifier was designed based on a two-stage classification. In stage 1, we classified shots roughly into several basic types using field color features by SVMs. Then, in stage 2, the mid-level features would be utilized as the criteria to further classify them. Finally, we integrated the video contents and MPEG-7 specification such as a set of independent or related D (descriptors) and DS (description schemes). MPEG-7 was proposed to standardize media access methods, which provides automatic techniques to access a video based on its contents through video indexing and retrieval, summarization, and understanding. The remainder of the thesis is organized as follows. In Section 2, we give a brief overview of the proposed semantic scene detection system. Section 3 presents the video analysis and feature extraction techniques, including shot boundary detection, camera motion extraction, and key-frame extraction schemes. Then, based on each extracted key-frame, Section 4 introduces the extraction techniques for mid-level features such as field color features and skin color features. In Section 5, an effective two-stage classification strategy was proposed to detect and classify semantic scenes in baseball videos. In Section 6, we present a content-based multimedia analysis/retrieval system to characterize baseball videos by integrating the MPEG-7 descriptions. The experimental results are presented to show the accuracy of the shot boundary detection and scene classification in Section 7. Finally, we make conclusions in Section 8.

2 System Overviews In the section, we described the architecture of the MPEG-7 compliant baseball scene detection system, the MPEG-7 specification, Gaussian Mixture Model (GMM), and Support Vector Machine (SVM).

2.1 System Architecture An important observation in baseball sport videos is that videos are featured of a well-defined structure. It is composed of a lot of successive

2

scenes each of which can be represented by a shot. Fig. 1 illustrates the structure of a baseball game progress. These baseball scenes including pitching, field, player, and close-up could be easily detected by using a shot-based method. Inning

Segmented Shots

S1

S2

Inning

S1

S2

S3

Time Semantic Scenes

Pitching

Close-up

Pitching Infield Hitting

Player

Fig. 1 Example of baseball game progress Besides, since a baseball sport video is too long, annotation of its contents can only benefit from segmenting the video into smaller units. Although a large amount of video data are becoming increasingly difficult to handle, employing the MPEG-7 specification on a video is a feasible approach to describe and manage multimedia data [16, 18]. Here, we proposed a content-based multimedia analysis/retrieval system in order to characterize video sequences completely for video indexing and retrieval. The architecture and the corresponding realized modules of the proposed system are illustrated as shown in Fig. 2.

3

Fig. 2 System architecture The system consists of three parts: 1) video analysis and low-level feature extraction, 2) mid-level feature extraction and semantic scene detection, and 3) MPEG-7 annotation and video content access. First, a video would be segmented into various shots. In the shot detection, scalable colors (i.e., one MPEG-7 defined feature) are computed and saved as a low-level feature used in the histogram-based boundary detection. Furthermore, for each segmented shot, the motion feature is also extracted to acquire the camera motion information. According to the camera motion information, the key-frames are selected for each video shot using a key-frame extraction strategy. Second, several visual features within the selected key-frames, such as field color distribution, skin color distribution, and camera motion information, are also extracted by using a predefined color model trained by Gaussian Mixture Model (GMM). The extracted visual features would be utilized for semantic scene detection. In total, eight semantic scenes in a baseball sport video would be defined and classified, namely “pitching”, “infield-hitting”, “outfield-hitting”, “fielding”, “player”, “running”, “close-up”, and “others”. Based on the visual properties of different scenes, we proposed a two-stage classification strategy in the semantic scene detection. According to the field color distribution of key-frames, we first classify the semantic scenes for shots

4

roughly by applying a one-against-one SVM. Then, the skin color distribution and the camera motion information are used to further classify the other semantic scenes. Finally, we integrated the MPEG-7 description and the video content for video indexing and retrieval.

2.2 MPEG-7 Specification MPEG-7 [11-14] proposed by the Moving Picture Experts Group (MPEG) offers an essential solution to the challenges of video content management. It provides a great set of tools for describing multimedia data, and gives a standard way to follow. MPEG-7 focuses on applications where we can store (online or offline) or stream, and operate in both real-time and non-real-time environments. Besides, MPEG-7 is called a standard specification since it provides a generic framework facilitating the exchange and reuse of multimedia contents across different application domains. The MPEG-7 standard is composed of seven main parts, as described in Table 1. In MPEG-7, “data” means multimedia information, regardless of storage and display, and “feature” is a distinctive characteristic of the data that signifies something to somebody. In general, the description tools defined in MPEG-7 include three major elements: Descriptors, Description Schemes, and Description Definition Language. 1. Descriptors: A representation of a feature. A descriptor defines the syntax and the semantics of the feature representation. 2. Description Schemes: A description scheme is used to specify the structure and semantics of the relationships between its components, which may be both descriptors and description schemes. 3. Description Definition Language (DDL): A language that allows the creation of new description schemes and possibly, descriptors. The extension and modification of existing description schemes are also allowed.

5

Table 1 Structure of the MPEG-7 MPEG-7 standard Part-1: System

Part-2: Description

Part-3: Visual Part-4: Audio Part-5: Generic Entities and Multimedia Description Schemes (MDS) Part-6: Reference Software Part-7: Conformance Testing

Contents Specifying the tools for preparing descriptions for efficient transport and storage, compressing descriptions, and allowing synchronization between contents and descriptions. Specifying the language for defining the standard set of description tools (DSs, Ds, and data types) and for defining new description tools. Specifying the description tools pertaining to visual contents. Specifying the description tools pertaining to audio contents. Specifying the generic description tools pertaining to multimedia including audio and visual contents. Providing a software implementation of the standard. Specifying the guidelines and procedures for testing conformance of implementations of the standard.

The visual component is defined by the syntax in DDL and semantics associated with the syntactic elements. They enable descriptions of the visual features of the visual material such as color, texture, shape, motion, and face recognition. However, not all visual descriptors were utilized in our system. For example, only the Scalable Color Descriptor (SCD) in the color category and the Camera Motion Descriptor (CMD) in the motion category were employed. A detailed explanation of all visual descriptors could be found in the standard document [13]. Here, only the used descriptors are explained as follows.

6

1.

Scalable Color Descriptor The color descriptor specifies a color histogram in the HSV color space which is uniformly quantized into 256 bins so that image or video matching functionality can be easily achieved by using a histogram-based matching approach. 2. Camera Motion Descriptor The camera motion descriptor supports the following well-known basic camera operations: still, panning (horizontal rotation), tilting (vertical rotation), and zooming (change of the focal length).

2.3 Gaussian Mixture Model (GMM) The Gaussian mixture model (GMM) is an efficient method to precisely describe the sample clustering in the feature space. Through the training, the proper parameters could be obtained to fit with the target statistics. GMM can approximate arbitrary smooth shape of the density distribution. The mixture models are a type of probability density models which comprise a number of component functions, usually Gaussian. A mixture of N Gaussians is expressed as: N

P( x | μ , ∑) = ∑ α i G ( x; μi , ∑i ) i =1

N

= ∑αi ⋅ i =1

1

( 2π )

d

T ⎡ 1 ⎤ exp ⎢ − ( x − μi ) ∑i −1 ( x − μi ) ⎥ ⎣ 2 ⎦ ∑i

where d is the dimension of input feature vector x , α i is the mixing N

parameter satisfying

∑α i =1

i

= 1 , and G( x; μi , ∑i ) is the probability density

function corresponding to the i-th Gaussian component. The Gaussian mixture model contains the following adjustable parameters: αi , μi and ∑i (i = 1,..., N ) . The normal method of the parameter estimation in GMM is to use the Expectation-Maximization (EM) algorithm [3, 4], which is a well-established maximum likelihood algorithm for fitting a mixture model to a set of training data. However, in advance, EM requires a priori selection of K components.

7

2.4 Support Vector Machine (SVM) Support Vector Machines (SVM) were first investigated by Vapnik in 1992 [1]. SVM solves linearly inseparable problems by non-linearly mapping the vector in a low dimensional space to a higher dimensional feature space, and constructs an optimal hyper-plane in the higher dimensional space. Therefore, SVM has high performances in the data classification. A classification task usually involves with training and testing data which consist of some data instances. Each instance in the training set contains one “target value” (i.e., class labels) and several “attributes” (i.e., features). The goal of SVM is to produce a model which can predict the target value of data instances in the testing set with only the attributes. The concept of SVM can be briefly described as follows. Given a training set of instance-labeled pairs ( xi , yi ) , i = 1,..., l where xi ∈ R n and yi ∈{1, −1} , SVM requires the solution of the following l

optimization problem: l 1 min( wT w + C ∑ ξi ) w ,b ,ξ 2 i =1 subject to yi ( wT φ ( xi ) + b) ≥ 1 − ξi where ξi ≥ 0

Here, training vectors xi are mapped into a higher dimensional space with function φ . Then, SVM finds a linear separating hyper-plane with the maximal margin in the higher dimensional space, as illustrated in Fig. 3.

φ Input

Feature

w

Margi Fig. 3 SVM hyper-plane

8

For the optimization problem, C > 0 is the penalty parameter of the error term. Furthermore, K ( xi , x j ) ≡ φ ( xi )T φ ( x j ) is called the kernel function. The most important thing of employing SVM is to select a suitable kernel function. In most applications, basic kernel functions could be classified into four categories as follows. 1.

Linear: K ( xi , x j ) = xiT x j

2.

Polynomial: K ( xi , x j ) = (γ xiT x j + r )d , γ > 0

3.

Radial Basis Function (RBF): K ( xi , x j ) = exp( −γ xiT x j ), γ > 0

4.

Sigmoid: K ( xi , x j ) = tanh(γ xiT x j + r )

2

Here, γ , r, and d are kernel parameters.

3 Video Analysis Generally, a video is composed of a series of shots, and a shot is composed of a lot of consecutive frames. First, we applied a shot boundary detection scheme to segment a video into separated shots which are the basic unit for subsequent operations. Then, we executed a camera motion classification scheme after extracting the motion information from each shot. Finally, also based on the extracted motion information, a key-frame extraction strategy was proposed. In the section, we would elaborate the feature extraction and video analysis techniques including the shot boundary detection, the camera motion classification, and the key-frame extraction.

3.1 Shot Boundary Detection A shot is a sequence of frames generated during a continuous camera operation, and it represents a continuous action in time and space. In general, there are two types of transitions defining a boundary between shots: abrupt also referred to as cuts, and gradual such as fades and dissolves. Nevertheless,

9

most transitions for a sports video are abrupt. Therefore, we only focused on the abrupt shot boundary detection in the thesis. In the literature, there were plenty of researches done in the shot boundary detection, among which the histogram-based approach is the most popular since it achieves satisfied performances [7, 9, 18, 31]. The histogram-based approach compares the color/intensity histograms of two consecutive frames, and identifies a shot boundary if their difference exceeds a certain threshold. In our approach, we would extend the method for detecting shot boundaries using color histograms certified by the MPEG-7 standard [13]. First, a video is transformed into many frames which can be viewed as still images, and one of MPEG-7 color descriptors (i.e., Scalable Color Descriptor) is used as a low-level feature. Then, the abrupt boundary detection is done simply by computing the distance between two sets of features extracted from adjacent frames. As mentioned above, the SCD is a color histogram in the HSV color space. Hence, we first normalize the RGB color image with the range value [0, 1], and then transform it from RGB to HSV. The transformation formula is expressed briefly as follows. ⎧ ⎪ H = cos−1 ⎨ ⎪ ⎩ S=

⎫ ⎪ ⎬ + ( R − B )( G − B ) ⎪ ⎭

1 ⎡( R − G ) + ( R − B ) ⎤⎦ 2⎣

(R − G)

2

Max ( R, G, B ) − Min ( R, G, B ) Max ( R, G, B )

V=

Max ( R, G, B ) 255

After the transformation, the Hue component takes the value [0o, 360o], the Saturation component has values in the range [0, 1], and the Value component also has values in the range [0, 1]. According to the tables provided in the MPEG-7 normative part, the triple-color components (H, S, V) is uniformly quantized with 16 bins in H, and 4 bins in each S and V (i.e., 256 bins in total). In other words, the H is divided equally into 16 parts, the S into 4 parts, and the V into 4 parts as well. Fig. 4 shows the quantization of an HSV

10

color space. The number of bins is in a scalable representation, and possible values could be 16, 32, 64, 128, and 256. Saturation

00

45

90

135

180

225

270

315

360

0.25 0.5 0.75 1 1 0.75 0.5

Hue

0.25 0

Value Fig. 4 HSV color space with the 16x4x4 quantization Therefore, the normalized color histogram in bin c of frame fi is 1 , and the CHS represented as Norm( Histi (c)) = Histi (c) * 256 ∑ Histi ( j ) j =1

represents the similarity of Histi and Histi−1 which are two consecutive frames fi and fi-1, respectively. By using the Bhattacharyya Coefficient [5] to compute the histogram similarity, CHS is expressed as follows. CHS =

number of bins

∑ c =1

Norm( Histi (c)) ∗ Norm( Histi −1 (c))

The CHS is often used as a measure of the shot boundary. The CHS value is in the range [0, 1], and the more the CHS value is, the more similar two consecutive frames are. Given a fixed threshold for the CHS value, if the CHS value is less than the threshold, the shot boundary would be detected since an abrupt transition occurs between two consecutive frames.

11

3.2 Motion Feature Extraction A shot can be characterized by a particular camera motion which may be single or mixed. The purpose of extracting camera motion information is to acquire the semantic meaning so that the intention of a video shot could be inferred. Thus, the camera motion characterization plays an important role in content-based video representation and indexing [19, 20], especially for sports videos. The proposed multimedia content description MPEG-7 [13] has adopted several descriptors (DS) to characterize various aspects of camera motions. A camera motion descriptor specifies a set of basic camera operations such as panning and zooming et al. In the thesis, we consider that the effective characterization of camera motions is an essential feature to detect the semantic scenes for baseball sport videos.

3.2.1 Motion Vector Estimation To extract motion information, motion vector fields (MVFs) are estimated from raw video data. As we know, motion vectors are stored in MPEG bit streams. However, MPEG motion vectors do not always correspond to true information as we need, since the estimation is carried out to minimize prediction errors in the compression process. Therefore, it is necessary to preprocess motion vectors in order to enhance reliability. Here, the method proposed by Th. Zahariadis et al. [29] was applied to estimate reliable motion vectors. By computing block-based motion vectors, the Mean Absolute Different (MAD) was defined to measure the difference between the macro-block C ( k , l ) of current frame and the macro-block R( k + x , l + y ) of the reference frame. It is illustrated as Fig. 5 and written as follows. 1 M −1 N −1 MADC ( k ,l ) ( x , y ) = ∑ ∑ C ( k + i , l + j ) − R( k + x + i , l + y + j ) MN i =0 j =0

where ( k + x , l + y ) is a valid macro-block coordinate. The simplest search is the exhaustive search (ES), where all possible displacements are evaluated within a specific range. Thus, the ES computes the MAD at all

( 2 ρ + 1)( 2 ρ + 1)

locations of the search window in order to find the best

12

motion vector. The search window is illustrated as Fig. 6.

Reference

Current

MV

M N C(k, l)

R(k+x, l+y)

Macro-block Matched

Search Window Fig. 5 Block matching technique

+7 +6 +5 +4

M

+3 +2 +1 0

N

Current macro-block

-1

ρ=7

-2 -3 -4

ρ=7

-5 -6 -7 -7 -6 -5 -4 -3 -2 -1

0

+1 +2 +3 +4 +5 +6 +7

Fig. 6 Search window

3.2.2 Camera Motion Classification After we derived the MVFs from raw video data, they can be divided into several non-overlapping regions (sub-MVFs); i.e., either background regions or object regions [19]. For the example as shown in Fig. 7, the MVSs on the left-hand side are divided into five sub-MVFs on the right-hand side; i.e., an

ρw

13

object region in the dark area and four background regions located on the corners. ρh

LT

RT

LB

RB

Fig. 7 Example of MVFs and sub-MVFs In general, background regions contain the most camera operations so that we reserve the four corners for the camera motion analysis. As a result, the camera motion analysis only on the background sub-MVFs avoids heavy computation and increases the effectiveness of camera motion information extraction. First, the method presented by Lee and Hayes [19] was extended to estimate a camera motion magnitude. For a frame and motion vectors, a background sub-MVF with N macro-blocks in the nth frame of a video is denoted as [ s, n ] , where s ∈ {LT , LB, RT , RB} . Then, its magnitude can be expressed as: M [ s, n] =

N mb N mb 1 ⋅ (∑ mvxk[ s, n]) 2 + (∑ mvyk[ s, n]) 2 N mb k =1 k =1

where mvx and mvy are the horizontal and vertical components for the k th macro-block of the background sub-MVF [ s, n ] . A camera movement only occurs when the motion magnitude is more than a threshold. An aggregation function to determine a camera movement happening or not is defined as:

Dbg [n] =



MV [ s, n]

s∈{ LT , LB , RT , RB}

14

⎧1, if M [ s, n] > τ mag MV [ s, n] = ⎨ ⎩ 0, otherwise where Dbg [ n] is the aggregation function for the camera movement in the four background sub-MVFs. If Dbg [ n] is more than or equal to a given threshold τ d , a camera movement occurs; otherwise the camera motion is still. For the camera motion, we defined six basic templates as shown in Fig. 8, which are used to classify camera motions in the subsequent process.

Zoom (In)

Pan (Left)

Tilt (Up)

Zoom (Out)

Pan (Right)

Tile (Down)

Fig. 8 Camera motion templates The orientation of a background sub-MVF where M [ s, n] is more than a given threshold τ mag is used to determine the moving direction of a camera. For each M [ s, n] , an angle is denoted as θ [ s, n ] for the orientation, and divided into eight different orientations as shown in Table 2. ⎡ Nmb ⎤ ⎢ ∑ mvyk [ s, n] ⎥ ⎥ θ [ s, n] = tan −1 ⎢ Nk =mb1 ⎢ ⎥ ⎢ ∑ mvxk [ s, n] ⎥ ⎣ k =1 ⎦

15

Table 2 Eight orientations of a background sub-MVF Orientation 1(→)

Angle -5