A two-stage shot boundary detection framework in the presence of fast

0 downloads 0 Views 716KB Size Report
Abstract—This paper addresses the shot boundary detection issue in soccer videos, ... aim, the difference of color histogram is used as the first feature while the ...
WK,QWHUQDWLRQDO&RQIHUHQFHRQ&RPSXWHUDQG.QRZOHGJH(QJLQHHULQJ ,&&.( 

A Two-Stage Shot Boundary Detection Framework in the Presence of Fast Movements: Application to Soccer Videos

Farshad Bayat

Mohammad-Shahram Moin

Department of Electrical and Computer Engineering Qazvin Branch, Azad University Qazvin, Iran [email protected]

Department of Electrical and Computer Engineering Qazvin Branch, Azad University Qazvin, Iran [email protected] aim, the difference of color histogram is used as the first feature while the corresponding motion vectors are used as the second feature. Finally, a linear SVM is used to classify frames. In [4] the shot boundary detection goal is achieved utilizing the image's edges combined with a windowing procedure and applying a threshold on the edge changes in two successive frames. In order to speed up the shot boundary detection process, a segmentation approach is used in [86], where the video is divided to a set of 20 frames.

Abstract—This paper addresses the shot boundary detection issue in soccer videos, in presence of fast camera and/or players movements. An approach is proposed based on the modified dissimilarity features corresponding to the distribution of intensity histogram of pixels as well as image texture combined with a fuzzy C-Mean clustering method. In the proposed approach, the singular value decomposition technique is used to map into the refined features space which considerably simplifies the detection process. As a key feature of the proposed approach, some preprocessing techniques are proposed to cope with the effects of fast movements in the videos. Furthermore, in the detection step a two-stage defuzzifier is introduced to increase the precision. Finally, the proposed method is applied to a big dataset which demonstrates its effectiveness and performance.

In the present paper and considering all three mentioned challenges, the shot boundary detection in soccer videos in the presence of fast camera and/or players movements is addressed by using a modified dissimilarity features. For this purpose, the distribution of brightness histogram of pixels and image texture are used as features combined with a fuzzy C-Mean clustering approach to obtain shot boundary and non-shot boundary frames. Also, the dataset is refined by taking the advantages of the SVD technique into account to simplify the detection procedure. Finally, in the detection step a two-stage defuzzifier is introduced to increase its precision.

Keywords-Shot boundary Detection; SVD; Fuzzy C-Mean.

I.

INTRODUCTION

In the recent years, broadcasting of sport videos such as soccer videos has been rapidly increased and therefore revision and classification of these videos became a crucial issue. As a result, many researches have been done and reported in the field of machine vision, in order to classify, summarize and analyze sport videos [1- 5].

II.

The first step in the video processing is dividing its frames to a set of shots. A shot is a sequence of frames captured by a camera between its start and stop times, known as shot boundaries [1]. There is a strong semantic relation between frames in one shot. Therefore, shots are considered as fundamental units for organizing the video content and shot boundary detection (SBD) procedure plays a crucial role in the field of video content analysis. In the SBD problem there are several challenges need to be considered. First challenge is the camera and/or players fast movements which causes discontinuity in the features signal and leading to false detections. Second remarkable challenge is variation of image brightness which affects color information and makes the SBD procedure difficult. The next challenge is referred to the difficulty of thresholding procedure in the detection stage.

A. Visual Content Representation Visual content representation problem can be represented as a mapping from image space Q, into the feature space F. Let Vt  F denotes the feature of t-th frame I t  Q . Then, the main problem is to find an appropriate feature defined as the following map I :

I : It  Q o Vt  F

(1)

Note that a suitable feature for SBD should contain two characteristics: invariability and sensitivity [10].

In [6, 7] a threshold based shot boundary detection approach is proposed by comparing the similarity of two pixels in two successive frames. When the calculate similarity value is less than a predefined threshold, then it corresponds to a shot boundary. Another approach is proposed in [12] by using the information of spatiotemporal and compressed domain features. To this

‹,(((

MATHEMATICAL MODEL FOR SBD

In this section, the general mathematical model for SBD is presented, based on which a new framework is developed. Basically, SBD process consists of three main steps: visual content representation, continuous signal construction, and continuous values clustering [9].

B. Continuous Signal Construction Most widely used approaches in the SBD problem are based on finding the significant discontinuities in the visual content of video stream. Let S denotes the continuous value space, and St denotes the contents



model is used. In this model, H, S and V refer to the hue, saturation and value of brightness, respectively. In order to reduce the brightness and obtaining correct color information, we quantize the image as 8:12:18. For a given ased feature eatu e iss calculated ca cu N frames, color histogram based as:

between Vt and Vt 1 . Continuous signal is defined as the following map:

T : Atd  F 2ud o St  S

(2)

where Atd (Vt d 1 ,...,Vt ,Vt 1 ,...,Vt  d ) and d is the neighborhood radius representing the contents continuity between Vt and Vt 1 [11].

A1728u N

where Btr (St  r ,..., St , St 1 ,..., St  r ) and r is neighborhood radius associated with St and St 1 .

(4)

Texture information is the next feature which is extracted using the local binary pattern (LBP) operator. This operator provides a binary code using the derivation of pixels' gray level. If the gray level in the neighbor pixels is higher than the central level, this operator produces 1, otherwise 0 (see Fig. 2). The final results construct a histogram which is a measure of image's texture.

C. Continuous Values Clustering Decision making is the final step. To this aim, we need to find a map \ from continuous signal into the decision space, W , which can be a shot boundary or not. If w W is a decision (shot boundary or non-shot boundary), then the clustering problem is formulated as:

\ : Btr  S 2ur 1 o w W

[aa1 | a 2 |... |. |a N ]

(3) the Figure 2.  Example of a local binary pattern operator [8].

III. PROPOSED FRAMEWORK FOR SBD

Given that the middle part of the video is usually the most important part, thus instead of processing the whole image, we will partition it into 9 sections. The principle used for image partitioning is the golden section spatial composition rule [18] in which the image is divided as 3:5:3 along its width and height (see Fig. 3).

Utilizing the presented mathematical model above, the proposed framework for SBD is illustrated in Fig. 1.

Figure 3.  Golden region (upper), output after applying LBP (lower).

After applying the LBP operator, the resulting image is divided into 3 rows and 14 columns. Consequently, the image is divided into 42 parts for each one the mean and variance is calculated. Figure 1.  General framework of proposed approach

B42u N

A. Visual Content Representation Two features, i.e. color histogram in the HSV model and local binary pattern operator, are used to extract visual contents from images. A suitable feature to represent a frame of a video is color histogram which contains global information of the image and thereby is robust with respect to the internal frame movements. One of its disadvantages is that it is sensitive to the brightness which can change the color information. To reduce its influence, the HSV color

[b1 | b2 |... | | bN ]

(5)

where N is the number of frames and bi is associated with the texture feature and obtained as:

bik

Pik V 2 ik

k

1,..., 42;

(6)

where i and k denotes the frame number and image region index, respectively. P and V are mean and variance, respectively.



B. Continuous Signal Construction After feature extraction step, continuous signal is constructed based on visual features. This signal is actually a mapping from features space into the new space which has the property of continuity, low noise/error and high separation capability. In linear algebra, the SVD is a useful factorization of a matrix with many applications in signal processing. For a given M u N matrix, the SVD is given as:

S

U ¦V T

(7)

Figure 4.  Features of color histogram and BLP.

where U is a M u M matrix whose columns are eigenvectors of SS T and V is a N u N matrix whose columns are eigenvectors of S T S . Furthermore, 6 diag (V1 ,..., V r ,..., V R ) is a diagonal matrix where V1 t V 2 t ... t V r t 0 and r min(M , N ) . The values {V1 , V 2 ,..., V r } are called singular values of S. For a given large scale matrix S, the singular values ( V i ) contain all important information of this matrix and keeping the first k singular values (i.e. V i , i 1,..., k ) where k  M , is equivalent to keeping the most important elements of matrix S. For a given parameter k and using the corresponding diagonal matrix 6 k based on the first k singular values V i , i 1,..., k , the compact form of

An example is shown in Fig.5 before and after applying this operator.

matrix S is constructed as U M ur 6kVrTu N . Then one obtains:

ª¦k 0º « » T T º ª FM u N | U M ur « » Vr u N U M ur ¬ ¦ r ur Vr u N ¼ (8) «¬ 0 0 »¼ 0 , V r 0) ¦ k diag (V1 , V 2 , , V k , V k 1 0,

Figure 5.  Example of applying difference operator.

In order to remove negative parts of signal a maximum filter is used which reduces the noise level as well. This parametric filter is a sliding window that slides over the signal and put to zero all values except the maximum:

Then, the new feature space for each frame is obtained as [v1T ¦r , , vTN ¦r ] . Using exhaustive simulations and experimental results the parameter k for color histogram and LBP are chosen equal to 50 and 8, respectively. Note that, the parameter k is depended on the number of frames used for feature extraction and here the frame length is selected as N=3000. The cosine similarity measure is used to obtain similarity signal of two frames:

fi

cos( fi c, fi c1 )

fi c˜ fi c1 fi c 2 ˜ fi c1

F c (i )

iw

­ F c(i ) ° ® °0 ¯

, i

max( F c(i )) iw

, i z max( F c(i ))

(11)

iw

The output signal after applying maximum filter is shown in Fig.6.

(9) 2

Dissimilarities signals (i.e. 1  fi ) for color histogram and LBP are shown in Fig.4. This figure illustrates the dissimilarities and red diamonds represent the boundary shots in the database.

Figure 6.  The result of applying maximum filter.

IV. SHOT BOUNDARY DETECTION

The next challenge is to moderate the cameras and players fast movement which increase the signal power. This is a critical issue in the context of sport videos processing and may lead to several undesirable results. To overcome this problem, we first use difference operator as:

Featurec(i) [1, 1]* Feature(i)

The key idea for shot boundary detection is based on using the similarity of frames in a shot and sudden change in visual features during transition from a shot to another. For this purpose, after feature extraction, normalization and some preprocessing, we use a Fuzzy C-Mean (FCM) clustering approach to classify frame's shots as boundary or non-boundary shots.

(10)

Where * denotes convolution operator.



First, the features are normalized to the range [0 1] and then the feature space is defined as:

FD

^ f i F c i , F c i |1 d i d N` 1

2

indicated by red diamonds, and determined frames using the proposed approach are marked by red circles. Furthermore, the frames identified in the second step are indicated by red triangles and those determined as boundary shot are marked as black squares.

(12)

where F1c i , F2c i is the features pair of i-th frame. Feature vector FD for 10000 frames are shown in Fig.7.

Figure 8.  Results of applying FCM algorithm.

V.

Figure 7.  Normalized feature space of color histogram and BLP.

In order to analyze the proposed approach in sense of accuracy, performance and robustness, the FUM-BSVD dataset is used described in Table I [21].

To obtain shot boundaries we used the FCM approach. The output of FCM is a two rows matrix which defines the membership value of each frame to each cluster as:

Mf2

2 N ­ ½ 2N °U  V | uit  >0,1@ ; ¦ uit 1;0  ¦ uit  N ° ® ¾ (13) i 1 t 1 ° ° 1 d i d 2 ,1 d t d N ¯ ¿

TABLE I.  01Ԝ 01Ԝ 01Ԝ 01Ԝ 01Ԝ 01Ԝ

Experimental simulation results showed that the boundary shots are associated with the interval [0.75 1] with high membership degree while those frames associated with the interval (0.2 0.75) are hard to make a decision and need further processing. To this aim, a defuzzifier is defined as follows:

P t

­1 : u1t  >TSB , 1@ ° ®1 : u1t  >TNSB , TSB & LM ( FD (t )) ° ¯ 0 : o.w

F1c(t ) u F2c(t )

FUM-BSVD SOCCER VIDEO DATASET DESCRIPTIONS.

Length :Ԝ 33Ԝ :Ԝ :Ԝ 34Ԝ :Ԝ :Ԝ 40Ԝ :Ԝ :Ԝ 32Ԝ :Ԝ :Ԝ 34Ԝ :Ԝ :Ԝ 35Ԝ :Ԝ

Game Germany vs. England Greece vs. Argentina Slovakia vs. Italy Germany vs. Argentina Germany vs. Spain Germany vs. Uruguay

22 34 07 17 31 17

Number 1 2 3 4 5 6

Optimal parameters are found using the ROC diagram. This diagram is usually used in binary classification algorithms. The ROC consists of two axes where the horizontal and vertical axes show the wrong and correct detections, respectively.

1 (14)

SE

where TSB and TNSB are thresholds associated with shot boundary and non-shot boundary, respectively. These thresholds are adjusted using ROC diagram. LM (FD (t)) denotes the local maximum and FD (t ) is obtained as:

FD (t )

EXPERIMENTAL RESULTS

SP

TP TP  FN TN FP  TN

(17)

where TP and TN are the number of correctly determined and correctly not determined cases, respectively. FP and FN are the number of falsely determined and not determined cases, respectively.

(15)

The local maximum is calculated by comparison of frames with their neighbors and applying a threshold to reduce noise effects as:

Optimal values of parameters are found by plotting the ROC diagram for different thresholds in the range of [0 1]. The ROC diagram is obtained by changing threshold and plotted in Fig.9 for TNSB and TSB . An optimal threshold should lead to less faults and high accuracy at the same time. In the ROC diagram, the optimal point is one closer to the point (0,1). Accordingly, the optimal results are obtained as TNSB 0.2 and TSB 0.75 .

­1: F (t ) ! FD (t  1) FD (t  1) & FD (t ) t Tmin LM ( FD (t )) ® D (16) o.w ¯0 : The results of applying the proposed approach is shown in Fig.8 where features are indicated by blue diamonds, boundary shots obtained from the data base are



Finally, the results of applying the proposed approach to the dataset which consists of 855000 frames and 4433 shots, are shown in table II. According to the results, the proposed approach performs well with 96% recall and 93% precision rates in the average. In order to better assessment, the performance of proposed method is compared to the alternatives proposed in [18] and [20]. These comparison results are given in Table III.

REFERENCES [1] C. H Yeo, Y. W. Zhu, Q. B. Sun, and S. F Chang, “A framework for sub-window shot detection,” in Proc. Int. Multimedia Modelling Conf.,Jan. 2005, pp. 84–91. [2] A. Bagheri-Khaligh, R. Raziperchikolaei, M.Ebrahimi,” A new method for shot classification in soccer sports video based on SVM classifier”, Image Analysis and Interpretation (SSIAI), 2012 IEEE Southwest Symposium on, 22-24April 2012, pp.109,112. [3] M. Xu, J. Wang, M.A. Hasan, X. He, C. Xu, H. Lu, and J.S. Jin,” Using context saliency for movie shot classification”, Cyberworlds, 2008 International Conference on , 22-24 Sept. 2008, pp.157,162. [4] A. Ekin, A.M. Tekalp, “Automatic soccer video analysis and summarization”, IEEE Transactions on Image Processing, 2003, vol. 12, no. 7, pp. 796-807. [5] Z. Cernekova, I. Pitas, and C. Nikou, “Information theory-based shotcut/fade detection and video summarization,” IEEE Trans. Circuits Syst.Video Technol., vol. 16, no. 1, pp. 82–90, Jan. 2006. [6] X. Wu, P. C. Yuan, C. Liu, and J. Huang, “Shot boundary detection: an information saliency approach,” in Proc. Congr. Image Signal Process, 2008, vol. 2, pp. 808–812. [7] J. S. Boreczky and L. D. Wilcox, “A hidden markov model framework for video segmentation using audio and image features,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 1998, vol. 6, pp. 3741–3744. [8] K. Matsumoto, M. Naito, K. Hoashi, and F. Sugaya, “SVM-based shot boundary detection with a novel feature,” in Proc. IEEE Int. Conf. Multimedia Expo., Jul. 2006, pp. 1837–1840. [9] L. Varela, R.P. Amaral, T.G. Amaral,” Shot segmentation algorithm for online video event detection,”, Intelligent Engineering Systems (INES), 2012 IEEE 16th International Conference on, 13-15 June 2012 ,pp. 213,216. [10] S. Porter, Video Segmentation and Indexing using Motion Estimation, PhD thesis, University of Bristol, Department of Computer Science, 2004. [11] A.Ekin and A.M. Tekalp,”Shot type classification by dominant color for sports video segmentation and summarization”. In Proc.Of ICASSP, vol.3, pp.173-176, 2003. [12] L.Xie, S-F Chang, et al, ‘’Structure analysis of soccer video with hidden markov models’’, in: Proc. Of IEEE ICASSP, 4:4096-4099, 2002. [13] F. Bayat, M. S. Moin, F. Bayat, M. Mokhtari, “ A new framework for goal detection based on semantic events detection in soccer video”, 8th Iranian Conference on Machine Vision and Image Processin, 2013, submitted. [14] S. Liao, M. W. K. Law, and A. C. S. Chung, “Dominant local binary patterns for texture classification,” IEEE Trans. on Image Processing, vol18, no. 5, pp. 1107-1118, 2009. [15] C. H Yeo, Y. W. Zhu, Q. B. Sun, and S. F Chang, “A framework for sub-window shot detection,” in Proc. Int. Multimedia Modelling Conf.,Jan. 2005, pp. 84–91 [16] A. Hanjalic, “Content-based analysis of digital video,” Boston: Kluwer Academic Publishers, 2004. [17] J. Yuan, H. Wang, L. Xiao, W. Zheng, J. Li, F. Lin, and B. Zhang, “A formal study of shot boundary detection,” IEEE Trans. Circuits Syst. Video Technol. , Feb. 2007, vol. 17, no. 2, pp. 168–186. [18] A. Ekin, A.M. Tekalp, R. Mehrotra, “Automatic soccer video analysis and summarization,” IEEE Transactions on Image Processing 12(7), 2003, 796–807 [19] M. Tavassolipour, M. Karimian, and S. Kasaei, “Event detection and summarization in soccer videos using bayesian network and copula” IEEE Transactions on Circuits & Systems for Video Technology, 23(2), 2013. [20] Z. Lu, Y. Shi, “Fast video shot boundary detection based on SVD and pattern matching,” Image Processing, IEEE Transactions on, Dec. 2013 , vol.22, no.12, pp.5136,5145. [21] V. Kiani, H. R Pourreza, “An effective slow-motion detection approach for compressed soccer videos,” ISRN Machine Vision, vol. 2012, Article ID 959508, 8 pages, 2012.

1 0.9

Sensitivity(TP Ratio)

0.8 0.7 0.6 0.5 0.4 0.3 0.2

TSB Threshold

0.1

TNSB Threshold

0

0

0.2

0.4 0.6 1-Specificity(FP Ratio)

0.8

1

Figure 9.  ROC diagram for obtaining optimal threshold.

TABLE II. 

RESULTS OF APPLYING THE PROPOSED APPROACH.

F1

Precision

Recall

0.935 0.962 0.954 0.923 0.945 0.955 0.946

0.935 0.96 0.945 0.885 0.925 0.935 0.931

0.935 0.965 0.965 0.965 0.965 0.975 0.962

TABLE III. 

Total Shots 852 636 814 937 589 605 4433

Total Frame 140054 141873 150177 138447 141779 142944 855274

Match No. 1 2 3 4 5 6 Total

COMPARISON RESULTS OF PROPOSED APPROACH WITH METHODS IN [18] AND [20].

Method Ekin [18] Lu [20] Our method

Precision 91.7% 86.4% 93.1%

I.

Recall 97.3% 84.5% 96.2%

F1 94% 85% 95%

CONCLUSION

In this paper the shot boundary detection in soccer videos is addressed to handle three critical challenges, i.e. fast camera and/or players' movements, variation of image brightness, and the difficulty of thresholding procedure in the detection stage. To this aim, the distribution of brightness histogram of pixels and image texture are used as features combined with a fuzzy C-Mean clustering approach to obtain shot boundary and non-shot boundary frames. Also, the dataset is refined by taking the advantages of the SVD technique into account to simplify the detection procedure. Finally, using a big dataset and performing extensive simulations it is shown that the proposed method leads to an admissible performance in both recall and precision rates.