Ball Hit Detection in Table Tennis Games Based on Audio Analysis Bin Zhang, Weibei Dou Department of Electronics Engineering Tsinghua University, Beijing, China, 100084
[email protected],
[email protected]
Abstract As bearer of high level semantics, audio signal is being more and more used in content-based multimedia retrieval. In this paper, we investigate the ball hit detection for sports games and propose a novel approach to detect ball hits. By employing Energy Peak Detection (EPD) and Mel Frequency Cepstral Coefficient-based (MFCC-based) Refinement (MBR), high precision (91%) and adequate recall (73%) of ball hit detection are achieved with a low computational complexity and an easy training process. The proposed algorithm can be applied in audio content-based highlight detection systems and provide valuable information for semantical understanding of sports games.
1. Introduction Content-based information retrieval of multimedia is very desirable, since the amount of multimedia data have been increased dramatically in recent years. As bearer of high level semantics, audio signal has been recognized as a key component in content-based multimedia information retrieval. Audio content-based highlight detection for sports games has been developed rapidly [1, 2, 5, 7, 8, 9], not only because of its relatively low computational complexity compared with video processing, but also because some important events, say, ball hits, can be easily traced through audio analysis. In racket sports games, such as tennis and table tennis, ball hit is a crucial audio keyword in highlight detection. More specifically, by analyzing the location, number, and frequency of ball hits, the location of the highlights as well as the fierceness of the game can be inferred. By the information of ball hits, a rally can be easily detected, with the help of other audio information [8]. Nevertheless, it is not easy to detect ball hits accurately. Unlike other sounds, such as background noise, commentator’s speech and audience’s cheering, ball hit is a typical transient signal and very short in duration. Therefore, some long term audio features [1], which are extracted in
Liming Chen LIRIS CNRS UMR 5205 Ecole Centrale de Lyon , France
[email protected]
a scale of up to one second and are very effective in detecting speech and cheering, can not be applied to ball hit detection. A lot of work has been done on ball hit detection. Rui et al. [5] proposed a ball hit detection approach for TV baseball games, which was based on directional template matching. Sub-band energy envelopes were used as audio features, and the probability of being a ball hit was modeled using the Mahalanobis distance of a data sequence from the templates. Rodet et al. [4] proposed a Short-time Fourier Transform-based (STFT-based) approach to detect fast attack transients, in which the maximum energy of current frame, and the average energy of former and latter frames, were compared to detect the peak for each frequency bin. Finally, results in different bins were aggregated to locate an attack. Hsu [2] proposed a golf impact detection algorithm with audio clues by modeling impact sounds with unimodal multivariate Gaussian featured with MFCC, achieving an around 70% success rate. The Mahalanobis distance between testing samples and training samples was calculated as the similarity function. Afterwards, some morphological operations were applied as a filter to further refine the detection results. Xiong et al. [9] also proposed an approach based on MPEG-7 audio features and Entropic Prior Hidden Markov Models (EP-HMM), to detect golf ball hits. Although acceptable results were obtained through these approaches, there were some flaws underlying. Basically, these approaches can be classified into three categories: 1, temporal feature-based detector, such as [5]; 2, spectral feature-based detector, such as [2]; 3, temporal and spectral feature-based detector, such as [4, 9]. In general, the performance of the first two categories are limited, because the information used in detection is not complete. The third category should be the best option. However, the existing approaches were not optimized. First of all, in [4], only the audio features in adjacent frames for classification were compared, which was limited and could not reveal a panoramic temporal variation of audio features in a ball hit. More frames should be taken into account to locate the ball hits more precisely. Second, whereas the ball hits only represent a small portion of a sports game, mostly
0-7695-2521-0/06/$20.00 (c) 2006 IEEE
less than 10%, the computational complexity of the existing approach were high. In order to make final decision, each frame of the audio stream has to be processed by the system using a complex classifier, which is very time consuming and unnecessary. The computational complexity will thereby be reduced greatly if we properly choose the frames which are to be processed. To overcome the drawbacks of the existing systems, a temporal and spectral feature-based approach with a hierarchical structure is proposed in this paper, to detect ball hits in table tennis games. Unlike the existing systems, we fully exploit the temporal and spectral characteristics of ball hits. With a hierarchical structure made up of two cascade blocks, Energy Peak Detection (EPD) and MFCCbased Refinement (MBR), the proposed system provides high accuracy in ball hit detection while keeping the computational complexity low. Based on such a ball hit detector, each second of the video stream will be labeled as “ball hit” or “non ball hit.” A direct application of our ball hit detector is sports game highlight detection which can usefully rely on the analysis of ball hits within the sports games.
E[n] =
2. System framework The proposed system is composed of two blocks, Energy Peak Detection (EPD) and MFCC-based Refinement (MBR) (Fig. 1). Audio Data
Block 1: Energy Peak Detection (EPD)
Raw Ball Hits
to 20ms, typically. To describe the energy variation during a ball hit, we have to, therefore, extract STE every 10ms. Also, we have to include up to about 100ms to provide a panorama of the complete ball hit. Template matching [5] should have been a good choice. However, it cannot work well when a ball hit is polluted by other sounds, such as commentator’s speech and audience’s cheering. In this case, the STE envelope will be distorted, and it will not match the template any longer. To increase the robustness of the algorithm against background sounds, a new feature, peak index, is proposed in this paper. As prerequisites, we introduce some time scales of feature extraction in this paper. The STE is extracted every 10ms, which is called a frame. The final decision of our system is made every 1 second, which is called a decision window, or window for short. In the meantime, we compute the mean of STE every 100ms, which is named a macro frame. We note STE as E[n], n = 1, 2, . . ., where n is the discrete time and increases by one every 10ms. Consequently, the mean of the STE of macro frame n is
Block 2: MFCC-based Refinement (MBR)
Refined Ball Hits
Figure 1. The framework of proposed ball hit detection system
Raw ball hits are detected on a coarse level by block 1. In this block, a computational cheap algorithm is involved, in which most of the real ball hits are included but as well as some mistakenly detected attack sounds. In block 2, we further analyze the MFCC of these raw ball hits, from which refined ball hits are obtained. Compared with the block 1 (EPD), block 2 (MBR) is more time consuming. With the help of EPD, however, the data that are to be processed by MBR are greatly reduced. Hence, the entire computational complexity is reduced, too. High accuracy is guaranteed by MBR using spectrum-based pattern recognition techniques.
2.1. Block 1: energy peak detection By analyzing the Short-time Energy (STE) [10] of a ball hit, a peak can be observed when a ball hit occurs. For instance, in table tennis games, the duration of a ball hit is 10
9 1 E[10n − m], n = 1, 2, . . . . 10 m=0
(1)
In this way, peak index is defined and can be described in the following manner: I[n] = E[n] − E[n/10], n = 1, 2, . . . ,
(2)
where · indicates rounding towards infinity. The reason why we choose 100ms as the length of macro frames is interpreted as follows. Peak index should remain high when a ball hit is polluted by speech, and it should be low when a sudden, short speech occurs. Hence, the length of macro frames should be short enough so that the energy of a speech signal remains stable. Meanwhile, the length of macro frames must be much greater than the length of frames to sharpen the energy peak of a ball hit. By analyzing the sports games, we find the duration of an instantaneous speech is equal to or greater than approximately 100ms in most cases. Therefore, 100ms-long macro frames enable us to suppress the interference of speech as much as possible while preserving sharp peak when ball hit occurs. Peak index provides a simple and effective description of an energy peak. So the advantage of this algorithm is its low computational complexity. Moreover, since only the difference of a ball hit and the mean energy of a macro frame is taken into account, it is not very sensitive to background sounds, which boost the energy of a ball hit and the mean energy at the same time. To obtain the raw ball hit, a empirical threshold should be chosen. However, this threshold is not very critical, because we still have a second block to refine the decision. In general, we can select a relatively small threshold, without missing most real ball hits.
0-7695-2521-0/06/$20.00 (c) 2006 IEEE
2.2. Block 2: MFCC-based refinement Using EPD, many impulsive sounds can be detected. However, not all of them are ball hits. To extract ball hits from these raw ball hit candidates, an MFCC-based classification approach is proposed. Among the spectrum-based audio features, MFCC is the most commonly used feature in the literature, not only for its success in Automatic Speech Recognition (ASR), but also for its ability of representing audio spectrum in a compact parametric form regardless of the audio type. MFCC is defined as follows [3]. 1 π , n = 1, 2, . . . , L , [log S(k)] cos n k − Cn = 2 K k=1 (3) where L is the desired length of the cepstrum, typically 12 in ASR applications. S(k) is the output power of the kth mel scale filter and K the number of filters in the filter bank. Moreover, a logarithmic energy item is added to the feature vector as the first element C0 , and a 13 dimensional feature vector is formed. In our work, C0 is excluded from the proposed MBR, because it only describes the energy, rather than the shape of spectrum. To obtain real ball hits based on MFCC, one can calculate the similarity between every candidate and training samples. In [2], this similarity was estimated through Mahalanobis distance. The Mahalanobis distance (statistical distance) between two points x = (x1 , . . . , xp )T and y = (y1 , . . . , yp )T in the p-dimensional feature space Rp is defined as (4) d(x, y) = (x − y)T Σ−1 (x − y), K
where Σ is the covariance matrix. By setting a threshold of this distance dth , we can build a classifier that classifies the candidates into two classes: ball hit (d ≤ dth ) and non ball hit (d > dth ). Alternatively, a C-Support Vector Classifier (C-SVC) with a Radial Basis Function (RBF) kernel [6] can be employed in classification, which also classifies the candidates into ball hit and non ball hit, and is proven to be more accurate. Both the Mahalanobis distance threshold and parameters of C-SVC must be obtained through training, in which one should provide two sequences, a sequence of pure ball hits and a sequence of other sounds (non ball hits). The latter one is easy to find, while the former one, pure ball hits, is relatively difficult to get, because the ball hit is extremely short, and it is very annoying if manually extracting dozens of ball hits from the games. To overcome, a training algorithm is proposed in this paper. Only a sequence of audio data, which contains tens of clear ball hits with low background noise, is needed. Since there is little pollution and interference, the ball hits can be located
with satisfying accuracy by merely employing EPD. The MFCCs of these hits, noted as Chit (n), n = 1, 2, . . . , Nhit , are then extracted and form the training set of pure ball hits. Meanwhile, the MFCCs of the sequence of other sounds, Cother (n), n = 1, 2, . . . , Nother , are extracted, too. Here, Nhit and Nother are the number of training samples of two training sets, pure ball hits and other sounds, respectively. The Mahalanobis distance threshold is calculated through 1 (d + dother ), 2 hit
(5)
Nhit 1 d(Chit (n), Chit (n)), Nhit n=1
(6)
dth = where dhit
dother
= =
1
Nother
Nother
n=1
d(Cother (n), Chit (n)),
(7)
are the mean distances of two training sets to the arithmetic center of the training sets of pure ball hits, respectively. Using the C-SVC training algorithm presented above, however, we are only able to collect tens of samples of ball hits, which are somewhat insufficient in typical classifier training. As a result, the dimension of the feature space may have to be reduced to compensate the insufficient training data. Related experimental results on MFCC dimension are presented in Fig. 2.
3. Experimental results The experiments are carried out on a table tennis match. It is neither necessary nor possible to find the location of every ball hit precisely. Instead, whether ball hit occurs in each second gives us plenty of information in understanding the match. Therefore, the decision rule is set as follows. If the number of detected ball hits is greater than two, this decision window is classified as ball hit. Otherwise, this decision window is classified as non ball hit. The training sequences, which are extracted from this match, are composed of a sequence of ball hits with a duration of ten seconds, and a sequence of other sounds with a duration of one minute. Background noise, speech, and cheering are included in the latter one, in order to cover as many circumstances as possible. The testing sequence is the whole match, with a duration of 33 minutes. Each second is labeled with ball hit or non ball hit manually, and there are altogether 188 decision windows (seconds) are labeled as ball hit. The conventional precision-recall metric is used for evaluation, and they are defined as follows [1]. Precision =
Number of correctly detected windows , (8) Number of detected windows
0-7695-2521-0/06/$20.00 (c) 2006 IEEE
Number of correctly detected windows . (9) Number of manually labeled windows The testing results of different approaches are shown in Tab. 1. It reveals that EPD + MBR (C-SVC) performs best with a fairly satisfying accuracy. The experiment also proves the efficiency of the algorithm. The system detects about 3500 raw ball hits by EPD. In another word, the block 2 only has to process 3500 MFCC samples, rather than 100samples/s×33min×60s/min≈2×105 MFCC samples without EPD. It is a enormous reduction of computational cost. Recall =
Table 1. Comparison of different approaches Results Approach Precision Recall EPD only 28% 84% EPD + MBR 71% 66% (Mahalanobis distance) EPD + MBR (C-SVC) 91% 73% To select the optimal MFCC feature space dimension, the overall performances of different dimensions are tested and compared. The testing results are shown in Fig. 2, from which it can be seen that 3-dimensional feature space performs best.
Precision and Recall (%)
100 80 60 40 Precision Recall
20 0
0
2
4
6
8
10
12
Feature Space Dimension
Figure 2. Performance versus the dimension of MFCC feature space
4. Conclusion A novel approach to detect ball hits for table tennis games by audio analysis has been proposed in this paper. By employing Energy Peak Detection (EPD) to get the raw ball hits, and MFCC-based Refinement (MBR) to obtain refined ball hit detection, the proposed approach has achieved
high precision and recall rate, while keeping computational complexity low. Experimental results also have shown the performance of more than 90% precision and nearly 80% recall rate, which is very encouraging. In the future, an audio content-based highlight detection system for table tennis games will be constructed, with this proposed algorithm embedded as a fundamental module. By classifying each window of the video stream into basic sound classes, ball hit, background sound, speech, and cheering, the content of the sports game will be recognized. Final highlights will be located by means of Hidden Markov Model (HMM) [7], which will model the game based on these basic sound classes.
5. Acknowledgement This work has been supported by Programme de Recherches Avanc´ees de Coop´erations Franco-Chinoises (PRA SI04-02). We appreciate our former colleague Tian Xia, for his great inspiration which leads to EPD.
References [1] H. Harb and L. Chen. Highlights detection in sports videos based on audio analysis. In Proc. of CBMI 2003, 2003. [2] W. H.-M. Hsu. Golf impact detection with audio clues. Speech and Audio Processing and Recognition Project, http://www.ee.columbia.edu/∼winston/courses/speech audio/ project/E6820PrjRpt.pdf, 2002. [3] L. Rabiner and B.-H. Juang. Fundamentals of Speech Recognition. PTR Prentice-Hall, Inc, 1993. [4] X. Rodet and F. Jaillet. Detection and modeling of fast attack transients. http://mediatheque.ircam.fr/articles/textes/ Rodet01a.pdf, 2001. [5] Y. Rui, A. Gupta, and A. Acero. Automatically extracting highlights for TV baseball programs. In Eighth ACM International Conference on Multimedia, pages 105 – 115, 2000. [6] B. Scholkopf and A. J. Smola. Learning with kernels: support vector machines, regularization, optimization, and beyond. the MIT Press, 2002. [7] J. Wang, C. Xu, E. Chng, and Q. Tian. Sports highlight detection from keyword sequences using HMM. In Proc. of ICME 2004, volume 1, pages 599 – 602, 2004. [8] L. Xing, Q. Ye, W. Zhang, Q. Huang, and H. Yu. A scheme for racquet sports video analysis with the combination of audio-visual information. In Proc. of SPIE, volume 5960, pages 259 – 267, 2005. [9] Z. Xiong, R. Radhakrishnan, A. Divakaran, and T. S. Huang. Audio event detection based highlights extraction from baseball, golf and soccer games in a unified framework. In Proc. of ICASSP 2003, volume 5, pages 632 – 635, 2003. [10] T. Zhang and C.-C. Jay Kuo. Audio content analysis for online audiovisual data segmentation and classification. IEEE Transactions on Speech and Audio Processing, 9(4):441 – 457, 2001.
0-7695-2521-0/06/$20.00 (c) 2006 IEEE