J Sign Process Syst (2010) 61:75–83 DOI 10.1007/s11265-008-0314-3
From Low-Level Features to Semantic Classes: Spatial and Temporal Descriptors for Video Indexing Markos Zampoglou & Theophilos Papadimitriou & Konstantinos I. Diamantaras
Received: 17 February 2008 / Revised: 3 October 2008 / Accepted: 27 October 2008 / Published online: 26 November 2008 # 2008 Springer Science + Business Media, LLC. Manufactured in The United States
Abstract As the quantity of publicly available multimedia material becomes larger and larger, automatic indexing becomes increasingly important in accessing multimedia databases. In this paper, a novel set of low-level descriptors is presented for the aim of content-based video classification. Concerning temporal features, we use a modified PMES descriptor for the spatial distribution of local motion and a Dominant Direction Histogram we have developed to represent the temporal distribution of camera motion. Concerning color, we present the Weighted Color Histogram we have designed in order to model color distribution. The histogram models the H parameter of the HSV color space, and we combine it with weighted means for the S and V parameters. For the selection of key-frames from which to extract the spatial descriptors we use a modified version of a simple efficient method. We then proceed to evaluate our descriptor set on a database of video shots resulting from the temporal segmentation of the archive of a real-world TV station. Results demonstrate that our approach can achieve high success rates on a wide range of semantic classes.
M. Zampoglou (*) Department of Applied Informatics, University of Macedonia, Thessaloniki 54006, Greece e-mail:
[email protected] T. Papadimitriou Department International Economic Relations and Development, Democritus University of Thrace, Komotini 69100, Greece e-mail:
[email protected] K. I. Diamantaras Department of Informatics, TEI of Thessaloniki, Thessaloniki 57400, Greece e-mail:
[email protected]
Keywords Video indexing . Color descriptors . Motion descriptors . Content-based video retrieval
1 Introduction Content-based multimedia indexing and retrieval is a field of rising interest, with a wide range of applications. More and more multimedia databases appear, and the number of publicly available multimedia constantly increases. As a result, locating a multimedia item by sequential search or manual indexing within such a database becomes increasingly difficult. This is the reason that the fields of multimedia indexing and retrieval are attracting a rising interest from the research community. Research on content-based image indexing and retrieval has advanced significantly during the recent years [1, 2]. The prospect of extending this approach to video, however, has only recently begun to be explored, and many issues yet remain to be solved. In video indexing, our material is a video sequence. A video sequence is series of consecutive video shots, separated by camera transitions such as fades or cuts. Thus, a shot is a piece of digital video containing a continuous uninterrupted take from a single camera. Much like content-based image indexing, video indexing is essentially based on the processing of a set of features extracted from a video sequence. Because of the structure of a video sequence, the feature extraction step consists of two steps: (1) temporal segmentation, which is the separation of the sequence into shots and can be done either automatically ([3, 4]) or, as in our case, manually, and (2) key frame extraction (Fig. 1). Following temporal segmentation, the first set of features is extracted. From the video shots, we can extract features
76
J Sign Process Syst (2010) 61:75–83 Video sequence Temporal Segmentation
Video shot
2 Related Work Temporal information (Motion, inter-frame relations)
Key frame extraction
Key frames
Spatial information (Colour, texture, shape, objects)
Figure 1 The typical feature extraction process for digital video sequences.
describing the temporal information contained. These can either be motion descriptors or other descriptors of the inter-frame relations. Following temporal segmentation and temporal information extraction, the next step is to select a small number of frames (often one) from a shot. These select frames, called key frames are chosen to be the most representative frames of the shot according to some criterion. From these frames, descriptors are then extracted to represent the spatial information contained, such as colour, texture or shape. The descriptors used to model spatial information are essentially the same as those used for image indexing and retrieval, since a frame is nothing more than an image, the difference lying in our option to choose more than one key frame to represent that aspect of the shot information. In this paper, we will focus on video shot indexing by combining temporal and spatial descriptors. We will present our approach of modelling the camera and the object motion, as well as descriptors for the color information. The rest of the paper is organized as follows: Section 2 contains a review of the previous work related to our research, Section 3 presents the descriptors we have chosen for our classification scheme, as well as the classifier used (an overview appears in Fig. 2), Section 4 describes the experimental implementation details and Section 5 presents the classification results. Figure 2 An overview of our approach.
Motion Vector Fields are the most common form of raw motion information and a number of approaches exist where statistical descriptors are extracted from the Vector Fields [5, 6]. If we aim at a higher-level description of the motion information in a video shot, however, a clear distinction can be made between the motion vectors: Assuming that the majority of pixels in each frame belong to the background, then the majority of motion vectors should be the result of the camera motion. This is often referred to as dominant or global motion. The rest of the motion vectors, assumed to result from object motions, are referred to as local motion. Of course, the dominant motion in a scene could be zero, suggesting a static camera. That event can be easily assumed when most vector magnitudes for a frame have zero value. The remainder of the motion in the scene is referred to as local or residual motion and belongs to objects moving in the foreground. A simple 2D model is enough to estimate the dominant motion [7, 8]. Thus, the residual motion can then be estimated by compensating for the global motion over the complete vector field and focusing on the vectors that still have significantly large magnitudes. In that case, the residual motion will be expressing the motion patterns of objects in the foreground, and statistical features of that motion for the shot can then be used as descriptors for classification of the videos, combined with the initial information provided by the 2D model. One step beyond the simple distinction between the dominant and the local motion can be to also estimate the actual camera operations that caused the dominant motion patterns [9, 10]. After estimating a model for camera operations, the camera motion can be compensated for, and the residual motion estimated. The residual motion, corresponding to object motion, can then either be modelled as low-level information [9], or, after object segmentation [10, 11], object trajectories can be estimated and used as descriptors [12, 13].
Video Shot PMES Extraction Motion Estimation DDH Extraction Selection of 5 Frames
WCH Extraction S and V Means Extraction
Frame Selection
Feature Vector Formation
Classification
J Sign Process Syst (2010) 61:75–83
77
Concerning color descriptors, the most widely used one is the Color Histogram [14]. Several approaches exist concerning the number of histogram bins, as well as the color space model, the most commonly used representation being the Hue-Saturation-Value (HSV) color space [15]. The greatest disadvantage of the color histogram is the fact that it completely ignores the spatial structure of color in the image. Of the approaches proposed to deal with this issue, the most popular one is the Color Correlogram, where the relative distance between pixels of all colors is taken into account besides the actual pixel color [16]. A special subcase of the Correlogram is the Autocorrelogram, where only the relative distance between pixels of the same color is taken into account [17].
3 Feature Set and Classification Scheme 3.1 Local Motion Descriptor Recently, Ma and Zhang [18] proposed a descriptor for the local motion of a video shot, called the Perceived Motion Energy Spectrum (PMES). It is a compact, fixed-length descriptor that can be calculated at a very low computational cost. The first step is to calculate the mixture energy of each macroblock, which is the motion energy resulting from both camera and local motion. In this case, the term energy corresponds to the magnitude of the motion vectors. The mixture energy of a block (i, j), denoted MixEni,j, is calculated by averaging over all the vector magnitudes of that block through time, after a trimming process. Trimming takes place after sorting all the magnitude values for a particular macroblock through time and throwing out the extremes. The percentage of values to be left out is specified through a trimming parameter α. 1 MixEni;j ¼ M 2baM c
M baM c X
Magi;j ðnÞ
ð1Þ
n¼baM cþ1
where M is the total number of magnitudes (in our case, the number of frames selected for motion vector extraction from a shot), n is the frame index, α is the trimming parameter (0≤α≤0.5) Magi,j(n) is the magnitude of the motion vector of block (i, j) in the nth frame and baM c denotes the largest integer which is smaller or equal to αΜ. The mixture energy is then normalized to [0, 1], giving MixEni;j . The mixture energy describes the overall result of local and camera motion in the vector magnitudes. This information, however, is significantly limited in its ability to contribute to video classification, since taking the mean magnitude of both local and global motions conveys very little useful information. Following extraction of the mixture energy, the second step aims at eradicating the effect of camera motion and leaving
only local motion for modeling. To this end, the motion vector angle is treated as a stochastic variable. For each block, we form the angle distribution through time. We then calculate the angle entropy for that block, and treat it as a measure of the angle variation. The hypothesis behind this is that camera motions are generally consistent throughout the shot. In this sense, blocks with no local motion will only have the motion vectors that result from camera motion, and the angle entropy from these vectors’ angle distributions will be low. On the contrary, when blocks have, at some point in the shot, moving objects passing through them, the angle will probably change for that point, increasing the resulting entropy. Before calculating the distribution, we have to quantize vector angles in m orientations. Consequently, we form the angle histogram, by summing, through time t, the number of vector angles for a particular block (i, j) which fall into each particular orientation bin. Consecutively, for each histogram bin t, the corresponding probability p(t) is calculated by: , m X pðt Þ ¼ AHi;j ðt Þ AHi;j ðk Þ ð2Þ k¼1
where AHi,j(t) denotes the histogram value for block (i, j) and bin t. The angle entropy of that block can then be found by AngEni;j ¼
m X
pðt Þ log pðt Þ
ð3Þ
t¼1
We then proceed to normalize the angle entropy by dividing with the maximum possible entropy value, logm, and end up with the Global Motion Ratio (GMR). GMRi;j ¼ AngEni;j log m ð4Þ The final step is to calculate the perceived (or local) motion energy for that macroblock. This is achieved by multiplying the Global Motion Ratio with the normalized Mixture Energy of that macroblock. PMESi;j ¼ GMRi;j MixEni;j
ð5Þ
After the process has been repeated for every macroblock, we end up with a number of features equal to the number of macroblocks in a frame, which form a descriptor of the intensity and spatial distribution of local motions in the shot. In our previous research, we have demonstrated the PMES measure’s descriptive potential, by using it to achieve classification based solely on the local motion patterns [19]. However, the number of classes which can be separated based solely on their local motion distribution is small. To broaden the applicability of our approach, we have further developed descriptors for global motion as well as color. An issue that appears with the addition of further descriptors is that the dimensionality of the problem is increased. The PMES descriptor consists of a number of
78
features equal to the number of macroblocks in a frame, which is large enough by itself, and with the addition of further features the dimensionality of the problem might become so high as to lead to overfitting and poor generalization. A simple way to reduce the size of the PMES measure is to average over neighborhoods of blocks. A moving object will most often pass from a series of blocks, increasing the PMES values of all of them. In this sense, the PMES descriptor changes smoothly from block to block and the reduction to a coarser representation will not radically affect its representation capabilities. 3.2 Global Motion Descriptor While local motion is extremely important in characterizing a shot, and we have demonstrated the PMES measure’s ability to serve as a descriptor for binary classifiers, we should not leave out camera motion: it is clear that different types of shots have different global motion patterns, and thus it can also serve as a descriptor. One major issue is the fact that the camera motion does not have to be consistent throughout the shot. We have defined the shot boundaries on camera transitions, but have not applied any restrictions to camera motions. As long as it’s the same take, we cannot control the camera operations. It is thus possible for a shot to contain multiple different camera motions. An efficient descriptor would have to be able to represent all motions that appear in a shot, and in such a way as to maintain a fixed length, in order to be able to work on a binary classifier. With these in mind, we developed the Dominant Direction Histogram (DDH) [20]. In order to calculate the Dominant Direction Histogram of a shot, we first quantize all vector angles into 8 directions, and add an extra category for zero-magnitude vectors. Consequently, for each frame from which we have extracted a vector field, we locate the dominant motion vector direction. Under the common assumption that most pixels will belong to the background, this should represent the camera motion for the frame. We then proceed to form the dominant direction histogram for the whole shot, by counting the number of times each direction has appeared as dominant. However, the number of classes in the histogram should be smaller than 9: In the vast majority of cases, it does not make any difference whether the dominant direction is leftward or rightward, as much as that it’s horizontal. The same applies to upwards and downwards motions, as well as diagonal. The extra detail very rarely offers any actual information, and could easily lead to overfitting. Thus the final bins in the histogram are four: Horizontal, Vertical (Translation or Rotation), Diagonal (merging all four possible diagonal directions, each of which can result from a mixture of the four camera operations mentioned above), and Static.
J Sign Process Syst (2010) 61:75–83
A final issue before the DDH becomes applicable is that we have not restricted ourselves to fixed-length shots, and, as a result, varying numbers of frames have been used for motion vector estimation. As a result, the DD Histograms as described above will have varying sums. This can easily solved through normalization, after which all DDHs are made to sum to one. 3.3 Color Descriptors Following the local and global motion descriptors, the system will also need spatial descriptors for the classes where motion is inadequate to achieve separability. We turned to color, since it is the most distinctive spatial characteristic and the one most deeply explored in the past. As described above, the color histogram and its variations are by far the most popular way of describing a frame’s color. Its main disadvantage, and the reason for introducing the color correlogram, is its inability to take the spatial distribution of color into account. However, the color correlogram suffers from high dimensionality, which, in our case, could lead to poor generalization performance. To include spatial distribution information, we used a Weighted Color Histogram (WCH), tailored to the demands of our particular application. In a Weighted Color Histogram, each pixel location in the frame is assigned a weight, and each histogram bin contains the sum of weights of the pixels of the corresponding color. The weighting scheme we used was separating the frame in three zones along the vertical axis (Upper, Middle and Lower), while placing emphasis on the centre of the horizontal axis. The underlying assumption here is that, in TV footage, the camera operator always places objects of interest near the centre of the frame. However, there are many occasions, where a clear distinction can be made between the upper and lower parts of the frame, such as in sports or landscape shots, and this is the motivation for the separation along the vertical axis. It should be noted that, as we are unable to define strict boundaries for the regions, we chose smooth weighting functions. (Fig. 3.) An 8-bin histogram of the H parameter from the HSV color space is formed for each of the three regions. Consequently, the S and V parameters are represented in a far more compact way, namely their weighted means. We use a weight function derived from the same principle as for the H histograms: The centre of the frame should be distinguished from the rest. However, the compactness of the descriptor allows us to take the outer regions into account as well, without any significant increase in dimensionality. We thus end up with four descriptors for S and V, Sinner, Souter, Vinner, Vouter, resulting from the weight functions of Fig. 4. Their smoothness, once again, guarantees the robustness of the descriptors
J Sign Process Syst (2010) 61:75–83
79
Figure 3 The three Gaussians used as weight functions: upper, middle and lower.
A final issue to be dealt with respect to color features is key frame extraction. A video shot is a sequence of frames, any one of which could be used for the calculation of the spatial descriptors. Extracting descriptors from all of them would lead to a huge feature vector length and too much redundancy, given the continuous nature of the shots. A fast, simple and efficient approach to the problem is proposed by Ferman et al [21]. To choose a frames histogram, we estimate the sum of absolute differences between each frame’s histogram and the histograms of all other frames. We then choose the frame whose histogram minimizes this quantity. ( ) X KeyH ¼ arg min ð6Þ kHl Hk k Hk
l6¼k
where Hn is the histogram of the nth frame, and the histogram difference is defined as the sum of absolute differences of the corresponding bins. Ferman et al. state that this approach is computationally costly. This is absolutely true, if we take all the frames from a shot. However, we know that a shot is a continuous take, and as such, has a certain degree of continuity which suggests that a frame is only slightly different from its next one. In our modification [20], we take only five frames from each shot, evenly distributed in time. This greatly reduces computational costs and makes the method feasible without reducing its efficiency. Figure 4 The two Gaussians used for estimating the S and V means: inner and outer.
Ferman et al. apply their method to a frame’s color histogram. In our case, we apply it to the whole color feature vector, consisting of the three local H Weighted Color Histograms, as well as the four S and V means. We thus end up with a single frame’s color descriptors, which have demonstrated the greatest similarity to the descriptors of all other frames, and can thus claim to represent them most accurately. 3.4 Classifier Our aim was binary classification, and to this end, we turned to Support Vector Machines (SVMs). SVMs have become an extremely popular classifier in the recent years, for a number of reasons. Besides being extremely fast, they have a very strong theoretical foundation in their good generalization capabilities [22]. Also, the SVM training problem has a unique, globally optimal solution. Finally, being kernelbased, they have the ability to map a given feature space into a higher-dimensional space at a very small extra computational burden. For these reasons, SVMs have seen successful application in many different fields, multimedia classification included. However, in most cases, SVMs have been used as a tool for relevance feedback [23, 24], that is, after an initial search is performed, the user is asked to select the most relevant results returned, and consequently an SVM is trained on-line using these results to bring back exactly the types of documents the user had in mind.
80
J Sign Process Syst (2010) 61:75–83
Figure 5 Top row: one image from each class: Team Sports, Soccer, Speaker, Newscast, and Interview. Bottom row: five images not belonging to any class.
In our research, we use a different approach. Since the classification that we have in mind will consist of a predetermined number of classes, we use the SVM as a classifier to decide which shots belong to each class, within a stationary framework. Thus, for each class, an SVM is trained on an initial set of examples than belong to that class (positive examples) as well as examples that don’t (negative examples), and its efficiency in classifying future examples is evaluated on a test set of both positive and negative examples.
4 Implementation 4.1 The Video Database We were offered a part of the archives of the local TV station Omega TV (Ωμέγα TV), Thessaloniki, Greece. The archive was manually cut into 1,074 shots of 720×576 pixels, of content that ranged from head-and-shoulders videos such as newscasts or speeches to sports, or concerts. The shot length ranged from 9 to 512 frames, with a mean length of 178.7 frames. The reason we resorted to manual segmentation was to concentrate on the feature selection and extraction process, and the fact that the video shots and the class distinctions came from the real world meant that we were given a chance to directly evaluate our feature set’s ability to separate between classes with immediate applicability in mind (Fig. 5). 4.2 Feature Extraction In order to calculate the Modified PMES and the DDH feature, we needed the shots’ motion fields. To this end, we applied a block-matching technique but, instead of estimating the motion vectors over consecutive frames, we imposed a temporal distance of 7 frames over which the vectors were estimated. The reason for this is that this
approach is significantly more robust to noise, in the sense that, assuming erroneous vector estimations remain of generally the same small size regardless of the temporal distance, an increased temporal distance makes the correct estimations larger and more distinctive. We used a sparse vector field of 11×9 macroblocks, each corresponding to a 64×64 pixel block. This gave a PMES descriptor of 99 features, which was then reduced to 30 by averaging over neighborhoods of 2×2 macroblocks. The same motion vectors were used to extract the DDH feature. The color features described in section 3.3 were also extracted, namely the WCH plus the four means, Sinner, Souter, Vinner, and Vouter. For their extraction, five frames were selected from each shot, namely the first, the last and three more evenly distributed through time, and the color features of one of them was chosen to minimize the sum of absolute differences from the others. The final feature vector consisted of 30 features for the PMES, 4 for the DDH, 3×8=24 for the WCH and four for the S and V means, summing up to 62 features (Fig. 6). 4.3 Classification As an SVM application, we used Thorsten Joachims’ SVMlight implementation, since it is very fast and particularly user-friendly [25]. The input vectors were 62 features long, plus a label that defined a shot as a positive or negative example for the class being trained at the time. Our aim is a complete classification scheme which, given a predefined set of classes, will be able to put each given video into the appropriate category. The best approach to this, given PMES
DDH
30
4
S, V Means
WCH
8
8
8
4
Figure 6 The final feature vector, its components and their lengths.
J Sign Process Syst (2010) 61:75–83
81
the TV station’s needs, is a hierarchical tree, ranging from generic classes to very narrow and usually show-specific ones. This means that in some cases certain classes will be subgroups of others, while other classes may mix elements from more than one high-level (generic) class. In our tests, given a limited amount of material, we applied classification to the classes that had enough examples and made sense with our current feature set. The classes we finally decided to test our feature set on were five: & & &
& &
Team Sports, which contained basketball, soccer, water polo and volleyball, consisting of 250 shots. This is the class we successfully applied the PMES feature on in [19]. Soccer, being the sub-class of Team Sports with the most examples, consisting of 174 shots. Speaker, consisting of all the shots where a person was speaking in front of the camera, be it a newscast or an interview from the street, under the sole assumptions that the person took up at least 15% of the frame and was generally placed near the centre of the frame, mostly looking towards the camera. There were 157 shots fulfilling the prerequisites. Newscast, which was a subgroup of Speaker, containing all the news shots, consisting of 16 shots, and Interview, which contained all the shots from a particular talk-show program, consisting of 19 shots. In this class, a number of shots also belonged to Speaker, but others did not.
The five classes cover a wide range, from the broad ones with a general semantic interpretation, such as Team Sports or Speaker, to the station-specific Newscast and Interview, where, due to their specialization (e.g. all the Interview shots were taken in the same TV studio, with the same background), classification is easier, but higher success rates are demanded. Representative frames from shots of all classes, as well as some examples that didn’t belong to any class appear in Fig. 5. After tests with linear, multinomial, and radial basis kernels, we concluded that the linear one managed to achieve much better generalization, since the more complex ones tended to overfit the data. One issue to be dealt with was the fact that SVM classification is sensitive to the number of positive and negative examples in the sense that, if one side is significantly smaller than the
other, the classification shifts in favor of the more numerous side. This meant that, in the classes with few positive examples, false negative rates were increased. To deal with this swiftly and efficiently, we simply added 3 instances of each positive example into the training set.
5 Results The classification results appear on Table 1. We used four measures to evaluate the performance of our features. The initial classification results with respect to a certain class can be split in four groups: True Positives (shots belonging to the class classified as such), True Negatives (shots not belonging to the class classified as such), False Positives (shots not belonging to the class classified as belonging to it) and False Negatives (shots belonging to the class classified as not belonging to it). From these values, a number of measures can be derived. Precision is the number of True Positives as a percentage of the total number of shots classified as Positives. Precision ¼
TP TP þ FP
ð7Þ
Recall is the number of True Positives as a percentage of the total number of shots that should have been classified as positives. Recall ¼
TP TP þ FN
ð8Þ
As a measure of the Overall success of a feature set we used the harmonic mean of Precision and Recall, with Recall weighted twice as much as precision. F2 ¼
3 Precision Recall 2Precision þ Recall
ð9Þ
In our case, Recall is a lot more important than Precision, since, when querying the station’s archives, the thing that concerns us most is that as many as possible the relevant shots are returned. Of course, we want to keep irrelevant shots out of the way, and in this sense, we are also interested in keeping Precision as high as possible.
Table 1 The classification results for the five classes. Class Team sports Soccer Speaker Newscast Interview
Precision (%)
Recall (%)
Overall1 (%)
Overall2 (%)
Overall3 (%)
67 78.8 68.3 100 75
79.6 89.7 88.9 100 94.7
74.9 85.7 80.7 100 87.1
72.9 84.1 73.6 78.7 83.3
60.8 79.3 51.3 96 80.3
82
For comparison, we present the overall results for two other descriptor sets: Overall2 consists of our motion descriptors combined with traditional color histograms, while Overall3 consists solely of color histograms. The results demonstrate that the overall performance of the descriptor feature set presented is quite high, and in any case superior to simpler approaches. The more generic classes, Team Sports and Speaker do show lower success rates than the more narrow ones, but that was to be expected. It is worthy to note that Newscast, because of the similar patterns that appear in all its shots defines a very small area in our feature space, leading to perfect classification.
6 Conclusions and Future Work We have demonstrated the ability of our set of low-level descriptors to achieve good classification results in a wide range of semantic classes, from specific to broad ones. The fact that the data came from a real-world database makes the results even more significant. This, however, does not suggest that the results at this point are perfect. Our work will continue on improving the descriptors presented above, as well as adding descriptors for other aspects of the shots. Spatial descriptors for texture and shape should be the immediate step in extending our scheme to more classes as well as perfecting the success rates for the existing ones, until we are able to present a full classification scheme which will contain all the semantic classes necessary for the indexing of a general video archive.
References 1. Datta, R., Li, J., Wang, J. Z. (2005). Content-based image retrieval: approaches and trends of the new age. Proceedings of the 7th International Workshop on Multimedia Information Retrieval, in conjunction with ACM International Conference on Multimedia, pp. 253–262. 2. Smeulders, A. W. M., Worring, M., Santini, S., Gupta, A., & Jain, R. (2000). Content-based image retrieval at the end of the early years. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22, 1349–1380. doi:10.1109/34.895972. 3. Koprinska, I., & Carrato, S. (2001). Temporal video segmentation: a survey. Signal Processing: Image Communication, 8, 477–500. doi:10.1016/S0923-5965(00)00011-4. 4. Nagasaka, A., Tanaka, Y. (1991) Automatic video indexing and full-video search for object appearances. Proceedings of the IFIP TC2/WG 2.6 Second Working Conference on Visual Database Systems II, pp 113–127. 5. Ardizzone, E., Gatani, L., La Cascia, M., Lo Re, G., Ortolani, M. (2006) Advances in Multimedia Modelling. Springer Berlin, chapter “A P2P Architecture for Multimedia Content Retrieval,” pp. 462–474. 6. Chen, J. F., Liao, H. Y. M., Lin, C. W. (2005) Knowledge-Based Intelligent Information and Engineering Systems. Springer Berlin/ Heidelberg, chapter “Fast Video Retrieval via the Statistics of Motion Within the Regions-of-Interest”.
J Sign Process Syst (2010) 61:75–83 7. Fablet, R., Bouthemy, P., & Pérez, P. (2002). Non-parametric motion characterization using causal probabilistic models for video indexing and retrieval. IEEE Transactions on Image Processing, 11, 393–407. doi:10.1109/TIP.2002.999674. 8. Fablet, R., & Bouthemy, P. (2000). Statistical motion-based object indexing using optic flow field. IEEE International Conference on Pattern Recognition, 4, 287–290. 9. Piriou, G., Bouthemy, P., & Yao, J. F. (2006). Recognition of dynamic video contents with global probabilistic models of visual motion. IEEE Transactions on Image Processing, 15, 3418–3431. 10. Shih, H. C., Huang, C. L. (2003). Image analysis and interpretation for semantics categorization in baseball video. IEEE International Conference on Information Technology: Coding and Computing [Computers and Communications], pp 379–383. 11. Ferman, A. M., Tekalp, A. M., & Mehrotra, R. (1998). Effective content representation for video. IEEE International Conference on Image Processing, 3, 521–525. 12. Jeannin, S., & Divakaran, A. (2001). MPEG-7 visual motion descriptors. IEEE Transactions on Circuits and Systems for Video Technology, 11, 720–724. doi:10.1109/76.927428. 13. Chia-Han, L., & Chen, A. L. P. (2001). Processing concept queries with object motions in video databases. IEEE International Conference on Image Processing, 2, 641–644. 14. Zhen-Hua Zhang, Yong Quan, Wen-Hui Li, Wu Guo (2006). A new content-based image retrieval. Machine Learning and Cybernetics, IEEE International Conference on, pp 4013–4018. 15. Sural, S., Quian, G., & Pramanik, S. (2002). Segmentation and Histogram Generation Using the HSV Color Space for Image Retrieval. Proceedings. International Conference on Image Processing, 2, 589–592. 16. Rautiainen, M., & Doermann, D. (2002). Temporal Color Correlograms for Video Retrieval. Proceedings, International Conference on Pattern Recognition, 2, 589–592. 17. Williams, A., & Yoon, P. (2007). Content-based image retrieval using joint correlograms. Multimedia Tools and Application, 34, 239–248. doi:10.1007/s11042-006-0087-2. 18. Yu-Fei, Ma, & Hong-Jiang, Zhang (2001). A new perceived motion based shot content representation. IEEE International Conference on Image Processing, 3, 426–429. 19. Zampoglou, M., Papadimitriou, T., Diamantaras, K. I. (2007). Support Vector Machines Content-Based Video Retrieval Based Solely on Motion Information. Proc. 17th Int. Workshop on Machine Learning for Signal Processing, IEEE, Thessaloniki, Greece, pp 176–180. 20. Zampoglou, M., Papadimitriou, T., Diamantaras, K. I. (2008). Integrating Motion and Color for Content-Based Video Classification. 2008 IAPR Workshop on Cognitive Information Processing, Santorini, Greece. 21. Ferman, A. M., Tekalp, A. M., & Mehrotra, R. (2002). Robust Color Histogram Descriptors for Video Segment Retrieval and Identification. IEEE Transactions on Image Processing, 11, 497– 508. doi:10.1109/TIP.2002.1006397. 22. Cristianini, N., Shawe-Taylor, J. (2000). An Introduction to Support Vector Machines. Cambridge University Press. 23. Zhang, L., Fuzong Lin, Bo Zhang (2001). Support vector machine learning for image retrieval. International Conference on Image Processing, pp 721–724. 24. Mezaris, V., Kompatsiaris, I., Boulgouris, N. V., & Strintzis, M. G. (2004). Real-time compressed-domain spatiotemporal segmentation and ontologies for video indexing and retrieval. IEEE Transactions on Circuits and Systems for Video Technology, 14, 606–621. doi:10.1109/TCSVT.2004.826768. 25. Joachims, T. Schϕlkopf, B., Burges, C., Smola, A. (eds.) (1999). Advances in Kernel Methods - Support Vector Learning. MIT, chapter “Making large-scale SVM learning practical,” pp. 169–184.
J Sign Process Syst (2010) 61:75–83
83 the same department. Dr. Papadimitriou served as a reviewer for various publications and a scientific committee member for Conferences and workshops. In 2007 he was a member of the organizing committee of the IEEE Workshop on Machine Learning for Signal Processing held in Thessaloniki, Greece. Theophilos Papadimitriou current research interests include digital signal and image processing, data analysis, and neural networks.
Markos Zampoglou was born in Thessaloniki, Greece, in 1981. He received the diploma degree in Applied Informatics from the University of Macedonia, Thessaloniki in 2004 and a Master’s degree in Artificial Intelligence from the University of Edinburgh. He is currently a Ph.D. student at the Department of Applied Informatics, University of Macedonia. His research interests include digital signal and image processing, machine learning and data analysis.
Theophilos Papadimitriou was born in Thessaloniki, Greece, in 1972. He received the diploma degree in Mathematics from the Aristotle University of Thessaloniki, Greece, and the D.E.A. A.R.A.V. I.S (Automatique, Robotique, Algorithmique, Vision, Image, Signale) degree from the University of Nice-Sophia Antipolis, France, both in 1996 and the Ph.D. degree in electrical engineering from the Aristotle University of Thessaloniki in 2000. In 2001, he joined the Department of International Economic Relations and Development, Democritos University of Thrace, Komotini, Greece, where, he served as a lecturer (2002-2008). Currently he holds the position of Assistant Professor in
Konstantinos I. Diamantaras was born in Athens, Greece, in 1965. He received his Diploma in Electrical Engineering from the National Technical University of Athens in 1987 and the Ph.D. degree, also in electrical engineering, from Princeton University, Princeton, NJ, in 1992. Subsequently, he joined Siemens Corp. Research, Princeton, as a Post-Doctoral Researcher, and in 1995, he worked as a researcher with the Department of Electrical and Computer Engineering, Aristotle University of Thessaloniki, Thessaloniki, Greece. Since 1998, he has been with the Department of Informatics, Technological Education Institute of Thessaloniki, where he currently holds the position of Associate Professor and Chairman. His research interests include signal processing, neural networks, image processing, and VLSI array processing. Since 1997, he has been serving as editor for the Journal of VLSI for Signal Processing (Springer). Dr. Diamantaras has served as Associate Editor for the IEEE Transactions on Neural Networks from 1999 to 2000. In 1997, he was co-recipient of the IEEE Best Paper Award in the area of Neural Networks for Signal Processing. He is the author of the book Principal Component Neural Networks: Theory and Applications, co-authored with S. Y. Kung (New York: Wiley, 1996). He is currently a member of the IEEE Machine Learning for Signal Processing (MLSP) Technical Committee and the IEEE Signal Processing Theory and Methods (SPTM) TC. He has been a member of the organizing committee of ICIP-2001 and also a technical committee member for various international signal processing and neural networks conferences. He is a member of the Technical Chamber of Greece.