Normscore(x) = Scorex â Scoremin. Scoremax â ... NormScore(x) = Scorex. P. X. (6) and will alter the scores .... meh,cl(im) â. (12) meh(im) â mcl(im) + meh(im) ...
Using Score Distributions for Query-time Fusion in Multimedia Retrieval Peter Wilkins, Paul Ferguson and Alan F. Smeaton Centre for Digital Video Processing & Adaptive Information Cluster, Dublin City University, Ireland. {pwilkins,
pferguson and asmeaton}@computing.dcu.ie
ABSTRACT In this paper we present the results of our work on the analysis of multi-modal data for video Information Retrieval, where we exploit the properties of this data for query-time, automatic generation of weights for multi-modal data fusion. Through empirical testing we have observed that for a given topic, a high performing feature, that is one which achieves high relevance, will have a different distribution of document scores when compared against those that do not perform as well. These observations form the basis for our initial fusion model, which generates weights based on these properties, without the need for prior training. Our model can be used to not only combine feature data, but to also combine the results of multiple example query images and apply weights to these. Our analysis and experiments were conducted on the TRECVid 2004 and 2005 collections, making use of multiple MPEG-7 low-level features and automatic speech recognition (ASR) transcripts. Results achieved from our model achieve performance on a par with that of ‘oracle’ determined weights, and demonstrate the applicability of our model whilst advancing the case for further investigation of score distributions.
Categories and Subject Descriptors H.3.3 [Information Search and Retrieval]: Retrieval Models
General Terms Measurement, Experimentation
Keywords multimedia fusion, automatic weight determination
1.
MOTIVATION
Video Information Retrieval (IR) by its nature requires the combination of multiple modalities to create a response
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MIR’06, October 26–27, 2006, Santa Barbara, California, USA. Copyright 2006 ACM 1-59593-495-2/06/0010 ...$5.00.
to a user’s information need. These modalities range from low-level visual data which describes the colours that compose an individual keyframe, to motion characteristics of a shot, to the extracted speech uttered within the video. Each piece of this information puzzle could potentially contribute something meaningful to the final retrieval result and the problem faced by an information system is how to combine these disparate pieces of information into a coherent retrieval response. Often an information system, in an attempt to improve the quality of its results, will employ weights in an effort to boost those features which should contribute more to a final retrieval ranking, and reduce the impact of features which contribute more noise. However whilst using weights to merge different types of data is prudent, the process of generating these weights and having those weights accurately reflect the quality of the underlying data is far from trivial. Many approaches have been devised to accomplish this aim. These vary from iterative manual selection of weights based on training data, relevance feedback approaches, and the use of support vector machines (SVM) to learn the weights to apply in particular scenarios. Each of these approaches requires either for there to be some form of training data for the collection available, or for multiple iterations of the same query to be executed. In the case of training data, there needs to be an appropriate set of example queries that accompanies these which can be used to tune the system to be able to handle the range of queries that the information system may receive. Also some form of annotation or relevance assessment is required to measure the impact these approaches have on result quality. Alternatively, feedback approaches will rely on multiple iterations of the same query and on an assumption that the top rankings of a feature result list will contribute towards an improvement in result quality. These contrast to our approach which utilizes the distribution of the scores of documents returned by a feature as a measure of the quality of that feature, when compared against other features for that query. As such our approach is a one-pass weight generation scheme that is query-dependant and does not rely on the presence of training data or relevance assessments. The organization of the remainder of this paper is as follows. Section 2 describes related work in the domain of score analysis and its uses in broader information retrieval. Section 3 details our analysis of the properties of score distributions of feature data used for video IR. This section will outline our empirical observations of this data and put forward our working hypothesis on the correlation of these
observations to retrieval performance. Section 4 outlines our proposed weight generation model (and sub-variants), as well as providing a complete description of our fusion framework. It includes a description of the stages at which we fuse data, as well as discussing the normalization and combination strategies that we employ. Section 5 is our experimental section, where we conduct fully automatic retrieval experiments against the 2004 and 2005 TRECVid collections, varying the automatic weight generation techniques, normalization, and combination strategies. Finally we close with Section 6 to examine the outcomes of our experiments, detail our continuing work and theorize on how these techniques could be used to compliment, rather than replace existing weight generation approaches.
2.
RELATED WORK
There has been much work done in the area of text retrieval and in particular web retrieval, with regard to combination of different sources of retrieval evidence. This combination may take results from multiple search engines, as is the case for meta-search engines, or from within a single search engine architecture where there may be ranked results generated from multiple representations of the same document such as document text, titles, anchor text and linkage structure. Each of these may produce a quite different ranking of documents in response to a query and as with combining any sources of information, the goal is to gather all these sources together and use them to produce a more confident final result. One way of dealing with the combination of these multiple sources of information is to explicitly model this in the system. For example the INQUERY system is based on a probabilistic model designed to combine evidence from multiple representations of documents [21]. In [9], Graves and Lalmas have used the same inference network model to capture structural context in video retrieval though they did not combine scores from different retrieval elements in the way we do here. An alternative approach to modeling multiple representations is to develop a system that can combine multiple results that are potentially generated using different retrieval models [8, 3]. As with multimedia retrieval, and in particular contentbased image retrieval (CBIR), there are many different representations of the data involved, e.g, at low level it is the colours, shapes, textures etc. that characterise an image and an accurate combination of these features should provide a more precise representation of an image as a whole. Much of the current CBIR systems use these low-level features of images together using approaches such as CombSUM [8], as discussed in section 4.5. An alternative technique is to combine multiple results based on the distribution of scores within each of the results being combined [12]. Here Manmatha et al. found that the frequency distribution of non-relevant documents follows a exponential distribution, whereas relevant documents follow a Gaussian distribution. They then showed how these distributions can be used to map the scores to probabilities, which can be used to combine multiple results. They found that this worked as well as the many of the other available combination techniques such as those discussed in [8]. The use of Relevance Feedback (RF) has long been used in information retrieval as a mechanism for improving upon initial search results. The RF process attempts to better
capture the information need of the user through an iterative process of system feedback and query refinement. RF was initially introduced in [18] for document retrieval and was later introduced into image retrieval [2, 13, 17]. Much work has since been done in this area and RF remains an active area of research in image retrieval due to the difficultly in interpreting the semantic meaning of images by the search system alone. As discussed by Rui and Huang in [16] there are essentially two components requiring relevance feedback in image retrieval: “One is an appropriate transformation that maps the original visual feature space into a space that better models user desired high-level concepts . . . the other important component is the ‘ideal’ query in the users mind”. Not all RF explicitly needs user interaction and this first component in particular can be “trained” using queries along with a corresponding set of relevant images. However this training process can be time-consuming, as well as running the risk of becoming over-trained on the specific training set and may only be of use on a particular image collection. Our approach aims to better combine image features by using the score distributions of the image features used to estimate the optimal way to combine these features, with approaches such as those in [8], as well as others discussed in section 4.5, in a way that requires no training through user relevance feedback
3.
PROPERTIES OF MULTI-MODAL DATA
Our work investigating automatic fusion techniques has made use of collections of TRECVid [20] data. TRECVid is an annual workshop run by NIST which aims to promote and evaluate content-based retrieval of digital video data. TRECVid provides not only collections of digital video data, but sets of retrieval queries and relevance assessments for these, which allows us to evaluate our overall retrieval techniques. A TRECVid collection consists of digital video, common shot-boundary reference, keyframes extracted from these shotbounds and automatic speech recognition (ASR) transcripts of dialogue within the collection in addition to a set of queries and relevance judgments. For our investigation we made use of the keyframes and the ASR. Keyframes were indexed using MPEG-7 visual features, whilst ASR was indexed by a standard text search engine. In this section we will first discuss the techniques used to create our low-level data, both visual and textual. Then we present our analysis of the properties of this data when retrieved, and demonstrate the correlation to performance.
3.1
Visual Features
For our visual feature extraction, we make use of three MPEG-7 [4] visual features, which were extracted using the aceToolBox, developed as part of our participation in aceMedia [1]. The following description of the toolbox and visual features is referenced from our earlier TRECVid work [5]: In this first version of the toolbox, colour feature grouping is performed by Recursive Shortest Spanning Tree (RSST). The original RSST algorithm is a relatively simple and fast region-growing method. It starts from pixel level and iteratively merges regions (two regions per iteration) according to the distance calculated using colour features and region size. The process stops when the desired number of regions are obtained. For our experiments we processed images us-
ing the following three descriptors.
• A Local Colour Descriptor (Colour Layout CLD) is a compact and resolution-invariant representation of colour in an image. The colour information of an image is partitioned into 64 (8x8) blocks and then the representative colour of each block is determined by using the average colour in each block. • A Global Colour Descriptor (Scalable Colour - SCD) measures colour distribution over an entire image. It is defined in the hue-saturation-value (HSV) colour space and produces a 256-bin colour histogram, normalised, non-linearly mapped into a four-bit integer value, and then encoded by a Haar transform to obtain a 32 bin histogram. More details on these descriptors can be found in [11]. Video shots, when retrieved using each visual feature, were each ranked using the Euclidian distance metric.
3.2
Textual Features
The other feature we used in our experiments was the ASR transcript provided with each video. We provided a standard text search facility over this data, using our inhouse search engine F´ısr´eal [7]. Given a text query, the search engine will return a ranked list of shots which contain that text, with the ranking performed using BM25 term weighting [15]. However, as we know, the text associated with a shot does not always represent the visual representation of that shot, particularly in TV news data as the anchor may be discussing other events than what is currently on-screen. To address this we implemented a weighted windowing scheme where for a given found shot the preceding and following two shots were re-weighted with a degraded weight and added back into the result list. This gives us an element of context for each shot as its neighborhood in terms of adjacent shots, is also considered part of a shot’s ASR description.
3.3
Analysis of Properties of Feature Data
Our analysis of feature data centered on examining the distribution of scores within a single result list, and contrasting these distributions against the other features for that query. In work similar to Manmatha [12], we use these distributions as an indication of the quality of the feature in terms of the current query. We note at this stage that because we want to cross-compare features, all results are normalized to the range [0:1]. Our initial hypothesis was formed after examination of the results for a single feature and how these relate to retrieval performance. We observed that for a feature which performed well, its top ranked documents underwent a more dramatic change in score than those features which performed poorly. For example, Figure 1 plots the averaged score distribution (of 24 topics) of each feature we used, truncated to the top 1000 results.
colour layout colour struct. edge text 0.8
0.6 score
• An Edge Histogram Descriptor (EHD) designed to capture the spatial distribution of edges by dividing the image into 4x4 subimages (16 non-overlapping blocks) and edges are then categorized into 5 types (0◦ , 45◦ , 90◦ , 135◦ and ‘nondirectional’) in each block. The output is a 5-bin histogram for each block, giving a total of 5x16 = 80 histogram bins.
1
0.4
0.2
0 0
200
400
600
800
1000
rank
Figure 1: Feature Results ’04 We observe from this graph that the text feature undergoes the largest initial change, followed by the edge feature with the two colour features approximately equivalent. Table 1 presents the performance of each of these features on the TRECVid 2004 collection. Name Text Edge Colour Layout Colour Struct.
MAP 0.0523 0.0190 0.0072 0.0019
P@5 0.1739 0.0696 0.0348 0.0087
P@10 0.1304 0.0652 0.0217 0.0109
Recall 23% 14% 0.9% 0.9%
Table 1: 2004 Feature Results
A correlation appears to exist between the features which exhibit the greatest degree of change, and the features which perform the best. In this case we can observe that the text feature has undergone the greatest amount of change in score earlier in the ranking. For example at document rank 100, the score is 0.29 down from 1.0. Edge was the second-best performer, and at the same rank position 100 scored 0.33. Colour layout and Colour structure both scored 0.4 at document rank position 100. Our hypothesis as to why this correlation occurs comes back to an observation about the differences between the scores, rather than the scores themselves. A feature which exhibits this rapid change at the beginning can be thought of as having found many diverse yet interesting results, whereas those features that have a long series of documents which decrease in score linearly can be thought to have found many documents that are not very dissimilar from one another, and therefore is not offering any new large amount of information. However, as noted earlier, the data just presented is an average of the 24 topics for 2004, and for individual topics certain features will outperform others. For example, Figure 2 plots score distributions for TRECVid 2004 topic 135, displaying the distribution for all features over the top 1000 results. From this graph we see that the edge feature undergoes the greatest change, followed by text and colour structure, with colour layout the most stable. Table 2 contains the relevance figures for this topic.
ing which features have the greatest change easier. However in this case it would appear that from the plot that text is the best performer, followed by edge, yet we know from the relevance data that this is reversed. But it did clearly discriminate the two best-performing features from the others. What makes the analysis more difficult is that whilst the visual features span the range of the collection, the text feature results for this topic only account 27% of the collection. This same property can be seen in the averaged results for each feature on the 2005 collection. Table 3 shows the relevance assessments for our data.
1 colour layout colour struct. edge text 0.8
score
0.6
0.4
0.2
Name Text Edge Colour Layout Colour Struct.
0 0
200
400
600
800
1000
rank
Figure 2: Topic 135 Name Text Edge Colour Layout Colour Struct.
MAP 0.1611 0.3214 0.0154 0.0032
P@5 0.6 0.6 0.0 0.0
P@10 0.3 0.4 0.0 0.0
Recall 85% 66% 40% 22%
MAP 0.0194 0.0558 0.0113 0.0061
P@5 0.0875 0.175 0.075 0.067
P@10 0.896 0.18 0.075 0.05
Recall 9% 10% 7% 7%
Table 3: 2005 Feature Results
1 colour struct. colour layout edge text
Table 2: 0135 Feature Results 0.8
0.6 score
Two observations emerge from these results. Firstly, that rapid decline in the score of the edge feature correlates with the relevance figures of edge being the best performer. Secondly however, we can equally observe that on a pure performance ranking, text far outperforms the colour structure feature, however on the plot these two features track very closely and it would be difficult to infer from just the raw plot which was the better-performing feature. As previously, this plot is over the top 1000 results normalized for each list. Figure 3 plots the same data, but over the entire collection. As can be seen, the data distributions appear to change
0.4
0.2
0 0
200
400
600
800
1000
rank
Figure 4: 2005 Feature Results, top 1000
1 colour layout colour struct. edge text 0.8
score
0.6
0.4
0.2
0 0
5000
10000
15000
20000
25000
30000
rank
Figure 3: Topic 135 Whole Collection when viewed at the collection level. In this case the feature which exhibits the most initial change is text, followed clearly by edge. From this examination we get a more sparse space in which to plot our data, and thus makes determin-
Figure 4 shows score distributions for the top 1000 retrieved shots and illustrates that the edge feature undergoes the greatest initial change, followed by text, which correlates to our relevance figures. Figure 5 shows the distributions when mapped over the entire collection, and similar to topic 135, we observe that while text has the greater initial change, it is over a shorter range. Also, out of just the visual features, edge can be seen to have the greatest initial difference, and this correlates with the relevance data. What these issues highlight is that there is merit in utilizing either a fixed range examination of the data, or a relative percentage based range examination. Our attempts at creating a fusion model explore both these possibilities (Section 4.2). Furthermore it should be noted that any model would also need to account for the fact that these differences in distribution, despite normalization, can be artifacts of the manner in which the raw feature was generated and ranked. Any comparison of these distributions will need to take these factors into consideration.
4.
AUTOMATIC FUSION FRAMEWORK
1 colour struct. colour layout edge text 0.8
score
0.6
0.4
0.2
0 0
5000
10000
15000
20000
25000
30000
35000
40000
45000
rank
Figure 5: 2005 Feature Results, collection
4.1
Approach
Through experimentation we have discovered that when fusing together different rankings into a combined, fused ranking, apart from the raw individual rankings themselves, the fusion result can be greatly influenced by: • The normalization strategy employed before combination; • The relative weights supplied/generated to apply to features;
have found that the first point for merging should be combining the visual feature data together for each example query image, producing one result list for that image. The second stage is the combination of each example query image’s result list into one result list that represents the results of the set of example query images. The final stage is the combination of the image results with the text results to produce the final ranking. We have experimented previously with combining lowlevel data of the same type together first (i.e. combining all the colour feature results into one result, all edge to one edge, then combine these), however we have found that this approach degrades performance. Similarly we have found that if we were to combine text with each of the single example query image’s results (e.g. for two example images we would have two result lists, so add the text results here such that we would combine three lists), that this too degrades performance. Furthermore, we have found that when combining result lists, the longest possible list should be used for combination, rather than using truncated lists. In our earlier work it had been common to truncate each stage within result fusion to 1,000 results, however as can be seen in Figure 7, for visual feature data the majority of relevant shots occur after this 1,000 mark cutoff. Therefore an early truncation of raw result lists to a size of 1,000 will damage performance as many relevant documents would be excluded from the ranking. However in all our work, the final result at the end is truncated to the top 1,000 results for evaluation purposes. 90 Colour Structure
• The methodology of applying these weights to the features;
Colour Layout 80
Edge
70
• The stage at which to merge individual feature data; Relevant
60
In this paper we examine techniques to employ for querytime automatic weight generation, normalization and weight application strategies. Based upon earlier experimentation we have defined our possible merger points, and these are illustrated in Figure 6. Working from the query down, we
50
40
30
20
10
Text
Image A
Image B 0 25
Read raw index data Color Result List
Edge Result List
Color Result List
Single Image Result List
Edge Result List
Single Image Result List
All Image Result List
Text Result List
Final Result List
Figure 6: Fusion Stages
Normalize results Determine weights Apply weights Combine lists
Normalize results Determine weights Apply weights Combine lists
100
200
500
1000
2000
5000
10000
33367
Rank
Figure 7: Rank positions of relevant shots for TRECVid 04
4.2 Normalize results Determine weights Apply weights Combine lists
50
Weight Determination
The model for the automatic determination of relative weights for rankings to be combined was based upon an examination of the distribution of scores resulting from our feature inputs, as described in Section 3. Our hypothesis is that if we are to examine at query time the shape of the curve of score distribution, than we are able to make approximate inferences as to which feature(s) would be the best-performing for the given query in terms of ability to retrieve relevant shots, and weight these features accordingly. Our model for relative weight determination is based upon an examination of the differences between the scores of adjacent results in one list and contrasting that to the differences between the scores of adjacent results in another result list.
We refer to this as the Mean Average Distance (MAD), and formally this is given by: PN
− score(n + 1)) (1) N −1 For example, if shot ‘A’ has a normalized score of 0.85 and shot ‘B’ has a normalized score of 0.80, then the difference we measure between the two is 0.05. We can then sum these differences for each result list to provide an indication as to how close or apart the shots are. However a direct comparison of these differences in itself will not yield much useful information as the differences could be accounted for by the ranking metric or even the nature of the distribution of the raw feature data. Therefore to achieve a score that is comparable between lists we define a ratio which measures MAD within a top subset of a result list, versus that of a larger set of the same result list. The resulting score we refer to as a Similarity Cluster (SC), and can be defined as: M AD =
n=1 (∆score(n)
10,000, then the first set will use values of 330 and 29,700 for SC bounds, whilst the second will use 100 and 9,000. The advantage of this approach is that as it is not a fixed boundary, but a relative one we should achieve comparable values for SC that more accurately reflect the distributions of score differences in a result set better than what would have been achieved with a fixed boundary. The second approach is similar to the previous, except that instead of a percentage we base our cutoffs on a ZScore value. A Z-Score measures the distance of a point from the mean in terms of numbers of standard deviations. For instance a Z-Score of 2.0 (or -2.0) means that the point with that score is two standard deviations away from the mean. We calculate our Z-Scores on the normalized scores of a result list. The distribution of these scores can be observed in Figure 5. Whilst this data does not adhere to a classic Gaussian distribution, if we were to instead graph the differences in scores between the shots, then we would get a shape closer to a Gaussian distribution.
4.3 M AD(subset) sc = M AD(largerset)
(2)
These formulae would appear to be similar to standard deviation, however standard deviation informs us as to the degree of compactness around a mean, whereas in our equations we are comparing the actual mean values. The calculation of the final weight for a feature ranking is to find what the relative percentage of the total a given SC score is against the sum of all SC scores. F eature SC Score (3) ΣAll SC Scores For instance, if we have two visual features, colour and edge, which have SC scores of 8 and 14 respectively, then the weight for colour would be given as: F eature W eight =
ColourW eight =
ColourSC ColourSC + EdgeSC
(4)
which given the above example would give a weight of 0.36 to colour and a weight of 0.64 to edge.
4.2.1
Subset size determination
What we have not covered up to this point is the strategy to employ in selecting the subset sizes to be used in the calculation of SC (equation 2). Based upon empirical testing, we had in our earlier work[22] defined fixed size subsets for use in the calculation of SC. These were 25 for the top subset, and 1000 for the larger subset. Our initial investigations of this technique dealt with visual data only, where we could be assured that result sets would all be of equal length. However, as noted in Section 3 and with the move to incorporate additional visual features as well as text results, that proposition no longer holds true. As such we’ve defined two mechanisms for dynamically selecting the size of subsets to use. The first approach selects subsets based upon a percentage of the result set approach, where two percentages are defined, one for the size of the top subset selection, and a second for the larger set. For example, if the percentages defined are 10% for the top subset and 90% for the larger set, and we are merging two result lists of size 33,000 and
Query Image Weight Determination
Another issue which presents itself within video IR is how to appropriately weight different parts of the query. It is known that if multiple example query images are provided as part of a query, that some will perform better than others. A naive approach is to weight all images equally, however this means that performance can suffer. Our approach to remedy this situation is to apply our automatic weight generation techniques not just to feature combinations, but to imageimage and text-image combinations as well. As the output from each of these processes is identical (e.g. feature combination will be a combination of result lists, image image combination will be a combination of result lists our techniques can be applied on these lists in order to aid performance.
4.4
Normalization
Normalization plays a key part when attempting to combine multiple result lists based on score. As different results may have either different raw score distributions or be the product of alternative ranking metrics, a naive combination of these scores could bias a result list. Because of this result lists need to be normalized first. In our experiments, we implemented two forms of normalization, Min-Max Normalization, and relative normalization. Min-Max normalization is given by the formula: N ormscore(x) =
Scorex − Scoremin Scoremax − Scoremin
(5)
which will distribute all scores into the range [0:1]. Relative normalization, is given by the formula: N ormScore(x) =
Scorex P
(6)
X
and will alter the scores such that all scores will sum to 1.
4.5
Combination of Rankings Using DempsterShafer Theory
The application of weights to a result list, such that these lists can be subsequently merged, has been investigated in two different ways. The first of these is to apply the weight
equally to the result list, and to combine the result lists making use of Comb-SUM [8]. The second approach is of weight application is to make use of the Dempster-Shafer Combination of Evidence framework. Dempster-Shafer’s Theory of Evidence is a formal framework for the combination of independent sources of evidence. The theory was originally proposed by Dempster [6] and extended by Shafer [19]. Its main appeal is its explicit representation of ignorance in the combination of evidence, which is expressed by Dempster’s combination rule and in this way it extends classical probability theory. In this section we outline Dempster-Shafer’s Theory of Evidence and its application in combining image feature scores. The frame of discernment Θ represents the set of all possible elements {θ1 , θ2 ... θn } in which we are interested. The power set of Θ, denoted 2Θ contains all possible propositions. Each of these propositions is assigned a probability mass function, denoted m, where X
m(A) = 1
(7)
A⊆Θ
Here A is any element of 2Θ and m(A) is the amount of the total belief committed exactly to A. The uncommitted belief m(Θ) is a measure of the probability mass that remains unassigned. This is the measure that is used to model “ignorance” or conversely the “confidence” in the evidence, and this is said to be its uncertainty and is defined as follows: m(Θ) = 1 −
X
m(A)
(8)
If A ⊆ Θ and m(A) > 0, then 0 is called a focal point. The function m(A) measures the amount of belief that is exactly committed to A, not the total belief that is committed to A. Each mass m(A) supports any proposition B that is implied by A. Therefore the belief that a proposition A is true is gained by adding all the masses m(B) allocated to propositions B that imply A. This degree of belief is defined as follows: X
m(B)
(9)
B⊆A
Two bodies of evidence within the same frame of discernment (provided they are independent) can be combined using Dempster’s Combination Rule. If m1 , m2 are the two probability mass functions of the two bodies of evidence defined in the frame of discernment Θ that we wish to combine, the probability mass function m defines a new body of evidence in the same frame of discernment Θ: m(A) = m1 ⊕ m2 (A) =
P B∩C=A m1 (B) ∗ m2 (C) P A, B, C ⊆ Θ
1−
B∩C=∅
meh,cl ({imi }) =
(11)
meh ({imi }) ∗ mcl ({imi }) + meh ({imi }) ∗ mcl (Θ) + meh (Θ) ∗ mcl ({imi })
A⊆Θ
Bel(A) =
an image in the collection. Each of the features are used to calculate the similarity between the query image and all other images in the collection, or the query text and the text associated with a shot. We must normalise each of these feature scores so that they fulfill the probability mass function, as defined by equation 7. For this we must also know the uncertainty value m(Θ) associated with each feature so that the scores for the feature are normalised to 1 − m(Θ), to satisfy equation 8. For example if we wish to combine the results of the edge histogram (meh (im)) and colour layout (mcl (im)) features, we can take the weights given by the mean average distance feature weight determination process (as discussed earlier), as the level of confidence we have for each feature, or formally 1−m(Θ). If this, for example gives weights of 0.4 and 0.6 for edge histogram and colour layout respectively, we can then normalise the meh (im) values to 1 − 0.4 and normalise the mcl (im) measures to 1 − 0.6, as this corresponds to the level of confidence we are assigning to each feature. To then combine these features meh (im) and mcl (im), into a combined scores for the image we use Dempster’s Combination Rule, as defined in equation 11. It is possible to simplify this equation, as this currently takes account of all elements in the set 2Θ . However, as we have non-zero basic probability assignments for only the singleton subsets of Θ, i.e. each of the images {im1 , im2 ...imn }, as well as the uncertainty in the body of evidence m(Θ), we can reduce the complexity in the combination, similar to [10] and [14]. The modified equation 11 can be re-written as:
(10)
m1 (B) ∗ m2 (C)
This returns a measure of agreement between the two bodies of evidence. As we are applying the Dempster-Shafer’s Theory of Evidence to combine various content based image features, we now outline how it is applied specifically to this area, as well as computational savings that can be made. When dealing with a fixed collection of images where we are concerned with returning the most relevant images associated with a text and image query, our frame of discernment Θ is defined as {im1 , im2 ...imn }, where im represents
1−
P
{imk }∩{imj }=∅ meh ({imk }) ∗ mcl ({imi })
Since we are only interested in the ranking produced from this combination, and the denominator in equation 12 is independent of ({imi }) this formula can be further simplified to: meh,cl (im) ∝ (12) meh (im) ∗ mcl (im) + meh (im) ∗ mcl (Θ) + meh (Θ) ∗ mcl (im)
We can then use the above equation to find the agreement between two independent features for the same image in the collection. In order to yield a combined list of relevant images based on two features we must calculate the combined score for each image in the collection based on this equation. Each of the scores of the images in this list corresponds to a probability of relevance between that image and the image that the feature was generated from. The sum of these probabilities now sums to less than 1, corresponding to the level of disagreement between the features being combined. This in itself can be of use, as low levels of disagreement generally correspond to features that have ranked the same images similarly, whereas when combining many features together for certain query images there may be a high level of disagreement, indicating that the features do not agree on which images should be ranked highly and this generally coincides with poorer performance.
5.
EXPERIMENTS
Our experiments were performed on the TRECVid 2004 and 2005 collections of TV news video data. The TRECVid 2004 collection consisted of 70 hours of news video footage from CNN and ABC broadcasters. Extracted from this were 33,367 keyframes which represent each shot. TRECVid 2004
had 24 queries, represented by 145 example query images (either provided or extracted from the video) and a text description of the query. TRECVid 2005 consisted of 80 hours of news video footage from several broadcasters including CNN, NBC, CCTV (Chinese) and other Arabic sources. Again 24 queries were provided, represented by 227 example query images and text descriptions of each query. ASR on the full video for both collections was also provided. We performed a fully automatic search on both collections using the full set of queries for each. For each query we used every available keyframe as well as the text description to form the final query that was given to the system. For each collection we performed iterative manual training to determine the best set of collection weights. This run is known as the ‘oracle’ run as it was obtained by training on the relevance data for that collection, and is the best result we could obtain for that collection using collectionwide weights. Again we note that all of our fully automatic runs did not benefit from any user intervention, prior training, or any information as to the quality of the features, or query images, that were used. The following abbreviations were used to compose part of the run name:
the use of CombSUM is most effective for the combination of weighted result lists. The relative poor performance of Dempster-Shafer can be attributed to the fact that the weights generated were not atomic measurements, but dependant on the values of other weights. This means that if, for instance, the three visual features were determined to have performed equally well, they would have been each assigned weights of 33%. However those weights from the perspective of Dempster-Shafer would have indicated a high degree of uncertainty as to the quality of the feature. We can also observe that measures which incorporated the range of the result list into the calculation of SC, (such as MAD-%) performed slightly better than static approaches. Name Oracle weights 2004 MAD-%-MM CS-CS-std MAD-MM CS-CS-std MAD-Z-MM CS-CS-std MAD-MM CS-CS-MM MAD-std CS-CS-MM MAD-%-std CS-CS-std MAD-std CS-CS-std MAD-Z-MM CS-CS-MM
• MAD - Standard application of MAD (Mean Average Distance between scores) with static window sizes (25 and 1,000). • MAD-% - Application of MAD using relative percentage approach to determine window size based on result list length. • MAD-Z - Application of MAD using Z Scores to determine window size, based on deviation from the mean score.
MAD-%-MM CS-CS-MM MAD-%-std CS-CS-MM MAD-Z-MM DS-CS-MM MAD-MM DS-CS-MM MAD-std DS-CS-MM MAD-%-MM DS-CS-MM MAD-%-std DS-CS-MM MAD-Z-MM DS-CS-std MAD-MM CS-DS-std MAD-%-MM CS-DS-std MAD-MM CS-DS-MM
• MM - Use of Min Max Normalization. • std - Use of relative normalization. • CS - Use of CombSUM.
MAD-std CS-DS-MM MAD-%-MM CS-DS-MM MAD-%-std CS-DS-MM MAD-%-std CS-DS-std MAD-std CS-DS-std
• DS - Use of Dempster-Shafer. Each run name adheres to the following semantic description: [MAD type (static, percent, Z score)] - [Normalization used for weight generation (MM or std)] [Combination type used for query level combination (CS or DS)] - [Combination type used for feature level combination (CS or DS)] [Normalization used for applying weights to scores (MM or std)] For example, the run name MAD-%-MM CS-CS-std means that the run used the MAD percentage approach for determining the ratios for the calculation of SC, that the results were normalized using Min-Max Normalization, feature results were normalized using relative normalization before weights were applied, and that at both the feature and query level, combination was performed using CombSUM. Each run executed the entire series of queries using serialized indexes in less than 30 seconds. The results on TREVid 2004 and 2005 are presented in Tables 4 and 5. As we can see, our best runs achieved results that were comparable to the best scores achieved using oracle weights in all metrics measured. In particular it would appear that
MAD-MM DS-CS-std MAD-%-MM DS-CS-std MAD-%-std DS-CS-std MAD-std DS-CS-std MAD-MM DS-DS-MM MAD-MM DS-DS-std MAD-%-MM DS-DS-MM MAD-%-MM DS-DS-std MAD-%-std DS-DS-MM MAD-std DS-DS-MM MAD-%-std DS-DS-std MAD-std DS-DS-std
MAP 0.0774 0.0753 0.0748 0.0727 0.0673 0.0673 0.0643 0.0639 0.0639 0.0626 0.0626 0.0581 0.051 0.051 0.0473 0.0473 0.0466 0.0353 0.0353 0.0352 0.0352 0.035 0.035 0.0293 0.0292 0.0285 0.0263 0.0227 0.0215 0.0168 0.0168 0.0168 0.0168 0.0168 0.0168 0.0167 0.0167
P@5 0.2261 0.2261 0.2261 0.2087 0.235 0.235 0.1739 0.1652 0.1739 0.2 0.2 0.1913 0.1739 0.1739 0.1478 0.1478 0.113 0.1043 0.1043 0.1043 0.1043 0.1043 0.1043 0.0609 0.0609 0.087 0.087 0.0522 0.0435 0.0435 0.0435 0.0435 0.0435 0.0435 0.0435 0.0435 0.0435
P@10 0.1870 0.1957 0.1913 0.2 0.1913 0.1913 0.1522 0.1435 0.1652 0.1783 0.1783 0.1522 0.1348 0.1348 0.1304 0.1304 0.1087 0.0783 0.0739 0.0783 0.0783 0.0826 0.0826 0.0652 0.0652 0.0696 0.0739 0.0609 0.0435 0.0478 0.0478 0.0478 0.0478 0.0478 0.0478 0.0478 0.0478
Recall 500 514 498 497 492 492 488 491 460 488 488 434 455 455 450 450 481 367 365 367 367 366 366 359 358 444 445 347 366 169 169 169 169 169 169 169 169
Table 4: 2004 Results
As an aside, we briefly experimented with taking our top performing runs for 2004 and 2005, and removing the weakest feature from the feature set (colour structure). We expected that we would see an increase in performance as we
Name Oracle weights 2005 MAD-%-MM CS-CS-std MAD-MM CS-CS-std MAD-%-MM CS-CS-MM MAD-%-std CS-CS-MM MAD-Z-MM CS-CS-std MAD-MM CS-CS-MM MAD-std CS-CS-MM MAD-Z-MM CS-CS-MM MAD-Z-MM DS-CS-std MAD-MM DS-CS-std MAD-%-MM DS-CS-std MAD-MM DS-CS-MM MAD-std DS-CS-MM MAD-%-MM DS-CS-MM MAD-%-std DS-CS-MM MAD-%-std DS-CS-std MAD-std DS-CS-std MAD-%-std CS-CS-std MAD-std CS-CS-std MAD-MM CS-DS-std MAD-MM CS-DS-MM MAD-std CS-DS-MM MAD-%-MM CS-DS-MM MAD-%-std CS-DS-MM MAD-%-MM CS-DS-std MAD-Z-MM DS-CS-MM MAD-%-std CS-DS-std MAD-std CS-DS-std MAD-MM DS-DS-MM MAD-MM DS-DS-std MAD-%-MM DS-DS-MM MAD-%-MM DS-DS-std MAD-%-std DS-DS-MM MAD-%-std DS-DS-std MAD-std DS-DS-MM MAD-std DS-DS-std
MAP 0.0844 0.0847 0.0835 0.0736 0.0736 0.0726 0.0673 0.0673 0.0548 0.0498 0.0477 0.0472 0.0416 0.0416 0.0395 0.0395 0.0366 0.0359 0.0345 0.0343 0.0268 0.0266 0.0266 0.0263 0.0263 0.0261 0.0254 0.023 0.0229 0.0033 0.0033 0.0033 0.0033 0.0033 0.0033 0.0033 0.0033
P@5 0.2917 0.2417 0.292 0.2583 0.2583 0.1917 0.25 0.25 0.2417 0.225 0.1833 0.1833 0.2583 0.2583 0.2417 0.2417 0.1333 0.1333 0.15 0.1583 0.1 0.1 0.1 0.1 0.1 0.1 0.1083 0.0583 0.0583 0.0167 0.0167 0.0167 0.0167 0.0167 0.0167 0.0167 0.0167
P@10 0.2792 0.2333 0.2458 0.2417 0.2417 0.175 0.258 0.258 0.1958 0.1792 0.175 0.1667 0.1917 0.1917 0.1792 0.1792 0.1292 0.1208 0.1583 0.1708 0.1042 0.1042 0.1042 0.1042 0.1042 0.1042 0.1125 0.0708 0.0708 0.0083 0.0083 0.0083 0.0083 0.0083 0.0083 0.0083 0.0083
Recall 1116 1157 1140 1153 1153 1089 1132 1132 1121 1014 936 954 922 922 929 929 901 871 1041 1035 461 461 461 459 459 461 833 464 464 173 173 173 173 173 173 173 173
Table 5: 2005 Results would have reduced the amount of noise that the system needed to deal with. To our surprise, these runs actually performed worse. This leads us to observe that despite the signal to noise ratio of the colour structure feature, our model was able to take the information present within the feature and use it to boost the runs performance.
6.
CONCLUSIONS AND FUTURE WORK
In this paper we explored two major concepts. Firstly that an examination of the distribution of the scores can reveal correlations between those results which undergo a rapid initial change in score, to those results which perform well with regard to relevance. Secondly, we presented an initial model to take advantage of these correlations and to automatically generate weights for a retrieval system without giving that system any prior training or outside knowledge of the collection. We found that this model could achieve performance with collection-wide oracle weights derived from training on the test collection. We believe that we have demonstrated through empirical testing the potential for the analysis of score distributions to
be exploited to assist in video IR. Our initial model, whilst basic, achieves respectable performance, particularly when the variety of the quality of the features used is considered. We believe that whilst this model in itself is competitive, it does not need to be used independently of other techniques used to automatically generate weights that involve training or iterative queries. Our approaches would complement other techniques by offering a final layer of result interpretation to assist in weight refinement. As has been previously discussed, our experiments and the formulation of our model was empirically based. Given that we have observed correlations between score distribution and relevance, the development of a theoretical model that is able to describe these phenomena would be of great benefit. This would lead to further refinements of the weight generation model, and provide greater confidence for its applicability. The use of high-level semantic features as a pre-filtering step into this algorithm would also be of interest. Our investigations into this area would center on whether this filtering would make a difference to the properties we have observed, and if the filtering should be performed before or after weight generation. We would also look to investigate alternative weight generation techniques, that whilst leveraging the distribution information is able to generate atomic weights, such that they can be used as appropriate input into any DempsterShafer framework.
7.
ACKNOWLEDGMENTS
The research leading to this paper was partly supported by the European Commission under contract FP6-027026 (K-Space) and Science Foundation Ireland under grant number 03/IN.3/I361. We are grateful to the AceMedia project (FP6-001765) which provided us with output from the AceToolbox image analysis tooklit.
8.
REFERENCES
[1] The AceMedia Project, available at http://www.acemedia.org. [2] Learning of personal visual impressions for image database systems. In In Proceedings of the Second International Conference on Document Analysis and Recognition, pages 547–552, 1993. [3] B. Bartell, G. Cotrell, and R. Belew. Automatic combination of multiple ranked retrieval systems. In 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 173–181, 1994. [4] S.-F. Chang, T. Sikora, and A. Puri. Overview of the MPEG-7 standard. IEEE Transactions on Circuits and Systems for Video Technology, 11(6):688–695, June 2001. [5] E. Cooke, P. Ferguson, G. Gaughan, C. Gurrin, G. J. F. Jones, H. L. Borgne, H. Lee, S. Marlow, K. McDonald, M. McHugh, N. Murphy, N. E. O’Connor, N. O’Hare, S. Rothwell, A. F. Smeaton, and P. Wilkins. TRECVID 2004 experiments in Dublin City University. In Proceedings of TRECVID 2004, November 2004. [6] A. P. Dempster. A generalization of the Bayesian inference. Journal of Royal Statistical Society, 30:205
– 447, 1968. [7] P. Ferguson, C. Gurrin, P. Wilkins, and A. F. Smeaton. F´ısr´eal: A Low Cost Terabyte Search Engine. In Proceedings of European Conference in IR, March 2005. [8] E. A. Fox and J. A. Shaw. Combination of multiple searches. In Proceedings of the 2nd Text REtrieval Conference, 1994. [9] A. Graves and M. Lalmas. Video retrieval using an mpeg-7 based inference network. In SIGIR ’02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pages 339–346, New York, NY, USA, 2002. ACM Press. [10] J. M. Jose, J. Furner, and D. J. Harper. Spatial querying for image retrieval: A user oriented evaluation. In ACM SIGIR, pages 232 – 240, 1998. [11] B. S. Manjunath, J. R. Ohm, V. Vasudevan, and A. Yamada. Color and texture description. In IEEE Trans. On Circuits and Systems for Video Technology, June 2001. [12] R. Manmatha, T. Rath, and F. Feng. Modeling score distributions for combining the outputs of search engines. In Research and Development in Information Retrieval, pages 267–275, 2001. [13] R. W. Picard, T. P. Minka, and M. Szummer. Modeling user subjectivity in image libraries. In IEEE International Conference On Image Processing, 1996. [14] V. Plachouras and I. Ounis. Dempster-Shafer theory for a query-biased combination of evidence on the web. Information Retrieval, 8(2):197 – 218, April 2005. [15] S. E. Robertson and S. Walker. Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval. In SIGIR ’94: Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, pages 232–241. Springer-Verlag New York, Inc., 1994. [16] Y. Rui and T. Huang. Optimizing learning in image retrieval. In Proceeding of IEEE int. Conf. On Computer Vision and Pattern Recognition, pages 236–245, 2000. [17] Y. Rui, T. S. Huang, M. Ortega, and S. Mehrotra. Relevance feedback: A power tool for interactive content-based image retrieval relevance feedback: A power tool for interactive content-based image retrieval. IEEE Transactions on Circuits and Systems for Video Technology, 8(5):644, 655 1998. [18] G. Salton. Automatic Text Processing. Addison–Wesley, 1989. [19] G. Shafer. A Mathematical Theory of Evidence. Princeton University Press, 1976. [20] A. F. Smeaton. Large Scale Evaluations of Multimedia Information Retrieval: The TRECVid Experience, volume 3568 / 2006, pages 11–17. Springer, 2005. [21] H. Turtle and W. Croft. Evaluation of an inference network-based retrieval model. ACM Transactions on Informaion Systems, 9(3):187–222, 1991. [22] P. Wilkins, P. Ferguson, C. Gurrin, and A. F. Smeaton. Automatic determination of feature weights for mult-feature CBIR. In ECIR 2006 - European
Conference on Information Retrieval. Lalmas M et al. (Eds.): (LNCS Series 3936), pages 527–530. Springer, 2006.