Spatio-Temporal Query for Multimedia Databases Sujal S Wattamwar
Hiranmay Ghosh
TCS Innovation Labs Delhi, Tata Consultancy Services Ltd. 249 D&E Udyog Vihar Phase IV, Gurgaon, Haryana. 122016. INDIA
{sujal.wattamwar | hiranmay.ghosh}@tcs.com ABSTRACT Complex media events are often characterized by spatio-temporal relations between its constituent media objects. A multimedia query language should support specification of such relations for semantic retrieval. We propose a method for formally defining 3D spatio-temporal relations between elementary media objects in this paper. To support soft decision making for multimedia data interpretation and to provide graded ranking, we define these relations using a set of fuzzy membership functions. It is possible to define fuzzy 3D extensions of Allen’s relations as well as arbitrary new relations using our method. This method can be incorporated with upcoming multimedia query languages, such as MP7QF.
Categories and Subject Descriptors H3.3 [Information Search and Retrieval]: Query Formulation
General Terms Human Factors, Theory.
Keywords Multimedia database, Multimedia Query Language, Fuzzy Logic, Spatio-temporal relation
1. INTRODUCTION Semantics of media data is often derived from the interaction of the media objects in space and time. For example, a “goal scored” event in a soccer match can be characterized by the ball within goal-box followed by cheer. In this example, assuming that a “ball”, a “goal-box” and “cheer” can be recognized by audiovisual recognition algorithms, the “goal scored” event can be recognized by analyzing their spatial and temporal relations, namely within and followed by. Recognition of such media events in a media instance requires a query mechanism that can process such spatio-temporal relations across such media objects. The relations need to be specified with certain degree of uncertainty to account for inherent differences in the event instances and difference in viewpoints of the observers. In this paper, we propose a scheme for expressing uncertain spatio-temporal relations in a multimedia query language that can lead to media semantics. In recent years, MPEG-7 multimedia content description scheme [10] has been widely adopted for multimedia databases. Though MPEG-7 is based on XML specifications, standard XML Query Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MS’08, October 31, 2008, Vancouver, British Columbia, Canada. Copyright 2008 ACM 978-1-60558-316-7/08/10...$5.00.
languages, such as XQuery, do not adequately support queries relating to multimedia data. This has motivated research in querying MPEG-7 multimedia content descriptions. Doller et al. [3] have proposed MP7QF, a query format for MPEG-7 compliant multimedia content descriptions. The spatial and temporal query specification in MP7QF is based on MMDOC-QL [8], which enables querying for multimedia objects at specific places or times in a media stream. However, it is not possible to specify the spatial or temporal relations between two media objects / events following such scheme. Papadias et al. [9] has proposed a scheme for specifying temporal relations between two events using an efficient binary encoding scheme. This scheme enables formal specification of the semantics of temporal relations and reasoning with them and can readily be extended to any other 1D space interval as well. A 3D (x, y, t) relation between two multimedia events can be represented by a 3-tuple of 1D interval relations along the three axes. However, the relations are crisply defined and not amenable towards uncertain processing that is a prime requirement for establishing multimedia semantics. While the paper proposes a scheme for approximate spatio-temporal queries based on reasoning with the closeness of the relations, the boundaries are still crisp and does not provide for a graded ranking of the search results. Karthik et al. [7] have proposed use of fuzzy membership values in place of the binary intersection values to achieve graded ranking. They have introduced a containment relationship between media event, which is required over and above the 3-tuple of 1D relations to unambiguously distinguish between certain spatio-temporal relations, when some of the events can be concave in nature. However, they do not specify derivation of the fuzzy membership functions and their use in query processing. Our main contribution in this paper is to formulate a scheme to formally specify arbitrary spatio-temporal relations between two objects in a media stream and illustrate their use in semantic query processing in multimedia databases. We extend the work presented in [7] to define the semantics of the fuzzy interval membership functions and to define spatio-temporal relations on top of these functions. We show that all the Allen’s relations [1] can be specified in their uncertain form using our scheme and that the crisp forms of these relations can be realized as special cases. Moreover, it is possible for the users to formally define arbitrary interval relations by defining different fuzzy interval membership functions and combining the membership relations in different ways in 3D. We illustrate application of these fuzzy spatiotemporal relations in semantic query processing with some illustrative query examples and results. The rest of the paper is organized as follows. Section 2 discusses about the multimedia ontology and spatio-temporal relations. Section 3 explains the MPEG-7 multimedia content description scheme. We present an overview of MPEG-7 Query Format and propose an extension to support spatio-temporal relation in Section 4. Section 5 details the specifications for the spatiotemporal relation proposed by us. Section 6 gives illustrative
Figure 1 Media properties of concepts and their relation examples of using the spatio-temporal relations and Section 7 concludes the paper.
2. MULTIMEDIA ONTOLOGY AND SPATIO-TEMPORAL RELATIONS Ghosh et al [4] has proposed Multimedia Web Ontology Language (M-OWL) for semantic web applications involving multimedia data. The ontology scheme supports reasoning with expected media properties of concepts for their recognition. For example, a medieval Indian monument can be recognized by observing the minarets, domes and arches, characterized by their observable shape properties as illustrated in Figure 1(b) and (c). Note that there are inherent differences between the monuments and all of them do not have all the expected components. Thus, concept recognition using these visible features call for reasoning with uncertainty. An evidential reasoning scheme with belief propagation in Bayesian Network is built around M-OWL to support such uncertainties in observation. Karthik et al extends the media property specification in M-OWL with spatial and temporal relations, e.g. the dome should be above the arches and should appear between the minarets in a visual depiction of a medieval monument as illustrated in Figure 1(c). Harit et al. [5] proposes a method to compute the belief value of a concept based on the belief values of the media objects detected and the belief value of a spatio-temporal relation between them. Formal representation of spatio-temporal relations and computing of the belief value for a spatio-temporal relation in a specific media instance remained an unsolved part of the puzzle.
segment. Each video segment may have some interesting media objects present in it, which can be marked and described using SpatialDecomposition DS. It describes information like shape of the marker, textual annotation, visual features, etc. We assume that, the co-ordinates of all the objects in a particular video segment remain same throughout that segment. A sample video1, depicting the achievements of NASA, is shown in Figure 2 and a section of an MPEG-7 file describing the media contents is shown in Figure 3.
3. MPEG-7 MULTIMEDIA CONTENT DESCRIPTION SCHEME MPEG-7 is a Multimedia Content Description Interface. It is a standard for describing multimedia content. The audio-visual (AV) contents can be described using MPEG-7 Multimedia Description Schemes (MDSs). The Description Schemes (DSs) provide a standardized way of describing in XML the important concepts related to AV content description and content management. Various kind of information like temporal, spatial, textual description, etc. is described using respective DSs. The video is described using Video DS. Then, the video is divided into various segments depending on visual features, for example, color feature and each of these segments is described using VideoSegment DS. It describes various attributes of the particular video segment like media time, media duration, textual annotation, etc. The media time is described using MediaTime DS. TextAnnotation DS describe textual annotation for the video
Figure 2. A video showing the achievements of NASA This file describes the video which has seven video segments having ids as VS1 … VS7. VS3 has media time point as T00:00:08, and is described by MediaTimePoint DS and MediaDuration DS gives the duration of that segment. This video segment has one object, called as spatial object, and is described by SpatialDecomposition DS having id as SS1. SS1 is marked by a bounding box and is described by a Poly DS with co-ordinates of the top-left and bottom-right corners as (29, 10) and (224, 145) respectively. The object is described by the textual annotation 1
http://www.open-video.org/details.php?videoid=346
“Satellite” and it is structured as it answers “what” about the marked object. In this way, the MediaTimePoint DS and Poly DS gives the time and space location of the media object. All the remaining segments are described in the similar way. ...
… … … T00:00:08 PT0M3S … 29 10 224 145 Satellite …
Figure 3. A sample MPEG-7 description scheme
4. MULTIMEDIA QUERY LANGUAGE Doller et al proposes a MP7QF as a query language for MPEG-7 compliant multimedia content descriptions. The query language supports a variety of query operations, including text based, media example based, media feature based and media location (space and time) based queries. It is possible to construct combined search by combining searches with Boolean, Comparison and Arithmetic operators. It is possible to create arbitrarily complex queries by using the operators to combine other combined search specifications to any depth of nesting. Figure 4 depicts the grammar of the query language in EBNF. Note that SpatialQuery and TemporalQuery specifications in MP4QF are based on MMDOC-QL and enable a user to query a specific location (space or time) in a multimedia stream. However, it does not allow specification of relative positions between two media events. We propose to extend the Operator class in MP7QF to include SpatioTemporalOperator that can establish spatio-temporal relations between two media events. These relations are extensions of Allen’s 1D interval relations and operate on media events that can be queried using operations such as TextQuery, QueryByExampleMedia and QueryByFeatureRange. Allen’s relations are extended in 3D (x, y, t) by using 3-tuple of 1D interval relations and a containment relation. While Allen’s relations are crisp and assigns a binary value (TRUE or FALSE) to any relation with respect to two events, our relations are defined with a set of fuzzy interval intersection functions, resulting in a fuzzy membership value in the range [0,1]. This is necessary to account for the uncertainties in the event instances and observations in the multimedia domain. It is possible for the users to redefine these relations (change their interpretation formally) and to define new relations by combining
the membership values in different ways and changing the membership functions. The extensions in the grammar are shown in Figure 5. SpatioTemporalRelationDescription is a description for spatio-temporal relation between two events in a multimedia presentation and is detailed in the following section. Query Input QueryCondition QueryExpression SingleSearch CombinedSearch Operation Operator BooleanOperator Search CompareOperator ArithmeticOperator
≔ Input | Output; ≔ RsPresentation, QueryCondition, SortBy; ≔ {MediaResource}, {Feature}, QueryExpression; ≔ SingleSearch | CombinedSearch; ≔ Operation; ≔ Operator; ≔ BrowsingQuery | TextQuery | QueryByExampleMedia, … ≔ BooleanOperator | CompareOperator | ArithmeticOperator; ≔ “AND”, Search, Search, {Search} | … ≔ SingleSearch | CombinedSearch; ≔ {“EQUAL_TO” | “GREATER_THAN” | … ≔ {“SUM” | “DIVIDE” | …
Figure 4. Grammar of MP7QF [3] Operator
∷= BooleanOperator | CompareOperator | ArithmeticOperator | SpatioTemporalOperator; SpatioTemporalOperator ∷= (SpatioTemporalRelationDescription);
Figure 5. Proposed extension to MP7QF Grammar
5. SPECIFYING SPATIO-TEMPORAL RELATIONS 5.1 1D interval relations Allen et al. modeled an event as an interval in time and established temporal relations between the events by comparing their start and end times. Allen identified 13 distinct temporal relations which exhaustively represent all possible combinations of relative start and end times of the events. These relations are summarized in Figure 6, where B represents the primary interval and A is the secondary interval whose duration is related to that of A. Though Allen has defined these relations with respect to time, they readily generalize to any space dimension as well.
Figure 6. Allen's relations Papadias et al. proposed a binary string encoding scheme for the representation of these 1D interval relations that results in a compact formal representation of these relations and enables reasoning with them. An event is represented as a closed interval in 1D [a, b] where ‘a’ denotes the start time and ‘b’ denotes the end time satisfying the relation -∞ < a < b < +∞. Five regions of interest are identified with respect to the interval [a, b] as (-∞, a), [a, a], (a, b), [b, b], and (b, +∞).
The relationship between a primary interval [a, b] and a secondary interval [c, d] can be uniquely determined by considering the five empty or nonempty intersections of [c, d] with each of the five aforementioned regions, denoted by five binary variables t, u, v, w, x, respectively. For example, Figure 7 depicts two 1D intervals [a, b] (primary interval) and [c, d] (secondary interval). The membership value “t” represents the truth value of intersection of the interval [c, d] with the region (-∞, a) for the interval [a, b] and its truth value is TRUE. The temporal relations between the two events are defined to be binary 5-tuples (Rtuvwx: t, u, v, w, x ∈ {0, 1}) where “0” indicates empty intersection and “1” indicates nonempty one. For example, the secondary interval [c, d] in Figure 7 is related to the primary interval [a, b] with the relation Rtuvwx = . This scheme can be used to unambiguously represent all of 13 Allen’s relations, e.g. precedes≔ R10000, meets≔ R11000, overlaps≔ R11100, finished_by ≔ R11110, and so on.
Figure 7. Five Regions of interest The number of relations that can be defined by this method can be extended by increasing the number of regions of interest. For example, Papadias et.al have defined a neighborhood of the primary event to a distance δ from its endpoints as shown in Figure 8 to define nine regions of interest: (-∞, a-δ), [a-δ, a-δ], (aδ, a), [a, a], (a, a + δ), [a + δ, a + δ], (a + δ, b), [b, b], (b, +∞). It is now possible to define relations precedes_near≔Rxx1000000 and precedes_far≔R1x0000000,where, x represents don’t care value. In Figure 8, the events [c, d] and [e, f] are related to [a, b] as R111000000 (precedes_near) and R100000000 (precedes_far) respectively, a distinction that is not possible with five regions of interest. The problem with this method is that the definitions of the regions are arbitrary and the crisp distinction occurs across the boundary. For example, if ε is an arbitrary small number, a secondary event A will be considered near to a primary event B, when their endpoints are at distance δ - ε, but they will be considered to be far when their endpoints are at a distance δ + ε.
5.2 Fuzzy 1D relations To overcome this shortcoming and to provide a graded ranking of the relationship values, Karthik et al. [7] proposed an extension for this binary value encoding to fuzzy interval membership functions. The binary variables t, u, v, w, x are replaced with fuzzy membership values that represent degree of intersection of the secondary event with the respective regions of interest of the primary event. The values of these membership functions are derived from fuzzy membership functions associated with the regions. However, the work does not specify how these fuzzy membership values are computed and how they are used to compute a graded value for a temporal relation. In the following paragraphs, we propose a formal way of defining the fuzzy membership functions, computing these membership values and a graded value for 1D interval relations. As in [7], we model the variables t, u, v, w, and x as the real numbers to denote the degree of intersection of a secondary interval with the regions of interests (-∞, a), [a, a], (a, b), [b, b], and (b, +∞) respectively of the primary interval [a, b]. The values of these variable ranges in [0, 1] where, ‘1’ means maximum degree of intersection and ‘0’ means minimum degree of intersection. These values are computed with respect to a pair of events as follows. We define five fuzzy membership functions T, U, V, W and X as the membership functions for the five regions of interest (-∞, a), [a, a], (a, b), [b, b], and (b, +∞) respectively. A set of possible functions are illustrated graphically in Figure 9. In this figure, interval [a, b] is a primary interval with endpoints ‘a’ and ‘b’. The functions T, U, V, W, X can be defined in a piecewise linear manner with respect to the start and the end points of the primary event. For this purpose, we normalize the interval [a, b] to denote a unit interval with the value of ‘a’ mapped to 0 and that of ‘b’ mapped to 1. Following this scheme, the functions shown in the Figure 9 has been defined as: T: < (-∞ 1), (-0.15 1), (0.1 0), (+∞ 0)> U: < (-∞ 0), (-0.25 0), (-0.1 1), (0.1 1), (0.25 0), (+∞ 0)> V: < (-∞ 0), (-0.15 0), (0.15 1), (0.85 1), (1.15 0), (+∞ 0)> W: < (-∞ 0), (0.75 0), (0.9 1), (1.1 1), (1.25 0), (+∞ 0)> X: < (-∞ 0), (0.9 0), (1.15 1), (+∞ 1)>
Where the tuple (x y) means “at the point x in the 1D space, the value of the membership function is y” and that the value of the function linearly changes between the points specified in two successive tuples. The value of the variables t, u, v, w, and x are determined by the maximum value of the corresponding function in the intersection interval between the secondary event and the corresponding region of interest. Referring to Figure 9, the membership values for intervals [c, d], [e, f], [g, h], [i, j] with respect to the interval [a, b] are given in Table 1.
Figure 8. Nine regions of interest In Figure 8, the event [c, d] is considered near to [a, b] while the event [e, f] is considered to be far from it, though there is an insignificant difference between the distances of the endpoints ‘d’ and ‘f’ of the two events from the starting point ‘a’ of the primary event [a, b].
A 1D interval relation can be defined as a function of these membership functions R (t, u, v, w, x). We use the following fuzzy operators [6] to combine them. • Union: The union of two fuzzy sets λ and µ of a set X, denoted by λ ⋃ µ is a fuzzy subset of the set X defined as: (λ ⋃ µ) (x) = max {λ(x), µ(x)}, ∀ x ∈ X • Intersection: The intersection of two fuzzy sets λ and µ of a set X, denoted by λ ⋂ µ is a fuzzy subset of the set X defined as: (λ ⋂ µ) (x) = min {λ(x), µ(x)}, ∀ x ∈ X
•
Complement: The complement of a fuzzy set µ of a set X is denoted by µc and is defined as 1 - µ(x), ∀ x ∈ X
Note that the interval [g, h] is found to be more equal to [a, b] than [i, j] and [e, f] is found to meet [a, b] better than [c, d] using this method. The results conform to common intuition. The membership value of a relation for a pair of events in a media stream can be interpreted as the belief in the relation to hold good in that specific context. The belief function is a continuous function with value in the range [0, 1]. Thus, it provides a graded ranking function and overcomes the limitation of sharp ad-hoc boundaries and binary ranking. This belief value can be used for semantic interpretation of the media [5]. The crisp Allen’s relations are special cases of the fuzzy relations when the membership functions are appropriate step/impulse functions as: T: < (-∞, 1), (0, 1), (0, 0), (+∞, 0)> U: < (-∞, 0), (0, 0), (0, 1), (0, 0), (+∞, 0)> V: < (-∞, 0), (0, 0), (0, 1), (1, 1), (1, 0), (+∞, 0)> W: < (-∞, 0), (1, 0), (1, 1), (1, 0), (+∞, 0)> X: < (-∞, 0), (1, 0), (1, 1), (+∞, 1)>
These functions are illustrated graphically in Figure 10. Figure 9. Fuzzy membership functions Table 1 Membership values for the events in Figure 9 with respect to interval [a, b] Interval
t
u
v
w
x
[a, b]
0.5
1
1
1
0.5
[c, d]
1
0.4
0
0
0
[e, f]
1
1
0.2
0
0
[g, h]
0.4
1
1
0.9
0.2
[i, j]
0
0.8
1
1
1
Following Papadias’ scheme, we interpret the Allen’s relations as intersections of the variables and their complements. We can define the thirteen Allen’s relations in their fuzzy form as intersections of the variables t, u, v, w and x or their complements, e.g. equals ([a, b], [c, d]) = R01110 = tc ⋂ u ⋂ v ⋂ w ⋂ xc = min (1 - t, u, v, w, 1 - x) meets ([a, b], [c, d]) = R11000 = t ⋂ u ⋂ vc ⋂ wc ⋂ xc = min (t, u, 1 - v, 1 - w, 1 - x) While the above definitions preserve the precise semantics of Allen’s relations, albeit in their fuzzy form, it is possible to define them in a simpler way in many practical situations, e.g. we may redefine equals and meets relations as: equals ([a, b], [c, d]) = min (D:u, D:w) = min (T:u, T:w) meets ([a, b], [c, d]) = min (D:u, D:vc) = min (T:u, T:vc), where, ‘D’ is the dimension on which the projections of the interval are taken. In this case, D=T (time dimension). Note that these redefinitions may not preserve the semantics of the Allen’s relations and can be considered as new definitions by a user. The use of the membership values of the regions of interest to compute the membership values for the relations equals and meets for some of the events in Figure 9 are depicted in Table 2. We have used the simplified definition of the relations in this table.
Figure 10. Step functions It is possible to define new 1D relations by defining different membership functions T, U, V, W, X and by combining the values of t, u, v, w, x in different ways. Table 2 Fuzzy membership and binary values for equals and meets relation with respect to interval [a, b] Relation
Fuzzy Membership Value
equals([g, h], [a, b])
min{1, 0.9} = 0.9
equals([i, j], [a, b])
min{0.8, 1} = 0.8
meets ([c, d], [a, b])
min{0.4,1- 0} = 0.4
meets ([e, f], [a, b])
min{1, 1-0.2} = 0.8
For example, we can define the relations near and far for a secondary interval [w, z] with respect to a primary interval [x, y] as: near ([x, y], [w, z]) = (x:t ⋂ x:u ⋂ x:vc) ⋃ (x:x ⋂ x:w ⋂ x:vc) = max(min(x:t, x:u), min(x:x, x:w)) far ([x, y], [w, z]) = (t:t ⋂ t:uc) ⋃ (t:wc ⋂ t:x) = max(min(t:t, t:1-u), min(t:1-w, t:x))
5.3 Extension to 3D relations While the Allen’s relations have been originally defined for temporal intervals, they readily generalize for any dimension in space as well. In general, a D-dimensional relation can be defined as a D-tuple of 1D projections. An event in a multimedia document is confined to a 3D space, time and two orthogonal space dimensions, e.g. X and Y. Papadias has defined 3D relations between two events in space and time with a 3-tuple of relations,
each representing the projection of an event in an 1D time or space dimension as where, CT, XP and YP are the binary strings representing the 1D interval relations for the projections of the events in time, X and Y directions respectively. It should be noted that for defining a relation, all of the elements in the tuple are not mandatory. For example, we can define a relation top-left by using XP and YP values only as: top-left top
≔ top and left ≔ y:wc ⋂ y:x
and
left ≔ x:t ⋂ x:uc
Thus, top-left ≔ (y:wc ⋂ y:x) ⋂ (x:t ⋂ x:uc) Karthik et al. redefines the 3D relations in terms of corresponding fuzzy intersection membership functions. Further, they argue that this representation cannot unambiguously represent all distinct event relations, when some of the events may be concave. For example, the event B in Figure 11(a) is contained in the event A, while it is not in Figure 11(b). The projection based definition of events cannot distinguish between the relations in such situations. In order to distinguish between the relations depicted in Figure 11(a) and (b), Karthik et al. proposed introduction of a containment relation, which defines the overlap between two events in multi-dimensional space. The 3D relation between two events can now be unambiguously represented by a 4-tuple where, Cs represents the containment relation.
In order to define the corresponding fuzzy membership functions P, Q, R and S, we use the normalized values |A–(A⋂B)| / |A|, |B(A⋂B)| / |B|, |A⋂ B| / min (|A|,|B|) and |A⋂* B| / min(|A|,|B|) as the domains. Thus, each of these functions is defined over the domain [0, 1]. A set of possible functions can be defined in piecewise linear manner as: P, Q, R, S ≔ < (0 0), (0.8 0), (1 1)>. These functions are graphically depicted in Figure 13. The fuzzy membership values of containment relations can be computed with the values of p, q r and s as outside (R1100) = min(p, q, 1-r, 1-s), contains (R1011) = min(p, 1-q, r, s), and so on. The use of these containment membership functions will be illustrated in Section 0.
Figure 12. Containment Relations
Figure 13. Containment membership functions
6. ILLUSTRATIVE EXAMPLES Figure 11. Ambiguity in contained-in relation In order to define the containment relations between two events A and B, we note that four distinct possibilities: A is not completely included in B, B is not completely included in A, A and B have some intersection and A and B has some regular intersection. These four possibilities are represented by four containment propositions: p := A – (A ⋂ B ) ≠ ϕ q := B – (A ⋂ B ) ≠ ϕ r := A ⋂ B ≠ ϕ and s:= A ⋂* B ≠ ϕ, where ⋂* represents regularized intersection [1]. We can define six distinct containment relations Rpqrs between a primary event B and a secondary event A as: outside ≔ R1100 inside ≔ R0111 touching ≔ R1110
We demonstrate the capabilities of the proposed uncertain relations through a few illustrative use-case examples. Example 1: In this example, we retrieve a monument TajMahal in an image database. Domain ontology tells that the monument is characterized by a dome between a pair of minarets as depicted in Figure 14. Let us assume that we have detectors for the shapes of these structural components and they are a-priori annotated in the MPEG-7 content description. The query engine needs to determine if the dome is between the minaret pair.
contains ≔ R1011 overlaps ≔ R1111 skirting ≔ R0110
Figure 12 depicts these containment relations. Like the other intersection variables, these variables p, q, r and s have continuous values in the range [0, 1] and represent the fuzzy membership values for the intersection of a secondary event in the four regions of interest of the primary event. Thus, the containment relation can be defined as a four-tuple of fuzzy membership functions and is denoted by CS.
Figure 14. Spatial relation between dome and minaret pair
Figure 15(a) and (b) depict two example images for TajMahal from different viewpoints, with different spatial alignments for the dome and the minaret pairs.In the following paragraphs, we show how the query can be formulated incorporating spatial relations and how the two images will be retrieved with different ranks2.
We treat minaret pair as the primary object and dome as secondary object, and define the relation between (B, A) as projection of A on the X-axis should be within the bounds of projection of B and that A should be outside B as far as containment relation goes.
the the the the
between ≔ X:R00100(B,A) AND C:R1100(B,A) ≔ (x:v ⋂ x:tc ⋂ x:xc)s ⋂ (c:p ⋂ c:q ⋂ c:rc ⋂ c:sc) ≔ min(min(x:v,x:1-t,x:1-x), min(c:p, c:q,c:1-r, c:1-s)) This relation can be presented as shown in Figure 17.
Figure 15 Taj Mahal . () Dome Minaret_pair
Figure 16. A query specification for between (Minaret_pair, Dome) in MP7QF The query can be formulated as retrieve the images where the relation “between” holds good between two media objects that are identified as “dome” and “minaret pair” respectively. It is expressed in MP7QF as shown in Figure 16. In this specification, Feature DS is used to describe a media object. The TextAnnotation DS describes the textual annotation of the media objects that have been recognized by some shape matching algorithm. The SpatialQueryType is used to specify the actual spatial relation. The retrieval attribute in the SpatialQueryType specifies the type of spatial relation between the media objects
2
Recognizing the TajMahal with spatial relation between the dome and the minarets alone is rather a naïve approach and should be complemented with other visual and contextual specifications. We use this example for purpose of illustration only.
v 1-t 1-x p q 1-r 1-s
Figure 17. Definition of the relation between We have extended the MPEG-7 Query Language to include operator min and max which finds the minimum and maximum value from the specified values. The membership values for the two events A and B for Figure 15 (a) and (b) as computed by the above expression are given in Table 3. Table 3 Membership values for intersection functions in X and C for 5 and 16 t 1 1
X v 0 0.4
C x 0 0
p 1 0.9
q 1 0.9
r 0 0.2
s 0 0.15
between 1 0.6
From the two results we observe that in first image, the Dome is sufficiently contained_in and outside a Minaret_pair while in the second image the part of a Dome intersects with some part of the Minaret_pair and thus is not totally contained_in. We can retrieve both image instances with the fuzzy definition of spatio-temporal relations, though the first image assumes a higher rank. Example 2: In this example, we try to explore a surveillance video to detect if one person follows another person. The action “follow” cannot be crisply specified. The soft definition using fuzzy spatio-temporal relations can be used to detect the security incidence with certain degrees of belief. We assume that the activity is confined in X and time dimensions. A person B is said to follow A when B is near A (in X), behind A in time and do not collide with A. The last condition illustrates that if persons A and B collide with each other, then it can mean that they are communicating with each other (e.g. they shake hands with each other) which can mean that one person is not following other person, i.e.
7. CONCLUSION follows (A, B) : X:near and T:behind and C:outside We define the spatial relation near (in X dimension) as: ≔ X:R11000 ⋃ X:R00100 ⋃ X:R00011 ≔ max(min(X:1-t, X:u), X:v, min(X:w, X:1-x)) which can be approximated to max (X:u, X:v, X:w) near
We have relaxed the functions X:U and X:W in the regions (-∞ a) and (b +∞) respectively with respect to event A and define the membership functions U, V and W as: X:U: < (-∞ 0), (-4 0), (-0.3 1), (0.15 1), (0.2 0), (+∞ 0)> X:V: < (-∞ 0), (-0.3 0), (0.15 1), (0.85 1), (1.3 0), (+∞ 0)> X:W: < (-∞ 0), (0.8 0), (0.9 1), (1.1 1), (5 0), (+∞ 0)>
To define the temporal relation behind, we note the followed person enters and exits the scene before the following person as shown with the 4 snaps in Figure 18. Thus, we can define the temporal relation behind as behind ≔ T:R00111 = min(T:1-t, T:1-u, T:v, T:w, T:x) The outside relation is defined as outside ≔ min(p, q, 1-r, 1-s) In this example, follows (A, B) ≔ min(max(0,0, 0.7), min(1-0, 1-0, 1, 1, 1), min (1, 1, 1-0, 1-0)) ≔ 0.7 Note that the required membership values are calculated from the location of bounding boxes of the media objects.
We have proposed a formal scheme for defining spatio-temporal relations between events and for computing the belief value for materialization of a relation between the events in a media stream. The belief value lies in a continuous range [0, 1] in contrast to being binary in the earlier works enabling graded ranking of retrieval results. We propose that MPEG-7 Query Language be extended to incorporate this scheme. This belief value can be used for semantic interpretation of events in a media stream as in [5]. The proposed scheme can be applied in various real life fields like digital media libraries, surveillance solutions and air traffic control. An implementation of this scheme on a visual surveillance system is currently underway at our lab.
8. REFERENCES [1] Allen J F., “Maintaining Knowledge about Temporal Intervals”, Comm. ACM 26(11), 1983. [2] Arbab F., “Set models and Boolean operations for solids and assemblies”, Computer Graphics and Applications, IEEE, Volume 10, Issue 6, Page(s):76 – 86, 1990. [3] Doller M., Renner K., Kosch H., Wolf I., Gruhne M., “Introduction of an MPEG-7 Query Language”, 2nd International Conference on Digital Information Management, ICDIM, Pages 92-97, 2007. [4] Ghosh H., Chaudhury S, Kashyap K. and Maiti B., “Ontology Specification and Integration for Multimedia Applications. In Ontologies: A Handbook of Principles, Concepts and Applications in Information Systems”, Springer, 2007. [5] Harit G., Chaudhury S., Ghosh H., “Using Multimedia Ontology for generating Conceptual Annotations and Hyperlinks in Video Collections”, IEEE international conference on Web Intelligence, Pages 211-217, 2006. [6] Kandasamy W B V., “Smarandache Fuzzy Algebra”, American Research Press, Rehoboth, 2003. [7] Karthik T, Chaudhury S and Ghosh H., “Specifying SpatioTemporal Relations in Multimedia Ontologies”. International Conference of Pattern Recognition and Machine Intelligence (PReMI), Kolkata, Dec 2005. [8] Liu P., Chakraborty A., and Hsu L H., “A Logic Approach for MPEG-7 XML Document Queries”, in Proceedings of the Extreme Markup Languages, Montreal, Canada, 2001. [9] Papadias D., “Approximate spatio temporal retrieval”, ACM Transactions on Information Systems, vol. 19, pp. 53-96, January 2001. [10] Salembier P., Smith J R, “MPEG-7 Multimedia Description Schemes”, IEEE Transactions on Circuits and Systems for Video Technology, vol 11, no. 6, June 2001.
Figure 18 B follows A Thus we conclude that the scene depicted in Figure 18 convey person B follows person A with the confidence level of 0.7.