Logical Image Modelling and Retrieval CARLO MEGHINI Istituto di Elaborazione dell’Informazione, Consiglio Nazionale delle Ricerche, Via S. Maria, 46 - I-56126 Pisa, Italy email:
[email protected]
A logical model of images is presented, as a theoretical foundation of information systems allowing the retrieval of images on the basis of their form (typically via visual queries) and content. The proposed model offers a multiple representation of images, extending along three dimensions: form, content and abstraction. At each dimension, a first-order logic is postulated as a representation and reasoning tool, and one specific logic is introduced for the first of these dimensions, the one dealing with the form of images. A query on images is modelled along the same three dimensions and by means of the same logics. This allows us to model the image retrieval process by the relation of classical logical implication, i.e. an image is retrieved if it logically implies the query.
1. INTRO DUCTION
A new generation of information systems is emerging, capable of storing and retrieving objects extending along several media dimensions, among which the visual dimension plays a primary role. Nowadays, due to the impressive achievements in workstation and network technologies, large image repositories are being constructed, maintained and accessed by specialized users. And it is foreseen that in the medium term these systems will serve the information needs of an increasing community of users with less and less computer skills. The aim of this study is to give the logical foundations of image retrieval systems, that is systems that allow storing and retrieving images on the basis of their form and content. There are two basic reasons that motivate our effort. First, an information system is essentially a logical agent, which answers questions based on what it has previously been told. In order to perform this task in a way that is sensible to the users’ expectations, the system must follow laws which are logical in nature. Databases typically follow computationally simple logics, as performance is paramount for them. On the contrary, the logic of a deductive database or a knowledge-based system allows for more sophisticated inferences, as deductive abilities are the main concern of these systems. For almost 10 years it has been argued that also an information retrieval system, i.e. a system for the storage and the content-based retrieval of textual documents, obeys a logic [1]. Since then, an increasing number of researchers have subscribed to this view, and a variety of approaches to the classical problem of information retrieval has been proposed which find their foundation in some logic. The present study goes one step further in this direction by presenting image retrieval as logical inference. The question naturally arises whether images lend themselves to logical reasoning, in the same way sentences do. This leads to the second motivation of our study. From our point of view, images are a communication medium and therefore objects of a linguistic nature. As such, images can be paralleled to sentences of natural languages or to THE COMPUTER JOURNAL,
descriptions of formal languages for conceptual modelling. In particular, an image has a syntactical structure, given by the spots of colour that constitute it. These coloured spots can be likened to the alphabet of a written language: they can make up meaningful expressions which, similar to the sentences of written language, denote worlds, according to the classical notion of meaning as truth conditions. This paper is structured in two main parts. The first part deals with the modelling of images. It starts with a mathematical reformulation of the ordinary notion of image (section 2), on the basis of which a first-order logic for the representation of images is developed in section 3. The connection between the sentences of this logic and images is established in section 4, where semantics of the former are given in terms of the latter. The inferences licensed by our logic are discussed in section 5, and their use in querying an image base is illustrated in section 6. The first part closes with a comparison of the proposed model with other image models developed in different areas (section 7). In the second part of the paper, the modelling of the content of images and the notion of image abstraction are introduced. The nature of image content and its role in image retrieval are discussed in sections 8 and 9. Finally, section 10 presents an extension of the image representation taking into account alternative image forms. 2. IM AG E MO DELS
To the end of developing a mathematical definition of images, we start out by identifying the properties that intuitively characterize them. First, images live in a discrete space. In fact, although an image containing itself in a reduced scale has a virtually infinitely small granularity, images of this sort cannot be reproduced in practice, since no physical device has an infinite accuracy. Second, images have structure. In particular, 1. an image occupies a 2-dimensional (2-D) region1 , which is connected, that is it does not consist of separate parts; V O L . 39 ,
NO. 3,
1 9 96
174
CARLO MEGHINI
2. the region of an image is partitioned into connected subregions, called spots, such that: (a) each spot has a unique colour, and (b) no two contiguous spots have the same colour. If we set the origin of a fixed 2-D Cartesian space on the bottom-left corner of the minimum bounding rectangle of an image, that image can be represented as a partition of a connected region of the fixed space, with a function that assigns a colour to each member of such partition and to its subsets. An image can thus be viewed as a triple, an image model, given by:
hA; ; f i
where: A; the region, is a finite and connected subset of !2 ; where ! are the natural numbers; ; the spots, is a partition of A f is a partial function from the powerset of A to a predefined set of colours L, such that:
;
. for each X 2 ; – f
X 2L is the colour of X ; and – for every Y X ; f
Y f
X . f is undefined everywhere else.
;
The notion of image model captures all the aspects of images outlined in the points 1 and 2a above, but fails to account for the restriction expressed in point 2b. As an undesired consequence, many image models can represent the same image. To see why, consider the image i given by a white square of size k : If we let m denote the set of the first m natural numbers, our intended model for i is the one having k 2 as region and just one spot, namely k 2 itself, whose colour is white. But in fact any model hk 2 ; ; f i represents i in the sense of the points 1 and 2a, as long as is a partition of k 2 and f is valued white wherever defined. The difference between these image models and the intended model is that the latter satisfies point 2b while none of the former does. In order to capture formally the requirement expressed by point 2b, an ordering relation between image models is introduced. An image model M hA; ; f i is said to be finer than an image model M0 hA0 ; 0 ; f 0 i, in symbols M v M0 ; if and only if : 1. A A0 ; 2. for all X 2 there exists X 0 f
X f 0
X 0 :
2 0 ; such that X X 0 and
M is said to be strictly 0finer than M0 , 0in symbols M @ M0 ; if and only if M v M and M 6 M : Intuitively, if M is strictly finer than M0 then the two models represent the same image (they visually look the same) but contains two regions X and Y which are both contained in one region Z in 0 ; that is
X [ Y Z ; and which have the same colour as Z ; that is f
X f
Y f 0
Z : Now, let us consider the image model M1 which is equal to M except that in M1 ’s partition X and Y are replaced by
X [ Y : It is not difficult 1
Although any reproduction of an image has a non-zero thickness.
THE COMPUTER JOURNAL,
to see that M1 and M represent the same image but M@M1 : If we apply the same procedure to M1 and to the
resulting model, and so on until it is no more possible, we obtain a chain of models:
M @ M1 @ . . . @ M0 @ . . . @ M> where M> is maximal with respect to the v relation, that is no image model Mi other than M> exists such that M> is finer than Mi : M> is our intended model for the image
represented by all the models in the chain. If, for a given image, we let M? stand for the set of models that represent it, the intended model of that image can be mathematically characterized as the maximum of the lattice
M? ; v; above denoted as M> : In fact, it can be proved that M> is the only model in M? satisfying all the four points above, and hence our tacitly assumed ‘image model’. An image universe U
m; n is a triple hm 2 n, L; M
m; ni where m 2 n is the size of the universe, L is a set of colours, and: M
m; n fhA; ; f
i j A m 2 n and f : 2A ! Lg:
An image universe is thus determined by a rectangular area of a 2-D discrete space, a set of colours and all the images that can be composed with these two ingredients, which is to say the set of the image models whose region fits into the size of the universe. Image universes are finite objects such that: M
m; n M
m0 ; n0 iff m m0 and n n0 : We will consider as semantic domain the universe U given by h!2 ; L; Mi; where: M [ i 2 ! M
i ; i :
In the following, we will tacitly deal only with regions, colours and image models in U : 3 . IMA G E DE SCRIP T IO NS
Having precisely defined the concept of image, we now present a many-sorted first-order language for image description. The language in question, which we call I ; has three sorts, corresponding to three kinds of entities:
r ; the sort of regions; the constant symbols from this sort will be capital letters drawn from the beginning of the alphabet (metasymbols r and rk ); as existential quantifier ranging on regions 9r will be used; . c ; the sort of colours; the constant symbols from this sort will be English names for colours (metasymbol c), while 9c will be the existential quantifier; . i ; the sort of images; the constant symbols (and their metasymbols) from this sort will be small letters drawn from the middle of the alphabet; 9i is the existential quantifier for images. .
For each sort, the alphabet of I includes countably many constant symbols and variables. The latter will be small letters from the end of the alphabet with no additional sort V O L . 39 ,
NO. 3,
1 9 96
LOGICAL IMAGE MODELLING
information, deducible from the associated quantifiers. The only other symbol that we need in I is the 3-place predicate symbol I ; of sort hi ; r ; c i; which names the association between images, regions and colours necessary to describe the former. For instance, the expression: I
i; A; blue
means that region A of image i is blue. Image names are part of the language because images may be referenced from within objects of a different kind. As the language has, for the moment, no function symbols, its terms are just constant symbols or variables, whose sorts give the sorts of the corresponding terms. The atomic formulas are instances of the predicate symbol I, i.e. expressions of the form I
t1 ; t2 ; t3 where each ti is a term of the proper sort. The well-formed formulas of I are the smallest set containing the atomic formulas and the formulas: 1. 2. 3. 4. 5.
:;
_ ;
9r x;
9i x;
9c x;
where and are well-formed formulas. As customary, well-formed formulas made up from all the interesting connectives and the three universal quantifiers 8r ; 8i and 8c can be considered part of the language as abbreviations of primitive expressions. We also recall that any occurrence of a variable x in a formula is said to be bound if the formula has the form
Qx; where Q is a quantifier. A variable is free in a formula if at least one of its occurrences is not bound. Finally, a sentence is a wellformed formula in which there are no free variables. Among the sentences of I ; we will focus our attention on those that refer to one image, as our present goal is to define a language for describing single images. This means that we are willing to admit only two kinds of sentences: . named descriptions, that is sentences in which the first term of any instance of the predicate symbol I is the same image name; and . unnamed descriptions, which are of the form
9i x where is an open formula whose only free variable is x; which occurs only as the first term of any instance of the predicate symbol I. For instance, I
i; A; blue ! I
i; B; green
AND
RETRIEVAL
175
sort, or is one of :;
_ ;
9c x; or
9r x where and are named descriptions. Similarly, an unnamed description is given by
9i x where is a free unnamed description, that is either an expression of the form I
x; t1 ; t2 ; or one of :;
_ ;
9c y; or
9r y where and are free unnamed descriptions. In practice, a free unnamed description is a named description with a variable in place of the image name, and an unnamed description is obtained simply by prefixing the appropriate existential quantification to it. Two image descriptions are name consistent if either one of them is unnamed, but not both, or if they have the same image name. An image representation is a set of pairwise name consistent descriptions, and it is said to be named if it contains at least one named description, otherwise it is unnamed. An image base is a set of image representations. 4. THE SE MANTICS O F IMAGE DE SCRIP T IO NS
Image descriptions represent images, in a way to be specified by defining the appropriate mapping from the former to the latter. The first step in this direction is the definition of a function mapping the constant symbols of each sort of I into the proper individuals of the semantic universe U : Following the ordinary intuitive notion of image, we will let the mapping of the colour and region names be interpretation independent, that is one for all interpretations, thus giving to these kinds of names the character of rigid designators2 . In fact, when in everyday speech we say ‘blue’ we mean a specific colour and not something that may denote blue in a image and white in another. The same applies to spots, since we have fixed a Cartesian reference system for all images. The interpretation of image names is less straightforward, because to be coherent with our development, we must let image names name image models, which are also the interpretation structures of I : However, having within the language names for the language’s interpretations, although unusual, is harmless in our case, because those names are only used to make simple assertions about spots and colours in images. So we will have image names rigidly to designate image models, and we will spell out in a moment the obvious condition that must be satisfied by an image base in order to be consistent with such designation. Given the universe U and the language I ; a denotation function for I is a one-to-one mapping associating:
9i x
I
x; A; blue _
8r yI
x; y; green
. the constant symbols of sort r and the countably many finite subsets of !2 . the constant symbols of sort c and L; . the constant symbols of sort i and M; from the
is an unnamed description asserting that there is some image in which A is blue or which is entirely green. Image descriptions can be precisely defined as follows. A named description is either the atomic sentence I
i; t1 ; t2 where i is an image name, t1 and t2 are terms of the proper
2 Technically, the term ‘rigid designator’ is used in the context of a possible world semantics to indicate a constant symbol which denotes the same object in all worlds. Since an image model is a world and the interpretation of constant symbols is fixed in all image models, the usage of this term here does not seem to be entirely inappropriate.
is a named description saying that in image i either A is not blue or B is green, whereas:
THE COMPUTER JOURNAL,
;
V O L . 39 ,
NO. 3,
1 9 96
176
CARLO MEGHINI
countability of the finite subsets of !2 and the finiteness of each image universe M
i; i, the countability of M follows. Out of the denotation functions we factor one function, d ; and call it the image denotation function. An image structure for I is a pair hM;d i; where M is an image model. Notice that the function d is the same for all the images structures. This means that image structures will differ in image models, but a name denotes the same thing in all structures. A formal semantics for image descriptions can now be given in terms of image structures. Let M be an image model hA; ; f i; and image descriptions, and xc the same as except that all the occurrences of the variable x are replaced by occurrences of the constant c: An image structure hM;d i is said to support an image description if and only if :
hM;di
is inductively defined as follows: hM;di I
i; r; c iff f
d
r d
c hM;di : iff hM;d i 6 hM;di
_ iff either hM;di or hM;di hM;di
9r x iff for some constant symbol c of sort r ; d
c 2 dom
f and hM;d i xc hM;di
9i xx iff for some constant symbol c of sort i ; hM;d i c hM;di
9c xx iff for some constant symbol c of sort c ; hM;d i c
where the relation 1. 2. 3. 4. 5. 6.
In words, the first statement above says that an image model M supports an atomic image description I
i; r; c if in M the region designated by r (i.e. d
r) is of the colour designated by c (d
c). Notice that no condition is expressed on the relationship between M and the image model mentioned in the statement (d
i). This is because I
i; r; c is supported by many image models and selecting one of these (the obvious candidate is d
i) would make our support relation unnecessarily restrictive. The statements 2–6 simply extend in the obvious way the same notion to nonatomic image descriptions, taking into account the sorted nature of the language. Notice that in statement 4 the membership of d
c in dom
f is required to ensure that the quantifier 9r ranges on the spots of the image model M and on their subsets, rather than on the countably many finite subsets of !2 : Without this proviso, the sentence:
9r x:I
i; x; c would be a tautology, due to the finiteness of image models. It is not difficult to see that image structures are strict relatives (actually, special cases) of first-order structures for I ; and that the support relation is as close to the Tarskian definition of truth applied to image descriptions. This justifies the following terminology. An image structure is said to support an image representation if it supports each description in the representation. An image representation logically implies an image description if and only if every image structure
1
THE COMPUTER JOURNAL,
1
that supports also supports notation, we write:
:
Following the official
1
1
to signify that logically implies : We can now state the relation between image representations and image models that accounts for a sane naming of the latter. Intuitively, what we want is that the image model rigidly designated by the image name i be among the models supporting any image representation named i: In other words, we want to rule out the situation in which an image representation named i is not supported by the image it is supposed to represent. An image base is said to be faithful if, for each image representation named i in the image base:
1
d
i 2 fMjhM; d i
1 g:
Since the choice of the names for image models is arbitrary, in practice an image base is faithful if and only if it satisfies the following two rules: 1. it does not contain a contradictory image representation (because such representation is supported by no structure); 2. two image representations 1 and 2 in it have the same name if and only if
1 ^ 2 is satisfiable, that is there exists an image model M such that: hM;d i 1 and hM;di 2 :
1 1 1
1
1
1
5. INFERENCES ON IM AGE RE PRESE NTATIONS
We are now in the position of discussing what kind of inferences the above defined logic gives us, i.e. what image descriptions are logically implied by a given image representation. First, we have the inferences which represent trivial cases of the logical implication relation: 1. an image representation logically implies itself:
f 1 ; 2 ; . . . ; n g
^ ; i
this is a simple consequence of the definition of implication: a model supports a set of descriptions if it supports each member in the set, hence the members’ conjunction. 2. a contradictory image representation, having no models, logically implies any image description; for instance:
fI
i; A; blue; :I
i; A; blueg
where is any image description; 3. a tautological image description, being supported by any model, is implied by any image; for instance:
1
I
i; A; blue ! I
i; A; blue;
where
1 is any image representation.
The second inference is of no interest because contradictory image representations are not allowed in faithful image bases. The first and third account for a ‘principled’ behaviour of the logical implication relation, which will be exploited in querying an image base (see next section). These inferences are all valid in the first-order predicate V O L . 39 ,
NO. 3,
1 9 96
LOGICAL IMAGE MODELLING
calculus and, by interpreting in a certain way the sentences that occur in them, we have simply rephrased the classical notion of validity in terms of image representations, thus capturing the concept of an image being ‘more specified’ or ‘less vague’ than another. This concept can be illustrated in a more direct way by considering a simple image representation such as:
1 fI
i; A; white; I
i; B; blue; I
i; C; greeng: An image description vaguer than 1 is one with less spots,
such as: I
i; A; white ^ I
i; C ; green or that denotes also other images, such as: I
i; B; blue _ where is any description which is name consistent with I
i; B; blue: It is not difficult to see that both these descriptions are logically implied by : In the former case, the implication relation can be visualized, since both the left- and the right-hand sides of the implication can be pictorially represented. An inference that captures more of the nature of images is that which allows us to infer from an image representation the descriptions having equal or smaller spots with the same colour. The proof of the correctness of this implication relies on the fact that if an image model M assigns the colour c1 to the spot r1 , then, by definition of the f function, it also assigns c1 to any subset of r1 hence if M supports a representation asserting (no matter how explicitly) that r1 has colour c1 , it will implicitly support any description asserting c1 for a subset of r1 : This inference scheme, which can be called ‘proper’, is exemplified by the following implication:
1
;
1
fI
i; A; cg I
i; B; c;
for all constants A and B such that d
B d
A: This inference is clearly not valid in FOPC, as one can easily imagine an interpretation of the symbols occurring in it that falsifies the logical implication. 6 . QUE RY IN G AN I MAG E BAS E
The concept of image representation has been introduced to provide the logical foundation to the modelling of images in image retrieval systems. Likewise, the logical implication relation illustrated in the previous section is to play the same foundational role to the retrieval based on the form of images—as opposed to that based on image content, which will be dealt with in a following section. To this end, we let a query be any unnamed image description , and an image be retrieved in response to it if and only if :
1
1 : We would like at this point to draw attention to the fact that neither images nor queries need be specified by the user as sentences of I : Image representations can be automatically constructed upon acquisition of images in a manner that is entirely transparent to the user. Queries on images can be obtained in the same automatic and transparent way from a higher level input provided by the user, such as a drawing THE COMPUTER JOURNAL,
AND
RETRIEVAL
177
produced with an appropriate editor (as suggested in [2]) or even an image itself. Indeed, even the internal representation of images and queries need not be bound to the syntax of I : All it is prescribed by our approach is that the system behave in conformity to the logic from the viewpoint of an external observer. We now have all the ingredients of a retrieval model, for we have a language for representing images and queries, and a relation capturing the notion of relevance of the former to the latter. Our model clearly subscribes to the logical view of information retrieval [1], as it interprets retrieval as logical inference. The question naturally arises how adequately this model can support the form-based retrieval of images. Given the foundational nature of our study, the proposed image retrieval model is intentionally essential, as our aim is to show the basis of logical image modelling. From a practical point of view, the model must be understood as a backbone, to be extended in appropriate ways to deal with the many technical peculiarities of image retrieval. To show how this can be done, we briefly address two important extensions, concerning: (1) a more flexible representation of regions and colours and (2) the uncertainty of image retrieval. 6.1. Region and colour representation The reason for assuming that region and colour names are rigid designators, i.e. purely denotative objects with no connotation, was purely rhetorical: we did not want the basic features of our language to be hidden by a sophisticated referential apparatus for denoting sets of pairs or elaborated chromatic tonalities. However, nothing prevents the terms of our language to provide the machinery needed for a less cryptic representation of regions and colours. For the former, the language of elementary mathematics can be imported in I ; along with the classical operators for modelling transformations such as scaling, rotating and translating 2-D regions. These are especially useful in expressing queries, as they allow to denote in a concise and mathematically elegant way a number of different image models. For instance, we can use a function of two arguments, x; y in infix notation, to denote the interval having x and y as minimum and maximum, respectively. In this way, the unnamed description: I
x; x1 ; x1 1x2 ; x2 2; y where all the variables are existentially quantified, is supported by all image models containing a rectangular region of size
1 2 2 aligned with the axes of the fixed Cartesian space. By letting the 3-place function symbol
r; ax ; ay denote the rotation of region r by angles ax and ay ; the unnamed description: I
x;
x1 ; x1 1x2 ; x2 2; y1 ; y2 ; blue is implied by the representations of images having a blue
1 2 2 rectangle however rotated. As far as the representation of colours is concerned, the constant representing a certain colour can be the brightness V O L . 39 ,
NO. 3,
1 9 96
178
CARLO MEGHINI
of the gray level. For a more sophisticated representation, the language can be extended with a 3-place function giving the brightness value of three wavelengths: red, blue and green [3]. When rigid designators are replaced by structural descriptions, the proper syntactical and semantical arrangements must be made to our retrieval model. From the syntactical point of view, I has to be extended with the sorts and the function symbols needed to represent the newly introduced operators and their arguments. From the semantical point of view, these new operators must be associated with the corresponding operations, whereas the role of the image denotation function d is played by the function giving the semantics of these descriptions. It is important to realize that the region description language impacts on the performance of the retrieval, which requires the determination of the region denoted by a description, i.e. d
r: 6.2. Uncertainty While it is expected that the image representations stored in an image base be perfectly vivid, it is likely that the users of the image base would appreciate the possibility of expressing uncertainty in their queries. As we are dealing with form-based retrieval, this uncertainty will concern the boundaries of regions or the tonality of colours. There are at least two ways of tackling this problem in information retrieval: the exact and the non-exact match approach. In the exact match approach, a document (an image representation in our case) either matches a query or it does not. The implication relation of our model falls within this approach and gives the possibility of expressing uncertainty by means of the logical apparatus of the language. As it is well-known, negation, disjunction and existential quantification are powerful uncertainty operators, which the user can exploit in posing queries to an image base. To alleviate the query specification task, special function symbols may be introduced which make fuzzy the boundaries of a region or the brightness of a colour. For instance, the unnamed description:
9i xI
x;
x1 ; x1 1x2 ; x2 2; y1 ; y2 ; blue can be taken to denote all the image models containing a fuzzy blue
1 2 2 rectangle with a tolerance of y1 in the x coordinates and y2 in the y coordinates. As such it would be equivalent to the query:
9i x
_ I
x; A ; blue i
i
where each Ai is one of the variations of the original rectangle. In the non-exact match approaches, a function, usually valued in the 0; 1 interval, is used to measure the degree of match between the document and the query. A wide range of methods have been proposed to obtain this measure. We believe that the principled way of development for this kind THE COMPUTER JOURNAL,
of approaches, is to adopt an underlying theory of uncertainty, such as probability theory, and couple the logic with this theory, so obtaining a well-founded retrieval model, as suggested by [1]. Our proposed retrieval model lends itself to this kind of extension, being endowed with a formal semantics. 6.3. Computational adequacy From the computational point of view, our retrieval model would seem to suffer from a major drawback, since deciding logical implication between two arbitrary sentences of a (non-monadic) first-order language is a problem known to be unsolvable. There is no evidence that the restriction to named and unnamed descriptions makes things any easier. Therefore, it seems that if we allow arbitrary image representations in an image base, we run into a serious problem. This is, however, not a problem of our model: it is a general fact about the expressive power of a representation scheme that if we want an information system to handle as incomplete information as that expressible in a first-order language, we must accept the price of a very poor system performance. And there is a clear correlation between the degree of incompleteness allowed by the representation language and the complexity of the related decision problem [4]. Since we cast our image representation and reasoning in the general context of mathematical logic, we inherit the inherent intractability of the decision problem. However, the fact that the relational data model can be viewed as a first-order language to express and reason about facts [5] has not stopped the (indeed massive) use of that model in real applications. The question is, of course, how much of the expressive power of first-order languages is used in the application. In this respect, the relational data model is very strict, as it allows the users to ‘tell’ the database system only ground atomic sentences (i.e. tuples). Likewise, in an image repository we would expect to find ‘true’ image representations, directly (and very likely automatically) derived from images. Each representation would then be a set of ground atomic formulas f1 ; 2 ; ; n g; such that:
...
k I
ik ; rk ; ck ;
1k
n:
Representation of this kind can be qualified as vivid [6] as they are isomorphic to their denoted objects, i.e. image models. Correspondingly, the decision problem becomes equivalent to the problem of evaluating the truth of an arbitrary quantified Boolean expression (the query ) in a given interpretation (the image representation ). This problem is PSPACE-complete [7], hence tractable although, at present, at an exponential cost. But, again, the complexity of the query evaluation depends on how complex the query itself is. In the simplest case, a query is the conjunction of ground atomic sentences describing an image or fragments of it,
1
1
V O L . 39 ,
NO. 3,
1 9 96
LOGICAL IMAGE MODELLING
such as the k above. To compute whether a vivid image implies this kind of image description amounts to check whether each k occurs in the image, a problem that requires a linear number of region matches. This is the base case of query processing, as it does not involve any quantifier or sentential connective beside conjunction. The occurrence of either of these in the query adds one level to the complexity of query evaluation, leading, in the worst case, to an exponential explosion. 7 . REL ATIO N T O P REVIO US WO RK
It is worth at this point to pause and relate the introduced image model with other proposals sharing, to a reasonable extent, the same goal. As the number of such proposals is too large to permit an exhaustive discussion, we will restrict ourselves to some of the best known image models, coming from a fairly wide range of different fields. The assumption that images can be regarded as sentences in a natural language goes back to at least 30 years ago, and has been the basis of the activity of researchers investigating the automatic generation of images (for a survey, see [8]). Chomsky’s generative grammars and related concepts have been used in this field to capture the basic production mechanisms of image languages. In artificial intelligence, image models are investigated in order to understand and reproduce human problem solving methods which are deemed to be based on image inspection and manipulation. These operations can be performed on scene reproductions as in a vision system or on mental reconstructions as in imagery [9]. In fact, it is argued [10] that imagery and vision have parallel purposes and essentially differ in the image source (human memory in the former case, the external world in the latter). Mental imagery finds its theoretical foundations in psychology and cognitive science, where the primary role of images in certain human inference processes (such as reasoning [11] and discourse understanding [12]) has been postulated and experimentally tested. The role of images in our mental processes is still controversial, and not all the defenders of this role have been precise about image representation (this is not necessarily a limit of their proposals). Those who have gone as far as proposing an image representation scheme have resorted to various formalisms, ranging from generative array grammars [8] to array theories [9], depending on the purpose of the formalization and the formalizer’s style. Despite the fact that none (with the exception of [13], but see section 10), to the best of our knowledge, has used a mathematical logic as an image model, we can observe that the underlying ontology of our logic, consisting of regions and colours, is consistent with that of these other image models. The same applies to image retrieval systems [14]. The proposers of such systems have mostly bypassed the problem of providing a formalization of the underlying models, and focused on methods for efficiently store and retrieve images and assess some sort of similarity between THE COMPUTER JOURNAL,
AND
RETRIEVAL
179
them. While the great variety of such methods constitutes an enormous resource for implementing image retrieval systems, the lack of a formal model, stating precisely what sort of retrieval is offered, is a major deficiency of these systems. The logical form of our model is due, as argued in the Introduction, to the view of images as information bearers that underlies any image information system. However, we believe that the formal development we have presented can be used to capture other image inferences, beside those needed for retrieval, and thus it is relevant to any attempt aiming at capturing aspects of reasoning on images. 8. M O DE LLING IM AGE CONT ENT S
It has been argued [15] that a multimedia information retrieval system must give the possibility to its users to address the content of multimedia documents in specifying retrieval requests. In fact, much of the way in which we conceptualize images relies on their content, so it is to be expected that users would like to express their information needs concerning images in terms of their contents. In addition, the past experience with text retrieval system reveals that the disappointing performance of these systems in terms of retrieval effectiveness largely depends on the lack of an adequate representation of the content of textual data [16], and the issue in image retrieval appears to be even more crucial. Figure 1 introduces the elements of the following discussion and the relationships between them. The top of the figure is a graphical re-edition of what we have been presenting in the previous parts of this paper. Images have as content scenes from the real world, taken at a given point in time and are modelled by image representations. Images are understood by humans through a process of interpretation, which produces a mental reconstruction of the images’ contents, named hereafter content reconstruction. This reconstruction (depicted as an oval and called a ‘representation’ in Figure 1) is all we have from the original content of the image. It may vary from interpreter to interpreter and depends on the context in which the interpretation is carried out, including its goal and use. The cognitive modelling of image interpretation is beyond the scope of an image retrieval model, which must instead provide a language for representing the result of that process, i.e. content reconstructions. As both the producer and the user of content reconstructions are humans, this language, albeit formal, must have a close resemblance to natural language. The most suitable candidate for it is predicate logic (FOPC), which is the paradigmatic formal language for describing and reasoning about the facts of everyday life. is thus a sentence of a FOPC denoting, In Figure 1, according to the standard Tarskian semantics (named after the usual convention) the reconstructed image content. To obtain a specific language, we need only to specify a lexicon, i.e. a list of the constant, function and predicate symbols of the language, along with an associated meaning
2
V O L . 39 ,
NO. 3,
1 9 96
180
CARLO MEGHINI
FIGURE 1. An image and its content.
for each of them. In other words, we must make an ontological commitment. Further considerations, for instance of computational nature, may suggest limitations to the syntax of content reconstructions, leading to the adoption of a proper subset of FOPC, more amenable to automatic treatment. A similar pattern has led us to formulate a retrieval model for textual documents based on a terminological logic [17]. The pair h ; i is the representation of structure and content of an image. Given the relativism of interpretation, we should introduce the more articulated structure:
12
The extension of image retrieval to the content of images is obvious. We must see a query as consisting of two parts, the structural query, given by an unnamed image description ; and the content query, given by a sentence i of the language used to describe the contents of images by the user class ui : An image h ; hui ; i i i is retrieved in response to this query if and only if: and i i : We have already argued about the logical adequacy of this approach to the representation of content. As for its computational adequacy, the same considerations made in section 6 apply.
1 ...
2 ... 1
2
h1; hu1 ; 21 i; . . . ; hun ; 2n ii taking into account a number of user classes u1 ; ... un ; each providing an interpretation, based on a possibly different lexicon. Each i represents the meaning of in a classical sense, in that it spells out the conditions under which is and is one of true. In fact, the relation between translation. If we grant to the language of images the status of a natural language, then a collection of pairs h ; i i (for some user class ui ) comes very close to be a theory of truth for image descriptions in the sense outlined by Tarski [18]. In fact, each such pair can be seen as an instance of the sentence:
2
1
1
2
1
12
x is true if and only if p where x is the name of a sentence (or a sentence of the object language, that is an image representation ) and p is that sentence (or the translation of x in the metalanguage, that is ). A theory of truth is for Tarski a theory that entails one sentence like the above for each sentence of the object language. It is interesting to note that Davidson argues that such a theory can do the job of a theory of meaning of the object language [19].
1
2
THE COMPUTER JOURNAL,
9. DE SCRIBING IM AGE CONTENTS
Having decided a style for content reconstructions of images, and fixed the role that they play in an image retrieval model, we can try to take a closer look to them. The question we are trying to answer here is ‘What is an image about?’ and the answer is far too obvious: like any (true) sentence of natural language, a image is about facts. It is also important to notice that, in our view, a content reconstruction i is not a representation of the fact(s) to which a image would correspond, for there is no such fact(s) [20]. i is simply the sentence by which an interpreter would describe the content of the image for the class of users, ui : i will thus very likely contain contingent truths3 , such as
2
2
2
Boy(Francesco) rather than necessary truths, such as
(8x)(Boy(x)! Boy(x)). This would seem the end of any abstract speculation on the structure of content reconstructions of images, and that any further step in this direction would require the V O L . 39 ,
NO. 3,
1 9 96
LOGICAL IMAGE MODELLING
knowledge of the kind of questions the users are going to ask the system, so that ad hoc reconstructions could be devised. This is the database solution to the information problem, and is not applicable in the information retrieval domain because no knowledge is assumed on the needs of the users, nor any homogeneity is postulated among the objects of our subject matter, that is images. Despite this scarcely encouraging premises, we will try to draw attention to the entities which we deem as typical of a content reconstruction, and to the conditions that the recognition of these entities in an image pose on the structure of content reconstructions. The first kind of entities are the individuals of the domain of interpretation that are recognized in the image. The recognition of individuals is a fundamental part of the image interpretation, and is accounted for in content reconstructions by singular terms. For example, the recognition of an individual named ‘Francesco’ in an image would be expressed by the sentence:
(9x)(x=Francesco) as a part of that image’s content reconstruction. Here Francesco is an individual constant which denotes an individual of the interpretation domain. The second kind of entities are the relevant properties of the recognized objects, expressed via predicate symbols. The content reconstruction of the above example, could for instance be augmented as follows:
(9x) (x=Francesco ^ Smiling(Francesco)
^ Boy(Francesco) ^ (9y) (Girl(y) ^ Sibling(Francesco,y))) to state that Francesco is a boy, he is smiling and with his sister, whose name is not known. Among these properties we deem as particularly important for image retrieval those expressing spatial relationships between objects. We believe that a proper description of these properties can be given by including in the language for content reconstruction a spatial logic, such as the one presented in [21]. Typically, such logic would allow the expression of topological facts such as one object being on top, behind or inside another object. The third and last kind of entities that we want to point out are events. The nature and role of events in the analysis of natural language is controversial. Typically events are considered as a species of facts, as ‘both offer themselves as what sentences—some sentence at least—refer to or are about’ [20, p. 132]. The position that events are a kind of facts creates a number of problems that can be solved by assuming an ontology of events, that is, by ‘introducing events as entities about which an indefinite number of things
3 More properly, we should say that a content reconstruction reports the beliefs of the interpreter.
THE COMPUTER JOURNAL,
AND
RETRIEVAL
181
can be said’. For instance,
(9z) (Hug(z) ^ Hugging(z,Francesco,y) ^ In(z,kitchen)) could be conjoined to the above content reconstruction in order to add that the image in question shows a hug (the event) between the two mentioned actors, and is taking place in a kitchen. 1 0. IMAG E ABS TRACTIO NS
An image description, as defined in section 3, can be considered as a full representation of an image, giving all the visual details that the image contains. In many image storage and retrieval systems for practical applications [22], most of these details are suppressed to retain only the visual information which is of interest for the application at hand. This move is motivated either by efficiency considerations or by the fact that the suppressed details are of no use. The resulting abstracted image representation, which we may call image abstraction, typically consists of a sequence of features, extracted by an expert computer-aided process and used in place of the real image. The question that we want to address here is how can image abstractions be accommodated in our model, so that they can be retrieved in response to queries? Notice that this functionality, although not typical of image processing systems, is in fact made possible by the results produced by these systems. The distinction to be made is whether the abstraction concerns the form of the image or its content. For instance, a representation based on concepts such as Table, Chair, Sideboard and the like [23] is clearly of the latter kind, even though it is derived by matching an image against a predefined set of shapes, hence on a purely formal basis. Such a representation is, from our model’s point of view, a content reconstruction and can be lent to retrieval by formulating it in logical terms and including it among the i : In the above example, we would need a language having among its predicate symbols Table, Chair and Sideboard. There may be also abstractions concerning the form of an image, hence based on visual concepts such as graphical primitives. A notable example of this kind of abstractions is given in [13], where a first-order language is proposed to represent map images. This language provides predicate symbols denoting map elements, such as chain, region, closed and others, so that its sentences denote a class of images, namely maps. This latter kind of image abstractions, which we call formal abstractions, can be accommodated in our model, and therefore reconciled with each other, by observing that each of them stands to the image description in I as a content reconstruction stands to the scene depicted by an image. As a formal abstraction only focuses on certain details, and thus provides a view of the whole image, in the same way a content reconstruction provides a view of the scene depicted by an image.
2
V O L . 39 ,
NO. 3,
1 9 96
182
CARLO MEGHINI
For instance, a map can be represented in I in an objective way, by means of spots and colours, but can also be represented more abstractly in the above mentioned language, which is an image language, but evidently it is more abstract than I because it does not give any information about colours. Also, this language has less expressive power than I because it can describe only maps. We can call a first-order language an image language if its semantics is given by image models. Then, if is an image is any set of representation in I , an abstraction of sentences of any image language I 0 ; such that
1
1
3
M 1 implies M 3 for all image models M:
3
1
In other words, is a different way of saying what says, because it belongs to a different image language but any model denoted by is also denoted by : In addition, is more generic than ; because it denotes models which do not support : In case the implication holds in both directions, that is
1
1 1
3
3
M 1 if and only if M 3 for all image models M
3
1
then is a translation of : This leads us to a more complete representation:
h1; hu1 ; 31 ; 21 i; . . . ; hun ; 3n ; 2n ii where 1 is the image representation in I and each pair hui ; 3i ; 2i i gives, for the user class ui ; the formal image
abstraction and the content reconstruction. For some i; either i or i may be missing, but not both. A query is now a triplet: the structural query ; the content query i and the formal abstraction query i ; which is a sentence of the language of formal abstractions of user class ui : An image h ; hui ; i ; i i i is retrieved in response to this query if and only if: ; i i ; and i i :
3 2
1 ...
3 2 ... 1 3
2
1 1 . CO N CL USIO N S
The work presented in this paper is part of a project aiming at developing a model for large multimedia document bases, supporting the form- and content-based retrieval of multimedia documents. Our approach to this problem combines the logical approach to the retrieval of information [1] and the conceptual modelling approach to information systems [24]. In particular, from the former we adopt the view that the retrieval of documents is to be viewed as a logical inference, possibly of a probabilistic nature. From the latter, we adopt the view that the construction of a complex information system requires an explicit representation of knowledge about the system’s application domain. We have argued in favour of this view in [15] and developed MIRTL[17], a probabilistic extension of which is presented in [25]. In this paper, a logical foundation of image modelling and retrieval has been outlined, to serve, among other things, as the basis for a further extension to MIRTL, enabling it to model documents containing images. THE COMPUTER JOURNAL,
ACKNO WLEDGEMENTS
We thank the components of the MIRO Working Group (ESPRIT Basic Research Action Programme no. 6576) for the stimulating discussion following a presentation of an early stage of the present work.
REF E REN CES [1] van Rijsbergen, C. J. (1986). A new theoretical framework for information retrieval. In Proceedings of SIGIR-86, 9th ACM Conference on Research and Development in Information Retrieval, pages 194–200, Pisa, I. [2] H. Hirata, K. and Kato, T. (1992) Query by visual example. In Proceedings of EDBT’92, 3rd International Conference on Extending Database Technology, Vienna, pp. 56–71. [3] Ballard, D. H. and Brown, C. M. (1982) Computer Vision. Prentice Hall, Englewood Cliffs, NJ. [4] Levesque, H. (1988) Logic and the complexity of reasoning. J. Philos. Logic, 17, 355–389. [5] Reiter, R. (1984) Towards a logical reconstruction of relational database theory. In M. L. Brodie, J. Mylopoulos and J. W. Schmidt, (eds), On Conceptual Modelling. pp. 191–233 Springer Verlag, New York. [6] Etherington, D., Borgida, A., Brachman, R. and Kautz, H. (1989) Vivid knowledge and tractable reasoning: preliminary report. In Proceedings of IJCAI-89, 10th International Joint Conference on Artificial Intelligence, Detroit, MI, pp. 1146– 1152. [7] Garey, M. and Johnson, D. (1979) Computers and intractability. A guide to the theory of NP-completeness. Freeman, New York. [8] Rosenfeld, A. and Siromoney, R. (1993) Picture languages–a survey. Languages of design, 1, 229–245. [9] Glasgow, J. (1993) The imagery debate revisited: a computational perspective. Computational Intelligence, 9, 309– 333. [10] Kosslyn, S. (1987) Seeing and imagining in the cerebral hemispheres: a computational approach. Psychol. Rev., 94, 148–175. [11] Lindsay, R. (1988) Images and inference. Cognition, 29, 229–250. [12] Johnson-Laird, P. (1983) Mental Models. Cambridge University Press, Cambridge, MA. [13] Reiter, R. and Mackworth, A. (1989) A logical framework for depiction and image interpretation. Artificial Intelligence, 41, 125–155. [14] Gudivada, V. and Raghavan, V. (eds) (1995) IEEE Computer Journal, 28, issue 9. [15] Meghini, C., Rabitti, F. and Thanos, C. (1991) Conceptual modelling of multimedia documents. IEEE Comp., 24, 23– 30. [16] van Rijsbergen, C. (1986) A non-classical logic for information retrieval. Comp. J., 29, 481–485. [17] Meghini, C., Sebastiani, F., Straccia, U. and Thanos, C. (1993) A model of information retrieval based on a terminological logic. In Proceedings of SIGIR-93, 16th ACM Conference on Research and Development in Information Retrieval, Pittsburgh, PA, pp. 298–307. [18] Tarski, A. (1983) The concept of truth in formalized languages. In Corcoran, J. (ed.), Logic, semantics, metamathematics. pp. 152–278, Hackett Publishing Company, IN. [19] Davidson, D. (1991) Truth and meaning. In Inquiries into truth and interpretation. pp. 17–36, Clarendon Press, Oxford. [20] Randell, D., Zhan, C. and Cohn, A. (1992) A spatial logic based on regions and connection. In Proceedings of KR-92, V O L . 39 ,
NO. 3,
1 9 96
LOGICAL IMAGE MODELLING 3rd International Conference on Knowledge Representation and Reasoning, revised version. [21] Davidson, D. (1985) The logical form of action sentences. In Essays on Actions and Events. pp. 105–148, Clarendon Press, Oxford. [22] Jamberdino, A. and Niblack, W. (eds) (1992) Image Storage and retrieval systems. Proc. SPIE, 1662, SPIE. [23] Rabitti, F. and Savino, P. (1991) Image analysis for semantic
THE COMPUTER JOURNAL,
AND
RETRIEVAL
183
image databases. Technical Report B4-35, IEI-CNR, Pisa, Italy. [24] Brodie, L., Mylopoulos, J. and Schmidt, J. W. (eds). On Conceptual Modelling. Springer Verlag, 1984. [25] Sebastiani, F. (1994) A probabilistic terminological logic for modelling information retrieval. In Proceedings of SIGIR94, 17th ACM International Conference on Research and Development in Information Retrieval, Dublin, Ireland.
V O L . 39 ,
NO. 3,
1 9 96