The “El Ni˜no” Image Database System - CiteSeerX

0 downloads 0 Views 275KB Size Report
face identification and recognition, the fact that a particular face belongs to Jack Kerouac, and the peculiar relation that the face may bear with a picture of a road ...
˜ Image Database System The “El Nino” Simone Santini and Ramesh Jain Department of Electrical and Computer Engineering university of California, San Diego ssantini,[email protected]

Abstract This paper presents the main features of the image database El Ni˜no. The main characteristic of El Ni˜no is its search model which, rejecting the idea of querying image databases, proposes an approach based on a mix of browsing and querying that we call exploration. This paper presents the interface and query models of El Ni˜no, as well as some of the architectural issues deriving from our query model.

1. Introduction This paper presents the image database system El Ni˜no. El Ni˜no is the collective name of a group of search engines, interface tools, communication and integration modules for the management of image repositories. The system was developed as an exploratory testbed for the distribution of image databases over geographically remote systems, and for the study of new models of interaction between users and visual information management systems. The user interaction model is the most distinctive characteristic of El Ni˜no. Logically, the user interaction model and the user interface that we designed derive from a series of observation about the nature of the search process:





In very general terms, the user searches the database looking for image that have a certain “meaning” to him. This fact has been used in the past to justify modeling the interaction with an image database after the “query-answer” process of traditional database. This modeling choice ignores that the “meaning” of an image is not as well specified as the meaning of a record in a traditional database, in which the semantics of a particular record is a simple function of its syntax and of the semantics of its atomic constituents. This is no longer true in image databases [7]. Usual interaction models assume that the user has a specific target image in mind, and queries the database

looking for that specific image. This is not an appropriate model for all applications. A number of analyses of user behavior revealed that, in a significant number of cases, users have no clear idea of what kind of image they are looking for, but preferred to mix search and browsing in a way that would help deciding what pictures were appropriate [4]. We call this mode of interaction, in which search and browsing are strictly intertwined: exploration. These observations led to the design specifications of the interface of El Ni˜no, which we present in Section 2. We begin the description of the system with the interface because the requirements of the interface condition many of the system components.

2. Interface and Query Specification The interface of El Ni˜no is based on the principle that the user should be aware of the overall organization of the database, and of the consequences of her actions. The first point implies that the user should be aware of the current interpretation of difference and similarity that the database is considering. The second point entails the rejection of the counterintuitive selection of low-level similarity criteria in favor of an approach based on the selection and placement of positive or negative examples. We call the conjunction of these two activities an exploratory interface. An exploratory interface implements the first point above by means of a display space. In practical systems, a display space is either a two dimensional surface or a three dimensional volume on which the system places the answers to the user queries (Fig. 1). Unlike traditional browsers, in a display space the position of the images matters. The database shows the images in a position that reflects as well as possible the mutual distances in the feature space. In other words, the distance between two images I and J in the display space reflects as well as possible the distance between I and J according to the dissimilarity measure currently in use.

Figure 1. Two examples of display spaces. Two dimensional (A) and three dimensional (B). In both spaces, the distance between images reflects as well as possible the distance with respect to the current similarity measure used by the database.

More formally, let F  Rn be the feature space, and f (xI ; xJ ; ) the distance between feature vectors xI , corresponding to image I and feature vector xJ , corresponding to image J . The distance measure depends on the q dimensional parameter vector , which determine what kind of similarity criterion we are actually using (see Section 4). In other words, the parameters encode the current query. In the usual case, the parameters alpha are set manually by the user. El Ni˜no, as we will see, follows a different route. The display space is determined by a projection operators that, in the case of a two-dimensional display space, can be written as  : F ! S  R2 . The distance in the display space is the usual Euclidean distance. If   =   (x; ) are the coordinates of the projection of a feature vector x, then

e (I ; J ) =

" X



(I ? J )2

we query the database and determine the M images with the smallest distance (in our case M  100). Only these images will participate in the optimization (2), and only these images will be displayed. The second characteristic of our interface is that the user doesn’t manipulate the similarity measure (that is, the vector ) directly. Rather, the user will move the images around in the display space grouping them according to their perceived similarity in the context of the current query. This manipulation provides the necessary dependency on the context of the query. Two images might be considered very similar in the context of a particular query and different in the context of a different query. The manipulation of images in the display leads the database to the creation of a similarity measure that satisfies the relations imposed by the user. Rather than the user trying to understand the properties of the similarity measures used by the database, the database should use the user categorization to develop new similarity measures. Mathematically, the user selects a subset A of the images in the databases, and places them on the interface so that, for I; J 2 A, the distance dIJ between I and J can be determined. The vector can in this case be determined by solving an optimization problem similar to (2):

# 12

(1)

The projection  is also assumed to depend on m parameters forming a vector . These parameters determine the projection of the images into the display space. They are set so as to minimize:

F ( ) =

X

I;J

[f (xI ; xJ ; ) ? e ((xI ; ); (xJ ; ))]2 :

(2) Minimization is done using standard optimization methods. The resulting projection operator  is used to place the images in the display space. Note that, in the formal model, all images in the database are projected into the display space. In practice this is impractical for a number of reasons. First, it would lead to an intolerably large optimization problem for the minimization of F ( ) in (2). Second, displaying tens or hundred of thousands of images in a limited display space would result in such a cluttered display that it would be of no informative value. In practice, given the similarity criterion f (; ; ),

F ( ) =

X

I;J 2A

[f (xI ; xJ ; ) ? dIJ ]2 :

(3)

Note that in this case the optimization is restricted to the images manipulated by the user, and that the distances in the display space are now the known term, while the distances in the feature space are the unknowns. In many cases of practical interest, the set A is quite small, and the optimization problem (3) is underconstrained. We can include additional constraints in order to reach a sensible answer. The simpler constraint is based 2

LOCAL INTERFACE OPERATOR

OPERATOR

MEDIATOR STUB

MEDIATOR

OPERATOR

REMOTE CONNECTION

ENGINE

ENGINE

ENGINE LOCAL CONNECTION

DATABASE

the user interaction involves the whole database. Some images will disappear from the display (the purple image in Fig. 2.A), and some will appear (the yellow, gray, and cyan images in Fig. 2.D). In order to support a direct manipulation interface the database must accommodate very general similarity measures, and must automatically determine the similarity measure based on the anchors and the concepts formed by the user. Any a priori limitation of the generality of the similarities will be reflected in an corresponding limitation in the ontology of the image meanings. Certain meanings will not emerge simply because the database can’t implement the similarity measures that induce them. As a simple example, it is impossible to define the meaning “round” in a database that uses only color histograms.

on the idea of modifying as little as possible compatibly with the satisfaction of the constraints. If 0 is the value of before the optimization process begins, this leads to optimizations of the type: X

I;J 2A

[f (xI ; xJ ; ) ? dIJ ]2 + wk ? 0 k2 :

DATABASE

˜ Figure 3. The overall architecture of El Nino.

Figure 2. Schematic description of an interaction using a direct manipulation interface.

F ( ) =

DATABASE

(4)

where w is a positive weight. Other supplementary conditions are based on the deviation from the natural distance in the feature space, as considered in [5]. An user interaction using a direct manipulation interface is shown schematically in Fig. 2. In Fig. 2.A the database proposes a certain distribution of images (represented as colored rectangles) to the user. The distribution of the images reflects the current similarity criterion of the database. For instance, the the green image is considered very similar to the orange one, and the brown to the purple. In Fig. 2.B the user moves some images around to reflect his own interpretation of the relevant similarities. The result is shown in Fig. 2.C. According to the user, the red and green images are quite similar to each other, and the brown image is quite different from them. The images that the user has placed form the anchors for the determination of the new similarity criterion. Given the new position of the anchors, the database will reorganize its similarity assessment, and return with the configuration of Fig. 2.D. The red and the green images are in this case considered quite similar (although the green image is not exactly in its intended position), and the brown quite different. Note that the result is not a simple rearrangement of the images in the interface. For practical reasons, an interface can’t present more than a small fraction of the images in the database. Typically, we display the 100-300 images most relevant to the query. The reorganization consequent

3. Architecture El Ni˜no is a collection of search engines connected to a mediator which communicates with the user via the user interface. Fig. 3 shows the overall architecture of the system. We assume a remote user who runs an interface on his local machine. In the current incarnation of the system, this interface is a locally running Java application. We are planning to convert this into a Java applet that can run inside the user’s browser. The user interface communicates with a mediator. The interface sends requests to the mediator, and receives answers back from the mediator using a simple ad hoc communication protocol. The mediator receives the queries, dispatches them to the appropriate search engines (which can be in remote locations), integrates the results using the attached operators, and sends the results back to the interface. The engines used in El Ni˜no can be the most diverse, and the possibility to access a number of different 3

engines in a consistent and transparent way is instrumental to our program. We need at the same time the general ontology of unrestricted visual search and the semantic constraints that specialized engines can give to restricted applications. More in details, the components of the architecture work as follows.

The feature space is derived from a multiresolution decomposition of the image generated by a discrete subgroup of a suitable transformation group. If G is a group of transformations acting freely on a space X , and g 2 G, then a representation of G in L2 (X ) is a homomorphism

 : G ! L2 (X )

(5)

The irreducible unitary representation of G on L2 (X ) is defined as

Mediator. The mediator receives queries from the interfaces. An El Ni˜no query is a graph that contains two types of nodes: engines and operators. Engine nodes specify queries that must be answered by some engine attached to the mediator, and operator nodes describe how the results of the queries should be combined.

s

(g?1 x) f ?g?1 x (g)(f ) : x 7! dmdm (x) m is a measure on X . This representation

(6)

where can be used to generate a wavelet transform. Starting from a mother wavelet  2 L2 (X ), we define g =  (g )( ). The wavelet transform of a function f 2 L2 (X ) is defined as:

Engines. El Ni˜no supports a wide variety of engines. Engines must define a standard API in order to communicate with the mediator. The engines used by El Ni˜no can be based on the most diverse features: general visual features, specialized visual features, metadata, and more. All the engines are required to return a similarity measure in the interval [0; 1].

Tf (g) = hf; g i (7) Note that the wavelet transform of f is defined on G. In the case of images, we have X = R2 . It is possible to extend the same definition to discrete groups [5]. In the discrete case, the transform is a set of coefficients indexed by the elements of the group: Tf (g ). If the images are in color, it is possible to extend the definition of the inner product h; i to color spaces. In this case, the coefficients themselves are colors: Tf (g ) 2 C , where C is a metric color manifold embedded in R3 . In our engine we use two types of groups: the affine group and the Weyl-Heisenger group. The first group generates a multiscale decomposition of the image, the second a transform similar to (but more general than) the Gabor decomposition [3]. A transform is in general composed by an intolerably large number of coefficients. For instance, the affine decomposition of image of size 128  128 generates about 21,000 coefficients. Fortunately this representation is highly redundant and, for database application, we don’t need the complete reconstruction of the image. A coefficient is represented by two quantities: the group element g , which defines the “position” of the coefficient in the transform, and the value of the coefficient, which is a color in C . In other words, a coefficient can be represented as a point in G  C . If the group G is endowed with a metric (which is true for all groups we use), this induces a metric in the space G  C . Given two coefficients Tf (g1 ) and Tf (g2 ), it is possible to measure the distance d(Tf (g1 ); Tf (g2 )). With this distance, we can apply vector quantization, and represent all the coefficients with a greatly reduced “codebook.” In our system, we typically use between 100 and 200 coefficients. For instance, we can use the affine group to generate the transform. An element of the group is determined by three

Operators. Operators combine the answers from the various engines. Mathematically, an operator is a function f : [0; 1]n ! [0; 1], where n is the arity of the operator. Communication. There are two levels of communication in the system: the user interface needs to communicate with the mediator, and the mediator needs to communicate with the engines.

4. Search Engines The interface of El Ni˜no requires a special support from the query engine, and the reason why we presented the interface first is that it justifies the choices that we made for the query engine. Psychological considerations led us to refute Minkovski distances as a model of similarity in favor of more complex Riemann metric [6]. Similar considerations also suggest that the similarity metric is sensitive to the context of the query. In particular, the presence of the query deforms the perceptual space. Moreover, if the similarity criterion must emerge from the interaction with the user, we can’t accept ontologically restricting features. Any reasonably limited choice of a feature set will sooner or later break. This will happen when the user query will reveal regularities that escape the restricted ontology of the features. In order to guarantee the generality of the features, we impose a reconstruction constraint: the feature set of an image must be general enough to allow (at least in principle) the reconstruction of the complete image. 4

sure resulting from their combination. Formally, the operators define an algebra in the space of distance functions. Equivalently, the combination operators can be considered as acting on the Riemann geometries of the feature spaces. Given two Riemann tensors g1 and g2 which define two different geometries of the feature space, an operator creates the tensor g = o(g1 ; g2 ) defined on the same feature space. These operators act on the distance functions defined by the single engines. All the distance functions are defined in FF and take values in [0; 1]. For an engine based on SQL or regular expression, the distance takes values in f0; 1g. Other than this restriction, engines based on matching can be treated just like engines that measure similarity. We try to make our algebras look like Boolean logic as much as possible, since the use of logic connectives is intuitive and well established. So, in general, we define three operators which we call and (^), or (_), and negation (:). In the general case, however, we have to sacrify some logically reasonable relations in order to enforce distributivity, which plays an important role in query optimization. Formally, in El Ni˜no, a distance in a function in L2 [F  F ; R+], which is a Hilbert space, and the operators form an algebra on this space. On this Hilbert space, we define two operators: and (^), and or (_). The and operator, for instance, has a signature:

Figure 4. Geodesics in the query space (see text).

parameters: the two spatial coordinates x and y , and the scale parameter s. A pixel in an image is an element of the six-dimensional space G C , where G is the affine transformation group. We call this the image space. An image is a set of elements in this space. Both the transformation group and the color space can be endowed with a natural metric, which makes it possible to define the distance between two coefficients. With the distance between two coefficients it is possible to define the distance between two sets (i.e. between two images) in a natural way as the distance between two sets of coefficients. The choice of the transformation group is determined by invariance considerations [5]: it is easy to define similarity measures that are invariant with respect to any of the transformations of the group. El Ni˜no generates the similarity criteria automatically by looking at the positive and negative examples chosen by the user. As an illustration, Fig 4.a shows a number of geodesics (lines of minimal distance) of an hypothetical two dimensional image space with no samples selected. The image space is Euclidean, and the geodesics are straight lines. If the user selects a number of images with a concentration of features around the point (0:4; 0:4), the image space is distorted, and its geodesics become as in Fig. 4.b. If the user selects a set of images with two concentrations of features (around (0:2; 0:2) and (0:8; 0:8)), the geodesics become those of Fig. 4.c. The geometry of the image space implicitly defines categorization in El Ni˜no. The samples collected by the user form the context from which conceptualization emerges. These categories do not rely on a predefined ontology as is the case for simple schemes based on weighting distances computed on predefined features.

^ : L2 [F  F ; R+ ]  L2 [F  F ; R+ ] ! L2 [F  F ; R+ ] and similarly for the or operator. In order to allow query optimization, it is important that the two operations be distributive. If d1 , d2 , d3 are distance measures, then we should have

d1 ^ (d2 _ d3 ) = (d1 ^ d2 ) _ (d1 ^ d3 )

And

d1 _ (d2 ^ d3 ) = (d1 _ d2 ) ^ (d1 _ d3 )

this requirement is forcing us to make some compromises about the negation operator. In particular, the only functions that satisfies the De Morgan theorems and the involutive property of negation (::x = x) are1 :

d1 ^ d2 = max(d1 ; d2 ) d1 _ d2 = min(d1 ; d2 ) :d = 1 ? d The use of max and min has the disadvantage that, for any two values d1 and d2 , the values of max and min depends on only one of the two values. Consider the following example: we have three images A, B , and C and two distance measures d1 and d2 . The

5. Similarity Algebra The unification of different similarity engines requires the definition of operators to put together their results. Formally, the operators take two distance measures defined by two engines, and transform them into a new distance mea-

1 The use of max and min is inverted with respect to the norm established, for instance, in fuzzy logic. the reason is that we are dealing with distances rather than similarities.

5

distance between A and the other two images, with respect to the two distance measures, is given in the following table:

d(A; B ) d(A; C )

d1

0.01 0.9

6. Conclusion El Ni˜no is an attempt to overcome the limitation of traditional search engines and, more specifically, to find an alternative to the well known query by example paradigm. The functional specifications of El Ni˜no are based on an analysis of user interaction with traditional image repositories, and of an analysis of user perception of the meaning of an image and its dependence on the context of the query. These specifications led us to the design of a direct manipulation interface in which the user operates on the database’s similarity measure by placing images in a position that reflects their mutual similarity in the current context. The use of this kind of interface requires that the database engine be able to adapt its similarity measure to the indication of the user. We gave a rough sketch of the design of such a similarity engine, and of a theory of a distributive algebra that can be use to merge the results of two different engines. Finally, we presented a distributed architecture for the integration of remote engines into a single system.

d2

0.9 0.9

The distance between A and C is very large with respect to both criteria, while the distance between A and B is small with respect to the first criterion. Reasonably, one would expect that the distance between A and B with respect to the first and the second criteria would be smaller than the distance between A and C . However, in the case of the max operator we have

dd1 ^d2 (A; B ) = dd1 ^d2 (A; C ) = 0:9

(8)

Fagin [2] proposes a series of norms and conorms that satisfy the usual logic properties, but are not distributive [1]. In El Ni˜no, we use the following class of pseudo-logic operators which form a distributive algebra. Let f : [0; 1] ! R be a monotonically increasing function. Then we define

d1 ^ d2 = f ?1 (f (d1 ) + f (d2 )) d1 _ d2 = f ?1 (f (d1 )f (d2 ))

References (9)

[1] D. Dubois and H. Prade. A review of fuzzy set aggregation connectives. Information Sciences, 36:85–121, 1985.

It is easy to prove that these operations form a distributive algebra and therefore allow query optimization. Unfortunately, we have to give up some of the power of logic predicates.

[2] Ronald Fagin. Combining fuzzy information from multiple systems. In Proceedings of the 15th ACM Symposium on Principles of Database Systems, Montreal.

5.1. Integration of Metadata

[3] C. Kalisa and B. Torr´esani. N-dimensional affine Weyl-Heisenberg wavelets. Annales de L’Instut Henri Poincar`e, Physique th`eorique, 59(2):201–236, 1993.

Some information about images is naturally expressed in the form of text. Cultural information (in the wide sense of the term) is more conveniently expressed in structured or semi-structured form, rather than being extracted from the image data. Even if we can develop robust algorithms for face identification and recognition, the fact that a particular face belongs to Jack Kerouac, and the peculiar relation that the face may bear with a picture of a road stretching to the horizon, are cultural facts that are conveniently expressed with metadata. Linguistic queries are integrated with visual queries using the same algebraic operators that we have introduced above. Consider a query involving metadata, like “give me images containing the label Gnat.” The set of images satisfying this query is a crisp set, that is, we can assign to every image in the database a value s(Q; I ) 2 f0; 1g. We will have s(Q; I ) = 1 for images that contain the label “Gnat,” and s(Q; I ) = 0 for images that don’t. This identification provides the necessary unifying principle for perceptual and textual queries.

[4] Marjo Markkula and Eero Sormounen. Searching for photos-journalists practices in pictorial IR. In The Challenge of Image Retrieval. Papers Presented at a Workshop on Image Retrieval. University of Northumbria at Newcastle, February 1998. [5] Simone Santini. Explorations in Image Databases. PhD thesis, Univerity of California, San Diego, January 1998. [6] Simone Santini and Ramesh Jain. Similarity matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1995. (submitted). [7] Simone Santini and Ramesh Jain. User interfaces for emergent semantics in image databases. In Proceedings of the 8th IFIP Working Conference on Database Semantics (DS-8), Rotorua (New Zealand), January 1999.

6

Suggest Documents