Interfaces for Emergent Semantics in Multimedia Databases Simone Santini and Ramesh Jain Visual Computing Laboratory, University of California, San Diego, La Jolla, CA
ABSTRACT
In this paper we introduce our approach to multimedia database interfaces. Although we deal mainly with image databases, most of the ideas we present can be generalized to other types of data. We argue that, when dealing with complex data, such as images, the problem of access must be rede ned along dierent lines than text databases. In multimedia databases, the semantics of the data is imprecise, and depends in part on user's interpretation. This observation made us consider the development of interfaces in which the user explores the database rather than querying it. In this paper we give a brief justi cation of our position and present the exploratory interface that we have developed for our image database El ni~no. Keywords: human machine interfaces, image databases, emergent semantics, interaction
1. INTRODUCTION
Giving everybody access to large repository of images is an interesting and important problem per se, as well as a viable start to sharpen our tools and tackle the more challenging problem of giving access to general multimedia data repositories. Much work has been done on several aspects of image databases, from indexing to feature extraction,1{3 to integration of several information sources.4 All this work makes some assumptions on the nature of the search process and of the information that is contained in images:
The user knows what he or she is looking for, and is just a matter of nding the right tools for requirement
speci cation and for semantic retrieval. The semantics of an image can be (at least in principle) adequately characterized by referring to the image data or to some meta-data associated with the image. In this paper we challenge these assumptions, and show that the problem of image retrieval from an image repository is qualitatively dierent from the problem of extracting information from a traditional database. In particular, not only doesn't the user know what he is looking for, but the semantic content of an image will change depending on the current status and goals of the search. Assuming this point of view, we can derive several consequences of importance for the design of image databases. One consequence is that the interface between man and machine must be expanded and that its role will be much more important than in traditional databases. We argue that the semantics of an image must be determined in the particular context in which the user asks the query, and that it is not a property of a single image but of the relation between dierent images. In other words, the semantics of an image is revealed by looking at the distribution of images in the database, and at the position of a particular image in this distribution. It follows from this observation that the meaning of an image is an emergent property, deriving from the interaction between the user and the database. The paper is organized as follows: in Sect. 2 we consider some general facts about problem solving activities and, more speci cally, about solving the problem of assigning a meaning to an image. In the same section we introduce the decision cycle model of problem solving. In Sect. 3 we propose our model of direct manipulation interface. In Sect. 4 we make the model more precise by de ning it as a series of operators on three dierent spaces. In Sect. 5 we illustrate the idea by showing the interface of our database system El ni~no. Conclusions are in Sect. 6. Send correspondence to: Simone Santini. E-mail:
[email protected]
Figure 1. A Modigliani portrait placed in a context that suggests \Painting."
2. THE DECISION CYCLE
The interaction between a user and a traditional database follows a very simple scheme: the user asks a query, and the database returns the most appropriate answer. This simple model assumes that the meaning is a \thing" that is inside the record and that a query can lter out the records that contain the right \thing." This is a simplistic view that we must reject if we want to understand the role of the user and that of the human-machine interface in the design of a multimedia database. This view derives from another simplistic assumption, namely that the meaning of an image is completely determined by the \objects" in the images and therefore it can be symbolically encoded. In reality, the interpretation of an image is sometimes revealed if, instead of considering the image alone, we place it in the context of the other images in the database. In other terms, if we don't look at single images, but to a whole system of oppositions and dierences between images. Consider the images of Fig. 1. The image at the center is a Modigliani portrait and, placed in the context of other 20th century paintings (some of which are portraits and some of which are not), suggests the notion of \painting." On the other hand, if we take the same image and place it in the context of Fig. 2, the context suggests the meaning \Face." We consider the interaction between a user and a database as a form of human/environment interaction. During this interaction, the user has a goal, which is assigning the right meaning to the right image and, at the same time, focus on the set of images that posses a certain meaning of interest. The interface between the user and the database should facilitate this problem solving activity. In order to make things more precise, and to derive indications about the desirable properties of an interface, we will brie y describe the decision cycle model,5,6 as exempli ed in Fig. 3. The agent-environment-agent loop is elaborated as a seven stages process: 1. 2. 3. 4. 5. 6. 7.
Form a goal (the environment state that is to be achieved). Translate the goal into an intention to do some speci c action that ought to achieve the goal. Translate the goal into a more detailed set of commands: a plan for manipulating the interface. Execute the plan. Perceive the state of the interface. Interpret the perception in light of expectations. Evaluate or compare the results to the intentions and goal.
Figure 2. A Modigliani portrait placed in a context that suggests \Face."
Figure 3. A schematic representation of the decision cycle model of human-machine interaction.
Figure 4. Schematic description of an interaction using a direct manipulation interface. This model of human-machine interaction has a number of drawbacks,6 but it is sucient for our purposes since it highlights the continuous nature of the interaction: the problem solving activity is composed of a number of iterative steps, during which the environment is manipulated and brought to a state that represents the solution of the agent's problem. The model also shows (steps 5 and 6) the importance for the agent of having a complete picture of the status of the environment.
3. DIRECT MANIPULATION
The decision cycle can give us indications on the design of an interface for the discovery of semantics: 1. The user should have a global view of the placement of images in the database. The global view (along the lines of Figs. 1 and 2) is necessary to give the user an indication of what meaning is the database currently attributing to a speci c image. As we have seen, semantic cannot stand on a single image. 2. The user should be able to manipulate its environment in a simple and intuitive way. In many a current interface, the user is given knobs or cursors corresponding to some database-de ned quantities. For instance, the database may employ several feature extractors, and the user can be asked to choose the relative importance of them. This is a highly unintuitive interface and a poor use of the user abilities. It is not immediate to see how asking for \more texture," as opposed to \less color" will change the database response. Based on these principles, we replaced the query-answer model of interaction with a direct manipulation one. In our model, the database gives the user information about the status of the whole database, rather than just about a few images that satisfy the query. Whenever possible, the user manipulates the image space directly by moving images around, rather than manipulating weights or some other quantity related to the similarity measure currently used by the database. The manipulation of images in the display leads the database to the creation of a similarity measure that satis es the relations imposed by the user. An user interaction using a direct manipulation interface is shown schematically in Fig. 4. In Fig. 4.A the database proposes a certain distribution of images (represented schematically as colored rectangles) to the user. The distribution of the images re ects the current similarity interpretation of the database. For instance, the the square is considered very similar to the double square, and the circle to the star. In Fig. 4.B the user moves some images around to re ect his own interpretation of the relevant similarities. The result is shown in Fig. 4.C. According to the user, the triangle and the square are quite similar to each other, and the circle is quite dierent from them. The images that the user has placed form the anchors for the determination of the new similarity criterion. Consequently, the database will rede ne its similarity measure, and return with the con guration of Fig. 4.D. The triangle and the square are in this case considered quite similar (although the square image has been moved from its intended position), and the circle quite dierent. Note that the result is not a simple rearrangement of the images in the interface. For practical reasons, an interface can't present more than a small fraction of the images in the database. Typically, we display the 100-300 images most relevant to the query. The reorganization consequent the
Figure 5. Interaction involving the creation of concepts. user interaction involves the whole database. Some images will disappear from the display (the star in Fig. 4.A), and some will appear (the gray squares in Fig. 4.D). A slightly dierent operation on the same interface is the de nition of visual concepts. In the context of direct manipulation, the term concept has a more restricted scope than in the common usage. A visual concept is simply a set of images that, for the purpose of the current application, can be considered as equivalent or almost equivalent. An interaction involving visual concepts looks like that in Fig. 5. Looking at the display of Fig. 5.A, the user still decides to consider the square and the triangle as close to each other but, in addition to this, she decided that they have enough semantic relevance in the current context to deserve a special status as a concept. The user opens a concept box and drags the images inside the box. The boxes then used as an icon to replace the images in the display space. A concept works much like a cluster of images that are kept very close in the display space, but with some important distinctions. Ancillary information can be attached to the concept box as meta-data. So, if the user of a museum creates a concept called \medieval cruci xion," the words \medieval" and \cruci xion" can be used to replace the actual images in a query. This mechanism gives a way of integrating visual and non visual queries. If an user looks for medieval paintings representing the cruci xion, she can simply type in the words. The corresponding visual concept will be retrieved from memory, placed in the visual display, and used as a visual query. This approach requires a dierent and more sophisticated organization of the database in several respects: 1. The requirement for a contextual presentation that the user can manipulate requires a formal de nition of one or more display space, in which the images are presented in a relation that resembles that established by the query. Images are placed in a high dimensional feature space. This is especially relevant since in the direct manipulation interface the display space is not merely an output of the system, but it is an input device as well: the reorganization of the images done by the user takes place in the display space. 2. The database must accommodate arbitrary (or almost arbitrary) similarity measures, and must automatically determine the similarity measure based on the anchors and the concepts formed by the user. The reader should refer to7,8 for more details.
4. INTERFACE OPERATORS
Formally, an interface can be de ned as a number of operators that work on three spaces: the feature space, the query space, and the display space. The feature space is the space in which images are described. For an n dimensional feature vector, this space is a subset of Rn . Contrary to the usual assumption, however, we assume that the space has no intrinsic geometry. A query operator endows the space with a geometry, and transforms it into the query space. In the query space, the distance of an image from the origin of the coordinate system represents its signi cance for the query. The
determination of the geometry of the query space is in general quite complicated, and is beyond the scope of this paper. The query can be seen as an operator which transforms the feature space into the query space. Once the feature space F space has been transformed into the metric query space Q, other operations are possible, like:
Distance. Given a feature set xi , return its distance from the query. Select by Distance. Return all feature sets that are closer to the query than a given distance. k-Nearest Neighbors. Return the k images closest to the query.
4.1. The Display Space
The display operator projects image xi on the screen position X , = 1; 2 in a way that preserves as much as possible the mutual distances between images. A con guration of the display space is obtained by applying the display operator to the whole query space. With these de nition, we can describe the operators that the user has available to manipulate the display space.
The Place Operator The place operator moves an image from one position of the display space to another, and attaches a label to the images to \glue" it to its new position. Visual Concept Creation A visual concept is is a set of images that, conceptually, occupy the same position in
the display space and are characterized by a set of labels. Formally, we will include in the set the keywords associated to the concept as well as the identi ers of the images that are included in the concept.
Visual Concept placement The insertion of a concept in a position Z of the display space.
5. THE INTERFACE AT WORK
We have used these principles in the design of the interface for our database system El ni~no. As we mentioned in the previous section, the interface that we described requires the support of a suitable engine and data model. In particular, the engine must be able to
Understand the placement of images in the display space. Be able to create a similarity criterion \on the y" based on the placement of samples in the display space. The engine that we use in El ni~no satis es these requirements using a purely geometric approach. The feature space is generated with a multi-resolution decomposition of the image. Depending on the transformation group that generates the decomposition, the space can be embedded in dierent manifolds. If the transformation is generated by the two dimensional ane group, then the space has dimensions x, y, and scale, in addition to the three color dimensions R, G, B . In this case the feature space is dieomorphic to R6 . In other applications, we generate the transform using the phase space of the Weyl-Heisenberg group,7 obtaining transformation kernels which are a generalization of the Gabor lters.9 In this case, in addition to the six dimensions above we have the direction of the lters, and the feature space is dieomorphic to the cylinder R6 S 1 . An image is represented as a set of coecients in this six (or seven) dimensional space. The raw feature space of El ni~no is the space of such sets of coecients. Each image is represented by a set of about 30,000 coecients. In order to reduce the memory occupation of each image and make the distance computations more ecient, we create a view in this space by a vector quantization operation that reduces an image to a number of coecients between 50 and 100 (depending on the particular implementation of El ni~no).
Figure 6.
Figure 7. The query space is created endowing the feature space with a metric. One of the characteristics of El ni~no is that the metric is not a simple Minkowski metric, but a more general Riemann metric. This fact allows us to create endless similarity criteria based on the query choices of the user. The description of the engine of El ni~no goes beyond the scope of this paper, and is available elsewhere.7 The following example illustrates the way our interface works. In Fig. 6 the user sees a browser with a sample of images from the database, and selects a few \interesting" images in theme of astronomy. We are somehow hiding here the problem of how do we \ignite" the search that is, how do we make this rst crucial selection of interesting images. We will return brie y to this problem in the conclusions section. The commonality between these images constitutes the similarity criterion which is applied to answer the rst query. The result of this query is in Fig. 7. Images are placed in the interfaces in a position that re ects their mutual similarity. The user selects a few more images in the two concept boxes in the bottom right angle of the interface, shifts the position of other images, and submits another query, whose result is in Fig. 8. After a number of iterations,
Figure 8.
Figure 9. the situation in the interface is that of Fig. 9. Note that, during the interaction, the interest has changed (or maybe it has been re ned) from general astronomic images to images of planets. This kind of shift or re nement of interest is a very common occurrence in image searches, and is made possible by our interface and search engine.
6. CONCLUSIONS
In this paper we have introduced the idea that semantics is not a property of a single image, but of the interaction of the user with the database. We have also introduces the related idea that semantics is not inherent in a single image, but is revealed in the con guration of the totality of images in a database. We have introduced the decision cycle model of interaction as a possible way to engage in a continuous interaction with a database and let the emergent semantics of the images be manifest. The adoption of this interaction model poses some requirements on the database interface and on the very organization of the database engine. This paper was concerned mainly with the interface organization.
We have introduced a model called direct manipulation interface with the necessary characteristics to support the decision cycle interaction. Finally, we have presented an implementation of this model of interface in our system El ni~no. This is our rst interface for emergent semantics. Experience with its use has already generated some ideas for improvements, and will certainly generate more in the future. In particular, in some cases it can be useful to generated more than one display space and give the user the possibility to de ne relations between them (e.g. place the same concepts in all display spaces and let them change position on all display spaces as a consequence of a reorganization of a single space).10 We are currently working on the formalization of some of these ideas and to their introduction in the next version of our interface. Another issue that we are considering is the \ignition" of the search process: it is necessary somehow to generate a rst set of partially satisfactory images from which the visual search can start. One possibility that we are exploring is partial labeling. It is generally acknowledged that labeling a database is too time consuming to be practical and that, in any case, labels can capture only partially the semantics of an image. It is possible, however, to label only a part of the database, and do the textual search on that part only. The results, although unsatisfactory, will in general be good enough to start the visual search process.
REFERENCES
1. S. F. Chang and J. R. Smith, \Extracting multi-dimensional signal features for content-based visual query," in SPIE Symposium on Communications and Signal Processing, 1995. 2. D. Forsyth and M. Fleck, \Finding people and animals by guided assembly," in Proceedings of the 1997 IEEE International Conference on Image Processing, Santa Barbara, pp. III{5{III{8, 1997. 3. C. Faloutsos, R. Barber, M. Flickner, and J. H. et al, \Ecient and eective querying by image content," Journal of Intelligent Information Systems: Integrating Arti cial Intelligence and Database Technologies 3(3-4), pp. 231{262, 1994. 4. R. Fagin, \Combining fuzzy information from multiple systems," in Proceedings of the 15th ACM Symposium on Principles of Database Systems, Montreal, 5. D. A. Norman and S. D. (eds.), User Centered System design: new perspectives on human-computer interaction, L. Erlbaum Associates, Hillsdale, N.J., 1986. 6. D. Kirsh, \Interactivity and multimedia interface," Instructional Science 25(2), pp. 79{96, 1997. 7. S. Santini, Explorations in Image Databases. PhD thesis, Univerity of California, San Diego, January 1998. 8. S. Santini and R. Jain, \Similarity is a geometer," Multimedia Tools and Applications 5, November 1997. available at http://www-cse.ucsd.edu/users/ssantini. 9. C. Kalisa and B. Torresani, \N-dimensional ane Weyl-Heisenberg wavelets," Annales de L'Instut Henri Poincare, Physique theorique 59(2), pp. 201{236, 1993. 10. M. Linvy, R. Ramakrishnan, G. Chen, D. Donjerovic, S. Lawande, J. Myllymaki, and K. Wenger, \Devise: Integrated querying and visual exploration of large datasets," in Proceedings of the 1997 ACM SIGMOD conference, pp. 301{312, 1997.