space; or using a more abstract term, like âblueâ. In that sense, visual primitives ...... NYPD developing CCTV camera system that will be better than london's.
A review on knowledge-based computer vision Sandro Rama Fiorini Mara abel Institute of Informatics — UFRGS — Porto Alegre, Brazil {srfiorini, marabel}@inf.ufrgs.br August 2010 Abstract In this study we review the area of Knowledge-Based Computer Vision (KBCV), putting in focus its knowledge representation aspects, starting from the year 2000. We introduce the topic by reviewing what researchers during the 1990’s saw as the future path to be taken and how these influenced recent contributions. We then proceed analysing the contributions in visual knowledge representation. We divide the analysis in the what part and in the how part. In the what part, we discuss the reviewed contributions under the light of different types of knowledge necessary for KBVC. In the how part, we discuss contributions in how this knowledge can be represented and structured, regarding representation formalism, model structure and symbol grounding. We also review full approaches for KBVC concerning their application domain. We conclude by exploring some alternatives for future research.
1
Introduction
The quest for creating intelligent vision systems is still in lack of a proper happy ending. Meanwhile, the motivation gets strong every year. The fall of the prices of digital devices for image and video capturing made possible their use in all sorts of domains, from recreational photography to surveillance. The consequence is that a huge amount of raw data is being gathered without a efficient way in which it could be interpreted and queried. As an example of this phenomenon, the effectiveness of the public camera surveillance system in Britain has been recently put under question given operation difficulties (Owen Bowcott, 2008; Gill and Spriggs, 2005); some of them are related to the huge amount of visual information that needs to be monitored live by officers in monitoring teams (in some cases, a single operator should monitor up to a hundred cameras (Gill and Spriggs, 2005)). Related initiatives recognize the need for better software tools for helping officers to identify relevant events (Mark Hosenball, 2010). Nowadays, a great part of research effort is being put in creating means of analysing and interpreting the visual raw data generated in such domain. This is a hard problem and it has been approached by researches in many different ways. This review present a survey about knowledge-based computer vision techniques for tackling it. In particular, we review what is being done in terms of visual knowledge representation in the last ten years. 1
Today, the motivation for Knowledge-Based Computer Vision (KBCV) is still present. While there is not an exact definition about how visual knowledge is represented in human brain (Patterson et al., 2007), the traditional conception is that the visual cognitive process that support visual interpretation is, at least in part, symbolic (Goldberg et al., 2006; Auer et al., 2005). In a recent survey, David Vernon et al. (Vernon et al., 2007) did a very comprehensive review of the existing proposals of cognitive systems. Vernon separates approaches in two basic kinds. The cognitivist approaches are those where the knowledge embedded in the cognitive system is represented symbolically. The emergent approaches are those where the knowledge emerges from the dynamics of simple operations encoded in the system, like in connectionist and enactive approaches. According to Vernon, “the arguments in favour of dynamical systems and enactive systems are compelling but the current capabilities of cognitivist systems are actually more advanced” (Vernon et al., 2007). Also Bonarini et al.(Bonarini et al., 2007) support that abstracting symbolic models from raw sensor data helps in noise filtering and in keeping the sensorial coherence during sensor fusion. The typical architecture of a KBCV system is structured in two different subsystems: low-level and high-level processing (Figure 1). At low-level processing, the raw-data is processed by signal processing algorithms, usually generating feature descriptions, sets of symbolic descriptors which summarizes characteristics of data in a quantitative way. Examples of low-level processing algorithms are segmentation, color detection, texture detection, noise reduction, occlusion detection and so forth. As Drapper writes (Draper et al., 1996), these algorithms were the main focus of research in computer vision for a long time and had come a long way in terms of performance.
High-Level processing
Visual Knowledge Model
Low-Level processing
Figure 1: Typical architecture of KBCV. On the other hand, high-level processing is related to the interpretation and reasoning with visual data. It is usually built on top of the low-level processing algorithms, taking features descriptors as input and generating abstract, qualitative descriptions about the content of the visual data. We call these content descriptions. Ideally, the high-level processing can also act on the low2
level processing, adjusting its parameters to improved their performance based on generated content descriptions, creating a kind of feedback loop between the two levels (usually called bottom-up and top-down reasoning). High-level processing has been implemented using the various forms of symbolic artificial intelligence, from rule-based to probabilistic systems. They usually employ some sort of a priori knowledge about the visual interpretation to be made. The first approaches combining symbolic AI and visual data processing date from the ending of 1970’s, beginning of 1980’s. The first sizeable efforts in this direction combined expert systems with low-level image processing. Examples are VISIONS/Schema (Hanson and Riseman, 1978; Draper et al., 1989) and SIGMA (Matsuyama, 1987). For instance, the SPAM system (McKeown et al., 1985) was able to extract basic feature descriptions from images and use them in a rule-based reasoning system for visual interpretation. In a interesting survey about these systems, Crevier and Lepage (Crevier and Lepage, 1997) organize the previous contribution as components of an hypothetical image understanding shell, i.e., as an adaptation of the expert systems shell idea. In 1996, the authors could see that an evolution was taking place, from problem- and domain-specific image understanding systems to more generic approaches. In their conclusion, Crevier and Lepage state some issues to be addressed in order to achieve that. First, the interfaces of image processing algorithms should be standardized, including the representation of the results produced by them ,i.e., regions, segments and etc. This could allow easier integration of the many existent low-level processing modules. Secondly, while suggesting that standardization of high-level representations is difficult, they point out that a distinction should be made between declarative knowledge about image content and the knowledge about the procedures to extract it. The third and last point is about the methods for knowledge entry, or, knowledge acquisition by the system, asserting that “the true power of image understanding systems will not unfold until the advent of automated learning techniques, both for object description and recognition strategies” (Crevier and Lepage, 1997). To some extent, these issues concern the knowledge engineering aspect of the problem. This has been also pointed out also by the proponents of the VISIONS/Schema system in (Draper et al., 1996). While reviewing the problems faced during the development of VISIONS/Schema. the authors argued that better definition of control knowledge is necessary, i.e., the knowledge about the interpretation procedure. However, that part of the problem lies in representing a priori knowledge: [...] Intraclass variation would lead to failures when new instances of an object class were encountered that did not fit the precious model. The VISIONS/Schema System, for example, modelled houses as having windows and windows as having shutters. The first time a picture of a house without shutters was encountered, our knowledge about houses had to be loosened, leading to unfortunate possibility of more false matches. The paper also points out to the difficulties in acquiring knowledge for complex knowledge-based vision, given the high granularity usually required to represent both object and control knowledge. Furthermore, some of the problem arises from the fact that the technique of acquiring knowledge from an expert
3
does not work in computer vision. Instead, according the authors, “vision researchers have concentrated on making knowledge easier to declare” (Draper et al., 1996), like high-level languages to describe objects (McKeown et al., 1989). The authors also argue - perhaps in most relevant assertion of the paper - that the control problem in knowledge-based vision lacks a proper underlying scientific theory, upon which it could be formalized. In our opinion, a good theory formalizing control presupposes the existence of a good underlying theory for visual knowledge representation. The issues pointed out by these authors summarizes and gives the overall direction of many efforts in visual knowledge representation in the following years. As in the early years, many of the recent advances in KBCV have originated from further advances in AI. In this survey, we review some of approaches to the knowledge representation published since the year 2000. We divide our review of aspects for visual knowledge representation in two parts. First we address the what part of the problem; that is, what is the information to be represented in terms of representation constructs. Then, we address the how part of the problem; where we give attention to representation formalisms used in KBVS and the usual model architectures encountered. We also discuss symbol grounding aspects. Next we review KBCV approaches according the nature of the visual content to be interpreted, i.e., image, video or robotic vision. Our focus is to present each reviewed approach as a unit, commenting of application domain and so on. Finally, we drawn some conclusions and some future steps that should be taken.
2
Visual knowledge representation
Naturally, knowledge representation plays a central role in KBCV. The challenge is to appropriately represent the visual knowledge necessary for solving a given task in scene interpretation, like object recognition, spatial reasoning and so on. As emphasized above, early surveys in the area have suggested that more attention should be given to the standardization of the visual knowledge models. Recent research followed these suggestions, proposing formalisms and conceptual models for that purpose. The contribution can be separated in two different areas: (a) conceptual modelling of visual knowledge, defining what to represent in terms of modelling constructs; and (b) representation formalisms and structuring, defining how to represent conceptual models about visual knowledge.
2.1
What to represent: conceptual modelling
The question of what to represent regards the question of what is the conceptual model necessary to enable visual interpretation in KBCV. The literature present different sets of modelling constructs for representing different aspects of the problem, which we call visual subdomains. Examples of visual subdomains are signal features, visual features, events, context and so on. Concerning the recent trends in conceptual modelling in KBCV, perhaps one of the most pervasive concepts in the notion of ontologies, as we could can find frequent references to it in the reviewed work. The most accepted definition of
4
ontology in Computer Science is that of Gr¨ uber (Gruber, 1995), which states that an ontology is a “formal specification of a shared conceptualization”. In other words, an ontology is the materialization in a document, using a formal representation language (e.g. first-order logic), of the knowledge (conceptualization) about a given domain shared by a community. In KBCV, the notion of ontologies helped in organizing and making explicit the declarative knowledge about scene interpretation, particularly in stabilizing a common terminology for each of the visual subdomains. In the remaining of this section, we review the recent contributions of the literature in each subdomain. 2.1.1
Signal features
In general, signal features are low-level entities generated by signal (image or video) processing algorithms, like pixels, regions and their properties. There is a set of common constructs for representing signal features which is pervasive for almost all studied approaches (e.g.: pixel, region, colour features, and so on.), even if explicit modelling of signal features is only present in some of them (e.g. (Hudelot et al., 2005; Simou et al., 2005; Papadopoulos et al., 2007a)). A general characteristic of signal features is their quantitative nature, where numeric representation are used (contrasting with qualitative nature of visual primitives, presented next). The most prominent low-level construct is generally referred as blob. It is an arbitrary set of signal data (e.g. (Chella et al., 2001, 1997; Bennett et al., 2008; Gonz` alez et al., 2009; Town, 2006)), usually result of segmentation algorithms and further characterized by attributes like predominant pixel colour, size, position, and so on.
Figure 2: Image processing ontology fragment (Hudelot, 2005).
5
The use of simple blobs is correlated with ad hoc approaches for modelling signal features. On the other hand, some approaches proposed more refined definitions to signal features. In the Project Orion, Thonnat et al. (Maillot and Thonnat, 2008; Hudelot et al., 2005; Hudelot, 2005) organized types of signal features in a image processing ontology, also including knowledge about types of processing algorithms used to produce them. The constructs are organized as a taxonomy of concepts, which is depicted in Figure 2. They are divided in two groups. The image data concepts represent aspects of the image data itself. It is further divided in image entities concepts for representing image data structures (e.g.: regions, edges and region graphs); and image features concepts for representing calculated image features (e.g.: measures of size, position, shape, color and texture). The second group is composed by image processing functionality concepts, which aims to represent kind of processing algorithms according their intention (e.g.: segmentation, object extraction, etc). Similar work has been done in the MPEG-7 standard. Among other things, it proposes standards for defining meta-data about audio-visual data. Particularly, the MPEG-7 Visual Standard (Sikora, 2001) defines some constructs for describing signal features of visual data, like colour, texture, shape, motion and so on. In (Simou et al., 2005), these constructs have been formalized as a taxonomy in the Visual Description Ontology, aimed for supporting in reasoning over extracted visual data (see (Garc´ıa and Celma, 2005) for a similar approach). The Figure 3 shows the part of hierarchy containing the descriptors about colour. It has been used to support further scene interpretation in (Papadopoulos et al., 2007a,b).
Figure 3: Part of the Visual Description Ontology presented in (Simou et al., 2005)
2.1.2
Visual Primitives
Visual primitives are descriptors used to create generic, domain-independent content descriptions of visual data. Particularly, they are intended for describing common visual features bearing objects depicted in visual data. Examples are primitives to describe shapes, colours, textures and so on. The importance of visual primitives is that they abstract quantitative details about visual data
6
without necessarily committing to a certain domain. For instance, one can describe the colour of a given object by using a certain code of the HSV color space; or using a more abstract term, like “blue”. In that sense, visual primitives are qualitative descriptors. There are several proposals of descriptors that match our definition of visual primitives (Town, 2006; Shanahan and Randell, 2004; Thirde et al., 2006; Shapiro and Ismail, 2003; Fusier et al., 2007; Georis et al., 2006; Br´emond et al., 2004; Hudelot et al., 2005; Hudelot, 2005; Maillot et al., 2004; Papadopoulos et al., 2007b; Bloehdorn et al., 2005). Some visual primitives, like colour, size and shape, are ubiquitous. However, there is a great variation in the way they are represented. In some approaches, e.g., (Town, 2006; Shanahan and Randell, 2004; Thirde et al., 2006; Shapiro and Ismail, 2003), visual primitives for describing common visual features, like color, size and shape, are presented in a ad hoc way, i.e., with little or none structuring or further formal definition. Bremond et al. (Fusier et al., 2007; Georis et al., 2006; Br´emond et al., 2004) present a conceptual model for modelling object and events in dynamic scenes. In terms of visual primitives, it defines that concepts representing physical objects are characterized by three different kinds of visual attributes: positionbased (e.g.: speed. direction, trajectory); global appearance (e.g.: height, width, global colour); and local appearance (e.g.: silhouette, posture, sub-part color). In related work by Thonnat et al. (Hudelot et al., 2005; Hudelot, 2005; Maillot et al., 2004), a ontology of visual primitives is proposed. The visual concept ontology is a “terminological ontology which can be defined as a common vocabulary used by humans to visually describe real world objects and scenes”(Hudelot et al., 2005). It is divided in three parts. The spatial concepts are primitives for describing the spatial extension of objects, including notions about shape, size and location (Figure 4). The colour concepts are primitives for representing colours regarding their hue, lightness and saturation. The texture concepts are primitives for representing texture (e.g.: granulated, oriented and uniform). The visual concept ontology also includes spatial relations, which are discussed in section 2.1.3. The Multimedia Ontology is an interesting approach of Papadopoulos et al. (Papadopoulos et al., 2007b; Bloehdorn et al., 2005). The main primitive of the ontology is the Multimedia Information Object, an extension of Information Object design pattern of DOLCE core ontology (Gangemi et al., 2005, 2002). In this design pattern (depicted in Figure (Papadopoulos et al., 2007b)), the Information Object (IO) represent any piece of multimedia data, like parts of images, video, audio or text. The design pattern also take into account essential relation bearing information objects. A given IO has a format (e.g. JPEG, GIF, UTF-8, etc.); is realized by a multimedia file; is about some domain concept; and can be interpreted by some agent (e.g. a segmentation algorithm). It also conveys the representation of the different modalities in which a domain concept can be represented; this defines the content of the IO itself. For representing visual information, the IO is extended to a Visual Information Object, which are further detailed with specific visual features. This IO design pattern is useful for organizing knowledge about information itself, putting together all related concepts of information realization, format, content and consumption. Coradeschi et al. recently employed DOLCE concepts (Loutfi et al., 2008) to extend their proposal for anchoring symbols framework in robots (Coradeschi 7
Figure 4: Taxonomy of visual primitives related do shapes, proposed in (Hudelot et al., 2005). and Saffiotti, 2000, 2003). In DOLCE ontology, entities can be characterized by qualities, which intuitively captures the notion of “attribute” of a certain concept. Among types of qualities are size, smell and colour. DOLCE also makes a distinction between quality and quale, the actual values a quality can assume (e.g.: colour names, or terms denoting size categories). In Coradeschi’s work, qualities represent basic entities that can be perceived and measured by agents (Loutfi et al., 2008). This is realized by representing relations from perceivable domain concepts to sets of qualia. For instance, coloured domain concepts are mapped to the qualia set colour. Values of quale are then anchored in a perceptual system. It seems that the notion of qualities and qualia improved the organization of the symbol anchoring. Finally, is worth mentioning the noise terms employed in (Shanahan and Randell, 2004), in order to “explain away” interpretations of the row visual data. 2.1.3
Spatial relations
Some of the reviewed approaches include some sort of explicit representation of spatial relations. Spatial relations could be seen as defining a kind of context that can be exploited in visual interpretation. For instance, defining that a cup is usually on the table. As in (Hudelot et al., 2005; Hudelot, 2005), we separate spatial relations in groups conveying information about orientation, distance and topology. Proposals for distance and orientation (projective) relations are given in (Hudelot et al., 2005; Hartz and Neumann, 2007; Loutfi et al., 2008). As men-
8
Figure 5: Multimedia Information Object (MMIO) Design Pattern (Papadopoulos et al., 2007b). tioned in (Hudelot et al., 2005), both kinds of relations need a primary object, a reference object and a frame of reference. The frame of reference is needed in order to calculate the relation between the primary and the reference object. Also, there are three types of reference frames: (a) deictic; where the coordinate system related to the reference frame is given by the observer; (b) intrinsic; where the reference frame has its origin in the reference object; and (c) extrinsic, where the frame of reference is a coordinate system independent of any object in the scene or the observer. The frame of reference for orientation and distance relations is assumed to be deictic in (Loutfi et al., 2008). In (Hudelot et al., 2005), orientation relation between the object and its subparts is assumed to be intrinsic; and deictic between other objects (none is said about distance relations). In terms of terminology to represent orientation relations, some approaches (Hudelot et al., 2005; Hartz and Neumann, 2007) employ image-based terms (e.g.: rigth, left, above, above-left, and so on). In (Loutfi et al., 2008), the terms reflect the use of system in robotic systems: left, right, in-front, behind. Representation of topological relations are largely based in the RCC-8 theory proposed by Cohn et al. (Cohn et al., 1997; Randell et al., 1992). The RCC-8 describe an interval logic for reasoning about space. It simplify earlier proposals (e.g.: (Clarke, 1981)) by eliminating the distinction between open, semi-open and closed region; which simplify the axiomatization and improve computability. Put in simpler terms, the theory is built over the primitive dyadic relation C(a, b) denoting “a connects to b”. Based on it, eight basic topological relations are proposed (see Figure 6): DC(a, b) (“a is disconnected from b”); EC(a, b) (“a is externally connected with b”); P O(a, b) (“a partially overlaps b”); T P P (a, b) (“a is a tangential proper part of b”); N T P P (a, b) (“a is a nontangential proper part of b”); and EQ(a, b) (“a is identical to b”). The relations T P P i(a, b) and N T P P i(a, b) are inverse relations for T P P (a, b) and N T P P (a, b), respectively. These relations are used for representing and reasoning about topological relations in (Hudelot et al., 2005, 2008) and in Qualitative Spatial Reasoning-based approaches (Cohn et al., 2003).
9
Figure 6: Illustrations of the eight topological relation of RCC-8 (Cohn et al., 1997). 2.1.4
Domain representation
It is not so surprising that KBCV systems employ domain models to represent knowledge about visual content. However, the literature provide a multitude of ways in which domain models can be structured in KBCV systems; let alone its relation with other models, which is discussed in further down. Here, we review some proposals of representation primitives specific for modeling domain knowledge in KBCV. Usually, KBCV approaches model domain knowledge using a representation ontology embedded in the representation formalism of the choice (Hudelot et al., 2005; Chella et al., 2001; Mylonas et al., 2008; Zlatoff et al., 2004); letting the specifics of scene interpretation knowledge to other parts of the system. However, some approaches give special attention to domain representation by forcing the modeling into a specific ontological structure. Both (Br´emond et al., 2004) and (Loutfi et al., 2008) propose some sort of upper-level ontology to organized the domain model. Particularly in (Loutfi et al., 2008), the DOLCE (Gangemi et al., 2002) upper-level ontology is used; since the KBCV approach is restricted to interpreting physical entities, all domain concepts must be subsumed by the DOLCE concept physical object. Other proposals provide specific representation primitives. Town (Town, 2006) proposes the SemanticCategory primitive for representing categories of things, (e.g.: people, vehicle, animals); and the VisualCategory for representing categories of what he calls “stuff” (e.g.: water, skin, cloud). In a interesting approach, Bonarini et al. (Bonarini et al., 2007) proposes that concepts (in the domain) are described by sets of properties; these properties can be of two types: substantial and accidental. Substantial properties are those that do not change value in time, characterizing the essence of the objects described by the class; thus, they can be used for object recognition. On the other hand, accidental properties change in time, an cannot be used for object recognition, but for other activities, like object tracking. In some approaches, some special attention is given to mereological aspects. In (Neumann and Moller, 2008; Hartz and Neumann, 2007) proposes a construct called aggregate; it consists of a “set of parts tied together to form a concept and satisfying certain constraints” (Neumann and Moller, 2008). This is later used to build visual interpretation starting from primitive image evidences, going to more complex objects.
10
2.2
How to represent: representation formalism and model architecture
After reviewing what kind of information is to be represented in visual knowledge models, we proceed to review how this informations is in fact represented. There are two dominant aspects: which representation formalisms are suitable for knowledge representation (and reasoning) in KBCV; and what is the overall architecture of visual knowledge models. Also, there is a third aspect that is somewhat perpendicular to both: the symbol grounding problem. So, we start from it. 2.2.1
The symbol grounding problem
In AI, the symbol grounding problem, as put by Harnad (Harnad, 1990), ultimately consist in the problem of grounding symbols of a symbolic system in entities that turn out to be also symbols, that also need to be grounded, and so on. It is ultimately a question of substantiate the meaning of symbols in other thing than symbols. The grounding problem has a fundamental role in KBCV systems, given that, by describing high-level symbols in terms of sensory input, the KBCV systems are trying to give a solution for the symbol grounding problem. The implementation proposals for symbol grounding can be divided in those which the grounding is defined completely a priori ; and in those defined by learning methods. Nevertheless, all of them include some way of representing uncertainty. In terms of a priori approaches, it is possible to mention the work of Coradeschi et al. (Coradeschi and Saffiotti, 2000,?; Coradeschi et al., 2001; Coradeschi and Saffiotti, 2003) in symbol anchoring. Symbol anchoring is defined as a subset of the grounding problem concerned with the grounding of symbols representing only physical objects only. They define a formal framework composed by a symbol system P of high-level predicates; a perceptual system Π of percepts; a set Φ of percept attributes; a set D(Φ) of all attributes domains; and a symbol grounding relation g ⊆ P × Φ × D(Φ). The implicit notion is that percepts are composed by sets of attributes. So, the grounding relation maps high-level predicates to percepts containing the necessary tributes defined by the relation. Percepts are generated by low-level processing algorithms. The matching between high-level predicates and percepts can be complete or partial. Fuzzy sets are also used to discretize low-level information. A priori grounding relations are also used in (Hudelot et al., 2005), where domain and visual primitives are mapped to signal descriptors through fuzzy sets. On the other hand; the learned method seems to be dominant. They are usually employed when the definition of a priori grounding is too complex. Also, they help in to reduce the high dimensionality of signal descriptions. These have a number of variations. Probabilistic methods are used (Kreutzmann et al., 2009; Town, 2006) when the grounding links need to be induced from training datasets. Some hybrid approaches employ support vector machines (SVM) like in (Papadopoulos et al., 2007a; Maillot et al., 2004). 2.2.2
Representation formalism
The research for representation formalisms (or language) specific for KBCV is not so intense. Usually, the formalism is only used as a passive tool for ex11
pressing the knowledge used in interpretation. Some approaches employ known languages, as LOOM (as in (Loutfi et al., 2008)), KL-ONE (as in(Chella et al., 2001)), languages directly derived from first-order logic (as in (Shanahan, 2002; Bonarini et al., 2007), and others (Br´emond et al., 2004; Georis et al., 2006; Hudelot, 2005). On the other hand, there is a recent push in the research of description logics (DL) specific to the context of KBCV. In a series of papers, Bernd Neumann et al. (Neumann and Moller, 2008; Peraldi et al., 2009) discusses the use of DL to image and multimedia interpretation. The rationale is to formalize visual knowledge in a given way that is possible to use the default reasoning engines of DL to implement interpretation (taking advantage of their formal properties of decidability and so forth). A very interesting review of these efforts is done by Dasiopoulou and Kompatsiaris (Dasiopoulou and Kompatsiaris, 2010). They conclude that the open-world assumption of DL semantics matches the incomplete nature of visual interpretation, but this is not being properly exploited. Also, they state that abductive reasoning and imprecision should be better exploited in DL systems for KBCV. 2.2.3
Model architecture
In general, the reviewed visual knowledge models are modularized, partially to reflect the software architecture of the interpretation and reasoning algorithms. In general, models are specialized in representing a specific visual subdomain, e.g., signal features, visual primitives, domain model, context, and so forth. Also, the organization further correlated with the representation of symbol grounding. The approaches can be divided in roughly two groups regarding the number of the levels in the architecture. Two-level architectures resemble the general KBCV architecture shown in Figure 1; they are composed by a low-level module where signal features are the dominant entities and one high-level module, where domain, visual and other constructs are altogether represented (Figure 7a). One example is described in (Arens et al., 2008), where the the knowledge are separated in quantitative and conceptual reasoning modules, with a auxiliary natural language module. In (Coradeschi and Saffiotti, 2003), the architecture is divided in perceptual and a symbolic modules. The domain knowledge and visual knowledge is represented together in the same module; then, these are grounded directly into perceptual system. In three-level architectures (Figure 7b), there exists an intermediate level for representing knowledge which low-level, neither domain specific. It usually contains knowledge in terms of generic visual primitives and spatial relations. This intermediate level is used to bridge the semantic gap between domain and signal models. A good example of this approach is the one of Project Orion (Hudelot et al., 2005), depicted in Figure 8. The semantic level, describing domain concepts, is grounded in image level (represented by the Image Concept Ontology described earlier) through the visual level (represented by the Visual Concept Ontology also described earlier). This improve the level of independence between domain model and low-level. Chella et al. (Chella et al., 2001, 1997) also proposes a three-level architecture based on the conceptual spaces (G¨ardenfors, 2000). High-level symbolic concepts are mapped to low-level descriptors though generic visual shape de12
High-Level (domain knowledge, visual primitives, ...)
Domain knowledge
Low-level (Signal Features)
Low-level (Signal Features)
Generic Visual Knowledge
(a)
(b)
Figure 7: Two-level (a) and three-level (b) model architectures.
Figure 8: Three-level architecture of Project Orion (Hudelot et al., 2005).
13
scriptors called knoxels. On the other hand, in the architecture proposed in (Papadopoulos et al., 2007b) the intermediate level is represented by the Information Object described previously. It maps domain concepts to signal descriptors composed of MPEG-7 visual descriptors. Finally, the architecture proposed in (Neumann and Moller, 2008) is composed by a three-level core, where the intermediate level grounds domain concepts using Geometrical Scene Description representations (Neumann, 1989). In next section we present the research efforts reviewed as complete approached for KBCV, focusing on their application domain.
3
Approaches to KBCV and their application domains
KBCV is usually employed when a given task involves a description of the content in a given set of visual data. It is possible to separate this kind of task in two categories based on its ultimate goal: (a) retrieval tasks, where the goal is to query and retrieve parts of visual datasets; and (b) interpretation tasks, where the goal is to extract meaningful descriptions of a visual datasets. While both kinds are strongly related; i.e. a detailed description of a given visual dataset can be used to support queries over it; the literature seems to be divided in these two topics.
3.1
Retrieval Task
In the case of retrieval tasks, there is whole area devoted to study it. In contentbased retrieval of visual media, the system input is generally a visual dataset (imaged and/or videos) and a query over this dataset. The expected output is a subset of this visual dataset, containing the content described by the query. The focus here is in the visual data: the user wants the visual data itself, rather than a — semantic — description of it. This requisite generally relaxes the need for strong semantic description and grounding of the visual data, i.e., it is significantly more important to know which objects are present in the scene, rather than where they are located and how they relate. Since the focus of this survey is on content rather than data, we do not review these techniques (even though some approaches have been mentioned in previous sections). For recent surveys on the subject, the reader can refer to (Datta et al., 2008; Liu et al., 2007).
3.2
Interpretation task
In interpretation task, the focus is in extracting precise content descriptions for further utilization (e.g.: in diagnosis or in planning tasks), rather than in visual raw data. This translates on stronger requirements in knowledge modelling. We can further divide visual interpretation approaches concerning how the dynamic is the content. In image interpretation, the visual content is static; In video interpretation, the visual content is is dynamic and needs to be tracked over time.
14
3.2.1
Image interpretation
Some of the studied approaches are mainly focused in extracting content scenes which do not change over time, i.e., image interpretation. Usually, the the emphasis of these system is in extraction of objects descriptions contained in the depicted scene. making knowledge models about object types, their features and their spatial relation a relevant aspect. The work of Neumann et al. (Bohlken and Neumann, 2009; Neumann and Moller, 2008; Hartz and Neumann, 2007) in the project eTRIMS (Hotz and Neumann, 2010) aims to create descriptions about terrestrial images of building facades. The objects to be interpreted are windows arrays, balconies and entrances, together with its constituent parts (e.g. windows, doors, etc.). The mereological aspect plays a central role in the representation. The implementation is based on SCENIC scene interpretation system (Terzic et al., 2007). Recently, probabilistic techniques have been incorporated in the approach (Kreutzmann et al., 2009). The work of Thonnat et al. in Projet Orion (Maillot et al., 2004; Hudelot et al., 2005; Maillot and Thonnat, 2008) proposed a framework of ontologies for representing image content in three abstraction levels (called semantic levels). Also, the approach to represent the low-level processing algorithms inside the knowledge model, in order to allow better control of the reasoning process. The approach has been applied in the domain of biology (recognizing of common diseases in flowers (Hudelot, 2005)) and in image indexing (Maillot, 2005). The approach of Chella et al. showed in (Chella et al., 2001, 1997) proposed an implementation of the conceptual spaces idea of G¨ardenfors (G¨ardenfors, 2000). The system coordinates a intermediate representation between raw-image data a symbolic representation. The intermediate representation is a conceptual space, where the basic primitive are superquadratic shapes. These are used to ground high-level symbolic knowledge about the domain of interpretation. The approach has been extended to convey visual interpretation in dynamic scenarios, with application in robotics (G¨ardenfors, 2000). 3.2.2
Video interpretation
Many of the recent KBCV approaches are in video interpretation domain. There is a faint distinction between some approaches that apply KBCV in robotic vision and others that apply it to interpretation of video feeds. The predominant subject in robotics-related approaches to KBCV is how to extract image objects and track them along time. Coradeschi et al. in a series of papers (Coradeschi and Saffiotti, 2000, 2003; Loutfi et al., 2008), propose a framework for grounding symbols in robot sensors. The framework is also capable of making sensor data fusion among sensor modalities. A similar approach is presented in (Bonarini et al., 2007). Some aspects of visual perception with application in robotics has been also investigated in (Shanahan and Randell, 2004). In video interpretation, the majority of the reviewed approaches were applied in monitoring domain, where event and behaviour extraction are relevant. In (Bennett et al., 2008), high-level reasoning is used to help tracking of objects. In (Town, 2006), a visual knowledge model is used in conjunction with a Bayesian network in order to interpret events from surveillance videos. The
15
Figure 9: Tracking and event recognition in AVITRACK (Thirde et al., 2006). same approach is used for image indexing. In the AVITRACK project (Thirde et al., 2006; Br´emond et al., 2004), ontologies describing objects and events support event interpretation in airport a monitoring scenario (Figure 9). A low-level processing module do the tracking and object recognition, while the high-level processing use the ontologies to infer events and activities. A related approach has been also tested in bank surveillance and is described in (Georis et al., 2006). In (Arens et al., 2008; Haag and Nagel, 2000), an approach for traffic scene monitoring is presented. Its focus is in recognition of events. The approach is based on situation graph trees. The system can also generate simple natural language descriptions of image events. In (Gonz`alez et al., 2009), a similar approach is employed for human behavior extraction.
4
Conclusion
In this survey, we reviewed the last contributions of the KBCV area, focusing in knowledge representation issues. Our most prominent conclusion is that the area still lacks in conceptual modelling sophistication. Advances have been made with the use of ontologies, better separating declarative from procedural and control knowledge about vision. However, there is a tendency in using ontologies as “improved” data structures, limiting their power when it comes to actually represent a complete picture of the semantic in content descriptions. Indeed, we believe that this scenario could be improved by incorporating notions of formal ontology (Guarino, 1998) into the constructions of these knowledge models. Indeed, as showed in the review above, some works are going in this direction by incorporating parts of the DOLCE core ontology. In a second issue, it is possible to to see that pure symbolic approach for knowledge representation could be limiting factor for the appearance of more complex KBCV systems. This can be perceived when one consider the symbol grounding problem (among others, like the frame problem(Vernon et al., 2007)). We think more research effort should be put on interfacing the raw-data and high-level representations. In particular, we think that the Conceptual Spaces approach of G¨ ardenfors (G¨ ardenfors, 2000) is a good starting point, having produced some good results (Chella et al., 2003). 16
5
Acknowledgements
This work has been supported by CNPq (Brazilian National Research Council).
References Arens, M., Gerber, R., Nagel, H.-H., Jan. 2008. Conceptual representations between video signals and natural language descriptions. Image and Vision Computing 26 (1), 53–66. Auer, P., Bloch, I., Buxton, H., Courtney, P., Dickinson, S., Fisher, B., Granlund, G., Kropatsch, W., Metta, G., Neumann, B., Pinz, A., Sandini, G., Sommer, G., Vernon, D., Billard, A., Boettcher, P., Christensen, H., Crookell, A., Eberst, C., F¨ orstner, W., Hlavac, V., Leonardis, A., Nagel, H.H., Niemann, H., Pirri, F., Schiele, B., Tsotsos, J., Vincze, M., Bischof, H., B¨ ulthoff, H., Cohn, T., Crowley, J., Eklundh, J.-O., Gilby, J., Kittler, J., Little, J., Nebel, B., Paletta, L., Sagerer, G., Simpson, R., Thonnat, M., Aug. 2005. A research roadmap of cognitive vision. Tech. Rep. 5.0, ECVision: The European Research Network for Cognitive Computer Vision Systems. Bennett, B., Magee, D. R., Cohn, A. G., Hogg, D. C., Jan. 2008. Enhanced tracking and recognition of moving objects by reasoning about spatiotemporal continuity. Image and Vision Computing 26 (1), 67–81. Bloehdorn, S., Petridis, K., Saathoff, C., Simou, N., Tzouvaras, V., Avrithis, Y., Handschuh, S., Kompatsiaris, Y., Staab, S., Strintzis, M. G., 2005. Semantic annotation of images and videos for multimedia analysis. In: The Semantic Web: Research and Applications. No. 3532 in LNCS. pp. 592–607. Bohlken, W., Neumann, B., 2009. Generation of rules from ontologies for highlevel scene interpretation. In: Governatori, G., Hall, J., Paschke, A. (Eds.), Rule Interchange and Applications. Vol. 5858 of LNCS. Springer Berlin, Heidelberg, pp. 93–107. Bonarini, A., Matteucci, M., Restelli, M., 2007. Problems and solutions for anchoring in multi-robot applications. J. Intell. Fuzzy Syst. 18 (3), 245–254. Br´emond, F., Maillot, N., Thonnat, M., Vu, V.-T., 2004. Ontologies for video events. Rapport de recherche 5189, INRIA, Sophia-Antipolis, France. Chella, A., Frixione, M., Gaglio, S., Jan. 1997. A cognitive architecture for artificial vision. Artificial Intelligence 89 (1-2), 73–111. Chella, A., Frixione, M., Gaglio, S., 2001. Conceptual spaces for computer vision representations. Artificial Intelligence Review 16 (2), 137–152. Chella, A., Frixione, M., Gaglio, S., 2003. Anchoring symbols to conceptual spaces: the case of dynamic scenarios. Robotics and Autonomous Systems 43 (2-3), 175–188. Clarke, B. L., 1981. A calculus of individuals based on ”Connection”. Notre Dame Journal of Formal Logic Notre-Dame, Ind. 22, 204–219.
17
Cohn, A. G., Bennett, B., Gooday, J., Gotts, N. M., Oct. 1997. Qualitative spatial representation and reasoning with the region connection calculus. GeoInformatica 1 (3), 275–316. Cohn, A. G., Magee, D., Galata, A., Hogg, D., Hazarika, S., 2003. Towards an architecture for cognitive vision using qualitative spatio-temporal representations and abduction. In: Spatial Cognition III. No. 2685 in Lecture Notes in Computer Science. Springer, Berlin, pp. 246–262. Coradeschi, S., Driankov, D., Karlsson, L., Saffiotti, A., 2001. Fuzzy anchoring. In: The 10th IEEE International Conference on Fuzzy Systems. Vol. 1. pp. 111–114. Coradeschi, S., Saffiotti, A., 2000. Anchoring symbols to sensor data: Preliminary report. In: Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence. AAAI Press / The MIT Press, pp. 129–135. Coradeschi, S., Saffiotti, A., May 2003. An introduction to the anchoring problem. Robotics and Autonomous Systems 43 (2-3), 85–96. Crevier, D., Lepage, R., 1997. Knowledge-based image understanding systems: A survey. Computer Vision and Image Understanding 67 (2), 161–185. Dasiopoulou, S., Kompatsiaris, I., 2010. Trends and issues in description logics frameworks for image interpretation. In: Artificial Intelligence: Theories, Models and Applications. pp. 61–70. Datta, R., Joshi, D., Li, J., Wang, J. Z., 2008. Image retrieval: Ideas, influences, and trends of the new age. ACM Comput. Surv. 40 (2), 1–60. Draper, B., Hanson, A., Riseman, E., 1996. Knowledge-directed vision: control, learning, and integration. Proceedings of the IEEE 84 (11), 1625–1637. Draper, B. A., Collins, R. T., Brolio, J., Hanson, A. R., Riseman, E. M., Jan. 1989. The schema system. International Journal of Computer Vision 2 (3), 209–250. Fusier, F., Valentin, V., Br´emond, F., Thonnat, M., Borg, M., Thirde, D., Ferryman, J., 2007. Video understanding for complex activity recognition. Machine Vision and Applications 18 (3), 167–188. Gangemi, A., Borgo, S., Catenacci, C., Lehmann, J., 2005. Task taxonomies for knowledge content. Deliverable EU FP6 - D07, Laboratory for Applied Ontology (ISTC-CNR), Trento, Italy. Gangemi, A., Guarino, N., Masolo, C., Oltramari, A., Schneider, L., 2002. Sweetening ontologies with DOLCE. In: G´omez-P´erez, A., Benjamins, V. R. (Eds.), Knowledge Engineering and Knowledge Management: Ontologies and the Semantic Web. No. 2473 in LNCS. pp. 223–233. Garc´ıa, R., Celma, s., 2005. Semantic integration and retrieval of multimedia metadata. In: Proc. of 5th International Workshop on Knowledge Markup and Semantic Annotation (SemAnnot ’05). Vol. 185. CEUR-WS, Galway, Ireland, pp. 69–80. 18
Georis, B., Maziere, M., Bromond, F., 2006. Evaluation and knowledge representation formalisms to improve video understanding. In: Computer Vision Systems, International Conference on. IEEE Computer Society, Los Alamitos, CA, USA, p. 27. Gill, M., Spriggs, A., 2005. Assessing the impact of CCTV. Tech. Rep. 292, Home Office Research, UK. Goldberg, R. F., Perfetti, C. A., Schneider, W., 2006. Perceptual knowledge retrieval activates sensory brain regions. Journal of Neuroscience 26 (18), 4917. Gonz` alez, J., Rowe, D., Varona, J., Xavier Roca, F., Sep. 2009. Understanding dynamic scenes based on human sequence evaluation. Image and Vision Computing 27 (10), 1433–1444. G¨ ardenfors, P., 2000. Conceptual Spaces: The Geometry of Thought. The MIT Press, Cambridge, Massachussetts. Gruber, T. R., 1995. Toward principles for the design of ontologies used for knowledge sharing. Int. J. Hum.-Comput. Stud. 43 (5-6), 907–928. Guarino, N., 1998. Formal ontology and information systems. In: Formal Ontology in Information Systems. IOS press Amsterdam, pp. 3–15. Haag, M., Nagel, H. H., Jan. 2000. Incremental recognition of traffic situations from video image sequences. Image and Vision Computing 18 (2), 137–153. Hanson, A., Riseman, E., 1978. VISIONS: a computer system for interpreting scenes. Computer Vision Systems. Harnad, S., 1990. The symbol grounding problem. Phys. D 42 (1-3), 335–346. Hartz, J., Neumann, B., 2007. Learning a knowledge base of ontological concepts for high-level scene interpretation. Cincinnati (Ohio, USA). Hotz, L., Neumann, B., Apr. 2010. Learning and recognizing structures in fa¸cade scenes (eTRIMS) — a retrospective. KI - K¨ unstliche Intelligenz 24 (1), 63–68. Hudelot, C., 2005. Towards a cognitive vision platform for semantic image interpretation; application to the recognition of biological organisms. Thesis (Phd in computer science), Universit´e de Nice Sophia Antipolis. Hudelot, C., Atif, J., Bloch, I., Aug. 2008. Fuzzy spatial relation ontology for image interpretation. Fuzzy Sets and Systems 159 (15), 1929–1951. Hudelot, C., Maillot, N., Thonnat, M., 2005. Symbol grounding for semantic image interpretation: From image data to semantics. In: Tenth IEEE International Conference on Computer Vision, 2005. IEEE, Los Alamitos, USA, p. 1875. Kreutzmann, A., Terzi´c, K., Neumann, B., 2009. Context-aware classification for incremental scene interpretation. In: Proceedings of the Workshop on Use of Context in Vision Processing - UCVP ’09. Boston, Massachusetts, pp. 1–6.
19
Liu, Y., Zhang, D., Lu, G., Ma, W.-Y., Jan. 2007. A survey of content-based image retrieval with high-level semantics. Pattern Recognition 40 (1), 262– 282. Loutfi, A., Coradeschi, S., Daoutis, M., Melchert, J., 2008. Using knowledge representation for perceptual anchoring in a robotic system. International Journal on Artificial Intelligence Tools 17 (5), 925–944. Maillot, N., 2005. Ontology based object learning and recognition. Thesis (Phd ´ in computer science), Ecole Doctorale STIC, Universit´e de Nice. Maillot, N., Thonnat, M., Boucher, A., Dec. 2004. Towards ontology-based cognitive vision. Machine Vision and Applications 16 (1), 33–40. Maillot, N. E., Thonnat, M., Jan. 2008. Ontology based complex object recognition. Image and Vision Computing 26 (1), 102–113. Mark Hosenball, May 2010. NYPD developing CCTV camera system that will be better than london’s. Matsuyama, T., 1987. Knowledge-based aerial image understanding systems and expert systems for image processing. IEEE Transactions on Geoscience and Remote Sensing GE-25 (3), 305–316. McKeown, D. M., Harvey, W. A., McDermott, J., 1985. Rule-based interpretation of aerial imagery. Pattern Analysis and Machine Intelligence, IEEE Transactions on PAMI-7 (5), 570–585. McKeown, J., Harvey, W. A., Wixson, L. E., Apr. 1989. Automating knowledge acquisition for aerial image interpretation. Computer Vision, Graphics, and Image Processing 46 (1), 37–81. Mylonas, P., Athanasiadis, T., Wallace, M., Avrithis, Y., Kollias, S., 2008. Semantic representation of multimedia content: Knowledge representation and semantic indexing. Multimedia Tools and Applications 39 (3), 293–327. Neumann, B., 1989. Natural language description of time-varying scenes. In: Semantic Structures: Advances in Natural Language Processing. D. Waltz, London, UK, pp. 167–206. Neumann, B., Moller, R., Jan. 2008. On scene interpretation with description logics. Image and Vision Computing 26 (1), 82–101. Owen Bowcott, May 2008. CCTV boom has failed to slash crime, say police. http://www.guardian.co.uk/uk/2008/may/06/ukcrime1. Papadopoulos, G., Mezaris, V., Kompatsiaris, I., Strintzis, M., 2007a. Combining global and local information for knowledge-assisted image analysis and classification. Eurasip Journal on Advances in Signal Processing 2007, 45842. Papadopoulos, G., Mezaris, V., Kompatsiaris, I., Strintzis, M., 2007b. Ontologydriven semantic video analysis using visual information objects. In: Semantic Multimedia. pp. 56–69.
20
Patterson, K., Nestor, P. J., Rogers, T. T., 2007. Where do you know what you know? the representation of semantic knowledge in the human brain. Nature Reviews Neuroscience 8 (12), 976–987. Peraldi, S. E., Kaya, A., M¨ oller, R., 2009. Formalizing multimedia interpretation based on abduction over description logic aboxes. In: 22nd International Workshop on Description Logics (DL2009), Proceedings of the. CEUR Workshop Proceedings (Vol. 477). Oxford, UK. Randell, D. A., Cui, Z., Cohn, A. G., 1992. A spatial logic based on regions and connection. In: 3rd Internation Conference on Knowledge Representation and Reasoning. San Mateo, p. 165–176. Shanahan, M., 2002. A logical account of perception incorporating feedback and expectation. In: Principles of Knowledge Representation and Reasoning: Proceedings of the Ninth International Conference (KR2002). Shanahan, M., Randell, D., 2004. A logic-based formulation of active visual perception. In: Principles of Knowledge Representation and Reasoning: Proceedings of the Ninth International Conference (KR2004). Menlo Park, Calif., pp. 64–72. Shapiro, S. C., Ismail, H. O., May 2003. Anchoring in a grounded layered architecture with integrated reasoning. Robotics and Autonomous Systems 43 (23), 97–108. Sikora, T., 2001. The MPEG-7 visual standard for content description-an overview. Circuits and Systems for Video Technology, IEEE Transactions on 11 (6), 696–702. Simou, N., Tzouvaras, V., Avrithis, Y., Stamou, G., Kollias, S., 2005. A visual descriptor ontology for multimedia reasoning. PROC. OF WIAMIS ’05. Terzic, K., Hotz, L., Neumann, B., 2007. Division of work during behaviour recognition - the SCENIC approach. In: Proceedings of the Workshop on Behaviour Monitoring and Interpretation BMI’07. Vol. 296 of CEUR-WS. CEUR-WS, Osnabr¨ uck, Germany, pp. 144–159. Thirde, D., Borg, M., Ferryman, J., Fusier, F., Valentin, V., Bremond, F., Thonnat, M., 2006. A real-time scene understanding system for airport apron monitoring. In: Computer Vision Systems, International Conference on. IEEE Computer Society, Los Alamitos, CA, USA, p. 26. Town, C. a. b. c., 2006. Ontological inference for image and video analysis. Machine Vision and Applications 17 (2), 94–115, cited By (since 1996) 16. Vernon, D., Metta, G., Sandini, G., 2007. A survey of artificial cognitive systems: Implications for the autonomous development of mental capabilities in computational agents. Evolutionary Computation, IEEE Transactions on 11 (2), 151–180. Zlatoff, N., Tellez, B., Baskurt, A., 2004. Image understanding and scene models: a generic framework integrating domain knowledge and gestalt theory. In: 2004 International Conference on Image Processing, 2004. ICIP ’04. Singapore, pp. 2355–2358. 21