Intel Serv Robotics DOI 10.1007/s11370-012-0117-z
SPECIAL ISSUE
Towards concept anchoring for cognitive robots Marios Daoutis · Silvia Coradeschi · Amy Loutfi
Received: 23 April 2012 / Accepted: 4 September 2012 © Springer-Verlag 2012
Abstract We present a model for anchoring categorical conceptual information which originates from physical perception and the web. The model is an extension of the anchoring framework which is used to create and maintain over time semantically grounded sensor information. Using the augmented anchoring framework that employs complex symbolic knowledge from a commonsense knowledge base, we attempt to ground and integrate symbolic and perceptual data that are available on the web. We introduce conceptual anchors which are representations of general, concrete conceptual terms. We show in an example scenario how conceptual anchors can be coherently integrated with perceptual anchors and commonsense information for the acquisition of novel concepts. Keywords Anchoring · Categorical perception · Near sets · Knowledge representation · Commonsense information
1 Introduction Autonomous cognitive robots are often envisioned to perform difficult tasks in real environments or to interact effortlessly with humans. This view represents one aspect of the long-term goal of artificial intelligence and robotics, to create artificial cognitive systems that match and eventually exceed M. Daoutis (B)· S. Coradeschi · A. Loutfi Cognitive Robotics Lab, Department of Science and Technology, Örebro University, 70281 Örebro, Sweden e-mail:
[email protected] S. Coradeschi e-mail:
[email protected] A. Loutfi e-mail:
[email protected]
human level competence in a multitude of tasks. As artificial systems progress steadily from sensor-based reactivity to cognitive computational models that allow them to perceive, learn, reason or to interact, we often have to deal with increasingly difficult challenges concerning information representation. In the present work we study how perceptual information can be associated with semantic information towards the grounding of concepts in cognitive robots. In this context we require open-ended systems in which domain knowledge can be dynamically acquired or defined a priori. The World Wide Web is a possible source of knowledge that would satisfy both the requirement of being available on demand while it is updated, or better evolved continuously, with contributions from human users. Indeed, recent trends begin to consider the web as an alternative source for acquiring such information [13]. Notable examples include not only the semantic robot vision challenge,1 an effort towards robots searching the environment for objects that had been learned through images retrieved from the web, but also work by Tenorth et al. [31] where they make use of the web, so as to enable the natural language processing and execution of instructions for everyday manipulation tasks. By adding another information source to the intelligent agent we must implicitly model both its physical perception with the ones available on the web. So we have to deal with multiple heterogeneous perceptual systems, where information processing is used to transfer knowledge from the world into the robot’s world model. In the world model, knowledge should be represented in a way that can be further processed by other cognitive processes (such as reasoning or planning). In turn the representation should be capable to accommodate not only the perceptual information (physical and non physical) acquired by the robot, but also grounded symbolic 1
http://www.semantic-robot-vision-challenge.org/.
123
Intel Serv Robotics
terms which describe and relate abstract conceptual knowledge. Therefore, if we consider that a cognitive robot has a “symbolic space” where different interrelated concepts are represented and manipulated, we must implicitly have a solution to the grounding problem which is defined as the problem of making intrinsic to artificial systems the symbols and interpretations they manipulate [15]. In applied robotics this problem is studied in the context of Perceptual Anchoring which is defined as the process of establishing a connection between perceptual data and symbolic data which refer to one physical object. Since its first definition [6], anchoring has been augmented to support the integration of non-trivial symbolic systems which include common-sense knowledge such as Cyc [9], while operating between multiple perceptual agents [10]. One related instance of anchoring addresses the anchoring of discrete concepts using conceptual spaces which give a geometric treatment of concepts and knowledge representation [5]. Here we present an initial effort to integrate the knowledge coming from a nonphysical environment such as the web in the anchoring process. Towards this integration we focus on the perceptual–symbolic correspondence so as to establish the link between perceptual and conceptual knowledge, as the ability to combine conceptual structure with material structure is considered to be a key cognitive strategy [16]. We introduce an augmentation of the anchoring framework which approaches the creation and maintenance of correspondences between symbols and perceptual data that refer to one particular concept2 (instead of a physical object).Conceptual Anchoring is a hybrid cognitive model for representing and processing information at different levels of abstraction using both sub-symbolic and symbolic components. Acquired data from the web are unified into dynamic computable representations we name conceptual anchors. The conceptual anchors regard concepts that might not necessarily exist in the agent’s physical environment, but which the agent eventually needs to process at some point during the operation. These structures can be thought as “mental” representations or after Harnad, the combined iconic, categorical and symbolic representation of a concept. Through the systematic construction of the conceptual anchoring space we can let the agent to a) express concepts in terms of their associated percepts and symbolic knowledge and b) expand its physical perceptual knowledge by modeling and using a priori information, so as to learn new perceptual instances (physical objects) out of conceptual information.
2
We use the term concept to refer to concrete and not abstract concepts since abstract concepts have no physical referents while concrete concepts are available to the senses. Concrete concepts in turn can be either general terms referring to groups, or specific terms referring to individuals.
123
In the following Sect. 2 we present in detail the conceptual anchoring model. From Sect. 3 to Sect. 6 we go through the details behind the semantic knowledge base, the conceptual information acquisition component and the aspects of physical perception. In Sect. 7 we illustrate an extended example scenario where a cognitive agent uses the conceptual anchoring module to learn and recognize a novel complex concept using a priori knowledge and information from the web. Finally in Sect. 8 we present our conclusions and summary.
2 Conceptual anchoring framework Intelligent robots need a large amount of knowledge for a specific task which is often hard to provide even for simple problems. Undoubtedly, there are similarities between the data (both perceptual and semantic) processed by the robots and the data available on the web, such as formally expressed semantic knowledge or media resources such as images; video or sound. Quite often, these on-line resources are expressed in compatible (or even the same) formats, with the data our robots are processing (e.g. JPEG images or RDF/OWL files). For example, we can extract the features from Google or Flickr images, to produce visual percepts, by using shape, color or texture features. In the context of anchoring, the percepts produced from on-line resources are used to construct the conceptual perceptual signatures, which are then grounded to their corresponding semantic descriptions and tangible semantic knowledge in a similar fashion as in perceptual anchoring [10,9]. The multi-modal data structure holding the perceptual data, semantic descriptions and unique identifier about one conceptual object, is called conceptual anchor ca(t). Since the conceptual percepts do not correspond necessarily to physical entities, in the agent’s environment, implicitly the conceptual anchor does not correspond to a specific physical object (perceptual instance), but rather to the generalized conceptual appearance and implicit knowledge about the general concept. For instance, the conceptual anchor of the concept “Cup” links the multiple visual percepts of cup images and features of them with the corresponding visual grounded concepts such as the colors or shapes of the cups and also with the tangible semantic knowledge about cups as follows: Cup is a specialization of DrinkingVessel, inheriting all the implicit relations that it is also an Artifact, a SpatiallyDisjointObject, a Container, … etc. Our hypothesis is based on the following argument: (a) if the perceptual anchor materializes a concept into a specific perceptual instance (physical object) in the agent’s environment and (b) the conceptual anchor materializes the concept itself in the conceptual anchoring space, then the conceptual anchor of a particular concept should contain
Intel Serv Robotics
Fig. 1 Overview of the conceptual anchoring framework
– (in the best case) all the perceptual instances representing the concept, therefore including the perceptual instance of the actual physical object, or – (in the worst case) at least the perceptual instance of the real object. In a parallelism with how a person would anchor an object and a concept, we would say that the perceptual anchor refers to the perceptual instance of “our favorite red cup in the kitchen”, while the conceptual anchor of the “cups” would point to the collection of all the cups that we know of, when some other person is referring to us about cups. So, perceptual anchors are considered instances of conceptual anchors, and as such they inherit all the semantic content about the concept, complementing it with additional physically grounded properties (such as relative size, color, topological region, location, spatial relations, etc…) of the physical object. Since both conceptual and perceptual representations are expressed as anchors, the current anchoring functionalities are able to handle both representations in the same way regarding matching. Conceptual Anchoring is an addition to the knowledge-based anchoring architecture [10] (Fig. 1) and mainly (a) provides an interface to acquire perceptual and semantic knowledge from the web, while (b) maintaining the conceptual anchors in the anchoring space which are used to store the conceptual models. We use the conceptual anchors, which can be though as “metaanchors”, as perceptual templates to guide the detection and learning of existing physical objects (perceptual instances) in the environment, without having previously trained the robot to recognize each instance individually.
symbols and perceptual data that refer to the same concept and implicitly the perceptual instances (individuals) of the same concept. Given a symbol system and a perceptual system, an anchor α(t), indexed by time, is the data structure that links the two systems. In our case we have one symbol system and two perceptual systems, the physical and the one attached to the web. A perceptual anchor links the symbol system with the physical perceptual system and refers to one particular physical object (perceptual instance), while the conceptual anchor links the same symbol system with the perceptual system of the web and refers to one particular concept representing a collection of perceptual anchors. The relations between concepts and perceptual instances are defined using the measure of similarity between concepts and instances and between anchors in general either conceptual or perceptual. We use Near sets so as to model the measure of similarity between concepts and instances and between concepts themselves. Near sets is a framework introduced by Peters and Ramanna [24,25], for solving problems based on perception by defining the resemblance between sets. The Near sets theory is a generalization of rough sets and emerged from the idea that two or more rough sets can share objects with matching descriptions if they both contain objects belonging to the same class and therefore are considered near. Nearness is defined as closely corresponding to, or resembling an original and near sets model the measures of nearness of sets based on similarities between classes contained in coverings of disjoint sets. In this context the term similarity means resemblance between two concepts or between a concept and an instance where almost equal patterns are found in compared items.
2.1 Model
2.1.1 Perceptual system
The main role of the proposed model is to systematically create and maintain in time the correspondences between
We have a universe which consists of a physical environment Er and a conceptual environment Ec . In Er there exist
123
Intel Serv Robotics
physical objects (perceptual instances) and in Ec concepts, respectively. For simplicity we will treat concepts and instances as their union set and we will refer to them as objects in general, so we have a set of objects O which is defined as the non-empty finite set O = {C ∪ I } where C = (c1 , c2 , . . . , cc ) are concepts, and I = (i 1 , i 2 , . . . , i i ) are perceptual instances. Each object must have some quantifiable features (observable properties) that can be measured. So there exists a real-valued feature function f : O → R which represents some feature of an object. All feature functions belong to the set F of the available feature functions of the perceptual system. We consider a perceptual modality β to be a subset of F where all the feature functions belonging to β concern one particular perceptual modality. Some examples of modalities can be the visual, olfactory, spatial or temporal and can be overlapping sets of feature functions from F. B is the set of all perceptual modalities of the perceptual system that can be modeled as B = (β1 , β2 , . . . , βn ) ⊆ F where n is the total number of modalities. For instance, β1 can be the modality of vision which concerns some feature functions β1 = ( f 1 , f 2 , f 3 ) which may represent color, texture and shape features, respectively. Perceptual System We can then define a perceptual system to be O, B, F that can be either of physical or conceptual nature and that it consists of the non-empty finite sets of objects O, perceptual modalities B and available feature functions F. Each perceptual system continuously produces real-valued, structured collections of measurements (feature vectors) assumed to originate from the same object. These collections of measurements are called percepts and denoted as πβ (x). The percept πβ (x) of modality β is a partial observation of an object x belonging to O, so that πβ (x) = f f f (πβ 1 (x), πβ 2 (x), . . . , πβ i (x)) with i being the number of fi
feature functions ∈ β and each πβ (x) : f βi (x) → R. Percepts can be thought as combined real-valued feature vectors of the collection of feature functions belonging to one particular modality. Perceptual Signature The collection of percepts from all the modalities of the perceptual system with respect to an object x is called the perceptual signature of x and is defined as πB (x) = {πβ1 (x), πβ2 (x), . . . , πβn (x)}. Each πβ (x) corresponds to a percept, or combined feature vector regarding a modality β ∈ B. Perceptual Distance The perceptual distance between two perceptual signatures of two objects x and y is defined as dπ = πB (y) − πB (x) = (πβ1 (y) − πβ1 (x))2 + (πβ2 (y) − πβ2 (x))2 + · · · + (πβn (y) − πβn (x))2 ,
123
(1)
where each (πβn (y) − πβn (x))2 is defined as the l2 norm for calculating the distance between two anchors regarding the same modality. Note here that we are only able to compare percepts with same modalities, such as pairs of visual percepts or pairs of spatial percepts. 2.1.2 Semantic system We define the semantic system as a knowledge representation and reasoning System (KR&R), which includes a knowledge base KB with inference capabilities. In the KB there exists a collection of abstract hierarchically structured knowledge, expressed in some form of formal logic, containing the set of concepts, instances, relations and rules between concepts mainly in the domain of commonsense knowledge (more details in Sect. 4). We define this knowledge hierarchy as an ontology . Semantic Grounding Relations We define as semantic grounding g, the process where every percept πβ (x) in the perceptual signature πB (x) is mapped to a set of concepts or instances ωβ (x) ∈ which is the corresponding knowledge fragment of the percept. So, gβ ⊆ × β × R. Semantic grounding relations are twofold processes. Initially, they ground the different percepts to their mapped symbols, while then they associate and validate the sets of symbols against the concepts, instances: direct relations and generalizations from the ontology that represent the modality and the grounded knowledge. The duality of the grounding relations (symbolic and semantic) is necessary to ensure that not only we ground the percept to the correct symbol, but also that this symbol corresponds to the correct concept in the knowledge base. In a situation where we are grounding the color from the visual modality of one object, in the first step of the color grounding relation, the HSV values map to the symbol red, but during the second phase the semantic grounding relation finds the exact instance (i.e. dark-red_3) in the knowledge base, that is a particular tone of red, which may also be the color of other anchors/instances in the KB. Semantic Descriptions We define the semantic description of an object x the collection of all semantically grounded concepts from . σB (x) = (ωβ1 (x), ωβ2 (x), . . . , ωβn (x)). Semantic distance The semantic distance between two semantic descriptions of two anchors x and y is defined as dσ = σB (y) − σB (x) = (ωβ1 (y) − ωβ1 (x)) + (ωβ2 (y) − ωβ2 (x)) (2) + · · · + (ωβn (y) − ωβn (x)) where each (ωβn (y) − ωβn (x))2 is defined as the semantic distance metric between the concepts of the two anchors with respect to the same modality. Semantic distance models how far are two concepts with respect to their semantic content. For example, this can be achieved by defining topological
Intel Serv Robotics
Fig. 2 Example of one anchor with the two components. The perceptual signatures on the left and the corresponding knowledge fragments (semantic descriptions) on the right. It is indexed by time and by a unique identifier
similarity distances using the ontology to define the distance between two concepts or by using statistical means such as a vector space model to correlate concepts in the ontology tree. In the present work, arbitrarily and for the shake of simplicity, we assume that the distance between concepts is 0 if they are the same concept or they differ at most one isa or genls relations, and 1 otherwise. So the total semantic distance of one modality compared with another is the total distance between the concepts of the modality. For example, the distance between the concepts for the visual modality, “dark-yellow” and “yellow” is 0 via the genls relation, while the distance between “yellow” and “red” is 1 since they are not directly related. 2.2 Anchoring An anchor α(x, t), indexed by time, is considered the data structure that links one perceptual system , with the symbol system KB (Fig. 2). So an anchor about an object x from O at time t can be defined as any partial function a from time to triples in: α(x, t) : × B × F → R = {uuid, πB (x, t), σB }
(3)
So at every time step t, one anchor α(t) is the data structure that holds three elements: the perceptual signature πB (x, t), the semantic description σB of the corresponding knowledge about the perceptual signature and the unique identifier uuid meant to denote this anchor in the anchoring space, regarding one object from O that the agent is aware of. One anchor α is grounded at time t if it contains the percepts perceived at
t and the updated descriptions. If the object is not observable at t and so the anchor is ungrounded, then no percept is stored into the anchor but the semantic description still provides the best available estimate since the last observation. Anchoring is the process where anchors are produced and maintained in time for every object in O, which the agent can perceive. The perceptual systems continuously produce features and percepts assumed to originate from the same object x. The perceptual signatures are grounded during the semantic grounding for each perceptual system to the collection of hierarchically structured semantic knowledge so as to form the semantic descriptions of x. We can then describe the anchoring space Aspace of an agent Ag as a multi-modal space where heterogeneous items of information are mapped and represent the combined past and present perceptual and conceptual state of an agent described in terms of the perceptual signatures and their semantic descriptions. Here we have to mention that grounded anchors represent currently perceived objects, while ungrounded anchors are not deleted from the anchoring space, but since they are ungrounded they are considered as past perceptual experience and therefore their content is kept in the anchoring space as well as the KB.
2.3 Anchor perceptual indiscernibility relation We have an intelligent agent Ag which has one or more perceptual systems PAg = (1 , 2 , . . .), a semantic system KB and an anchoring system A with the anchoring space Aspace .
123
Intel Serv Robotics
Let A be PAg , K B, Aspace and α(x), α(y) be two anchors from Aspace regarding two objects x and y from O. Anchor Indiscernibility Relation The anchor indiscernibility relation for one modality β is defined as β = {(x, y) ∈ Aspace × Aspace : αβ (y) − αβ (x)2 = 0}, (4) where αβ (y) − αβ (x)2 represents the distance between two anchors. Since an anchor is a composite structure with two components which have distance measures separately, from 1, 2 and 3 we can obtain (5) ∼β = {(x, y) ∈ Aspace × Aspace : dπ∗ + dσ∗ 2 = 0} d ∗ denotes that the corresponding distance may or may not be present, as for example in the case where we are matching against an anchor that does not have a perceptual signature part and therefore we are not able to compute the distance to the perceptual signature of another anchor. Using the anchor indiscernibility relation, objects with matching signatures and descriptions can be grouped to form classes called elementary sets defined by C/∼β = {o ∈ O|o ∼β c∀c ∈ C/∼β }
(6)
Anchor Tolerance Relation Most of the times we want to facilitate observations of associations in the anchoring space. We can introduce some tolerance value ( ∈ R) so as to obtain the anchor tolerance relation ∼ =β, or simply ∼ =β = {(x, y) ∈ Aspace × Aspace : α(y) − α(x) 2