Integrating Subsymbolic and Symbolic Processing

0 downloads 0 Views 577KB Size Report
Hofstadter, D.R. 1979 Godel, Escher, Bach: an Eternal Golden Braid. New York, USA,. Basic Books Inc. Horn, B.K.P. 1986 Robot Vision, MIT Press, Cambridge, ...
Integrating Subsymbolic and Symbolic Processing in Artificial Vision

Authors: Edoardo Ardizzone, Antonio Chella, Marcello Frixione and Salvatore Gaglio.

Edoardo Ardizzone and Salvatore Gaglio are with: DIE - Dipartimento di Ingegneria Elettrica, Universita' di Palermo, Viale delle Scienze 90128 Palermo Italy. Tel.: +39.91.595735 Fax: +39.91.488452

Antonio Chella is with: DIE - Dipartimento di Ingegneria Elettrica, Universita' di Palermo Palermo Italy and CRES - Centro per la Ricerca Elettronica in Sicilia Monreale (Palermo) Italy

Marcello Frixione is with: Dipartimento di Filosofia Universita' di Genova Genova Italy and DIST - Dipartimento di Informatica, Sistemistica e Telematica Universita' di Genova Genova Italy

Please address future correspondence to: Dr. Edoardo Ardizzone Dipartimento di Ingegneria Elettrica Universita' di Palermo Viale delle Scienze 90128 Palermo (Italy)

Integrating Subsymbolic and Symbolic Processing in Artificial Vision

Key Words: Artificial Vision, Symbolic Reasoning, Subsymbolic Processing, Connectionist Paradigms, Analog Models, Neural Networks.

Abstract We approach the integration between symbolic and subsymbolic processing within a hybrid model of visual perception, intended for an autonomous intelligent system. No hypotheses are made about the adequacy of this model as a model of human vision: the proposed model is currently under development for a robot system. We propose an associative mapping mechanism, that relates the constructs of the symbolic representation to a geometric representation of the observed scene. The symbolic representation is expressed in terms of a formalism provided with compositional structure. The geometric representation is obtained by making use of a geometric modelling system based on superquadrics. We describe a possible realization of the mapping mechanism by means of a feed-forward neural network architecture based on the backpropagation rule, presenting some results of a partial implementation.

1. Introduction The design and the realization of intelligent autonomous systems, able to operate in unstructured environments, is one of the most significant goals of applied artificial intelligence and of related fields, such as advanced robotics. The ability to interact autonomously with the external world requires an adequate development of the perceptive capabilities and of the internal representations of the environment, in such a way to allow the drawing of inferences and the capability of decision making of the system. Of fundamental importance in this field is the development of external world modelling techniques, the fusion of data from different sensory modalities and the treatment of incomplete and/or uncertain information. These problems require the integration of "high level" capabilities (reasoning, planning, inferential activities) with "low level" capabilities (perception, motor control). To this aim it appears promising to combine classic symbolic techniques of knowledge representation and reasoning with the subsymbolic computational techniques typical of connectionist models. Another appealing feature of such hybrid models consists in the possibility of facing in a "natural" way some high level problems, too hard to be solved within an entirely symbolic paradigm (e.g. uncertainty and incomplete information, prototypical representation of concepts, reasoning with analog models). In this paper a hybrid model of visual perception, intended to provide object recognition capabilities to an autonomous intelligent system acting in a unstructured environment is proposed. The conceptual description of scenes and the inferential

3 activities at the symbolic level are grounded on the perceptual mechanisms by means of a subsymbolic mapping device of a connectionist kind. Visual perception is modelled as a process in which information and knowledge are represented and processed at different levels of abstraction, from the lowest one, directly related to features of perceived images, to highest levels where the knowledge about the perceived objects is of a symbolic kind, through an analog representation level where the geometric features of the scene are explicitly treated. Here and in the rest of paper, with the term analog representation we intend a representation where information may be measured by processing continue-valued entities, e.g. geometric parameters of shapes. In previous papers [Gaglio, Spinelli et al. 1984, Ardizzone, Gaglio et al. 1989a], a general model of this kind has been proposed in which the above mentioned different levels coexist and interact. Previous implementations [Ardizzone, Palazzo et al. 1989b] showed a good behaviour in object recognition of simple scenes made up by chairs, tables and other pieces of furniture. In that model, the geometric representation of the perceived scene, built on 3-D information extracted from sensory data, is related to the conceptual representation in the framework of fuzzy logic [Zadeh 1988], to take into account uncertainty, similarity and approximation, which are typical of the reasoning capabilities of intelligent systems. However, this approach requires an explicit representation of fuzziness and therefore a precise definition of its parameters; this generally may be obtained only by somewhat arbitrary choices. A subsymbolic approach based on connectionist architectures [Smolensky 1988], like the one presented in this paper, may represent an attempt to get over this problem, owing to properties of such architectures like associativity, adaptive learning capabilities and intrinsic parallelism. Moreover, the subsymbolic approach allows the integration of different knowledge representation levels via suitable mapping mechanisms, thus avoiding the need of an explicit control structure for the entire perceptual system, like those typical of artificial intelligence (e.g. blackboard structures, see [Hayes-Roth 1988]), which would imply a precise definition of the interaction mechanisms between levels. The proposed model must not be considered as a model of human vision: no hypotheses are made about this point, and the model may be referred to an autonomous abstract intelligent system currently under development, in which other components are devoted to the reasoning activities necessary for planning actions, controlling input sensors, coordinating motor activities, and so on. So it is a task of this higher level components to use the information acquired through the perceptual system to create expectations or to form contexts in which the perceptual system's performance may be verified and if necessary modified by repeating the perceptual process once some

4 relevant parameters have been adequately changed (e.g. the placement of sensors, the detail level of 3-D information extracted from sensory data, the characteristics of the resulting geometric representation, etc.). In the following section, the general structure of the proposed model of visual perception is presented. In section 3, we discuss a specific associative mapping mechanism, constituting the subsymbolic level closest to the symbolic level, and in section 4 we describe a neural network architecture able to realize it. Finally in section 5 we present some results of a partial implementation of such associative mechanism.

2. The global structure of the proposed visual perception model FIG. 1 shows the proposed model for visual perception. Level A) consists of input data from sensors connected to the external world. Level B) consists of an analog representation of the perceived scene. This level is made up by two components. The first one (B1) consists in a 3-D volumetric reconstruction of the observed scene. The second one (B2) is the representation of the reconstructed scene in terms of a geometric model. Level C) is a symbolic representation of the domain considered (the perceived scene and the related general knowledge). ****** FIGURE 1 NEAR HERE ****** The introduction of intermediate levels permits us to overcome the difficulties arising when a direct mapping between the sensory inputs and the symbolic representation is attempted. Moreover, the analog intermediate level plays the role of a mental model, i.e. of an analog representation internal to the domain, which can be important to carry out some types of inference [Johnson-Laird 1983]. The analog model allows to specify the entities and the relations of a geometric nature that define the domain of the symbolic representation. In other words, this level may be considered as the simulation structure at the logic level, in the sense of the FOL system [Weyhrauch 1980]. The presence of two components (B1 and B2) at the analog level may be justified in the following way. B1) consists in a volumetric reconstruction of the perceived scene, built starting from sensory data, for example from multiple views of the scene set. It is therefore a discrete representation of the spatial occupancy of objects present in the scene. This rich set of raw and unstructured data does not permit a straightforward interpretation. A direct mapping between this level and the symbolic representation is therefore hard to be obtained. On the contrary, the geometric model of

5 level B2 is an analytically coded representation of the same data. In fact, primitives of the adopted geometric modeling system are employed as they are or modified by deformations and/or combined by Boolean operations, to best fit the various scene parts. In such a way a compression and grouping of information present at the volumetric level is obtained. In the current version of the model, sensory data of level A are 2-D digital images (two-dimensional arrays of pixels) representing one or more views of the observed scene. The volumetric reconstruction at the level B1) is a representation of the type of spatial arrays (a voxel representation - see FIG. 2). At present, it is obtained from data of level A) by applying classical volume intersection techniques to infinite generalized cylinders grown up from silhouettes of different scene views [Ardizzone, Gaglio et al. 1989a]. Some considerable drawbacks of this technique are (see also [Chien and Aggarwal 1986]): the possible shortage of precision in the volumetric reconstruction, depending on the object's characteristics; the necessity of exactly knowing the geometric relations among different views; the need of a great number of views in the case of very complex objects. A possible alternative way lies on the application of different techniques (e.g. based on early vision algorithms [Marr 1982] like shape from shading, stereo vision, optical flow, and so on) to sensory data, in order to extract from them the 3-D relevant features of objects present in the observed scene (e.g. by needle diagrams, boundary webs, etc.). 3-D information of this kind is not of a volumetric type, but nevertheless allows for the building of a geometric model of the scene [Horn 1986]. We are currently working on approaches of this kind, both applying conventional algorithms and trying high-parallelism architectures and neural networks [Grossberg 1988]. ****** FIGURES 2 and 3 NEAR HERE ****** In this version of the model, B2 is built by utilizing a solid modelling system based on geometric primitives known as superquadrics (FIG. 3). Superquadrics are mathematical shapes based on the parametric form of quadric surfaces, in which each trigonometric function is raised to one of two real exponents [Barr 1981]. Thus a superquadric is completely defined by: its centre, its orientation, its "sides" (length, width and breadth) and the above-mentioned real exponents, 1 and 2, called form parameters since a change in these exponents affects the superquadric's surface shape. Owing to the possibility of easily controlling their form, superquadrics have been proposed [Pentland 1986] as basic shapes (or primitives) suitable for the representation of "building blocks" of real objects and scenes. A solid modelling system of the CSG (Constructive Solid Geometry) type [Requicha 1980], where complex objects are

6 constructed in a iterative manner by Boolean combinations and deformations of superquadrics, has been proposed [Ardizzone, Gaglio et al. 1989a]. The Boolean combinations are the classical functions of intersection, union and difference, regularized in order to assure model consistency. Deformation operations, purposely defined to increase system flexibility, are those of stretching, elongation, bending and twisting (for a more complete description of this topic, see [Ardizzone, Gaglio et al. 1989a]). In the geometric model at level B2), the objects are therefore described by deformation operations and Boolean combinations of those superquadrics that more suitably approximate the shapes of the various elements. In synthesis at this level the scene is represented by a set of superquadric parameters (see FIG. 4). ****** FIGURE 4 NEAR HERE ****** As regards to the mapping between B1) and B2), the volumetric representation built at level B1) is segmented to obtain its main "parts". Each part is approximated by means of the best fitting superquadric [Ardizzone, Gaglio et al. 1989a]. The approximation of parts in terms of superquadrics is carried out by applying a two-steps algorithm to each part. First, the centre and the orientation of the principal axes of the part under consideration are calculated by determining the point and the unit vectors with respect to which all the products of inertia are zero. Once known the centre and the principal axes, the computation of sides (length a1, width a2 and breadth a3) of the part, that are coinciding with these ones of the approximating superquadric, is immediate. After, the form parameters 1 and 2 that best satisfy the object part's squareness are obtained by minimizing a function binding the squareness features of the part with those ones of the primitive. Since the primitive centre, orientation and sides now are known quantities, this function only depends on the form parameters 1 and 2, and has a minimum corresponding to the values of 1 and 2 that define the superquadric best fitting the given part. This function cannot be generally considered continuous, and therefore a random optimization method is currently used [Ardizzone, Palazzo et al. 1989b]. It has to be stressed that the best fitting procedure works in this case on pieces of volume whose position, orientation and size are known from the volumetric reconstruction and subsequent segmentation. The approximation of each part requires therefore an optimization procedure working in a two-parameter space (1 and 2). At level C) knowledge about the scene is represented in terms of a symbolic formalism with compositional structure. The level of symbolic representation includes a component for the structured definition of the concepts (terminological component), and a set of individual constants denoting the objects in the scene (assertional component). From a logical point of view the concepts and the structures of the terminological

7 component define the predicates describing the assertional constants. For the terminological component, a semantic network representation system of the SI-Net type (Structured Inheritance Semantic Networks) has been adopted. SI-Nets are a class of semantic net formalisms. Best known examples are the KL-ONE [Brachman and Schmoltze 1985] and KRYPTON [Brachman, Pigman Gilbert et al. 1985] systems. The elements making up the SI-Nets are the concepts. Each concept is represented by a node and its related structures, i.e. the arcs linking it to the other constructs of the semantic networks. The nodes are organized in a hierarchical taxonomy, such that less general concepts inherit the characteristics of more general concepts that are at the higher levels of the taxonomy (principle of structured inheritance). Concepts are characterized by expressing their relations with the other concepts in the net by means of links called generic roles. A generic role is a construct that expresses potential relations between the instances of the concepts it connects. Roles are analogous to the slots of a frame. The concept in which the role terminates is a restriction of the type of the fillers of a role. From a logical viewpoint, generic concepts correspond to one-argument predicates, and roles to two-argument predicates. In a KL-ONE net the taxonomy is organized by means of subsumption (or superconcept, or "is a") links. Each concept inherits all the roles and related structures from its immediate superconcepts. The mapping between B) and C) individuates and names the objects present in the scene, and classifies them in terms of symbolic level concepts. Such a mapping is implemented via an associative mechanism based on a neural network architecture. As a result, both the individuation and the classification mechanism (attribution of an instance to a class or identification of a relationship between instances) are implemented in an adaptive manner via a learning process based on examples. This specific part of the model will be analyzed in greater detail in the rest of the paper.

3. The mapping between the geometric model and the symbolic representation The aim of this part of the model is the identification and the classification of the objects present in a scene described at the geometric level by a set of relevant parameters. If one considers the large variety of real world objects that correspond to a concept, it becomes immediately clear that this is not simply a problem of attributing a group of parameters to a class. It is rather a question of providing a quantitative measure of the similarity of a given object to the prototype defining a class at the conceptual level.

8 A prototype may be intended as the most representative member of a certain class or category. Psychological experiments [Rosch et al. 1976] show that categorisations of the real world are not arbitrary but highly determined; in taxonomies of real objects, basic categories carry the most information, and are thus, the most differentiated from one another. Basic categories are also the preferred level of reference (see also [Rosch 1978]); children learn them first and they are recognized faster. In Rosch's experiments [Rosch et al. 1976] about taxonomies of common concrete nouns in English, for example, basic objects are the most inclusive categories whose members present many common attributes, have similar shapes and may be identified from average shapes of members of the class. However, information about an object may be somewhat limited, as in the case of an object occluded by another object or when the information about the scene is heavily affected by noise due to the characteristics of the scene, to the type of the data acquisition, and so on. Severe problems of approximation, uncertainty and similarity have therefore to be taken into account when applying categorisation to computer vision models, in order to give a robot system the capacities of interacting with object in the real world. The relationships between objects and conceptual models can be investigated in the framework of fuzzy logic [Zadeh 1988], attributing a geometric meaning to concepts through the definition of suitable membership functions. It has be shown in [Ardizzone, Gaglio et al. 1989a] that it is possible to associate to a concept a prototype analytically defined by parameters characterizing the geometric primitives making it up. A quantitative measure of the similarity between a geometric representation of a real world object and the corresponding prototype may be expressed in terms of a membership function. This approach however requires a precise analytic definition of the concepts intended as prototypes defining the classes. Furthermore, an explicit representation of fuzziness is necessary, i.e. a precise definition of membership functions, whose shape however, being dependent upon the specific perceptive system, is in general of difficult determination or is the result of arbitrary choices of the uncertainty parameters. An associative mechanism of a connectionist type can be used to face the problem in a more satisfactory way. First, it allows to avoid exhaustive descriptions of prototypes at the symbolic level: in fact it is the associative mechanism that charges itself of the prototype representation, learned during a training phase based on examples. Second, the measure of similarity between the prototype and a real object (that applying fuzzy logic concepts must be defined a priori and explicitly by means of a membership function) is implicit in the behaviour of the network and is determined during the learning phases. The adequacy of the behaviour of the network can be directly verified

9 on the basis of its performances, and possible adjustments may be introduced during further learning phases. ****** FIGURE 5 NEAR HERE ****** The general scheme of the mapping between the geometric model and the symbolic representation is shown in FIG. 5. The associative mechanism is realized via a connectionist architecture. The neural network interacts, via the input/output units, both with the symbolic level (terminological and assertional components) and the geometric model. At the geometric level, the external units are connected to the parameters defining the geometric primitives. At the symbolic level, the external units are connected to the nodes of the semantic network representing the terminological knowledge. Similarly, other external units are connected to the symbolic constants of the assertional component making up the names of the objects in the scene. ****** FIGURE 6 NEAR HERE ****** An example of a simple semantic network is shown in FIG. 6, with reference to a scene where objects such as tables, chairs, balls or their constituting components are present. Tables and chairs are combinations of more than one superquadric; simple objects (e.g. a ball) may be represented by single superquadrics. The mechanism of the selection of attention allows the associative mechanism to concentrate on single primitives or groups of primitives in the scene. In FIG. 5, A) includes the units related to the geometric representation of the scene; B) includes the neural network units corresponding to the constants defining the objects in the scene; C) comprises the neural network units corresponding to the conceptual description present in the semantic network (units in C) correspond to concepts and roles in the semantic network); finally D) includes the units for the implementation of the selection of attention mechanism. Each D) unit maps a primitive representing a portion of the scene; the activation of a set of these units realizes the selection of attention on the matching primitives in A). When some D) units are activated, selecting the attention on the corresponding primitives, the associative mechanism attempts the identification and classification of the corresponding objects. If the identification is successful, the units mapping the object names in B) are activated. Simultaneously in C) the units mapping the conceptual description that classifies the selected portion of the scene are activated. Furthermore, it may be possible to utilize as input units the units present in B) and C); e.g. activating a concept or an assertional constant the associative mechanism

10 will attempt to identify the corresponding object (objects) by selecting the attention on the corresponding superquadrics. Analogously the activations of the units corresponding to a role will cause the activation of the pairs of constants in B) and of the corresponding superquadrics in A). The external activation of terminological and assertional units can be interpreted in terms of some "suggestions" given to the network about the presence in the scene of an object of a certain class or having a certain name. When the associative mechanism is put into action, the network tries to provide a consistent interpretation of all the inputs, taking into account the received suggestions. As above mentioned, the interesting features of this paradigm are that the prototypes of the objects are acquired by the network in an adaptive manner during the learning phase (there is no need of an external precise description) and that the fuzziness needs not be explicitly predefined, but it emerges in a natural way from the associative model. The system does not infer via membership functions expressly defined, but via the activation distribution of network units: the activation level of external units interacting with the symbolic level represents in fact a parametrization of the certainty of the classification and naming.

4. The network architecture The neural network architecture chosen to implement the general scheme of FIG. 5 is an auto-associative architecture [Rumelhart and McClelland 1986]. An autoassociator can recognize the input pattern even if it is incomplete, incorrect or noisy, and show the same pattern noiseless and completed in the best way as output. Therefore during the learning phase the network must modify the weights of connections among units according to an error-correcting rule, in order to operate in the desired way during the working phase. In particular, during the learning phase the network creates in an adaptive manner an internal coding of the patterns constituting the training set. In the working phase, when an input pattern is shown to the neural network, the network generates the output pattern according to its internal coding. The network is therefore capable of generalizing the acquired knowledge: it has in fact the ability to recognize a known pattern, even if degraded by noise or by missing portions, and to process a new pattern by interpolating it on the basis of the known patterns. An architecture of this kind has been preferred to the more common heteroassociative memories because the scheme of FIG. 5 does not exhibit a difference between input and output: it must be possible to freely clamp the state of some units and to let the network suitably complete the activation pattern, on the basis of the knowledge

11 acquired during the learning phase. The associative mechanism must therefore be able, owing to its own internal model, to face possible discrepancies or deficiencies of information in the input pattern. As far as the implementation of two-arguments relations is concerned (in our case, the roles in the semantic net) since it is necessary to discriminate between the first and the second argument of a relation, the units related to the selection of attention, to generic concepts and to assertional constants have been duplicated. With reference to the simple scenes treated in the previous section, if the superquadrics corresponding to a chair are activated as the first argument of a relation, and simultaneously the role corresponding to the relation back is activated, the associative mechanism must activate, by the related units of selection of attention, the second argument of the same relation, i.e. the superquadrics representing the chair back. Good candidates for the neural network implementation of the auto-associative memory are the Boltzmann machines and the backpropagation architecture. The Boltzmann machine architecture seems to be more appropriate in order to perform auto-associative tasks because the architecture by itself allows no distinction between input and output units. The well-known considerable drawback of Boltzmann machines is the slowness of the learning process, that makes prohibitive their software simulation on sequential machines, for even a modest number of connections. Research is currently devoted to the development of learning algorithms which are simpler and quicker [Galland and Hinton 1989] and to the hardware implementation of such architectures. For these reasons, we chose to direct our efforts towards a feed-forward architecture based on the backpropagation rule. ****** FIGURE 7 NEAR HERE ****** The general structure of this network is shown in FIG. 7. The network is made up by two identical layers of external units (one for the input and one for the output) and by one layer of hidden units. Each unit receives in input the outputs of (only) all the units of the previous layer and feeds with its output (only) all the units of the following layer. The external units are subdivided into two groups, respectively representing the geometric level and the symbolic level. Units of the first group are in turn subdivided into two blocks. The first block is made up by clusters of units, each cluster representing the geometric parameters of one superquadric. The second block is made up by units allowing the selection of attention: each unit of this block selects the attention on one superquadric.

12 Also units of the second group are subdivided into two blocks. Units of the first block are related to assertional constants (A-Box), while units of the second block are related to generic concepts (T-Box). As shown in FIG. 7, the use of a feed-forward network to implement an autoassociative memory requires the duplication of external units. This is justified by the sharp distinction existing between input units and output units of such architectures. The duplication of external units makes the architecture heavier, but allows the learning phase to be simple and efficient. The learning phase is based on a parallel implementation of the well-known gradient descent technique [Rumelhart, Hinton et al. 1986]: the weights of the network connections are moved towards the minimisation of a smooth error function, defined by the summed squared differences between the actual and the desired network outputs. In order to explore the characteristics of associative mechanisms between geometric and conceptual representations, a simple feed-forward network based on backpropagation has been implemented, as a preliminary step towards the complete realization of the architecture above described. The implementations have been developed on a SUN3/Unix workstation as a C program, making use of callable routines of the Rochester Connectionist Simulator V4.2 [Goddard, Lynne et al. 1989].

5. A partial implementation of the mapping mechanism In FIG. 8 the schematic representation of the neural network architecture is shown, that implements a simplified version of the geometric/symbolic mapping. The layer A) is made up by the inputs units receiving the information coming from the geometric level (i.e. the superquadric parameters). Each cluster a_1, a_2, . . ., a_n represents a block of input units coding the parameters of a single superquadric and the related units for the selection of the attention (s_1, . . . , s_n, shown in gray in FIG. 8). Beside the selection of attention unit, each cluster a_1, . . . , a_n is made up by eight units: three representing the coordinates of the centre of the superquadric, three representing the lengths of the sides of the superquadric (whose principal axes are assumed parallel to the axes of the general reference system), two representing the form parameters 1 and 2. We assume that the sequence of clusters a_1, . . . , a_n is determined by the distance of the corresponding superquadrics from the origin of the reference system: the cluster a_1 corresponds to the superquadric nearest to the origin, and so on. Units s_1, ..., s_n in layer A) implement the mechanism of selection of attention (see sect. 4). These units allow the selection of one or more superquadrics to be considered, when the analysis is limited to a portion of the scene.

13

****** FIGURE 8 NEAR HERE ****** The layer B) includes the output units of the neural network. In the current implementation the coding is local, i.e. these units directly match the nodes of the semantic network at the symbolic level. The output units make available their activation level, which may be considered a measure of the certainty of the recognition of the object. This aspect avoids the external attribution of uncertainty parameters, as it happens when using fuzzy logic. At the present, generic concepts are connected to the associative mapping mechanism, while a complete interface between the neural network and the roles is yet missing. part_of is the only relation the mapping device is able to recognize. This is provisionally obtained via a mechanism, whose generalization to other kinds of 2argument predicates is currently in progress. During the learning phase, the network is presented with superquadric parameters corresponding to different objects, such as chairs, tables, etc., and simultaneously the corresponding units at the terminological levels are activated. During this phase the network adaptively adjusts the weights of its internal links according to the standard backpropagation rule with the momentum term. For the sake of simplicity, the classes of objects (chairs, tables, stools, etc.) we chose for the training are homogeneous and relatively close to the prototype (chairs with 4 legs, tables with rectangular board and so on). Taking into account these hypotheses, the experimentations showed that very acceptable results may be obtained by using a network without hidden units. This simple kind of network, in spite of its well known computational limits, gives good results on the simple chosen training set, while more complex networks give no better outcomes. This is probably due to the following facts: i) superquadrics represent geometrical entities with continuity, in the sense that small changes in the represented objects correspond to small changes of their parameters; ii) the training set is homogeneous. i) and ii) involve that such classes are linearly separable [Minsky and Papert 1969]. i) is a further confirmation about our choice of superquadrics as representation primitives (in a voxel representation, for example, small changes in the object shape or position could imply strong changes in the geometric model). ii) is in fact a simplification adopted as a provisional assumption: the choice of more complex training sets would imply that the classes of objects are not more linearly separable, thus requiring one or more layers of hidden units.

14 In order to carry on experimentations with the architecture, a large test set made up with 125 patterns has been built; each pattern is a set of superquadrics representing a piece of furniture (chairs, tables and stools), variously positioned and scaled. The training set for networks has been made as follows. First, a basic set of a few individual objects is built, each object being described as boolean union of superquadrics. Second, this set is vastly enlarged including translated, scaled, noise-corrupted and attention-selected objects making use of a purposely written Pattern Description Language (PDL), based on Unix LEX and YACC. FIG. 9 shows an example of table along with the related superquadric parameters reported in tab. I. ***** FIG. 9 NEAR HERE ****** **** TAB. I NEAR HERE ****** Six tests have been carried on; during each test the network has been trained with a training set made up by randomly choosing a number of patterns within the test set. The learning time for each test has been fixed to 1000 epochs with intermediate checks performed after 10, 50, 100, 200, 300, 500 epochs. The tests have been carried on by presenting the whole test set to the network and calculating the APSS (Average Pattern Summed of Squared errors). The APSS is the sum over the output set of the squared differences between output and teach values, divided by 2, and averaged over the test set:

APSS =

1 2   (tjp - ojp) 2*Npat p j

where Npat is the number of patterns belonging to the test set, tjp and ojp are respectively the teach value and the output value of unit j when pattern p is presented to the network; the first sum is over the set patterns belonging to the training set and the second sum is over the set of output units. **** TAB. II NEAR HERE ****** The size of the six training sets along with the corresponding APSS after 1000 training epochs is described in tab. II. It should be noted that the best result is obtained on training set #2 using only the 20% of the test set. When the training set is larger, the phenomenon of overlearning becomes evident: the network is too specialized over the learned patterns showing a bad generalization over the unknown ones. Note also that no

15 generalization is performed after larning training set #6, which is the same as the test set: this explains the low APSS value. ****** FIG. 10 AND 11 NEAR HERE ****** FIG. 10 shows the behaviour of the network during the training phase when the training set #2 made up by 25 patterns is presented to the network. It is worth noting that after about 400 training epochs with high values of APSS, the network is about to reach the convergence. This choice of the number of training epochs is critical: it has been found that a smaller number does not allow the network to generalize, while a larger number causes overlearning problems [Tesauro and Sejnowski 1989]. FIG. 11 shows the behaviour of the network during the training phase when the training set #4 made up by 75 patterns is presented to the network. After about 200 training epochs the APSS reaches the minimum value. By carrying on the training phase the APSS tends to raise up, showing the typical overlearning behaviour . ******* FIG. 12 AND 13 NEAR HERE ****** FIG. 12 shows a screen output of the simulator; each unit is represented by a circle whose size is proportional to unit activation. White circles stand for positive activations while black circles for null activations. The input units and the selection of attention units are shown on the bottom of the figure: the first eight units of each row correspond to the superquadric parameters, while the last unit is the selection of attention unit. The output units are shown on the top of the figure. In particular this figure shows the activation of the output units corresponding to an input pattern made up by six superquadrics describing a table; it should be noted that the most active output unit is, as expected, the one related to the class of tables. When the number of superquadrics describing the input object is less than the maximum (7 in the current implementation), the remaining units are solicited by a random input, as in the last row of FIG. 12. Another interesting result is related to the nature of the adopted training set, where stools and tables differ only by a scale factor. When the pattern of a stool whose size increases with continuity is presented to the network, the activation of the output unit corresponding to the stool decreases, while simultaneously the activation of the output unit corresponding to the table increases. The mechanism of selection of attention, as previously described, allow to select the attention in order to classify a particular group of superquadrics. As an example,

16 FIG. 13 shows the activation pattern when an example representing a chair is shown to the network. The attention unit which is related to the superquadric of one of the legs is activated at same time from the user(in FIG. 13 the second attention unit from the top). Note that the most active output unit is, as expected, the one representing the class of legs.

6 . Conclusions Integrating "high level" inferential activities with "low level" perceptual capabilities is a crucial, unsolved topic in the design and realization of intelligent autonomous systems. Hybrid models where symbolic and subsymbolic paradigms coexist appear to be promising since they offer a more "natural" way to face such problems as uncertainty, incomplete information, prototype representation, analog modeling, and so on. A subsymbolic mapping mechanism of a connectionist kind, intended for a hybrid model of visual perception, has been proposed in this paper, and a possible implementation as a fee-forward neural architecture has been presented. The proposed model would allow for the classification and naming of objects present in a scene. Actually, the model lacks an interaction between the symbolic inferences and the recognition of objects and relations in the scene. The mapping mechanism, as it has been realized, incorporates in fact some kind of "radical Gestaltian assumption": the fact that an object is an instance of a concept connected to the associative mechanism is decided exclusively on the basis of the global shape of the object itself. The object relations with its parts and with other objects in the scene play no role in its classification. For example, the classification of an object as a whole cannot be affected by the fact that its subparts can in turn be recognized by the associative mechanism, also if this piece of information is relevant on the basis of the terminological knowledge in the semantic net. This heavy restriction has been accepted only as a "temporary assumption", and will be discharged in next developments. Future versions of the model would in fact allow for a greater interaction between the symbolic level and the associative mechanism, enabling the information related to roles and relations to take part in classification of objects represented at the geometric level. Furthermore, this kind of development will give the model the capability to form contexts and expectations, that can properly guide and tune the processing of the underlying subsymbolic levels, acting as a sort of top-down feedback. For example, the hypotheses generated by the geometric/symbolic mapping would involve a revision of the geometric segmentation generated by the lower level mapping mechanism, so

17 modifying the superquadric parameters that geometric/symbolic mapping mechanism itself.

represent

the

inputs

to

the

Acknowledgements The authors would like to thank Dr. Giuseppe Spinelli for his very useful comments and suggestions, and Mr. Francesco Callari for his implementation work on the described neural architecture. This work has been partially supported by MURST 40% Special Project M.I.R.A. (Metodologie Informatiche per la Robotica Avanzata).

18 References

Ackley, D.H., Hinton, G.E., Sejnowski, T.J. 1985 A learning algorithm for Boltzmann Machines, Cognitive Science, 9, 147-169. Ardizzone, E., Gaglio, S., Sorbello, F. 1989 Geometric and Conceptual Knowledge Representation within a Generative Model of Visual Perception, Journal of Intelligent and Robotic Systems , 2, 381-409. Ardizzone, E., Palazzo, M.A., Sorbello, F. 1989b Computer reconstruction and description of 3-D objects, Proc. of the Seventeenth IASTED Int. Symp. on Simulation and Modelling, Lugano, Switzerland, June 19-22, 1989, 139-145. Barr, A.H. 1981 Superquadrics and Angle-Preserving Transformations, IEEE Computer Graphics and Applications, 1, 11-23. Brachman, R.J., Pigman Gilbert, V., Levesque, H. 1985 An essential hybrid reasoning system: knowledge and symbol level accounts of KRYPTON, Los Angeles, USA, Proc. of the Ninth IJCAI, 532-539. Brachman, R.J., Schmolze, J. 1985 An overview of the KL-ONE knowledge representation system, Cognitive Science, 9, 171-216. Chien, C.H., Aggarwal, J.K. 1986 Identification of 3D objects from multiple silhouettes using quadtrees/octrees. Computer Vision, Graphics and Image Processing, 36, 256-273. Gaglio, S., Spinelli, G., Tagliasco, V. 1984 Visual Perception: an Outline of a Generative Theory of Information Flow Organization, Theoretical Linguistics, 11, 21-43. Galland, C.C., Hinton, G.E. 1989 Deterministic Boltzmann learning in networks with asymmetric connectivity, Department of Comp. Science, University of Toronto, Toronto, Canada, Techn. Rep. CRG-TR-89-6. Goddard, N.H., Lynne, K.J., Mintz, T., Bukys,L. 1989 Rochester Connectionist Simulator, New York, USA, The University of Rochester, Computer Science Department, Rochester, Tech. Rep. 233 (revised).

19 Grossberg, S. (Ed.) 1989 Neural networks and natural intelligence, Cambridge, MA, USA, MIT Press. Hayes-Roth, B. 1988 A blackboard architecture for control, Artificial Intelligence, 26, 251-321. Hinton, G.E. 1989 Connectionist Learning Procedures, Artificial Intelligence, 40, 185234. Hinton, G.E., Sejnowski, T.J. 1986 Learning and Relearning in Boltzmann Machines, in: Parallel Distributed Processing,Vol.1, edited by Rumelhart, D.E., McClelland, J.L., Cambridge, MA, USA, MIT Press. Hofstadter, D.R. 1979 Godel, Escher, Bach: an Eternal Golden Braid. New York, USA, Basic Books Inc. Horn, B.K.P. 1986 Robot Vision, MIT Press, Cambridge, MA, USA. Johnson-Laird, P. 1983 Mental models, Cambridge, MA, USA, Cambridge University Press. Marr, D. 1982 Vision. San Francisco, USA, Freeman and Co. Minsky, M., Papert, S. 1969 Perceptrons. Cambridge, MA, USA, MIT Press. Pentland, A.P. 1986 Perceptual Organization and the Representation of Natural Form, Artificial Intelligence, 28, 293-331. Requicha, A.A.G. 1980 Representations for rigid solids: theory, methods and systems, ACM Comp. Surveys, 12, 4, 437-464. Rosch, E. et al. 1976 Basic Objects in Natural Categories, Cognitive Psychology, 8, (3) Jul 382-439. Rosch, E. 1978 Principles of Categorization, in: Cognition and Categorization, edited by Rosch, E. and Lloyd, B., Hillsdale, NJ, USA, Erlbaum.

20 Rumelhart, D.E., McClelland, J.L. (Eds.) 1986 Parallel Distributed Processing. Vol.1, Cambridge, MA, USA, MIT Press. Rumelhart, D.E., Hinton, G.E., Williams, R.J. 1986 Learning internal representations by error propagation, in: Parallel Distributed Processing,Vol.1, edited by Rumelhart, D.E., McClelland, J.L., Cambridge, MA, USA, MIT Press. Smolensky, P. 1986 Information Processing in Dynamical Systems: Foundations of Harmony Theory, in: Parallel Distributed Processing,Vol.1, edited by Rumelhart, D.E., McClelland, J.L., Cambridge, MA, USA, MIT Press. Smolensky, P. 1988 On the hypotheses underying connectionism, Behavioral and Brain Sciences, 11,1. Tesauro, G., Sejnowski, T.J. 1989 A parallel network that learns to play backgammon, Artificial Intelligence, 39, 357-390. Weyhrauch, R.W. 1980 Prolegomena to a theory of mechanized formal reasoning, Artificial Intelligence, 13 (1,2), 133-170. Zadeh, L.A. 1988 Fuzzy logic, IEEE Computer, April , 83-93.

21 FIGURE CAPTIONS FIG. 1. The proposed model for robot visual perception in order to perform object recognition tasks. Level A) consists of input data from sensors connected to the external world. Level B) consists of the analog representation of the perceived scene. (B1) consists in a 3-D volumetric reconstruction of the observed scene. (B2) is the representation of the reconstructed scene in terms of a geometric model. Level C) is a symbolic representation of the domain considered. FIG. 2. A voxel representation of the volumetric reconstruction of a chair. This is obtained by applying classical volume intersection techniques to infinite generalized cylinders grown up from silhouettes from 2-D digital images representing views of the observed scene. FIG. 3. Examples of superquadrics for different values of form parameters. Superquadrics are mathematical shapes based on the parametric form of quadric surfaces, in which each trigonometric function is raised to one of two real exponents 1 and 2, called form parameters since a change in these exponents affects the superquadric's surface shape. FIG. 4. A geometric representation of the chair of FIG. 2. The chair is described by Boolean combinations of those superquadrics that more suitably approximate the shapes of the various elements. FIG. 5. A schematic view of the mapping mechanism between geometric and symbolic representations of the perceived scene. A) includes the units related to the geometric representation of the scene; B) includes the neural network units corresponding to the constants defining the objects in the scene; C) comprises the neural network units corresponding to the conceptual description present in the semantic network; D) includes the units for the implementation of the selection of attention mechanism. FIG. 6. An example of a simple semantic network. The network is referred to a scene where objects such as tables, chairs, or their constituting components are present. Tables and chairs are combinations of more than one superquadric; simple objects may be represented by single superquadrics.

22 FIG. 7. The general structure of the neural network architecture. The network is made up by two identical layers of external units and by one layer of hidden units. The external units are subdivided into two groups, respectively representing the geometric level and the symbolic level. Units of the first group are in turn subdivided into two blocks. The first block is made up by clusters of units, each cluster representing the geometric parameters of one superquadric. The second block is made up by units allowing the selection of attention: each unit of this block selects the attention on one superquadric. Also units of the second group are subdivided into two blocks. Units of the first block are related to assertional constants (A-Box), while units of the second block are related to generic concepts (T-Box). FIG. 8. The schematic representation of the backpropagation architecture partially implementing the described link. Each cluster a_1, a_2, . . ., a_n represents a block of input units coding the parameters of a single superquadric. Units s_1, ..., s_n in layer A) implement the mechanism of selection of attention. These units allow the selection of one or more superquadrics to be considered, when the analysis is limited to a portion of the scene. The layer B) includes the output units of the neural network. The output units make available their activation level, which may be considered a measure of the certainty of the recognition of the object. FIG. 9. An example object the architectures have been trained to classify and/or recognize. The corresponding superquadric parameters may be found in tab. I. FIG. 10. APSS measure related to the classification task for the backpropagation architecture, showing the ability of the network to generalize over the 125 pattern test set when trained on a subset of 25 patterns. FIG. 11. APSS measure related to the classification task for the backpropagation architecture, showing the ability of the network to generalize over the 125 pattern test set when trained on a subset of 75 patterns. It is worth noting that APSS reaches a minimum after 200 training epochs and an overlearning phenomenon when learning is carried on.

23

FIG. 12. A screen hardcopy of the simulator. Each unit is represented by a circle whose size is proportional to unit activation. White circles stand for positive activations while black circles for null activations. The input units and the selection of attention units are shown on the bottom of the figure: the first eight units of each row correspond to the superquadric parameters, while the last unit is the selection of attention unit. The output units are shown on the top of the figure. The input activation pattern is related to a table and no attention units are activated. The most active output unit is, as expected, the one related to the class of tables. FIG. 13. A screen hardcopy of the simulator (see also FIG. 12). The input activation pattern is related to a chair and the second attention unit related to a chair leg is activated at same time. The most active output unit is, as expected, the one representing the class of legs.

24

C)

Symbolic representation

Mapping mechanism

Geometric model

B)

B1)

3-D reconstruction

Mapping mechanism

A)

Input sensory data

FIG. 1

Analog model

B2)

25

FIG. 2

26

FIG. 3

27

FIG. 4

28

SYMBOLIC LEVEL (terminological component

ASSOCIATIVE MAPPING MECHANISM

B)

SELECTION OF ATTENTION

SYMBOLIC LEVEL (assertional component)

C)

D) A)

GEOMETRIC MODEL

FIG. 5

29

THING

COMPLEX OBJECT

SIMPLE OBJECT support

FURNITURE

1/nil PARALLELEPIPED

CYLINDER

plane 1/1

TABLE

support 1/nil

back

BOARD

0/1

SEAT

1/1 board

CHAIR 1/1

1/4 0/0

FIG. 6

STOOL

30

Symbolic level

Geometric level Superquadric parameters

Selection of attention

A - Box

Selection of attention

A - Box

T - Box

Output layer

Hidden layer

Input layer

Superquadric parameters

Symbolic level

Geometric level

FIG. 7

T - Box

31

Chair

Stool

Table

Leg

B)

A)

a_1

s_1

a_n

FIG. 8

s_n

32

FIG. 9

33

0.3

0.2

0.1

0

0

200

400

600

800

1000

APSS vs learning epochs, trained on 25 pat

FIG. 10

34

0.3

0.2

0.1

0

0

200

400

600

800

1000

APSS vs learning epochs, trained on 75 pat

FIG. 11

35

FIG. 12

36

FIG. 13

37

The superquadric parameters of the table reported in fig. 9.

Leg 1 Leg 2 Leg 3 Leg 4 Inf.plane Sup.plane

x

y

z

a

b

c

1

2

27 90 153 80 90 90

90 27 90 180 90 90

65 65 65 65 140 153

7 7 7 7 70 90

7 7 7 7 70 90

65 65 65 65 10 3

500 500 500 500 500 500

1 1 1 1 1 1

Tab. I

38

The size of the six training sets along with the corresponding APSS after 1000 training epochs.

Training set # 1 2 3 4 5 6

Size 10 25 50 75 100 125

Tab. II

APSS (x 0.001) 59.15 38.51 44.33 49.17 45.90 19.94

Suggest Documents