studying the human visual cortex for improving

STUDYING THE HUMAN VISUAL CORTEX FOR IMPROVING PREHENSION CAPABILITIES IN ROBOTICS Eris Chinellato Robotic Intelligence Lab Universitat Jaume I [email protected]

Yiannis Demiris Intelligent Systems and Networks Group Imperial College London [email protected]

ABSTRACT Although other primates have grasping skills, human beings evolved theirs to the extent that a large fraction of our brain is involved in grasping actions. Recent neuroscience findings allow us to depict the outline of a model of visionbased grasp planning that differentiates from the previous ones in that it is the first to rest mainly, if not exclusively, on human physiology. The main theory on which our proposal is based is that of the two streams of the human visual cortex [1]. Although they are evolved for different purposes, being the ventral stream dedicated to perceptual vision, and the dorsal stream to action-oriented vision, they need to collaborate in order to allow proper interaction of human beings with the world. Our framework has been conceived to be applied on a robotic setup, and the design of the different brain areas has been performed taking into account not only biological plausibility, but also practical issues related to engineering constraints. Connectionist Models, Cognitive Processes, Computational Neuroscience, Robotic Grasping, Visual Streams

1 Introduction Computational modeling of human capabilities is performed for two main purposes: validate physiological hypothesis and solve complex engineering problems. Indeed, AI was born with the idea of emulating our capabilities in problem solving. Artificial sensorimotor skills have not developed nearly as fast as reasoning ones, due to engineering constraints, but also to the basic differences between reasoning in humans and artificial machines. Computational neuroscience are trying to bridge this gap providing artificial agents with abilities more strictly inspired in natural mechanisms, in order to achieve, in a medium-long term, better efficiency and more intelligent behaviors. We are presenting here a work aimed at mimicking the coordination between sensory, associative and motor cortex of the human brain in vision-based grasping actions. Such integration is obtained by coordinating the two visual streams of the human cortex, the action-oriented dorsal stream and the perception-oriented ventral stream [1]. The model is not only biologically plausible, but also carefully designed to be easily applied to a robotic setup.

Angel P. del Pobil Robotic Intelligence Lab Universitat Jaume I [email protected]

Figure 1. Human brain, left lateral view

Previous models of vision-based grasping [2, 3, 4] had so far built mainly, when not exclusively, on monkeys data, and represent a general ”primate brain”. Recent neuropsychological and neuroimaging research has shed a new light on how visuomotor coordination is organized and performed in the human brain, and thanks to the current availability of human data, we can now develop a model of vision-based grasping mainly, if not yet entirely, based on human physiology. We next describe the main findings on which our model is based. Previous models are then reviewed, and our proposal is finally detailed and justified.

2 Findings The following description of the neuroscientific concepts relevant for our research is organized in a logical sequence from stimulus perception to action execution. Refer to figure 1 to localize the brain regions cited in the text.

2.1

The visual cortex and the two cortical streams of visual elaboration

The retina is the visual receptor of the human body. The retina sends the visual information it gathers to the lateral geniculate nucleus (LGN) of the thalamus, which forwards it almost entirely to the primary visual cortex (V1) in the occipital lobe. V1 is organized in a retinotopic manner, respecting

the topological distribution of stimuli on the retina. It detects basic visual features, such as colors, bar or edges and their orientation, and depth through binocular disparity. Visual area V2 receives most of V1 output and projects mainly to V3 and V4. V2 is again retinotopic, has receptive fields that are larger than V1’s and realizes a matching of V1 features in order to perform moderately complex visual tasks, such as detecting spatial frequencies and textures or separating foreground from background. Visual area V3, still retinotopic, has even larger receptive fields and ability to detect complex features. V3 is the place in which the data stream begins to split into two main directions: a dorsal one, towards the posterior parietal cortex (PPC) and a ventral one, to the inferior temporal (IT) cortex. Visual information flowing through the ventral ”what” pathway passes from V2/V3 through V4 to the lateral occipital (LO) complex, dedicated to object composition and recognition [5]. Along the dorsal stream, through region V3a, object related visual information reaches area AIP in the intraparietal sulcus, which is concerned with analyzing visual features in order to organize grasping actions. We will describe now in more detail the main regions and functions of the two pathways.

2.2

Dorsal stream

The posterior parietal cortex (PPC) is largely recognized as the main associative area of our brain dedicated to the coordination between sensory information and motor response. The intraparietal sulcus (IPS) separates the superior and inferior lobes of the PPC. Several areas within and close to the IPS are dedicated to visuomotor transformations. We are especially interested in two of them, its most anterior and posterior sections; their characteristics are described below. For a more thorough review, nomenclature issues and detailed bibliography please refer to [6] and [7]. Caudal intraparietal sulcus - cIPS The most posterior part of the IPS (which is also called CIP, pIPS or PI) seems to be dedicated to 3D shape and orientation processing. cIPS receives projections from visual area V3a and is also active during visually guided grasping. Shikata and colleagues [8] showed that findings on monkeys are very likely to apply also to humans. Anterior intraparietal sulcus - AIP AIP (sometimes called hAIP, h=human) is located at the junction between anterior IPS and inferior PCS -postcentral sulcus- and it is the most important area involved in the planning and monitoring of grasping actions. The number and quality of recent studies on this area (see e.g. [9]) allow to paint a detailed picture of its features. First of all, AIP is not explicitly involved in spatial analysis that is not related to action: e.g., it is not active during perceptual size discrimination, and for 2D pictures. Grasping of familiar objects, and especially tools, seem to rely less on AIP, as cognitive cues and previous knowl-

edge is likely to be used to infer the object size when AIP is impaired [10]. Different AIP neurons are tuned to different objects, and to different views of the same object (thus, probably to different grips). Moreover, AIP is preferentially activated during grasping with precision grips in comparison with full-hand power grips, suggesting a fundamental role in the fine calibration of finger positioning, as required in precision grip tasks. Although AIP keeps active from object observation to the end of movement execution, some AIP neurons are selective for one of the following grasping sub-phases: set, preshape, enclose, hold, ungrasp. AIP neurons can be classified in subpopulations according to their preferential response: visuomotor neurons: grasping actions when both object and hand are visible; motor-dominant neurons: grasping in dark and hand manipulation when vision is unavailable; visual-dominant neurons: simple visual presentation of 3D objects, even though no action is required.

2.3

Ventral stream

In the ventral pathway there is an area, called LO (or LOC, Lateral Occipital Complex), which seems to play a role in grasping actions, sending information about object recognition to action-oriented areas. LO receives high level visual input from V4 and integrates visual elements that share similar attributes of orientation, color, or depth into objects and extract them from the background [5]. Whilst LO activates whenever an object is visible (compared to scrambled images), including during reaching and grasping hand movements to visible objects, it shows no differential activation for grasping compared to reaching, suggesting that the visually driven specification of the required movements for grasping is mediated entirely in AIP. Together with other ventral stream areas (mainly the fusiform gyrus), LO shows fMRI adaptation when the subject is viewing the object from different viewpoints, whilst no such adaptation is found in the dorsal stream areas AIP or cIPS.

2.4

Role of the streams and their interaction

The underlying idea of the original two streams theory [1] is that visual information has direct control over action in the dorsal stream, without any intervening mental representations; neural activity in the dorsal stream does not reflect the representation of objects or events, but rather the direct transformation of visual information into the required coordinates for action. Nonetheless, the pathways are not completely dissociated, and the ventral stream is normally involved in grasping actions, probably helping the dorsal areas in the grip selection process, through semantic knowledge and memories of past events [11]. As explained above, AIP is the cortical region in which visual information is used to code an appropriate grasping configuration for a target object, and the detailed

parameters of the selected action are completely determined by processing in the dorsal stream. Nevertheless, action selection is aided by visual processing in the ventral stream. Considering possible different acting conditions, whilst for tools and well known objects parietal grasp selection is probably driven top-down by semantic information [9], some simple grasping movements may be made without the influence of visual context or any top-down visual knowledge [11]. In support of this view, Sugio et al. [10] showed that different brain areas activate depending on the familiarity with the object, confirming that AIP elaboration is less critical if the object is well known, and especially if it has a handle, suggesting that in these cases the ventral stream does most of the job and the action is mainly memory-driven. Psychophysical experiments of delayed grasping in different conditions on normal subjects supports the idea that memory-guided grasping relies on the processing of stored information coming from the perception-based system in the ventral stream [12]. An explanation to the above findings is that contribution of the ventral stream on the generation/selection of the final grip is modulated by the degree of recognition of the target object achieved by LO and the quality of the previous experience. A higher confidence in the object recognition task reflects in a stronger influence of the past grasping experiences, whilst a more uncertain recognition leads to a more exploratory behavior, giving more importance to the actual observation. The relation between the two streams in our model builds on this assumption. Several other areas are involved in the preparation and/or execution of vision-based grasping actions. We only describe briefly the ventral premotor cortex, reminding that primary motor and somatosensory cortex, basal ganglia, prefrontal cortex and cerebellum are all deeply involved in the correct planning and execution of grasping actions. The ventral premotor cortex PMv is a key place in the preparation and execution of reaching and grasping actions. This region is still poorly characterized in humans. In monkeys, many electrophysiological studies showed that it is composed by F4 and F5, the reaching and the grasping areas that connect the posterior parietal cortex with the primary motor cortex, translating abstract action representations in motor commands. F5 is directly connected to AIP, and it also codes for the type of grip, as most F5 neurons are selective for one of precision grip (predominant), finger prehension or whole hand prehension. Even more than AIP, F5 neurons code for segments of the grasping action, thus constituting a vocabulary of motor prototypes to select and compose in the final action.

3 Previous models Computational models of the human visual system are largely available, and research on object recognition keeps involving a large part of the computer vision community. Nevertheless, few resources have been dedicated to the exploration of the mechanisms underlying the functioning of

the action-related visual cortex, and the issue of the integration between the contributions of the two visual pathways is nearly unexplored. In 1998, Fagg and Arbib proposed the most complete attempt to model the sensorimotor mechanisms of visual-based grasping in primates, the FARS (Fagg-ArbibRizzolatti-Sakata) model [2]. This model focuses on the interaction between AIP and F5, and is especially oriented to the action execution part of the process. The model is implemented using biologically inspired neural networks, and includes a large number of different brain areas (mainly inspired on monkey physiology), but only areas AIP, F5 and the primary motor cortex F1 (corresponding to human M1) are modeled in detail. For AIP, the distinction between visual, visuomotor and motor neurons is taken into account. Rizzolatti and Luppino [3] suggested in 2001 that the FARS model should be modified, as the hypothesis that action selection is obtained in a loop F5-AIP, using information coming from the ventral stream and the prefrontal cortex, seems not coherent with more recent findings. According to their view, AIP is the site of action selection, as it receives direct input from the ventral stream and the prefrontal cortex, whilst F5 does not. AIP would send the coordinates of only the selected grip to F5, and the action would remain potential until a release signal is received. The model proposed by Lebedev and Wise in 2002 [4] distinguishes between a vision for action (VFA) and a vision for perception (VFP) system, that can be roughly identified with the dorsal and ventral streams. According to the authors, the VFA system tries to develop the target action so as to better fulfill the contingent sensorial situation, and gets the help of the VFP system which in turn biases the selection towards stored, recognizable patterns.

4 Modeling the two streams interaction Being our focus on the generation of grips by use of visual information, and not on the perceptive process, we will start our description from the point visual areas V2 and V3 send their output to higher order cortical visual zones. In early dorsal areas (V3a/cIPS) possible grasping zones on the object surface are extracted. The candidate contact zones are combined with proprioceptual data and implicit taxonomy in AIP, to generate candidate grips. Meanwhile, the spatial invariant visual areas of the ventral stream (V4 and then LO) proceed in the task of identifying the object, and thus access memorized properties which can be useful for the oncoming action. On-line, perceptual data referred to geometry and position of the target object and perceptual information on relevant object properties are integrated in an efficient and robust way, in order to plan, execute and evaluate an appropriate grasping action. A detail description of how the model works is given below. Figure 2 shows the graphical schema of the whole model. Neuroscientific concepts will be from this point mixed with computational aspects, and issues related to robotics application.

Figure 2. Model Framework

4.1

Early dorsal stream

In order to elaborate a proper action on an external target, two main informations are necessary: one is the position and shape of the object, the other is the proprioceptual data referred to the effector, i.e., the acting hand. Proprioception The somatosensory cortex is the source of the proprioceptual information about the current position and state of the hand and arm. The state of the hand includes its orientation and the opening of each finger, as to perform a more efficient action one may want to minimize the number and amplitude of joint movements. For a common robotic hand, the complexity is much reduced, but the problem of efficiency becomes critical, as the choice of reducing the joint movements reflects in a faster and more energy-saving action. Position and geometry of the target object Region V3a on the dorsal stream is organized in a retinotopic way, so that topological organization of the observed scene, and the relative position of object features are directly represented. Object distances are estimated through stereopsis and convergence (this second being probably more influent in the dorsal stream [11]). The posterior parietal cortex does not construct any model or representation of the object, but rather extract properties of visual features that are suitable for a potential action. Area cIPS is likely to elaborate the input it receives from V3-V3a in order to produce features that are possibly suitable for grasping purposes. These features can be classified in four categories: 1) surfaces that are large and flat enough in order to place one or more fingertips in an opposition grip; 2) small features suitable for precision grips; for a robotic hand, small features are much easier to grasp if they have a clear major axis (e.g. pen vs. marble); 3) cylindrical zones which allow an involving grip; 4) spherical features, in order to allow a power grip. More irregular shapes are reconducted to the third (elongated features) or fourth (roughly round objects) groups.

For each extracted feature, its orientation [8] and curvature in two directions (the max curvature direction and its normal) have to be also detected.

4.2

Ventral stream

In the ventral stream visual information is related to cognitive aspects aimed at object recognition. V4 codes at the same time shape, color and texture of features, which are then composed in LO to form more complex shapes recognizable as objects [5]. The inputs of the recognition step will thus be information on the shape, the color and texture of the object. Identifying the object will translate in a classification task, to categorize the target in one of some known object classes. Outputs will thus be the object identity, and its composition, which in turn allows to estimate its weight distribution and the roughness of its surface, that are valuable information at the moment of planning the action. Recognition is not a true/false process, and a reliability index of the extracted information needs to be provided, as the classification could be more or less certain: if it is considered very unreliable, more importance will have the on-line visual information gathered by the dorsal stream. Beside the recovery of memorized object properties, the recognition allows to access memories of previous grasping experiences. These can be used to associate an object with basic natural grips, to recall the outcome of old actions on that object, and this information is used to bias the grasp selection. The knowledge level about a specific object identity can be used to modulate the influence of the ventral stream on the overall selection process. Knowledge level may be expressed by the classification likelihood, the probability of having actually found that specific object identity within the whole set of known objects, and by the number of previous encounters with the same object identity.

4.3

AIP and the grasp planning process

The grasp planning process can be performed through a series of neural networks which as a first step could match features to generate grips (AIP visual dominant neurons), then make the grips compete under the influence of the data coming from the ventral stream (AIP visuo-motor neurons) to finally select one preferred grip that will be defined in detail through a loop connecting AIP with F5 (motor dominant neurons). Together with the selected grip, an initial grasping force is also provided in the grasp planning process. In effect, it seems that AIP is also responsible for determining the initial force of grips [11], as this is strictly related to the geometry of the grasping configuration and the nature of the target object. Moreover, this initial force depends on the confidence we have in our choice: if we don’t know if an object is heavy or light we may decide to begin with a lighter grip in order to avoid crushing it.

4.3.1 Grip taxonomy After infants have learnt to grasp, we can suppose that a basic grip taxonomy is maintained implicitly in the PPC, as we know AIP neurons code for different grips. For a practical application, this taxonomy must be a trade-off between human and robotic grasping capabilities [13]. The list of basic grips in the taxonomy corresponds to the above list of graspable features: 1) all fingers precision grip (the thumb is opposing all other fingers; 2) thumbindex precision grip (special case of the previous one for small features). 3) cylindrical grip (all fingers parallel, involving the object); 4) full hand spherical power grip (also useful for complex shapes for which no other grips are considered appropriate). 4.3.2 Grip generation Grasping configurations are computed matching taxonomy with possible graspable features, whilst proprioceptual data constrains the grasping direction, reducing the search space. The whole population of grips is obtained starting from the relative position of hand and object and see how each grip in the taxonomy could match the target features. A check in the set of possible grasping features will tell if there are zones on the object surface suitable to a specific solution (e.g. a side whole hand precision grip needs two fairly parallel surfaces, one with space for one finger, the other for more fingers, and the orientation of both these features should be roughly vertical). For optimizing the action efficiency, the hand should grasp the object in a direction close to that connecting feature and hand in the starting position, avoiding to move around the object or to rotate excessively the wrist joint. 4.3.3 Grip selection The selection process between the generated grips consists in a competition between candidates that is biased by infor-

mation from the ventral stream and by a quality assessment. Dominant factors in the selection are the estimated movement cost, the outcome of previous graping experiences, the importance given to such experiences (depending on the knowledge level) and the reliability estimation given by visual criteria. In a previous work [14] we defined a number of criteria for the reliability assessment of planar grips, and use them to predict the outcome of future robotic grasping actions. Some of those criteria can be adapted to the threedimensional case, maintaining their plausibility and usefulness. Criteria that refer to visual aspects and are used during the first selection phase are: the Grasping margin criterion, which tries to minimize the risk of placing the fingers on edges or other unsuitable object parts; the Curvature criterion that considers concave surfaces more reliable than convex ones and the Center of mass criterion, designed to reduce the effect of gravitational and inertial torques, especially if the object is heavy. If the recognition confidence is high, more importance can be given to some criteria with respect to others. In case the knowledge level on the object is low, the criteria weighting should be more homogeneous, to respect the uncertainty of the situation. It is information from the ventral stream which modulates the importance of each criterion.

4.3.4

Action planning and execution

Once the general grip configuration and the target feature have been selected, the action needs to be planned in detail. AIP is likely to play an active role in this, as fMRI research showed it keeps active throughout the action. The exact position in which the fingers will touch the object is defined in our framework through an inverse-model/forwardmodel loop, in which the evaluation of a set of motor criteria drives the steps until the final position. The goal state of each iteration is given by the union of two factors: the approaching of the hand to the final position planned by the visuo-motor neurons of AIP, and the bias given by motor criteria toward a more stable configuration. Following a simulation scheme similar to [15] the inverse model compares the goal state of each finger of the hand with its actual position and generates a motor command suitable to approach the goal. The forward model evaluates the outcome of the current motor command and guides the following step so that to keep improving the quality of the ongoing situation estimated by the motor criteria. The most important motor criteria [14] are the Finger aperture, according to which an average finger aperture is preferred for better contacts as it allows for larger contact surfaces, and the Distribution of forces and contact points criterion, which considers more stable a grip having a regular and equilibrated force distribution. Neuroscience supports this final part of the model. The ventral premotor cortex of the human brain has often been considered to make large use of inverse model for motor programming purposes [16]. The cerebellum is believed

to be the place in which forward models are computed, and evidence for projections from the cerebellum to AIP supports our schema [17]. Something similar to our visual and motor criteria seems to be managed in the basal ganglia: especially the substantia nigra pars reticulata (which has been found to project to AIP) seems to be responsible for motor related reward assignment [17].

4.4 After the execution The actual outcome of the action execution is used to update the previous grasping experiences memory for future reference. The reward system (criteria) can also be updated, for example, changing the criteria weight for improving their usefulness in some conditions.

5 Conclusion Thanks to recent neuroscience findings, we have been able to design the outline of a model of the brain mechanisms upon which vision-based grasp planning in humans relies. Our model pays special attention to the interaction between the two streams of the human visual cortex. We have included in the model the brain areas that are most relevant to the grasping task, but its modularity allows for easy extensions. The model has been conceived to be applied on a robotic setup, and the design of the different brain areas has been performed taking into account not only biological plausibility, but also practical issues related to engineering constraints. At the same time, we plan to experimentally validate neuroscience findings by testing them on a robotic system. Possibly, new issues will arise, and they may be the inspiration for new neurophysiological experiments.

Acknowledgment This project has been partly funded by Fundació Caixa Castelló-Bancaixa (P1-1B2005-28), Generalitat Valenciana (GV05/137), and the Spanish Ministry of Science and Technology (FPI program). A special thanks to professor Giacomo Pedone.

References [1] M. A. Goodale and A. D. Milner. Separate visual pathways for perception and action. Trends Neurosci, 15(1):20–25, Jan 1992. [2] A. H. Fagg and M. A. Arbib. Modeling parietalpremotor interactions in primate control of grasping. Neural Networks, 11(7-8):1277–1303, Oct 1998. [3] G. Rizzolatti and G. Luppino. The cortical motor system. Neuron, 31(6):889–901, Sep 2001. [4] M. A. Lebedev and S. P. Wise. Insights into seeing and grasping: Distinguishing the neural correlates of

perception and action. Behavioral and Cognitive Neuroscience Reviews, 1(2):108–129, 2002. [5] K. Grill-Spector. The neural basis of object perception. Curr Opin Neurobiol, 13(2):159–166, Apr 2003. [6] J. C. Culham, C. Cavina Pratesi, and A. Singhal. The role of parietal cortex in visuomotor control: What have we learned from neuroimaging? Neuropsychologia, Dec 2005. [7] U. Castiello. The neuroscience of grasping. Nat Rev Neurosci, 6(9):726–736, Sep 2005. [8] E. Shikata, F. Hamzei, V. Glauche, M. Koch, C. Weiller, F. Binkofski, and C. Büchel. Functional properties and interaction of the anterior and posterior intraparietal areas in humans. Eur J Neurosci, 17(5):1105–1110, Mar 2003. [9] S. H. Frey, D. Vinton, R. Norlund, and S. T. Grafton. Cortical topography of human anterior intraparietal cortex active during visually guided grasping. Brain Res Cogn Brain Res, 23(2-3):397–405, May 2005. [10] T. Sugio, K. Ogawa, and T. Inui. Neural correlates of semantic effects on grasping familiar objects. Neuroreport, 14(18):2297–2301, Dec 2003. [11] M. Goodale and D. Milner. Sight Unseen. Oxford University Press, 2004. [12] A. Singhal, E. Chinellato, J.C. Culham, and M.A. Goodale. Dual-task interference is greater in memory-guided grasping than visually guided grasping. In Vision Sciences Society 5th Annual Meeting, Sarasota, Florida, May 2005. [13] M.R. Cutkosky and R.D. Howe. Human grasp choice and robotic grasp analysis. In S.T. Venkataraman and T. Iberall, editors, Dextrous Robot Hands, chapter 1, pages 5–31. Springer-Verlag, 1990. [14] E. Chinellato, A. Morales, R. B. Fisher, and A. P. Del Pobil. Visual quality measures for characterizing planar robot grasps. IEEE Transactions on Systems, Man, and Cybernetics, Part C, 35(1):30–41, 2005. [15] Y. Demiris and G. Hayes. Imitation as a dual-route process featuring predictive and learning components: a biologically-plausible computational model. In K. Dautenhahn and C. Nehaniv, editors, Imitation in Animals and Artifacts, chapter 13, pages 327–361. MIT Press, 2002. [16] R.C. Miall. Connecting mirror neurons and forward models. Neuroreport, 14(16):1–3, 2003. [17] D. M. Clower, R. P. Dum, and P. L. Strick. Basal ganglia and cerebellar inputs to ’AIP’. Cereb Cortex, 15(7):913–920, Jul 2005.