Towards Human-Robot Gesture Recognition using Point-Based Medialness Frederic Fol Leymarie
Prashant Aparajeya
Daniel Berio
Computing, Goldsmiths, Uni. of London
[email protected]
Computing, Goldsmiths, Uni. of London
[email protected]
artist
[email protected]
Abstract—We have recently introduced the idea of using a perception-based medial point description of a deforming or moving 2D object as a framework for a part-based shape representation which can then be efficiently used in dynamic pattern recognition tasks. The first step is one of fuzzy medialness measurements of 2D segmented objects from intensity images which emphasises shape information characteristic of an object’s parts (such as the main concavities and folds along a contour). We distinguish interior from exterior shape description. An interior dominant point is a well localised peak value in medialness representation. An exterior dominant point is evaluated by identifying a region of concavity sub-tended by a minimum angular support. We also identify main convexities as locally dominant outline curvature points at the end of important thick branches of medialness. We thus bring together some of the main psychologically-motivated point descriptors of a shape in one framework. We demonstrate in this short paper our exploration of its use in studying and characterising different gestures, including those performed by a human artist when performing graffiti (street) art. Our goal is to be able to develop learning by example shape-based gestures first to duplicate an artist’s expertise via robotic platforms and later engage in human-robot collaborations in artistic creativity. Index Terms—2D shape analysis, medialness, dominant points, codons, graffiti, Sigma Lognormal Model, gesture transfer.
I. I NTRODUCTION We consider the 2D shape representation and its associated gesture characterisation problems in computer vision and robotics. To perform a gesture identification task efficiently (and ultimately classification, recognition, learning), we propose to use a point-based medialness shape descriptor we have recently introduced [10] that can robustly capture the important structural information related to a deforming and moving form, in particular in the context of humans and robots performing gestures-driven tasks. This work is part of a larger project on human perception and computer vision applied to artistic creativity and how it can impact social robotics, in developing new modes of communication and collaboration between robots and humans [20]. Shape representation has been addressed in many ways by computer scientists, including: by contour analysis [2], using Blum’s medial axis transform and its related 2D shock graph [5], [18], combining contour and skeletal medial graph [9], [1], or using the related inner distance [12]. Closest to our approach from a computational point of view are the contour enclosure-based symmetries [4] and medial point transform [22] — but in these earlier works no attempt is made to isolate
Fig. 1. Top row: Left: Original shot of an open hand gesture; Right: corresponding segmented and binarised (figure-ground) image. Middle row: Left: Classical internal medial-axis approximation; Middle: external medialaxis; Right: 2D shock graph; Bottom row: Left: interior medialness map; Middle: recovered concave (blue dots) and convex (red dots) points; Right: final dominant point set (where interior medial points are in green) obtained via our method.
and use dominant points. Other classical approaches also emphasise either boundary information (e.g. Fourier, wavelet, scale-space analyses of closed contours) or interior information (e.g. primitive retro-fitting or approximation). Most of these (medial or not) approaches do not explicitly tackle deformations and articulated movements [13], while we do. Our representation, derived from the earlier proposal by Kov´acs et al. [7], maps the whole shape information into a few number of points we call “dominant”. Contrarily to classical medial-based representations, ours is not overly sensitive to small boundary deformations and furthermore gives high response in those regions where the object has high curvature with large boundary support and in the vicinity of joints (between well-delineated parts, such as the limbs of an animal). We augment the medial dominant points with main contour points indicating significant convex and concave features, thus bringing together with our notion of medialness the main 2D point-based shape systems proposed over the years in the fields of cognitive psychology and computer vision: the so-called “codons” denoting contour parts [17] and high curvature convexities often used in scale-space analyses [13] (Fig. 1).
(d) Fig. 2. (a) A tag made with a marker. (b) Replica of the tag generated with the Sigma-Lognormal Model by specifying way-points (in orange) (c) Plot of the lognormal set used in (b). (d) Illustration of the artist performing a graffiti tag and its transfer to a plotter.
II. M OTIVATIONS & BACKGROUND Our interest in characterising gestures emerges from a longterm project to understand human artistic performance and creativity in the arts. Our work has thus far been mainly focused on sketching and drawing face portraits [21], and on street art in the form of graffiti writing. While initially we worked on purely algorithmic and software systems [19], it became clear to us that only by embedding our systems in biomimetic robotics platforms would we be able to make significant further progress in understanding these human activities. If one’s goal is to better understand and mimic biological systems, including the human artist at work, for which behaviour is a function not only of their neural or computational system, but also of their morphological capacities and of their interaction with a surrounding physical environment, then robotics platform designed to reproduce the physical constraints of the human body offer a good compromise between algorithmic simulation (via software systems) and purely mechanical contraptions [14], [21]. The initial project, named AIkon (the Artistic Ikonograph: www.aikon-gold.com), was focused on face portraits and modeled various aspects of the drawing act, from the observation of the face using computer vision techniques, to the placement of strokes to create an appropriate shading based on morphological operations. In the more recent efforts the drawing process is carried by a robotic arm, and its influence on drawing style is also explored [21]. The resulting drawings are strikingly similar to the original sketches by the artist Patrick Tresset. Our current project, named AutoGraff (Automata and Graffiti: www.doc.gold.ac.uk/autograff/) considers the signatures characteristic of graffiti art — commonly referred to as tags [3]. The practice of writing tags is usually assimilated through years of observation and practice. With experience gestures are learned and embodied to the point of becoming “second
nature” to the writer, leading to a drawing process that is performed almost automatically. With experience the hand will draw more efficient lines, which will ultimately result in more harmonious curves and evoke a strong sense of dynamism in the tag. We have observed that the stroke gestures used when writing tags assume an asymmetric ‘bell’ shaped velocity curve (Fig. 2); this observation has brought our attention on Rejan Plamondon’s Sigma-Lognormal Model of Handwriting Gestures [15], which is commonly used in hand-writing synthesis and recognition. The Sigma-Lognormal function models handwriting trajectories which are described by a set of way-points that define the directions of a vectorial sum in time of “stroke” functions. Each stroke is described by (i) a time window, (ii) a velocity profile defined by a lognormal function and (iii) an initial and final directions which are interpolated according to a sigmoid function (erf). Smooth trajectories are created by overlapping the time windows of each stroke; a greater time overlap will generate smoother trajectories. Plamondon et al. also show that with increasing experience the rapid movements made by an adult when writing tend towards lognormality [16], i.e. the velocity of the trajectories can be precisely described by a sum of lognormal curves. Experimental results achieved during the initial phases of our research have shown that the application of the Sigma Lognormal Model indeed permits the generation of curved trajectories showing a strong resemblance to the ones that can be seen in graffiti art, and further that are similar to the ones executed by an experienced drawing hand. In order to recognise strokes and their associated gestures, we need to develop a concise yet robust representation which is adequate for computer vision methods which can inform both our analysis of the human performance and drive a robotic platform to reproduce or learn by example how to create a similar artistic piece. We present in the rest of this paper our proposed representation and its algorithmic implementation in a typical computer vision loop, followed by some early results in characterising typical human body movements and gestures as well as some graffiti drawing gestures. III. M EDIALNESS M EASURES & D OMINANT P OINTS We summarise here the method of extraction of dominant points we recently introduced [10]. Mathematically, following Kov´acs et al. [7], medialness of a point in the image space is defined as the containment of the longest segment of boundary (information, edges) falling into the annulus of thickness parametrised by the tolerance value () and with interior radius taken as the minimum radial distance of a point from boundary (Fig. 3). On completion of medialness measurements each pixel in the transformed image space holds a local shape information (of accumulated medialness). Assuming figureground separation, thickness variations, bulges and necks of an object are captured via interior medialness measurement, while the concavities and joints are defined via exterior medialness measurement. In the work of Kov´acs et al. it is shown that humans are most sensitive to a small number of localised areas of
Fig. 4. Medialness and tolerance: variation in D is illustrated for increasing values of = 2, 8, 16.
Fig. 3. Adapted from [7]. The D function for a simple shape defined as a sum of boundary segments falling inside the annulus neighbourhood of thickness (thick boundary segments within the grey ring) centred around the circle (with centre p). M (p) is taken as the minimum radial distance from point p to the nearest contour point.
medialness [7]. Our equivalent (extended) notion is defined as dominant points and can be applied to any objects, animated or not. Dominant points are constrained to be a relatively small number of points of high medialness obtained by filtering out the less informative, redundant and noisy data from the initial medialness image space. The recovery of internal and external dominant points as well as convex points is detailed below (§III-B); together they are the representative feature points of the shape. The computational scheme can then be applied to a set of frames in a movement sequence. A. Medialness Measure A medial point is defined by computing the D function based on an equidistant metric (to boundary segments). The D value at any locus in space represents the degree to which this position is associated with a percentage of bounding contour pixels of the object within a tolerance of value (after Kov´acs´et al. [7]; Fig. 3). Formally, D is defined as: D (p) = T1 |p−b|≤M (p)+ db, for any point p = [xp , y p ], vector b(t) = [(x(t), y(t)] describing the 2D bounding contour ´ (B) of the object, and normalising factor T = b∈B db. The metric M (p) is taken as the smallest distance between p and the bounding contour: M (p) = min |p − b(t)|. In Fig. 4 the 0≤t≤1
medialness measurement is performed on a wide open hand showing the variation in internal medialness measurement (different value of the D -function) by augmenting the value of tolerance () which reflects a smoothing effect: as increases, smaller symmetries are discarded in favour of large scale ones (external measurements are described later). B. Dominant Points Extraction Each point in the transformed image space has some local shape information in the form of medialness for a globally selected tolerance . Among these points, some have, cumulatively, enough information to represent parts or ultimately the whole object.
Fig. 5. The 3 successive steps in isolating internal dominant points: (i) medialness representation of (the interior of) an arm+torso (in a drawing pose); (ii) top-hat transform; and (iii) internal dominant points shown as enlarged black dots.
1) Internal Dominant Points: Medialness increases with “whiteness” in our transformed images (which proves also useful for its visualisation). To select points of internal dominance, a “white” top-hat transform is applied, resulting in a series of bright white areas. This transform is defined as the difference of an input function (here an image of medialness measures as a grey-level 2D function) with the morphological opening of this function by a flat structural element (a disk parametrised by its radius as a function of ). Opening is a set operator on functions which “removes” small objects from the foreground of an image, placing them in the background (augmenting the local function set values) [23]. This filtering is followed by a thresholding to discard remaining areas of relatively low medialness significance. Fig.5(middle) shows the result obtained after applying the white top-hat transform on a medialness image Fig.5(left). We still require to process further the output of the tophat transform to isolate the most dominant points amongst the remaining selected medialness points which tend to form clusters. To do so, a flat circular structuring element of radius /2 (but of at least 2 pixels in width) is applied over the top-hat image — such that within the scope of the structural element, only that locus which maximises medialness is selected. We further impose that no remaining points of locally maximised medialness are too close; this is currently implemented by imposing a minimum distance of length 2 ∗ is taken between any pair of selected points. We have found that in practice this is sufficient to avoid clustering interior dominant points (an example of the result of this filtering is shown in Fig.5(right)). 2) External Dominant Points: When an object can be deformed or is articulated, salient concavities can be identified in association to those deforming or moving areas (such as for joints and limbs of a robot or human body). The location of an external dominant point can be made invariant to this deformation/articulation only up to a certain extent. For example, if the
Fig. 7. Illustration of interior medialness processing on a hand to identify significant convexities in the vicinity of ends of medialness trace. Fig. 6. External medialness processing on a hand with moving thumb. The articulated movement of the thumb changes the location and orientation of the associated external dominant point. If located far from the contour, it then proves difficult to retrieve a (shape-based) match with the modified form. At the top, for the zoomed in areas, blue arrows show the local support for concavity while the green arrow indicates the direction of flow of medialness (away from the concavity).
location of an external dominant point is initially relatively far away from the corresponding contour segment, a slight change in the boundary shape near the movable part (such as an arm or finger movement) can considerably change the position of that associated dominant point (Fig. 6). On the other hand, if a point is located very close to the contour, it can easily be due to noise or small perturbations of the boundary. In practice, to resolve these issues, we define a candidate external dominant point as a “local concavity” if it falls under a threshold angular region, with a constraint of length of support which itself depends on the tolerance value (). The value of the threshold (θout ) is tunable but is always less than π, which permits to control the angular limit of the concave region. A locus whose concavity is larger than θout is considered a flat point. In our experiments we tuned the value of θout from 5π/6 to 8π/9. In association, we define an external circular region (of radius function of ) centred at each concavity containing candidate external dominant points. Each such region may provide only one representative dominant point, as a function of the maximum containment of boundary points inside the associated annulus (of medialness) and corresponds to (our definition of) the maximum length of support. Finally, we position the representative dominant point to be near and outside the contour at a fixed distance. 3) Convex Points: Our final shape feature is a set of convex points, where an object has a sharp local internal bending and gives a signature of a blob-like part or significant internal curvature structure (i.e. a peak in curvature with large boundary support). The goal is to represent an entire protruding sub-structure using one or few boundary points. Such protrusions are known to be important contributors in characterising shape [17], [9]. The process of extraction of
convex points is very similar to the extraction of concave loci (Fig. 7), one difference being in the value of the threshold angle (θin ), where π < θin ≤ 2π. In our experiments we have found useful values to be in the range: 5π/4 to 4π/3. Convex and concave points are complementary to each others and have been used in the “codon” theory of shape description: a codon is delimited by a pair of negative curvature extrema denoting concavities and a middle representative positive maximum of curvature denoting a convexity [17]. In our case we associate these two sets with the extremities of the (extended) medial axis of H. Blum: end points of interior branches correspond to centre of positive extrema of curvature and end points of exterior branches are mapped to negative extrema of curvature of the boundary. The repositioning of these extrema near the boundary is alike the end points of the PISA (Process Inferring Symmetry Axis) representation of M. Leyton [11]. Together, the three sets: concave, convex and interior dominant, form a rich point-based description of medialness to allow us to efficiently address applications with articulated movement for real image data. C. Articulation Anatomically, an animal’s articulated movement is dependent on the point of connection between two bone elements of a skeleton. Our results show that concave points (representative exterior dominant points) have good potential to indicate and help trace in time such articulations, unless the shape is highly deformed or flattened. For usual movements, such as walking, jogging, gesticulating, these feature points remain present and identifiable in association to an underlying bone junction and hence can provide a practical signature for it; examples of this property are given in Fig. 8 and Fig. 10. IV. G ESTURE T RANSFER We present here briefly our approach to transfer gesture commands to robotic systems using our proposed shape-based feature point sets. We use the Sigma Lognormal Model (SLM)
Fig. 8. Two samples from an articulated set of Vitruvian men (after L. da Vinci, 1490) illustrating some features of our perception-based selection of dominant points.
of Plamondon et al. for this purpose which depends on the preidentification of way-points indicative of a spatial scaffold for gestures [15], [16]. The SLM describes a hand trajectory as a vectorial sum in time of ’stroke’ functions. Each stroke function is defined by a set of parameters Si {t0 , σ, µ, θ1 , θ2 , D} where: t0 is the initial time (seconds) of the stroke, σ is the log of the response time of the stroke, µ is the log of the delay of the stroke, θ1 is the initial direction of the stroke, θ2 is the final direction of stroke, and D is the magnitude of the stroke. By overlapping in time a set of stroke functions it is possible to generate complex trajectories that are similar to the ones that can be seen in handwriting gestures (Fig.2(b,c)). A greater time overlap will create smoother trajectories. To simplify the user definition of a SLM trajectory, we can define a set of way-points W (or virtual targets) for the trajectory and then compute the θ1 , θ2 and D parameters by computing the direction and magnitude of the vector between each pair of consecutive points Wi and Wi+1 . Given a stroke curvature parameter θd we compute θ1 and θ2 d as follows: d~ = Wi+1 − Wi , θ = tan( dxy ) then D = |d| and θ1 = θ+θd and θ2 = θ−θd . The calculation of θ1 and θ2 from a base direction θ and a curvature parameter θd will guarantee the generated trajectory to follow the given way-points and will allow the control of the curvedness of each stroke by varying the θd parameter. By using this re-parameterization it becomes easier to fit the SLM trajectory to a set of input points, which can be the result of a gesture recognition system (such as our medialness-based feature points). One of the main difficulties with fitting SLM parameters to an input trajectory is finding the optimal number of strokes (or way-points) for its generation. In fact there can be potentially an infinite number of stroke combinations that will achieve the same trajectory. To reduce the search-space and simplify the process we assume that given an input set of n points I, the number of way-points will correspond with the number of points of maximum curvature (within a userspecified threshold) plus the initial and final points I1 and In . To compute the curvature, and subsequently for computing the error in the optimization process we generate a linearly subdivided version of the input poly-line S(I, ns). We are
Fig. 9. (a) The input trajectory provided by the medialness method. (b) The input trajectory smoothed with centripetal Catmull-Rom interpolation and with the curvature extrema identified (red circles). (c) Fitted trajectory over the smoothed input trajectory and the fitted way-points (in orange).
currently experimenting with different methods for creating smoother input curves which aid in finding curvature extrema, such as interpolating through the input points with a centripetal Catumull-Rom spline [24] (Fig.9). Once a set of potential way-points has been identified, we use a stochastic opimisation method to evaluate the parameters for each stroke function that will best fit the input trajectory. Currently we are using Simulated Annealing (SA) for this purpose [8]. The search begins by creating an initial SLM trajectory along the way-points we have computed by finding the curvature extrema of the input trajectory. During the SA process, random moves in search-space are made by taking a random stroke S of the current hypothesized trajectory and offseting its parameters randomly. The random moves are done on the position of the waypoints and by varying the θd parameter. In addition, each random move has the following constraints: (i) the initial and final way-points always correspond to the initial and final points of the input trajectory; (ii) if the number of curvature extrema of the trajectory resulting from the hypothesis K(H) and the curvature extrema of the input trajectory K(I) match, then each way-point W is offset by (K(I) − K(H)) × δ where 0 < δ < 1.5 is a random value, such that values δ > 1 permit extrapolation which may help in the search. The hypotheses are evaluated with an objective function that currently linearly subdivides the generated and input trajectories in n equidistant points and returns an error proportional to the sum of the square-distances between each point. This system converges in practice and it is then possible to use the generated path to drive a Cartesian drawing machine that will plot the gesture (Fig.2(d)). V. C ONCLUSION We have presented early results in applying our proposed point-based medialness representation to support marker-less gesture recognition from image sequences. Medialness here refers first to the model proposed by Kov´acs et al. in terms of “hot spots” along the thick trace of medial measures for the interior of a biological form (in movement or in a static pose) [6]. We described here a possible algorithmic method to extract such hot spots as dominant points of interior medialness. We extend this representation by augmenting it with notions of concave and convex points located near the outline and based
Fig. 10.
Some examples of tracking three gestures of a hand and (bottow row) of the artist performing a graffiti tracing.
on local medialness measures. Note that we have applied Kov´acs et al. medialness measure also to the exterior of a form, in order to characterise concavity (a further extension of their model) and to capture a part-based representation of shape (combining it with the codon and PISA models). Note also that all our parameters prior to the SLM computation are computed with respect to the tolerance level in medialness, , which is currently globally set to a single value. One area of future experimentation will be to evaluate the use of multiscale medialness (e.g. refer to Fig. 4) versus a locally varying (alike adaptive/anisotropic smoothing). In Fig. 2(d) we illustrate our current work on tracking over time persistent dominant points in order to identify useful waypoints which can then be fed into our robotic platforms under the SLM of Plamondon et al. [16] to generate gesture traces alike those performed by the human artist. Our current graffiti writing robots are simple Cartesian platforms, and we plan to experiment with articulated arms as well as with visual feedback, in continuation of our previous work done in the context of face portrait systems [21]. ACKNOWLEDGEMENT We thank the EU project CEEDs (FP7-FET-ICT-258749) which supports in part the PhD studies of P. Aparajeya. R EFERENCES [1] X. Bai, W. Liu, and Z. Tu. Integrating contour & skeleton for shape classification. In ICCV Workshops, pages 360–7, 2009. [2] L. Chen, R. Feris, and M. Turk. Efficient partial shape matching using Smith-Waterman algorithm. In CVPR Workshops, pages 1–6, 2008. [3] L. Gottlieb. Graffiti Art Styles. McFarland, 2008. [4] M. Kelly and M. D. Levine. Annular symmetry operators: A method for locating and describing objects. In ICCV, pages 1016–21, 1995.
[5] B.B. Kimia. On the role of medial geometry in human vision. Journal of Physiology – Paris, 97(2):155–90, 2003. [6] I. Kov´acs. “Hot spots” and dynamic coordination. In Dynamic Coord. in the Brain. MIT Press, 2010. pp. 215–28. ´ Feh´er, and B. Julesz. Medial-point description of shape. [7] I. Kov´acs, A. Vision research, 38(15):2323–33, 1998. [8] P. Van Laarhoven and E. Aarts. Simulated Annealing: Theory and Applications. Springer, 1987. [9] F. Leymarie and M. D. Levine. Simulating the grassfire transform. IEEE PAMI, 14(1):56–75, 1992. [10] F. Fol Leymarie et al. Point-based medialness for movement computing. In ACM Proc. of MOCO, pp.31–6, Paris, France, 2014. [11] M. Leyton. A process-grammar for shape. Artificial Intelligence, 34(2):213–247, 1988. [12] H. Ling and D. Jacobs. Shape classification using the inner-distance. IEEE PAMI, 29(2):286–99, 2007. [13] Y. Mingqiang et al. A survey of shape feature extract. tech. In Pattern Rec. Tech. & Applic. pp. 43–90, InTech, 2008. [14] R. Pfeifer et al. The challenges ahead for bio-inspired ’soft’ robotics. Communications of the ACM, 55(11):76–87, 2012. [15] R. Plamondon and M. Djioua. A multi-level representation paradigm for handwriting stroke generation. Human Movement Science, 25(4– 5):586–607, 2006. [16] R. Plamondon et al. The lognormal handwriter: Learning, performing and declining. Frontiers in Psychology, 4(945), 2013. [17] W. Richards and D. D. Hoffman. Codon constraints on closed 2D shapes. CVGIP, 31(3):265–81, 1985. [18] T.B. Sebastian, P.N. Klein, and B.B. Kimia. Recognition of shapes by editing their shock graphs. IEEE PAMI, 26(5):550–71, 2004. [19] P. Tresset and F. Fol Leymarie. Generative portrait sketching. In Proc. of 11th VSMM, pages 739–48, Ghent, Belgium, 2005. [20] P. Tresset and F. Fol Leymarie. Human robot interaction and drawing. Bulletin de l’AFIA (Association Francaise pour l’Intelligence Artificielle), (78):44–9, October 2012. [21] P. Tresset and F. Fol Leymarie. Portrait drawing by Paul the robot. Computers & Graphics, 37(5):348–63, August 2013. [22] G.J. van Tonder and Y. Ejima. Flexible computation of shape symmetries within the max. disk paradigm. IEEE SMC, Part B, 33(3):535–40, 2003. [23] L. Vincent. Morphological grayscale reconstruction in image analysis. IEEE Trans. on Image Processing, 2(2):176–201, 1993. [24] C. Yuksel et al. Parameterization and applications of Catmull-Rom curves. Computer Aided Design, 43(7):747–55, 2011.