May 2, 2007 - An empirically demonstrated human activity language provides sensory-motor-grounded representations for .... ogy studies movement systems as the foundation for a ..... London BâBiological Sciences, vol. 355, no. 1404 ...
C O V E R F E A T U R E
A Language for Human Action Gutemberg Guerra-Filho and Yiannis Aloimonos University of Maryland, College Park
An empirically demonstrated human activity language provides sensory-motor-grounded representations for understanding human actions. A linguistic framework allows the analysis and synthesis of these actions.
H
uman-centered computing (HCC) involves conforming computer technology to humans while naturally achieving human-machine interaction. In a human-centered system, the interaction focuses on human requirements, capabilities, and limitations. These anthropocentric systems also focus on the consideration of human sensory-motor skills in a wide range of activities. This ensures that the interface between artificial agents and human users accounts for perception and action in a novel interaction paradigm. In turn, this leads to behavior understanding through cognitive models that allow content description and, ultimately, the integration of real and virtual worlds. We seek to provide a flexible representation, proposed here as human activity language (HAL), for compactly modeling hundreds of human actions. This structure— organized in terms of syntax, morphology, and kinetology—has the flexibility required to model numerous behaviors using a language’s parsing and generation aspects. Further work requires exploring the morphological organization of a praxicon—a lexicon of human motion—to discover more structure in a HAL. As Figure 1 shows, we can envision expressing actions in at least three languages: sensory, motor, and natural. The sensory language lets us perceive and understand actions, the motor language lets us produce actions, and the natural language lets us communicate about actions. Our work focuses on building a language that maps to the lower-level sensory and motor languages and to the higher-level natural language. By modeling
60
Computer
actions as a language in each space, we can formulate many interesting problems as translation problems, such as • • • •
video annotation to create descriptions of activity, natural-language-driven character animation, training robots by imitation using video, and controlling robots with natural language.
These problems appear at the kernel of human-centered computing.
CONCEPTUAL REPRESENTATIONS One important aspect of HCC is the need for computers to be able to share a conceptual system with humans. Concepts are the elementary units of reason and linguistic meaning. Many researchers hold the philosophical position that all concepts are symbolic and abstract and therefore should be implemented outside the sensory-motor system. This way, meaning for a concept amounts to the content of a symbolic expression, a definition of the concept in a logical calculus. An alternative approach holds that concepts are grounded in sensory-motor representations. This sensory-motor intelligence considers sensors and motors in the shaping of the cognitive hidden mechanisms and knowledge incorporation. A variety of studies in many disciplines—such as neurophysiology, psychophysics, and cognitive linguistics—suggest that the human sensory-motor system is indeed deeply involved in concept representations.
Published by the IEEE Computer Society
0018-9162/07/$25.00 © 2007 IEEE
Linguistic framework The functionality of Broca’s region in the brain1 and the mirror neurons theory2 suggest that perception and action share the same symbolic structure that provides common ground for sensory-motor tasks such as recognition and motor planning, along with higher-level activities. Further, spoken language and visible movement use a similar cognitive substrate based on the embodiment of grammatical processing. Some research shows that language is grounded in the motor system,3 which implies the possibility of using a linguistic framework for a grounded representation. Computers in an HCC environment could become more powerful if they incorporated models of human actions. Thus, we investigate the involvement of sensory-motor intelligence in concept description and, more specifically, the structure in the space of human actions. In the sensory-motor intelligence domain, our scope centers on the representation level of human activity. Our sensory-motor linguistic framework contributes to the modeling of human actions. An artificial cognitive system with sensory-motor representations can learn skills through imitation, better interact with humans, and understand human activities. This understanding includes reasoning and the association of meaning to concrete concepts. Closing the semantic gap will involve the grounding of concepts on the sensory-motor information.
Closing the gap This process might start from video, where objects are detected, recognized, and categorized. At this level, human body parts become features extracted from visual input, allowing the capture of human movement. We seek information on human actions that correspond to observable voluntary meaningful movement. Motion capture data is processed to discover structure in this space. This input contains the essential 3D specification of human movement necessary for mapping visual and motor spaces. We hypothesize that there exists a language—in a formal sense—that describes all human action, and show how this language can be obtained using empirical data. We introduce a human activity language that consists of kinetology, morphology, and syntax. Kinetology— the phonology of human movement—finds basic primitives for human motion (segmentation) and associates them with symbols (symbolization). Kinetology provides a symbolic representation for human movement that allows synthesis, analysis, and symbolic manipulation. Kinetology also has applications for compressing, decompressing, and indexing motion data. The morphology of a human action relates to the essential parts of the movement and its structure. A single action morpheme represents the least amount of movement with a purposeful goal. To learn action mor-
Figure 1.Three language spaces for human action. Courtesy of Abhijit Ogale.
phemes, we present a grammatical inference methodology and introduce a parallel learning algorithm to induce a grammar system representing a single action. The syntax of human activities involves the construction of sentences using action morphemes. A sentence can range from a single action morpheme, or nuclear syntax, to a sequence of sets of morphemes. A single morpheme decomposes into analogs of lexical categories: nouns, adjectives, verbs, and adverbs. The sets of morphemes represent simultaneous actions (parallel syntax) and a sequence of sets of morphemes relates to the concatenation of activities (sequential syntax). We use a motion-capture database to validate our linguistic framework. Our database contains around 200 different actions that correspond to concrete verbs associated with observable voluntary meaningful movement. These actions are not limited to any specific domain. Instead, the database includes actions of several types: manipulative (prehension and dexterity), nonlocomotor, locomotor, and interaction.
Motor primitives Researchers in various disciplines have touched on the idea of primitives in human movement; such primitives provide the first step toward a language. Indeed, recent work provides evidence that voluntary actions are made from simpler elements that connect to each other either serially or in parallel.4,5 To some scientists, motor primitives basically amount to motor schemas or control modules,6,7 and they can be specific to a task. Their basic feature is that appropriate transformations can derive many different movements from a limited number of primitives, and these movements can be combined using a well-defined set of rules to form more complex actions.8 At the behavioral level, motor primitives have been examined for human and monkey grasping as well as May 2007
61
from a set of labeled sentences. The grammar inference consists of learning a set of rules for generating the valid strings that belong to the language. The target grammar usually belongs to the Chomsky hierarchy of formal grammars. There are several methods for learning regular Figure 2. A context-free-grammar (CFG) component of a parallel grammars, context-free grammars (CFGs), and synchronous grammar system (PSGS), shown as a binary tree forest. stochastic variations.15 Current approaches16-18 account only for sequential learning and not for object manipulation. Prehension—such as lifting a full the parallel learning we introduce. cup—consists of reaching, orienting the hand, and We define sequential learning as a technique that infers grasping. Even though these three actions can be com- the structure of a single sequence of symbols A. This bined in many ways, they are executed as a unified coor- structure corresponds to a forest of binary trees, as dinated complex act.9 shown in Figure 2, where each node in a tree is associMore recently, muscle synergies in a cat during a pos- ated with a CFG rule in a normal form. Initially, the tural task have been identified using the non-negative sequential learning algorithm computes the number of matrix factorization technique.10 Because several syner- occurrences for each different digram in the string A. A gies were assumed to act on a given muscle, the total digram is a pair of adjacent symbols. A new grammar activation of that muscle is the sum of activations due to rule Nc is created for the digram with the current maximum frequency. The algorithm replaces each all the synergies. Of particular importance here is the work of Andrea occurrence of in the string A with the created nond’Avella and Emilio Bizzi.11 They employed a similar terminal Nc. The entire procedure is repeated until approach to extract amplitudes and timing relationships digrams occur more than once. As an example, the set among muscle activations during more natural behav- of rules inferred for the CFG displayed in Figure 2 is iors in intact animals. A combination of synergies that {N1 AB, N2 CD, N3 EF, N4 BN1, N5 N2N3, were shared across behaviors and those that were for N6 N5G, N7 N6N4}. A sequential learning algospecific behaviors captured the invariance across the rithm keeps merging adjacent root nodes into single entire observed data set. These results support a modu- rules, and, consequently, overgeneralization occurs lar organization of the motor controller and indicate when “unrelated” rules are generalized. that the motor output of these modules is combined to control a large set of behaviors. KINETOLOGY In a monkey cortex, electrical microstimulation in the As an initial step in our linguistic framework, we primary motor and premotor cortex causes complex sought to find basic primitives for human movement. movements involving many joints and even several body These motion primitives are analogous to phonemes in parts. These actions closely resemble gestures in the spoken language. Just as phonemes are units of phonic monkey’s natural repertoire. Microstimulation at each origin, motion primitives are units of kinetic origin, site cause the arm to move to a specific final posture. which we refer to as kinetemes. In this sense, kinetolThus there appears to be evidence for a cortical map of ogy studies movement systems as the foundation for a joint angles, or a cortical representation of limb or body kinetic language. In addition to a geometric representapostures. There also appears to be growing evidence that tion for human movement, a kinetological system conthere is cortical coding not only of kinematic and sists of segmentation, symbolization, and principles. dynamic variables but also of more global features such as segment geometrical shape or the order of the seg- Segmentation ments within the sequence.12 Automatic segmentation decomposes action sequences In our linguistic framework, we initially aim to find into movement primitives. Theses primitives are meanmovement primitives as basic atoms. Kanav Kahol and ingful atomic elements with characteristic properties colleagues13 used the local minimum in total body force that, within a segment, stay constant. To segment human to detect segment boundaries. Atsushi Nakazawa and movement, we consider each joint actuator indepencolleagues14 measured similarities of motion segments dently. Each joint angle appears as a one-dimensional according to a dynamic programming distance, then function over time, as Figure 3 shows. We associate an actuator with a joint angle that specclustered these with a nearest-neighbor algorithm. ifies the actuator’s original 3D motion according to a geometric representation, as Figure 3a shows. The segGrammatical inference Here we pose the morphology of human activity as a mentation process assigns one state to each instant of grammatical inference problem. Grammatical inference the movement for a given actuator. Contiguous instants concerns the induction of the grammar of a language assigned to the same state belong to the same segment. 62
Computer
We define a state according to the sign of derivatives of a joint angle function, as Figure 3b shows. In our segmentation, we use velocity (first derivative) and acceleration (second derivative). This leads to a four-state system: positive velocity/positive acceleration (blue), positive velocity/negative acceleration (green), negative velocity/positive acceleration (yellow), and negative velocity/negative acceleration (red). A kinetological system can be defined either in complex ways that consider higher-order derivatives or in simple ways. A simpler system could have used only the first derivative. In that case, we would have only two states: positive velocity (blue) and negative velocity (red). The representation has a qualitative aspect—the state of each segment—and a quantitative aspect corresponding to the time length and angular displacement of each segment. The qualitative aspect is depicted with the colors already specified, while the quantitative aspect is represented by the line segment’s length (for duration) and thickness (for angular displacement), as Figure 3b shows.
(a)
(b)
(c)
Symbolization The second aspect of our kinetological system, symbolization amounts to classifying motion segments such that each class contains variations of the same motion. This way, each segment is associated with a symbol representing the cluster that contains motion primitives with a similar spatiotemporal structure, as Figure 3c shows. Clustering—using an appropriate similarity distance for segments with the same atomic state—offers a simple way to perform symbolization. Symbolization results in a set of strings for the wholebody motion that defines a structure, an actiongram, shown in Figure 3d. An actiongram A has n strings A1, , An. Each string Ai corresponds to an actuator and contains a possibly different number of mi symbols. Each symbol Ai(j) is associated with a segment, its symbol, time period, and angular displacement.
Principles In addition to sensory-motor primitives, we suggest five kinetological properties for evaluating our approach and any other: compactness, view-invariance, reproducibility, selectivity, and reconstructivity. Figure 4 illustrates some of these principles. The compactness principle relates to describing a human activity with the least number of atoms. We achieve compactness through segmentation, which reduces the representation’s number of parameters. We implemented our segmentation approach as a compression method for motion data, tested the compression efficiency algorithm on several different actions, and recorded a median compression rate for motion files of 3.698 percent of the original file size. We achieved the best compression for actions with smooth movement. Further compression could be achieved through symbolization.
(d)
Figure 3. Segmenting human movement. Each joint angle appears as a one-dimensional function over time: (a) geometric representation, (b) segmentation, (c) symbolization, and (d) actiongram.
An action representation should be based on primitives robust to the variations of the image formation process. The view-invariance principle regards the effect of projecting a 3D representation of human movement into a 2D representation according to a vision system. A viewinvariant representation provides the same 2D projected description of an intrinsically 3D action captured from different viewpoints. The view-invariance evaluation requires a 2D-projected version of the initial representative function according to varying viewpoints. We use a circular surrounding configuration of viewpoints. As Figure 4a shows, a view-invariance graph shows the state associated with the movement for each time instant (horizontal axis) and for each viewpoint in the configuration of viewpoints (vertical axis). For any joint and any action in our database, the graph demonstrates a high view-invariance measure for our segmentation process, with the only exceptions being at the segment’s borders and two degenerated viewpoints. May 2007
63
(a)
(b)
(c)
(d)
Figure 4. Kinetological properties. (a) A view-invariance graph of the knee flexion/extension angle; (b) reproducibility of pelvic obliquity during gait; (c) selectivity shown as different representations for three distinct actions—(i) walk, (ii) run, and (iii) jump; and (d) reconstructivity, in which the top line shows the original motion sequence and the bottom line shows the decompressed sequence.
The reproducibility principle requires an action to have the same description even when considering a different performance of that action. A kinetological sys64
Computer
tem is reproducible when we associate the same symbolic representation with the same action performed on different occasions (intrapersonal) or by different subjects (interpersonal). To evaluate the reproducibility of our kinetological system, we used human gait data for 16 subjects, covering males and females at several ages. We computed a reproducibility measure for each joint angle, which gauges the fraction of the most representative symbolic description among all descriptions for the 16 individuals. The reproducibility measure is high for the joint angles that play a primary role in an action, as Figure 4b shows for a walking action. The identification of an action’s intrinsic and essential variables is a byproduct of a kinetological system’s reproducibility requirement. The selectivity principle concerns the ability to discern between distinct actions. In terms of representation, this principle requires a different structure to represent different actions. We compare our representation of several different actions and verify whether their structures are dissimilar. We demonstrate the selectivity property using a set of actions performed by the same individual and consider four joint angles: left and right hip flexion/extension, and left and right knee flexion/extension, as Figure 4c shows. Different structures clearly represent the different actions. We associate the reconstructivity principle with the ability to reconstruct the original movement signal up to an approximation factor from a compact representation. We propose a reconstruction method based on a novel interpolation algorithm that considers the kinetological structure. We consider one segment at a time and concentrate on the state transitions between consecutive segments. Based on a transition, we determine constraints about the derivatives at border points. Each possible sequence of three segments corresponds to two equations associated with the first and second derivatives at border points of the center segment. A polynomial provides a simple model for the jointangle function during a segment. The least-degree polynomial satisfying all the constraints is a fourth-degree polynomial. Using this approach, the reconstruction process needs to find only five parameters. The polynomial is partially determined with the two associated equations for the particular sequence of kinetemes, along with two more equations using the joint-angle values at the two border points. We obtain these values from the time length and angular displacement of each segment. The last free variable can be determined using some criteria such as jerk (third-derivative) minimization. We implemented this reconstruction scheme as the decompression method for motion data, which Figure 4d shows. The average error in our motion database was 0.823 degree.
MORPHOLOGY Morphology explores the structure of words, their constituting parts, and how these parts aggregate. In the context of a human activity language, morphology involves the structure of each action and the organization of a praxicon in terms of common subparts. Our methodology determines the morphology of each action in a praxicon, then finds that praxicon’s organization. We define a human action morpheme as the set of essential actuators intrinsically involved in the action, the synchronization among these actuators, and the corresponding motion patterns in terms of kinetemes. The morphemes are Figure 5.Two context-free grammar components—corresponding to hip and knee the essential parts of human actions. flexion/extension—related by the synchronized rules of a parallel synchronous gramBecause the derived motion patterns mar system (PSGS). are sequences of kinetemes, we call the inference of morphemes morpho-kinetology. This for each component, where rules in a tuple are derived part of morphology aims to select a subset of the motion simultaneously. A PSGS consists of a set of CFGs related that projects the whole action only into the essential by synchronized rules. This grammar models a system with actuators and their motion patterns. a set A of different strings Ai occurring at the same time: an actiongram. Each string Ai corresponds to the language of Morpho-kinetology a component grammar Gi modeling an actuator. Essential actuators are responsible for achieving an We propose parallel learning to concurrently infer a action’s intended result. Because they are strongly con- grammar system as the structure of all strings A1, …, An strained, only these meaningful actuators will have con- in the actiongram A. Our parallel learning algorithm sistent motion patterns in different performances of the executes the sequential learning within each string Ai same action. To learn the morphology of a human independently. A new nonterminal is associated with the action, our input is an actiongram associated with sev- time interval equal to the union of the time intervals of eral repeated performances of this action. both symbols in the corresponding digram. In parallel Formally, the morpheme consists of a set I represent- learning, nodes are merged only if the new rule is syning the essential actuators for the action and, for each chronized with other rules in different CFG components i I, a substring pi corresponding to the motion pattern of a grammar system. This way, we avoid overgeneralthe actuator i performs during the action. Because our ization because synchronization guarantees a relationinput is a set of concurrent strings, we pose this problem ship between the merged rules. as the grammatical inference of a grammar system modThe algorithm checks a nonterminal for possible syneling the human activity such that each component chronized rules with nonterminals in the CFGs of other grammar corresponds to an actuator. strings, as Figure 5 shows. Synchronization between two A parallel communicating grammar system (PCGS) nonterminals of different CFGs requires that they have consists of several grammar components working simul- an intersecting time period in the different strings gentaneously in synchronization. The component grammars erated by their respective CFGs. The algorithm issues a rewrite their own sentential forms in parallel and com- synchronization that relates two nonterminals in difmunicate by exchanging their current sentential forms ferent CFGs if there is a one-to-one mapping of their among each other. We propose a novel grammar system, occurrences in the associated strings. Further, any two a parallel synchronous grammar system (PSGS), where mapped occurrences must correspond to intersecting strings generated by components are not shared through time periods. communication steps. The formal model suggested is To identify the essential actuators and the correbased on a PCGS with rule synchronization.19 sponding motion patterns, the nonterminals associated We model the synchronization among rules in different with no synchronization rules are discarded from the components as a set of rules tuples with, possibly, one rule component CFGs. The set I of essential actuators is May 2007
65
Figure 6. Morphological grammar for a single actuator.This additional structure compactly represents all possible motion patterns for this actuator.
identified according to the set of CFGs with a considerable amount of synchronized rules. For each actuator i I, the associated motion pattern pi is generated by the nonterminal in Gi whose occurrences cover the most time of the motion’s duration. Given an actiongram of a human activity, parallel learning selects a subset of the actiongram that projects the whole action only into the action’s intrinsic joint angles and motion patterns. The process was performed in each action of our motion database, and we automatically identified the morphemes in our database: the essential actuators that participate in each action, the associated motion patterns, and their synchronization with movement in other joints.
Morpho-syntax We can learn further structure for the morphemes we have inferred for each action in a praxicon. This structure arises from the ordering, intersection, and repeated occurrences of kinetemes in motion patterns for the same actuator in different actions. We refer to this additional structure as morpho-syntax. One method to infer more structure considers a single actuator i at a time. We denote pia as the motion pattern for actuator i and action a, such that i Ia, where Ia is the set of essential actuators for action a. Basically, all motion patterns pia for actuator i in different actions are described as sequences of kinetemes. These sequences altogether can be generated by a single context-free grammar that represents a more compact and efficient structure: a morphological grammar. Initially, we perform the symbolization process on the segments associated with kinetemes in motion patterns pia for all actions. This way, segments of different actions 66
Computer
might become associated with the same symbol—the same kineteme or motor primitive can be found in different actions. With regard to actuator i, this symbolization results in a set of symbols that represents a unified alphabet of kinetemes for all actions in the praxicon. The motion patterns for actuator i in all actions are rewritten according to this unified alphabet. In our experiments with a total of 30 evaluated actuators, the maximum size of such an alphabet consisted of 31 kinetemes, while the median size was 17 kinetemes. Overlapping kinetemes in joint-angle space are considered different units without considering their angular intersection. To overcome this lack of structure, we subdivided the original kinetemes according to their intersections with other kinetemes in joint-angle space. In this space, a kineteme ranges from an initial angle to a final one. The border angles correspond to points where the kinetemes are subdivided. Therefore, the intervals that these angles delimit correspond to new kinetemes. The number of new kinetemes is at most twice the original number. These new kinetemes represent subparts of the original kinetemes in the motion patterns. Thus, an original kineteme becomes its sequence of subparts, and a sequence of subpart symbols replaces every instance of an original kineteme symbol in a motion pattern. The inference of the CFG that generates motion patterns for actuator i in all actions involves the application of sequential learning to a string that concatenates all these motion patterns: pia1 pia2 … piak, where denotes the concatenation operation and k the number of actions a such that actuator i Ia. However, the counting of digram occurrences does not consider digrams with symbols at the borders of two different
consecutive patterns. For example, a set of two motion ment of this string encodes the inclusion of a particular patterns BAC and DACD is concatenated as BACDACD, joint actuator in this set. but the first occurrence of digram CD is not considered. We have extracted a set of about 200 binary strings This prevents the ordering of concatenation of motion from the morphemes of our motion database. These reppatterns from affecting the inferred grammar. resent the HAL nouns in the most basic level for each The morphological grammar induced for a single actu- action. Using a noun praxicon, we can infer a grammar ator is an additional structure that compactly represents that resembles the body model’s topology. High-level all possible motion patterns for this actuator, as Figure nouns correspond to nonterminals in this grammar, 6 shows. The grammar can generate any movement in associated with body parts such as lower limb, hand, the praxicon and aid the analysis of an unknown and and head. possibly novel movement. Further, based on the new An action’s initial posture is analogous to an adjeckinetemes, the grammar explicitly tive that further describes the active considers the intersections between joints in the sentence. An action’s original kinetemes. This leads to an initial pose is retrieved from a morThe human activity important aspect of morpho-syntax: pheme as the initial joint angle of language’s syntax uses the discovery of common motions in the first kineteme in the motion different actions. The morpho-synpattern of each essential actuator. action morphemes tactic process is applied to obtain morHigher-level adjectives represent to construct sentences. phological grammars for each usual initial postures such as sit, actuator in the articulated human stand, and lie. body model. A motion verb represents the changes each active joint actuator experiences during SYNTAX the action execution. We obtain the human activity verbs The human activity language’s syntax uses action mor- from the motion patterns in the action morphemes. phemes to construct sentences. A sentence consists of a A HAL adverb models the variation in the execution group of entities. In this sense, a sentence can be as com- of each motion segment in a verb. The adverb modifies the pact as a single action morpheme or as complex as a verb to generalize the motion. For example, one instance sequence of morpheme sets. The sets of morphemes rep- of a “reach with your hand” action corresponds to a morresent simultaneous actions and a sequence of move- pheme that models the movement required to touch ments is related to the causal concatenation of activities something at a specific location. To generalize this action This way, our intention is to identify which entities con- to any location, we represent the motion of a segment in stitute a single morpheme sentence (nuclear syntax) and a space with a reduced dimensionality. Each dimension to study the mechanisms of composing sets of mor- in this reduced space represents a parameter, such as speed phemes (parallel syntax) and of connecting these sets and force, that models the variability of an action. into sequences (sequential syntax).
Parallel syntax Nuclear syntax A single action morpheme sentence consists of the entities implicit in any motion. These entities form a central part of an action, thus we refer to them as nuclear syntax. For didactic purposes, we identify these entities as analogs to lexical categories: nouns, adjectives, verbs, and adverbs. The subject-verb-object (SVO) syntax pattern reflects cause and effect. We represent an action with a word that has the structure of a sentence: the agent or subject is a set of active body parts (noun), and the action or predicate is the motion of those parts (verb). A noun in a HAL sentence corresponds to the body parts active during the execution of a human activity and to the possible objects involved passively in the action. The body parts are equivalent to actuators of the articulated body model. Therefore, a noun is retrieved from the set of essential actuators in the action morpheme. This set can be represented as a binary string with the same size of the set of all actuators. Each ele-
The organization of human movement is simultaneous and sequential: the nuclear syntax expands to parallel and sequential syntax. The parallel syntax concerns activities performed simultaneously. For example, a person might walk and wave at the same time, but could not whistle and chew gum simultaneously. Parallel syntax addresses the possible ways to combine different action morphemes into a set of morphemes that can be performed simultaneously. Basically, the main constraint imposed by parallel syntax involves the essential actuators. To merge two action morphemes for actions a1 and a2 into a parallel set of morphemes, their sets of essential actuators must have an empty intersection: the two action morphemes cannot share any essential actuator. This rule can be implemented as a constraint matrix C. For each pair of actions a1 and a2 in a praxicon, if Ia1 Ia2 = , the matrix entry C(a1, a2) is true; otherwise, the matrix entry is false. The constraint matrix explicitly stores which pairs of morphemes could be merged as simultaneous May 2007
67
I
Figure 7. Possible transitions between the motion patterns of two action morphemes and their respective morphological grammar entries. Because the two patterns share kinetemes, a transition exists between the two morphemes.
activities. More sophisticated inferences could also be performed using this structure. For example, transforming this matrix into a graph, cliques correspond to groups of action morphemes that can be executed at the same time.
Sequential syntax The cause-and-effect rule is physically consistent and embeds the ordering concept of syntax. The body pose must experience the motion cause, and the effect leads to a posture in the next sentence. Sequential syntax concerns the concatenation of actions or, more formally, the connection of sets of action morphemes (from parallel syntax) to form sequences of movement. Consider a single actuator i, if i belongs to the sets Ia1 and Ia2 of essential actuators for two action morphemes, a1 and a2, respectively, the sequential concatenation of these two morphemes is only feasible if a transition from one motion pattern, pa1, to another, pa2, exists. Such a transition can be obtained from the morphological grammar Gi of actuator i. Any nonterminals or terminals in Gi shared by both motion patterns pa1 and pa2 give rise to a possible transition. Consequently, the two morphemes a1 and a2 have a feasible concatenation with respect to actuator i. This way, two sets of action morphemes can be sequentially connected only if they have a feasible concatenation with respect to all actuators contained in their sets of essential actuators. Figure 7 displays the motion patterns of two action morphemes and their respective morphological grammar entries. The two patterns share kinetemes and, consequently, a transition exists between the two morphemes. 68
Computer
n scientific discussions about perception and reasoning in cognitive systems, the debate on signals versus symbols has persisted. Specifically, where do signals end and symbols begin? Signals and symbols acquire their meaning depending on the operations we apply to them. Starting from motion-capture measurements—signals—we extract symbols as early as possible, then use a symbolic framework to learn a first language of human movement. Following the framework of modern linguistics,20 we studied the kinetology, morphology, and syntax of this new language. Like modern-day archaeologists working from a papyrus containing a series of actiongrams, we used computer science techniques to decipher the underlying language by discovering the structure in each row and among rows. Applications of human activity language in various areas are now providing a novel viewpoint. The problem of visual surveillance and video analysis becomes one of translation from image representations using HAL—that is, action understanding involves motor representations at a higher or abstract level, which the language provides. Most importantly, HAL addresses a fundamental research issue in cognitive science and artificial intelligence, that of the mechanisms for combining sensory-motor information and the concepts for understanding others and communicating the actor’s own intentions. Using HAL offers an empirical approach for discovering the languages of action and vision and the correspondences with natural language by collecting body movement measurements, visual object information, and associated linguistic descriptions from interacting human subjects. Using such data, we can imagine that in the near future the community will move toward the creation of the praxicon, a computational resource that associates the lexicon of words and concepts with the corresponding motoric and visual representations, enriched with information on co-occurrence patterns among the concepts for forming higher-level complex concepts. This praxicon will bring us closer to understanding human thought while significantly enhancing software that acquires meaning. It also offers support for the new idea of achieving artificial intelligence by measuring, structuring, parsing, and analyzing human behavior. ■
Acknowledgment We thank the National Science Foundation and DARPA (DTO) for their support.
References 1. N. Nishitani et al., “Broca’s Region: From Action to Language,” Physiology, vol. 20, 2005, pp. 60-69.
2. V. Gallese et al., “Action Recognition in the Premotor Cortex,” Brain, vol. 119, no. 2, 1996, pp. 593-609. 3. A. Glenberg and M. Kaschak, “Grounding Language in Action,” Psychonomic Bull. & Rev., vol. 9, no. 3, 2002, pp. 558-565. 4. T. Flash and B. Hochner, “Motor Primitives in Vertebrates and Invertebrates,” Current Opinion in Neurobiology, vol. 15, 2005, pp. 660-666. 5. F. Mussa-Ivaldi and E. Bizzi, “Motor Learning through the Combination of Primitives,” Philosophical Trans. Royal Soc. London B—Biological Sciences, vol. 355, no. 1404, 2000, pp. 1755-1769. 6. M. Arbib, “Schema Theory,” The Encyclopedia of Artificial Intelligence, S. Shapiro, ed., Wiley Interscience, 1992, pp. 1427-1443. 7. S. Schaal et al., “Computational Approaches to Motor Learning by Imitation,” Philosophical Trans. Royal Soc. London B—Biological Sciences, vol. 358, 2003, pp. 537-547. 8. D. Del Vecchio et al., “Decomposition of Human Motion into Dynamics-Based Primitives with Application to Drawing Tasks,” Automatica, vol. 39, 2003, pp. 2085-2098. 9. M. Jeannerod, “Object Oriented Action,” Insights into the Reach to Grasp Movement, K. Bennett and U. Castiello, eds., Elsevier and North-Holland, 1994, pp. 3-15. 10. L. Ting and J. Macpherson, “A Limited Set of Muscle Synergies for Force Control during a Postural Task,” J. Neurophysiol., vol. 93, 2005, pp. 609-613. 11. A. d’Avella and E. Bizzi, “Shared and Specific Muscle Synergies in Natural Motor Behaviors,” Proc. Nat’l Academy of Sciences, vol. 102, 2005, pp. 3076-3081. 12. B. Averbeck et al., “Neural Activity in Prefrontal Cortex during Copying Geometrical Shapes,” Experimental Brain Research, vol. 150, 2003, pp. 127-141. 13. K. Kahol, P. Tripathi, and S. Panchanathan, “Automated Gesture Segmentation from Dance Sequences,” Proc. IEEE Int’l Conf. Automatic Face and Gesture Recognition, IEEE Press, 2004, pp. 883-888. 14. A. Nakazawa et al., “Imitating Human Dance Motions through Motion Structure Analysis,” Proc. IEEE/RSJ Int’l Conf. Intelligent Robots and Systems, IEEE Press, 2002, pp. 2539-2544.
15. R. Parekh and V. Honavar, “Grammar Inference, Automata Induction, and Language Acquisition,” The Handbook of Natural Language Processing, R. Dale, H. Moisl, and H. Somers, eds., Marcel Dekker, 2000, pp. 727-764. 16. C. Nevill-Manning and I. Witten, “Identifying Hierarchical Structure in Sequences: A Linear-Time Algorithm,” J. Artificial Intelligence Research, vol. 7, 1997, pp. 67-82. 17. Z. Solan et al., “Unsupervised Learning of Natural Languages,” Proc. Nat’l Academy of Sciences, vol. 102, no. 33, 2005, pp. 11629-11634. 18. J. Wolff, “Learning Syntax and Meanings through Optimization and Distributional Analysis,” Categories and Processes in Language Acquisition, Y. Levy, I. Schlesinger, and M. Braine, eds., Lawrence Erlbaum, 1988, pp. 179-215. 19. G. Pa˘un, “On the Synchronization in Parallel Communicating Grammar Systems,” Acta Informatica, vol. 30, no. 4, 1993, pp. 351-367. 20. R. Jackendoff, The Architecture of the Language Faculty, MIT Press, 1997.
Gutemberg Guerra-Filho is a PhD candidate in the Computer Science Department at the University of Maryland, College Park. His research interests include sensory-motor intelligence, computer vision, and graphics. He received an MS in computer science from the University of Maryland and the State University of Campinas. Upon graduation, GuerraFilho will join the faculty of the Computer Science and Engineering Department at the University of Texas at Arlington as an assistant professor; www.cs.umd.edu/~guerra.
Yiannis Aloimonos is a professor in the Department of Computer Science at the University of Maryland at College Park and the director of the Computer Vision Laboratory at the Institute for Advanced Computer Studies. He is also affiliated with the Cognitive Science Program. His research interests include computational vision, action, and cognition. Aloimonos received a PhD in computer science from the University of Rochester; www.cfar.umd.edu/~yiannis.
May 2007
69