Conceptual grounding in simulation studies of language acquisition* Peter F. Dominey Institut des Sciences Cognitives, France
In order to understand the evolutionary pathway to the capability for language, we must first clearly understand the functional capabilities that the child brings to the task of language acquisition. Behavioral studies provide insight into infants’ ability to extract statistical and distributional structure directly from the auditory signal, and their capabilities to construct relations between this structure and the structure extracted from perceptual systems. At the interface of these two processes lies a conceptual scene representation that can be accessed by both, and that importantly provides a means for the two systems to constructively interact. Recent studies have begun to make progress in simulating infants’ capabilities to extract statistical structure (e.g. word segmentation and lexical categorization) directly from the speech sound sequence. The current research examines how this structure interacts with perceptual structure at the level of the conceptualized scene. In particular we demonstrate how the grounding of words and sentences in conceptualized visual scenes permits the system to construct the appropriate relations between words and their referents, and sentences and theirs (structured conceptualizations of scenes representing agents, objects and actions) in the initial phases of acquisition of syntactic structure. These studies simulate behavioral observations of the trajectory of infants’ linguistic acquisition of concrete nouns, followed by concrete verbs and then more abstract nouns and verbs, in parallel with the development of first simple and then more complex syntactic structures. The relevance of these results to infant language acquisition behavior will be discussed. While this research yields interesting new results in characterizing the grounding of language in conceptualized scenes, it also identifies serious
*This work was supported in part by grants from the ACI Cognitique (Paris), the ACI Integrative and Computational Neuroscience (Paris) and the Human Frontiers Science Program. I thank Emmanuel Dupoux and Luc Steels for invaluable comments on the manuscript. Evolution of Communication 4:1 (2000), 57–85. issn 1387–5337 © 2000 John Benjamins Publishing Company
58
Peter F. Dominey
limitations of the current methods that will be discussed, along with the associated future extensions.
1.
Introduction
Language provides a function for binding linear sequences of symbols onto multidimensional representations of events in the perceived world, such that these sequences can then be used by the speaker to communicate these representations (and thus the events) to be decoded by the listener. From this perspective, language and its evolution and acquisition cannot fully be considered in the absence of the effects of the external “to be communicated” world on the language user. In this sense, in the current research language processing is grounded, not in a physical world, but rather in the internal “conceptual” representation of that world. The goal is to develop a system that is capable of simulating aspects of human language acquisition in the context of conceptual grounding. From the perspective of evolution of language, this is a fruitful and perhaps necessary step in the research program, if one considers that in order to understand the evolutionary pathway to the capability for language, we must first clearly understand the functional capabilities that the child brings to the task of language acquisition. Indeed, it is also likely that the developmental progression that we will examine in the child may provide insight into the intermediate steps along the evolutionary path to language. In particular, the parallel will potentially exist in the initial reliance on reduced syntactic forms and fixed word order that is progressively followed by the use of more richly structured varieties of syntactic forms. The following sections will first identify the target linguistic behaviors for the simulations. The behaviors in question are based on the capabilities of infants at around 18 months of age to utilize word meaning and subject-verbobject order to correctly associate sentences with their appropriate visual scenes. In addition, it will be crucial for us to identify the initial cognitive processing capabilities that the infants bring to the language acquisition task. These will be characterized in terms of their language-related capacities such as word segmentation and lexical categorization of function and content words, and in terms of non-linguistic capabilities including knowledge of naive physics. Given this background we will then describe the language processing model, its inputs, processing and outputs, and then proceed with a description of simulation results that highlight the importance of the interaction between conceptual and
Conceptual grounding in simulation studies of language acquisition
Conceptual Representation
Perception
Speech
Figure 1.The conceptual representation provides a common ground in which information from perceptual and speech processing can contribute to the construction of a representation in a common format.
verbal representations in language acquisition. We will conclude with a discussion of the strengths and weaknesses of this approach, and future research directions.
2. Identification of the target behavior Language comprehension can be functionally defined as the ability to construct conceptual representations (equivalent to those that can be constructed from perceptual inputs) from speech input. This notion is illustrated in Figure 1, in which a conceptual representation is shown to receive inputs both from speech/ language processing, as well as from non-linguistic perceptual processing. Human infants display some stereotypic performance characteristics in early language acquisition, and it is these points that are of interest in the current study. The first point of interest (Hirsch-Pasek and Golinkof 1993) is that at around 18 months of age, children can use knowledge of S-V-O word order in order to associate sentences like ‘John pushed Bill’ vs. ‘Bill pushed John’ with the appropriate visual scenes. In these selective looking tasks, at the same time that children hear such a sentence, they are presented simultaneously with two video images corresponding to the correct, and a controlled incorrect scene. Preferential looking towards the correct scene indicates that the sentence was correctly understood. Performance of this task demonstrates the capability to construct conceptual representations based on visual or verbal input, and to determine if two conceptual representations match or not. This implies that
59
"dom-r9">
60
Peter F. Dominey
children can construct mappings between individual words and their referents (word-to-world mappings) and mappings between sentences and structured conceptualizations of scenes in the world (sentence-to-world mappings). By 24 months of age, children demonstrate a clear capability to use the canonical SVO ordering (John hit the ball), and they enter the phase in which they will master non-canonical forms such as the passive OVS (The ball was hit by John.), and relativized OSV (It was the ball that John hit.) forms. This implies a more sophisticated use of syntactic structure in assigning open class words to their proper thematic roles. See Hirsh-Pasek & Golinkoff (1996) for theoretical and experimental investigation of these developmental phases. The second aspect of infant language capabilities of interest is the idea that the knowledge of syntactic structure that children have acquired should not be applicable only to a fixed set of terms, but of course should generalize to new words and sentences, an aspect that is crucial for learning new words. If children know that the verb of a sentence follows the first noun, then they should be able to use this information to their advantage when exposed to new verbs. Thus in the sentence “John gorped the ball,” the child should know that gorp is a verb, and she can thus use this knowledge to associate “gorp” with the action that she is currently observing (Hirsh-Pasek & Golinkoff 1996). In addition to the behavior described above, we are also interested in a third type of behavior related to conceptual grounding. Studies of language acquisition in English have revealed that nouns are acquired earlier than verbs (see discussions in Gillette et al. 1999). Recent experiments indicate that the noun/ verb distinction is confounded with a measure of concreteness or imaginability, and that, indeed it is this factor that predicts the learnability of words more accurately than their lexical category (Gillette et al. 1999). That is, the noun “table” is more concrete that the verb “think” since one is substantially more “observable” than the other. However, the verb “push” is more concrete than the noun “soul”. Thus, in this theory, concrete verbs may be learned prior to abstract nouns. This reasoning is in contrast with theories that explain these variations in word learning in terms of progressive development of the cognitive representational system that initially favors the simpler representations required for nouns vs. verbs (see Gillette et al. 1999 for a review). One current perspective on these observations is that the initial development of word to world mappings for concrete nouns can proceed using somewhat elementary associative mechanisms, providing a foundation or scaffolding for later extracting the meaning of verbs and their syntactic structure in a process referred to as “semantic bootstrapping.” In this sense this first-pass
"dom-r8"> "dom-r17"> "dom-r15"> "dom-r10"> "dom-r19"> "dom-r12"> "dom-r7">
Conceptual grounding in simulation studies of language acquisition
“asyntactic” analysis of concrete noun meaning grounds the syntactic learning process (Gillette et al. 1999). In return, knowledge of syntactic structure can streamline the acquisition of semantic knowledge in “syntactic bootstrapping.” In this context, the current goal is to develop a system in which the acquisition of structural regularities in speech input are grounded in structural regularities encoded in a conceptual representation of external perceived events. This conceptual grounding will provide the basis for acquisition of word-toworld and sentence-to-world regularities, and should provide insight into developmental trajectories and issues of semantic and syntactic bootstrapping.
3. Initial conditions Before embarking on this exercise, we must clearly establish what are the initial pre-wired processing capabilities available to the language learner. From the linguistic perspective it has been established that by the 16–19 month period that concerns us, children are capable of segmenting words from continuous speech. Already at 9 months, infants are capable of using statistical regularities of artificially generated sound sequences to detect the boundaries of words after only 2 minutes of exposure (Saffran et al. 1996), and by the time they reach the 16–19 month period, they have a well developed word segmentation capability (see Pinker 1987, Jusczyk 1997 for reviews). In addition to this early segmentation capability, it also appears that infants are able to use auditory cues in the speech signal in order discriminate between the major lexical categories of closed class vs. open class words (Shi et al. 1999). This early discrimination between the closed class of grammatical morphemes and function words (the, a, to, by, from, etc.) vs. the open class of nouns, verbs, adjectives, adverbs etc. relies not only on acoustic form differences, but also on different statistical distributions and additional cues to which the infants are sensitive, and forms a crucial component of the initiation of syntactic processing (see Morgan & Demuth 1996 for an extensive treatment of this issue). The result of this early, discrimination capacity applied to the child’s target language will then be expressed in adulthood. Indeed, in adults, extensive data from event related potentials, brain imagery and psycholinguist studies indicate that these lexical categories are treated by distinct and dissociated neurophysiological processing streams (e.g. Friederici, 1985; Osterhout, 1997; Pulvermüller, 1995; Brown, Hagoort, & ter Keurs, 1999). So word segmentation and the capability to
61
"dom-r21"> "dom-r4"> "dom-r2"> "dom-r5"> "dom-r9">
62
Peter F. Dominey
discriminate open vs. closed class lexical categories are capabilities that can be considered in place at or before 16 months of age. If language acquisition is the learning of a mapping between sentences and meanings, then the infant must also have some pre-linguistic capacity for representing meaning. From this non-linguistic perspective, already at 6 months of age, children are capable of processing causal events with agents, objects and actions and using these “naive physics” representations to understand simple action scenarios that involve goal-directed reaching for objects (Woodward 1998). Similarly, infants in this same age range display rather sophisticated knowledge of the physical properties of objects that allows them to “parse” and understand dynamic scenes with multiple objects (Carey and Xu 2000). This implies the existence of conceptual representations that can be instantiated by non-linguistic (e.g. visual) perceptual input prior to the development of language (as illustrated in Figure 1). These conceptual representations will form the framework upon which the mapping between linguistic and conceptual structure can be built. This approach does not exclude the possibility that the conceptual representation capability will become more sophisticated in parallel with linguistic development (see Bowerman & Levinson 2001 for a survey of the issue). It does require, however, that at least a primitive conceptualization capability that can deal with multiple-agent events exists in a prelinguistic state, a position that may still be open to debate by formal linguists (see Crain & Lillo-Martin 1999 for the formal lingustics perspective). Identification of the preceding initial conditions attempted to define constraints on the system itself, but we must also take into account the constraints on the environment in which the system is to learn. In particular, we have identified requirements on the initial conceptual and linguistic processing capabilities. We must now have some assurance that indeed, the environment is such that the scenes to be conceptualized and the accompanying sentences will have some relation. Specifically, to some extent we must assume that language input describes ongoing events that the infant can perceive (see Hirsh-Pasek & Golinkoff 1996, Pinker 1987). Having identified these initial state processing capabilities, we can now describe the model that will embody these conditions.
4. Model description A major assumption that drives the architecture of the model is that sentences are a form of self-identifying data structure. That is, by relative combinations of
Conceptual grounding in simulation studies of language acquisition
word order conventions, explicit grammatical marking (by function words “by, to, from” etc. or grammatical morphemes), and their various combinations across languages, a given sentence in any language contains the information necessary to perform correct sentence-to-world mapping. Thus, an architecture that is sensitive to word order and grammatical marking shall provide the basis for a language-independent language acquisition system. The model architecture is presented in Figure 2. Sentence-scene pairs are presented to the model to allow learning of word meaning and syntactic structure. Sentence input consists of successive words, and “visual” input consists of a conceptualized scene representation paired with the sentence. Based on these inputs, the system should learn word-to-world mappings in the WordToWorld matrix (i.e. the meanings of individual words — their mapping to appropriate scene items), and sentence-to-world mappings in the SentenceToWorld matrix (i.e. the thematic structure of syntactic forms). The interface
2. Lexical analyser
5. Concptual Scene Representation
SentToWorld
Action
Fxn-
Agent
Map
Function Vector
Object WordToWorld
Closed
6. Lexical To Conceptual
Open
Transformation
Class
class words
Array 4. Scene Analysis/ Parser 1. Speech Input 3. Visual Input
Processing
Figure 2.Model Architecture. Words in the input sentences are coded as activation of a single element in a 25-element vector. The Open Class Array is an array of these vectors for the open class words. Function Vector encodes a concatenation of the successively presented function words. WordToWorld encodes the mapping of word vectors to conceptualized scene item vectors that in turn correspond to entities in the external world. SentenceToWorld encodes the mapping of constituents of the Open Class Array onto roles in the conceptualized scene. See text for details.
63
64
Peter F. Dominey
between linguistic input and events in the external world takes place at the level of the conceptualized scene. Internal conceptualized scene representations consist of agent, object, recipient and action roles that are filled based on the current visual input. The representation is flexible so that some scenes may only have an agent and action, others an agent, action and object; and others an agent action, object and recipient. Section 2 described studies indicating that that concreteness or imaginability of a word will likely influence the acquisition of its meaning. Indeed, concreteness is related both to the word and its meaning or referent in the world. Concreteness is defined in terms of the ease with which something can be imagined or observed. Thus, subjects reliably rate “push” and “table” as more concrete than “see” and “idea” (Gillette et al. 1999). To simulate concreteness for a given word-scene item pair, each scene item is given a concreteness factor that will determine the scope of the set of possible referents that co-occur with this item in the conceptual scene representation. Specifically, as concreteness decreases, the size of the set of referent conceptualized scene items increases so that the corresponding word will now be associated with an increased number of scene items — only one of which is intended by the utterance, thus increasing the probability that the word is associated with the wrong worldreferent or scene item. 4.1 Input For a given trial, the input is a matched sentence-scene pair. 4.1.1 Sentence Verbal input consists of successively presented words that are organized into sentences, e.g. “John threw the ball,” or “John gave the ball to Mary”. Each word is represented as the activation of one element in a 25-element vector. Sentences can be made up of content and function words, and this primitive lexical categorization is built into the model so that these two lexical categories are treated in separate processing streams (see Figure 2). A sample of the sentence types used in the following simulations is presented in Table 1. 4.1.2 Scene Visual input consists of a structured array of scene items that fill in the agent, action, object and recipient values of the conceptualized scene. Each scene item is represented as the activation of one element in a 25-element vector. Associated
Conceptual grounding in simulation studies of language acquisition
Table 1.Sample sentence types 1. 2. 3. 4. 5.
Active Passive Relative Dative Dative passive
“The boy pushed the truck.” “The truck was pushed by the boy.” “It was the truck that the boy pushed.” “The student gave the book to the librarian.” “The book was given to the librarian by the student.”
with each individual scene item is a concreteness parameter that indicates the degree of variability for that element’s representation in the conceptualized scene. Specifically, as concreteness decreases for a given scene item, the number of additional scene items that are co-represented in the scene vector is increased. This parameter is central to the study in that it is this variability that will make learning difficult, and will require more optimized “bootstrapping” strategies in order to overcome the variability. 4.2 Processing The following paragraphs describe the different levels of learning that occur in the model, with appropriate links to the associated behavior in infants. Word-to-world learning: In the initial learning phases, the association between a word and its corresponding scene item is learned by a simple associative memory, and is stored in the WordWorld matrix (Eqn 1). This exploits a form of cross situational learning, in which the correct word-scene item associations will emerge as that which remains common across multiple sentence-scene situations (Siskind 1996). In the initial configuration as specified in Equation 1, as each new open class word in the sentence is processed, this learning operates simply by associating every word with every element in the current scene. In this manner the system can extract the cross-situational regularity that a given word will have a higher coincidence with the real world object to which it refers than with other objects. However, this learning is not optimal, and is later supplemented by a more effective method referred to as Syntactic Bootstrapping, below. (1)
Word-to-World(i,j) = Word-to-World(i,j) + OC CA(k)(i)*CSA(m)(j)*LRSem
In Eqn 1, the index k = 1 to 6, corresponding to the maximum number of words in the open class array (OCA). Index m=1 to 6, corresponding to the maximum
65
66
Peter F. Dominey
number of elements in the conceptual scene array (CSA). Indices i and j = 1 to 25, corresponding to the word and scene item vector sizes, respectively. LRSem is a learning rate parameter. Sentence-to-world learning: This section first addresses sentence-to-world learning in which fixed word order (e.g. Subject–Verb–Object) maps directly onto thematic structure (e.g. Agent–Action–Recipient). We then generalize this to situations involving variable word orders and grammatical markings. As the sentence is processed word by word, open class words that are relevant for thematic structure, i.e. nouns and verbs, are placed in the Open Class Array. Adjectives and adverbs should thus remain bound with their respective nouns and verbs. This corresponds in fact to a constituent representation in which constituents such as “the big brown dog” would be coded as a single noun constituent in the Open Class Array. It is important to note that the relative position of a constituent in the Open Class Array does not correspond to the numerical order of the corresponding word in a sentence. There is no notion of counting words here, rather, what is important is their relative ordering. Thus, the Open Class Array contains the ordered list of nouns and verbs in the order they appeared in the sentence. In parallel, the Conceptualized Scene will encode the corresponding agent, object and action (and possibly recipient) for a given event. Now for the learning. The learning consists in establishing the link between the relative ordering of constituents in the Open Class Array, and the corresponding thematic roles in the Conceptualized Scene. Establishing these “links” relies on the acquired knowledge of word meaning in the Word-to-World mapping. The knowledge of this mapping is acquired by semantic bootstrapping from knowledge in the Word-to-World mapping, in the following manner. We first establish the “predicted referents” for the words in the open class array by decoding each word with the learned word-referent associations encoded in the Word-to-World mapping (Eqn 2). This results in the predicted referents array (PRA). n
(2)
PRA(k)(j) =
∑ OCA(k)(j)*Word-to-World(i,j) i =1
Index k = 1 to 6, corresponding to the maximum number of scene items in the predicted references array (PRA). Indices i and j = 1 to 25, corresponding to the word and scene item vector sizes, respectively. Note that the predicted references array (PRA) preserves the “word order” of the open class array, with each of the
Conceptual grounding in simulation studies of language acquisition
words decoded into its corresponding scene referent. Now the contents of the conceptualized scene array (CSA) and the predicted referents array (PRA) can be compared in order to establish the link between word order in PRA and thematic role in CSA. This correspondence is encoded in Sentence-to-World (Eqn 3). Sentence-to-World(m,k) = (3)
n
Sentence-to-World(m,k) +
∑ PRA(k)(i)*CCSA(m)(i)*LRSyn i =1
Index m = 1 to 6, corresponding to the maximum number of elements in the conceptual scene array (CSA). Index k = 1 to 6, corresponding to the maximum number of words in the predicted references array (PRA). Index i = 1 to 25, corresponding to the word and scene item vector sizes, respectively. LRSyn is a learning rate parameter. The resulting association between the relative constituent ordering (e.g. Subject Verb Object) and the thematic structure of the scene (i.e. Agent, Object, Action) is encoded in the SentenceToWorld matrix. In summary, the Predicted Referents Array is a matrix in which each of the 5 rows is a 25-element vector that represents the scene item for the corresponding word in the Open Class Array. Likewise, the Conceptualized Scene Array is a matrix in which each of the 5 rows is a 25-element vector that can represent a scene item. Thus a 5×5 matrix, SentenceToWorld, describes the transformation from constituent ordering in the Predicted Referents Array to thematic structure in the Conceptualized Scene. In this context, knowledge of the syntactic structure that maps sentence structure onto conceptual structure relies on knowledge of word meaning. Syntactic Bootstrapping: We have just seen that knowledge of word meaning is crucial for developing initial syntactic knowledge. In the opposite sense, knowledge of syntactic structure can be extremely useful in learning the meaning of words. Recall that in the initial WordToWorld learning, a given word is associated with all of the elements in the current scene. While effective, this is not optimal as a word is associated with many scene items that are not indeed its referent. This issue can be resolved by using knowledge of the syntactic structure to identify appropriate referent for a given word as indicated by knowledge of word order encoded in the Sentence-to-World mapping. This is indicated in the updated Eqn 1.2. In this Eqn., index k = 1 to 6, corresponding to the maximum number of words in the open class array (OCA). Index m = 1 to 6, corresponding to the maximum
67
68
Peter F. Dominey
number of elements in the conceptual scene array (CSA). Indices i and j = 1 to 25, corresponding to the word and scene item vector sizes, respectively. (1.2)
Word-to-World(i,j) = Word-to-World(i,j) + OCA(k)(i)*CSA(m)(jj)*LRSem*Sentence-to-World(m,k)
For example, in SVO languages, open class words that appear at the beginning of a sentence will tend to be the agent of the action. When new words are encountered, this knowledge can greatly reduce the learning problem, and allow the infant to directly associate the new word with its correct referent. Indeed, it is this ability to generalize knowledge of syntactic structure to new words and sentences that is part of the power of a syntactic system. Coping with variable Word Orders: The description above suggests that the SentenceToWorld array will encode only a single mapping, resulting in a highly reduced syntactic structure for the learned language. Here we address this issue. We can observe that the sentences “John pushed Bill,” and “It was Bill that John pushed,” yield different Open Class Arrays, (John, Pushed, Bill) and (Bill, John, Pushed) respectively. Both of these Open Class Arrays are to be linked to the same Conceptualized Scene, each in its respective manner. This suggests the need for two distinct SentenceToWorld mappings for the two syntactic forms. Pursuing this further, we see that what differentiates these two sentences are their respective configurations of functional projections (function words). We will thus extend the single Sentence-to-World mapping to a set of distinct Sentence-to-World mappings, each mediated by a specific configuration of function words. In the model then, the configuration of function words in a given sentence is represented in the Function Vector. The Function Vector is constructed by concatenating the successive function word vectors in the order that they appear in the sentence. The concatenation is performed in a circular buffer so that the resulting FunctionVector preserves information related to the identity of the successive function words, and their relative order. Thus, to account for different SentenceToWorld mappings associated with different syntactic structures (encoded by different FunctionVectors) the system shall learn to associate different SentenceToWorld mappings with different Function Vectors. To implement this mechanism, as each new sentence is processed, we first calculate the specific Sentence-to-World mapping for that sentence (Eqn 4).
Conceptual grounding in simulation studies of language acquisition
n
(4)
Sentence-to-World-Current(m,k) =
∑ PRA(k)(i)*CSA(m)(i) i =1
In Eqn 4, Index m = 1 to 6, corresponding to the maximum number of elements in the conceptual scene array (CSA). Index k = 1 to 6, corresponding to the maximum number of words in the predicted references array (PRA). Index i = 1 to 25, corresponding to the word and scene item vector sizes. Thus, Sentenceto-World-Current encodes the correspondence between word order (that is preserved in the PRA Eqn 2) and thematic roles in the CSA. Note that the quality of Sentence-to-World-Current will depend on the quality of acquired word meanings as reflected in the PRA. Thus again, syntactic learning requires a minimum baseline of semantic knowledge. Given the Sentence-to-World-Current mapping for the current sentence, we can now associate it with the corresponding function word configuration for that sentence, expressed in the Function Vector (Eqn 5). In Eqn 5, note that we have linearized Sentence-to-World-Current from 2 to 1 dimensions to make the matrix multiplication more transparent. Thus index j varies from 1 to 36 corresponding to the 6×6 dimensions of Sentence-to-World-Current. (5)
Fxn-Map(i,j) = Fxn-Map(i,j) + FunctionVector(i)*Sentence-to--World-Current(j)*LRSyn
Finally, once this learning has occurred, for new sentences we can now extract the Sentence-to-World mapping from the learned Fxn-Map by using the FunctionVector as an index into this associative memory, illustrated in Eqn. 6. In Eqn 6, again to simplify the matrix multiplication, Sentence-to-World has been linearized to one dimension, based on the original 6×6 matrix. Thus, index i = 1 to 36, and index j = 1 to 25 corresponding to the dimension of the FunctionVector. n
(6)
Sentence-to-World(i) =
∑ Fxn-Map(i,j)*FunctionVector(j) j =1
Thus, we see that different syntactic structures (active, passive, relative etc.) can be accommodated by learning the appropriate Sentence-to-World mapping for each one, based on the grammatical markings encoded in the FunctionVector. Semantic bootstrapping is required for this learning, as revealed by reliance on the predicted references array in Eqn 4. Likewise, the system will benefit from syntactic bootstrapping, as the quality of the semantic knowledge is improved by application of syntactic knowledge in Eqn 1.2.
69
70
Peter F. Dominey
Behavior and Performance In the selective looking tasks used to evaluate infants’ language comprehension (Hirsh-Pasek & Golinkof 1996), experimenters measure the ability of the child to match a given sentence with the corresponding scene. In practice this is realized by making two scenes simultaneously available — one that is correct with respect to the sentence, and the other that is not — and measuring the relative time that the child looks at each of the two. We will thus evaluate performance by using the Word-to-World and Sentence-to-World knowledge to construct for a given input sentence the “predicted scene”. That is, the model will construct an internal representation of the scene that should correspond to the input sentence. This is achieved by first converting the Open-Class-Array into its corresponding scene items in the Predicted-Referents-Array as specified in Eqn. 2. The referents are then re-ordered into the proper scene representation via application of the Sentence-to-World transformation as described in Eqn. 7. (7)
PSA(m)(i) = PRA(k)(i)*Sentence-to-World(m)(k)
In Eqn 7, index i = 1 to 25 corresponding to the size of the scene and word vectors. Indices m and k = 1 to 6, corresponding to the dimension of the predicted scene array, and the predicted references array, respectively. When learning has proceeded correctly, the predicted scene array (PSA) contents should match those of the conceptualized scene array (CSA) that is directly derived from input to the model. The match is characterized in Eqn 8. (8)
Performance(m)(i) = PSA(m)(i)*CSA(m)(i)
In Eqn 8, index m = 1 to 6 corresponding to the number of scene items. Index i is a value between 1 and 25 that corresponds to, for each scene vector, the element with maximum activation. Thus, the Performance array describes the extent to which the predicted scene corresponds to the actual scene that has been described.
5. Simulation results We now report simulation results that address four aspects of conceptual grounding in language acquisition. First we examine the effects of conceptual grounding in the context of syntactic and semantic bootstrapping in the face of noisy inputs, i.e. non-concrete or abstract words and their referents. The objective of this work was to study the effects of conceptual grounding and
Conceptual grounding in simulation studies of language acquisition
concreteness on learning word- and sentence-to-world mappings, and the effects of bootstrapping on this learning. An a-priori goal was to determine whether the profile of progressively delayed learning for more abstract words could be explained in this setting. Second, we wanted to examine the relative contribution of lexical category vs. concreteness in determining whether words would be learned early or late. Third, we wanted to verify that these observations were consistent with a system capable of learning multiple syntactic structures as coded by the respective SentenceToWorld = FunctionVector × FxnMap relations. Finally we wanted to demonstrate that this knowledge of different syntactic structures could be exploited to quickly acquire the meaning of new verbs. 5.1 Experiment 1: Effects of concreteness and conceptual grounding on learning rate This experiment has two purposes: First, to determine if, as suggested by Gillette et al (1999), learning rate for different words is related to their concreteness, and second, to establish the influence of conceptual grounding on this effect. We thus developed a training set of sentence-scene pairs in which there was a distribution of different concreteness values for the word-scene item pairs. The learning strategy is the following. In the initial phase, word-to-world mappings are learned by the simple cross-situational associative strategy (Eqn 1). After a short period in this configuration, both syntactic and semantic bootstrapping are activated (Eqns. 1.2, 2 & 3). In this synergistic bootstrapping condition, semantic knowledge allows the system to determine the syntactic mapping between relative constituent order in the Open Class Array, and thematic structure in the Conceptualized Scene (Eqns 2 & 3). Likewise, in the other direction, this syntactic knowledge allows a more refined acquisition of semantic knowledge (Eqn 1.2). That is, instead of blindly associating a given open class element with all of the current scene items, we can exploit syntactic knowledge to associate it only with the appropriate scene item. Figure 3 displays the change in strength of the Word-to-World mappings for a set of words with progressively decreasing concreteness values. Two simulation conditions are displayed in the figure: In one condition (A) the synergistic bootstrapping (Eqn 1.2) is employed, and in the other (B), only the simple associative learning (Eqn 1) is used. In both cases, the training was performed using repeated exposure to a set of 27 SVO sentences for a total of 996 sentence presentations during the time-course displayed in Figure 3. The
71
72
Peter F. Dominey
A. With Bootstrapping
B. Without Bootstrapping
1
1
.975
.975
.95
.95
.925
.925
.9
.9
Figure 3.Word-to-World learning strength as a function of concreteness and presence of synergestic bootstrapping. A. Learning curves with bootstrapping. B. Learning curves without bootstrapping. In each panel, the vertical axis encodes learning strength, and the horizontal axis is time, and the concreteness is indicated. Arrows in first three panels of A indicate that as concreteness is reduced, the maximum learning is progressively delayed. In B it is demonstrated that without the bootstrapping, learning cannot occur.
bootstrapping condition corresponds to the set of curves in the left panel (A) that increases progressively, while the no-bootstrapping condition corresponds to the flat curves in the right panel (B). Thus the first observation is that in conditions with varying concreteness, learning cannot take place without the synergistic grounding between syntactic and semantic relations. The second noteworthy observation is that the increase in learning strength over time for these different words is not uniform and is completely predicted by the concreteness parameter.
Conceptual grounding in simulation studies of language acquisition
5.2 Experiment 2: Concreteness, lexical categories and learning over time Thus it appears that concreteness is a crucial factor in determining the relative learning rate. In order to determine the relative importance of lexical category and concreteness, we measured the Word-to-World mapping strength in different conditions that contrasted lexical category (Nouns vs. Verbs) with concreteness (concrete, intermediate, abstract) and learning period (early, middle, late). The experimental conditions used the same sentences as in the previous experiment. Early, middle and late time periods corresponded to exposure to 101, 500 and 898 sentences. The performance values, expressed as the strength of correct word to world mappings, are presented in Figure 4. There we see a general trend for improvement in all conditions over time, that is more effective as concreteness increases, and appears to be relatively independent of lexical category. These observations were supported by the results of a repeated measures ANOVA in which the dependant variable was strength of learning, and the independant variables were period, lexical category and concreteness. There was a significant main effect for period (F(2,3)=14627, p=1×10−6) and for concreteness
Word-World Strength
Interaction Lexical Category x Period x Concreteness 1,1 1,0 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1 0,0
Concrete Intermediate Abstract Early
Middle Nouns
Late
Early
Middle Verbs
Late
Figure 4.Effects of lexical category and concreteness on progressive improvement of word-to-world mapping strength. For both nouns and verbs, learning improves with time, and is most effective for concrete elements, independent of their lexical category.
73
74
Peter F. Dominey
(F(2,6) = 13617, p = 1×10−6), but not for lexical category (F(1,3) = 58, p > 0.001). There was a significant interaction between period and concreteness (F(4,6) = 2754, p = 1×10−6) indicating that concreteness significantly influences the effects of progressive exposure on performance. Of particular interest in this respect, there was no significant interaction between lexical category and period (F(2,3) = 5.74, p = 0.094), indicating that the improvement over successive periods is independant of the lexical category. 5.3 Experiment 3: Variable word order and function words Implicit in the current architecture is the hypothesis that different syntactic forms will be uniquely indicated by their respective configuration of function words (and/or grammatical morphology which is equivalent in theory). In this case, the FunctionVector can uniquely associate each distinct syntactic form with its appropriate Sentence-to-World mapping via the SentenceToWorld = FunctionVector × FxnMap relations described above in Eqn 6. In the testing of this mechanism, we were concerned with four distinct syntactic structures as exemplified here: 1. 2. 3. 4.
Active — John washed Mary. Passive — Mary was washed by John. Dative — Mary gave the apple to John. Dative passive — The apple was given to John by Mary.
The training sentences were uniformly distributed between the four sentence types, for an exposure to 2000 sentences in training. It is noteworthy that the model achieves error free performance with this small amount of training. Figure 5 displays a snapshot of the state of the system after successful training on these sentence types. Figure 5.1 corresponds to the sentence type “Tarzan introduced John to Jane,” and Figure 5.2 corresponding to the sentence type “John was seen by Mary.” In both figures, Panel A illustrates the Open Class Array with five 25-element vectors. Panel B encodes Sentence-to-World mapping. This corresponds to the transformation from constituent order in the sentence to thematic structure in the conceptualized scene (Eqn 6). The first four columns correspond to Action, agent, object, recipient. This reflects the fixed structure of the Conceptual Scene Array. The rows correspond to the ordered constituents in the Open Class Array. Note that for the two sentences in 5.1 and 5.2 these mappings are different, reflecting the different syntactic structure of the two sentences. For the active, 3-argument verb sentence in 5.1,
Conceptual grounding in simulation studies of language acquisition
A. OpenClassArray
B.SentenceToWorld
C.WordToWorld
D. ConcSceneArray
E.PredSceneArray
F.Performance
G. FunctionVector
H.SentToWorldCurrent
I.FunctionMapping
A. OpenClassArray
B.SentenceToWorld
C.WordToWorld
D.ConcSceneArray
E.PredSceneArray
F.Performance
G.FunctionVector
H.SentToWorldCurrent
I.FunctionMapping
Figure 5.Snapshot of model activity after learning multiple sentence types. Figure 5.1 (top) corresponds to the sentence “Tarzan introduced John to Jane”, and Figure 5.2 (bottom) corresponds to the sentence “John was seen by Mary.” See text for details.
75
76
Peter F. Dominey
the Sentence-to-World array indicates that the action (column 1) corresponds to the second element in the open class array (OCA), the agent (column 2) corresponds to the first OCA element, the object corresponds to OCA element 3 and the recipient to OCA element 4. It left as an exercise to the observer to decode panel B for the passive 2-argument verb sentence in 5.2. Panel C encodes the Word-to-World mapping that maps Open Class elements onto their conceptual representations (Eqn 1.2). Thus the transformation Open Class Array × Word-to-World converts the ordered open class words into the ordered set of scene referents in the Predicted Referents Array (Eqn 2). The subsequent transformation Predicted Referents Array × Sentence-to-World (Eqn 7) maps the scene referents from the sentence order into the conceptual scene representation order to form the Predicted Scene Array. In other words the transformation Open Class Array × Word-toWorld × Sentence-to-World takes us from the input sentence to an internal representation i.e. the Predicted Scene Array (PSA) — in panel E. This can be compared with the real input Scene in panel D. Performance is measured as the intersection between the predicted and actual Conceptual Scene (Eqn 8), displayed in panel F. In panel G we see the corresponding Function Vector that is formed by concatenating successive function word representations. As described earlier, once the Word-to-World mappings are well learned, we can construct on-line the Sentence-to-World-Current mappings for the current sentence (Eqn 4). Panel H thus displays the correlation matrix between the words in the Open Class array (once decoded into their scene items by the Word-to-World matrix), and the scene items in the predicted scene. That is, Panel H corresponds to the Sentence-to-World-Current mapping that was discovered on-line for this sentence by semantic bootstrapping. This correspondence matrix in H is then associated with the contents of Function Vector (Eqn 5), in the matrix of connections in FxnMap, displayed in panel I. Once this learning has occurred, the appropriate Sentence-to-World matrix for a given sentence can be retrieved based on the function words in that sentence via the SentenceToWorld = FunctionVector × FxnMap relation (Eqn 6). This allows the learning of new word meanings by syntactic bootstrapping, based on the configuration of function words in a sentence. That is, a sentence that contains new words can be presented, and based on the configuration of function words, the corresponding FunctionVector will extract the associated SentenceToWorld mapping from the FxnMap matrix. This will allow generalization of syntactic knowledge to new words and sentences.
Conceptual grounding in simulation studies of language acquisition
5.4 Experiment 4: Generalization and syntactic bootstrapping Indeed, as mentioned above, one of the great strengths of a syntactic system is that the acquired rules are not only appropriate for processing previously learned sentences, but they can also be applied to new sentences in an openended generalization. In this case, the use of syntactic knowledge should aid in the acquisition of new word meanings, by allowing children to identify the thematic role of a new word, based on its configuration within a sentence. Furthermore, as the syntactic knowledge is progressively acquired we should see that new word meanings can be acquired with increasing efficacy. Effects of Degree of Syntax Acquisition on rate of new verb learning F(4,12)=27858,91; p
80
Peter F. Dominey
learning English, it is a highly reliable observation that during the early stages of learning there is an asymmetrical vocabulary distribution that heavily favors nouns over verbs (Bates et al. 1995). A recent paper by Gillette et al. (1999) reviews two theoretical explanations, and experimental data that may tip the balance. One position holds that nouns are learned first because they are conceptually simpler (they describe single objects) than verbs (that describe relations between multiple objects), thus placing the burden of explanation on a conceptual system that is initially not prepared for representing verbs. An alternative position explains the noun verb gradient by two related mechanisms. The first has to do with the concreteness of word to world mappings. Concrete nouns (and particularly those that are among the most frequent in child vocabularies) correspond to concrete, easily observed objects in the child’s environment. The high observability factor contributes to the ease in early acquisition based on simple associative word-to-world mapping. According to this view, the noun preference in early vocabularies is more related to an effect of concreteness than of lexical category per se. Indeed, highly concrete verbs are present in the earliest vocabularies. Once this initial vocabulary of concrete terms has been established, it provides the grounded basis for the subsequent acquisition of more abstract terms, including verbs. The presence of known nouns in the context of an unknown verb provides the required scaffolding of a clause level syntax that allows even an abstract verb to now be correctly associated with the appropriate aspect of the scene. In this framework then, the noun verb discontinuity is more related to concreteness than to lexical category, and the development of a concrete set of nouns provides a syntactic–thematic mapping that allows the subsequent acquisition of more abstract verbs. In other words there is no need for explaining late verbs in terms of lack of an appropriate conceptual representation at the outset. What is missing is the syntactic representation that must first be built from the scaffolding of concrete nouns (Gillette et al. 1999). The simulation results in Experiments 1 and 2 support this explanation of the vocabulary development. In particular, Experiment 2 revealed that differences in learning over time were related to concreteness and that lexical category per-se for nouns and verbs did not influence learning rate independent of concreteness. A prediction that issues both from the theoretical position defended by Gleitman and colleagues (Gillette et al. 1999), and supported by the model, is that for human languages in which verbs are associated with more concrete aspects of scenes than are nouns, one should observe an increase in the proportion of verbs in the initial vocabularies. Mandarin appears to be such a
Conceptual grounding in simulation studies of language acquisition
language, and indeed, recent evidence from studies of Mandarin children indicate that this prediction is born out in the data (Snedeker & Li 2001). 6.3 Learning multiple syntactic structures, and cross-linguistic issues An important characteristic of human language acquisition and adult processing is that there are typically multiple possible syntactic forms or mappings from the constituent structures to thematic roles in conceptualized scenes. That is, the sentences “John washed Mary” and “ Mary was washed by John” do not undergo the same mapping operation in order to find their correspondence with the conceptualized scene “Washed(John, Mary).” It is clear that the syntactic difference between these two sentence types must somehow be marked so that the parser can take these differences into account. In English, the differences are marked by the function words and or function morphemes. In addition to this variability in the form of syntactic structures, there is also variability in the argument structure of the verb; we can thus have intransitive frames like “John ate,” transitive frames like “John ate the cake,” and ditransitive frames like “John gave the cake to Mary.” More generally, languages vary in their reliance on word order regularities, and in their use of closed class grammatical function words and morphemes to indicate thematic roles. A learning system should be able to accommodate both of these types of variability within and across languages. To handle the variability in syntactic structure we thus adopted a strategy in which multiple Sentence-toWorld mappings could be established, each associated with a particular syntactic form that is uniquely identified by a specific configuration of function words. This method proved effective in these initial experiments. It is noteworthy here that what is prewired into the model is the ability to construct these mappings, while the mappings themselves are learned in a language specific manner. In order to accommodate the variability in the argument structure of verbs, and the corresponding variability in the number of elements in the conceptualized scene, we employ a representation that over-estimates the anticipated need. That is, both the open class array and the conceptualized scene can contain up to five elements that is sufficient for intransitive, transitive and di-transitive cases. As it is constructed, the model seamlessly passes from verbs with one, to two to three arguments. The key is that for a given sentence type, the appropriate Sentence-to-World mapping has been established, and that it will thus be used. Thus, to the extent that the input data reliably reflect some structural regularity of the target language, the model can cope with variability in the
81
82
Peter F. Dominey
syntactic structures it can accommodate, both in terms of the number of elements in the verb argument structure, and in terms of syntactic transformations as indicated by function words. We cannot leave this point however, without admitting that this is a far cry from the extreme flexibility that is seen in adult language, with hierarchical structures and embedded clauses, etc. This is clear. What is also clear, is that at 18–24 months the language capability of children is also a far cry from that of the adult, and that during that period, they will traverse a language processing capability that is highly reduced with respect to the adult, similar to that of the current model. Despite this weakness, this intermediate state of language processing provides the basis for the development of adult language capability. It is this developmental trajectory that we intend to follow. 6.4 Relevance to evolution In the introduction it is stated that knowledge of how children progress in their grasp of language can provide insight into human’s evolution of this capability. This could be miss-interpreted as a claim that we should have pre-humans that can understand but not speak language, and then a mutant that will only be able to pronounce single words, etc, that is, that evolution would have been a direct mapping between developmental and evolutionary stages. This is clearly not the intended case. What is intended in the comparison is the following. Before children master complex syntax, with conditional non-present tenses and hierarchical embedding, they must first pass through a functionally reduced developmental stage. We can thus imagine that early languages employed highly simplified syntax that was relevant for talking about “here and now” and that then became successively more rich and complex, similar to the progressive development of syntactic complexity in the child. Above all, from the child we can observe what comes first, and the order in which successive components follow, providing useful clues concerning the successive introduction of these elements in the evolution of language. 6.5 Future directions The current implementation was put in place as an initial proof of concept for a language acquisition system based on conceptual grounding, and structural mapping between the sentence and the scene. While we are encouraged by the results presented above, it is clear that there are at least three potentially serious
Conceptual grounding in simulation studies of language acquisition
scaling limitations that must be addressed. The first limitation is one of pure size. The system described here is limited to a vocabulary of 25 words, and thus cannot accurately describe acquisition performance beyond this limit (though this is in the right ball park for the age-period we are concerned with). Preliminary experiments on a model scaled to 100 words indicate that the learning time and accuracy are not significantly affected, and future work will continue to address this issue of scaling in size. In this context, the simple associative learning mechanism for word meaning, even when vastly improved by syntactic bootstrapping (Eqn. 1.2), still retains ample possibilities for improvement such as the employment of more the sophisticated cross-situational techniques developed by Siskind (1996). A second limitation is related to the structure of the input sentences. In the earliest implementation of the model, relative constituent order (e.g. SVO, SOV etc.) was tightly linked to absolute position in the sentence. Thus, the language input had no determiners, adjectives or adverbs, and could not treat sentences that deviated from this skeletal structure of bare nouns, bare verbs and function words. A solution to this problem is to treat constituents such the “the big red bear” so that they will be assigned as a unit to a given element of the Open Class Array. A first approximation to this would require one to augment the open vs. closed class categorization of the input to include the finer distinction that includes determiners, adjectives and adverbs, and to represent constituents like “the big red bear” as single constituents or units. The most primitive method to accomplish this is simply to strip out and ignore determiners, adjectives and adverbs, which was taken as a first “rapid-prototype” approach with the clear understanding that the final solution will be more subtle. A third and related limitation is related to the inability to process ad-hoc structural hierarchies as in sentences such as “The block that Mary gave to John hit the ball,” that require embedding of relative phrases. Indeed, this type of sentence opens up important reference issues, as “block” will now refer both to the object of give, and the agent of hit. In addition to potential supplemental parsing requirements this implies the need for a discourse level working memory that keeps track of what has been happening to objects so that they can thus be referenced. Interestingly though, the embedded phrases have the same kind of structure as the main phrases, and thus we have started investigation of a form of recursive implementation of the existing system. Recognizing that it would be more conservative in the evolutionary sense to make the existing system recursive, rather than to start again from scratch, we simply treat the relative
83
"dom-r1"> "dom-r2"> "dom-r3"> "dom-r4"> "dom-r5"> "dom-r6"> "dom-r7"> "dom-r8"> "dom-r9">
84
Peter F. Dominey
clause as being associated with an additional conceptualized scene. Thus the sentence “The block that Mary gave to John hit the ball,” is now mapped by two distinct SentenceToWorld mappings, onto the two conceptualied scenes “Gave(Mary, Block, John)” and “Hit(Block, Ball).” In this configuration, the scene item “block” forms the binding that establishes a hierarchical relation between the two scenes. This method for treating relatiove phrases greatly enriches the permutations of syntactic structures, as it can apply to the agent, object or recipient in active and passive 2- and 3-argument verb sentences. In conclusion, while this research represents a modest start, there remains much to be done. The hope is that this type of demonstration can provide a concrete basis for evaluation and evolution of the current underlying theories.
References Bates, E., Dale, P. S., & Thal, D. (1995). Individual differences and their implications for theories of language development. In P. Fletcher & B. MacWhinney (Eds.), Handbook of Child Language (pp. 96–151). Oxford: Basil Blackwell. Bowerman, M., & Levinson, S. C. (2001). Language Acquisition and Conceptual Development. Cambridge: Cambridge University Press. Brown, C. M., Hagoort, P., & Ter Keurs, M. (1999). Electrophysiological signatures of visual lexical processing: Open- and closed-class words. Journal of Cognitive Neuroscience, 11, 3, 261–281 Carey, S., & Xy, F. (2001). Infant’s knowledge of objects: beyond object files and object tracking. Cognition, 80, 179–213. Crain, S., & Lillo-Martin, D. (1999). An introduction to linguistic theory and language acquisition. Malden, MA: Blackwell. Dominey, P. F. (2001). A model of learning syntactic comprehension for natural and artificial grammars. In E. Witruk, A. D. Friederici & T. Lachmann (Eds.), Basic mechanisms of language and language disorders. In press. Dordrecht: Kluwer Academic Publishers. Dominey, P. F., & Ramus, F. (2000). Neural network processing of natural lanugage: I. Sensitivity to serial, temporal and abstract structure of language in the infant. Language and Cognitive Processes, 15, 1, 87–127 Friederici, A. D. (1985). Levels of processing and vocabulary types: evidence from on-line comprehension in normals and agrammatics. Cognition, 19, 133–166. Gillette, J., Gleitman, H., Gleitman, L., & Lederer, A. (1999). Human simulation of vocabulary learning. Cognition, 73, 135–176.. Hirsh-Pasek, K., & Golinkof, R. M. (1996). The origins of grammar: evidence from early language comprehension. Cambridge, MA: MIT Press. Jusczyk, P. W. (1997). The discovery of spoken language. Cambridge, MA: MIT Press.
"dom-r11"> "dom-r12"> "dom-r13"> "dom-r14"> "dom-r15"> "dom-r16"> "dom-r17"> "dom-r18"> "dom-r19"> "dom-r20">
Conceptual grounding in simulation studies of language acquisition
Marcus, G. F., Vijayan, S., Bandi Rao, S., & Vishton, P. M. (1999, January 1). Rule learning by seven-month-old infants. Science, 283, 5398, 77–80. Morgan, J. L., & Demuth, K. (1996). Signal to syntax: boostrapping from speech to grammar in early acquisition. Mahwah, NJ: Lawrence Erlbaum. Nazzi, T., Bertoncini, J., & Mehler, J. (1998). Language discrimination by newborns: Towards an understanding of the role of rhythm. Journal of Exp Psych. Human Percept & Perform, 24, 3, 1–11. Osterhaut, L. (1997). On the brain response to syntactic anomalies: Manipulation of word position and word class reveal individual differences. Brain and Language, 59, 494–522. Pinker, S. (1987). The bootstrapping problem in language acquisition. In B. MacWhinney, ed., Mechanisms of language acquisition. Hillsdale, NJ: Lawrence Erlbaum. Pulvermüller, F. (1995). Agrammatism: behavioral description and neurobiological explanation. J Cog Neuroscience, 7, 2, 165–181. Saffran, J. R., Aslin, R. N., & Newport, E. L. (1996). Statistical learning by 8-month-old infants. Science, 274, 1926–1928 Siskind, J. M. (1996). A computational study of cross-situational techniques for learning word-to-meaning mappings. Cognition, 61, 39–91. Shi, R., Werker, J. F., & Morgan, J. L. (1999). Newborn infants’ sensitivity to perceptual cues to lexical and grammatical words. Cognition, 72, 2, B11-B21. Snedeker, J., & Li, P. (2000). The limits of observation: Can the situations in which words occur account for cross-linguistic variation in vocabulary composition? Paper presented at the Seventh International Symposium on Chinese Languages and Linguistics. Woodward, A. L. (1998). Infants selectively encode the goal of an actor’s reach. Cognition, 69, 1–34.
Author’s address Peter F. Dominey Institut des Sciences Cognitives CNRS UPR 9075 67, Boulevard Pinel F-69675 BRON Cedex France
[email protected]
85