Cognitive Science 33 (2009) 709–738 Copyright 2009 Cognitive Science Society, Inc. All rights reserved. ISSN: 0364-0213 print / 1551-6709 online DOI: 10.1111/j.1551-6709.2009.01026.x
Labels as Features (Not Names) for Infant Categorization: A Neurocomputational Approach Valentina Gliozzi,a Julien Mayor,b Jon-Fan Hu,b Kim Plunkettb b
a Department of Computer Science, University of Torino Department of Experimental Psychology, University of Oxford
Received 19 March 2008; received in revised form 20 September 2008; accepted 2 October 2008
Abstract A substantial body of experimental evidence has demonstrated that labels have an impact on infant categorization processes. Yet little is known regarding the nature of the mechanisms by which this effect is achieved. We distinguish between two competing accounts: supervised name-based categorization and unsupervised feature-based categorization. We describe a neurocomputational model of infant visual categorization, based on self-organizing maps, that implements the unsupervised feature-based approach. The model successfully reproduces experiments demonstrating the impact of labeling on infant visual categorization reported in Plunkett, Hu, and Cohen (2008). It mimics infant behavior in both the familiarization and testing phases of the procedure, using a training regime that involves only single presentations of each stimulus and using just 24 participant networks per experiment. The model predicts that the observed behavior in infants is due to a transient form of learning that might lead to the emergence of hierarchically organized categorical structure and that the impact of labels on categorization is influenced by the perceived similarity and the sequence in which the objects are presented. The results suggest that early in development, say before 12 months old, labels need not act as invitations to form categories nor highlight the commonalities between objects, but they may play a more mundane but nevertheless powerful role as additional features that are processed in the same fashion as other features that characterize objects and object categories. Keywords: Self-organizing maps, Connectionist modeling, Categorization, Lexical development, Hierarchical structure, Word learning, Unsupervised learning
Correspondence should be sent to Kim Plunkett, Department of Experimental Psychology, University of Oxford, South Parks Road, Oxford OX1 3UD, UK. E-mail:
[email protected]
710
V. Gliozzi et al./Cognitive Science 33 (2009)
1. Introduction Words help us think. For adults, words have meaning and pragmatic force, and they can be combined to convey complex thoughts and performative functions. For young infants, some words may have pragmatic force but they lack meaning and combinatorial power. The meaningful use of words and the realization of their grammatical productivity is an achievement not realized until late infancy or early childhood. Despite the absence of semantic and syntactic representations in young infants, there is ample evidence that words have an impact on infant conceptual systems. In an extensive series of studies, Waxman and colleagues have offered support for the view that labels have an impact on category formation in young infants (Waxman, 1999, 2003; Waxman & Booth, 2003; Waxman & Markow, 1995). Using a novelty preference procedure, infants were familiarized with a series of objects taken from the same category (such as animals) and then given a choice between two novel objects, one of which was from the familiarized category. During familiarization, the objects were accompanied by a novel label such as ‘‘dax’’ or a neutral carrier phrase such as ‘‘Look at this.’’ Preference for the novel out-of-category object is taken as evidence that infants treat the novel within-category object as belonging to the same category as the objects presented during familiarization. Infants showed a preference for the out-of-category object when familiarized with the novel label, but not with the neutral carrier phrase. Waxman and colleagues interpret this finding as demonstrating that ‘‘labels facilitate categorization,’’ that labels ‘‘act as invitations to form categories,’’ and that labels ‘‘highlight the commonalities between objects.’’ Their findings suggest that the effects are specific to the consistent use of labels that could be words in the infant’s language. Tones and buzzers do not achieve the same effect (Fulkerson & Waxman, 2007) and the same label needs to be used consistently throughout familiarization. Using different labels does not work (Waxman & Braun, 2005). These findings have been reported for infants well before their first birthday, indicating that labels have an impact on infant categorization before they produce their first words and before they have acquired a substantial receptive vocabulary. Using a related experimental procedure, Plunkett, Hu, and Cohen (2008) have shown that labeling contingencies can influence the number of categories formed by 10-monthold infants. In a series of five experiments, Plunkett et al. (2008) demonstrated that the manner in which 10-month-old infants perceive the visual similarity of objects depends upon whether infants are familiarized with the objects in the presence or absence of labels, and upon the labeling contingencies during familiarization, that is, how the labels correlate with category membership. In two experiments, they replicated Younger’s (1985) results on infant visual categorization—studies in which labels were not presented. During a familiarization phase, 10-month-old infants were exposed to a set of eight visual stimuli, shown in Fig. 1. These stimuli are cartoon drawings of animal-like objects defined along four dimensions, corresponding to the length of the neck and legs, to the ears’ orientation, and the size of the tail. Under the so-called Broad Condition (used in Experiment 1), infants were exposed to cartoons whose feature-values freely vary, and which can be organized in a single large cluster centered
V. Gliozzi et al./Cognitive Science 33 (2009)
711
Fig. 1. Cartoons shown to infants during familiarization in Plunkett et al. (2008). The number under each figure represents the feature value for each of the four dimensions defining the cartoon, that is, length of the legs and neck, size of the tail, and separation of the ears, respectively.
on a prototype. Under the so-called Narrow Condition (used in Experiment 2), infants were shown cartoons which are variants of two prototypes and that can be grouped in two distinct clusters. In other words, under the Broad Condition, infants were invited to form a single category, whereas under the Narrow Condition they were invited to form two distinct categories. Once the familiarization phase was completed, category formation was assessed by measuring infant looking time when tested with two different kinds of cartoon drawings: the ‘‘average’’ drawing (the central tendency of all the drawings presented during familiarization—3333) and one of the ‘‘modal’’ drawings (using only the extreme values of the features presented during familiarization, and closer to one of the two prototypes used in the Narrow Condition than to the other—1111 or 5555). Following the wellestablished novelty preference procedure (Fantz, 1964; Roder, Bushnell & Sasseville, 2000; Wetherford & Cohen, 1973), looking time is taken to be a measure of infant surprise at the new stimuli. If infants look longer at the modal cartoons than at the average one, this is taken as a sign that they have formed a single category; in contrast, if they look longer at the average stimulus than at the modal ones, this is taken as a sign that they have formed two distinct categories. The results showed that when familiarized with the Broad Condition stimuli, infants formed a single category containing all the stimuli, whereas when familiarized with the Narrow Condition stimuli, they formed two distinct categories. In order to assess the impact of labels on categorization, Plunkett et al. (2008) considered whether labels affected the categorization of the visual stimuli. To this end, three more experiments were conducted. In all three experiments, infants were exposed to the same visual stimuli as the ones used in Experiment 2, namely the visual stimuli from the Narrow Condition. In contrast to Experiment 2, the visual stimuli were accompanied by the acoustic labels ‘‘rif’’ or ‘‘dax’’ during the familiarization phase of the experiment. In Experiment 3, the two labels were presented, one for each visual category; in Experiment 4, the two labels
712
V. Gliozzi et al./Cognitive Science 33 (2009)
were randomly associated with the visual stimuli; in Experiment 5, a single label was presented with all the images. As for Experiments 1 and 2, category formation was assessed by measuring the looking time of infants when tested with modal and average stimuli in silence. In Experiment 3 infants formed two categories, demonstrating that labels did not disrupt visual categorization when they correlated with the visual categories. In Experiment 4, however, in which labels were not correlated with the visual categories, no category was formed, as indexed by a lack of preference for either the modal or average stimuli. This experiment indicates that a decorrelation between labels and visual inputs has a disruptive effect on the process of category formation. Finally, in Experiment 5 a single category was formed, as indexed by a preference for the modal stimuli during testing—an analogous result to Experiment 1 where the different Broad Condition visual stimuli were used. These results highlight the impact of labels on category formation: The visual stimuli without labels led infants to form two categories (as in Experiment 2), whereas when accompanied with a single label they gave rise to a single category. When accompanied by two labels, infants formed two categories or no categories, depending on the correlation between the labels and category membership. In order to produce this pattern of responding, infants must have tracked the cross-modal statistics that characterized individual label–object pairings and the feature-value correlations within the visual categories. That infants must have computed these cross-modal statistics is apparent from the contrasting results of Experiments 3 and 4 in which identical auditory and visual stimuli were used, yet infants formed two categories when the labels were correlated with category membership and no categories (as indexed by their lack of novelty preferences) when the labels were uncorrelated. By what means do labels achieve this effect on infant categorization? Demonstrating that infants are sensitive to statistical correlations between labels and category instances does not reveal the nature of the mechanisms that are responsible for computing these statistics. One possibility, suggested by Waxman and colleagues, is that labels facilitate categorization because they act as invitations to form categories by highlighting the commonalities between objects. On this view, labels play a supervisory role through their one-to-many associations with objects. Labels impact the process of categorization because multiple objects are given the same name and objects that are given the same name belong to the same category. Categories formed in this manner often contain members that share other attributes, but they will always share the same name. A label can function as the name for the category and may even be understood or produced by the infant to refer to members of the category. Stimuli that do not count as names, such as tones and buzzers, cannot invite infants to form categories and will not have meaning. Let us call this approach the supervised name-based account of category formation. Theories of lexical development, such as Clark’s (1973) semantic feature hypothesis, attribute a similar role to labels in the development of word meaning. An alternative approach assumes that labels are additional feature-values that enter into the statistical computations performed by infants during the process of category formation. On this view, labels are nonsupervisory, that is, they have the same status as other features
V. Gliozzi et al./Cognitive Science 33 (2009)
713
and are handled in the same manner and as part of the same statistical computation as other features. Like other features, they may vary in their salience and thereby have a greater or lesser impact on the outcome of computations. Note that redundant features do not help discriminate between categories. Contrastive features are the most informative sources in category formation. On this view, labels that do not vary contrastively across sets of objects will be redundant and fail to contribute to category formation. Let us call this approach the unsupervised feature-based account of category formation. Theories of infant categorization, such as Younger and Cohen’s (1986) account of the perception of correlations among attributes, ascribe a similar role to features. Evidence as to whether labels fulfill a supervisory or nonsupervisory role for infant categorization is scant. The categorization studies reported by Waxman and colleagues involve just a single category of objects and, consequently, do not evaluate whether infants have learnt an association between the familiarization label and the object category. It is, therefore, unclear whether the categorization effect described in these studies is caused by the supervisory role of the label being associated with category members, or other factors, such as heightened attention due to the presence of a salient auditory stimulus. The results of Experiment 5 in Plunkett et al. (2008) offer some support for a name-based account. Recall that infants treat the Narrow Condition stimuli as members of a single category when they are accompanied by the same label during familiarization, whereas they are grouped into two categories in the absence of any labels. On the feature-based account, the label is redundant and should be ignored. The result of Experiment 5 suggests that the label is playing a supervisory role and perhaps acting as the name of the category. In contrast, Hu (2008) reports evidence consistent with a feature-based account in a follow-up study of Experiment 3 in Plunkett et al. (2008): After infants had been familiarized with Narrow Condition stimuli and two correlated labels and tested using the standard novelty preference procedure, infants were given an inter-modal preferential looking task (Golinkoff, Hirsh-Pasek, Cauley & Gordon, 1987) in which novel but typical instances of the two categories (1111 and 5555) were displayed side by side and each of the labels was played to the infants. If infants had learned the names for the two categories, then we would expect them to orient preferentially to the category instance associated with the appropriate label (Pruden et al., 2006; Schafer, 2005). However, infants failed to demonstrate any preference upon hearing either label. Insofar as a null result can be interpreted as evidence, we may cautiously conclude that infants demonstrated no evidence of learning the names for the categories they had formed during the familiarization phase of the experiment, even though labels clearly had an impact on the category formation process.1 One interpretation of this finding is that labels fulfilled a nonsupervisory role in Experiment 3, acting simply as additional features that entered into the statistical computations leading to category formation. But why should labels play an unsupervisory role in Experiment 3 but take on a supervisory capacity in Experiment 5? One possible answer to this question may lie in the constancy of the labeling events. Infants can readily identify that labels are used in a contrastive fashion in Experiment 3 and may engage a different learning strategy than when the same label is used. However, this explanation suggests a somewhat capricious infant, ungrounded in the labeling contingencies of the real world—labels are rarely repeated in the
714
V. Gliozzi et al./Cognitive Science 33 (2009)
fashion of Experiment 5. How would the infant know which learning strategy to adopt in this situation? Our goal in this paper is to offer a unified, mechanistic account of the full pattern of results described by Plunkett et al. (2008). We achieve this using a neurocomputational model that mimics infant behavior during both familiarization and testing phases in the experiments. The model is a direct attempt to evaluate the plausibility of the Unsupervised Feature-Based account of infant categorization described above. The model is based on selforganizing maps (SOMs), proposed by Kohonen (1982, 1984). We have adopted this type of neuro-computational architecture for a variety of reasons: Kohonen maps implement a biologically plausible approach to human information processing (Kohonen, 1993) that have been applied to a wide range of physiological and cognitive phenomena (for an extensive review, see Miikkulainen, Bednar, Choe, & Sirosh, 2005). Although our particular implementation is targeted at a high level of cognitive abstraction, we can be confident that the network architecture and learning algorithms exploited in the model can also be implemented at a physiological level of information processing. Second, self-organizing maps do not require corrective feedback for learning. They are pure unsupervised learning systems and are therefore ideal tools for investigating an unsupervised, feature-based account of infant visual categorization. Finally, Kohonen maps offer a mechanistic framework for exploring developmental phenomena. Although the dataset we aim to model is not developmental in character—we consider only performance in 10-month-old infants—we demonstrate that these self-organizing maps have developmental implications, both in terms of predicting trajectories of development and for identifying the appropriate conditions for successful mastery of a task. However, before describing the model in detail, we further motivate our particular choice of architecture by reviewing previous modeling endeavors that have applied network architectures to word learning and infant categorization.
2. Network models of word learning and infant categorization There are very few neural network models that investigate the impact of labels on visual categorization. Plunkett, Sinha, Møller, and Strandsby (1992) proposed a connectionist model of vocabulary learning that learns to associate image representations and labels. The network architecture was an auto-encoder consisting of two partially merging subnetworks: a visual and a linguistic subnetwork. When presented with a label–object pair, the network was trained following a three-phase training regime that aimed at capturing an attention-switching process. The network first concentrates on the image, then on the label, and then on the ensemble. The performance of the network was then evaluated by analyzing the network’s ability to produce the correct label, when only an image was presented (production), and to produce the correct image, when only a label was presented (comprehension). The results suggested that language might have an impact on the speed with which image representations are clustered. Notice that the impact of language on visual categorization in this model is weaker than the one described in Plunkett et al. (2008): The labels do not change the final outcome of the categorization process, only the efficiency with which it is achieved.
V. Gliozzi et al./Cognitive Science 33 (2009)
715
The role of language in visual categorization is also considered by Mirolli and Parisi (2005), who proposed a neural network model similar to the one proposed by Plunkett et al. (1992). Mirolli and Parisi (2005) introduced a metric to determine how well the network categorizes objects. This metric compares the similarity of the hidden unit representations corresponding to objects in the same category (the more similar, the better) with the similarity of the hidden unit representations corresponding to objects belonging to different categories (the more distant, the better). Their results show how labels can improve the quality of visual categorization when visual and linguistic stimuli are presented together to the network. However, the authors do not assess the impact of labels on visual categorization in comparison with the network performance in the absence of labels. Again, the hypothesis put forward by Mirolli and Parisi (2005) is weaker than the one addressed in the current work, which attempts to show how labels not only improve visual categorization but can also change it. Note that both of these models involve corrective feedback derived from a ‘‘teacher’’ signal that informs the learning algorithm how to adapt the connections in the network. In both models, corrective feedback regarding the appropriate use of the labels is employed. In effect, they are implementations of a supervised, name-based approach to infant categorization and are inappropriate for evaluating the unsupervised, feature-based approach advocated in the current work. These earlier models are also based on traditional backpropagation networks, which are widely recognized as biologically implausible (e.g., Crick, 1989). The same drawback applies to the models belonging to the second approach that we now consider. Mareschal and French (2000) proposed a feed-forward, backpropagation network to simulate Younger’s (1985) experimental results (corresponding to Experiments 1 and 2 in Plunkett et al., 2008). Their network was an auto-encoder containing a single hidden layer, and that learned to reproduce at the output layer the received inputs. In order to do this, the network learned how to encode inputs at the hidden layer in a compact way: Having less units than the input and the output layers, the hidden layer served the role of an information bottleneck. In an autoencoder, the input signal also serves as the target signal used to train the network. In this way, the auto-encoder architecture attempts to bypass the objection of the implausibility of error-driven learning in contexts such as visual categorization in which there is clearly no teaching. The model aimed at simulating the relation between ‘‘sustained attention, encoding, and representation construction’’ of infants in Younger’s experiment. The authors established a correlation between network output error and looking time: Longer looking time corresponds to higher output error in the network. Mareschal and French (2000) justify the established correlation as follows: When infants look at a new visual stimulus, they build an internal representation that they compare to the stimulus and adjust this representation until there is a correspondence. In their model, the adjustment of the internal representations is explicitly described as an iterative process of encoding the visual input into an internal representation and then assessing the obtained representation against the continuing perceptual input. As long as there is a mismatch, the infant continues to fixate the stimulus and to update its internal representation. Therefore, the higher the initial mismatch between the internal representation and the stimulus (i.e., the higher the error), the longer will be the looking time. Although the model provides a mechanistic account of
716
V. Gliozzi et al./Cognitive Science 33 (2009)
category learning by successfully replicating some of Younger’s (1985) experimental results, a difficulty with this account lies in the way in which the model learns to categorize during familiarization. The network is trained in batch mode. Learning occurs only after all of the visual stimuli have been presented to the network, as only at this point are the weights updated. In contrast, infants learn continuously about the input as they do not know when the training sequence will end. Furthermore, as in all models based on backpropagation, the whole set of inputs is presented to the network hundreds of times before learning is complete. These are unrealistic features of the model—infants learn after just eight presentations of the visual stimuli. Westermann and Marechal (2004) proposed an improved version of Mareschal and French’s (2000) auto-encoder model to simulate Younger’s results on infant categorization. Their focus is rather different from Mareschal and French (2000) as they modeled the shift from featural to relational processing of the visual input that occurs at an early stage in the development of many cognitive abilities (Cohen, 1998; Younger & Cohen, 1986). They proposed the ‘‘Representational Acuity’’ hypothesis, according to which the capacity of infants to use the correlation among features when categorizing is related to their capacity for discriminating single features. Their model uses a radial-basis-function auto-encoder architecture with Gaussian hidden units, whose number is larger than that of the input units. The progressive attunement to the detection of specific features is represented in the model by the shrinking of the hidden units’ receptive fields. Learning takes place by adjusting the weights between hidden units and output units. In contrast to Mareschal and French (2000), these weights are adjusted in a pattern update manner, that is, learning takes place after each presentation of a visual stimulus. However, for the network to categorize successfully, the set of visual stimuli must be presented to the network hundreds of times. Again, this does not reflect the experimental schedule with infants.2 In spite of these limitations, the model successfully replicates Younger’s (1985) Experiment 2. Performance of the model in replicating Experiment 1 of Younger (1985) is not reported. Note that these infant categorization models do not involve processing of labels. Categorization in the models involves visual stimuli alone. It would not be difficult to extend these models to include labels along the lines of the auto-encoder networks described by Plunkett et al. (1992) and Mirolli and Parisi (2005). However, such models provide a supervisory target label for which corrective feedback is supplied and, therefore, would correspond to an implementation of the name-based approach to infant categorization and not constitute an appropriate test of the feature-based approach advocated in the current work. We do not consider this class of models further. Gureckis and Love (2004) propose an alternative model to account for the results of the experiments on visual categorization presented in Younger and Cohen (1986). They describe a network architecture called SUSTAIN which has been applied successfully to a range of adult categorization problems. Their model is trained in an unsupervised fashion with Younger and Cohen’s (1986) infant stimuli and it succeeds in mimicking the development of categorization abilities with just a single presentation of each training stimulus. SUSTAIN resembles the model we use to simulate the Plunkett et al. (2008) data, so it is worth considering in a little more detail. SUSTAIN is an incremental network. At the
V. Gliozzi et al./Cognitive Science 33 (2009)
717
beginning of training, the network contains only a single unit, whose internal representation is set equal to the first input. Further units are recruited when the network receives an input that is different enough from previous inputs, as determined by a threshold parameter. The internal representation of each newly allocated unit coincides with the input that motivated its introduction. For inputs which are similar enough to previous inputs, a principle of competitive learning is applied: The unit corresponding to the input is the one which is closer to it and the unit updates its internal representation to become the average of all items assigned to the unit so far. This procedure enables SUSTAIN to categorize input patterns based on their similarity: The winning unit identifies the category to which the current input belongs. Gureckis and Love (2004) modeled the shift from featural to relational processing of the visual input in Younger and Cohen (1986) by shrinking the threshold parameter. Using smaller thresholds enables SUSTAIN to build new categories that are sensitive to the featural correlations in the training set. Gureckis and Love (2004) also report that SUSTAIN can simulate the results of the Younger (1985) study which correspond to the first two experiments reported by Plunkett et al. (2008). It is unclear, however, whether SUSTAIN could simulate the three labeling conditions described by Plunkett et al. (2008), while maintaining the same threshold parameter across all experimental conditions. We shall return to this issue later when considering unsupervised similarity-based approaches in general. For the moment, we highlight several important differences between SUSTAIN and Kohonen maps. First, Kohonen maps have a topographical structure: Categories that are similar to each other are close together on the map. SUSTAIN does not have this structure. Topographical structure permits assessment of between-category similarity. Second, SUSTAIN achieves final performance levels after a single exposure to the complete training set. Repeated exposure to the same training set will not alter its response characteristics. This is not true of infants and it is not true of Kohonen maps. Finally, SUSTAIN is entirely algorithmic: There is no evidence that the principles underlying the model can be implemented by biologically plausible mechanisms. The model we propose is based on selforganizing maps, that can be implemented using biologically plausible mechanisms. Li, Zhao, and Mac Whinney (2007) have proposed a model based on self-organizing maps called DEVLEX to account for some well-known phenomena of early vocabulary acquisition. Their model consists of three standard Kohonen maps with a standard learning function. A phonological map processed nearly 600 words from different parts of speech (nouns, verbs, adjectives, etc.) that are typically found in the early mental lexicon. Each word was presented to the model as a phonological feature vector that uniquely identified that word among all other words in the vocabulary. A semantic map processed individual word meanings where word meanings were defined in terms of co-occurrence constraints on words. Finally, a phonetic output map was trained to produce a sequence of output phonemes corresponding to word production. Inter-map connections from the input map to the semantic map and from the semantic map to the output sequence map were trained using Hebbian learning. Li et al.’s model accounts for the vocabulary spurt (the rapid acceleration of vocabulary learning after 18 months of age), for word frequency and length effects and for individual differences in early word learning, as well as for developmental plasticity (by which children with early brain injury can acquire linguistic abilities within the normal
718
V. Gliozzi et al./Cognitive Science 33 (2009)
range). This approach demonstrates the utility of Kohonen maps for simulating aspects of early lexical development. Mayor and Plunkett (2008) have extended this framework to a model that learns to associate labels with categories of objects. Mayor and Plunkett trained an auditory self-organizing map with 100 object labels, each label pronounced in different ways corresponding to variation observed across speakers. Similarly, a visual self-organizing map is trained with 100 object categories, each category having multiple variants based on a prototype. Once the visual and auditory maps have learned to categorize the labels and objects appropriately, individual label–object pairs are presented to the model, producing a pattern of activity across the auditory and visual maps. Hebbian learning is then used to build connections between the most highly active neurons in each map, thereby establishing label–object associations. Mayor and Plunkett showed that learning of a single label–object association is adequate to establish extension of the label to all members of the object category, fulfilling the so-called taxonomic constraint (Markman, 1990). They also showed that the model identified object category members in a categorical fashion, that is, sharp discontinuities in object identification were only observed close to category boundaries. Again, this model highlights the utility of Kohonen maps for capturing the generalization properties of words and the development of categorical perception. However, the configurations of the architectures used by Li et al. (2007) and Mayor and Plunkett (2008) do not allow words to influence the manner in which meanings or object categories are organized, because word meanings and object representations are represented on separate maps to that used for word form representation. In the remainder of this paper, we describe a further adaptation of the SOM framework that permits object categories to be directly influenced by labels. We aim to demonstrate that this type of architecture can capture the pattern of findings reported by Plunkett et al. (2008). The model is entirely unsupervised and can be considered an implementation of the feature-based approach to understanding how labels influence infant categorization.
3. The model We propose a psychologically plausible neural network architecture that replicates the five experiments described in Plunkett et al. (2008). The architecture consists of a single self-organizing map that receives as input both the visual and the acoustic stimuli, as shown in Fig. 2. As we will see, the acoustic part of the map is ignored when replicating Experiments 1 and 2 in which there are no labels. The acoustic part is also ignored when testing the map’s performance, as labels do not play any role in this phase. 3.1. Encoding of the stimuli Following Mareschal and French (2000), the visual and acoustic stimuli are encoded by means of value vectors. For both kinds of stimuli, we use input vectors with four dimensions. In the case of visual stimuli, each value in the input vector corresponds to
V. Gliozzi et al./Cognitive Science 33 (2009)
719
Fig. 2. The architecture: Both visual and acoustic input feed to the same map.
a feature in the cartoons presented to infants (the length of the legs and neck, tail width, and ear separation). Each feature is first measured, and then divided by the maximal value the feature can take. By so doing, we assume that each feature plays a similar role in the categorization task. The actual vectors used in the Narrow and Broad conditions are listed in Table 1. The stimuli in the Narrow Condition can be naturally divided into two categories. To the first category belong the stimuli having small values on the first and fourth vector elements (the legs and ears, respectively), and high values on the second and third vector elements (the neck and tail, respectively), whereas to the second category belong the elements that have the opposite configuration. This configuration parallels the natural clustering into two categories of the images in the Narrow Condition, as described in Plunkett et al. (2008). In contrast, in the Broad Condition vector values freely vary, for example, a broad range of values of the neck length combine with a broad range of values of leg length. For the acoustic stimuli, each label is represented by a vector of dimension four: either (.0, .0, .7, .7) or (.7, .7, .0, .0). The fact that the visual and acoustic vectors have the same dimension reflects our assumption that these two components of the visual-acoustic stimuli Table 1 Feature value vectors presented in the Narrow and in the Broad Conditions, after Mareschal and French (2000). See text for details Narrow Condition
Broad Condition
Legs
Neck
Tail
Ears
Legs
Neck
Tail
Ears
.27 .27 .45 .45 .82 .82 1.00 1.00
1.00 .81 .81 1.00 .42 .23 .23 .42
.80 1.00 1.00 .80 .22 .41 .41 .22
.33 .33 .11 .11 1.00 1.00 .78 .78
.27 .27 .45 .45 .82 .82 1.00 1.00
1.00 .23 .81 .42 .42 .81 .23 1.00
.22 1.00 .41 .80 .80 .41 1.00 .22
1.00 1.00 .78 .78 .33 .33 .11 .11
720
V. Gliozzi et al./Cognitive Science 33 (2009)
may be equally important. We can increase or decrease the importance of the components by modifying the vectors’ dimension and value range. As with the infants, we assess the performance of the model by measuring its response to the two modal and to the average stimuli, represented as the vectors (.27, 1.00, 1.00, .11), (1.00, .23, .22, 1.00), or (.64, .62, .61, .56), respectively. The measured response is the quantization error, an index of novelty, that will be defined in the next section. 3.1.1. The Kohonen algorithm During the training phase, the map is presented with a set of input vectors that it learns to categorize. With each map unit u is associated a weight vector wu of the same dimension as the input vectors. This weight vector can be thought of as the unit’s internal representation of the world. The weight vectors are initialized to random values. The learning algorithm proceed as follows: Input vectors are presented to the network in random order. After each presentation of an input vector I, its best matching unit (for short, BMU) i is identified. This is the unit whose internal representation (weight vector) is closest to the input vector itself (in Euclidean distance). Once the BMU i is identified, its weights—and the weights of its neighboring units—are adjusted toward the input vector, according to the equation: wu ðt þ 1Þ ¼ wu ðtÞ þ Nðu; i; tÞaðtÞðIðtÞ wu ðtÞÞ
ð1Þ
where wu(t + 1) and wu(t) are the weight vectors associated with unit u at time t + 1 and t, respectively, and I(t) is the input vector presented to the network at time t. N(u,i,t) is the Gaussian neighborhood function of the distance kru ) rik between unit u and the BMU i of input I at time t: kru ri k2 2r2 ðtÞ
Nðu; i; tÞ ¼ e
that shrinks linearly over time from r(1) ¼ 1.2 to r(8) ¼ .8. The neighborhood function centered in a BMU i at time t determines the radius of the effects of a change in the internal representation of the neighboring units. a(t) is the learning rate at time t. The learning rate function determines the magnitude of change of the units’ internal representations toward the incoming stimulus. In our model, as well as in standard self-organizing maps, the value of the neighborhood function shrinks over time, and the learning rate decreases over time. The discrepancy between an input I(t) and the weight vector associated with its BMU i is called the ‘‘quantization error’’ of the network with respect to I(t) (and we denote this error by Qe(t) ¼ kI(t) ) wu(t)k). The principle underlying the self-organizing map algorithm is therefore the following: After the map has been presented with a given input I, the weights of its BMU i, and of its surrounding units, are adjusted toward I itself. Therefore, each time an input similar to I is presented to the map, its BMU is likely to be in the same region as the BMU for I. As a result, similar inputs activate units in the same region of the map. In contrast to simple competitive learning, in self-organizing maps there is a topological organization: Similar inputs are categorized in adjacent regions of the map.
V. Gliozzi et al./Cognitive Science 33 (2009)
721
3.2. Specific assumptions In standard self-organizing maps, for a self-organizing map to learn to categorize a set of inputs, the inputs need to be presented to the network more than once. However, this does not correspond to the schedule of the experiment described in Plunkett et al. (2008). Infants are presented with each stimulus only once. In order to achieve psychological plausibility, each input needs to be presented only once to the network. The learning rate is therefore adapted from standard SOM algorithms where the learning rate is normally a monotonically decaying function of time. We adapt two aspects of learning in the model. First, we manipulate the learning rate to depend on attention: The learning rate is higher when the input stimulus is novel. This corresponds to the observed phenomenon that infant attention is higher when exposed to novel stimuli. The learning rate in our model depends on the Euclidean distance between the weight vector being updated and the input vector that produces this update: the higher the distance the higher the learning rate, the lower the distance the lower the learning rate. Typically, the first time an input I from a given category is presented to the network, its distance to the weight vector of its BMU i will be high (as nothing similar has been learnt by the network yet), hence the learning rate will be high, and i’s weights will be updated so that they become close to I. When presenting a subsequent input I´ that belongs to the same category as I, it will likely also have i as its BMU. Furthermore, I´’s distance from i’s weight vector will not be very high (as both I´ and i’s weight vector are similar to I ). Consequently, i’s weight vector (which is its ‘‘internal representation’’) will be updated toward I´ with a low learning rate, and it will become the average between the inputs I and I´ (and more generally, between all inputs of the same category of which it is the BMU). In contrast, if a new input is markedly different from previous inputs, the network will recruit a new BMU. As the learning rate in such a condition is high, the new BMU becomes the center of the new category. Note that the result of this attention mechanism leads to a working behavior that is very similar to the algorithmic implementation of Gureckis and Love (2004). Second, the learning rate is a function of the total cognitive load. An increased cognitive load makes the distances between attributes (in the feature space) less significant: Infants are more sensitive to subtle differences when they have less information to process than when they have more information to take into account. Cognitive load is implemented in the model by requiring that the learning rate be negatively correlated with the dimensionality of the input vectors. As a consequence, the values of the learning rate function are higher, for the same Euclidean distance, when visual inputs are considered alone, whereas they are lower when visual inputs are considered with acoustic inputs. This is because the input vectors have lower dimensionality when the visual input is considered alone and higher dimensionality when it is considered together with the acoustic input. Combining these two contributing effects, we can define the learning rate as being a simple sigmoid function of the quantization error: The learning rate saturates to 1 for large quantization errors and is low for small quantization errors. Our learning rate function has the form of a standard sigmoid function
722
V. Gliozzi et al./Cognitive Science 33 (2009)
aðtÞ ¼
1 1 þ e
Qe ðtÞb a
where Qe(t) is the quantization error resulting from the presentation of input I(t), b is a scaling factor, and a scales the learning rate according to the inputs’ dimensionality. We report the results with b ¼ .4 and a ¼ .5. Over the range .3 < b < .5 and .05 < a < 1.0, more than 50% of the combination of parameters achieve statistical significance in all five experiments.3 As a result of the incorporation of attention and cognitive load into the learning rate, the inputs need to be presented to the network only once, similar to the experimental procedure of all five experiments described in Plunkett et al. (2008), thereby contributing to the psychological plausibility of the model. The self-organizing map contains 25 units organized in a hexagonal grid. Each unit is associated with a weight vector of dimension eight: the first four values of the weight vectors are weights from the visual inputs, the last four weights from the acoustic inputs. The same map is used to simulate all five behavioral experiments. In Experiments 1 and 2 in which the map receives only visual inputs, the acoustic portion of these vectors is ignored. The weight vectors associated with each unit of the map are initialized to small random values. These values are uniformly distributed between the maximal and minimal values of the corresponding components of the input vectors, scaled by a factor of .3. We will see later that the initialization of the weight vectors has an impact on the number of subsequent categories formed, similar to the effect of a threshold in Gureckis and Love (2004). The network is trained by presenting input patterns in random order. Weight update is performed by using the update rule described in Eq. (1), with the learning function a(t) that is a function of the quantization error, as described above. During the test phase, the modal and average stimuli are presented to the map, the acoustic inputs being ignored, and the map’s performance is assessed by measuring its quantization error associated with the test stimuli. We use the quantization error Qe as an analog of infant looking times. Being an index of the discrepancy between the network’s internal representation and the input stimuli, quantization error is a natural candidate for mimicking the infants’ looking behavior in response to novelty. A small quantization error in response to a given input indicates that the network has already been presented with a very similar input; the input is not novel to the network. Analogously, infant looking times are smaller for non-novel items. In contrast, a very novel input would be associated with a high quantization error in the network, and with longer looking times in infants. More precisely, we define the ‘‘network looking time’’ (NLT) as a monotonic function of its quantization error NLT ¼ 4Q:5 e , where the pre-factor 4 scales to the experimental trial duration and the exponent 0.5 has been determined by a regression analysis on the results. In the test phase, we assume that if the NLT corresponding to modal visual stimuli is systematically higher than the NLT corresponding to the average visual stimulus, then the network exhibits a novelty preference for the modal stimuli and has, therefore, formed a single category, centered on the average stimulus. In contrast, if the NLT corresponding to the
V. Gliozzi et al./Cognitive Science 33 (2009)
723
average stimulus is systematically higher than that corresponding to the modal stimuli, we conclude that the network has formed two distinct categories, centered around the two modal stimuli. As the test stimuli involve only visual stimuli, the acoustic pathways in the network are ignored during testing. All results and statistical analyses refer to the performance of 24 independent networks per experiment. This corresponds to the number of infants tested in the original experiments.
4. Results First, the results of NLTs are compared to those of the infants reported in Plunkett et al. (2008) for the testing phase of each experiment. Second, we analyze the natural clustering characteristics of the stimuli used in the five experiments to highlight the similarity structure of the stimuli. The results of the simulations and the analysis of the similarity structure in the stimuli show that infants and the model group objects together differently from the clustering analysis, indicating that the order of presentation of the stimuli in the training phase impacts the simulation results over and above their global similarity. In a third analysis, we show that the model mimics infants’ looking times in the familiarization phase of the experiments and thereby confirm the impact of the serial presentation of the familiarization stimuli on categorization. 4.1. Results of the testing phase In order to replicate the five experiments described in Plunkett et al. (2008), we train the architecture by presenting once, in random order, the relevant input stimuli. In Experiments 3–5 that deal with acoustic as well as visual stimuli, we consider the whole architecture. In contrast, the simulations of Experiments 1 and 2, in which only the visual stimuli are relevant, we consider only the visual portion of the architecture. This is done by ignoring the weights connecting the map to the acoustic stimuli.
Table 2 Network looking time during testing Experiment Network LT average:modal Ratio LT (SD) t-test (2-tailed) p-value
Exp. 1
Exp. 2
Exp. 3
Exp. 4
Exp. 5
2.62:3.93 40.0% (.7) t(23)¼)28.7 >.0001
2.72:2.10 56.5% (1.8) t(23)¼6.0 >.0001
2.83:2.01 58.6% (.9) t(23) ¼ 18.5 >.0001
2.24:2.24 50.1% (3.1) t(23) ¼ 1.1 .30
1.86:2.11 46.8% (2.4) t(23) ¼ )3.4 .0001) replicating the infant findings. In addition, Plunkett et al. (2008) report that infant looking times decrease during the familiarization phase. In order to identify whether the model could capture these effects in the familiarization phase, we ran a 5 · 2 mixed model anova with the factors Experiment (five levels) and Block (two levels). We found a main effect of Experiment (F[4,115] ¼ 126.3, p > .001) and of Block (F[1,115] ¼ 1271, p > .001), replicating the infant findings. We also found an interaction effect between Experiment and Block (F[4,115] ¼ 19.04, p < .001) not reported by Plunkett et al. (2008). However, simple comparisons of Block yielded a highly significant effect of longer NLTs in Block 1 than Block 2 in each experiment.
Table 3 Infant and network looking time (SD) during familiarization Infants Exp.
Mean
1 2 3 4 5
5.23 (1.24) 3.97 (1.58) 6.87 (1.69) 7.24 (1.08) 7.05 (1.45)
Networks
Block1
Block2
Mean
Block1
Block2
5.81 (1.56) 4.34 (1.79) 7.30 (1.74) 7.44 (1.31) 7.39 (1.23)
4.64 (1.30) 3.61 (1.67) 6.45 (2.05) 7.04 (1.37) 6.72 (1.60)
5.53 (.23) 4.82 (.28) 5.44 (.36) 5.32 (.31) 6.29 (.23)
6.19 (.12) 5.89 (.08) 6.85 (.03) 6.52 (.10) 7.09 (.10)
4.88 (.16) 3.76 (.25) 4.04 (.04) 4.13 (.04) 5.50 (.12)
728
V. Gliozzi et al./Cognitive Science 33 (2009) 6.5
Looking time [sec]
No Labels Different Visual Condition
6
Same Visual Stimuli Different Label Condition
***
***
5.5
5
4.5
1
2 Experiment
3−5
Fig. 5. Network performance during training. During familiarization, network looking time in Experiment 1 is longer than in Experiment 2. Looking times for Experiments 3–5 are longer than in Experiment 2. These results are in agreement with the infants’ data. See text for explanations of the two effects.
5. Discussion 5.1. Network behavior during testing We have implemented a neural network model of the impact of labels on visual categorization. In five sets of simulations, we have reproduced the pattern of looking preferences described in Plunkett et al.’s (2008) laboratory experiments with infants showing that labels can affect the way in which visual stimuli are categorized. In the laboratory experiments, infant categorization was assessed by measuring looking time at test stimuli, using the wellestablished novelty preference procedure. In the simulations, we assessed network category formation by measuring looking time at test stimuli where NLT is measured as a function of the quantization error—an index of the degree to which input patterns fit the organization of the stimulus space discovered by the self-organizing map. For all five experiments, the model’s looking preferences mimic those of infants. Hence, when shown a set of visual inputs from the so-called Broad Condition, the network looks longer at the modal visual stimuli than the average stimulus, indicating that it has formed a single category. In contrast, when shown visual inputs from the Narrow Condition during familiarization, the network looks longer at the average stimulus at test, indicating that it has formed two categories. These results mimic those of the infants in Younger’s (1985) study and offer an alternative implementation of the same infant categorization effects to those provided by earlier models (Mareschal & French, 2000; Westermann & Marechal, 2004). An important advantage of the current implementation is that it requires only a single presentation of the familiarization stimuli and it makes use of a psychologically plausible, unsupervised learning algorithm that does not require an error signal to achieve successful performance.
V. Gliozzi et al./Cognitive Science 33 (2009)
729
The pattern of categorization changes when the visual stimuli are accompanied by labels: Visual stimuli from the Narrow Condition produce two distinct categories just in the case that labels are correlated with visual category membership in the Narrow Condition. When labels are randomly associated with the visual stimuli, the networks show no systematic preference for either the average or modal stimuli, indicating a disruption of the category formation process. Finally, the networks demonstrate a novelty preference for the modal stimuli when the accompanying label is the same for all familiarization stimuli, indicating the formation of a single category as in the Broad Condition (Experiment 1). In the model, as with infants, labels have an impact on the perceptual similarity of the training objects such that two categories, one category, or no categories5 are formed depending on the labeling contingencies present during familiarization. In this self-organizing model, the labels are presented as additional features to the visual attributes of the objects. Therefore, the auditory and visual information has the same status for the learning process. In particular, the labels do not have the status of names that act as invitations to form categories by highlighting the commonalities between objects. The correlations between labels and objects affect the performance of the model in the same way that the correlations between the visual features of the objects affect the outcome. We interpret this finding as support for the unsupervised, feature-based approach to infant visual categorization. It is important to emphasize that we are not suggesting that label–object associations cannot be formed by infants during the first year of life. Findings with neonates (Slater, Quinn, Brown & Hayes, 1999), 6-month-olds (Tincoff & Jusczyk, 1999), and 10-montholds (Pruden et al., 2006) all demonstrate that young infants are quite capable of forming and retaining these cross-modal associations. Nor do we suggest that labels cannot act as invitations to form categories by highlighting the commonalities between objects. However, the evidence that labels impact infant categorization does not warrant the assumption that labels achieve this effect because they are treated as names for the objects and categories concerned. Our model shows how a feature-based approach to infant categorization can adequately explain the findings. We suggest that name-based categorization awaits infant mastery of the referential function of words later in the second year of life (Nazzi & Bertoncini, 2003). The model’s ability to mimic infant looking times in testing is particularly noteworthy given that the physical circumstances for evaluating infant preferences are quite different from that of the model: infants always see two pictures during testing, whereas the model is only ever exposed to one picture at a time—testing in the model is effectively just a continuation of the training phase but with learning turned off. Network preferences in the Broad Condition (Experiment 1) reflect the fact that the average stimulus (3333) has close proximity to its BMU in the map, whereas the modal stimuli (1111 and 5555) do not fit their BMUs so closely. This asymmetry derives from the self-organizing capacity of the map to represent a single category centered on the central tendency of the familiarization stimuli, namely 3333. The opposite pattern of network preferences in the Narrow Condition (Experiment 2) derives from the map’s representation of two categories for which the modal stimuli are good exemplars (they are close to the BMUs for their respective categories), whereas the average stimulus
730
V. Gliozzi et al./Cognitive Science 33 (2009)
3333 is distant from its own BMU and those of the extracted categories. Network preferences in Experiments 3–5 are explained in an analogous fashion: The labeling contingencies during familiarization impact the organizational structure of the map affecting the proximity of the average and modal stimuli to their BMUs. The success of the model in replicating infant looking preferences during testing adds credence to the assumption that infant preferences are an outcome of on-line processing of the statistical regularities inherent across the set of familiarization stimuli (Plunkett et al., 2008; Younger, 1985). Importantly, we demonstrated that the capacity of the network to capture all five experimental outcomes depends not only on the global statistical regularities of the training set but also on the sequential character of the training regime. The ability of the model to mimic infant looking preferences based on a sequence of onetrial learning events in Experiment 5 is worthy of further examination. Recall that Experiment 5 presented the Narrow Condition visual stimuli together with a single common label during familiarization. From a statistical learning perspective, this label is entirely redundant during familiarization as it is a constant across all trials. In general, neural network models exploit the variability between input stimuli to organize their patterns of connectivity and internal representations. One might have predicted, therefore, that the networks in Experiment 5 would ignore the label as it seems to provide no distinguishing information across trials. Nevertheless, we have demonstrated that a single common label has a dramatic impact on the network’s classification of the visual stimuli, causing it to treat them as a single category rather than two. However, the intuition of the label being redundant in this experiment is not entirely misleading. In fact, when the networks in Experiment 5 are trained with standard parameters for multiple epochs, they learn to segregate the visual stimuli into two categories rather than one, thereby ignoring the label. In other words, the formation of the single category is a transient effect. This finding has several important implications: First, it predicts that infants should show similar transient effects such that if they were continuously trained on the label– object contingencies of Experiment 5, infants would eventually form two visual categories. Note that unsupervised learning systems, such as SUSTAIN (Gureckis & Love, 2004) which do not have a topographical organization, would not make this prediction because learning is complete after a single sweep through the training data. Second, the transition from a single- to a two-category representation in the model implies that the label changes its status from being associated with a single visual category of objects to being associated with two discriminable visual categories of objects. This suggests that the model has the potential to represent a hierarchy of categorical organization as a result of the introduction of a common label for members of the hierarchy. However, the organization capacity of the label is obtained at a price: The model suggests that initially the label obliterates categorical distinctions and it is only through further experience and internal reorganization that a hierarchical structure is achieved. Nevertheless, these transitions are emergent properties of a self-organizing system which do not require explicit instruction or feedback. The model also predicts that the demonstrated impact of labels on categorization in 10-month-old infants does not represent an endpoint of learning but rather is a step en route to the development of
V. Gliozzi et al./Cognitive Science 33 (2009)
731
a more structured system—perhaps a system that underpins the organization of the mental lexicon itself. 5.2. Network behavior during training A particularly salient feature of the model is its capacity to mimic infant looking times during familiarization. It achieves this in a number of ways: First, NLT is higher in the first block of familiarization trials than in the final block. Plunkett et al. (2008) also report Block effects during the familiarization phase of all their experiments: Infants look longer during the first three familiarization trials than the final three trials. This finding is also readily accommodated within the framework for predicting looking times outlined above: For networks, the quantization error during the first three trials will be high because none of the input stimuli are likely to align well with their BMUs. However, map structure will have begun to emerge after five familiarization trials and hence quantization error will be lower during the final block of training. Second, infant and NLTs during training are higher in Experiment 1 than in Experiment 2. Furthermore, infant and NLTs are higher in Experiments 3–5 than in Experiment 2. The model’s higher looking times during familiarization in Experiment 1 than in Experiment 2 can be understood in terms of the distance between the feature vectors used to represent the line drawings: For each stimulus presented to the network, we calculated the distance to the closest stimulus previously presented during its familiarization sequence. Recall that the distance between input vectors corresponds to the degree of visual similarity of the input figures. Greater distances between consecutive stimuli will yield higher looking times. The average of these distances is significantly higher in Experiment 1 (M ¼ .73; SD ¼ .08) than in Experiment 2 (M ¼ .6; SD ¼ .1) (t(46) ¼ 5.19, p > .0001) when each new input is compared to at least the two previous ones. NLT at each incoming stimulus during training depends, therefore, on the sequence of the stimuli received up to that point, and not just on the distance from the most recent one. Can infant looking time during familiarization be explained in the same way, by assuming that infants compare each new drawing to (at least the two) previously presented drawings? Our model predicts a positive answer to the question. We reanalyzed the familiarization data for Experiments 1 and 2 reported in Plunkett et al. (2008) by calculating the average Euclidean distance between the eight familiarization stimuli shown to each of the infants. As with the networks, the average Euclidean distance was higher in Experiment 1 (M ¼ .45; SD ¼ .04) than in Experiment 2 (M ¼ .38; SD ¼ .09) (t[46] ¼ 3.29, p < .005).6 Note, however, that the average Euclidean distance between the eight familiarization stimuli for each infant depends on the order in which they are presented. We calculated the correlation between infant looking time during familiarization and the total Euclidean distance for the particular sequence of stimuli to which they are exposed and found a significant positive correlation (r ¼ .347, p < .05). This finding suggests that infants, like the networks, are responding to some measure (in our case Euclidean distance) of the perceptual similarity between consecutive familiarization stimuli and that this sensitivity is a primary determinant of their looking times. This perspective predicts that it should
732
V. Gliozzi et al./Cognitive Science 33 (2009)
be possible to manipulate infant looking times during familiarization by careful selection of the sequences of stimuli to which they are exposed from the Broad or Narrow conditions. Sequences with low Euclidean pathways should elicit the shortest looking times. We are currently running infant experiments to test this prediction and evaluate its impact on the category formation process. The difference in looking times during familiarization between Experiment 2 and Experiments 3–5 in the infants can be explained by the increased cognitive load deriving from the presence of labels in the last three experiments. This pattern in the network can be explained in a similar way with simple linear algebra: In Experiments 3–5, the dimension of the input vectors is higher than in Experiment 2. As a consequence, the quantization error (and therefore the NLT) is higher in Experiments 3–5 than in Experiment 2. In this context, input vector length corresponds directly to infant cognitive load. By implication, network representations that result in decreasing the effective length of the input stimuli should contribute to a lowering of cognitive load and, in infants, a lowering of looking times. Finally, we note that there is a mismatch between the model and infant behavior when comparing looking time during familiarization in Experiment 1 and Experiments 3–5. Infants look longer during familiarization in Experiments 3–5 than in Experiment 1. The model does not reproduce this result. Furthermore, infants do not show any reliable differences in looking time in Experiments 3–5, whereas the model does. We hope to understand and remedy these residual mismatches in future work. 5.3. Developmental interpretation of the model All the results reported in this paper concern performance of the model at a snapshot in time after a single exposure to each of the stimuli in the training set. The training regime is designed to mimic the infants’ circumstance at just 10 months old after a single exposure during familiarization. It will be recalled, however, that before training in the model commenced, the weight vectors associated with each unit of the map were initialized to small random values uniformly distributed between the maximal and minimal values of the corresponding components of the input vectors. It turns out that this weight initialization procedure is critical to successful performance. In the model, the units’ internal representations (or weights) reflect directly the network experience with previous inputs. In Fig. 6, we describe how weight initialization—mimicking different levels of exposure to stimuli not explicitly presented to the network—can impact the number of categories formed out of the same input dataset. For clarity, the process is depicted in one dimension. Fig. 6A depicts the network behavior during the presentation of two similar inputs I1 and I2, for weights that are initialized at low values (a distribution centered on 1), thereby mimicking a very young infant. From top to bottom: When input I1 is presented its BMU (the unit closest to the input) is moved toward the input (solid arrow) and the BMU’s neighbor is moved slightly toward the BMU value (dashed arrow). When an input I2, similar to I1 is presented to the network, the closest unit is recruited as its BMU. If it corresponds to the same
V. Gliozzi et al./Cognitive Science 33 (2009)
733
Fig. 6. The role of weight initialization in the number of formed categories. In the left panel, weights (x) are initialized at low values (around 1). After the successive presentation of inputs I1 and I2 (o), the same unit is recruited as the best matching unit (BMU) for both inputs (see text for a full explanation). In the right panel, weights are initialized at higher values (around 2), thereby mimicking the role of previous exposure to stimuli. Two different units are recruited as BMUs for inputs I1 and I2. Together, this suggests that different weight initialization (different previous experience) can lead to different patterns of categorization.
BMU as for the first input (criterion d > r in Fig. 6A), both inputs will share the same BMU: The network forms one category with the inputs. In contrast, Fig. 6B depicts the network’s behavior when weights are initialized to higher values (distribution centered on 2), thereby mimicking an infant having an increased amount of exposure to the environment prior to the experiments. Similar to Fig. 6A, inputs I1 and I2 are presented to the network. Now, after the presentation of input I1, one unit is closer to the new input I2 than the BMU for input I1 (criterion d < r in Fig. 6B). Two different BMUs are associated with inputs I1 and I2, thereby indicating that the network has formed two categories with the inputs. This illustration of the impact of weight initialization on the number of clusters formed emphasizes the role of experience in categorization experiments. Moreover, it suggests that no maturation effect such as an increased working memory, improved perceptual acuity, or hippocampal maturation, as suggested in Gureckis and Love (2004), is needed to explain the developmental aspect of infant categorization such as those described in Younger and Cohen (1986). 5.4. General considerations We have shown that this general approach to modeling the impact of labels on infant visual categorization offers an alternative perspective for understanding the nature of the mechanisms that underlie infant looking preferences in novelty preference tasks. The model shows that labels can change the way in which visual stimuli are categorized, when the labels are considered as additional features rather than names for the
734
V. Gliozzi et al./Cognitive Science 33 (2009)
visual stimuli. The model also makes some novel empirical predictions for infant categorization. For example, it predicts that category formation observed in Experiment 5 is a transient effect, and that infants should learn to ignore the label after repeated exposure. It also predicts that categorization effects will depend on the perceptual similarity and the sequence of the stimuli presented during familiarization, as well as on the age of the infant. All of these predictions constitute avenues for further evaluation of the model. The model makes a number of specific assumptions that are critical to its performance. For example, we assume that the learning rate in our model is affected by stimulus novelty and cognitive load. These are plausible assumptions but their validity must be evaluated in light of the model’s capacity to generate correct novel empirical predictions. A particularly important assumption made in the model’s architecture is to assume that the length of the visual and auditory vectors is the same. This assumption is tantamount to the claim that visual and auditory stimuli are equivalent in status (perceptual similarity, cognitive load) for determining the outcome of the categorization process. The claim of equivalence of status of auditory and visual stimuli in cross-modal auditory-visual processing is not uncontroversial. For example, a number of adult studies have argued for visual dominance effects (e.g., Colavita, 1974; Koppen & Spence, 2007), whereas some infant studies have argued for auditory dominance effects (Robinson & Sloutsky, 2004; Sloutsky & Robinson, 2008). Our choice of equivalent status reflects the finding with infants that both auditory and visual stimuli can affect the categorization process (Plunkett et al., 2008; Waxman & Markow, 1995). Nevertheless, it is clear that the equivalence assumption needs to be relaxed under specific testing conditions, and a computational approach needs to be flexible enough to handle these various conditions. We believe that the modeling framework adopted here possesses this flexibility. For example, a simple adaptation to the present architecture can deal with different amounts of visual information, maintaining the relative strength of the labels. We ran a series of simulations on an architecture consisting of three maps: a visual, an acoustic, and a global map that receives as an input the activation from the first two maps. This alternative architecture reproduced all five experiments reported by Plunkett et al. (2008). Also, by taking advantage of this flexibility, we have implemented a more realistic encoding of the acoustic inputs in which every phoneme is encoded by six values that fully describe its articulatory features, following Plunkett and Marchman (1991). Similarly, we found no qualitative difference in behavior to the simpler architecture. In a similar fashion, it would be possible to model scenarios with this architecture that capture both auditory and visual dominance effects. The model highlights the utility of an unsupervised feature-based approach to infant categorization where labels do not have the status of names but can be considered additional features that enter into the statistical computations performed by infants during categorization. Of course, labels can also be names. Mayor and Plunkett (2008) have attempted to demonstrate how self-organizing maps can be used to model word learning in the context of joint attentional events (also see Li et al., 2007). However, these word learning models require extra machinery and learning algorithms to achieve successful performance. The laboratory experiments described by Plunkett et al. (2008) do not involve joint attentional
V. Gliozzi et al./Cognitive Science 33 (2009)
735
events. They can be considered cross-modal equivalents of the unimodal statistical learning experiments with infants described by Saffran, Aslin, and Newport (1996). Building mental representations from statistical correlations amongst stimuli may involve a different configuration of neural systems than that required for word learning or name-based categorization. Is there any role for a supervised name-based approach to infant categorization? We know that young infants are able to demonstrate an appreciation of familiar label–object associations (Tincoff & Jusczyk, 1999) and are able to learn novel label–object associations (Pruden et al., 2006). It is unclear whether these early labels should be considered as names as they may not be referential in character and may be limited for the purpose of generalization. Irrespective of the referential nature and generalizability of early words, the formation of label–object associations might impact categorization in a manner that goes beyond the statistical role ascribed to labels in the unsupervised feature-based approach. However, this may be difficult to demonstrate. Even studies of concept formation in adults are equivocal about the mechanism by which labels impact the categorization process. For example, Yamauchi and Markman (2000) have demonstrated that labels function in much the same fashion as ordinary perceptual features in a straightforward classification task, but that labels differ from other perceptual features when subjects are required to make inferences: Inferences based on feature information are governed by perceptual similarity, whereas inferences based on knowledge of labels are more categorical. Nevertheless, Yamauchi and Markman (2000) acknowledge that ‘‘category labels can be viewed as reliable pointers to systematic knowledge structures that may provide a basis for making predictions about unknown features’’ (p. 793), suggesting that it may be the predictive reliability of labels rather than their status as names that is driving the asymmetry between linguistic and nonlinguistic features in adult categorization experiments. Lupyan, Rakison, and McClelland (2007) have also shown that labels can facilitate categorization even when the labels are redundant to task success. Furthermore, they showed that categorization performance correlated positively with verification performance (akin to Yamauchi and Markman’s [2000] classification task), indicating that subjects who were best at identifying names of training items also performed best at categorizing them. This result can be interpreted as evidence for name-based categorization. However, the authors note that their findings are compatible with ‘‘a general account that naming a category causes items within that category to cohere because the name serves as a reliable cue to class membership’’ (p. 1082). In other words, even these adult findings are compatible with an unsupervised feature-based account of categorization but where labels have the status of particularly reliable predictors, presumably as a result of a lifetime of use. Of course, separating the contributions of name-based and feature-based approaches to categorization is complicated by the fact that the former involves the latter: Names are reliably correlated with category membership. Should we therefore abandon any attempt to resolve this issue? We think not. Names refer. Features do not refer. One might reasonably suppose that the cognitive mechanisms that underpin the referential use of labels are separable from the mechanisms that exploit labels as perceptual features. The identification of these mechanisms and perhaps their neural
736
V. Gliozzi et al./Cognitive Science 33 (2009)
instantiation promise to provide insights into the manner by which labels impact the process of categorization in both infants and adults. In adults, the distinction between implicit and explicit memory can be treated either as the result of the operation of separable neuroanatomical systems or as functional dissociations within a single system (Berry, Shanks, & Henson, 2008). We suspect that the distinction between unsupervised, feature-based categorization and supervised, name-based categorization in infancy may benefit from similar theoretical analysis.
Notes 1. A variety of other reasons may have led to this result. For example, the intervening novelty preference test may have confused these young infants and disrupted their memory for the object–label associations. Alternatively, the infants may have failed to generalize the labels to category members that were not experienced during familiarization. Note, however, that infants of a similar age have shown that they can acquire novel object–label associations in the laboratory (Pruden, Hirsh-Pasek, Golinkoff, & Hennon, 2006). 2. One might suppose that a single infant trial corresponds to multiple iterations of training in the model. However, for this to reflect the training regime of the infant, the model would need to be iterated many times on the same trial before proceeding to the next trial. Such a training regime can lead to catastrophic interference when using a pattern update training schedule (French, 1999; McCloskey & Cohen, 1989) so that the model would fail to learn about all the stimuli. 3. Experiments 1 and 2 being extremely robust, statistical significance is obtained over an even wider range of values in more than 95% of the configurations tested. 4. Note that a scaling factor of 7 is used during the familiarization phase to reflect the fact that infants see the pictures for 10 s, instead of 6 s during testing. Moreover, the total looking time is split between two images in the testing phase, whereas only one image is shown at a time during the familiarization phase. 5. Strictly speaking, multiple, small categories are created in Experiment 4. 6. These measures are in nonnormalized feature space.
References Berry, C. J., Shanks, D. R., & Henson, R. N. A. (2008). A unitary signal-detection model of implicit and explicit memory. Trends in Cognitive Sciences, 12, 367–373. Clark, E. V. (1973). What’s in a word? On the child’s acquisition of semantics in his first language. In T. E. Moore (Ed.), Cognitive development and the acquisition of language (pp. 65–109). New York: Academic Press. Cohen, L. (1998). An information-processing approach to infant perception and cognition. In F. Simion & G. Butterworth (Eds.), The development of sensory, motor, and cognitive capacities in early infancy (pp. 277–300). East Sussex, England: Psychoogy Press.
V. Gliozzi et al./Cognitive Science 33 (2009)
737
Colavita, F. B. (1974). Human sensory dominance. Perception and Psychophysics, 16, 409–412. Crick, F. (1989). The recent excitement about neural networks. Nature, 337, 129–132. Fantz, R. L. (1964). Visual experience in infants: Decreased attention to familiar patterns relative to novel ones. Science, 146, 668–670. French, R. (1999). Catastrophic forgetting in connectionist networks. Trends in Cognitive Sciences, 3, 128– 135. Fulkerson, A. L., & Waxman, S. R. (2007). Words (but not tones) facilitate object categorization: Evidence from 6–12-month-olds. Cognition, 105, 218–228. Golinkoff, R., Hirsh-Pasek, K., Cauley, K., & Gordon, L. (1987). The eyes have it: Lexical and syntactic comprehension in a new paradigm. Journal of Child Language, 14, 23–46. Gureckis, T. M., & Love, B. C. (2004). Common mechanisms in infant and adult category learning. Infancy, 5(2), 173–198. Hu, J. F. (2008). The impact of labelling on categorisation processes in infancy. PhD thesis, Department of Experimental Psychology, Oxford University. Kohonen, T. (1982). Self-organized formation of topologically correct feature maps. Biological Cybernetics, 43(1), 59–69. Kohonen, T. (1984). Self-organization and associative memory. Berlin: Springer. Kohonen, T. (1993). Physiological interpretation of the self-organizing map algorithm. Neural Networks, 6(7), 895–905. Koppen, C., & Spence, C. (2007). Spatial coincidence modulates the Colavita visual dominance effect. Neuroscience Letters, 417, 107–111. Li, P., Zhao, X., & Mac Whinney, B. (2007). Dynamic self-organization and lexical development. Cognitive Science, 31, 581–612. Lupyan, G., Rakison, D. H., & McClelland, J. L. (2007). Language is not just for talking: Redundant labels facilitate learning of novel categories. Psychological Science, 18(12), 1077–1083. Mareschal, D., & French, R. (2000). Mechanisms of categorization in infancy. Infancy, 1, 59–76. Markman, E. M. (1990). Constraints children place on word meanings. Cognitive Science, 14, 57–77. Mayor, J., & Plunkett, K. (2008). Learning to associate object categories and label categories: A self-organising model. In B. C. Love, K. McRae, & V. M. Sloutsky (Eds.), Proceedings of the 30th annual conference of the Cognitive Science Society (pp. 697–702). Austin, TX: Cognitive Science Society. McCloskey, M., & Cohen, N. (1989). Catastrophic interference in connectionist networks: The sequential learning problem. In G. Bower (Ed.), The psychology of learning and motivation (Vol. 23, pp. 109-165). New York: Academic Press. Miikkulainen, R., Bednar, J. A., Choe, Y., & Sirosh, J. (2005). Computational maps in the visual cortex. New York: Springer. Mirolli, M., & Parisi, D. (2005). Language, as an aid to categorization: A neural network model of early language acquisition. In A. Cangelosi et al. (Eds.), Proceedings of the 9th neural computation and psychology workshop (pp. 97–106). Singapore: World Scientific. Nazzi, T., & Bertoncini, J. (2003). Before and after the vocabulary spurt: Two modes of word acquisition? Developmental Science, 6(2), 136–142. Plunkett, K., & Marchman, V. (1991). U-shaped learning and frequency effects in multi-layered perceptron. Cognition, 38, 43–102. Plunkett, K., Sinha, C., Møller, M. F., & Strandsby, O. (1992) Symbol grounding or the emergence of symbols? Vocabulary growth in children and a connectionist net. Connection Science, 4, 59–76. Plunkett, K., Hu, J.-F., & Cohen, L. (2008). Labels can override perceptual categories in early infancy. Cognition, 106, 665–681. Pruden, S. M., Hirsh-Pasek, K., Golinkoff, R. M., & Hennon, E. A. (2006). The birth of words: Ten-month-olds learn words through perceptual salience. Child Development, 77, 266–280. Robinson, C. W., & Sloutsky, V. M. (2004). Auditory dominance and its changed in the course of development. Child Development, 75(5), 1387–1401.
738
V. Gliozzi et al./Cognitive Science 33 (2009)
Roder, B. J., Bushnell, E. W., & Sasseville, A. M. (2000). Infant’s preferences for familiarity and novelty during the course of visual processing. Infancy, 1, 491–507. Saffran, J. R., Aslin, R. N., & Newport, E. L. (1996). Statistical learning by 8-month-old infants. Science, 274, 1926–1928. Schafer, G. (2005). Infants can learn decontextualized words before their first birthday. Child Development, 76(1), 87–96. Slater, A., Quinn, P., Brown, E., & Hayes, R. (1999). Intermodal perception at birth: Intersensory redundancy guides newborn infants’ learning of arbitrary auditory-visual pairings. Developmental Science, 2, 333–338. Sloutsky, V. M., & Robinson, C. W. (2008). The role of words and sounds in infants’ visual processing. Cognitive Science, 32(2), 342–365. Sneath, P. H. (1957). The application of computers to taxonomy. Journal of General Microbiology, 17(1), 201– 226. Tincoff, R., & Jusczyk, P. W. (1999). Some beginnings of word comprehension in 6-month-olds. Psychological Science, 10(2), 172–175. Waxman, S. R. (1999). Specifying the scope of 13-month olds’ expectations for novel words. Cognition, 70, 35– 50. Waxman, S. (2003). Links between object categorization and naming: Origins and emergence in human infants. In D. H. Rakison, & L. M. Oakes (Eds.), Early category and concept development: Making sense of the blooming, buzzing confusion (pp. 213–241). London: Oxford University Press. Waxman, S., & Booth, A. (2003). The origins and evolution of links between word learning and conceptual organization: New evidence from 11-month-olds. Developmental Science, 6, 128–135. Waxman, S. R., & Braun, I. E. (2005). Consistent (but not variable) names as invitations to form object categories: New evidence from 12-month-old infants. Cognition, 95, B59–B68. Waxman, S., & Markow, D. B. (1995). Words as invitations to form categories: Evidence from 12- to 13-monthold infants. Cognitive Psychology, 29, 257–302. Westermann, G., & Marechal, D. (2004). From parts to wholes: Mechanisms of development in infant visual object processing. Infancy, 5(2), 131–151. Wetherford, M., & Cohen, L. B. (1973). Developmental changes in infant visual preferences for novelty and familiarity. Child Development, 44, 416–424. Yamauchi, T., & Markman, A. B. (2000). Inference using categories. Journal of Experimental Psychology: Learning, Memory and Cognition, 26(3), 776–795. Younger, B. (1985). The segregation of items into categories by ten-month-old infants. Child Development, 56, 1574–1583. Younger, B. A., & Cohen, L. B. (1986). Development change in infants’ perception of correlation among attributes. Child Development, 57, 803–815.