language learning in a neural network - Semantic Scholar

8 downloads 0 Views 62KB Size Report
John"? Such a neural system would reveal things about neural mechanisms and representations, and would be ... It can be compared to a ball rolling down a hill.
LANGUAGE LEARNING IN A NEURAL NETWORK Christian Balkenius Lund University Cognitive Science Kungshuset, Lundagård S–223 50 Lund Sweden E-mail: [email protected]

1. I NTRODUCTION How can a neural system learn how to interpret and produce a simple sentence like "Sam hits John"? Such a neural system would reveal things about neural mechanisms and representations, and would be of importance to our conception of what constitutes linguistic ability. Despite the simplicity of the above sentence (i. e. "Sam hits John"), none of the current connectionist models are able to handle it in a way that explains how it is accomplished in a biological system. That is the result of their ignoring the fact that even very simple sentences have 'real' constituent structure, that cannot be dealt with in a simple pattern transformation model [1, 2]. This paper aims at briefly describing what neural mechanisms are needed for linguistic applications and show how they can be used to predict a few general principles of the structure and learning of natural language.

2. S OME BASIC PROPERTIES OF N EURAL S YSTEMS What are the properties of a neural system, N? Let us first assume that N can be described by a set of interacting modules, that together constitute the whole of N . A module is what we get if we take any subset of the equations describing N and study them in isolation. This means that what I call a module does not necessarily correspond to any physical part of the network, even if it could, in principle. At each time the module is in a particular state. This state can, in theory, be described by all the measurements we could do in the module at one time, but usually, only a few of

those measurements are actually relevant to the function of the module. This means that the states can be found in a space of dimensionality equal to the number of possible measurements in a module. The same is true for the whole network, the state of which is described by the states of all the modules taken together. Now, what properties can we find in a module? 2.1 Constraint Satisfying Modules A property of modules much studied in the neural network literature is their ability to find what is sometimes called attractor states [3, 4, 5, 6, 7, 8]. Those are states to which the module tends to drift when it does not receive any external signals. It can be compared to a ball rolling down a hill until it finds the place at the bottom of the valley. Here the initial position of the ball corresponds to the initial state and the bottom of the valley to the attractor state.

s1

s2

S

Fig 1. The ball is initially dropped in the state s1, but as a result of the shape of the landscape, it rolls down in a valley that corresponds to state s2.

The landscape through which the ball rolls corresponds to what I call the configuration of the module. This is described by the

configurational state of the module that is used to select a state transition function f, that gives the actual path the ball takes through the landscape. Depending on where the ball is initially dropped, it will roll through different paths and find different valleys. But there is more to it than this. If the ball is dropped where it cannot find any valley it will make a hole in the ground and start to form a new valley. If the balls are dropped close to each other, then it will make a deep thin valley. If they are dropped at a somewhat greater distance from one another, then we will get a less narrow valley, that will attract balls dropped within a larger area. Unless the balls are dropped entirely at random, this will eventually lead to a landscape that contains useful information about the whole history of the module. The changes in the landscape correspond to changes in the configurational state of the module and is governed by a function g that is selected by the current state. This is the mechanism behind concept formation. In modules of the above mentioned kind, the landscape can be thought of as a set of constraints on the possible paths. The ball rolling down the valley can be considered to look for the nearest state that best satisfies those constraints. This would correspond to realizing that we see Sam even though his arm is partly obscured by a tree. The ball falls down at a place that represents a person very much like Sam, but without one of his arms. Since, our previous experience of Sam has shaped a valley for him, not far away in the landscape, the ball will roll down in it and reveal to us that this is indeed Sam, despite his strange appearance. Here the landscape codes semantic constraints (Sam has two arms) and the valleys corresponds to semantically consistent states (perceiving Sam as having two arms). 2.2 A ’Formal’ Description We can now define a module, M, as a tuple , where S is the state space that corresponds to all the different places in the landscape and F is a set of state transition functions that describe the possible paths through the landscape. C is the set of all possible configurations of the module, and thus, corresponds to all possible landscapes. How the landscape is changed is governed by G which is a set of functions that change the configuration as a result of the ball rolling through its corresponding landscape.

This gives us two interacting subsystems in each module. First, we have the system that governs the fast changes in the module, i. e. the transient neural activity. Secondly, we have the system , that controls the slower changes that correspond, not only to a changing landscape, but also to all learning in the system. By changing the behaviour of the functions in the two sets F and G , it is possible to describe a large set of different neural mechanisms. 2.3 Composition of Modules We can construct new modules by connecting two modules to each other. This can be described by applying a constructor function, Ω C , to two modules, M and M', which gives us a new larger module, M'', that includes both M and M', but which have properties that not necessarily correspond to those of M and M' taken together. As the two modules interact in a way described by C, the landscape that corresponds to M ΩC M' can look very different from what one would first expect. S'

S' S''

S''

S

S

Fig 2. The left picture shows the landscape of the larger module M'' as it would look if it was composed of two modules M and M' , that didn not interact at all. Each of the smaller modules have two valleys, which gives us a larger module with four valleys. The right picture shows another case where C describes massive interaction among the smaller modules. Here we get three valleys in the larger module that doesn't relate in any simple way to the valleys in M and M'. The composition has moved the valleys in each of the smaller modules. This can be also be viewed as if each of M and M' pushes the ball in the landscape of the other module in different directions depending on its own state.

2.4 Chunking Modules Chunking modules can be of two related types. The first kind of chunking module has the property of being sensitive to sequences of state transitions in another module. This means, to

continue the metaphorical description, that the place where the ball is dropped depends on the path of the rolling ball in the module to which it is connected. The result of this is that each valley in a chunking module represents a path in the other module. This implies that a chunking module can be used to find and learn the usual paths taken in another module. In general a chunking module is only sensitive to a path of finite length. Thus, we need hierarchies of chunking modules to recognize longer sequences. This is the basic mechanism used for the recognition of sequence. Also in the second kind of chunking module, each valley corresponds to a path in the other module. But the causal relation is the opposite. When the ball is dropped in its landscape it causes the ball in the module with which it interacts to roll along a certain path. The exact path is determined by the interaction of the path the ball is pushed and the topology of the landscape. The ball will for instance fall down in valleys in its way. This can in turn trigger activity in other places of the network. This is the basic mechanism for performance in the network. By combining the two types of chunking modules, it is possible to get a module that can learn both to recognize and produce sequences of states in another module.

CH

CH'

M

Figure 3. Top. The two modules CH and CH' can learn to produce sequential states in M . Bottom. The path taken by the ball (black) in M is determined by the interaction between the learned path (grey) and the topology of the landscape in M .

3. L ANGUAGE As one of my aims is to explain what constitutes linguistic structure from a neural point of view, it is not possible to assume, initially, very much about linguistic structure. The minimal description of a language is that it is a system with a set of items that can be combined to form sequences of such items. It is not necessary for my purposes to decide on what the basic items are. In the rest of this paper I will assume that the smallest items are words. This is obviously a too great simplification. If we wanted a complete model, then the items should more accurately be described as phonemes or maybe the harmonic spectrum of the spoken language at each time. It will be easy to see how the model described below could be extended to incorporate a more exact definition of the individual items. It should also become apparent that this should not change any of the properties described here.

4. T HE L ANGUAGE L EARNING S YSTEM Let us now consider the language learning system. This is a module, + , that contains four channels for communication with the rest of the neural system. Each of the channels can be considered to be connected to a module, whose only function is to hold a state that corresponds to the message sent through the channel. Two of the channels govern linguistic input and output. The state in the output module corresponds to the uttered word and the state in the input module corresponds to the perceived word. The other two channels govern what I call contextual input and output. They hold states that correspond to the non-linguistic environment. In this paper, the contextual input is considered to be a direct function of perception, but in general it can represent any non-linguistic information, such as emotional states, drive representation etc. The contextual output is a state that communicates information from + to the rest of the network N and makes it possible for N to use the information received by +.

language

l

n context

environment

Figure 4. The language module, +, is a part of the whole network n. It interacts with N hrough four channels.

4.1 T HE C ONTEXTUAL R EPRESENTATION - THE C ONTINUITY H YPOTHESIS If two neural systems are going to communicate with each other, how can they exchange information? There are basically two ways in which this can be done. Either the two systems have information about a common code that is used to communicate or they send information in a code that can be used by the other system although it lacks information about it. This is the case when the code is similar, in some way, to what it tries to represent. This latter type of code has some interesting properties. First, since the intrinsic properties of the code include information about the properties of what it represents, it is possible for the system to classify the code instead of what it represents. It can in fact relate to the code as if it was the thing it represents. Second, a single system does not have to know with which other system it communicates. We can just connect it to any other system and then let it work on the code it receives without it ever knowing that it represents something. It can even change its categories if the code should change as a result of damage to the other system. This makes a system with this type of representation very robust. Third, if the system sends the same kind of code it receives to represent the same thing then the other system will immediately understand the message. One criterion for this kind of representation can loosely be stated as follows. Two states, s and t in S, are recognized to be similar by the neural system if they are represented with states that are close in some subspace of S in some module M. This is the continuity hypothesis of neural states. I will not here dwell on the reasons for believing

that it holds for neural systems, but will go on to show its consequences for the language module. We can immediately recognize that it is an absolute requirement for a constraint satisfying module that this kind of representation is used. When the ball rolls down in the closest valley, then it should of course be a valley that corresponds to the closest semantically consistent state, thus the representation has to be of this type, or else we cannot suppose that the closest valley has anything to do with the closest semantically consistent state. Another consequence is that we can immediately exclude all representations using arbitrary symbols, as they, by definition, are not similar to what they represent. The most important consequence, however, is that the states of the contextual input and output modules must be similar to represent similar contents. This implies that the modules for contextual input and output must, in fact, be the same. If they were different, their state spaces would have been disjoint and then there could never be a common subspace where the states could be close. The same must be true for the linguistic representation. This permits us to view + as containing one contextual module, M C, and one linguistic input/output module, M L.

ML

l

MC

Figure 5. Interaction between the two modules ML and MC of l.

5. THE LEARNING S ITUATION The language learning situation can be described by the two modules that connect + with the rest of the network. It is assumed that the contextual representation sometimes is related to the linguistic input. This is of course the minimal requirement that we can impose on the learning situation in order for + to learn the meaning of the linguistic input. A class of linguistic sequences are said to be learned by + when three conditions are met.

• In response to a sequence of states in the module for linguistic input, + can construct a sequence of contextual states that correspond to the semantic content of the linguistic input. • In response to a sequence of contextual states, + can produce a sequence of states in the module for linguistic output that corresponds to parts or all of the semantic content of those contextual states. • + is able to generalize the linguistic constructions to instances of the class that it has not previously experienced, i. e . + can produce and understand sentences never experienced before. The first two conditions may at first look as an attempt to formulate a behavioral approach to language learning, but this is not at all the case as behaviorism strongly denies one of the points that I am trying to make, namely that what is in between input and output does indeed matter to the description of linguistic ability. 5.1 Learning a Word The simplest linguistic input we can give to + is a single word. It is necessary for + , sometimes to receive each word together with its corresponding contextual representation. The problem here is that + cannot know when this is the case. Is is not even possible for + to know what parts of the contextual state correspond to the particular word. Here the neural mechanisms described above can be used to make the necessary inferences concerning the meaning of a word. Assume that the contextual module can be described with a landscape that forms valleys which correspond to concepts that it forms to categorize properties of the environment. The valleys in the linguistic module correspond to the words it can recognize. To understand a word, + has to be able to change the state of M C in response to a state change in ML. To produce a word, the state changes must go the other way. A changed state in M C results in a changed state in M L. If we look at the whole of + then the valleys in it will be formed to correspond to the associations between words and semantic content. Depending on the states of MC and M L the ball will fall down in different places and depending on which way it rolls we will get either production, understanding, learning or a combination of them. This gives us an interactional model of one word learning (figure 5) where learning of new

concepts in MC makes it easier to understand new words. At the same time, the linguistic input, to M L, can influence the formation of new concepts in M C . If the contextual and linguistic input are compatible in respect to their semantic contents, then they push the ball most strongly in a direction that corresponds to their shared content. This will eventually result in a valley that corresponds to this particular semantic information. The linguistic input has in effect pushed the concept formation in a particular direction. This can also be viewed as if the linguistic input changes the attention of M C in a particular direction. Thus, the formation of concepts and the learning of language can be considered to be highly dependent on each other. 5.2 The Problem of asynchronicity If we accept the continuity hypothesis, then only a limited amount of information can be represented at one time in M C . In particular, observations of different entities which share some properties cannot be represented simultaneously. If they were, then the network will lack either the ability to recognize the similarities or be unable to decide to which of the items the shared property belongs. Suppose, for instance that we are to represent seeing Sam and John simultaneously. If both boys are represented at the same time, then we will get the same representation as we would get, if we saw a person that was a mix between the two of them. To see why this is the case, consider the kinds of representations that are possible under the continuity hypothesis. The representation of properties they both share, such as hair colour, must be represented in the same module. If they have different hair colours this presupposes the module is in two different states simultaneously, which is impossible. If they have the same hair colour, we could in theory represent that property, but we obviously cannot hope for all people to share all properties. The only solution is to represent each boy at separate times in MC . However, since parts of the two boys can share properties, we could conclude that we can only represent smaller parts of the boys. But this argument could be repeated ad infinitum. The result would then be that we can represent almost nothing in M C at one time. But this is only true if we want to represent each of the boys with every conceivable detail all the time, which of course is not necessary.

. Whatever the detail of representation, it will be necessary to change the state of M C to represent different parts of the perceivable environment. This implies that our chances of having the right contextual representation in M C when a word that corresponds to it is heard is very limited. For learning to be effective we would like the representation in M C to be syncronized with the corresponding word presented to ML. This can be done by connecting M C to another module, M I , that quickly forms valleys for each of the states the ball have recently visited in M C . Those valleys can later push the balls in other modules towards states that have been recently visited. As soon as + has formed any hypothesis of what a word means, then MI will influence the pattern in M C in order to find the most recent state related to that word. This will speed up word learning and concept formation considerably. This may be interpreted as if the interaction between M C and M I temporarily widens the valleys that the ball visits in M C.

l ML

MI

MC

Figure 6. The added module M I speeds up learning by acting as an intermediate memory.

l was extended by MI to speed up learning, but as a result we have got a system for intermediate memory. This also makes it possible to explain anaphora. Take the word "he" for instance. This can be seen as pushing the ball in MC towards a valley for male persons. This will cause the ball in MI to fall down in a valley that contains information of a male person. If we have mentioned Sam or even seen him recently, then we will have a valley in M I for Sam. This will cause the ball in M C to be pushed toward the valley that corresponds to Sam and, thus recall the same state in MC as if we used the word "Sam". The state in M C is the result of applying the semantic constraints coded by "he" on the different states coded in MI together with the constraints on semantic consistency coded in MC. The resulting state is the one that most strongly satisfies those constraints.

5.3 Implicit and Explicit relations I discussed above how it was possible to represent Sam to some specificity as long as none of the properties describing him had to share a common space in any module. The kind of relations that can be described in this way are what I call implicit relations. They are characterized by their not requiring any explicit notion of what the relation is. If we combine the state representing Sam with a state representing dark hair, then what we get is a representation of a dark haired Sam. If Sam is in fact dark haired, and N knows it, then the state representing dark hair, DARK•HAIR is a substate of the state, SAM representing Sam. A substate of s is a part of s in a subspace of S, in this case a subspace representing hair color. Here it is not necessary to specify the relation between the two states as there can only be one interpretation, namely that Sam is dark haired. Many representations are not of the simple implicit form and demand a more complicated representation. A common example of such a relation is that between two similar objects. It differs from an implicit relation in two important ways. Firstly, it is more general, i.e. it can be any relation between any items and secondly, the whole relation cannot be represented in a single module at one time if the involved items are similar. This results from the fact that an explicit representation must be specified by three states, one of them represents the relation and the other two represent the items involved in the relation. When the two items are similar as is the case if we want to represent a relation like "Sam hits John", they must be represented by more than two states in +. The two states needed in this case are: one that represent the first half of the relation, i.e. "Sam hits" and one that represents the second half, i.e. "John is being hit". Each of them can be considered to be composed of three states in different submodules of M C. "Sam hits" could be composed of three states for Sam, hit, and something like the role 'actor'. "John is being hit" could be composed of states for John, hit and a role like 'patient'. This is an explicit relation since there is a substate that corresponds to the actual relation - in this case 'hit'. It is not necessary for + to work on any special type of semantic roles. Actually, it is not necessary to ever decide on any semantic roles. It is

possible for + to use the representation, as long as the representation shows some similarity between similar roles. This is closely connected to the problem of finding primitive semantic features. In the model presented here this problem does not exist, as any semantic primitives are adequate as long as they contain sufficient information and satisfy the constraints set by the continuity hypothesis. With the mechanisms described so far, what are the kind of language that + can handle? + can in principle, take any connected set of substates in M C and associate it with any word in M L . To understand a sentence like "Sam hits John", the words must contain information about the role of the different participants. This would have been the case in a case based language where the case markers contain information about the role. But the neural mechanisms constrain the case system in some important ways. The case marker must be closer to the word it is connected to than to any other for + to be able to easily roll the ball into a valley that corresponds to the correct role for the participant. It is also easier for + to construct the correct states for a simple three word sentence if the verb is presented in first or second position. It can more easily construct the the two resulting states for "Sam-act hits John-pat", if it can complete the first state, SAM•ACT•HIT, before it has to start with the other one, JOHN•PAT•HIT. If the verb is in final position, then + has to go back and fill in the missing information in the first state. This means that it has to find the previous state again, something which is more easily accomplished if the verb contains some information about the first state. This is the case if the verb is congruent with the subject in some way. Congruence, in general, helps the language module to build the appropriate states in response to linguistic input. It is also interesting to note that the sentences that are possible to produce from a single state are similar to the sentences produced by children in their two word stage [9]. Longer sentences make use of the qualitatively more complicated mechanisms. 5.4 Word Order Not all languages have case markers. Instead, word ordering is used to signify the roles of the different participants. For this we must extend + with a module sensitive to order. This module,

MO, consists of an hierarchy of chunking modules. It interacts with M C to find different ordering of the contextual information. It is also able to produce different orderings of the linguistic output depending on the different states in MC. In a way similar to the interaction between M C and M L , M O can influence the concept formation in M C in a direction that makes it easier for + to learn to use word order information in the linguistic input. This interaction can lead the semantic roles developed in M C to be more syntactically oriented which will help the generalization of syntactic constructions over a broader spectrum of sentences. To understand the sentence "Sam hits John", + can now do as follows. When presented with the word "Sam", the ball falls down in the valley in M C that corresponds to Sam as before. When "hits" is received, the ball in M C jumps to the valley representing SAM•HIT. The order module, M O , recognizes this as a temporal sequence of PERSON and VERB, which are substates of SAM and. HIT. It uses this to push the ball towards a valley representing SAM•HIT•ACT. This is the first state we wanted the sentence to produce and it is now temporarily stored in M I .Now, +receives the word "John", which causes the ball in, to jump from the valley it is in to a valley representing John involved in hitting, JOHN•HIT. Again, the the ordering module can fill in the missing information by pushing the ball in the direction of the valley for PAT. This new state can now be stored in M I . This completes the processing of the sentence. S

S'' s JOHN+HIT+PAT sSAM+HIT+ACT sSAM+HIT

s JOHN+HIT

sSAM S'

Figure 7. The path taken in a hypothetical state space in response to the sentence "Sam hits John". The three selected dimensions correspond to the entire state of some submodules of MC.

The process above works even if not all words are known to + . If it has previously experienced the scene described it has the corresponding states

stored in MI . This together with an understanding of the rest of the sentence makes it possible for + to form a hypothesis of what the meaning of the word is. The relevance of word order can be learned in a similar way.

6. C ONCLUSION The model described above has been implemented as a computer program that simulates the neural systems that are needed to implement the different modules. It has shown a few things about neural mechanisms and linguistic ability: •

A neural system can work with real syntactic constituent structure without the detour of a symbol manipulating automata of the usual kind.



Keeping the model at a neural level makes it possible to use mechanisms for finding semantically consistent states to form hypotheses about syntactic constructions. This makes the learning of a simple grammar possible and I can see no reason why the mechanisms described here could not be extended to account for a more complicated grammar.



It is possible to predict some general word order constraints on natural language as a result of the neural mechanisms.



The use of anaphora can be understood as referring to states in the intermediate memory.



Congruence is not redundant, but helps the language module to construct the appropriate states.



Language learning and concept formation is highly dependent on each other.



The language learning of the model shows some similarities to the language aqcuisition in children, such as one and two word stages. The sentences produced at these stages have approximately the same semantic content as the sentences observed in data about language acquisition.



Syntax is not independent from semantics. The complexity of language should not be considered to be only syntactical, since the underlying semantic structure changes the

complexity of the syntax in a way that is not observable on the surface.

R EFERENCES [1] Fodor J. A. and Pylyshyn, Z. W. "Connectionism and cognitive architecture: A critical analysis", Cognition , 28 , 1988, 3-71. [2] Touretzky, D. "Connectionism and PP Attachment", in D. Touretzky, G. Hinton, and T. Sejnowski (eds), Prooceedings of the 1988 Connectionist Models Summer School , Morgan Kaufman, 1988. [3] Ackley, D. A, Hinton G. E, and Sejnowski T. J. "A learning algorithm for Bolzman machines", Cognitive Science, 9:147-169. [4] Grossberg, S. Studies of Mind and Brain, D. Reidel Publishing Company, Dordrecht: Holland 1982. [5] Grossberg, S. The Adaptive Brain , vols. I & II, North-Holland, Amsterdam 1987. [6] Hopfield, J. J. "Neural networks and physical systems with emergent collective computational ablilities", Proceedings of the National Academy of Science, 79 :2554-2558. [7] Kohonen, T. Self-Organization and Associative Memory, Springer-Verlag, Berlin 1988. [8] Rumelhart, D. E. and McClelland, J. L. Parallel Distributed Processing, vols. I & II, MIT Press, Cambridge, Massachusetts, 1987. [9] de Villiers, J. G and de Villiers P. A. Language acquisition, Harvard University Press, Cambridge, Massachusetts, 1978.