The Lexicon and the Boundaries of Compositionality Alessandro Lenci University of Pisa, Dept. of Linguistics Via Santa Maria 36 56100 Pisa ITALY
[email protected] September 27, 2005
Keywords: compositionality, polysemy, lexical semantics, context
1
Introduction
In its most simple and informal formulation the principle of compositionality expresses an uncontrovertible truth of natural language interpretation, i.e. that the meaning of a complex expression can be derived from the meanings of its parts and the way these parts are syntactically related. The fact that our linguistic system is compositional is nicely confirmed by the capacity of speakers to use a restricted amount of linguistic resources to produce and understand a potentially unlimited number of sentences: “this would be impossible were we not able to distinguish parts in the thoughts corresponding to the parts of a sentence” (Frege 1923). The problem is then how to turn the intuitive statement of compositionality above into a formal constraint on the human language computational system. The lexicon raises some important issues that call into question the compositional nature of natural language interpretation. As Janssen (1997: 420) claims, “compositionality requires that words in isolation have a meaning and that from these meanings, the meaning of a compound can be built.” In other terms, the principle of compositionality seems to presuppose that word meaning is strictly context free, that is to say that “a lexical item must make approximately the same semantic contribution to each expression in which it occurs” (Fodor and Pylyshyn 1988). Prima facie this requirement sharply contrasts with another uncontrovertible fact about natural language, that is to say that lexical meanings are highly context sensitive. Actually, word meaning undergoes continuous modification and modulation in new contexts, and speakers have the ability to “adapt” the meaning of lexical items to fit a specific context and communicative situation (Pustejovsky 1998). A possible way to reconcile meaning variation in context with the requirements imposed by the compositional interpretation of complex expression is to address the former in terms of the notion of homonymy, i.e. the conflation of two symbolic entities under the same form. The fact that in the two sentences The director opened the bank and The water 1
overflowed the bank the word bank occurs with different meanings can be easily explained by the fact that this word is a clear case of homonymy. Compositionality is salvaged by simply assuming that these sentences should actually be re-written as The director opened the bank1 and The water overflowed the bank2 , with bank1 and bank2 two different lexemes each associated with a specific and unique meaning. Generalizing this solution to every case of meaning variation entails adopting what Pustejovsky (1995) terms the sense enumeration model of the lexicon. This consists in assuming that for every word w with senses s1 , ..., sn , the lexicon contains w1 , ..., wn different lexical entries, each with a univocal meaning. Thus, the fact that the same word might have different meanings depending on the context could just be deemed as apparent, the truth being that we have multiple lexemes that happen to share the same form. However this solution is able to save compositionality, it can not be satisfactory. First of all, the different senses of a word are only rarely distinct and well-distinguishable conceptual units, as in the case of bank. In a much more common situation, words have multiple meanings that are in turn deeply interwoven, and can also be simultaneously activated in the same context. This case is known as polysemy, actually a widespread and pervasive feature affecting the organization of the lexicon. Polysemy actually concerns a term like bank too. For instance the sense of bank2 above is itself a constellation of different related meanings: the bank-as-institution needs to be distinguished from the bank-as-a-building, and the bank-as-people working in it, and yet these three senses are clearly related in a way in which the bank of the river Thames and the Bank of England are not. Secondly, polysemy and sense creation in context is systematic. Homonymy is just the accidental conflation of two different symbolic units under the same form, but polysemy is not accidental. As is well-known, there are regular polysemous clusters of sense that are shared by whole classes and paradigms of words. For instance, school and church can also refer either to an institution or to a building exactly like bank and these sense alternations are preserved across various languages. This shows that polysemy is a systematic property of human lexicon. As Blutner (2004) claims, any model of compositionality must also be able to account for the systematicity of linguistic competence, i.e. the fact that the ability to understand and produce some expression is intrinsically connected to the speakers’s ability to understand and produce other expressions that are semantically related. Unless this constraint is satisfied, compositionality risks to become just a vacuous principle, deprived of any real explanatory power. Last but not least, meaning context-dependency is the direct product of an openended capacity of speakers to use words in different situations with new meanings, i.e. it is part of speakers’ natural language creativity. Therefore, sense-enumeration approaches to the lexicon are wrong just because word senses - exactly like sentences - can not be enumerated. Thus, a paradox is almost at the door. As we said above, the principle of compositionality is a key ingredient to explain that part of human language creativity that concerns the possibility of producing an unlimited number of sentences by combining elementary linguistic expressions. However the principle of compositionality seems to require that meanings have a fixed and context-free interpretation. This in turns makes it hard to explain another part of human language creativity, the one that concerns the lexicon itself and that consists in the speakers’ ability to create new word senses in context. 2
Reconciling these two horns of the dilemma is one of the biggest challenges for any theory of human language: how to account for meaning compositionality, without renouncing to model word meaning variation, that is to say how to reconcile sentence-level creativity with word-level creativity.
2
Enriching the lexicon with context
A major tenet of contemporary research into the structure and function of language is that the core semantic properties of the lexicon depend on the way meaning is represented. Semantic representations are actually regarded to play a key role in explaining central lexical phenomena such as meaning multiplicity (e.g. ambiguity, polysemy, etc.), lexical inferences, semantic similarity judgements, and meaning composition. The key problem is the form and organization of semantic representations. Consistently with the cognitive paradigm that regards mind and language as symbolic combinatorial systems, a dominant approach to word meaning representations is in terms of symbolic structures. Lexical competence is thus modelled as a structured formal system of conceptual symbols onto which lexical terms are projected. Using a term nowadays very popular in knowledge representation and computer science, such a conceptual symbol system may be regarded as an ontology (cf. Saint-Dizier and Viegas 1995, Vossen 2003). Generally speaking, an ontology defines the set of concepts relevant to the description and organization of a certain domain of knowledge, together with the set of relations and axioms that define its architecture (Gruber 1995). Similarly, we can conceive of lexical modelling as the task of managing semantic knowledge in terms of a suitable repertoire of semantic types or categories. To emphasize the specific problems raised by ontologies for word meaning representations, we can refer to them as lexical ontologies. Different senses of the same word thus correspond to different elements of the ontology, while its architecture provides an explicit representation of the organization of the lexical space, i.e. of how lexical meanings interact and relate to each other. A basic design of lexical ontologies assumes that word information content is actually internally structured and that it is this structure to determine (and constrain) the way meanings combine to form complex expressions. The internal structure of lexical items, as well their combinatorial properties, are articulated in terms of functions and arguments. A major divide thus exists between predicative lexical items that project an argument structure (prototypically verbs, but also relational and event nouns, adjectives, etc.), and those that act as semantic fillers of predicate arguments. In this case, the lexical ontology includes conceptual symbols with typed variables, each representing a semantic argument. The type of the variable, itself a conceptual symbol, specifies its selectional restrictions, while semantic roles (e.g agent, patient, etc.) eventually identify the arguments in terms of their role in the event expressed by the predicate. Within this general architecture compositional semantic representations are typically modelled in terms of highly complex recursive symbolic structures obtained by means of applying the functional complexes expressed by predicate lexical items to their arguments according to the syntactic structure in the sentence.
3
The distinction between functions and arguments as a primary aspect of the lexical space plays a key role in formal implementations of the principle of compositionality. At the same time, the functional structure of semantic representations also raises the issue of the way predicative and argument lexical items interact. In fact, the way meaning composition is represented in most semantic theories takes the form of a strictly unidirectional process of functional application. The strict dichotomy in the lexicon between words acting as semantic functions and those acting as arguments therefore comes to correspond to the opposition between “active” concepts that set the number and types of items with which they can combine, and on the other hand non-predicative “passive” items that contribute to conceptual composition merely by filling the argument slots and satisfying their type restrictions (Pustejovsky 1995: 39). However, this model of lexical composition is notoriously unsatisfactory and inefficient to deal with polysemy or lexical creativity, which heavily affect the way word distribute with each other. Consider for instance school: a school can be built or destroyed (like a house or a car), founded (like a bank or an institution), begin or finish (like summer or a vacation), go on holiday or win a championship (like a group of friends or a football team), be boring or interesting (like a book or a story), etc. Since the predicates that can be applied to school are maximally orthogonal with respect of the semantic type of their arguments, explaining the distribution of school in terms of the conceptual categories that it expresses would thus entail to multiply the number of its senses. This problem is addressed in Pustejovsky’s theory of the Generative Lexicon (GL) by proposing alternative ways of conceiving both the internal structure of concepts and the way concept composition is modelled (cf. Pustejovsky 1995). One of the main features of GL is the abandonment of the twofold assumption that compositionality has to be implemented as an unidirectional functional application process, and that the lexicon is rigidly partitioned into active, selection-imposing functional elements and passive, selection-satisfying argument lexemes. GL assumes that even those lexical items that superficially behave as arguments may actually have a highly complex internal predicative nature that drives their linguistic distribution. This generalized predicative structure of lexical items is formally implemented though the 4-dimensional Qualia Structure. Together with the Argument Structure, the Event Structure and the Lexical Inheritance Structure, the qualia structure provides provides lexemes with a complex multi-layered representation of their content, thereby making “all lexical items as relational to a certain degree” (Pustejovsky 1995: 76). The qualia structure appears in the semantic representation of all the major lexical categories, although its most direct and clear application is in the analysis of nominal concepts. Notoriously, a prominent role in GL is assigned to the agentive and telic qualia: in nominal categories they encode (proto)typical events in which the entities represented by these categories are involved. The qualia structure actually intends to be a lexicalization of relevant chunks of the contexts of a word, which now enter into its information content. According to this view, the fact that a book can be read as part of its typical function, or written as its mode of creation, or similarly the fact that the typical function of a violinist is to play and the one of a knife is to cut are part of the information content lexicalized in the concepts expressed respectively by book, violinist or knife. This in turn means that lexical concepts do not only include properties that can be modelled as monadic features (e.g. shape, 4
color, dimension, etc.), but also information referring to the events and situations into which these entities participate, and that therefore needs to be represented in terms of polyadic predicates. Crucially, in GL the pieces of contextual knowledge encoded in the qualia structure “provide the jumping off point for operations of semantic reconstruction and type change” (Pustejovsky 1995: 77) and are therefore the basis to explain a wide array of creative and dynamic aspects in the lexicon. In fact, systematic lexical polysemy is treated dynamically as the result of an on-line process of lexically-controlled sense creation. Senses are generated in context as a result of the internal semantic structure of lexical items. It is mainly the information in the qualia structure to drive (and, at the same time, to constrain) metonymic reconstructions, sense extensions, adjectival polysemy, etc. These phenomena are modelled through the introduction of more complex modalities of lexical composition, such as co-composition, type-coercion, selective binding, etc. When two lexical items are combined (e.g. a verb and its object or a noun and an adjectival modifier), their qualia structures interact in a complex way and generate context-specific interpretations. Coming with highly articulated internal predicative structures, the composition of the lexical items consists of elaborated processes of merging the information encoded in their various representation layers, and in the qualia roles in particular. The notion of qualia structure as a key component in modelling the lexicon with its dynamics and semantic composition raises important issues concerning its proper status and definition. A first salient aspect of qualia is their acting as structuring dimensions of word content. In fact, qualia are not themselves meaning components, and under this respect they radically differ from semantic features or primitive conceptual functions. They rather intend to provide a multidimensional structured partition in the space of properties constituting lexical concepts. The second crucial aspect is that the qualia structure is grounded on the assumption that it is possible to select a subset of contextual knowledge about an entity as constitutive of its concept. This point is not uncontroversial and obviously raises the crucial question of how to model the qualia structure so as to properly establish this delicate border. For instance, the prominence assigned to the telic and agentive roles in the theory implies that these two dimensions are major structuring dimensions in the concept space. Pustejovsky (2001) proposes an ontology of nominal semantic types in which the main partition is between natural categories (e.g. dog) and functional categories (e.g. knife), with the latter characterized by telic and agentive information, which is instead lacking in natural types. This is surely consistent with the great bulk of psycholinguistic and cognitive evidence supporting a different representation and organization of artifactual and natural categories, with functional information being a key role in identifying the former. On the other hand, this in turns raises the issue of how we characterize the telic dimension, since the simple definition as “purpose and function of an entity” is necessarily too lose and abstract. Qualia are in fact designed as those aspects of word content necessary to capture its lexical polymorphism, e.g. polysemous alternations, metonymic reconstructions, etc. The point is that a “metaphysical” definition of the qualia may not be perfectly in line with their contribution to the explanation of lexical dynamics. A certain entity may in fact have a proper function at the metaphysical level, without being necessarily the case that this function is relevant to explain its linguistic behavior, exactly because “it is just 5
such functions that don’t make it over to the lexicon from the metaphysical scheme”(Asher and Pustejovsky 2000: 16). GL raises the important issue of the relationship between the word lexical content and the context, with the qualia acting as a sort of interface and a filter between the two. The standard view models this relationship by assigning the context essentially a selectional role. A word has a given number of senses, and then the context decides and selects the appropriate one. The main tenet of GL is that this model is inadequate because it ends up regarding the various senses of a word as all equally distinct, thereby producing their unwarranted multiplication. In GL, the relationship between lexicon and context is somehow inverted. It is an inherent property of the qualia dimensions to act as a selective and structuring filter on contextual knowledge, to single out those aspects that will enter into the concept constitution. However, as Jayez (2001) argues, it is controversial whether this “context-selectivity” of lexical item can be really reduced to the qualia roles - at least as they are currently defined - because they are at the same too loose and too rigid to capture the concept multidimensionality.
3
Contextualizing the lexicon
Lexical ontologies are grounded on the assumption that modelling meaning essentially amounts to representing human lexical competence as fundamentally independent of the way words are used in context. This theoretical stance is reminiscent of the traditional competence vs. performance opposition, typical of the generative paradigm in linguistics. In syntactic theory, the opposition essentially takes the form of an irreducible dichotomy between what we know about the sentences of a language (i.e. its grammar) and what we do with them, i.e. how we use them in concrete communicative scenarios. In turn, this entails that grammatical descriptions are independent of any use distribution: a grammar represents exactly those aspects of language that are supposed to hold true in the mind of an idealized speaker, no matter how language is actually used. In other terms, the representation of such knowledge must be abstracted away from any particular system of use (Townsend and Bever 2001: 37). Similarly, most symbolic approaches to lexical representations seem to assume a parallel dichotomy between word content and word usage in context. A lexical ontology of conceptual symbols intends to be a representation of what is “true” about word meaning, i.e. a representation of what we know about the information content of a word, irrespective of its behavior in context. The competence vs. performance dichotomy, however, leaves open the issue of how the system of rules and principles defining human linguistic competence is actually used. The same holds of lexical ontologies, which do not lend themselves to modelling lexical dynamics (and, more generally, the way lexical meanings are put to use in real contexts) in a natural way. One major problem concerns the extent to which ontologies are able to account for the conceptual space defined by a word, in the face of the ubiquitous problem of lexical polysemy. An interesting vantage point from which we can observe the complex interplay between compositionality and semantic polysemy is provided by the notion of lexical substitutability.
6
As is well known, one of the main consequences of the principle of compositionality is that if the meaning of a complex expression P (x1 , ..., xn ) is compositionally derived by the meaning of its parts, then if we substitute xi with a term yi with roughly similar meaning the meaning of the complex expression does not change (or at most changes proportionally to the degree of semantic similarity between xi and yi ). Substitutability of semantically similar items is in fact one of the major properties distinguishing compositional structures from non-compositional, idiomatic ones. So for instance if we say This school was founded in 1815, since the interpretation of this sentence is compositionally built from the meaning of its parts, we can perfectly replace school with terms like bank, university, company or newspaper, etc. and we still obtain a well-formed and semantically similar sentence. The reason is exactly that bank, school, university, company and newspaper are to a certain extent semantically similar, e.g. because they all belong to the same ontological class Institution. Now, let us focus on some uses of the word school, that we also briefly discussed in section 2:
a school can ...
be destroyed be founded finish go on holiday win a championship be boring
LIKE
a house, a car, etc. a bank, a university, etc. summer, movies, etc. a group of friends, my family, etc. a football team, etc. a book, a movie, etc.
Table 1: Different semantic similarity spaces of school
These examples show that school can be compositionally combined with various verbal predicates, expressing very different types of events. The syntactic relation holding between school and the predicate can also vary, e.g. active subject with go on holiday, passive subject (i.e. deep semantic object) with be founded. The point is that each of the predicates above define a different semantic similarity space for the noun school: the terms that can be substituted to school as occurring with be destroyed are very different from the ones that are substitutable to the same item as occurring with go on holiday or be boring. Sometimes these semantic spaces overlap one with the other to a certain extent, but they are nevertheless quite distinct. The fact that school belongs to different semantic similarity spaces depending on its compositional environment is just another way of looking at the polysemous nature of this lexical item. There is an immediate and rather conservative theory for these facts, which simply amounts to assume that each sematic similarity space is generated by a different meaning of school. Within the lexical ontology paradigm we saw above, this would be equivalent to stating that our ontology must include multiple symbolic conceptual representations of school, to be specifically targeted and selected by a given predicate. These representations would also be shared to a certain extent and degree of abstraction by the other members of the semantic similarity spaces, thereby allowing their substitutability 7
in the proper contextual conditions. However, at this point we know that, although being a viable theory, this is surely not a satisfactory one, exactly because it is a “nonexplanation”. A much more “progressive” theory could be provided by GL. As we saw above, GL represents an interesting attempt to explain semantic polymorphism at the level of word interaction in context, rather than as part of abstract lexical representations. This solution strongly relies on two facts: 1. the boundaries between lexical representations and context are smoothed, by assuming that the qualia structure also encodes important information on prototypical events or situations in which entities occur; 2. the notion of lexical composition is enriched by expanding the traditional operation of function-argument application, that is made sensitive to the qualia structure information within lexical representations. Actually, a GL-style analysis of the behavior of school could run as follows: there is just a unique sematic representation for the various occurrences of school above, but one with a very complex internal structure, eventually to be cast in terms of qualia structure, “dottyping” (cf. Pustejovsky 1995, 2001), etc. This internal structure includes a rich array of (highly conventionalized, and prototypical) contextual knowledge about schools, what we do in them, who goes to school and when, etc. Part of this information is also shared by other items (possibly at different levels of abstraction), which then can “live” in the same semantic similarity space with school. The polysemous behavior stems from the way the verbal predicates compose with school: semantic operations such as co-composition, coercion or selective binding - operating on the rich representational structure of school allow predicates to combine with this lexical item and to “highlight” different bits of its internal information. The main progress gained by adopting a GL-style explanation is the fact that no sense-multiplication needs to be assumed, i.e. polysemous similarity spaces are generated, instead of being presupposed as part of the lexical entry of school. On the other hand, the cost of this explanation heavily resides on the rich internal structural representation which is necessary for the generative compositional operation to work properly, and which is by and large stipulated in the theory. As we said in the previous section, every GL-style theory of a lexicon fragment should actually be accompanied by clear and explicit identity criteria for the qualia structure and for all the other representational devices needed to carry out the appropriate context-driven sense generation processes. If a clear theory of contextually enriched lexical representations is lacking, any GL-style analysis runs the risk of circularity: that is to say, the stipulation of multiple senses of a lexical item is avoided by stipulating a rich internal representation for that lexeme. As remarked in the GL theory, a key ingredient to provide a satisfactory explanation of the interplay between lexical polysemy and compositionality lies in assuming that different semantic spaces are generated “on line” by predicates during the composition process. The problem then resides on the shape and form of semantic representations that give rise to the generation of alternative similarity spaces. A possible way to go beyond the 8
limits of GL-style analyses is to “push to the limit” the idea itself of a context-sensitive lexicon, by pursuing semantic representations conceived as dynamic entities that are able to adjust modify their similarity spaces depending on the context in which they appear. Psychological research provide interesting support to this type of view. For instance, there is striking empirical evidence that human similarity judgments are necessarily dependent on the context and on the perspectives under which they are expressed (cf. Barsalou 1982, Medin et al. 1993). However, standard lexical representations cast in terms of ontologies of conceptual symbols do not directly lend themselves to be the best paradigm to implement such a model of the lexicon. In lexical ontologies, typically word meanings are never shaped or changed depending on the context of usage (cf. the competence vs. performance distinction mentioned above). Conversely, there is a radically different approach to word meaning in which a word information content is assumed to be inherently rooted in its contexts of use. In this model, hinging on the so-called distributional hypothesis of lexical meaning, an alternative view of semantic representations emerges, that allows us to approximate the idea of a context-sensitive lexicon as a way to account for semantic compositions while allowing room for semantic changes and meanings shifts.
4
Distributional models of meaning
Perhaps the most radical departure from symbolic approaches to meaning representation is the Wittgensteinian assumption that lexical knowledge is just a reflection of language usage. Over the last few years, in step with recent advances in understanding language acquisition through computational and robotic models, this assumption has spawned a number of often competing models of machine language learning aimed at investigating the relationship between meaning and context. Since Harris (1968), distributional information about words in context has been taken to play an important role in explaining several aspects of the language ability. The role of distributional information in developing representations of word meaning is now widely acknowledged in the literature. The so-called distributional hypothesis has been used to explain various aspects of human language processing, such as lexical priming (Lund and Burgess, 1996), synonym selection (Landauer and Dumais, 1997), retrieval in analogical reasoning (Ramscar and Yarlett, 2002) and judgements of semantic similarity (McDonald and Ramscar, 2001). It has also been employed for a wide range of natural language processing tasks, including word sense and syntactic disambiguation, document classification, identification of translation equivalents, information retrieval, automatic thesaurus and ontology construction and updating, language modelling smoothing etc. (see Manning and Sch¨ utze, 1999). Under this hypothesis, learning the meaning of a word is thought to be based, at least in part, on the speakers’ exposure to the word in its linguistic contexts of use. If this is the case, it should then be possible, at least in principle, to automatically acquire the meaning properties affecting the distributional behaviour of a word, by inspecting a sufficiently large number of its contexts of use, as reflected for instance in natural language texts. This set of context-sensitive properties provides us with a corpus-based characterization
9
of the possible meaning facets of a lexical unit. From a more cognitive perspective, an investigation into the process of inducing word meaning properties directly from how words co-occur in context is likely to shed light on the way humans exploit contextual information to learn word uses. A qualification is in order at this point. The word context is used here to only mean the narrow linguistic context, i.e. the words uttered/written before and after the word in question. This is clearly a big limitation with respect to a wider and more general sense of context as pragmatic or situated context, including any non-linguistic information available to the speaker through the particular situation where a sentence is uttered. However, the restriction is not totally unreasonable, as it is certainly true that literate persons acquire many new words through reading, where nothing but a linguistic context is available (Miller and Charles 1991). While the text is only of limited help to the acquisition of perceptual correlates of word meanings, it is nonetheless the most natural place to look at, when one wants to bootstrap semantic restrictions on the grammatical usage of a word. It is simply not enough to know that a verb like eat selects a noun phrase as a direct object. In fact, the probability of filling in the object position of eat is not evenly distributed over all nouns: “eat an apple” is far more plausible than “eat a suggestion”. There is a strong presumption that knowledge of the semantic preferences of a verb over the set of its possible arguments is part and parcel of a speaker’s linguistic competence. These preferences are shown to play an active role in human language parsing, particularly at early stages of associative surface syntax or ‘pseudo-syntax’ (Townsend and Bever 2001). The distributional hypothesis promises to bootstrap semantic knowledge directly related to this level of linguistic expertise. It is thus reasonable to turn to text evidence as the most accessible crossroad where cognitive, epistemological and language-engineering perspectives on meaning bootstrapping can possibly meet.
4.1
Context-based lexical representations
Under the distributional approach to word meaning, measuring the semantic similarity between any two words is equivalent to measuring the degree of overlapping between their sets of linguistic contexts. This is to say that two words that tend to be selected by similar linguistic contexts are closer in the semantic similarity space than two words that are not as distributionally similar. Most commonly, the behavior of a target content word A is represented as a multidimensional vector of co-occurrence frequency distributions. Each vector dimension says how often A is seen in the company of another word in a text corpus (often divided by A overall frequency to normalise differences in token frequency between target words). One can conceive of a vector dimension as a distinctive feature of the target A. Generally speaking, the extent to which a given feature f characterises A is proportional to the frequency value f takes for A. If two target words A and B present close values on the same vector dimensions dj and dk , then A and B are said to be similar relative to dj and dk . Surely, overall similarity between target words must be computed over the entire set of dimensions making up their vector representation. Intuitively, the greater the number of dimensions for which A and B present close values, the higher the similarity between 10
them. This presupposes the definition of a distance function D, associating a scalar to any pair of target words. Similarity is then defined as an inverse function of D, whose values range between 0 (no similarity) and 1 (maximum similarity): for example, sim(x, y) = 1 (1+D(x ,y)) . Several different D(x, y) can be used for this purpose, perhaps the most common being the Euclidean distance. Given any two real-valued vector representations, we can compute their Euclidean distance as the norm of their difference, i.e. the square root of the sum of quadratic differences of their dimension values as follows: v u n uX → − − → | x − y | = t (xi − yi )2
(1)
i=1
This reflects the assumption that topological proximity in the space of target words is indicative of closeness in semantic similarity space. In a more direct way, the dot product of two normalised vectors, i.e. the cosine of the angle between them, can be interpreted as a correlation coefficient, saying how similar the two vectors are. If the two vectors are geometrically aligned on the same line, the angle between them is 0 and its cosine 1. Conversely, if the two vectors are independent, their angle is close to 90◦ and its cosine close to 0. If words can be represented in terms of their distribution in contexts, then what is a set of contexts like? Actually, the notion of linguistic context can be defined in a number of ways. In the literature, the last ten years have witnessed a prominent shift of focus from a rather loose notion of context to a linguistically more constrained one. Under the first looser notion, a set of contexts of a word w consists of a bag of all content words co-occurring with w within a text window of fixed size. Thus, given any two words wi and wj , their similarity can be measured by looking at the intersection set between their respective word bags. Linguistic information about morpho-syntactic categories and syntactic relations is not taken into account for the characterization of the word contexts. On the contrary, the second interpretation requires that specific linguistic information be used to define word contexts. This is to say that a set of contexts is not simply a bag of co-occurring words, but a list of linguistically annotated words, each of which is assigned a morpho-syntactic label and stands in a specific syntactic relation to w. Accordingly, the similarity between wi and wj is given by the intersection set between their lists of linguistically annotated words. A word belongs to the intersection if and only if it is found to independently co-occur with wi and wj under the same linguistic (i.e. morpho-syntactic and functional) interpretation. Different pieces of linguistic information can be taken into account to characterize syntactic contexts. Another important element of variation concerns the type of syntactic relations to be included in the context representation of a word. It goes without saying that choice of a particular type of context representation may have significant repercussions on the typology and quality of resulting distributional semantic representations. Context-vector representations of words usually provide the input to computational methods that are used to group semantically similar words. Clustering, Multidimensional Scaling and Self-Organizing Maps are just some of the many techniques that can be used 11
to draw topological pictures of the semantic similarity spaces of words, as determined by their distributional properties. We can imagine distributionally defined lexical concepts arranged in what G¨ardenfors (2000) calls conceptual spaces. The latter are defined by a certain set of quality dimensions that impose a topological structure on the stream of our experiential data. In G¨ardenfors’ classical example, the structure of the “color space” is defined by three dimensions: hue, brightness, and saturation. The meaning of each color term is then identified with a three-dimensional vector of values locating its position on each of the space axes. The conceptual space model predicts that semantically similar color terms will then be located closer in the “color space”. In much the same way, the distributional hypothesis suggests that we can use n-dimensional frequency vectors to define a lexical semantic space derived by the average use of words in context. These represent classes of semantically similar words as clouds of n-dimension points, which occur close in the semantic space relatively to some prominent perspectives
5
Perspectives and semantic spaces
In the distributional approach, lexical representations can be regarded as regions in a semantic space whose dimensions are provided by the different contexts of use of a word. This type of distributional representation can provide the base for the construction of context-sensitive similarity spaces, parallel to those shown in Table 1. As we saw in section 3, when a lexical item wi - say school - composes with a predicate Pi , the latter acts as a sort of “perspectivizing” factor determining different spaces of words semantically similar to wi and substitutable to it within the context of Pi . In fact, different predicates can activate different aspects of word meaning, based on varying dimensions of semantic similarity, that depend, in turn, on the goals and functions that words happen to serve in the events or situations expressed by the predicates. In other terms, if the meaning of wi is inherently context-sensitive, when we compositionally build the interpretation of a complex expression Pi (wi ), we have to take into account that the predicate Pi also acts as a contextual dimension modulating the meaning of wi . Actually, this corresponds to the idea that word polysemy (at least to a certain extent) is created “on the fly” during compositional processes, rather than being presupposed by them. To give an illustrative example, we can see the semantic similarity spaces built by CLASS (Allegrini et al. 2000), a distributionally-based machine learning method to estimate entropic similarity scores between two nouns, based on type frequency distributions. In CLASS, nouns are represented as regions in the space of events expressed by verbal predicates. In fact, the events in which objects are involved provide the structuring dimensions to represent the semantics of nouns. The structure of the event space is determined by the type of event in which nouns occur, and by their grammatical roles (e.g. subject, object, etc.). CLASS is trained on an Example Base (EB) of functionally annotated verb–noun pairs instantiating a wide range of syntactic relations: • verb–object, e.g. (fondare, scuola, obj) ‘found, school’; • verb–subject, e.g. (iniziare, scuola, subj) ‘begin, school’; 12
• verb–prepositional complement, e.g. (andare, scuola, a pp) ‘go, school, to pp’. The EB contains 43,000 pair types, automatically extracted from different knowledge sources: dictionaries, both bilingual and monolingual, and a corpus of financial newspapers. Let us then consider the Italian polysemous word consiglio (meaning both ‘council’ and ‘advice’). Table 2 shows the 10 topmost words distributionally similar to consiglio when the word is taken out of context. Words are ranked by decreasing values of S(imilarity) score (in exponential notation). This similarity chain highlights the composite nature of relevant word associations and their strong correlation with both senses of consiglio. Words marked in bold refer to (different facets of) the council meaning. The other word associations are clearly related to the ‘advice’ meaning. As the two senses are fairly orthogonal, the overall ranking in Table 2 no longer reflects a single similarity gradient, and arbitrarily collapses multidimensional information onto a single axis. similarity chain
s score
programma program parlamento parliament ministero ministry comunicazione communication misura measure riunione meeting presidente president convinzione conviction assemblea assembly persona person
1.34695e-05 6.44995e-06 6.42011e-06 5.9057e-06 3.45036e-06 2.61985e-06 2.24574e-06 2.14368e-06 1.91805e-06 1.71717e-06
Table 2: The topmost similar words of consiglio council/advice Things dramatically change when we use distributional evidence relative to a particular context. Let us assume that the event expressed by a specific verb defines a background situation where the entities denoted by the verb arguments perform various functions/roles. We can then consider the different verbs with which consiglio occurs in a corpus (possibly with different functional roles) and use a dynamic, context-sensitive notion of semantic similarity to acquire word associations relative to the situation defined by each verb (Allegrini et al. 2003). similarity chain
s score
parlamento parliament assemblea assembly riunione meeting persona person
0.0019607800 0.0005764880 0.0004528990 0.0003300620
Table 3: The topmost similar words of consiglio relative to the convocare event
13
similarity chain
s score
Bundesbank stato state governo government operatore operator autorit` a authority persona person amministatore administrator borsa stock exchange organizzazione organization
0.0014769100 0.0011207700 0.0010024300 0.0008785130 0.0007340040 0.0006611840 0.0006184290 0.0003935720 0.0003113330
Table 4: The topmost similar words of consiglio relative to the decidere event similarity chain
s score
assembleaassembly azionista stock holder governo government operatore operator autorit` a authority persona person amministatore administrator borsa stock exchange organizzazione organization
0.0060112700 0.0006839950 0.0005555560 0.0008785130 0.0007340040 0.0006611840 0.0006184290 0.0003935720 0.0003113330
Table 5: The topmost similar words of consiglio relative to the deliberare event Take for example the case where consiglio is the object of convocare (‘convene’). The list of its topmost similar words is reported in Table 3. Note that all associations identified here are connected with the ‘council’ meaning, as all associations triggered by other senses and/or meaning facets of consiglio are effectively filtered out by the context. More interestingly, a context-oriented notion of word similarity is instrumental for exploring the multifarious meaning facets of the ‘council’ sense of consiglio. Table 4 and 5 illustrate the similarity chains associated with consiglio in two different contexts: as the subject of decidere (‘decide’) and as the subject of deliberare (‘decree’). In spite of the close semantic relatedness of the two verbs, the resulting word associations are remarkably different. Table 4 groups decision-making entities, ranging from financial and governmental institutions to individuals. On the other hand, Table 5 lists assembly-like entities, where decision-making is the result of a collaborative and somewhat institutionalized process. A slight change in perspective prompts two considerably different lists of semantic associates. Quite uncontroversially, the word consiglio can be regarded as a case of homonymy, as it is also proven by its alternative translations in English. However, even for standard cases of polysemy, distributional representations like the ones obtained through CLASS are able to give rise to context-sensitive similarity spaces. Table 6 reports the top-similar words to scuola ‘school’ with respect to three different compositional contexts: subject of iniziare ‘begin’, direct object of costruire ‘build’, and subject of decidere ‘decide’. 14
iniziare ‘begin’ corso course riunione meeting stagione season lavoro job
subj 0.00366 0.00198 0.00168 0.00133
scuola ‘school’ costruire ‘build’ obj sistema system 0.00294 casa house 0.00169 strada road 0.00008
decidere ‘decide’ subj comitato committee 0.00271 commissione commission 0.00165 operatore operator 0.00026 persona person 0.00026
Table 6: Similarity spaces of scuola ‘school’ in the context of three different predicates These verbal contexts actually express three highly different event types: iniziare ‘begin’ refers to a specific stage in the temporal unfolding of an event or situation; costruire ‘build’ is a creation verb, related to the physical constitution of an entity; and finally, decidere ‘decide’ refers to a particular volitional act of a rational agent. Such events capture very orthogonal semantic dimensions of scuola ‘school’, that are quite explicitly reflected in the similarity spaces in Table 6. CLASS is therefore able to simulate the situation depicted in Table 1. When scuola ‘school’ combines with the verb iniziare ‘begin’ as its subject, the predicate triggers a similarity space populated with entities whose temporal dimension is highly prominent (e.g. stagione ‘season’ or riunione ‘meeting’); conversely, when the same noun appears as object of costruire ‘build’, it is the physical dimension to be most focussed so that the similarity space shifts towards associations with nouns in which this semantic aspect is particularly salient, as in the case of casa ‘house’ or strada ‘road’. Finally, its composition with a verb like decidere ‘decide’ in the role of subject moves scuola ‘school’ towards similarity associations with human or human–like entities (cf. the collective noun commissione ‘commission’). The essential fact to be remarked is that in all cases the semantic representation associated with scuola ‘school’ is always the same, i.e. it is the same n–dimensional event space built by CLASS on the ground of a certain corpus–driven EB. The shifts in the similarity spaces highlight the multifaceted nature of the semantic content of a noun like scuola ‘school’, which is constitutive of its polysemous behavior. However, polysemy is now generated “on line” by the very same compositional process that determines the combination of this noun with different types of predicates.
6
Conclusions
Sentence-level creativity and word-level semantic creativity can actually be reconciled once we adopt a more sophisticated model of the lexicon. This type of solution is strongly pursued within the GL theory, in which enriching lexical representations with theoryconstrained aspects of contextual knowledge is assumed to be the essential condition to turn compositional processes into real generative devices of meaning multiplicity. Although the idea of overcoming context-free lexical representations is surely a step forward with respect to more conservative views of word meaning, its actual implementation in GL still leaves many open issues. Representational devices like the qualia structure are important proposals for a more sophisticated architecture of the lexicon, but they need to be further explored and constrained before being able to reach a fully satisfactory explanatory power. 15
The same holds true of the battery of extended semantic operations that are needed in GL to generate sense creativity in contexts. Promising theoretical prospectives are offered by distributional models of lexical representations. Notwistanding the many differences with GL, distributional models share with it some core assumptions. In particular, they push to its extreme consequences the idea itself of context-sensitive lexical representations. In distributional models, the relationship between lexical items and their context of use is somehow inverted, with the latter directly entering into the constitution of the former. According to this approach, it is not only sense-extensions to be generated in contexts, but lexical representations themselves, which emerge out of distribution of contexts of use. Assuming distributional semantic representations allows us to take into account the effect of polysemy in compositional processes, such as for instance the fact that polysemous items are associated with multiple similarity spaces whose shape and content can dramatically vary depending on the linguistic context of composition. Distributional models have become very popular nowadays especially for natural language processing, given the possibilities they offer to bootstrap semantic representations directly from corpus data. However, they are not without problems. Actually, Lenci et al. (2005) provide a critical discussion of various types of of distributional models of meaning, which in many cases still have to prove their effective capacity to provide in-depth analyses of word meanings. Yet, notwithstanding their limits, these models represent an important probe to explore lexical dynamics. In particular, distributional models seem to suggest that the possibility for language to tolerate compositional processes side by side to sense creation processes is rooted in the fact that the boundaries between lexicon and context are much smoother and articulated than is often assumed. Distributional model of the lexicon are able to simulate the dynamics occurring between lexicon and context, and this way they can provide important insights on the interplay between lexical variation and sentence creativity, allowing us to further explore the boundaries of compositionality in natural language.
7
References Allegrini, P., Montemagni, S. and Pirrelli, V. (2000), “Learning word clusters from data types, in Proceedings of Coling 2000, Saarbrcken: 8-14. Allegrini, P., Montemagni, S. and Pirrelli, V. (2003), “Example-based automatic induction of semantic classes through entropic scores”, Linguistica Computazionale, XVI-XVII. Asher, N. and Pustejovsky, J. (2000), “The metaphysics of words in context”, ms. Brandeis University. Barsalou, L. W. (1982), “Context-independent and context-dependent information in concepts”, Memory and Cognition, X: 8293. Blutner, R. (2004), “Pragmatics and the lexicon”, in Horn, L. R. and Ward, G. (eds.), Handbook of Pragmatics, Oxford, Blackwell. Fodor, J.A. and Pylyshyn, Z.W. (1988), “Connectionism and cognitive architecture: a critical analysis”, Cognition, XXVIII: 3-71.
16
Frege, G. (1923), “Logische Untersuchungen. Dritter Teil: Gedankenf¨ uge”, Beitr¨ age zur Philosophie des Deutschen Idealismus vol. III, 36-51. Translated as “Compound Thoughts, Logical Investigations”, Blackwell, Oxford, 1977: 55-78. G¨ardenfors, P. (2000), Conceptual Spaces, Cambridge, MA, MIT Press. Gruber, T. R. (1995), “Toward principles for the design of ontologies used for knowledge sharing”, International Journal of Human and Computer Studies, XLIII: 907928. Harris, Z.S. (1968), Mathematical Structures of Language, New York, Wiley. Janssen, T.M.V. (1997), “Compositionality”, in van Benthem, J. and ter Meulen A. (eds.), Handbook of Logic and Language, Amsterdam, Elsevier Science: 417-473. Jayez, J. (2001), “Underspecification, context selection, and generativity”, in Bouillon, P. and Busa F. (eds.), The Language of Word Meaning, Cambridge, Cambridge University Press: 124-148. Landauer, T.K. and Dumais, S.T. (1997), “A solution to Plato’s problem: The Latent Semantic Analysis theory of the acquisition, induction, and representation of knowledge”, Psychological Review, CIV: 211-240. Lund, K. and Burgess, C. (1996), “Producing high-dimensional semantic spaces from lexical cooccurrence”, Behavior Research Methods, Instrumentation, and Computers, XXVIII: 203-208. MacDonald, S. and Ramscar, M. J. A. (2001), “Testing the distributional hypothesis: the influence of context on judgements of semantic similarity”, in Proceedings of the 23rd Annual Conference of the Cognitive Science Society. Manning, C.D. and Sch¨ utze, H. (1999), Foundations of Statistical Natural Language Processing, Cambridge, MA, MIT Press. Medin, D. L., Goldstone, R. L., and Gentner, D. (1993), “Respects for similarity”, Psychological Review, C: 254278. Miller, G.A. and Charles, W.G. (1991), “Contextual correlates of semantic similarity”, Language and Cognitive Processes, VI: 1-28. Pustejovsky, J. (1995). The Generative Lexicon, Cambridge, MA, MIT Press. Pustejovsky, J. (1998), “Generativity and explanation in semantics: a reply to Fodor and Lepore”, Linguistic Inquiry, XXIX: 289-311. Pustejovsky, J. (2001). “Type construction and the logic of concepts”, in Bouillon, P. and Busa F. (eds.), The Language of Word Meaning, Cambridge, Cambridge University Press: 91-123. Ramscar, M.J.A. and Yarlett, D.G. (2002), “Semantic grounding in analogical processing: an environmental approach”, Cognitive Science, XVII: 41-71. Saint-Dizier, P. and Viegas, E. (1995), “An introduction to lexical Semantics from a linguistic and a psycholinguistic perspective”, in Saint-Dizier, P. and Viegas, E. (eds.), Computational Lexical Semantics, Cambridge, Cambridge University Press: 1-29. Townsend, D.J. and Bever, T.G. (2001), Sentence Comprehension: The Integration of Habits and Rules, Cambridge, MA, MIT Press. Vossen, P. (2003), “Ontologies”, in Mitkov, R. (eds.), The Oxford Handbook of Computational Linguistics, Oxford, Oxford University Press.
17