In order to express these edge combinations, it must be possible to talk about ...... language pony express sure thing invest money water fountain tennis racket.
Extracting Semantic Information from Corpora Using Dependency Relations Sebastian Padó
NI VER
S Y
TH
IT
E
U
R
G
H
O F
E
D I U N B
Master of Science School of Cognitive Science Division of Informatics University of Edinburgh 2002
Abstract Semantic space models are a representation formalism for lexical semantics which represents words as vectors in a high-dimensional space. The distances between word vectors can be interpreted as a measure for the semantic similarity of the words. Since semantic space models are constructed automatically from corpora, the semantic representations of words they provide are empirically grounded. This has made them popular in psychology for modelling behavioural data, for example priming, and in Information Retrieval, where richer word representations can be helpful. However, the “bag of words”-style word co-occurrence statistics that are traditionally employed for the construction of semantic space models are deficient from the standpoint of theoretical linguistics. This thesis examines the resulting shortcomings of semantic space models and suggests to replace the representation of words in terms of their word context by the representation of words in terms of their syntactic context. The new, highly parametrisable construction framework, which is based on dependency grammar, is implemented in form of the D EPENDENCY V ECTORS system. Models produced with the D EPENDENCY V ECTORS system are evaluated against traditional models in three tasks. The first task, which tests the new models’ cognitive adequacy, shows that D EPENDENCY V ECTORS models capture direct, but not mediated priming. The best existing models have been reported to capture both phenomena. The second task, which investigates the encoding of different lexical relations, shows that D EPENDENCY V ECTORS models can capture differences between different lexical relations, which traditional models cannot. The last task, the TOEFL synonymy task, shows that the performance of D EPENDENCY V ECTORS models in synonymy identification is roughly on par with the best existing traditional models. In summary, D EPENDENCY V ECTORS models cannot yet model all behavourial data which traditional models can, but this may be possible once the right set of parameters has been identified. However, the main insight is that the ability to distinguish between different lexical relations allows the application of D EPENDENCY V ECTORS models for linguistic tasks which require access to structured knowledge – for example, query extension or synonymy identification.
iii
Acknowledgements Firstly, I would especially like to thank Marco and Matthijs for their friendship and support, and all the other MScs for the memorable days at Buccleuch Place. On the professional side, thanks go to Scott McDonald and Thomas Landauer for kindly providing me with the data and tools I required. My supervisor Mirella Lapata is definitely due special acknowledgement for answering all my questions, giving me feedback for my ideas and generally making me think harder. I am further indebted to the DAAD (Deutscher Akademischen Austauschdienst) and the Studienstiftung des Deutschen Volkes for their financial support. Finally, thanks to the Edinburgh weather for not distracting me from my work. Ulrike – you know what you mean to me.
iv
Declaration I declare that this thesis was composed by myself, that the work contained herein is my own except where explicitly stated otherwise in the text, and that this work has not been submitted for any other degree or professional qualification except as specified.
(Sebastian Padó)
v
Contents
1 Introduction
1
1.1
The Meaning of Meaning . . . . . . . . . . . . . . . . . . . . . . . .
1
1.2
The Distributional Hypothesis . . . . . . . . . . . . . . . . . . . . .
4
1.3
Semantic Space Models . . . . . . . . . . . . . . . . . . . . . . . . .
6
1.4
Aim and Structure of this Thesis . . . . . . . . . . . . . . . . . . . .
7
2 Previous Work on Semantic Space Models
9
2.1
The Origins of Semantic Space Models . . . . . . . . . . . . . . . .
9
2.2
Extraction: Co-occurrence Statistics . . . . . . . . . . . . . . . . . .
11
2.2.1
Document space models . . . . . . . . . . . . . . . . . . . .
12
2.2.2
Word space models . . . . . . . . . . . . . . . . . . . . . . .
13
2.3
Representation: Vector Spaces . . . . . . . . . . . . . . . . . . . . .
14
2.4
Parametrisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
2.4.1
Extraction parameters . . . . . . . . . . . . . . . . . . . . .
15
2.4.2
Representation parameters . . . . . . . . . . . . . . . . . . .
16
Use of Semantic Space Models . . . . . . . . . . . . . . . . . . . . .
18
2.5.1
Psychological modelling . . . . . . . . . . . . . . . . . . . .
18
2.5.2
Linguistic tasks . . . . . . . . . . . . . . . . . . . . . . . . .
19
2.5
3 Beyond Co-occurrence 3.1
3.2
23
Linguistic Criticism of Semantic Space Models . . . . . . . . . . . .
23
3.1.1
Problems of co-occurrence statistics . . . . . . . . . . . . . .
24
3.1.2
Problems of model interpretation . . . . . . . . . . . . . . . .
26
Previous Work on Improvements . . . . . . . . . . . . . . . . . . . .
27
vii
3.3
3.2.1
Dependency grammar . . . . . . . . . . . . . . . . . . . . .
29
3.2.2
Further strategy of this thesis . . . . . . . . . . . . . . . . . .
30
Vector Extraction with Dependency Grammar: A Formal Framework . . . . . . . . . . . . . . . . . . . . . . . . . .
32
3.3.1
Input representation . . . . . . . . . . . . . . . . . . . . . .
32
3.3.2
From parser output to informative dependencies . . . . . . . .
34
3.3.3
Semantic space definitions . . . . . . . . . . . . . . . . . . .
37
3.3.4
Context evaluation . . . . . . . . . . . . . . . . . . . . . . .
39
4 Realisation
41
4.1
Dependency Parsing: MINIPAR . . . . . . . . . . . . . . . . . . . .
41
4.2
Implementation: The D EPENDENCY V ECTORS System . . . . . . . .
43
4.2.1
Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
4.2.2
Extraction of basis elements . . . . . . . . . . . . . . . . . .
44
4.2.3
Construction of the semantic space . . . . . . . . . . . . . . .
46
4.2.4
Parameter representation . . . . . . . . . . . . . . . . . . . .
48
4.2.5
Parameter values . . . . . . . . . . . . . . . . . . . . . . . .
48
4.2.6
A necessary optimisation . . . . . . . . . . . . . . . . . . . .
51
Evaluation: Distance Computation . . . . . . . . . . . . . . . . . . .
56
4.3
5 Evaluation
57
5.1
Evaluation Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
5.2
Parameter Exploration . . . . . . . . . . . . . . . . . . . . . . . . .
58
5.2.1
Parameters for traditional models . . . . . . . . . . . . . . .
58
5.2.2
Parameters for D EPENDENCY V ECTORS models . . . . . . .
59
Experiment 1: Mediated Priming . . . . . . . . . . . . . . . . . . . .
61
5.3.1
Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
5.3.2
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
5.3.3
Dimensionality . . . . . . . . . . . . . . . . . . . . . . . . .
65
5.3.4
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
Experiment 2: Encoding of Relations . . . . . . . . . . . . . . . . . .
67
5.4.1
68
5.3
5.4
Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
5.5
5.4.2
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
69
5.4.3
Dimensionality . . . . . . . . . . . . . . . . . . . . . . . . .
72
5.4.4
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . .
73
Experiment 3: The TOEFL Synonymy Task . . . . . . . . . . . . . .
75
5.5.1
Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
75
5.5.2
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
76
5.5.3
Dimensionality . . . . . . . . . . . . . . . . . . . . . . . . .
77
5.5.4
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . .
78
6 Conclusions
81
6.1
Contributions and Limitations . . . . . . . . . . . . . . . . . . . . .
81
6.2
Cognitive Adequacy of Dependency Models . . . . . . . . . . . . . .
83
6.3
Linguistic Applications of Dependency Models . . . . . . . . . . . .
85
A Users’ Guide to D EPENDENCY V ECTORS
89
B Experimental Materials
91
C Parameter Settings
97
D Experimental Results
105
Bibliography
113
ix
Chapter 1 Introduction S UMMARY.
In the first Chapter, I introduce two competing theories from philosophy
of language which result in opposite theories of lexical semantics. I explain how one of them, the usage-based theory, provides the background to build relational models of word meaning, so-called semantic space models. I describe the aim of this thesis, which is to improve semantic spaces by incorporating syntactic information in the construction process.
1.1 The Meaning of Meaning Lexical semantics, broadly speaking, is the study of the meaning of words or, more generally, of lexical items. Meaning, as a prerequisite for reasoning and communication, has been the subject of investigation of a whole range of sciences, including linguistics, psychology and philosophy. Yet, as a result of the diverging methodologies employed by different disciplines, several different approaches to lexical meaning have developed. One of the apparent problems of lexical semantics is that the meaning of lexical items is not easy to isolate. In communication, words are typically used within a sentential, conversational and situational context. Empirical science, which relies on well-controlled experimental conditions in which only a small number of factors are allowed to vary, must seek to break down this complexity, for it cannot hope to grasp 1
2
Chapter 1. Introduction
this scenario in its full complexity. The necessary simplifications, however, are very much a matter of the theoretical framework one endorses which is mainly provided by language philosophy. In that domain, two main schools of thought can be discerned whose general ideas about the nature of language have an impact on their views on lexical semantics. The first school takes a logic-based approach. Analogous to the work of cognitive philosophers on the apparent productivity of thought (e.g. the “language of thought” hypothesis by Fodor (1975, 1987)), this tradition tried to capture the regularity of language by phrasing semantics in the framework of model theory (Tarski 1954). Modeltheoretic semantics in this context means that the meaning of a sentence is embodied in its truth conditions: To know the meaning of a sentence is to know what state the world must be in for the sentence to be true. The central mechanism that is used to derive the truth conditions is the notion of compositionality, which dates back to the 19th century (Frege 1892). In modern terms, it can be stated as follows: The meaning of a complex expression is a function of the meaning of its parts and the way they are syntactically combined. The compositionality principle has important consequences for lexical semantics. In order to provide basic terms for the composition process, the meaning of words must be stated in logical terms. This view is exemplified by the works of Montague (Montague 1974), which after 25 years still form the backbone of most semantic theories in computational linguistics. This logic-based approach was not universally endorsed. On the contrary, it has been subject to a number of criticisms, of which I want to emphasize two important ones: Compositionality - Is the meaning of a sentence really always derivable in the compositional manner outlined above? Strict compositionality requires that the meaning of an expression is completely determined by its syntactic structure and the meaning of its atomic parts. The abundance of non-literal usage of expressions in language and the often decisive influence of context on meaning, which cannot be captured by compositionality, seem to suggest that purely syntax-driven analysis is rather a convention than an absolute principle.
1.1. The Meaning of Meaning
3
Logic as Lexical Representation Language - Basically the same criticism applies to the level of lexical representations. A logic-based representation cannot provide a rich account of lexical knowledge. In this view, the meaning of a word has only two components: Its semantic type, determining with which other words it can be combined, and its denotation in a model, which contributes to the truth value of expressions in which the word participates. Is this static definition to a word’s meaning not too rigid to be appropriate? Ubiquitous phenomena like metonymy and vagueness seem to indicate that word meaning is dynamic and determined by use. Put differently, most critics doubted that sentence meaning can be produced by a process that is insensitive to the meaning of words, and that word meaning is characterisable without reference to the context of usage. The latter observation forms the basis of the opposing school of usage-based semantics. Perhaps the first metaphor for this idea was put forward by Wittgenstein in his famous language game (Wittgenstein 1953). Using the example of the word game he pointed out that all definitory attempts of its meaning are doomed. One cannot hope to define the meaning of a word in terms of features that all objects described by this word must have: For every such property that comes to mind (like number of players, existence of a winner, etc.), there will be a game which does not exhibit it. Coming back to the empirical problem of meaning identification stated above, Wittgenstein supported the view that the complexity of language use is not a by-product, but rather a central characteristic of the way information is transmitted through language. Words themselves do not, and cannot have, meaning for themselves. Instead, in every situation in which words are used their meaning is negotiated anew between speaker and hearer. Of course, this negotiation must be grounded on some common knowledge: understanding is only possible if the hearer is able to infer the speaker’s intention in using a particular word. In this framework, knowing the meaning of a word in this framework is to know the contexts in which it can be successfully used. An interesting linguistic side-note was only recently brought up by Karlgren and Sahlgren (2001). In their opinion, model-theoretic (or in their terminology nomen-
claturistic ) approaches inject meaning into language from outside and thus have to
4
Chapter 1. Introduction
abandon all hope that semantic structure can ever be gathered from corpora like syntactic structure; on the other hand, this possibility falls naturally out of the usage-based paradigm.
1.2 The Distributional Hypothesis The rest of this thesis will assume a usage-based framework as philosophic underpinning. But what exactly does this mean in linguistic terms? What consequences does this philosophical insight have in linguistic terms? This question was answered by Harris, the first to formulate the so-called distributional hypothesis. Assuming language to be a linear sequence of semantically arbitrary elements, Harris (1968, p. 12) stated: The meaning of entities [...] is related to the restriction of combinations of these entities relative to other entities. This is to say that combinational restrictions can be viewed as semantic constraints that govern the distribution of entities in language. It is by virtue of distributional similarity that entities have similar meaning. The merit of this view is that it grounds the philosophical notion of use in the linguistic notion of context: If this is true, then the semantic similarity between words can be determined by measuring the words’ contextual similarity. This is certainly an attractive notion, making the usage-based paradigm accessible to empirical evaluation. Harris himself did not empirically test the validity of the distributional hypothesis. It was more than twenty years later when Miller and Charles (1991) finally conducted a study to verify the hypothetical correlation between semantic and contextual similarity. In a preliminary experiment, they asked subjects to provide semantic similarity ratings for pairs of nouns and established that intuitions about semantic similarity agreed. The actual experiment used a sentential sorting paradigm: For every pair of nouns, they extracted sentential contexts for both nouns from a corpus. After removing the noun, the subjects’ task was to complete the context with what they thought was the original word (see Figure 1.1). The result was that word pairs which were judged more similar occurred in contexts that were harder to distinguish: there was indeed a linear
1.2. The Distributional Hypothesis
5
I found a parking place half a block away, sat in the The bombs are as harmless as an
and waited.
in a garage.
Figure 1.1: Sentential contexts for the pair car – automobile from Miller and Charles (1991)
correlation between semantic and contextual similarity. However, Miller and Charles also found one outlier, the pair slave - monk, which reliably elicited significantly more similar contexts that it should have. As all other items were from the domain of nonliving objects, they speculated (p. 22) that [...] the relation between semantic similarity and contextual similarity is different in different semantic fields.
This result is evidently incompatible with the idea of a strict correlation between contextual and semantic similarity, which they called strong distributional hypothesis. Instead, Miller and Charles proposed the weak contextual hypothesis (p. 9): The similarity of the contextual representations of two words contributes to the semantic similarity of those words.
Miller and Charles concluded that contextual similarity was the dominating factor in determining semantic similarity, but that the influence of other factors cannot be ignored. This seems intuitively plausible – such factors include, among others, register level, situational context and world knowledge. Still, the correlation 1 between contextual and semantic similarity seems to be strong enough to provide a handle on operationalising word meaning: Word similarity can be measured by the empirical process of context comparison. This result was re-confirmed recently (McDonald and Ramscar 2001). 1 Miller
and Charles do not quantify the correlation because of the outlier mentioned above.
6
Chapter 1. Introduction
1.3 Semantic Space Models It was exactly this idea of measuring semantic similarity via context similarity that was taken up at the end of the 1980s by psychologists and led to the development of semantic space models2 . Semantic space models are an operationalisation of the distributional hypothesis. They realise a simple idea: words are represented by their contexts which are obtained from large corpora. Words with similar contextual representations are also semantically similar. Two aspects of semantic space models are especially worth considering:
Extraction:
The construction of semantic space models is traditionally based on
counts extracted from a corpus by co-occurrence statistics: The representation of a target word consists of all the frequencies with which other words (the so-called context words) occur in its neighbourhood. In practice, this is usually done by means of a window of a certain size that is passed over the corpus, increasing the counts for the context words around the target word in the window.
Representation:
Given this kind of construction, the most natural format to repre-
sent the co-occurrence information of all target words is as a vector, and the matrix of the representation of all target words forms a high-dimensional vector space whose dimensions are the context words. The distance between vectors can be interpreted as measuring the contextual similarity or dissimilarity between the words represented by the vectors. Figure 1.2 contains a simplified example, showing clusters of words grouping around space that correspond to the different usages of the word. Semantic space models capture a wide range of psychological data, such as semantic priming, and have also been applied to tasks from computational linguistics, like word sense disambiguation (see Chapter 2 for details). By virtue of being able to 2 This
section is not intended as an exhaustive discussion of semantic space models, but as introduction to the main concepts of the paradigm. Also, the intuition of semantic space models I offer here covers primarily the word space family, as opposed to the document space family. The next chapter contains an exhaustive discussion of semantic space models, and my reasons to prefer word space models in this thesis, will be discussed in Section 2.2.
1.4. Aim and Structure of this Thesis
7
leased feet square astronaut storage
space
address disk
shuttle NASA
Figure 1.2: A simplified two-dimensional example of a semantic space, adapted from The Projekt Deutscher Wortschatz Website (2002).
model these data, they give additional support to the distributional hypothesis. Word representations in semantic spaces deviate in a number of points from traditional logicbased representations of word meaning (Karlgren and Sahlgren 2001; Sahlgren 2002): Relationality: Word meanings are only defined in relation to other word meanings: In semantic spaces, absolute coordinates have no meaning, only distances between points do. In contrast, logic-based representations specify the meaning of every word separately and have to make use of additional mechanisms like meaning postulates to specify relationships between words. Broad coverage: Vector representations for words can be produced automatically on a large scale, because unsupervised model construction from raw corpora is possible. As large amounts of raw text data are available, this is a very cheap process. Adaptivity: Meanings are not fixed, but vary according to the corpus. This allows the representation of domain-specific usage.
1.4 Aim and Structure of this Thesis Even though semantic space models seem to have many interesting linguistic features, the focus of their use has always been in cognitive science and information retrieval,
8
Chapter 1. Introduction
but not so much in linguistics itself. Most linguists have regarded semantic space models cautiously. One reason is the general incompatibility of the usage-based framework with mainstream contemporary linguistics. Another reason is that relational theories of word meaning are only of limited interest for most linguists. The central source of uneasiness, however, can be found in the methodology of the construction, since the central simplifying assumption of co-occurrence statistics is that context can be satisfactorily characterised as a bag of words with no structure apart from linear order. This runs counter to linguistic intuition because it means that hardly any linguistic information apart from frequencies goes into the construction of a space model. My aim in this thesis is to examine the effect of enriching semantic space models with linguistic knowledge. More concretely, I will propose to re-define context in terms of syntactic structure, at the same time keeping the representational format of semantic spaces. These enriched models will then be analysed, and their performance will be compared to traditional models. Chapter 2 describes traditional semantic space models, their merits and shortcomings in more detail. Chapter 3 discusses why their construction is linguistically unattractive and presents the formalisation of an alternative framework based on dependency grammar. Chapter 4 presents the D EPENDENCY V ECTORS system I implemented based on this formalisation. Chapter 5 contains the experiments I conducted to analyse D EPENDENCY V ECTORS spaces and compare their performance with traditional semantic space models. Finally, Chapter 6 offers some discussion and conclusive results.
Chapter 2 Previous Work on Semantic Space Models S UMMARY.
This Chapter describes how semantic space models were introduced in
psychology to answer the need for empirically grounded concept representations. The dichotomy introduced in the first Chapter between extraction and representation is discussed in Section 2.2 (co-occurrence statistics) and Section 2.3 (semantic space formalisation). Parametrisation is noted as an important issue in model construction. The last part of the Chapter contains a summary of the usage of semantic space models.
2.1 The Origins of Semantic Space Models After behaviourism had been the dominating paradigm in psychology for about half a century, the 1950s and 1960s saw an increasing development towards a new paradigm. More and more psychologists dismissed the idea that the analysis of pure input-outputmappings could reveal all interesting facts about human behaviour. Instead, they revived the examination of "interior" mental processes which had been abandoned at the beginning of the century. This paradigm shift was called the cognitive revolution, and it meant renewed interest in the formal theories that could account for the regularities and irregularities of cognitive processes. During the 1970s, a central task of the new field of cognitive psychology was the search for an account of human similarity judge9
10
Chapter 2. Previous Work on Semantic Space Models
ments. The apparent complexity of experimental data defied all simple explanations, and it became clear that any sound theory of similarity ratings was to be based on a suitable representation formalism for concepts. Yet this turned out to be a rather difficult problem. From a theoretical perspective, a very attractive approach might have been to model concepts after their mental representations, since these would hopefully be intrinsically able to explain similarity ratings. But this strategy approaches the problem from the wrong direction: Up to the present day, there is neither philosophical agreement on the form of mental representations, nor has neuroscience been able yet to provide insights which can advance understanding of such high-level representational issues. Under these circumstances, it seems rather advisable to keep psychological models of concepts neutral in respect to their mental representation. Instead, psychologists tried to tackle the problem by using a representation formalism whose computational properties were well-known, namely features sets. This kind of representation naturally gives rise to feature-based theories of similarity like Tversky’s account (Tversky 1977). According to Tversky, the similarity of two items is a function of the intersection and difference sets of their features. Let a and b be the concepts under study, and A and B their respective sets of features. Then, according to Tversky, the similarity s between a and b must be a function of the intersection and difference sets of the two feature sets: s(a, b) = F(A ∩ B, A − B, B − A) When these models could not explain the full complexity of the experimental data, this was attributed to their mode of construction, which was based heavily on hand-coding of representations. In a next step, researchers set out to determine the feature sets based on lists of most salient features collected from a large number of subjects. When this still did not improve the situation, attention was drawn to what tuned out to be the real problem, namely the tacit assumption of feature set accounts which turned out to be the problem: Just as their counterpart in linguistics, denotational semantics, these theories presupposed that there exists one designated set of features which can explain all similarity ratings in all possible contexts. In reality, however, the impacts of context and functional considerations play a vital role in the outcome of similarity judgements, as a detailed study of the use of concept representations emphasized (Murphy and
2.2. Extraction: Co-occurrence Statistics
11
Medin 1985). For example, in the context of pet cats and fish might be judged more more similar than in a more general context like animal. Analogously, two items may be judged similar if attention is directed to some function both items can perform, even though they are very different apart from that. For example, a screwdriver and a coin can both be used to tighten a screw. If context is important, it is not surprising that subjects were not able to characterise concepts completely when given a neutral context. This characterisation of the similarity rating process called for usage-based concept representations, and it is not by accident that Murphy and Medin characterised the process as being dynamic, using the same term that Wittgenstein used to describe language use. When the apparent parallelism between language philosophy, modelling the use of words, and cognitive psychology, modelling the use of concepts, was observed, it became clear that the tools language philosophy used for empirical grounding, notably the distributional hypothesis, could achieve the same for concepts and produce empirically grounded concept representations. Psychologists could now assume that differences in the meaning of concepts could show up as distributional differences in corpora. This is exactly the approach taken by proponents of semantic space models at the end of the 1980s.
2.2 Extraction: Co-occurrence Statistics The first step in the construction of a semantic space model is the extraction of context information from the corpus for every word that should be represented in the space. For this task, traditional semantic space models use co-occurrence statistics. At the heart of this method lies the assumption that the context of a word can be treated as a bag of words, without having to pay attention to any – linguistically speaking – “deeper” structure like syntax or discourse. There are probably two reasons why co-occurence became the dominant methodology for the construction of semantic space models. Firstly, it constitutes a working approximation to linguistic knowledge while avoiding the disputes that must follow the adoption of a specific linguistic framework over all others. Secondly, co-occurrence
12
Chapter 2. Previous Work on Semantic Space Models
statistics are straightforward and cheap to compute, which makes large-scale automatic computer extraction possible. The range of semantic space models that have been constructed so far can be divided into two families, document space models and word space models. They differ in how they define the context of a given word.
2.2.1 Document space models Historically, document space models are older than word space models. The first paper about semantic space models, Dumais, Furnas, Landauer, Deerwester, and Harshman (1988), pioneered semantic space models by introducing Latent Semantic Analysis (LSA), which belongs to the document space family. Another study, and probably the best known one, is Landauer and Dumais (1997). Document space models take advantage of the paragraph structure of text, constructing a words-by-documents matrix from corpora. In other words, the context is a block of text which is not further analysed, and a word is represented by the frequency with which it occurs in the paragraphs of the corpus. If similar words tend to occur in similar environments, as the distributional hypothesis states, this should suffice to produce similar vectors for similar words. However, in reality this level of analysis is problematic because the form of paragraphs is influenced by pragmatic considerations. One well-known example of a pragmatic effect is for example the one sense per discourse constraint (Gale, Church, and Yarowsky 1992). According to this constraint, polysemous words are typically used exclusively in one sense within one unit of discourse, usually a paragraph. Inversely, it possible to argue that within one unit of discourse, concepts will be expressed preferentially with one word, because only in this way the contrastive use of semantically similar words for related, but not identical concepts is possible. To the degree this is true, semantically similar words will tend not to co-occur in the same paragraph. Simple frequency counting will assign them very different representations which do not adequately mirror the semantic similarity. LSA and other models therefore typically use principal components analysis (PCA), which is a very common technique in machine learning for dimensionality reduction. In mathematical terms, it calculates the set of eigenvectors and eigenvalues of
2.2. Extraction: Co-occurrence Statistics
13
a matrix. The eigenvectors can be understood as an alternative coordinate system, marking the axes of highest variance. The eigenvectors with the largest eigenvalues form a new, lower-dimensional coordinate system, in which the original vectors can be re-represented while conserving the distances between them as well as possible. Apart from the dimensionality reduction, document space models use PCA because it is thought to reveal underlying correlations between words. The computation of the re-represented vectors involves not only information about the word’s own occurrences across paragraphs, but also the distributional information of all other words. Of course, the dimensions of the new coordinate system do not represent paragraphs any longer. Recent models by Sahlgren et al. (Karlgren and Sahlgren 2001; Sahlgren 2001) use random indexing to modify the document space extraction method. Instead of having rows represent paragraphs, they represent paragraphs by high-dimensional random index vectors, adding the index vector to a target word vector every time the target word occurs in the paragraph. This appears to produce the same results as simple counting plus PCA, but avoids constructing huge word-by-documents matrices.
2.2.2 Word space models The first example of a word space model is Schütze (1993), who represented words by the frequency of four-grams in their context. However, at that time it was only thought of as a method for producing input representations for neural networks from corpora. The probably best-known full-blown word space model, Hyperspace Analogue to Language (HAL), followed three years later (Lund and Burgess 1996). Word space models avoid the two-step extraction process that document space models perform. They construct a words-by-words matrix which defines target words directly in terms of context words. To extract the counts for the matrix cells, a context window is passed over the corpus (see Figure 2.1). One fixed position in that window is occupied by the target word, depending on the shape of the context window. The other words within the window make up the target word’s context, and their respective rows of the target word vector are incremented. In this manner, the context window size determines how large the context of other words is that contributes to the representation of a target word.
14
Chapter 2. Previous Work on Semantic Space Models
For a variety of reasons, I decided to concentrate on word space models Figure 2.1: An asymmetric co-occurrence window with a size of three words to the left and one word to the right of the target word.
In this thesis, I will be focusing on word space models. The two families being comparable in their actual performance (see Section 2.5), the most important criterion for my decision was the coarse-grained level of analysis of document space models mentioned above, namely paragraphs. This results in an inability to incorporate linguistic analyses, which are usually available only up to sentence level. A second consideration was the superior transparency of word space models. Their dimensions correspond to context words and changing parameters have for the most part obvious impact on the resulting model (see Section 3.1 for an example for a word space matrix). Document space models, on the other side, generally rely on PCA and do not offer an intuitive reading, because their dimensions have only mathematical significance. Additionally, PCA has turned out not to be very robust. Evaluation results for LSA showed that model performance relied strongly on the number of dimensions of the resulting space. While optimal performance was obtained for spaces with around 300 dimensions, choosing other dimensionalities could result in performance at chance level.
2.3 Representation: Vector Spaces The interpretation of the matrices resulting from co-occurrence statistics as a vector space has so far only been implicit. Indeed, the vectors which represent the target words and which form the matrices’ columns can be thought of as spanning a vector space. The advantage of a vector-based interpretation is that this suggests a number of parameters for the representational level, which have recently been formalised (Lowe 2001). This formalisation of the vector space is independent from the extraction process, though. Lowe formalises a Semantic Space as a four-tuple hA, B, S, Mi, where A is a lexical
association function, which can be used to control for frequency effects. B is the set of
2.4. Parametrisation
15
basis elements, containing the dimensions of the semantic space, and is in word space models equivalent with the set of context words. S is the similarity measure which maps every pair of vectors onto the distance between them, and finally M is a mapping to a lower-dimensional space. The use of PCA in document space models can be modelled with M. Since I will concentrate on word space, where any dimensionality reduction can be performed in advance by choosing an appropriate B, I will ignore M for the rest of this thesis.
2.4 Parametrisation Semantic space models are parametrised, and their optimisation typically involves extensive parameter tuning, which is difficult in several respects. It is costly, because most parameter changes make a complete re-run of the extraction process necessary. Additionally, many parameters interact. This results in a huge search space for optimisation, because it is not possible to optimise single parameters independently.
2.4.1 Extraction parameters In the context of word space models, extraction parameters determine the size and shape of the context window. Levy and Bullinaria (2001) have investigated this part of the parameter space by studying the impact of tuning these parameters on two performance figures. The first one was a measure of reliability, namely the mean distance between all vectors divided by the distance between related words. The second one was the performance of the model in a classification task. The outcome was that both variables showed optimal values for a symmetrical window with a size between 5 and 10. Patel, Bullinaria, and Levy (1998) were also able to show that vector reliability increases monotonically with corpus size. While a recent study (McDonald 2000) verified the latter result, it found that an asymmetrical window, taking only the context to the left into account, resulted in a slightly better performance.
16
Chapter 2. Previous Work on Semantic Space Models
2.4.2 Representation parameters The most important parameter for the representational level of semantic spaces is probably the choice of B, the set of basis elements or dimensions of the space. How to choose these elements and how many to choose, was the subject of (among other studies) Patel, Bullinaria, and Levy (1998) and McDonald (2000). There seems to be general agreement that basis elements can be chosen by virtue of their frequency. Lowe (2001) points out that the real reason for good performance is high variance because it equals high informativity, but that frequency is strongly correlated with variance and therefore a sufficient indicator. However, opinions diverge on two points: Firstly, from which set basis elements should be picked. Patel, Bullinaria, and Levy (1998) obtained better results for the set of all context words, while McDonald (2000) found the set of content words more advantageous. And secondly, if there is an optimal dimensionality. Patel, Bullinaria, and Levy (1998) report that ever higher dimensionalities lead to better performance, while McDonald (2000) argues that performance reaches its maximum at around 500 dimensions. The next parameter, the lexical association function A, determines how the matrix cell Ki j is calculated from the observed frequency of the co-occurence of the target word ti and the context word b j , in symbols ( fˆ(ti , b j )). The simplest function is just the identity function: Ki j = fˆ(bi ,t j ). However, frequency effects may make more sophisticated lexical association functions necessary. In short, the problem is that due to the distributional properties of language, words occurring with similar frequencies will be judged more similar than they actually are (for details see (Lowe 2001)). To counteract this frequency bias, the log-likelihood function (Dunning 1993) and the log-oddsratio (Lowe and McDonald 2000) have been used. In experiments (McDonald 2000), their application could improve the coefficient of determination1 r2 from r2 = 0.2 to r2 = 0.3. The last parameter is the choice of the similarity measure S. Most researchers prefer the term distance measure, but distance and similarity can simply be thought 1 The
coefficient of determination is the amount of human judgement variance accounted for by modelled semantic distance.
2.4. Parametrisation
17
Name
Measure
L1:
L1 (x, y) = ∑i |xi − yi | p euc(x, y) = ∑i (xi − yi )2 i yi cos θ = √ ∑i2x√ 2
Euclidean: Cosine: Jaccard: KL divergence: Skew divergence:
∑ i x i ∑i y i |{i:x >0∧y >0}| jac(x, y) = |{i:xii >0∨yii >0}| D(x||y) = ∑i xi log yxii xi sα (x, y) = ∑i xi log αxi +(1−α)y i
Identity
Range
0
[0; +∞[
0
[0; +∞[
1
[0; 1]
1
[0; 1]
0
] − ∞; +∞[
0
] − ∞; +∞[
Figure 2.2: Different distance measures, their value for a comparison of identical vectors and their range.
of as inverse to one another, i.e. s = 1/d. The most widely used measures found in the literature are listed in Figure 2.2, together with their value of a comparison of identical vectors and their value range. Historically, the first distance measures were geometric in nature, reflecting the interpretation of the vectors as points in a space. The first two measures in the Figure are members of the Minkowski family of distance metrics (Schiffman, Reynolds, and Young 1981), namely the L1 or city block distance and the Euclidean distance. The Euclidean distance is the familiar geometric distance between two points in space. The L1 distance computes the distance under the assumption that movement is only possible parallel to dimensions. An obvious problem with these distances is that different words occur with different frequencies. Even if the words appear in virtually identical context distributions, their vectors will have different lengths, and will be judged to be more dissimilar than they actually are. Cosine distance tries to compensate for this bias by normalising the distance in the length of the vectors. In other words, it only compares the direction of the vectors. The cosine distance is probably the most widely used distance measure both for word and document space models (Landauer and Dumais 1997; McDonald 2000). Distance measures are not limited to the geometrical domain. During the last years, measures from information theory have received more attention. The generalised Jaccard distance, for example, treats the vectors as feature structures and determines their overlap (Grefenstette 1994; Curran and Moens 2002). Still another perspective is introduced by the Kullback-Leibler (KL) divergence or relative entropy, which determines
18
Chapter 2. Previous Work on Semantic Space Models
the distance between two probability mass functions, into which vectors can be transformed by normalising their length to 1. The KL divergence has the problem of being undefined if any element of the second vector is zero, because log x0i is undefined2 . A recent evaluation of different distance functions (Lee 1999; Lee 2001) suggested an improved version of the Kullback-Leibler divergence, the skew divergence, which is defined for arbitrary vectors. The skew divergence yielded better results than traditional distance measures for another linguistic task, namely estimating the probability of unseen n-grams.
2.5 Use of Semantic Space Models In Chapter 1, it was mentioned that semantic space models can model a whole range of psychological data that could loosely be called semantic similarity-related, and have also been shown to be useful for a number of information retrieval tasks. This Section gives an overview of the most important results.
2.5.1 Psychological modelling In Section 2.1, I mentioned that one of the first incentives for constructing semantic spaces was the modelling of human similarity judgements. A modern replication of this task can be found in McDonald (2000). Using materials from Miller and Charles (1991), he calculated the correlation between mean rated similarity and cosine distance in the semantic space for 19 target word pairs and obtained a correlation of r=0.65 p COMP > ADJ | {z } | {z } core
non−core
Figure 4.6: The obliqueness hierarchy according to Keenan and Comrie (1977)
quent words of the BNC which occurred on informative paths, were used for the construction of the matrix. This avoids sparse data problems for the resulting vectors, which could occur if one used a path equivalence relation with which, for example, the last word and some dependency relation had to be identical for two paths to be equivalent. In addition, the word-based path equivalence relation retains intuitively analysable dimensions of the semantic space, because every dimension corresponds to one context word.
Path valuation function:
Here, I constructed four path valuation functions as shown
in table 4.3 in order to be able to judge the influence of two different factors on the importance of paths. The simplest path valuation function, plain , assigns a constant value to every path. length applies a more advanced scheme: The value of a path decreases with its length. The latter gives more weight to shorter paths which correspond to more direct relationships. An alternative counting scheme is based on the obliqueness hierarchy (Keenan and Comrie 1977). The obliqueness hierarchy, which is motivated by cross-linguistic analyses of noun phrase accessibility, claims that there is a universal hierarchy of grammatical relations, some of which are more important than others (see Figure 4.6). The subject is supposed to be the most important grammatical function, followed by the direct object, the oblique object and so forth. In the D E PENDENCY V ECTORS
context, it can be hypothesised that paths with more important
grammatical relations will have higher values. This idea is realised in the path valuation function oblique , which assigns weights to paths depending on the most important dependency relation5 they contain. Finally, oblength combines length and oblique by dividing the obliqueness value of a path by its length. 5 For
lent.
the present purpose, I assume that grammatical relations and dependency relations are equiva-
4.2. Implementation: The D EPENDENCY V ECTORS System
- obliqueness
- length
+ length
plain
length
+ obliqueness oblique
51
oblength
Table 4.3: Configuration of path valuation functions
4.2.6 A necessary optimisation Despite the attempt to alleviate the efficiency problem by the two-pass strategy, it turned out that the implementation described above is still computationally infeasible. The problem is the complexity of ExtractBasisElements, which is quadratic in the size of the corpus for a word-based equivalence relation. For a corpus the size of the BNC and on present computers, this is infeasible in practice6 . However, for the case of a word-based equivalence function, one can take advantage of the case that words are distributed according to Zipf’s law (Zipf 1949), and that consequently, basis elements will be Zipf-distributed as well. According to Zipf, words show a hyperbolic distribution, where a small group of words occurs with very high frequencies, while the majority of occur very infrequently. This implies that the high-frequency group of words, making up the majority of all words, will turn up reliably in small samples. If one is interested only in the most frequent basis elements, it should therefore be possible to limit the size of the set of basis elements. Therefore, I implemented a size-limited version of ExtractBasisElements (see Figure 4.7). The only difference to the algorithm proposed above is as follows: if the addition of a new basis element makes the set of basis elements grow beyond a fixed maximal size n, then the set is sorted according to frequency, and the r percent least frequent elements of the set are removed . The parameter r is called the cutoff ratio. n and r were added as command line options to D EPENDENCY V ECTORS (see Figure 4.1). In order to test if this optimisation actually yield desirable results, I performed some preliminary experiments where I tested the impact of different settings on the reliability and the efficiency of ExtractBasisElements. 6 Probably
the space complexity is too high as well, because the size of the set of basis elements is linear in the size of the corpus.
52
Chapter 4. Realisation
B := 0/ while there are paragraphs in the parsed corpus do read paragraph and construct parse graph for all words w in the parse graph do for all paths φ in the local context C(w) do for all b ∈ keys(B) do if b ∼ φ then
f ound := true
increment B(b) by v(φ) if f ound 6= true then if |B| ≥ n then
remove least frequent r · n elements of B
B(φ) := v(φ)
sort B according to frequency for all b ∈ keys(B) do return b and B(b)
Figure 4.7: Enhanced Algorithm 1 (ExtractBasisElements): Construct size-limited hash table B of basis element frequencies
4.2. Implementation: The D EPENDENCY V ECTORS System Reliability
53
I ran ExtractBasisElements on a 20 MB sample from the BNC with a
very high maximal size (50,000), the minimum context specification (see Appendix C) and the word-based equivalence relation (see Section 3.3.3). I found that the set did not exceed the maximal size once, with the total number of basis elements being 35.327. Consequently, this set of basis elements could serve a “gold standard”. In practice, one will usually want to obtain a fixed number n of basis elements and is interested in how much higher the maximal size of the set must be chosen for them to be reliable. Therefore, I measured the overlap between the most frequent n basis elements of the gold standard and of the sets of basis elements obtained in different conditions. In a first experiment, I varied the maximal size between 1,000 and 25,000 with a step size of 1,000 while leaving the cut-off ratio constant at 0.5. By calculating the overlap for n = 1000 and n = 5000, I obtained Figure 4.8. It shows that for the most frequent 1,000 elements, a limit of 4,000 basis elements yields a reliability of already above 95%, and above 10,000 basis elements, the limited set is for practical reasons indistinguishable from the gold standard. Convergence is slower for the 5,000 most frequent basis elements, but a maximal size of 15,000 attains the 90% overlap mark. In a second experiment, I varied the cut-off ratio between 12% and 90% with a step size of 3% while keeping the maximal size at 10,000. The results in Figure 4.9 show that very high cut-off ratios have a detrimental effect on basis element reliability, but that the effect is negligible for moderate cut-off ratios. For n = 5000, the difference in reliability between a cut-off ratio of 12% and of 51% is less than 7%. Complexity
I also tested whether the limitation of the set of basis elements actually
yielded an improvement in complexity. For this purpose, I measured the time taken by the two experiments reported in the last paragraph. Figure 4.10 shows that the runtime indeed increases roughly linearly with the maximal size of the set. According to Figure 4.11, there is a roughly linear inverse relationship between the runtime and the cut-off ratio. Conclusion
These experiments show that the optimisation of ExtractBasisElements
indeed works. They show that extremely low maximal sizes and extremely high cutoff ratios damage reliability. However, a relatively small overhead is necessary to
54
Chapter 4. Realisation
100
maxsize 1000 maxsize 5000
overlap with gold standard (%)
90 80 70 60 50 40 30 20 10 0
5000 10000 15000 20000 maximal size of set of basis elements
25000
Figure 4.8: Reliability of first 1,000 and 5,000 basis elements for different maximal sizes (purging ratio constant at 0.5) 100
overlap with gold standard (%)
90 80 70
maxsize 1000 maxsize 5000
60 50 40 30 10
20
30
40 50 60 cut-off ratio (%)
70
80
Figure 4.9: Reliability of first 1,000 and 5,000 basis elements for different purging ratios (maximal size constant at 10,000)
90
4.2. Implementation: The D EPENDENCY V ECTORS System
55
3000
2500
time (s)
2000
1500
1000
500
0 0
5000 10000 15000 20000 maximal size of set of basis elements
25000
Figure 4.10: Runtime of the experiment from figure 4.8
1500 1450
time (s)
1400 1350 1300 1250 1200 1150 10
20
30
40 50 60 cut-off ratio (%)
70
80
Figure 4.11: Runtime of the experiment from figure 4.9
90
56
Chapter 4. Realisation
achieve a good reliability. To achieve 90% reliability for the most frequent n basis elements, it sufficient to set the maximal size to 4n for a cut-off ratio of 0.5. A lower cut-off ratio can achieve the same reliability even with a smaller maximal size. Not surprisingly, the parameter settings which lead to higher reliability also lead to longer runtime. However, the runtime results indicate that the bulk of the runtime is consumed by the comparison of paths with all existing basis elements. Under the now familiar assumption that the number of basis elements grows linearly in the size of the corpus, the new time complexity of ExtractBasisElements is O(|C| · |D|). In practice,
ExtractBasisElements needed between 50 and 200 hours of runtime7 to cover the BNC for a maximal size of n = 10000 and a cut-off ratio of r = 0.5, depending on the context specification. ExtractSpace took approximately between 3 and 6 hours to create a semantic space from the BNC for around 200 target words, around 1000 basis elements, and different context specifications. It should be remembered, though, that all complexity results presented in this section are determined by the use of the word-based equivalence relation. Should an equivalence relation be used that drastically alters the distribution of basis elements, reliability and complexity of the implementation must be investigated anew.
4.3 Evaluation: Distance Computation In order to evaluate the semantic spaces which the D EPENDENCY V ECTORS system produced, I needed a simple tool which was able to compute distances between vectors. Scott McDonald provided me with a perl script called semdist.pl which, given a file with target word pairs and a file with the words’ vectors, computes the Euclidean and cosine distances between the elements of the pairs I extended the script to calculate some more of the distances measures which were examined in Lee (1999) and called the result mysemdist.pl. Currently, the output specifies all the distances from Figure 2.2. The α parameter of the skew divergence was set to α = 0.99, the setting with which Lee obtained best performance for most conditions in her experiments. 7 All
experiments were performed on one processor of a Sun Enterprise 450, 480 MHz 4-processor machine with 4 GB memory.
Chapter 5 Evaluation S UMMARY.
This Chapter presents an evaluation of D EPENDENCY V ECTORS models
in the form of a comparison against a state-of-the-art traditional model in three tasks: The first experiment assesses the models’ cognitive adequacy, namely their ability to model mediated priming data; the second experiment examines the analysability of the models in terms of linguistic concepts; and the third experiment tests the models’ performance in a linguistic application, the TOEFL synonymy task.
5.1 Evaluation Scenario In Section 3.1, the shortcomings of semantic space models were discussed, highlighting two problematic aspects: the problems of co-occurrence statistics and the problems of model interpretation. The framework for vector extraction with dependency grammar and its implementation as the D EPENDENCY V ECTORS system addressed the first problem by providing a linguistically founded methodology. However, this change raises an immediate concern: It is necessary to show that D EPENDENCY V ECTORS models are still cognitively adequate in the sense that they can effectively model behavourial data like traditional models. This question forms the basis for Experiment 1, which is a replication of experiment 2 from Lowe and McDonald (2000). In that study, traditional semantic space models were used to model mediated priming (see section 2.5.1). Mediated priming, which is 57
58
Chapter 5. Evaluation
a subtle effect, is suitable for a detailed assessment of the cognitive adequacy of the D EPENDENCY V ECTORS models. Experiment 2 then examines the impact of the new methodology on the problems of model interpretation mentioned above. To this end, it uses the experimental materials from Hodgson (1991). These materials encode different lexical relations separately, and are therefore suitable for investigating the encoding of linguistic information in the model. Finally, Experiment 3 tests the performance of D EPENDENCY V ECTORS models on a linguistic task, the TOEFL synonymy task, to assess the concrete performance of D EPENDENCY V ECTORS models in a realistic application. For comparison, all three experiments were replicated using a vanilla traditional word space model. This model was constructed with a program which was kindly provided by Scott McDonald and which was used to obtain the traditional models reported in McDonald and Lowe (1998), Lowe and McDonald (2000), and McDonald (2000).
5.2 Parameter Exploration The performance of both traditional and D EPENDENCY V ECTORS models depends critically on their parameters. However, the parameter space is huge, and for reasons of time and resources, the exploration of the parameters space was limited. The strategy adopted is described in the following subsections.
5.2.1 Parameters for traditional models For the vanilla traditional semantic space model, I used a combination of the parameter values which were found to be optimal for word-space models in different studies (Patel, Bullinaria, and Levy 1998; McDonald 2000). I assume a symmetric window of size 10 (5 in every direction from the target word) and the 500 most frequent content words in the BNC as basis elements. Previous studies indicate that models resulting from the use of these parameters perform satisfactorily on a variety of tasks even though optimal performance on any specific task probably requires different parame-
5.2. Parameter Exploration
59
ters. Since this means that there is only one experimental condition for the traditional semantic space model, there will be no obvious way of showing the results as graphs; instead, I will always give the results for the traditional model at the beginning of every Results section.
5.2.2 Parameters for D EPENDENCY V ECTORS models For the D EPENDENCY V ECTORS models, I used the parameters described in Section 4.2.5, namely four different context specifications and four different path valuation functions. I took only one path equivalence function into consideration, namely the word-based one from Example 6 in Section 3.3.3. Since every model requires one context specification and one path valuation function, there are fourteen different complete parametrisations of the models: for the minimum context specification, which only considers paths of length one, the two length-sensitive measures collapse with their non-length-sensitive counterparts. For every experiment, these 14 parametrisations form the 14 experimental conditions. All experimental conditions are presented with the identical experimental material, and their respective performance is recorded. For brevity’s sake, I will refer to the 14 experimental conditions only by their numbers; Table 5.1 lists the parametrisation for every condition. Since 14 experimental conditions are already a considerable number, I chose to split each experiment into a main study and a dimensionality study. The main study investigated the performance of models with a fixed dimensionality of 1000 under the 14 different experimental conditions, that is parametrisations, using the most frequent basis element determined by ExtractBasisElements for the given condition. The dimensionality study used the model which had shown the best performance in the main study and varied its dimensionality while keeping all other parameters constant. This procedure can be justified by the observation that for traditional models, the effects of dimensionality are largely independent from the values of other parameters (Levy and Bullinaria 2001). I assumed that this also holds for D EPENDENCY V ECTORS models. For each experimental condition in the main study and for each dimensionality in the dimensionality study, I computed the five distance measures which are supported
60
Chapter 5. Evaluation
context specification path valuation 1
minimum
plain
2
minimum
oblique
3
medium
plain
4
medium
length
5
medium
oblique
6
medium
oblength
7
wide
plain
8
wide
length
9
wide
oblique
10
wide
oblength
11
rich
plain
12
rich
length
13
rich
oblique
14
rich
oblength
Table 5.1: The fourteen experimental conditions
5.3. Experiment 1: Mediated Priming
61
by mysemdist.pl (see Section 4.3). However, plots and discussion will only refer to the best performing distance measures, which will be named for each experiment. A complete version of all results can be found in Appendix D. Plots for the main study of an experiment will show results for the 14 experimental conditions. The graphs will be split into four parts which correspond to the four context specifications. Each part shows the development across different path valuation functions (compare Table 5.1).
5.3 Experiment 1: Mediated Priming Semantic priming, which was introduced in Section 2.5.1, is described here again for the reader’s convenience. Direct priming is an effect where the transient presentation of a prime word facilitates a following cognitive task involving a related target word. For example, the presentation of tiger reduces the reaction time compared with a neutral case in subjects when they have to decide whether lion is a word or a nonce. Mediated priming extends this paradigm by not only allowing directly related words as primes, but also indirectly related ones – like stripes, which is only related to lion by means of the intermediate concept tiger. In experiments, priming with mediated primes also resulted in significantly reduced reaction times, but the reduction was lower than for direct priming. The importance of mediated priming is that is a test case for different theories of lexical access (the spreading activation versus compound cue discussion, for a review see Lowe (2000)). On the modelling side, the experimental materials used for priming can be used to test the cognitive adequacy of models. A semantic space models priming if the primes are significantly closer in the space to their targets than unrelated words. It additionally models mediated priming if the direct primes are closer to their targets than the mediated primes, which are in turn closer than unrelated words. Because it is possible for a model to capture direct priming, but to fail to account for mediated priming, mediated priming materials can serve as a graded indicator for the cognitive adequacy. Lowe and McDonald (2000) have been able to model mediated priming with a traditional semantic space model, which is generally taken as proof of the very high cognitive adequacy of these models. This experiment replicates the second experiment
62
Chapter 5. Evaluation
from their paper in order to assess the cognitive adequacy of D EPENDENCY V ECTORS models.
5.3.1 Method The materials from Balota and Lorch (1986) contain 48 target words each of which possesses a related and a mediated prime, for example lion – tiger – stripes. Unrelated primes were constructed by picking a related prime from a random different target. In addition to the complete materials, I prepared a slightly abridged version by removing all triads which contained words with a frequency of less than once in a million words. This affected only one triad, the one which contained duckling (87 occurrences in the BNC). This left 47 triads in the abridged version. Appendix B contains the materials. Discarded items are marked. I performed the experiment for both versions of the materials. For every experimental condition of the D EPENDENCY V ECTORS models and for the traditional semantic space model used for comparison, 1000-dimensional vectors for all involved words and distances for all target-direct prime, target-mediated prime and target-unrelated prime pairs were computed. To analyse the data, I calculated analyses of variance (ANOVA, see Hinton (1995)) with the distance as dependent variable. The resulting F score is a measure for the total spread divided by the spread within distributions: The higher F is, the more distinct the distributions are from another. Afterwards, pairwise ANOVAs were used to examine the individual relationships between the unrelated and mediated pairs and the unrelated and related pairs. Following Lowe and McDonald (2000), I first performed an overall ANOVA over all three types of pair relationships (target–direct prime, target–mediated prime and target–unrelated word), which gives a general idea of how well the model models the data. Then, I performed two pairwise ANOVAs. The first one (target–direct prime versus target–unrelated word) examined the size of the direct priming effect, while the second one (target–mediated prime versus target–unrelated word) examined the mediated priming effect.
5.3. Experiment 1: Mediated Priming
63
5.3.2 Results Overall ANOVA
The complete version of the experimental materials resulted in slightly
higher F scores, but I will concentrate on the results obtained with the abridged version for ease of comparison with Lowe and McDonald (2000). The complete results are listed in Appendix D. For the traditional semantic space model I used as baseline, the maximal F score was F=9.384 p