From Corpus to lexicon: from contexts to semantic features Ronan Pichon and Pascale Sébillot IRISA, Campus de Beaulieu, 35042 Rennes cedex, France email:
[email protected],
[email protected] tel: 33 2 99 84 74 50, 33 2 99 84 73 17 fax: 33 2 99 84 71 71
1. Introduction Several natural language applications need domain-dependent semantic lexicons. However, these lexicons are not available for every domain, and their buildings must therefore be (as much as possible) automated. In this paper, we present and experiment a corpus-based methodology to automatically build a componential representation of meaning in a lexicon (Rastier 1996), without any preliminary lexical resource. We first present componential semantics and motivate our choice; then we deduce a methodology to build lexicons following the principles of this semantics; finally we describe its implementation through the precise explanation of three experiments that we have conducted in order to automatically obtain semantic lexicons from a corpus analysis, and we discuss the results. 2. Componential semantics Componential semantics (CS) (Rastier 1996, Hjelmslev 1961) is a linguistic theory in which emphasis is put on the relations between word meanings in the lexicon, and one of its other claims is that these relations are highly dependent on observations on a corpus. We successively present these two main points of the theory in this section, and end up with a conclusion about the interest of CS in order to automatically build semantic lexicons from corpora. In CS, a word meaning in a lexicon is defined by the lexicon itself, that is, by its comparisons with other meanings within the lexicon; therefore, there is no definition that is specific to one word. More precisely, a meaning is defined in the lexicon by its differences with the other meanings. These differences are represented by semantic features or semes, which are the basic and smallest CS meaning component, that can be defined as what distinguishes one meaning from the others. These semantic features can also be used to form semantic classes, which correspond to groups of words that share semantic features and that can be exchanged in some contexts; the semantic features express what characterise the meanings of the elements of a semantic class, compared with the rest of the lexicon. The elements of a semantic class possess two kinds of semantic features: generic ones, which correspond to the contexts in which these elements can be exchanged (Hjelmslev 1961, Pottier 1992), and specific ones, which correspond to the contexts in which they cannot. Therefore, CS, which is close to Cruse's semantics (Cruse 1986), is fundamentally a lexical semantics (it is also an interpretative semantics (Rastier 1996)), because both classes and features are represented in the lexicon, and the main difference between CS and formal, psychological or cognitive semantics is that it does not use any extra-linguistic information to describe meanings; it is also textual, because the existence and description of classes and features are based on text analysis. Consequently CS is a lexical relation-based semantic theory, as we have previously said, and Rastier claims that the relations can be observed on a corpus of the studied domain. This point is the second key of CS. Therefore, in order to build a lexicon from an
automatic analysis of corpus, we must be able to precisely recognise the contexts that characterise lexical meaning relations. Several works have pointed out the difficulties to formalise context (Brézillon 1996, Hirst 1997). Moreover, the similarities and differences that we seek are defined between signifieds (meanings) whereas extraction of relations from texts deals with signifiers (words). However, Rastier (1996), but also Harris (Harris et al. 1989), propose some guides in order to get relevant contexts. In CS, in which a word sense in a text is said to be fully determined by the whole text that surrounds it, two kinds of contexts are fundamental: • •
the topic of the current text unit in which the studied word occurs, the word neighbourhood of the occurrence.
Topic can be defined as what the subject of the text segment is (what it is talking about). In CS, the presence of a topic can be characterised by the co-occurrence, in a part of a text, of a few words that are typical of this topic. In more formal words, a theme is in fact indicated by the conjunction (co-occurrence) of semantic features; these semantic features are borne or activated by words, which are the elements that effectively occur in the texts. Moreover, the knowledge of the topic binds the semantic classes that can be used in order to interpret the words of a text. Word neighbourhood has often be used in syntactic (Briscoe et al. 1997) or semantic (Resnik 1993, Grefenstette 1993, Riloff 1996) lexical information acquisition works. Here, it is both used as a parameter to build classes of words, like in the other works, but also to judge the similarities and dissimilarities between words within a given topic, in terms of contexts that the elements of a semantic class share or do not. For example, two words like bus and car generally can be exchanged in some contexts, because they are both vehicles. Therefore, this fact leads to the building of a semantic class to which they both belong. But some contexts (for example, those denoting the fact that about fifty persons can travel together in a bus, but not in a car) distinguish the two words and are a hint to exhibit a specific semantic feature that differentiates the two words from each other. This description of CS allows us to explain the two main reasons why we have chosen to develop an automatic corpus-based methodology to build a componential representation of meaning in lexicons. The first one is the interest of this semantics itself. In CS, meaning is clearly domain-dependent and domain-adaptable, and this is the problem we want to tackle. Although CS has already led to text understanding (Tanguy 1996) and knowledge engineering (Bachimont 1996) applications, the automatic extraction of a CSbased lexicon has, to our knowledge, led to very few implemented works. H. Assadi mentions a possibility of automating the development of such a lexicon (Assadi 1998), but only for nominal terms of a restricted domain. The second one is that the uses of a corpus and a computer are key-tools to make this task possible and to lead to the development of large enough sized lexicons to prove the validity of CS. The semantic features that structure the lexicon are language elements, i.e. they are present in texts, and can therefore be extracted from a corpus. Moreover their existence is exhibited through the observation of similarities and differences in language uses within the corpus that can easily be detected with a computer (if they are clearly defined). We now describe the methodology that we propose in order to automatically build a CS representation of meaning in a lexicon without any preliminary lexical resource. 3. The methodology In this section, we present the corpus-based methodology that we have developed and implemented to build CS-based meaning representation in lexicons. We also give a few elements about the data analysis methods that we have used to implement this
methodology, and end up by a presentation of the corpus that we have chosen for the implementation of the methodology, and the pre-treatment that it has undergone. 3.1 Description As we want to build semantic lexicons with no prior semantic knowledge at all, we have decided to develop a methodology which is based on principles that are directly inspired from the interpretative mechanisms that are formalised by Rastier in CS. As we have previously said, these mechanisms are highly context-dependent, and, more precisely, the interpretative process is a two-task one. First, we have to determine the topics of the different text segments; this knowledge of the topic is used conjointly with word neighbourhood for the interpretation of word occurrences, and then, in a lexicon building aim, to determine lexical relations between meanings within the lexicon. Concerning topic extraction from a corpus, we have used a data analysis method based on the distribution of words throughout all the paragraphs of the texts, in order to automatically build sets of cue words that are characteristic of the main topics of the corpus. As we have already mentioned, the co-occurrence of some of the keywords of a topic in a text segment can then be used to assert the belonging of this segment to the topic. The first experiment that we describe in the next section is dedicated to this topic extraction, and we show that it is possible to detect in a corpus of texts of a homogeneous domain the main topics which characterise it, that is, that it is possible to automatically determine the sets of cue words that characterise the 10-30 most important topics that are globally present in a corpus, and that correctly describe its content. As semantic class content is bound to the knowledge of the topic that the text is dealing with, we must then segment the corpus into different parts, corresponding to the different topics that have been detected. For each sub-corpus, the second part of the methodology is then the following one: within each topic, we extract word neighbourhoods in terms of windows of 5 words before and 5 words after each studied word. Contexts built up from these two points (the knowledge of the topic and the windows) permit to type word occurrences in the corpus. We treat these so-typed occurrences with the help of a data analysis method to discover classes of occurrences that share similar contexts, in order to build semantic classes. The second experiment in the next section concerns this second part of the methodology. The knowledge of the topics, of the semantic classes within a given topic, and of the neighbourhood contexts of the studied word occurrences is finally used, in the third and last part of our methodology, in order to automatically detect generic and specific semantic features, that is, to extract sequences of discriminating contexts in order to make the semantic features of the lexical representation explicit. We use intersections and set differences between lists of context words for this task. This part of the work is quite original, because, to our knowledge, very few works deal with the contexts that are used to form semantic classes to detect fine-grained distinctions between the elements of a class, and none of them has already really tried to automate this part of the task. This step towards semantic features is the last experiment that we describe in the following section. Before the presentation of its implementation, let us give a few cues about the data analysis methods that we use in the two first steps of the methodology, and that we have mentioned in its description. Even if the two techniques that we apply are a bit different for the two parts of the work, there are both hierarchical classification methods; therefore, we just give here the main ideas of this kind of clustering technique. More precise elements will be given for each of them in the description of the two experiments in the next section. 3.2 Hierarchical classification
Hierarchical classification is a data analysis method which aims at structuring a set of data into a tree through successive clusterings of its elements. At the beginning of the method, the data consist in a set of elements which are associated with attribute vectors that characterise them (for example, the +/-5 word window context for a word in our second experiment). In order to be able to judge the similarity between two elements of the data set to begin the classification process, a similarity or proximity measure between the attribute vectors has to be defined; therefore different classification methods can differ on the kind of proximity measure that is chosen. The classification or clustering then proceeds the following way. At the beginning, each element (and its attribute vector) is put in one class. At each step of the classification, the two most similar classes (containing one or more elements) are clustered to build up a new one. The criterion to choose the couple of classes that is clustered is based on the proximity measure, and the two nearest classes, according to this measure, are selected. The classification goes on, till there is only one cluster left. Therefore, if the number of elements of the data set is n, there are n-1 steps to complete the classification. The result of the classification is a classification tree, that describes how the clustering has been done; each node corresponds to a class, and each leaf to an element of the original set of data. 3.3 The corpus and its pre-treatment We end up this section by a quick presentation of the corpus that we have chosen for the implementation of the methodology, and of the treatment that it has undergone. The choice of the corpus is guided by the fact that the chosen one must be homogeneous enough to be characterised by about 10 to 30 representative topics that we want to automatically extract, and that it can be split into thematically homogeneous units. The first point can be achieved with two kinds of corpora: a domain-specific one, or a general language one which deals with a limited number of subjects. We have chosen a corpus of the latter type. The second condition is simply obtained if we have a corpus in which each paragraph deals with at most one subject (we do not care, here, about the fact that a topic may be split in several paragraphs in the text). The corpus that we have chosen is a big set of articles from the French newspaper /H PRQGH GLSORPDWLTXH. It is written in a general French language and tackles a relatively limited number of subjects, that are globally all based on politics: diplomatic relations, Third World development, political economy, etc. The entire corpus contains 7.8 millions of words. This corpus has undergone a pre-treatment which consists in a part-of-speech (POS) tagging for which we have used the Multext tools for French, that is the MtSeg tokenizer and MtLex lemmatizer-tagger developed at Aix-en-Provence University, and the Tatoo tool develop at ISSCO in order to desambiguate the POS tags. We can now present the implementation of the methodology and its results. 4 Implementation and results This section is divided into three parts, corresponding to the three successive steps of the methodology that we have defined to develop a CS representation of meaning in a lexicon. The first part, topic extraction, is a fully developed and tested one. The second part, semantic class building, is also fully developed and is based on some well-known principles of this domain. The third one, the automatic detection of semantic features that
show similarities and dissimilarities between words, which is a fully original study, is still an under-progress work, but the first results that we present here are very promising. 4.1 Topic extraction As meaning in a CS-based lexicon is highly dependent on contexts observed in a corpus, our first step is to determine the main topics of our corpus. These topics are used as parts of the contextual information on which our meaning representation is based. 4.1.1 Definitions As we have mentioned earlier, the topics that we want to automatically extract from a corpus must satisfy the two following conditions: first of all, they should not be too numerous and must be coarse-grained enough in order to be a useful first step in our interpretative process. Secondly, they should cover most of the parts of the corpus, that is, they must really be characteristic elements of the corpus, and most (if not all) text segments should be liable to be assigned to one of these topics. The topics that we must discover can also be defined by two features: 1they are characterised by sets of keywords that (also) have to be automatically detected on the corpus, and the co-occurrence of some of the keywords of a same set in a text segment indicates the topic of the text unit; 2they are associated with text segments, and, here, paragraphs are the text unit that we consider. Consequently, the work that we present here is quite different from the studies which tackle the problem of topic detection in terms of segmentation of texts into several parts concerning different topics. These works, that can be grouped under the name discourse segmentation, deal with a general modelling of discourse (Grosz et al. 1986) or only try to partition texts or discourse into units, which correspond to their subtopic structure, with the help of linguistic hints (Litman et al. 1995) or notions such as the lexical cohesion (Hearst 1994, Ferret 1998). Concerning the first ones, our aim is quite different, because we do not try to develop such a complex model of discourse; for example, we do not deal with the intention of speakers or anything like that. Even if we are closer to the more knowledge-poor approach of some works of the second group, our objectives are still different. These works try to partition texts, that is to determine the beginnings and the ends of their topics. Our aim is, when we study a part of a text, to be able to recognise the topic it belongs to, with the help of keyword co-occurrences. Moreover, these studies presuppose a linearity of the topics in the texts and we do not. 4.1.2 Method In order to automatically detect sets of (key)words to characterise different main topics of our corpus, we have used a sound knowledge-poor hierarchical classification method: the likehood linkage analysis (LLA) (Lerman 1991). For our experiment, we have worked on a 1 million word part of the 7.8 million word /H PRQGH GLSORPDWLTXH corpus, which corresponds to 9500 paragraphs and 200 articles. In this sub-corpus, we have selected all the nouns that appear more than 8 times (because the elements of the set of typical words for a topic must be frequent enough to be used to determine the topic of a given paragraph), and we have only suppressed month names and a few acronyms; we have obtained 165 nouns, that globally appear in 8570 of the paragraphs, which means that we do cover a wide part of our sub-corpus. Each of the nouns is associated with its lemma and the list of the numbers of the paragraphs in which it occurs, that is, its observed distribution across the paragraphs of the sub-corpus. These information correspond to the initial set of data and attribute vectors for the hierarchical classification. The proximity measure of the LLA clustering is the relative distribution of the words across the
paragraphs; hence, with LLA, we obtain sets consisting of the words whose distributions are the most similar. One of the advantage of LLA is that it also provides a discriminating factor to facilitate the validation of the different sets that are built. More precisely, at each step of the clustering, a measure of the quality of the new partition is calculated. This measure is based on the distance between the elements of a same class. The variation of the quality of the partition between two steps can consequently be evaluated. Each local maximum indicates a significant node in the classification, that is, a node which points out a good class. This is a help for the reading of the results. Among all the sets of words, and because they consist of words whose co-occurrence indicates the current topic of any text segment, we only consider the sets containing at least three and no more than fifteen words as possible valid sets. 4.1.3 Results The 80 sets that we have obtained (without taking the LLA discriminating factor into account) have been evaluated by five French-native speakers. The evaluation process was the following one: for each set, the speaker was asked if he judges it correct (that is, homogeneous enough) to be associated with one topic. If the answer was positive, he was then asked to name the corresponding topic. A set has been judged valid if no more than one human judge has rejected it. Among the 80 proposed sets, the judges have validated 27 sets with a 78% agreement. If we only consider the 45 sets among the 80 that were indicated as more interesting by the LLA discriminating factor, 21 of them were validated. Here are a few examples of the sets that we have automatically got: the press = {MRXUQDO, MRXUQDOLVWH, SUHVVH} (newspaper, journalist, press) territory = {DXWRULWp, WHUULWRLUH, UpJLRQ} (jurisdiction, territory area, region) organisations (of institutions) = {LQWHUQDWLRQDO, FRPPXQDXWp, RUJDQLVDWLRQ, QDWLRQ, GpYHORSSHPHQW} (international, community, organisation, nation, development). And here is an example of a rejected set = {IUDQoDLV, IRL, FXOWXUH, GpPRFUDWLH, V\VWqPH} (French, faith, culture, democracy, system). We have also evaluated the quality of our results with the help of an existing index of the corpus. 92% of the validated sets are present in the index, but these sets only represent 20% of all the topics of the index. So, we seem to have a very good precision, but a pretty bad recall. But if we evaluate the coverage of the topics that we have discovered, we find that they represent 60% of the corpus segments. This allows us to say that our method has extracted the main topics of the corpus, even if it has not extracted all the different topics. 4.1.4 Usage For these sets to be useful, we must be able to associate any text segment of the general 7.8 million word corpus with one of them. To achieve this goal, we assign one topic to a given segment if at least two elements of its set of typical keywords are present in the observed text unit. After this task, we are now able to automatically extract, from the 7.8 million word corpus, a specific sub-corpus for each topic. We can also determine the current topic of a text unit, that is, useful contextual information, for most word tokens. 4.2 Semantic Classes
Our aim in this part is no more to determine contextual information, but to proceed toward a lexical representation of meaning using contextual information. In this section, we present a method to automatically build sets of words that are supposed to form semantic classes. As this part of our work is not original (cf. for example Grefenstette 1993, Riloff 1996 et Wilks et al. 1996), we give here a rather quick presentation of its principles. 4.2.1 Definition According to componential semantics, a semantic class is a set of words which share meaningful contexts in the language. This is what we want to obtain through the use of a hierarchical classification method on words, according to the contextual information they are associated with in the corpus. However, as a semantic class is only valid in one topic, the work that we present here has to be repeated on every sub-corpus that has been extracted from the original corpus. 4.2.2 Method The classification method that we have used in order to obtain semantic classes is a rather simple one. With each noun (respectively verb or adjective) that is present in a subcorpus, we have associated all the nouns, verbs and adjectives that appear in a +/-5 word window around its occurrences. These neighbourhood elements form its attribute vector. The similarity measure that is used during the clustering process is a normalised scalar product between two such vectors. As classification proceeds, we keep in memory the contextual neighbourhood (the list of nouns, verbs and adjectives) that is associated with each word and with each class (this is something that LLA cannot do). 4.2.3 Results As these semantic classes are just one step toward semantic features in our methodology, we have not yet developed a full scale analysis of the quality of the classification, and we use the classes that we judge valid to work on the last point of our implementation: semantic feature extraction. 4.3 Towards semantic features In this part, we use the knowledge of the topics of the corpus, the semantic classes and the neighbourhood contexts in order to extract semantic features. This final task aims at structuring a CS-based lexicon of word meaning. 4.3.1 Definition The semantic features that we want to extract are of two types: 1generic semantic features: each semantic class has to be associated with one generic semantic feature, which expresses what makes the class a valid semantic class, and is a sort of tag of the class. This generic semantic feature has the form of a word or a phrase. For example, the class {scalpel, bistoury} can be characterised by the generic semantic feature /surgical instrument to cut flesh/. 2specific semantic features: in each semantic class, specific semantic features are attached to the elements and explain how the meaning of one element differs from the other meanings in the class. For example, in the class {scalpel, bistoury}, /for the dead/ can be attached to scalpel, and /for the living/ to bistoury.
As we have mentioned it before, the automatic extraction of semantic features is a real original work, and it is still under progress. We present here the method that we have developed and the first preliminary results that we have got. 4.3.2 Method In order to extract such semantic features, we use the neighbourhood context associated with a semantic class by our previous classification method. The contextual elements which are common to all the members of a class are used to represent a generic semantic feature. We associate with a specific semantic feature of a member of the class the contextual elements which are specific to it (versus others members of the same class). The associations are made automatically by the calculus of intersections of the sets formed by the neighbourhood contexts. The so-extracted sets of contextual elements have to be interpreted by hand (by an expert), and this interpretation consists in detecting semantic regularities within these sets, that is, sequences of words which share similar semantic features. The automation of this last task is, as we mention it in the conclusion, one of our future works. We now present a few preliminary results. More examples can be found in the specific study of this experiment presented in (Pichon et al. 1999). 4.3.3 Results The first kind of results that we have obtained concern the study of the neighbourhood of different words that are quite similar, and that could be candidates to form a semantic class, within a given topic extracted from the corpus. The goal of this study of the semantic features within a semantic class is to point out what brings these words together, and also what differentiates them from each other. We first list the neighbourhood context elements that are common to all the considered words of the semantic class in the topic. We then present the sets of elements that are specific to each member of the class, and discuss their possible interpretations Topic: territory, Class: official authorities Considered members of the class: {SRXYRLU, DXWRULWp, JRXYHUQHPHQW} (power, authority, government) Common contexts: QRXYHDX, SROLWLTXH, SUpVLGHQW (new, political, president) Specific contexts: SRXYRLU (power): pWDW, ORFDO, VRYLpWLTXH, DQQpH, H[pFXWLI, SDUWL, SULVH, SXEOLF, pFRQRPLTXH, SODFH, DUULYpH, FHQWUDO (state, local, soviet, year, executive, party, seizure, public, economic, place, arrival, central) DXWRULWp (authority): 3pNLQ, SODFH, SUHXYH, UpJLRQ, WUDQVIHUW, FKLQRLV, pWDW, WHUULWRLUH, JRXYHUQHPHQW, LVUDpOLHQ, SDOHVWLQLHQ, ORFDO (Beijing, place, proof, region, assignment, Chinese, state, territory, government, Israeli, Palestinian, local) JRXYHUQHPHQW (government): IpGpUDO, RFFLGHQWDO, IUDQoDLV, PLQLVWUH, UpJLRQDO, XQLRQ, IRUPDWLRQ, FHQWUDO, QDWLRQDO, LVUDpOLHQ (federal, western, French, minister, regional, union, group, central, national, Israeli) In this semantic class within the territory topic, we notice that very few neighbourhood elements are common to the three words. However, the words that correspond to geographical or institutional area on which DXWRULWp, SRXYRLU or JRXYHUQHPHQW exerts its influence are quite numerous, even if they are different for each of the three words: {local, central} for SRXYRLU, {region, territory, local} for DXWRULWp, and {federal, regional, union, central, national} for JRXYHUQHPHQW.
Among the differences, we see that DXWRULWp is closely bound to local, whereas SRXYRLU and JRXYHUQHPHQW are bound to central; this may imply that DXWRULWp is subordinated to a JRXYHUQHPHQW or a SRXYRLU. Moreover, the specific co-occurrence of {federal, national} on one hand, and {minister, union, formation} on the other one indicates that JRXYHUQHPHQW exerts its influence within a well-defined and structured framework, whereas SRXYRLU and DXWRULWp imply a more informal one. Finally, DXWRULWp is closely bound to the notion of territory: {region, territory, local}, whereas SRXYRLU exerts its authority on something different: {public, economic, executive} and seems to be a more changeable entity: {place, seizure, arrival, year} than JRXYHUQHPHQW and DXWRULWp. The second kind of results concerns the study of the different neighbourhood contexts for the same word in various topics. The aim of this work is to study the variations of meanings between topics. Word: PLOLWDLUH (military) Topics: territory, negotiations Common contexts: IRUFH, eWDWV8QLV, DPpULFDLQ, JUDQG, SXLVVDQFH, pFRQRPLTXH, LQWHUYHQWLRQ, SROLWLTXH, SUpVHQFH, DLGH (strength, United-States, American, great, power, economic, intervention, political, presence, help) Specific contexts: territory: PR\HQ, RSpUDWLRQ, PDVVLI, RFFXSDWLRQ, UpJLPH, UXVVH, YLFWRLUH, EDVH (means, operation, massive, occupation, regime, Russian, victory, base) negotiations: DFWLRQ, RUGUH, HIIRUW, SD\V, DWODQWLTXH, (XURSH, 27$1, GpSHQVH, UHVSRQVDEOH, RUJDQLVDWLRQ (action, order, effort, country, Atlantic, Europe, NATO, organisation, expense, responsible) These contexts help us to differentiate the meanings of PLOLWDLUH from each other in the two topics. While the words associated with the territory topic clearly indicate a warlike meaning for PLOLWDLUH: {means, operation, massive, occupation, victory}, the emphasis, in the negotiation topic, is put on PLOLWDLUH as something structured or associated with something structured: {organisation, expense, responsible, country, Atlantic, Europe, NATO}. Consequently, if we manually extract sequences among the contextual elements that are associated with the words that we want to study, we see that it is possible to explicit the semantic features that are bound to a meaning by naming the sequences. The name which is given to such a sequence is in fact the precise semantic feature we are looking for. If we generalise the application of this method to all the words that we study, we can really build a complete CS meaning representation lexicon. 5. Conclusion and future works In this paper, we have described the methodology that we have elaborated in order to be able to proceed from one corpus to a CS-based lexical representation of meaning of the words that are present in the corpus. Moreover the experiments that we have presented prove the feasibility and the validity of our methodology, as they allow us to extract semantic features. However, the first results that we have obtained must still be improved to be completely satisfying. This improvement depends on the methods that are used at each step of the methodology. The first step, topic detection, can be considered the soundest part. It has already been cross-validated through the use of other types of data analysis methods (latent semantic analysis and correspondence analysis) (Morin 1999), and the results, in terms of topic detection and contents of their sets of typical keywords, are similar.
The automatic semantic classification that we have presented in subsection 4.2 can (and will) be improved. This might be done by using more complete information as neighbourhood contextual elements, such as the position of the words in the neighbourhood. We expect that it will improve the quality of the classes, but also facilitate the interpretation of the contextual elements for semantic feature extraction. Consequently, the interpretation process described in 4.3 will also have to be refined. The last step is the part which requires the biggest amount of work. The two main research directions are: - the automation of sequence extraction from the neighbourhood contextual elements. This can be done by combining data analysis methods and the information supplied by the semantic classes that are built at the previous step. - the automation of the comparison of the sequences that are extracted, in order to guarantee consistency in the lexicon that we want to build. This point is also necessary to be sure that the obtained semantic features are real meaning primitives. Making and testing these improvements of the different steps of our methodology will permit us to explicit how to extract a complete CS-based lexicon from a corpus. 6. References Assadi, Houssem. 1998 "Construction d'ontologies à partir de textes techniques – Application aux systèmes documentaires". PhD thesis. Paris 6 University Bachimont, Bruno. 1996 "Herméneutique matérielle et Artéfacture : des machines qui pensent aux machines qui donnent à penser". PhD thesis. École polytechnique Briscoe, Ted and Carroll, John. 1997 "Automatic Extraction of Subcategorisation from Corpora" in: 5th ACL conference on Applied Natural Language Processing. Washington. USA Brézillon, Patrick. 1996 "Context in Human-Machine Problem Solving: a Survey". Technical Report 96/29. LAFORIA. Paris 6 University. October Cruse, David A. 1986 Lexical Semantics. Cambridge Textbooks in Linguistics. Ferret, Olivier 1998 "Une segmentation thématique fondée sur la cohésion lexicale" in: TALN’98 (Traitement Automatique des Langues Naturelles). Paris. France Grefenstette, Gregory 1993 "Evaluation Techniques for Automatic Semantic Extraction: Comparing Syntactic and Window Based Approaches" in: Workshop on Acquisition of Lexical Knowledge from Text. SIGLEX/ACL. Columbus. USA Grosz, Barbara J. and Sidner, Candace L. 1986 "Attention, Intentions and the Structure of Discourse". Computational Linguistics 12: 175-204
Harris, Zelig, Gottfried, Michael, Ryckman, Thomas, Mattick (Jr), Paul, Daladier, Anne, Harris, Tzee N., and Harris, Suzanna. 1989 The Form of Information in Science: Analysis of an Immunology Sublanguage. Dordrecht/Holland and Boston: Kluwer Academic Publishers Hearst, Marti A. 1994 "Multi-paragraph segmentation of expository text" in: 32th Annual Meeting of the Association for Computational Linguistics. Las Cruces. NM. USA Hirst, Graeme. 1997 "Context as a Spurious Concept" in: AAAI Fall Symposium on Context in Knowledge Representation and Natural Language. Cambridge. USA Hjelmslev, Louis. 1961 Prolegomena to a Theory of Language. University of Wisconsin Press Lerman, Israël-César. 1991 "Foundations in the Likehood Linkage Analysis Classification Method". Applied Stochastic Models and Data Analysis 7: 69-76 Litman, Diane J., and Passonneau, Rebecca J. 1995 "Combining Multiple Knowledge Sources for Discourse Segmentation" in: 33th Annual Meeting of the Association for Computational Linguistics. Cambridge. Ma. USA Morin, Annie. 1999 "Latent Semantic Analysis and Correspondence Analysis for Thematic Exploration in Texts" in: ASMDA99 (Applied Stochastic Models and Data Analysis), Lisbon, Portugal Pichon, Ronan, and Sébillot, Pascale. 1999 "Différencier les sens des mots à l’aide du thème et du contexte de leurs occurrences : une expérience" in: TALN’99 (Traitement Automatique des Langues Naturelles), Cargèse, France Pottier, Bernard. 1992 Sémantique Générale. Presses Universitaires de France Rastier, François. 1996 Sémantique Interprétative. Presses Universitaires de France Resnik, Philip. 1993 "Selection and Information: a Class-Based Approach to Lexical Relationships". PhD thesis. University of Pennsylvania Riloff, Ellen. 1996 "Automaticaly Generating Extraction Patterns from Untagged Text" in: 13th national conference on Artificial Intelligence (AAAI 96). Portland. Canada Tanguy, Ludovic. 1996 "Traitement Automatique de la langue naturelle et interprétation : contribution à l'élaboration d'un modèle informatique de la sémantique interprétative". PhD thesis. École nationale supérieure des télécommunications de Bretagne – Rennes 1 University Wilks Yorick, Slator, Brian, Guthrie, Louise. 1996 Electric Words: Dictionaries, Computers, and Meanings. Bradford