Frame-Based Knowledge Representation Using Large Specialized

0 downloads 0 Views 1MB Size Report
specialized corpora automatically for creating semantic frames on .... The system then evaluates .... constraints of our computer would not allow us to build a.
The AAAI 2017 Spring Symposium on Computational Construction Grammar and Natural Language Understanding Technical Report SS-17-02

Frame-Based Knowledge Representation Using Large Specialized Corpora Daphnée Azoulay Observatoire de linguistique Sens-Texte (OLST) Université de Montréal [email protected]

Abstract

which we can apply statistical analysis to map the consensual knowledge that resides among the experts. The intricate link between knowledge and language has notably been described by Wittgenstein (1953) who sought meanings out of language constructions and saw in natural language grammar the scope and limits of thoughts and knowledge. As a knowledge representation structure that connects the conceptual and linguistic levels, Frame Semantics (Fillmore 1982, Fillmore and Baker 2010) is a particularly attractive model, allowing us to describe the predicative units that evoke processes and events - like verbs and predicative nouns - whose semantics cannot be captured completely with traditional ontologies and taxonomies (L’Homme et al. 2014). Recently, Bernier-Colborne and L’Homme (2015) have reported promising results in gathering lexical units that evoke the same frame, using a distributional semantic method. Their project intends to enrich a resource called Framed DiCoEnviro1, which uses FrameNet as a reference, and is based on a terminology resource pertaining to the environment, called the DiCoEnviro2. In this paper, we focus on the automatic construction of large quality specialized corpora from which we analyze the distributional co-occurrences and their potential of being exploited for the automatic creation of semantic frames. We first developed a tool to automatically compile very large specialized corpora. In section 2, we describe the tool and assess its validity by comparing a corpus built entirely automatically to one built with documents selected by a human. We then evaluate the impact of term frequency on the quality of the distributional information, based on FrameNet’s criteria. The future work section proposes a distributional analysis of co-occurrences

This paper explores the benefits of building large specialized corpora automatically for creating semantic frames on special fields of knowledge, using term cooccurrences. We develop and evaluate a method to automatically compile large specialized corpora. Our evaluation serves two purposes: 1) verify if our automatic corpus compilation method returns large specialized corpora of the same quality as a small corpus built with humanselected documents; 2) see how the occurrence frequency of predicative terms in a specialized corpus influences the extraction of frame element realizations, using distributional co-occurrences. This will allow us to estimate the optimal size of a specialized corpus for capturing linguistic and knowledge patterns in the form of part-of-speech patterns, in accordance to frame semantics’ framework criteria. As our goal is to automate the creation of semantic frames using distributional analysis, building large quality specialized corpora is crucial since it will provide the groundwork for all future steps.

1. Introduction A sublanguage, also called language for specific purposes (LSP), is a type of language which essentially has a pragmatic and communicative function inside a specific domain of knowledge. It is commonly defined as being objective, systematic and exact, and these factors have several linguistic consequences. Texts are mainly characterized by impersonal, logical and precise descriptions (Crystal 1997). To restrict automatic language processing to sublanguages seems much more realistic than trying to provide knowledge for an entire language (Grishman and Kittredge 1986). The vast number of specialized documents now available on the Web offers an unprecedented opportunity to easily gather a vast amount of domain-specific texts on

1

See http://olst.ling.umontreal.ca/ dicoenviro/framed/index.php (in development). 2 See http://olst.ling.umontreal.ca/ cgi-bin/dicoenviro/search_enviro.cgi.

Copyright © 2017, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

119

representation is that it avoids the need for careful logical design of a single comprehensive ontology.

intended to reduce the number of syntactic structures to analyze. We suggest to first identify the part-of-speech patterns (or syntactic n-grams) surrounding each verbal term of the corpus and then to apply semantic role labeling with a window-based approach only to the most frequent part-of-speech patterns that are common to all the verbal terms in the corpus. Our ultimate goal is to build semantic frames on special domains automatically using distributional analysis, but we first need to focus on the step of building corpora which are designed for that specific task. We think that the corpus compilation process should be integrated into the language processing, as the first phase of a continuum, in a way that the produced corpus is customized to its purpose. There has been some research on the topic of automatic construction of Web corpora, but, to our knowledge, none of the corpus produced by the proposed tools were evaluated for the specific task of extracting specialized knowledge with distributional analysis, which is the focus of this article.

(Halevy and Madhavan 2003: 1571) Although distributional methods are often used in corpus linguistics, they are rarely used with sublanguages (Périnet and Hamon 2014), one reason being that specialized corpora are very often much smaller than general ones. Banko and Brill (2001) mention that "the effectiveness of resolving lexical ambiguity grows monotonically with corpus size up to the size of at least one billion words." With the vast number of documents available today on the Web, we can consider building automatically specialized corpora of very large sizes on which we can then apply statistical analysis. At the same time, we need a method that guarantees objectivity in the compilation process, with total absence of human intervention. As Evert mentions (2006), nonrandomness in the selection of corpus texts is an important problem in corpus linguistics, mainly because it is contradictory to apply statistical analysis on data that were consciously selected. After presenting some previous work on the topic of automatic compilation of domain-specific corpora, we will show a method to build large specialized corpora automatically and we will then assess the impact of corpus size on the quality of the distributional co-occurrences.

2. Automated compilation of large specialized corpora Specialized corpora contain documents which are specific to a field of knowledge or to a domain. But a question arises quickly: what is a “domain”? It is a central topic in cognitive linguistics. Langacker says that “domains are necessarily cognitive entities: mental experiences, representational spaces, concepts, or conceptual complexes” and that a domain is a “context for the characterization of a semantic unit” (Langacker 1987: 147). He gives the example of the concept knuckle which “presupposes” finger, which depends on hand, arm, body, and then space. This classification represents a meronomic conceptual network and shows that a concept “presupposes” another. Based on this perspective, we define a domain as a topic, concept or field of knowledge which is part of a larger one and which can be “subdivided” in smaller ones (in which case we would call it a “subdomain”). For example, the environment includes the concept of the greenhouse effect which includes the topic of the greenhouse gases. Delimitations are subjective, and domains are usually dependent on larger ones. Whichever concept we wish to describe, using many data on a same topic, i.e. specialized corpora, will provide terminological density and syntactic patterns that we could hardly find in general corpora.

2.1 Related work WebBootCaT is an online version of BootCaT (Baroni and Bernardini 2004) integrated in Sketch Engine (Kilgariff et al. 2004). The tool uses an iterative algorithm to build a corpus from the Web semi-automatically. In the first iteration, it takes as input a list of keywords provided by the user. These terms are combined randomly and used as queries to Google. HTML pages found by Google are returned to the user who has the option to select or to delete them. The pages are compiled into a corpus which can be expanded following the number of necessary iterations. The method implies that the user has a certain prior knowledge of the domain for which he wants to build a corpus. As Gatto (2014) says, given the crucial importance of the keywords in the constitution of a corpus that is representative of a specialized field, using a semiautomatic tool like WebBootCaT requires from the user a significant degree of subjectivity, comparable to doing it manually. Another platform, called TerminoWeb (Barrière and Agbago 2006), gathers documents containing notional information on a given topic. As WebBootCat, the tool first extracts the most frequent terms of a selected text and combines them randomly to be used as keywords for queries to the search engine. The system then evaluates each of the returned documents and assigns a score based

Corpus-based representation is predicated on the idea that large corpora of domain models can be leveraged to create novel tools in service of knowledge representation. The key advantage of corpus-based

120

on the presence of pre-encoded linguistic patterns (ex. “is a kind of”, “is composed of”, etc.) which typically contain definitional elements, or what Meyer (2001) calls “knowledge patterns”. As mentioned earlier, we are interested in finding knowledge patterns, but instead of using pre-encoded and fixed linguistic patterns, we wish to precisely discover these linguistic patterns in a more abstract form, namely part-of-speech patterns, independently of their lexical realizations. We therefore do not wish to limit our collection of documents to the ones containing fixed and specific lexical patterns but we are looking instead for as much as linguistic realizations as possible inside a very vast variety of documents. The most recent crawler that we have found is called Babouk (De Groc 2011). It is a “focused crawler” which requires a list of terms to find texts on the Web on a desired domain. The first steps of the tool are the same as the ones of WebBootCat and TerminoWeb: terms are extracted from the pages found in the first iteration and are used as keywords to find new documents. Babouk uses a Random Walk algorithm which assigns a weight to the keywords and documents collected at each iteration. The weight reveals the relative importance of each of these elements and serves to select the most relevant documents, as well as to choose the best seeds for the next requests. One of Babouk’s objectives is to retrieve the maximum relevant documents by downloading the minimum number of pages. It was not designed to gather very large specialized corpora but rather to get high quality webpages.

guarantees the objectivity required when looking for statistical regularities in corpora (Evert 2006).

2.2 Corpus compilation tool

4. The two most frequent terms tagged as nouns and never used in prior iterations are extracted and used as keywords for a new query, just beside the name of the targeted domain in quotation marks (see example in figure 1). Again, the queries are restricted to the desired language, to the PDF files and to the needed usage rights in Google’s Advanced Parameters;

3. It collects enough quality data so that noise is reduced and that linguistic and knowledge patterns emerge. By selecting only the PDF files, it is more likely that the retrieved documents will correspond to scholarly and peer reviewed publications. Also, by expanding the size of the specialized corpora to millions and, hopefully, to billions of words, we expect that the noise produced by less quality documents will decrease and that the quality of the distributional information will be better. The tool operates as follows: 1. The user enters in Google the name of the targeted domain in quotation marks (e.g. "endangered species"). The name of the domain is used as keywords by the Google search engine. The queries are restricted to the desired language, to the PDF files and to the needed usage rights in Google’s Advanced Parameters; 2. The tool selects the ten first PDF documents submitted by Google from which it forms a sample corpus for the next steps; 3. The corpus is converted into text using Apache Tika (http://tika.apache.org/1.3) and text is tagged morphosyntactically and lemmatized with TreeTagger (Schmid 1994);

In this paper, we propose a very simple, unsupervised and iterative method to bootstrap a vast number of domainspecific documents from the Web with the Google search engine. The tool has three main properties: 1. It prevents semantic drifts to be induced by ambiguous requests. By the simple addition of the name of the domain in quotation marks besides two automatically extracted keywords, we prevent that the topic shifts away from the expected one when keywords are polysemic (see Curran et al. 2007).

Figure 1: example of request in Google

5. All new documents found in Google are retrieved;

2. It needs no human intervention in the choice of seeds or documents. The only words that the user enters in the Google search engine are the ones corresponding to the name of the domain. Seeds are then extracted automatically from the ten first retrieved documents that will initialize the iterative process. Also, because terms serving as keywords for finding documents are automatically extracted and directly sent to the Google search engine without human intervention, the process

6. Steps 3 to 5 are repeated until no new documents are returned by the search engine.

2.3 Evaluation of the tool We will now compare the terminological density (the proportion of words that are specific to a domain) within the corpus compiled with the automatic method to the

121

terminological density of a corpus built with documents carefully selected by a human, considering that terminological density is usually linked to the degree of specialization of a text. We used a term extractor called TermosStat (Drouin 2003), which compares a specialized corpus to a general corpus and ranks words according to their specificity. Words high on the list are considered as terms, which means that they are specific to a special subject field (Sager 1990). TermoStat could process a maximum of about 40M tokens. We used the automatic tool to compile a corpus in French on the topic of the endangered species. We stopped the compilation process when the corpus attained 40M tokens. Another corpus on the same topic was compiled manually by a masters’ student in terminology. It contained about 500k tokens. We then extracted the most specific words of both corpora with TermoStat and compared their terminological density. For the evaluation, we used a specialized dictionary on the environment, the DicoEnviro, as a gold standard to determine if the extracted words should be considered as terms. We evaluated manually the 150 first term candidates returned by TermoStat, and gave a positive score to the ones that had an entry in the DicoEnviro. We were surprised to obtain a precision of 60 % for both corpora. It suggests that our automatic compilation method allows us to build specialized corpora of the same precision in terms of terminological density as the one of a corpus built with human-selected documents. In addition to terminological density, the tool seems to allow us to get a large thematic coverage resulting from the vast number of documents collected. At the same time, we obtain specialized corpora large enough to search for linguistic patterns using distributional analysis. We calculated the terminological density for four other automatically-built corpora: an English corpus of 25M words on the topic of the endangered species, an English and a French corpora of about 40M words on the topic of the transportation electrification and an English corpus of about 40M words on the topic of the greenhouse effect. For these four corpora, we obtained a terminological density varying between 65 % and 87 % out of the 150 first term candidates returned by TermoStat (see Table 1). Domain

Language

endangered species endangered species transportation electrification transportation electrification greenhouse effect

French English French English English

3. Impact of term frequency Once we knew that we could build quality large specialized corpora with our automatic method, we needed to assess the impact of corpus size on the quality of the distributional information. We expanded the size of the corpus on the greenhouse effect until it attained a size of about 148M tokens, using the automatic method described above. Memory constraints of our computer would not allow us to build a larger corpus, even though many more documents could have been retrieved from Google. We then extracted twelve predicative terms to evaluate how their occurrence frequency, in different sizes of corpora, had an influence on the quality of their cooccurrences.

3.1 Corpus compilation and pre-processing We collected 10,800 English documents, which corresponded to 48 iterations and to about 148M tokens. The corpus was tagged morpho-syntactically and lemmatized with TreeTagger (Schmid 1994), and characters were lowercased. As the compilation was done in an iterative way, we kept each portion of the corpus as it grew to evaluate the effect of corpus size on the quality of the distributional information. We retained 5 different corpus sizes. The smallest of the corpora counted 1.5M tokens and the largest one, about 148M tokens.

3.2 Term extraction From the smallest corpus, we extracted term candidates with TermoStat. We then selected 12 predicative terms – 3 nouns, 3 verbs and 3 adjectives – of different levels of occurrence frequency (high, medium and low frequency) (see Table 2) so that we could track the evolution of their co-occurrences following the increasing of their frequency. Due to Zipf’s law, we expected that many terms would occur rarely and that we would need a very large corpus to capture a semantic representation for them (Pomikalek et al. 2009). None of the selected terms for which we would analyze the co-occurrences served as keywords during the compilation of the corpus, which means that their frequency has not been influenced by their use as keywords. We computed the frequency of each predicative term in each corpus size. It showed that the relative frequency of the majority of the terms was stable whatever the corpus size. The relative frequency of 30 terms was also computed in 5 corpus sizes, showing basically the same stability, except for the most frequent terms (see figure 2). This suggests that in future work, we should probably implement the tracking of the most stable terms (and of

Terminological density 60% 65% 66% 69% 87%

Table 1: Terminological density in the automatically-built corpora

122

elements” (FEs). They are sometimes obligatory (core FEs) and sometimes optional (non-core FEs). Their lexical realizations are said to “evoke” the background knowledge corresponding to the frame and profile its structure. Because the FEs form the argument structure of the predicates, their realizations are often syntactically positioned very close to it. In the context of this work, we have extracted the 50 most frequent co-occurrences of 12 predicative terms to see how the frequency of these terms in different corpora impacts the quality of their cooccurrences. For the extraction of the co-occurrences, we used a window-based approach that identifies the words surrounding a target within a span. A symmetric window of size 1 was chosen, and a stop list of 960 words and characters was defined, which excluded from the list of cooccurrences function words and recurrent typos resulting from the conversion of PDFs. We used simple loglikelihood (Evert 2007) as an association measure. Below, the variable E corresponds to the expected frequency of cooccurrence obtained by multiplying the individual frequency of each word (f1, f2) and the span size k and by dividing the product by the total number of tokens in the corpus N. The program then gives a score for each cooccurrence by comparing the observed co-occurrence frequency (O) in the specialized corpus with the expected random co-occurrence frequency (E). If O is much greater than E, it means that there is a particular attraction between f1 and f2.

course of the most frequent terms) in order to capture at least some basic knowledge of domains.

Predicative terms High frequency terms Nb of occurrences reduce 1680 natural 1586 reduction 1307 affect 1020 Medium frequency terms Nb of occurrences radiative 750 emit 581 agricultural 417 pollution 360 conservation 326 Low frequency terms Nb of occurrences mitigate 110 technological 97 degradation 40

Relative frequency

Table 2: Predicative terms for different levels of frequency in the smallest corpus

0,002 0,0018 0,0016 0,0014 0,0012 0,001 0,0008 0,0006 0,0004 0,0002 0 1.6

12

35

95



148

Corpus size (M) reduce emit technological evaporate consume vary

natural agricultural degradation deplete detect depend

reduction pollution modulate sequester melt management

affect conservation inundate harvest assess industrial

     

Expected frequency of co-occurrence

radiative mitigate pollute migrate protect residential

Simple-ll = 2



       

Simple log-likelihood

The co-occurrences were evaluated for different levels of occurrence frequency. They were attributed a positive score of 1 if they corresponded to a linguistic unit that could either be included in the same semantic frame as the target (i.e. realization of core FE and paradigmatic relation such as a synonym) or if the target itself was a core FE of the co-occurrence. We chose to give the same score to these different relations as they will be equally important for the future steps, when identifying automatically the semantic roles, discovering neighbor frames, and associating lexical units with each frame. We gave a score of 0.5 to the non-core FEs and a score of 0 to the typos and to the co-occurrences that had no semantic link with the target (see Table 3). At this stage, we would not try to define the name of the frames or to label the semantic roles of the extracted units.

Figure 2: Relative frequency of 30 terms in 5 corpus sizes

We then created 6 categories corresponding to 6 different levels of occurrence frequency of the predicative terms: less than 500 occurrences and of approximately 8k, 27k, 65k, 127k and 170k occurrences. When a term reached one of these 6 levels, 50 co-occurrences were computed from the corresponding corpus (the corpus in which the term had this level of occurrence frequency).

3.3 Frame-based evaluation of co-occurrences In Frame Semantics, a frame is a schematic representation of a situation or event defined through a specific set of semantic roles and predicates. Semantic roles correspond to the arguments of the predicates and are called “frame

123

Type of semantic link with the target Nearest neighbor Core FE realization Target is a core FE of the co-occurrence Non-core FE realization Error due to text conversion No semantic link

terms remains quite stable whatever the corpus size, we can estimate that in order to get an approximate precision of 90 % for all the terms in a specialized corpus, we would need a minimum of 2.5 billion words (148 000 000 x 17 = 2 500 000 000), although it is not sure that adding constantly more documents would ultimately result in gaining more information. We need to compile more documents to validate our estimation and we also must evaluate the co-occurrences of more terms to confirm our assessment. Also, the part-ofspeech of the predicative terms serving as target words for the evaluation might also have an influence on the results, and this parameter might also need to be considered. However, the results suggest that occurrence frequency dramatically affects the quality of the distributional information and that a precision as high as 93 % was possible to attain by increasing corpus size. It is worth mentioning that due to memory constraints of the computer, we could not increase the corpus size to billions of words. Still, we know that many more documents were available on the Web, although the number will of course vary depending on the domain and its delimitation. Pomikálek et al. (2009) mention the problem of duplicates and near-duplicates documents that can be found in large corpora and that can bias the distributional results. They suggest to detect the 10-word strings duplicates and to remove them. This simple algorithm could be added to our method.

Score 1 1 1 0.5 0 0

Table 3: Criteria for the evaluation of co-occurrences

3.4 Results Following the evaluation criteria listed in Table 3, each of the 50 co-occurrences of the 12 predicative terms (table 2) of 5 different occurrence frequency were given a score. Table 4 shows that the frequency of a target influences dramatically the score of its co-occurrences as it increases from 39 % to 93 %. Table 5 shows a sample of the 50 cooccurrences of the term management, which occurred 170k times in the 148M words corpus. Occurrence frequency of the target less than 500 ≈ 8000 ≈ 27 000 ≈ 65 000 ≈ 127 000 ≈ 170 000

Average score of the co-occurrences (out of 50) 39 % 60 % 76 % 83 % 83 % 93 %

Table 4: Impact of occurrence frequency on co-occurrences

4. Future work: representing knowledge with distributional analysis and partof-speech patterns

co-occurrence is a core-FE of "management" waste (65844.73) resource (30683.39) risk (25598.10) environmental forest (18948.31) system (18500.96)

Distributional semantic models are very often used to capture co-occurrences patterns in large corpora, following the idea that the distribution of words correlates with the construction of meaning (Sahlgren 2008). Ellis (2002) even states that input frequencies of linguistic patterns influence language learning at all levels. It is very likely that the repetition of co-occurrences shapes the mind unconsciously and that the process happens extremely fast (Gries and Ellis 2015). This usage-based theory to meaning construction has been put into practice, e.g. by Roth and Lapata (2015) who developed a FrameNet-based semantic role labeling (SRL) system that takes into account the distributional representation of words in their discourse context. Their tool, which is an adaptation of pre-existing SRL systems based on dependency parsing, showed that it was possible to reduce labeling errors by implementing new features to adapt meaning representations to context and to documentspecific properties.

(24436.46)

manure (14516.56) water (13421.86) quality (13123.97) land (8859.81) disaster (7490.42) zone (6295.06) "management" is a core-FE of the co-occurrence plan (42105.91) practice (40318.59) strategy (12768.01) district (6856.01) program (4582.13) option (4053.57) action (3884.79)

decision (3429.23)

agency (2161.90)

approach (2087.72) policy (1813.83) regime (1737.55) co-occurrence is a non-core FE of "management" adaptive (4989.41)

sustainable (4047.32) integrated (3197.47)

Table 5: Sample of the co-occurrences of management in the 148M words corpus

In the largest corpus of 148M tokens, the most frequent terms occurred about 170k times. The rarest terms in the same corpus reached a maximum of about 10k occurrences. Considering that the relative frequency of the

124

symmetric window of size 1 to capture syntactic n-grams (using a stop list to exclude the function words);

We would like to bring further this idea of adapting the semantic representation of words to their context by performing distributional analysis only on domain-specific corpora. We suspect that linguistic and knowledge patterns will emerge more easily within the scope and limits of a domain, and that this narrowing of the language should considerably help to resolve the word ambiguities induced by the heterogeneous nature of general corpora. It might also be true that in domain-specific languages, as opposed to the general language, repetition of fixed structures are frequent and that the search for these structures would give us access to conceptual knowledge.

3. We select only the POSP common to all or to the majority of these verbs; 4. To avoid errors from incorrect parsing, we do not use parse trees. We label only the specific POSP that were identified in the previous step, according to the criteria listed in Table 6: the adjacent words of each verb are labeled according to their POS and to their position (left or right) towards it.

An important property of sublanguage structures is that they are discoverable by the application of fixed procedures of finding the regularities of word combination in a field and not on the basis of subjective judgments or of semantic properties that lie beyond the capacity of computers.

5. We integrate new criteria for the labeling of animate and inanimate entities and criteria to detect temporal and location modifiers. POS and position towards the verb adjacent adverb noun on the right noun on the left relational adjective adjacent verb qualitative adjective

(Harris and Mattick 1988: 80) Therefore, we believe that instead of adding new features to pre-existing SRL systems, we should try to first simplify their task by limiting the number of syntactic structures for them to analyze. As we previously mentioned, sublanguages tend to have more regulated syntactic structures and tend to be concise and precise in keeping with their informative purpose. Harris introduced the notion of “kernel sentences”, which are simple declarative constructions that contain only one verb (e.g. NV, NVV, NVPN, etc.).

Label non-core frame element core frame element core frame element core frame element neighbor frame no semantic link

Table 6: POS labeling criteria

5. Conclusion We stressed the importance of using sublanguages to represent any type of concept or field of knowledge. Our experiment showed that it was possible to build automatically large specialized corpora of good quality and that an ideal size would be of approximately 2.5 billions of words. We then suggested to label only the part-of-speech patterns common to all the verbal terms in the corpus in order to reduce the number of complex structures to annotate. We will need of course to test the labeling criteria on a large corpus and to develop new criteria for the application of semantic role labeling. By using distributional analysis and frame semantics to represent specialized knowledge, we both intend to emulate human cognition and build data structures useful for computer programs. Our approach is driven by the wish to develop a domain-independent method which can then be applied on any domain-specific corpus. But before trying to abstract any type of knowledge from natural language, we strongly believe that the step of building quality large specialized corpora can absolutely not be neglected.

By combining the kernel sentence structures we have discovered, along with the associated modifiers and quantifiers, and possible connectives to other kernels, we can create a set of canonical structures for capturing the information in the sublanguage text. (Grishman 2001) Our hypothesis is that by limiting the number of syntactic structures to analyze to the most frequent ones in a specialized corpus, we might reduce the complexity of the annotation task of SRL. We could then be able to describe most of the concepts of a domain by describing the lexico-syntactic properties of most terms in the corpus. Once we have compiled a specialized corpus of about 2 billion words, the next steps would be the following: 1.We extract the verbal terms of the corpus (we select verbs because of their predicative nature); 2. For all these verbs, we identify the most frequent partof-speech patterns (POSP) surrounding them, using a

125

Handbook of Linguistic Analysis. Oxford: Oxford University Press. Gries, S. T. and Ellis, N.C. 2015. Statistical Measures for UsageBased Linguistics. Language Learning 65:228. Grishman, R. 2001. Adaptive Information Extraction and Sublanguage Analysis. In Proceedings of Workshop on Adaptive Text Extraction and Mining at Seventeenth International Joint Conference on Artificial Intelligence. Seattle, USA. Grishman, R. and Kittredge, R. eds. 1986. Analyzing Language in Restricted Domains: Sublanguage Description and Processing. Lawrence Erlbaum Ass., New Jersey, USA. Halevy, A.Y. and Madhavan, J. 2003. Corpus-Based knowledge representation. In Proceedings of the 18th International Joint Conference on Artificial Intelligence, 1567-1572. Mexico, Mexico. Harris, Z. S. and Mattick Jr., P. 1988. Science Sublanguages and the Prospects for a Global Language of Science. In The Annals of the American Academy of Political and Social Science 495, 7383. Telescience: Scientific Communication in the Information Age. Harris. Z. 1989. The Form of Information in Science: Analysis of an Immunology Sublanguage. Reidel, Dordrecht. Gatto, M. 2014. The Web as Corpus: Theory and Practice. London/New York: Bloomsbury. Kilgarriff, W. A., Rychlý, P., Smrž, P. and Tugwell, D. 2004. The Sketch Engine. In Proceedings of the Ninth EURALEX International Congress, 105-116. Lorient, France. Langacker, R. W. 1987. Foundations of Cognitive Grammar. Vol. I: Theoretical Prerequisites. Stanford, CA: Stanford University Press. L’Homme, M.C., Robichaud, B. and Subirats, C. 2014. Discovering frames in specialized domains. In Language Resources and Evaluation, LREC. Reykjavik, Iceland. Meyer, I. 2001. Extracting knowledge-rich contexts for terminography. In Bourigault, D., Jacquemin, C., and L'Homme, M.-C. eds. Recent Advances in Computational Terminology, 279-302. Amsterdam/Philadelphia: John Benjamins. Périnet, A. and Hamon, T. 2014. Analyse et proposition de paramètres distributionnels adaptés aux corpus de spécialité. In 12es Journées internationales d’Analyse statistique des Données Textuelles, 507-518. Paris, France. Pomikálek, J., Rychlý, P. and Kilgarriff, A. 2009. Scaling to Billion-plus Word Corpora. Advances in Computational Linguistics 41:3-13. Roth, M. and Lapata, M. 2015. Context-aware Frame-Semantic Role Labeling. Transactions of the Association for Computational Linguistics 3:449-460. Sager, J.C. 1990. A practical Course in Terminology Processing. Amsterdam/Philadelphia: John Benjamins. Sahlgren, M. 2008. The distributional hypothesis. Italian Journal of Linguistics 20(1): 33-54. Schmid, H. 1994. Probabilistic Part-of-Speech Tagging Using Decision Trees. In Proceedings of the Conference on New Methods in Language Processing, 44-49. Manchester, UK. Wittgenstein, L. 1953. Philosophical Investigations. Oxford: Blackwell.

Acknowledgements A special thanks to Gabriel Bernier-Colborne who significantly contributed to the realization of this project by providing the Python programs that were used for the experiments. I also wish to thank Marie-Claude L’Homme for introducing me to the field of terminology and (in alphabetical order) Marco Baroni, Vincent Claveau, Fabrizio Gotti and Philippe Langlais for their very constructive comments.

References Banko, M. and Brill, E. 2001. Scaling to very very large corpora for natural language disambiguation. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics and the 10th Conference of the European Chapter of the Association for Computational Linguistics, 26-33. Toulouse, France. Baroni, M. and Bernardini, S. 2004. BootCaT: Bootstrapping corpora and terms from the Web. In Proceedings of 4th International Conference on Language Resources and Evaluation, 1313–1316. Lisbon, Portugal. Barrière, C. and Agbago, A. 2006. TerminoWeb: a software environment for term study in rich contexts. In Proceedings of the International Conference on Terminology, Standardization and Technology Transfer, 103-113. Beijing, China. Bernier-Colborne, G. and L’Homme, M.-C. 2015. Using a distributional neighbourhood graph to enrich semantic frames in the field of the environment. In Proceedings of TIA, 9-16. Granada, Spain. Crystal, D. 1997. The Cambridge Encyclopedia of Language. Cambridge: Cambridge University Press. Curran, J. R., Murphy, T. and Scholz, B. 2007. Minimising semantic drift with mutual exclusion bootstrapping. In Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics, 172-180. Melbourne, Australia. de Groc, C. 2011. Babouk: Focused web crawling for corpus compilation and automatic terminology extraction. In Proceedings of the IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology, 22-27. Lyon, France. Drouin, P. 2003. Term Extraction Using Non-technical Corpora as a Point of Leverage. Terminology 9(1): 99- 117. Ellis, N. C. 2002. Frequency effects in language processing: A review with implications for theories of implicit and explicit language acquisition. Studies in Second Language Acquisition 24(2): 143-188. Evert, S. 2006. How random is a corpus? The library metaphor. Zeitschrift für Anglistik und Amerikanistik 54(2): 177-190. Evert, S. 2007. Corpora and collocations (extended manuscript). In Lüdeling, A., and Kytö, M. eds. Corpus Linguistics. An International Handbook, vol. 2. Berlin/New York: Walter de Gruyter. Fillmore, C. J. 1982. Frame Semantics. In Linguistics in the Morning Calm. 111-137. Seoul: Hanshin Publishing Co. Fillmore, C. J. and Baker, C. 2010. A frames approach to semantic analysis. In B. Heine and H. Narrog. eds. The Oxford

126

Suggest Documents