Using Wikipedia to Validate Term Candidates for the Mexican Basic Scientific Vocabulary Luis Adrián Cabrera-Diego1, Gerardo Sierra1, Jorge Vivaldi2, María Pozzi3 1
Instituto de Ingeniería, Universidad Nacional Autónoma de México, Torre de Ingeniería, Basamento, Av. Universidad 3000, Mexico City, 04510 Mexico {lcabrerad, gsierram}@iingen.unam.mx 2 Institut Universitari de Lingüística Aplicada, UPF, Roc Boronat 138, 08018 Barcelona, Spain
[email protected] 3 El Colegio de México, Camino al Ajusco 20, Pedegral de Santa Teresa, 10740 Mexico City, Mexico
[email protected]
Abstract. Terms are usually defined as lexical units that designate concepts of a thematically restricted domain. Their detection is useful for a number of purposes such as: building (terminological) dictionaries, text indexing, automatic translation, improving automatic summarisation systems and, in general, whatever task that contains any domain specific component. In spite of the numerous applications, their recognition constitutes a bottleneck for such applications. Since the 80s, a number of researches have been conducted on this topic with limited results. One of the reasons of such behaviour is the lack of semantic knowledge implemented in such systems. As well, these resources are scarce and difficult to manage. For some time now, large knowledge repositories have become publicly available, opening new opportunities to terminology recognition. This paper presents a new method using the free encyclopaedia Wikipedia to obtain the basic scientific vocabulary of Mexican Spanish from a corpus of school textbooks corresponding to high school education. The proposed method has been successfully applied to such Spanish texts for obtaining the basic terminology in the domain of mathematics. Keywords: automatic term recognition, Wikipedia, public knowledge repositories usage, corpus
1
Introduction
Special languages are subsets of a language having special linguistic characteristics. These are used in the different sciences and, in turn, contain the scientific vocabulary (group of specialized terms). This vocabulary may have many different applications, as in our case that we are interested in using it for the production of dictionaries and also as a multi-purpose linguistic resource for research. For this, we first need to identify and obtain the basic scientific vocabulary that should ideally be known by the average Mexican speaker at the end of the high school
76
period. In order to identify this vocabulary it was necessary to set up a corpus containing selected textbooks of each science and basic school level in Mexico, this corpus is the Corpus of Basic Scientific Texts in Mexican Spanish (COCIEM). As shown in [2] and [8], there are several methods to automatically extract terms from a corpus. Some of these methods are based on linguistic knowledge, like Ecode [11]. Other methods are based on statistical measures, such as ANA [5] while others use both linguistic knowledge and statistically based methods, such as TermoStat [3]. However, these term extractor systems do not make use of semantic knowledge. Notable exceptions are Metamap [1] and YATE [12] among others. Since most of the above mentioned tools are oriented towards the processing of some specific domains, we could not directly use any of the already existing systems, so some adjustments were necessary to reach our target. To achieve our goal, we first extracted term candidates (TC) by means of a statistical process. We then validated the list of TC using the method developed by [14], which looks for the term in Wikipedia (WP, www.wikipedia.org). For evaluation purposes, domain experts evaluated the full list of TC. For the purpose of this paper, we are considering only maths terms occurring in textbooks corresponding to the last three years of school (High School) because this field is well defined in WP, where it may be represented by a single WP category (Mathematics). Therefore, in this paper we firstly introduce the analysed corpus and the exhaustive extraction of TC. Then, we present the results of the term validation process. We then show the overall results and an evaluation of the methodology, using precision and recall. Finally, we describe the results, some issues we found and propose some future tasks.
2
Methodology
Fig. 1. Diagram of the project’s methodology
The basic idea for identifying the target vocabulary in our approach was firstly to obtain a list of TC included in COCIEM. Secondly, this set of TC on the one hand was validated using the domain categories of WP, and on the other, domain specialists manually evaluated it. Finally, both results were compared and jointly evaluated. The full methodology proposed is shown in Fig. 1. In the following subsections, each part of the system is described in detail.
77
2.1
Corpus
COCIEM consists of the scientific books (subjects dealing with the humanities and the social sciences were excluded) used by the greatest number of students in Mexico. They include theoretical texts, laboratory practices and exercise books that correspond to the current curricula for each school year. The aim was to include a truly representative set of textbooks of all levels of pre-university education. Specifically, COCIEM consists of 92 textbooks (3,671,391 words) classified in three different levels of education and divided in turn in scientific subjects. The levels and subjects of the COCIEM are: ─ Elementary School (15 books: 301K tokens): natural sciences (topics related to biology, ecology, physics, health education and anatomy) and mathematics. ─ Junior High (50 books: 2,097K tokens): biology, mathematics, environmental education, physics and chemistry. ─ High School (27 books: 1,273K tokens): biology, mathematics, health education, chemistry, physics and ecology. 2.2
Automatic Identification of Term Candidates
WordSmith Tools 5.0 [10] was used for identifying and extracting the list of TC. This tool was chosen because it is fast, easy to use and has many sub-tools to process text documents (such as the creation of list of words/keywords, etcetera). An indexed list of words was produced for each science and level; this list saves the position of every word. In the case of High School maths, three different lists were produced, one per each school grade (corresponding to algebra, analytic geometry and calculus). Once we had all the indexed lists of words, we utilised the clustering operation of WordSmith Tools for each list produced. This process consists in grouping the words from unigrams to pentagrams in the order in which they occur in the text. This situation allowed us to find syntagmatic terms and phraseological units. Once the lists of n-grams for each science were produced, the KeyWords tool was used to extract term candidates by means of the log-likelihood method [4], with Pvalue ≤ 0.1. The P-value is used for statistical operations to calculate whether the result obtained can be attributed to chance or not; its value ranges from 1 to 0. The log-likelihood (LL) method consists in comparing two corpora, the first one of which is the corpus we want to extract keywords from and the second one is a reference corpus. This method calculates first the expected values (E) for each corpus (i) and then calculate LL:
Ni
Ei (t )
i i
LL 2
Fi ln i
78
Fi
N
Fi Ei
(1)
(2)
Where N is the total number of words in each corpus and F is the frequency of a given word in both corpora. Even though LL always has a positive value, KeyWords tool indicates with a negative value the words that occur more frequently in the reference corpus and vice versa. We used this method because it allows users to analyse smaller volumes of text and to compare occurrences of common or not-so-common phenomena. In this process, it was decided that the minimum number of occurrences needed to be considered a term candidate would be 5. This figure was chosen experimentally by balancing the number of TC and the time required for validating such list. Since LL needs to have a reference corpus, it was decided to create a subcorpus for each word list produced. This subcorpus was a list of words from a subject of the same education level but from “opposite” subject, this opposition was subjective. For example, the list of maths words was compared to the list of biology words. The result obtained consists of lists with the key words of each subject; TCs are the n-grams with only positive keyness values. Once we obtained the lists of TC for each science, these had to be cleaned up. This operation consisted in the elimination of n-grams beginning or ending with stop words (articles, prepositions, pronouns, etcetera.). This operation, carried out by an in-house developed software, reduced considerably the number of TC (about 70% of reduction). At the end of the process the number of TC were 3,560 for Elementary School, 5,743 for Junior High and 2,481 for High School. 2.3
Validation of Term Candidates
One of the reasons for the lack of good results in the automatic term recognition field is that, as mentioned in the Introduction, most tools do not use semantic information to validate their results. The reasons must be found in the lack of such resources and the difficulties to manage such information. The exceptions (see section 1) either take advantage of semantic information provided in a large monolingual resource for a single domain (UMLS for Medicine) or try to adapt already existing general resources to a few domains (EWN). A promising alternative is the use of encyclopaedias as knowledge sources and the obvious choice is WP, as a free, high coverage in many domains multilingual resource. WP is by far the largest encyclopaedia in existence, with more than 3.5 million articles in its English version to which thousands of volunteers have contributed. WP has had an exponential growth since its creation. There are versions in more than 200 languages although the coverage is very irregular. As shown in Fig. 2, for any given language, WP is organised into two connected graphs: the category graph and the page graph. On the one hand, the category graph is organised as a taxonomy; each category may be connected to an arbitrary number of super/sub categories; such connections may often be considered as hyperonym/hyponym links. On the other hand, articles are linked among them forming a directed graph. Both graphs are connected together because every article is assigned to one or more WP categories (through "Category links") in such a way that
79
categories can be seen as classes linked to pages (belonging to the category); see [15] for an interesting analysis of both graphs. This bi-graph structure of WP is far from being error-proof. Category links do not always denote the category to which the article belongs. For example, the category “Geometric Shapes” is correctly linked to pages like “Ellipse” or “Cube” but also “Generatrix” (a geometric element that generates a geometric shape). A similar problem occurs in the case of links between categories, since these do not always denote a relation of hyperonymy/hyponymy and so the structure shown on the left of Fig. 2 is not a real taxonomy. Due to its encyclopaedic nature, some categories in WP are used for structuring the database (e.g. “scientists by country”, “mathematics timelines”, etc.) while others are used for monitoring its status (e.g. "All articles lacking sources", “Articles to be split” ...). It therefore becomes rather difficult, just navigating through the structure, to discover which entry belongs to which domain. Category graph
Page graph
C2 C4
C3 C5
auxiliary data
P2
C1 C6
P3
P1
Redirection table Disambiguation table
P4
Interwiki table
Fig. 2. Wikipedia’s internal organisation
In spite of the above mentioned difficulties, WP has been extensively used for a number of NLP tasks such as: lexical and conceptual information extraction, building/enriching ontologies, derive domain specific thesauri and semantic tagging among others. See [7] or [6] for excellent surveys of this resource and its application to NLP. Extracting information from WP can be done in several ways: i) using a Web crawler and a HTML parser, ii) an API for online access; and iii) a database obtained from WP dumps. The resource presented in this paper uses a database created from the database dumps as described in [16] and dates from May 2009. The full procedure for terminology validation used in this approach is similar to those used by YATE and EWN (see [12]). It starts by defining the domain of interest as one or more WP categories. We name such categories domain borders, considering in this way that all subcategories belong in some degree to such domain. Usually such domain border coincides with the domain name (as for example: “Economics” or “Chemistry”) but sometimes it may be necessary to use more than one WP category for defining a domain (ex. “Computer Science” requires both “Computer Science” and “Electronics” categories). From this point and for every TC, the procedure consists in: i) finding a WP page corresponding to such TC; ii) finding all WP categories associated to such page; iii) exploring WP recursively following all super category links found in the previous step until the domain border is reached (or WP top); and iv) sorting the list of TC according their domain coefficient (DC). We used the information collected during this exploration to define a DC for every TC as a way to calculate its termhood (that is, the degree in which such TC belongs to the domain). The DC is conveniently
80
weighed according to the way in which it is found in the WP. The basic calculation of the DC is based on the number of paths to the top; it was done by applying the following formula:
DCnc (t )
NPdomain(t ) NPtotal (t )
(3)
Where NPdomain(t) is the number of paths to the domain category and NPtotal(t) is the number of paths to the top. Therefore, the value of this DC ranges from 0 (none of the paths to the top go through the domain border nodes) to 1 (all the paths to the top go through the domain border nodes). In a similar way two more DC calculations were defined: the first one (DCnc) is based on the number of single steps necessary to reach the top or the domain border and the second one (DClmc) is based on the average length path to reach the top or the domain border (see [14]). Fig. 3 shows the portion of the Spanish WP category graph (and the DCs) corresponding to the ambiguous term (only some paths to the top go through the domain borders) “variable dependiente” (dependant variable) while Fig. 4 shows the unambiguous (all the paths to the top go through the domain borders) term “derivada” (derivative). The ambiguity/unambiguity status as well as the graph itself may change in different languages and/or across different WP releases in the same language.
Fig. 3. Wikipedia graph for the term “variable dependiente” (dependant variable)
81
Fig. 4. Wikipedia graph for the term “derivada” (derivative)
3
Results
DC values may be divided into four groups: i) DC(t) = 1; ii) 1 > DC(t) > 0; iii) DC(t) = 0; and iv) DC(t) = -1. The first group corresponds to TCs that clearly belong to the domain (for example “addition”, “algebra”, etc.). The second group corresponds to TCs that also belong to other domains (such as “formula”, “exercise” or “commutative property”) and in general the higher the value is the stronger is the relation to the domain. The third group corresponds to TCs that according to WP, do not belong to the domain (e.g. “activity”, “art”, ...) and the last group indicates that a TC was not found in WP (such as “slices”, “secondary”,...). The results have been evaluated using the standard measures of precision and recall. For such purpose, the resulting TCs have been validated by a team of engineering students (due to their deep domain knowledge) and linguists (due to their good understanding of terminology). It should be noted that the TCs evaluated and therefore the evaluation itself has been done over the list of TCs resulting from the extraction process (see section 2.2) instead of those actually occurring in the text. This step was mandatory due to the size of the documents under consideration (almost 500K words). Therefore, domain terms with a negative keyness value and/or with a frequency lower than the chosen threshold are not present in the evaluation list. See
82
[13] for a discussion about measures and issues in evaluation of term extraction systems. The result from the evaluation of our system is shown in Fig. 5. It is split in four precision vs. recall curves, the first three corresponding to each pattern (noun, nounadjective and noun-preposition-noun) and the last one resulted from the merger of all the patterns. Taken into consideration the results published for other term extractors (as those mentioned in the introduction), we consider our results to be excellent as it is shown in the four curves. In all cases, our system reaches 100% precision with relatively high values of recall (at least 20%). The differences among the different DC calculation are minimal, leaving aside the one-word units where by their own nature tend to be more ambiguous than multi-word units. The disambiguation process is complicated and it is much more frequent for one-word units than for any other sequence. 100
100
CDwp_nc CDwp_lc CDwp_lmc
80
precision
precision
80
CDwp_nc CDwp_lc CDwp_lmc
60
60
40
40
20
20
0
0 0
20
40
60
80
0
100
20
40
a)
80
100
b)
100
100
CDwp_nc CDwp_lc CDwp_lmc
CDwp_nc CDwp_lc CDwp_lmc
80 precision
80 precision
60 recall
recall
60
60
40
40
20
20
0
0 0
20
40
60
80
100
0
20
40
60
80
100
recall
recall
c)
d)
Fig. 5. Precision-recall curves for patterns: a) noun, b) noun-adjective, c) noun-prep-noun, d) all patterns
4
Limitations and Issues
The procedure followed shows some issues. The first one is related to the lack of lemmatization, this procedure was left to the redirection mechanism built in WP itself.
83
See for example the TCs “diameter” and “diameters”, using the redirection table both refer to the same article in WP. But it may happen that such link is missed out resulting in an unknown TC. Another issue concerns the process chosen to find candidates which does not always allow us to find the right term (for example, the candidate is Newton-Raphson instead of the actual term Método de Newton-Raphson). As well, some issues arise from WP itself. The relational database obtained from the snapshot is not perfect; it may happen that following a given link we reach a page other that the expected one. It may also happen that some pages do not have any category registered (ex. hipérbola, ecuación lineal, etc). Finally, as for any manual tasks, some actual terms are missing from the validated list causing false errors.
5
Conclusions and Future Work
The experiment to extract the basic scientific vocabulary in Mexican Spanish allowed us to obtain a significant segment of the maths vocabulary. In spite of the difficulties and inconsistencies found in WP, the experiment allowed us to verify that this resource may be useful to validate term candidates in a given domain. It has been possible by defining a domain as a set of WP categories and a domain coefficient in relation to such domain. These results are better than those published for similar tools (see [2] and [13] for examples). The reason of such behaviour may be the specialisation level of the documents analysed as they cannot be considered as highly specialised.1 In the future we plan to perform an automatic term recognition process for all scientific fields included in the corpus. This will allow us to compare different educational levels as well as different domains. We still have to figure out a better way to determine the categories to be used to validate terms, in accordance with the subject and the school level. We also need to improve the way to extract the terms from textbooks, in order to reduce the minimum number of occurrences and the quantity of wrong n-grams. As regards the TC lists, we still have to find better ways to validate them or use another method to create them, or even perform a manual evaluation of the texts. The results also show that we need to improve the treatment given to the disambiguation pages of WP, as well as finding a solution for those terms not present as a page or category names but included in the text of articles.
Acknowledgments This research has received the support from CONACyT (Mexico) for the project No. 000000000058923 and Science and Education Ministry (Spain) for the RicoTerm project (HUM2007-65966-C02-01/FILO). 1
According to [9], such documents may be considered as a communication expert-tonovice. These texts usually give good term explanation to ensure the understanding of the reader.
84
References 1. Aronson, A., Lang, F.: An overview of MetaMap: historical perspective and recent advances. In: JAMIA 2010 17, pp. 229--236 (2010). 2. Cabré, M.T., Estopà, R., Vivaldi, J.: Automatic term detection. A review of current systems. Recent Advances in Computational Terminology 2, 53--87 (2001). 3. Drouin, P.: Term extraction using non-technical corpora as a point of leverage. Terminology 9(1), 99--115 (2003). 4. Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19(1), 61--74 (1993). 5. Enguehard, C., Pantera, L.: Automatic Natural Acquisition of a Terminology. Journal of Quantitative Linguistics 2(1), 27--32 (1994). 6. Gabrilovich, E., Markovitch, S.: Wikipedia-based Semantic Interpretation for Natural Language Processing. Journal of Artificial Intelligence Research 34, 443-498 (2009). 7. Medelyan, O., Milne, D., Legg, C., Witten, I.: Mining meaning from Wikipedia. International Journal of Human-Computer Studies 67(9), 716--754 (2009). 8. Pazienza, M.T., Pennacchiotti, M., Zanzotto, F.M.: Terminology Extraction: An Analysis of Linguistic and Statistical Approaches. Studies in Fuzziness and Soft Computing 185, 255--279 (2005). 9. Pearson, J.: Terms in context. John Benjamins Publishing, Amsterdam (1998). 10. Scott, M.: WordSmith Tools, Oxford University Press, Oxford (1996). 11. Sierra, G., Alarcón, R., Aguilar, C., Bach, C.: Definitional verbal patterns for semantic relation extraction. Terminology 14(1), 74--98 (2008). 12. Vivaldi, J.: Extracción de candidatos a término mediante combinación de estrategias heterogéneas. PhD Thesis, Universitat Politècnica de Catalunya (2001). 13. Vivaldi, J., Rodríguez, H.: Evaluation of terms and term extraction systems: A practical approach. Terminology 13(2), p. 225--248 (2007). 14. Vivaldi, J., Rodríguez, H.: Using Wikipedia for term extraction in the biomedical domain: first experiences. Procesamiento del Lenguaje Natural 45, 251--254 (2010). 15. Zesch, T., Gurevych, I.: Analysis of the Wikipedia Category Graph for NLP Applications. In: Proceedings of the TextGraphs-2 Workshop (NAACL-HLT 2007), pp. 1--8 (2007). 16. Zesch, T., Müller, C., Gurevych, I.: Extracting Lexical Semantic Knowledge from Wikipedia and Wiktionary. In: LREC 2008, pp. 1646--1652, European Language Resources Association (ELRA), Marrakech (2008).
85