Term extraction from sparse, ungrammatical domain ...

Expert Systems with Applications 40 (2013) 2530–2540

Contents lists available at SciVerse ScienceDirect

Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

Term extraction from sparse, ungrammatical domain-specific documents Ashwin Ittoo a,⇑, Gosse Bouma b,1 a b

Department of Operations, Faculty of Economics and Business, University of Groningen, The Netherlands Computational Linguistics (Information Science), Faculty of Arts, University of Groningen, The Netherlands

a r t i c l e

i n f o

Keywords: Term extraction Natural language processing Text mining Business intelligence Product development-customer service

a b s t r a c t Existing term extraction systems have predominantly targeted large and well-written document collections, which provide reliable statistical and linguistic evidence to support term extraction. In this article, we address the term extraction challenges posed by sparse, ungrammatical texts with domain-specific contents, such as customer complaint emails and engineers’ repair notes. To this aim, we present ExtTerm, a novel term extraction system. Specifically, as our core innovations, we accurately detect rare (low frequency) terms, overcoming the issue of data sparsity. These rare terms may denote critical events, but they are often missed by extant TE systems. ExtTerm also precisely detects multi-word terms of arbitrarily lengths, e.g. with more than 2 words. This is achieved by exploiting fundamental theoretical notions underlying term formation, and by developing a technique to compute the collocation strength between any number of words. Thus, we address the limitation of existing TE systems, which are primarily designed to identify terms with 2 words. Furthermore, we show that open-domain (general) resources, such as Wikipedia, can be exploited to support domain-specific term extraction. Thus, they can be used to compensate for the unavailability of domain-specific knowledge resources. Our experimental evaluations reveal that ExtTerm outperforms a state-of-the-art baseline in extracting terms from a domain-specific, sparse and ungrammatical real-life text collection. Ó 2012 Elsevier Ltd. All rights reserved.

1. Introduction The recent years have witnessed a proliferation in unstructured text data. According to several studies (Blumberg & Atre, 2003; Russom, 2007), unstructured texts, in the form of customer complaint emails or engineers’ repair notes (e.g. job-sheets), constitute an overwhelming 80% of all corporate data. Buried within these massive amounts of corporate texts are meaningful information nuggets, such as pertinent domain-specific terms. For example, the term ‘‘proximity sensor’’, appearing in engineers’ repair notes, indicates that the product component,2 ‘‘proximity sensor’’, experienced a malfunction, and had to be serviced by an engineer. Efficiently and effectively (accurately) detecting such terms from large repositories of texts is crucial in a wide range of corporate activities. For example, in Product Development-Customer Service (PD-CS) organizations, terms that designate products are useful in ‘‘cost of non-quality’’ analyses for determining which products contribute most significantly to maintenance costs. Terms also provide

⇑ Corresponding author. Address: Nettelbosje 2, 9747 AE Groningen, The Netherlands. Tel.: +31 (0) 50 363 3853. E-mail addresses: [email protected] (A. Ittoo), [email protected] (G. Bouma). 1 Address: Oude Kijk in ‘t Jatstraat 26, 9712 EK Groningen, The Netherlands. 2 We will use ‘‘product’’ to denote both actual products and their components/ parts. 0957-4174/$ - see front matter Ó 2012 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.eswa.2012.10.067

valuable information on which products fail recurrently and require frequent servicing. This information can be subsequently exploited to improve the product development process, resulting in better quality products and more satisfied customers. In Natural Language Processing (NLP), several techniques exist for extracting terms from large collections of general and biomedical texts (Ahmad, Davies, Fulford, & Rogers, 1994; Chung, 2003; Dagan & Church, 1994; Frantzi & Ananiadou, 1999; Frantzi, Ananiadou, & Mima, 2000; Uchimoto, Sekine, Murata, Ozaku, & Isahara, 2001). These techniques often rely on external knowledge resources, such as ontologies, which is beneficial for their accuracy (Hiekata, Yamato, & Tsujimoto, 2010; Zhang, Yoshida, & Tang, 2009). However, most extant term extraction (TE) algorithms are inadequate to address the challenges posed by domain-specific texts, such as those in corporate domains like PD-CS. A major challenge is the sparse nature of these texts, which do not offer reliable statistical evidence, and severely compromise the algorithms’ performance. This difficulty is further compounded by the lack of comprehensive domain-specific knowledge resources, for e.g. corporate ontologies, which are difficult to create and to maintain (Auger & Barriere, 2010; Blohm & Cimiano, 2007; Lapata & Lascarides, 2003; Maynard & Ananiadou, 2000; Pecina & Schlesinger, 2006). Another challenge is the detection of multi-word terms, especially those comprising 2 or more words, such as ‘‘frequency convertor control board’’. In addition, there is the issue of ambiguous

A. Ittoo, G. Bouma / Expert Systems with Applications 40 (2013) 2530–2540

constructs, such as ‘‘device is regulating switch’’ (‘‘the device is the regulating switch’’ vs. ‘‘the device is regulating the switch’’), and that of incoherent phrases, such as ‘‘customer helpdesk collimator shutter’’. We will elaborate on these difficulties in Section 2.4. In response to the above challenges and to the growing need of organizations for extracting terms from corporate texts, we present ExtTerm, a novel framework for domain-specific TE. Our core contributions are as follows: ExtTerm accurately detects terms from sparse, domain-specific text collections that do not offer reliable statistical evidence, thereby overcoming the issue of data sparsity. We extract arbitrarily long terms, including those containing more than 2 words. Our term extraction approach is unsupervised, eschewing the need for domain-specific knowledge resources (e.g. ontologies), which are not always available and expensive to construct manually. Instead, we only rely on a readily-available resource, viz. Wikipedia, as a knowledge base. This also shows that readilyavailable resources, such as Wikipedia, despite their general contents, can still be exploited for domain-specific TE. Thus, they can be exploited to compensate for the lack of domain-specific resources. ExtTerm accurately discriminates between valid terms and other ambiguous and incoherent expressions. Many of the latter expressions tend to exhibit some of the core properties of terms, and are thus incorrectly extracted by most existing term extraction systems. In our experiments, we evaluate the performance of ExtTerm over a real-life, domain-specific text collection, which was provided by our industrial partners.3 The results reveal that ExtTerm achieves a very high accuracy level in domain-specific TE, and even outperforms the state-of-the-art algorithm of Frantzi et al. (2000). This article is organized as follows. In Section 2, we present some basic notions associated with terms, describe existing work on automatic term extraction, and highlight the challenges posed by domain-specific, sparse, and informally-written texts. In Section 3, we develop our proposed methodology for term extraction from domain-specific documents. Experimental evaluations and performance comparisons against baselines are given in Section 4. We conclude and discuss areas of future work in Section 5. In the remainder of this article, we refer to multi-word terms, for e.g. ‘‘proximity sensor’’, as complex terms, and to single-word terms, for e.g. ‘‘footswitch’’, as simple terms. We will also use ‘‘text collection’’, ‘‘texts’’ and ‘‘corpus’’ (plural: ‘‘corpora’’) interchangeably.

2. Preliminaries and related work 2.1. (Domain-specific) terms vs. (general) words Terms are formally defined as lexical manifestations of domainspecific concepts. Thus, unlike general words, such as ‘‘Friday meeting’’, terms convey a particular meaning in a given domain. For instance, the term ‘‘frequency converter control box’’ designates a specific product in the domain of Product DevelopmentCustomer Service (PD-CS) organizations. At the surface level, however, terms and general words are indistinguishable from each other since they are both realized as strings. Furthermore, as shown in past studies in terminology, terms tend to adopt all the word formation rules in a language (Sager, 1998). Two properties that enable us to discriminate between terms 3

Multi-national corporations based in The Netherlands.

2531

and general words are the unithood and termhood. 2.2. Properties of terms: unithood and termhood 2.2.1. Unithood Unithood determines whether an expression is well-formed and behaves as a coherent, atomic linguistic unit (Pazienza, Pennacchiotti, & Zanzotto, 2005). A well-formed expression adheres to a certain syntactic structure, such as noun phrases, for e.g. ‘‘frequency converter control box’’. An expression is a coherent and atomic unit if its individual words tend to co-occur (together) more often than spuriously, i.e. the words are strong collocated. For example, ‘‘frequency converter control box’’ and ‘‘proximity sensor’’ are coherent units, while ‘‘customer amplification’’ is not. Based on its definition, the unithood property is applicable only to complex terms. Simple terms always have perfect unithood since they consist of only 1 word (Kageura & Umino, 1996). 2.2.2. Termhood Termhood determines whether an expression is representative of a domain. For example, ‘‘frequency converter control box’’ is representative of the PD-CS domain, while ‘‘Friday meeting’’ is not. 2.3. Automatic term extraction approaches Natural Language Processing (NLP) techniques for term extraction (TE) are based on 3 main approaches, namely Linguistic, Statistical and Hybrid, as will be briefly described below. For a more comprehensive review on the techniques, we refer the reader to the work of Pazienza et al. (2005). 2.3.1. Linguistic techniques Linguistic techniques operate upon the premise that since terms denote specific concepts, they can only be realized by certain word types (Wright & Budin, 1997). Linguistic techniques are often implemented as Part-of-Speech (POS) filters, such as that of Justeson and Katz (1995), which accepts, as terms, any noun sequences containing optional adjectives and/or prepositions. We will revisit this filter in our experiments (Section 4). Since linguistic techniques rely on the syntactic structure, they identify terms according to the unithood property. 2.3.2. Statistical techniques Statistical techniques, such as those of Pecina and Schlesinger (2006), estimate the unithood of 2-word expressions, for e.g. ‘‘proximity sensor’’, by computing the collocation strength between the word pairs. This is achieved using well-known Lexical Association Measures (LAM), such as mutual information (Church & Hanks, 1990). Petrovic, Snajder, and Basic (2010) extended several traditional LAMs for estimating the unithood of expressions with at most 4 words. Concerning the termhood, it is statistically determined based on the observation that the highly frequent expressions in a domainspecific corpus are likely to denote relevant terms. Popular statistical techniques for termhood estimation include the ‘‘term frequency–inverse document frequency’’ (tf–idf) (Salton, 1991) and the C-value (Frantzi et al., 2000). Another termhood estimation technique is that of corpus comparison, in which a domain-specific corpus is compared against a collection of general texts. Expressions that are more likely in the domain-specific corpus are then treated as domain-specific terms. Several corpus comparison techniques for term extraction are mentioned in literature, such as those of Ahmad et al. (1994), Chung (2003), Drouin (2003), Rayson and Garside (2000) and Romero, Moreo, Castro, and Zurita (2012).

2532


2.3.3. Hybrid techniques Hybrid approaches combine a linguistic component, for identifying syntactically well-formed expressions, and a statistical component, for weighting the identified expressions. A state-of-the-art hybrid system is that of Frantzi et al. (2000). Its linguistic filter is similar to that of Justeson and Katz (1995). Its statistical component relies on the C-value algorithm (Frantzi et al., 2000) to compute the termhood of expressions and to select term candidates. We will employ this technique as a baseline in our experiments (Section 4). 2.4. Domain-specific term extraction challenges In this section, we elaborate on the shortcomings of extant TE techniques, which make them inadequate to address the challenges of term extraction from domain-specific texts, such as repair notes of engineers. 2.4.1. Challenge 1: silence Domain-specific texts are sparse, and do not provide sufficient statistical evidence to facilitate the detection of terms. For example, most existing TE algorithms are unable to identify rare terms, such as those denoting product failures that are mentioned only once in an entire corpus. This phenomenon, whereby infrequent but important terms are rejected or missed, is known as silence, and affects the recall of TE systems (Auger & Barriere, 2010; Blohm & Cimiano, 2007; Lapata & Lascarides, 2003; Maynard & Ananiadou, 2000; Pecina & Schlesinger, 2006). 2.4.2. Challenge 2: absence of knowledge resources To facilitate the identification of terms from general (opendomain) or bio-medical documents, several TE algorithms (Aubin & Hamon, 2006; Hoste, Vanopstal, & Lefever, 2007; Xu, Kurz, Piskorski, & Schmeier, 2002; Zhang et al., 2009) rely on external knowledge resources, such as WordNet (Fellbaum, 1998), GermaNet (Hamp & Feldweg, 1995), and the Gene Ontology (Ashburner et al., 2000). However, such knowledge resources are scarce in most specialized disciplines, particularly in corporate domains like PD-CS. Their creation is also considered to be too tedious and time-consuming to be viable. The unavailability of (domain-specific) knowledge resources compounds the difficulties in domain-specific TE. One alternative could be to exploit other readily available resources. Wikipedia, in particular, represents a promising alternative, and has received significant attention from the NLP community in recent years (Medelyan, Milne, Legg, & Witten, 2009). However, resources such as Wikipedia deal with general purpose topics, and may not be appropriate to support domainspecific term extraction. 2.4.3. Challenge 3: complex terms Most current TE techniques are designed to extract 2-word terms, such as ‘‘proximity sensor’’. However, they fail to identify longer terms, such as ‘‘frequency converter control box’’, which are more common and account for 85% of all terms in domainspecific corpora (Nakagawa & Mori, 1998). 2.4.4. Challenge 4: informal/ungrammatical language (noise) Domain-specific corpora are often expressed in a terse language, giving rise to ambiguity. For example, the phrase ‘‘device is cooling fan’’ can be interpreted correctly as ‘‘the device is the cooling fan’’ or incorrectly as ‘‘the device is cooling the fan’’. With the latter interpretation, most linguistic filters will identify only the noun ‘‘fan’’ as a term. Thus, they miss ‘‘cooling’’ since it is taken to be a progressive verb. Another common type of linguistic incoherency is the omission of punctuation symbols, such as periods (‘‘.’’) to indicate the end of sentences. For example, omitting the

period between ‘‘customer helpdesk’’ and ‘‘collimator shutter’’ yields the construct ‘‘customer helpdesk collimator shutter’’. Such unintelligible phrases tend to exhibit the syntactic structure of valid terms, such as noun phrases, and they even occur more frequently. Consequently, they are often incorrectly selected as valid terms by most TE systems. It is worth noting that Romero et al. (2012) also propose a technique to extract terms from support (FAQ) documents, which exhibit similar characteristics (sparse, ungrammatical) as the text we deal with in this article. However, they do not directly address challenges such as the effects of noise and silence and the detection of complex terms with more than 2 words. Another related technique is that of Hiekata et al. (2010) for detecting terms from shipyard reports. However, they rely on the existence of an ontology, which may not always be available in all domains. Our approach for overcoming the aforementioned difficulties in domain-specific term extraction will be described next. 3. ExtTerm framework for term extraction The overall architecture of our proposed ExtTerm framework for term extraction is shown in Fig. 1. Blank arrows represent inputs and outputs; filled arrows depict the processing performed by the various phases of ExtTerm. These phases will be described in the next sections, and illustrated using actual text snippets from a real-life domain-specific corpus. 3.1. Document pre-processing Document pre-processing converts the input (domain-specific) documents into a more amenable format for the subsequent stages of ExtTerm. It involves the two basic operations of Data Cleaning and Linguistic Pre-processing. 3.1.1. Data Cleaning During Data Cleaning, we discard all extraneous contents, which can impede the accurate discovery of meaningful terms from the text collection. Our cleaning mechanisms are implemented as regular-expression wrappers, which detect and discard entities that are unlikely to be content-bearing. Examples include undesirable symbols (e.g. ‘‘^’’), meta-data (e.g. email headers), and other ‘‘noisy’’ strings that domain experts considered redundant (e.g. ‘‘external remarks’’). We also normalize the use of punctuations by keeping only one instance of punctuation symbols (e.g. ‘‘?’’) if they appeared in a sequence (e.g. ‘‘???’’). Sample cleaning activities performed by ExtTerm, the text snippets on which they are applied, and the transformed, cleaned texts are listed in Table 1. Noisy entities that are dropped are shown italicized. 3.1.2. Linguistic Pre-processing In Linguistic Pre-processing the Part-of-Speech (POS) of the tokens (e.g. words) from the cleaned texts are obtained using the Stanford POS-tagger (Toutanova, Klein, Manning, & Singer, 2003). Table 2 depicts 2 text segments and the POS of their individual tokens (delimited with the pipe ‘‘|’’ character). We use ‘‘N’’ as POS-tag for all classes of nouns (e.g. plural, singular, proper), ‘‘P’’ for prepositions, and ‘‘A’’ for adjectives. All other POS-tags are as originally defined in the Penn Treebank (Marcus, Marcinkiewicz, & Santorini, 1993). For example, ‘‘VBG’’ refers to a verb in the progressive tense. Despite its robustness, POS-tagging is not always error-free, especially when applied to informally-written domain-specific texts. In particular, ambiguous constructs, in which ‘‘is/are’’ immediately precedes a gerund noun, represent a common source of POStagging errors. For example, as shown in Table 2, most POS-taggers


2533

Fig. 1. ExtTerm architecture.

Table 1 Sample cleaning activities. Type of cleaning

Original text snippet

Cleaned text

Discarding irrelevant identifiers, symbols, and strings

. . .080107 1650: External Remarks

. . .reference monitor is flickering. . .

-> Reference monitor is flickering . . . . . .they have had the intermittent problems (after patch upgrade). . .

. . .they have had the intermittent problems after patch upgrade. . .

Dropping enclosing brackets

Table 2 POS-tagged texts. Before

After linguistic processing

Reference monitor is flickering

Reference|N monitor|N is|VBZ flickering|VBG Problem|N is|VBZ regulating|VBG⁄ switch|N

Problem is regulating switch

sequences of nouns (N), preceded by an optional progressive verb (VBG) and an adjective (A). This filter, which we call Filter_1, is depicted in the regular-expression shown in Eq. (1). The symbols ‘‘?’’ and ‘‘+’’ are regular-expression cardinality operators, respectively indicating that their operands are optional or that they occur at least once.

Filter 1 ¼ A?VBG?Nþ fail to determine that the gerund ‘‘regulating’’ in ‘‘problem is regulating switch’’ corresponds to a noun (N). Instead, they almost always incorrectly tag ‘‘regulating’’ as a progressive verb (VBG), having as syntactic subject and object the nouns ‘‘problem’’ and ‘‘switch’’ respectively, as in ‘‘(the) problem is regulating (the) switch’’. This happens because such subject-verb-object triples closely resemble the standard language models on which the POS-taggers are trained. (We appended ‘‘⁄’’ to ‘‘regulating|VBG’’ in Table 2 to indicate a POS-tagging error.) The pre-processed documents are next fed to the Linguistic Filtering phase of ExtTerm, described below. 3.2. Linguistic Filtering Incorrect POS-tagging affects the performance of linguistic filters, which compromises the overall TE accuracy. For example, conventional linguistic filters (Section 2.3), such as those of Justeson and Katz (1995) and Frantzi et al. (2000), will reject ‘‘regulating’’ (Table 2) since it was incorrectly POS-tagged as a verb. As a result, they will only detect the noun ‘‘switch’’ as a term, instead of ‘‘regulating switch’’. Possible solutions to this issue include manually correcting all ambiguous phrases in the corpus or re-training existing POS-taggers on our domain-specific texts. However, both of these alternatives are too tedious and error-prone. 3.2.1. Filter_1: dealing with POS-tagging errors As a brute-force solution to overcome this difficulty, ExtTerm defines a linguistic filter that accepts, as term candidates, any

ð1Þ

3.2.2. Filter_2: complex terms To overcome the difficulties in detecting complex terms, such as those consisting of any number of nouns, interspersed with adjectives and/or prepositions, for e.g. ‘‘rear battery cabinet coaxial cable’’, we rely on the underlying principle of term formation. According to this principle, complex terms are formed by recursively juxtaposing base terms with adjectives and/or nouns and by inserting prepositional modifiers. A base term is a fundamental terminological unit, consisting of a noun pair or of an adjective and a noun (Daille, Habert, Jacquemin, & Royaut’e, 1996). Consequently, we define our filter for detecting complex terms, Filter_2, as shown in in Eq. (2). The regular-expression operator ‘‘⁄’’ indicates that its operand can occur 0 or more times. ‘‘P’’ denotes a preposition.

Filter 2 ¼ A?N P?A?Nþ

ð2Þ

Combining the regular-expressions of Eqs. (1) and (2), and factoring their common elements yields the POS pattern employed by ExtTerm’s linguistic filter, which is depicted in Eq. (3).

Candidate Term ¼ A?VBG?N P?A?Nþ

ð3Þ

This filter maximizes the balance between precision and recall by being strict about the order and cardinality of the filter elements (i.e. POS-tags), and by allowing progressive verbs to be part of terms. The set of term candidates selected by the filter (Eq. (3)) is then fed to the Relevant Term Selection phase.

2534


3.3. Relevant Term Selection The Relevant Term Selection (RTS) phase estimates the termhood of the previously identified term candidates based on their relative probability of occurring in the domain-specific corpus and in another general (open-domain) corpus. A candidate is likely to qualify as a term if it is more probable in the domain-specific texts than in the general corpus. We refer to the latter as the normative corpus. 3.3.1. Wikipedia as a normative corpus To ensure the optimal accuracy of the corpus comparison technique, it is essential that the normative corpus contrasts significantly with the domain-specific one. Specifically, the normative corpus should be representative of the universal usage of a language, differing from the more specialized/technical language in the domain-specific texts. Also, it should be much larger than the domain-specific corpus. Based on these considerations, we employ the English Wikipedia collection as the normative corpus. Another motivation for relying on Wikipedia is that it has been applied successfully in many different NLP tasks (Medelyan et al., 2009). It is worth noting that this stage of ExtTerm is similar to the approach of Romero et al. (2012), which also involves the use of Wikipedia for domain-specific TE. Our corpus comparison procedure for calculating the termhood of candidates is listed in the pseudo-code below. The variable cand represents a candidate selected from the Linguistic Filtering phase. Its frequency in the domain-specific corpus, DS, is given by fDS. The variable fNC denotes the candidate’s corresponding frequency in the normative corpus, i.e. Wikipedia.

Table 3 Candidates and termhood scores from Relevant Term Selection. Candidate

fDS

Termhood

Filament control board replacement kit C-arm backplane rotation sensor Customer helpdesk collimator shutter Radiation shield Baton Rouge

2 108 94 58 245

500 500 500 479.06 52.28

The main benefit of our corpus comparison approach is that it facilitates the detection of rare terms, alleviating the issue of silence due to data sparsity (Section 2.4). For example, as illustrated in Table 3, ‘‘filament control board replacement kit’’ is awarded the maximum termhood score even if it occurs only twice in the domain-specific corpus. This is because such a highly specific term, despite its low frequency, is more probable to occur in the domain-specific texts than in the normative corpus. The results also show that ExtTerm overcomes of the issue of noise. For example, despite its high frequency in the domain-specific texts, the candidate ‘‘Baton Rouge’’ is awarded the lowest termhood score, indicative of its irrelevancy to our PD-CS domain. This is because such a general expression is also very likely to be found in the normative corpus of general texts. It should be noted that some incoherent constructs, for e.g. ‘‘customer helpdesk collimator shutter’’, were awarded the maximum termhood score since they were found only in the domainspecific texts and not in the Wikipedia normative corpus. These invalid candidates will be dealt with in the Term Ranking phase. 3.4. Term Ranking

Procedure termhood_score(cand) 1. relevant_candidates = {} //set of identified relevant expressions 2. pDS = fDS/|DS| 3. pNC = fNC/|NC| 4. termH = pDS/pNC 5. if termH > = t then 6. add (t, termH) to relevant_candidates

We start by initializing an empty set (line 1) to harvest the candidates deemed relevant in the domain-specific corpus. For a given candidate, cand, we estimate its probability of occurring in the domain-specific corpus (line 2), and in the Wikipedia normative corpus (line 3).We then compute its termhood score, termH, as the ratio of its probabilities across these two corpora (line 4). In this way, candidates that are likelier to appear in the domain-specific corpus will be allocated higher termhood scores. A candidate is then considered to be relevant if its score satisfies a termhood threshold t (lines 5,6). These relevant candidates will be fed to the Term Ranking phase (discussed in Section 3.4). Choosing an appropriate threshold, t, is based performance considerations, as will be described in Section 4.5. 3.3.2. Sample output The output of the RTS phase is a set of (simple and complex) candidates, which are relevant to our domain-specific corpus. Table 3 shows sample candidates, their occurrence frequency in the domain-specific corpus, fDS, and their termhood, as calculated by the termhood_score procedure. All scores are normalized to lie in the range 0-500. Candidates appearing only in the domain-specific corpus and not in Wikipedia are awarded a maximum score of 500.

In the Term Ranking (TR) phase, we determine which of the previously identified relevant candidates are atomic, coherent linguistic units by computing their unithood scores. In the remainder of this section, we will refer to a candidate containing n-words as an n-candidate. For instance, ‘‘radiation protector shield arm’’, which is made up of four words, is a 4-candidate. 3.4.1. Unithood for 2-candidates The unithood scores of 2-candidates, such as ‘‘tube cover’’, can be straightforwardly calculated using traditional lexical association measure (LAM). In ExtTerm, we use the cube mutual information (MI3) measure proposed by Daille (1994). Given a 2-candidate, cand = ‘‘x y’’ (e.g. cand =‘‘tube cover’’), MI3 estimates its unithood as

f ðx;yÞ N

unithood scoreðcandÞ ¼ log f ðxÞ N

3

f ðyÞ N

ð4Þ

In this equation, f(x, y) is the co-occurrence frequency of the pair ‘‘x y’’, while f(x) is the (individual) frequency of the word x. Compared to the basic mutual information (MI) measure of Church and Hanks (1990), MI3 takes the cube of the joint probability of the events, as shown in the numerator of Eq. (4). This strategy overcomes the shortcoming of the basic MI which tends to overemphasize rare events (Daille, 1994; Guinaudeau, Gravier, & Sebillot, 2011). Our choice for MI3 was also motivated by our preliminary experiments and by other research efforts such as Vivaldi and Rodriguez (2001), which demonstrated that MI3 outperformed other LAMs like log-likelihood for term extraction. 3.4.2. Unithood for n-candidates, n P 2 To overcome the limitations of traditional LAMs, which operate only on 2-candidates, we propose a novel technique for accurately estimating the unithood score of n-candidates, n > 2. Our technique

2535


is hinged upon the mechanisms of term formation, as described in Daille, Habert, Jacquemin, and Royaut’e (1996). The central idea lies in reformulating the unithood of an n-candidate (n > 2) as the unithood of its sub-expressions, containing at least 2 and at most (n 1) words. As examples, the sub-expressions of two 4-candidates cand = ‘‘xray tube window cover’’ and candX = ‘‘customer helpdesk collimator shutter’’, are depicted in Table 4. Note that cand denotes a valid term, while candX does not. Two observations can be made from the above table. 1. All sub-expressions of the term cand are atomic units, and are likely to correspond to terms themselves. Conversely, with the exception of ‘‘collimator shutter’’ and ‘‘customer helpdesk’’, all sub-expressions of candX are invalid. 2. It is possible to reconstruct an n-candidate from its sub-expressions since each sub-expression that contains (n 1) words will be nested in a longer sub-expression of size n. Based on these observations, our technique for computing the unithood score of any n-candidate, n > 2, is listed in the pseudocode below. Procedure unithood_score() 1. n = 2 2. while (n < max_length) //maximum number of words found in candidates 3. setCand_n = {all candidates containing n words} 4. if (n == 2) //2-candidates 5. for each 2-candidate cand in setCand_2 6. score = MI3(cand) 7. add(cand, score) scoreHash_2 8. else //longer candidates, n > 2 9. for each cand in setCand_n 10. subExprSet = generate sub-expressions(cand) 11. for each subExpr with m = 2. . .(n 1) words in subExprSet 12. score + = retrieve score(subExpr) from scoreHash_m 13. score/ = n 14. add(cand, score) to scoreHash_n 15. n++

Table 4 Sub-expressions generated from 4-candidates. n-gram (n = 4)

Sub-expressions of size 2. . .(n 1)

cand = ‘‘xray tube window cover’’

Size 2 ‘‘xray tube’’ ‘‘xray window’’ ‘‘xray cover’’ ‘‘tube window’’ ‘‘tube cover’’ ‘‘window cover’’

Size 3 ‘‘xray tube window’’ ‘‘tube window cover’’ ‘‘xray window cover’’ ‘‘xray tube cover’’

‘‘customer helpdesk’’ ‘‘customer collimator’’ ‘‘customer shutter’’ ‘‘helpdesk collimator’’ ‘‘helpdesk shutter’’ ‘‘collimator shutter’’

‘‘customer helpdesk collimator’’ ‘‘helpdesk collimator shutter’’ ‘‘customer helpdesk shutter’’ ‘‘customer collimator shutter’’

candX = ‘‘customer helpdesk collimator shutter’’

Table 5 Candidates extracted by Term Ranking. Candidate

Unithood

Length

Fluoroscopy Filament control board replacement kit Fluid injector indication lamp control Knob assembly Xray image frequency convertor control board Frequency control relay Radiation protector shield arm Collimator shutter Customer helpdesk collimator Shutter⁄

500 234.03 203.4 170.01 162.54 152.01 61.44 15.36⁄

1 5 7 6 3 4 2 4

Note: ⁄ denotes invalid candidates, which are unlikely to represent terms.

We start by computing the unithood scores of 2-candidates (e.g. ‘‘tube cover’’) using MI3 (lines 1–7). The 2-candidates and their scores are indexed in a look-up table. In each subsequent iteration (lines 8–15), we process candidates containing an additional word. For example, 3-candidates after 2-candidates, and in general, n-candidates after (n 1)-candidates. This strategy is based upon Observation 2. For any n-candidate, we generate its sub-expressions by generating all possible word permutations (line 10), as shown in Table 4. Considering all the possible word combinations enables us to deal with word order variations. For example, we can capture ‘‘tube cover’’ from the expression cand = ‘‘xray tube window cover’’, even though the words ‘‘tube’’ and ‘‘cover’’ are not adjacent in the expression. After decomposing an n-candidate, we accumulate the unithood scores of its sub-expressions (line 12), which were calculated earlier. Thus, the unithood of a candidate is formulated as a function of the unithood of its sub-expressions. This ensures that valid candidates will be awarded higher unithood scores since their subexpressions are also likely to be valid terms (and have large scores), as noted in Observation 1. Conversely, invalid candidates will be awarded much lower unithood scores. Next, the (accumulated) unithood score for each candidate is normalized by the candidate’s length (line 13). Normalization is required to penalize longer candidates since They can be decomposed into a larger number of sub-expressions, and thus, they may be unfairly awarded higher scores than shorter expressions. The probability that a candidate designates a term decreases as its length increases (Piao, Rayson, Archer, & McEnery, 2005). Finally, we index the n-candidate and its unithood score in a look-up table (line 14). It can then be retrieved for calculating the scores of longer candidates, e.g. with (n + 1) words, in future iterations. 3.4.3. Sample output Sample candidates extracted by the TR phase are presented in Table 5. Similar to the termhood scores, we normalize the unithood scores to lie in the range 0–500. In addition, since simple candidates, consisting of 1 word, are assumed to have perfect unithood (Kageura & Umino, 1996), they are awarded the maximum score. As illustrated in the above table, ExtTerm successfully addresses the difficulties in detecting complex terms, including those containing more than 2 words, such as ‘‘filament control board replacement kit’’ (5 words). It also overcomes the issue of noise, awarding much higher scores to valid candidates, e.g. ‘‘filament control board replacement kit’’, than to invalid ones, e.g. ‘‘customer helpdesk collimator shutter’’. The latter candidates can then be easily detected and discarded.

2536

A. Ittoo, G. Bouma / Expert Systems with Applications 40 (2013) 2530–2540 Table 6 Corpora statistics.

4. Experimental evaluation In this section, we evaluate the performance of our proposed ExtTerm framework. We start in Section 4.1 by briefly describing the corpora employed in our experiments. Then, the results obtained in the different phases of ExtTerm are discussed in Sections 4.2– 4.5. In Section 4.6, we compare the performance of ExtTerm against that of a state-of-the-art baseline, namely the algorithm of Frantzi et al. (2000), which was described earlier (Section 2.3). This algorithm has been found to achieve relatively high accuracy in term extraction (TE) across different domains (Hliaoutakis, Zervanou, & Petrakis, 2009; Milios, Zhang, He, & Dong, 2003; Zhang, Iria, Brewster, & Ciravegna, 2008). Therefore, it is one of the most commonly used baselines to gauge the performance of other TE systems. In Section 4.7, we present additional experiments to evaluate ExtTerms’ accuracy in extracting terms of varying lengths. 4.1. Corpora 4.1.1. Domain-specific corpus The corpus that we targeted for the extraction of domainspecific terms is a collection of 32,545 documents generated in the business/corporate domain of Product Development-Customer Service (PD-CS). The documents describe customer complaints and the ensuing repair actions performed by service engineers on highend, professional electro-mechanical equipment. The corpus was compiled over a period of five years, spanning from 2005 to 2009, and the documents were collected from customer-call centers and engineers’ repair notes. The texts were expressed in English. 4.1.2. Wikipedia normative corpus As the normative corpus for termhood computation (Section 3.3), we relied on the English Wikipedia collection4. Some basic statistics on our domain-specific and normative corpora are presented in Table 6 4.2. Linguistic Filtering We evaluated ExtTerm’s linguistic filter, filterExtTerm, presented in Eq. (3), by comparing it against the baseline’s filter (Frantzi et al., 2000), filterBaseline, depicted in Eq. (5).

FilterBaseline ¼ ððAjNÞ jððAjNÞ ðNPÞ?ÞðAjNÞÞNÞ

ð5Þ

Our experiment results revealed that the filter with the broadest coverage was filterExtTerm, which admitted nearly twice as many term candidates than filterBaseline (85,342 vs. 48,984). As we will show in Section 4.6, the larger number of candidates selected by filterExtTerm is beneficial to ExtTerm’s recall, while the valid candidates rejected by filterBaseline compromise the baseline’s recall. Next, we inspected the respective outputs of both filters, as illustrated in Table 7. A value ‘‘Y’’ indicates that a filter (column) successfully extracted a candidate (row). An ‘‘N’’ is used to indicate unsuccessful extraction. Partially correct extraction (e.g. term fragments) are marked with a ‘‘⁄’’. Three main observations can be made from the above results. First, filterExtTerm successfully extracted complex candidates, such as ‘‘coaxial cable for rear battery cabinet’’. Conversely, filterBaseline could only capture fragments, e.g. ‘‘coaxial cable for rear battery’’. Second, unlike filterBaseline, filterExtTerm was not susceptible to POStagging errors. It detected candidates, such as ‘‘regulating switch’’, in which gerund nouns (e.g. ‘‘regulating’’) had been incorrectly POS-tagged as progressive verbs (Section 3.2). Third, both filters 4

Downloaded from http://ilps.science.uva.nl/WikiXML/.

Domain-specific corpus Wikipedia normative corpus

Number of words

Number of documents

1952,739 500 million (approx.)

32,545 5,100,000 (approx.)

Table 7 Comparing candidates output by linguistic filters.

Complex candidates Coaxial cable for rear battery cabinet Flat screen monitor wireless adaptor POS errors Limiting amplifier Regulating switch Invalid candidates Tear box helpdesk cable sleeve Customer helpdesk collimator Shutter

FilterExtTerm

FilterBaseline

Y Y

N (coaxial cable for rear battery)⁄ N (flat screen monitor)⁄

Y Y

N (circuit)⁄ N (switch)⁄

Y Y

Y Y

identified some incorrect candidates, which were realized as noun phrases, such as ‘‘customer helpdesk collimator shutter’’, and which satisfied the filters’ respective admission criteria. Experiments to discard these invalid candidates will be described in Term Ranking (Section 4.4).

4.3. Relevant Term Selection In the Relevant Term Selection (RTS) phase of ExtTerm, we estimated the termhood scores of the candidates selected by filterExtTerm. This was achieved using the corpus comparison approach, which calculated the candidates’ relative probability across the domain-specific corpus and the Wikipedia normative corpus, as described in Section 3.3. Sample candidates extracted by the RTS phase, their occurrence probabilities in our domain-specific (PDS) and normative (PNC) corpora, and their termhood scores are depicted in Table 8 (1st, 2nd, 3rd, and 4th columns, respectively). Those candidates that appeared only in the domain-specific texts, but not in the normative corpus were awarded the maximum score of 500. The above results can be broadly classified into five categories. Candidates in the first category (‘‘Relevant and Likely’’) occurred frequently in our domain-specific texts, but not in the normative corpus. They were highly specific to our domain, for e.g. ‘‘c-arm backplane rotation sensor’’, and were awarded the maximum termhood score. Candidates in the second category (‘‘Relevant but Sparse’’) occurred sparsely in the domain-specific texts. For example, ‘‘radiation protector shield arm’’ had a frequency of only 3. These rare candidates are responsible for the issue of silence, and most existing TE systems cannot detect them, which is detrimental to their recall (Section 2.4). However, as can be seen from the results, ExtTerm awarded high termhood scores to these candidates and successfully detected their occurrences, alleviating the issue of silence. Candidates in the third category (‘‘Invalid’’), for e.g. ‘‘customer helpdesk collimator shutter’’, were also assigned maximum termhood scores since they occurred only in the domain-specific corpus. However, they were incoherent word sequences, and will be dealt with during the Term Ranking phase (Section 4.4). Candidates in the fourth category (‘‘Likelier in Specialized Corpus’’) were likelier to occur in the domain-specific texts than

2537

A. Ittoo, G. Bouma / Expert Systems with Applications 40 (2013) 2530–2540 Table 8 Candidates output from Relevant Term Selection phase. Candidate Relevant and likely Angioplasty contrast Adjustment buzzer button C-arm backplane rotation Sensor

PDS

PNC

Table 9 Sample Ranked List from Term Ranking Phase. Termhood = PDS/ PNC

0.0017

0

500

0.0010

0

500

Relevant but sparse Radiation protector shield arm Coaxial cable for rear battery cabinet

0.0000015 0.0000010

0 0

500 500

Invalid Tear box helpdesk cable sleeve Customer helpdesk collimator Shutter

0.0016 0.0011

0 0

500 500

Likelier in specialized corpus Sensor button Camera shutter

0.00035 0.00020

0.0000011 0.0000016

308.17 123.89

Irrelevant but likely Airport Day

0.00062 0.00044

0.000012 0.000014

50.85 31.68

in the normative corpus. These candidates were awarded relatively high termhood scores, indicative of their relevancy. Candidates in the last category (‘‘Irrelevant but Frequent’’) were irrelevant candidates that occurred frequently in the domain-specific texts. These candidates are responsible for the issue of noise, and are often incorrectly extracted as relevant terms by existing TE techniques, compromising the precision (Section 2.4). However, as shown in Table 8, they were awarded very low termhood scores by ExtTerm, facilitating their detection and subsequent rejection, and alleviating the issue of noise.

Candidate

Unithood score

Footswitch Filament control board replacement kit C-arm backplane rotation sensor Fluid injector indication lamp control knob Assembly Coaxial cable for rear battery cabinet Tube window cover Xray image frequency convertor control Board Radiation protector shield arm Wireless repeater keyboard front cover Angioplasty contrast adjustment buzzer Button Cable control box Circuit controller ... Customer helpdesk collimator shutter Tear box helpdesk cable sleeve

500.00 234.03 230.63 203.40 200.70 189.72 170.01 152.01 102.12 98.35 92.34 89.12 ... 15.36 10.78

Table 10 ExtTerm’s performance scores at different threshold values. Threshold t

Precision

Recall

F1

10 100 200 300 400 500

0.79 0.84 0.87 0.89 0.87 0.87

0.96 0.90 0.89 0.84 0.81 0.80

0.86 0.87 0.88 0.87 0.84 0.84

‘‘customer helpdesk collimator shutter’’, which are then easily identified and discarded. In the next section, we measure ExtTerm’s performance using the well-known metrics of precision and recall, and we describe how the termhood threshold t (Section 4.3) is selected.

4.6. Performance evaluation and threshold selection 4.4. Selecting a Termhood threshold As discussed earlier (Section 3.3), only those relevant candidates, whose termhood scores satisfied a threshold t, will be selected and input to the subsequent Term Ranking stage of ExtTerm. To find the optimal balance between precision and recall (i.e. to obtain the ‘‘largest set of valid terms’’), we varied the threshold t across six different values of termhood scores: 10, 100, 200, 300, 400, and 500. These values were chosen since around 90% of the candidates had termhood scores in the range of 10–500. For each of these (six) threshold values t, we harvested (six) different sets of candidates from the RTS phase, such that the candidates’ termhood scores were larger than t. Then, for each candidate set, we measured the precision and recall of the Term Ranking phase (Section 4.4), and chose, as threshold, that value of t which optimized the precision-recall balance. As will be described in Section 4.5, this balance was maximized at t = 200. 4.5. Term Ranking The Term Ranking (TR) phase computed the unithood scores of the candidates, which were selected from the previous RTS stage, for each of the 6 different threshold values, i.e. t = 10, 100, 200, 300, 400, and 500. As output, TR generated (6 separate) ranked lists of candidates (per threshold t), sorted according to their unithood scores. An example ranked-list is illustrated in Table 9. As described before (Section 3.4), the above results show that ExtTerm successfully addresses the challenge of extracting terms containing more than 2 words. It also overcomes the issue of noise by awarding low unithood scores to invalid candidates, such as

For each of the six threshold values t (Section 4.3), we evaluated the corresponding ranked lists of candidates extracted by ExtTerm (Section 4.4), as depicted in Table 9. Since evaluating entire ranked lists is tedious, past TE studies have focused on the top-n terms. Previous research has also shown that the evaluations over a sample of size n are comparable to the evaluations over the entire population set (Evert & Krenn, 2005). In our experiments, we evaluated the top-1000 candidates (n = 1000) in each of the six ranked lists extracted (giving a total of 6000 candidates). To ensure the accuracy of our evaluations, we relied on two human judges (annotators), who were wellversed in the domain. Following Zhang et al. (2008), those candidates that both annotators deemed correct were marked as true_positives. Conversely, those candidates deemed incorrect by both annotators were false_positives. The precision scores of the candidates extracted at the different threshold values t were then estimated using Eq. (6), as shown in Table 10 (2nd column). To mitigate the effect of coincidental agreements between the two annotators, we computed the level of inter-annotator agreement using the kappa coefficient (Cohen, 1960). Our calculated kappa values, which hovered around 0.68– 0.72, were indicative of a strong level of inter-annotator agreement since kappa values of 0.7 are considered desirable. To calculate the recall, the annotators first created a goldstandard of 1000 known terms, which were manually selected from a subset of our domain-specific corpus (sub-corpus). No additional constraints, for example, pertaining to the term length, were imposed on the gold-standard terms. The sub-corpus was then analyzed by ExtTerm. The terms extracted by ExtTerm that were also found in the gold-standard were true_positives. Gold-standard

2538


terms that ExtTerm failed to extract were false_negatives. The recall scores for the different threshold values t were then computed using Eq. (7), as shown in Table 10 (3rd column).

Precision ¼

Recall ¼

true positiv e true positiv e þ false positiv e

ð6Þ

true positiv e true positiv e þ false negativ e

ð7Þ

To obtain a single performance value, we determined the F-score (van Rijsbergen, 1979). Since we are interested in balancing precision (P) and recall (R), we set the weighing factor, b, of the F-score to 1. The F-score is then referred to as the F1-score, and is computed using Eq. (8). The results are depicted in Table 10 (4th column).

F1 ¼

2PR PþR

ð8Þ

The highest F1 score of 0.88 was obtained at t = 200, indicative that the candidate set harvested during RTS (Section 4.3) at t = 200 contained the largest number of relevant terms. Thus, in our experiments, we used t = 200 as the threshold for the termhood score (Sections 3.3 and 4.3). 4.7. Benchmarking against baseline We manually inspected the top-1000 candidates extracted by the baseline (Frantzi et al., 2000) from our domain-specific corpus. The baseline’s corresponding precision, recall and F1 scores, together with those of ExtTerm are compared in Table 11. As can be seen from the results, ExtTerm outperformed the baseline in extracting terms more accurately from our sparse, informally-written domain-specific corpus.

ecall. ExtTerm, on the other hand, overcame this difficulty during the RTS phase (Sections 3.3 and 4.3) by selecting candidates with a high termhood scores, regardless of their absolute frequencies. The baseline’s recall was further compromised by its linguistic filter, filterBaseline, which failed to detect complex terms and terms with POS-tagging errors (Section 4.2). 4.8. Influence of term length on performance Our previous results have shown that ExtTerm successfully extracted terms regardless of their length. In this section, we performed additional experiments to investigate the effect of term length on ExtTerm’s performance. We divided the top-1000 candidates output by ExtTerm (at t = 200) and by the baseline into five different groups according to their lengths, as depicted in Table 12. Single-word terms were excluded. We then measured the precision scores achieved by ExtTerm and by the baseline for terms in each length-group. The results are presented in Table 12 (2nd and 3rd column respectively). Both ExtTerm and the baseline achieved comparable performance for terms that contained up to 4 words. However, the baseline’s performance degraded for longer terms, and worsened as the term length increased. This was due to the difficulties in distinguishing valid (complex) terms, for e.g. ‘‘filament control board replacement kit’’ from invalid word sequences, for e.g. ‘‘customer helpdesk collimator shutter’’. The performance of ExtTerm, on the other hand, remained relatively constant across the different term lengths, suggesting that it was equally precise in extracting longer complex terms (e.g. containing 5 or 6 words) and shorter ones (e.g. containing 2 or 3 words). 5. Discussion and conclusion

4.7.1. Lower precision of baseline We observed that the baseline extracted many irrelevant but frequent candidates, and thus suffered from the issue of noise, which compromised its precision. As discussed before, ExtTerm‘s corpus comparison approach during the RTS phase (Sections 3.3 and 4.3) enabled it to accurately detect and reject these frequent, irrelevant candidates based on their low termhood scores. In addition, its TR phase (Section 3.4 and 4.4), which formulates the unithood of a candidate as a function of the unithood of its subexpressions also mitigated noise by assigning low scores to invalid candidates. 4.7.2. Lower recall of baseline The baseline also failed to detect a large number of valid terms, which occurred sparsely in the domain-specific corpus. Thus, it suffered from the issue of silence, which was detrimental to its Table 11 Performance of ExtTerm vs. Baseline.

Baseline ExtTerm

Precision

Recall

F1

0.71 0.87

0.84 0.90

0.77 0.88

Table 12 Effect of term length on performance. Length

ExtTerm precision

Baseline precision

2 3 4 5 >= 6

0.87 0.87 0.88 0.87 0.88

0.86 0.86 0.86 0.70 0.65

Most term extraction (TE) techniques developed to date have predominantly focused on large, well-written corpora, such as newspaper and bio-medical texts. These texts provide reliable linguistic and statistical evidence, which facilitate the detection of terms. Furthermore, existing TE techniques often rely on readily-available knowledge resources, such as ontologies, to support their term extraction process, leading to substantial performance gains. However, the desiderata of large, well-written corpora and readily-available knowledge resources are rarely replicated in many corporate environments. In the domain of Product Development-Customer Service (PD-CS), for example, repair notes created by engineers tend to be sparse and ungrammatical (informally-written), which makes it hard to accurately detect terms from their contents. This difficulty is further compounded by the lack of readily-available, domainspecific knowledge resources. Consequently, traditional TE techniques exhibit several shortcomings and face a number of challenges in extracting terms from these types of domain-specific texts. As a result, their performance is severely compromised. In this article, we addressed these difficulties by presenting ExtTerm, a novel framework for term extraction from sparse, ungrammatical domain-specific documents. Our contributions to the TE literature and main innovations are as follows. Unlike existing techniques, ExtTerm overcomes the issue of data sparsity by accurately detecting rare terms, even those appearing with very low frequency in a corpus. Thus, it does not suffer from the issue of silence, which is beneficial to its recall. ExtTerm also precisely rejects irrelevant expressions even if they appear frequently in the corpus, mitigating the issue of noise and improving its precision. Furthermore, we present a technique, hinged upon the theoretical notion of term formation, for detecting arbitrarily long terms, including those containing 2 or more words. In addition, we show that open domain (general) knowledge resources, such as Wikipedia, can be


exploited to support term extraction from specific domains. The main benefit of relying on such resources is that they are readilyavailable, comprehensive (large) and accurate. Therefore, they constitute an attractive alternative to compensate for the lack of domain-specific resources such as ontologies. 5.1. Application The terms extracted by ExtTerm from a domain-specific corpus, for example, engineers’ repair notes, can be further inspected by domain experts, and employed for various types of Business Intelligence activities. Our actual ExtTerm implementation is currently being used by product data analysts in PD-CS organizations5 to extract terms from engineers’ repair notes and customer complaint messages. These terms denote products (sub-assemblies, subsystems) that experience malfunctions most frequently. They are used as the basis for corporate activities, such as ‘‘cost of non-quality analysis’’, which involves estimating the contribution of the various products to the maintenance/repair costs. This information is reported to management in quarterly reports, and to the Product Development Process (PDP) so that remedial measures can be devised and implemented to improve product quality. 5.2. Future work A significant amount of domain-specific texts (customer complaints, repair documents) is being generated and collected in languages other than English, such as Dutch and French. There has been some work on TE from corpora in languages besides English. However, most of these techniques exhibit the limitations as described earlier. That is, they have been traditionally applied on well-written and large corpora. Consequently, they fail to adequately deal with the additional challenges posed by domain-specific texts, such as that of sparsity, ungrammatical language, and detecting complex terms with more than 2 words. Consequently, our future work will focus mainly on multi-lingual term extraction from domain-specific documents. As a start, we plan to use the Wikipedia collection in a language L, e.g. Dutch, to identify relevant terms and filter out irrelevant ones from a domain-specific corpus, which also expressed in L. This is similar to our Relevant Term Selection phase (Section 3.3) and to the work of Romero et al. (2012). Designing a linguistic filter and developing a multi-word term extraction algorithm are more involved as they require some knowledge of the linguistic patterns adopted by terms in the language L. In addition, we also envisage using the terms extracted by ExtTerm for supporting higher level, downstream applications, such as text classification and text clustering. We expect that these terms will constitute a concise set of high quality features, which can substantially improve the classification or clustering accuracy. Furthermore, we will leverage upon the terms discovered by ExtTerm for automatically learning domain-specific ontologies from (domain-specific) documents. Terms, as designators of domain-specific concepts, constitute the fundamental building blocks of ontologies. Then, we will develop techniques for automatically learning the semantic relationships between the terms, such as ‘‘is-a’’ (hypernymy), in the ontology. Acknowledgements This work is being carried out as part of the project ‘‘Merging of Incoherent Field Feedback Data into Prioritized Design Information

5 Large, multi-nationals in The Netherlands, manufacturing high-end electronic devices.

2539

(DataFusion)’’, sponsored by the Dutch Ministry of Economic Affairs, Agriculture and Innovation under the IOP-IPCR program. References Ahmad, K., Davies, A., Fulford, H., & Rogers, M. (1994). What is a term? The semiautomatic extraction of terms from text. Translation Studies: An Interdiscipline, 267–278. Ashburner, M., Ball, C., Blake, J., Botstein, D., Butler, H., & Cherry, J. (2000). Gene Ontology: Tool for the unification of biology. Nature Genetics, 25(1). Aubin, S., & Hamon, T. (2006). Improving Term Extraction with Terminological Resources. In T. Salakoski (Ed.). Lecture Notes in Artificial Intelligence (pp. 380–387). Springer-Verlag. Auger, A., & Barriere, C. (2010). Probing Semantic Relations: Exploration and Identification in Specialized Texts. John Benjamins Pub Co. Blohm, S., & Cimiano, P. (2007). Using the web to reduce data sparseness in patternbased information extraction. Knowledge Discovery in Databases: PKDD. Blumberg, R., & Atre, S. (2003). The problem with unstructured data. DM Review, 13, 42–49. Chung, T. (2003). A corpus comparison approach for terminology extraction. Terminology, 9(2), 221–246. Church, K., & Hanks, P. (1990). Word association norms, mutual information, and lexicography. Computational Linguistics, 16(1), 22–29. Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46. Dagan, I., & Church, K. (1994). Termight: Identifying and translating technical terminology. In Proceedings of the fourth conference on applied natural language processing. Daille, B. (1994). Approche mixte pour l’extraction de terminologie: Statistique lexicale et filtres linguistiques. Universite Paris VII. Daille, B., Habert, B., Jacquemin, C., & Royaut’e, J. (1996). Empirical observation of term variations and principles for their description. Terminology, 3(2), 197–257. Drouin, P. (2003). Term extraction using non-technical corpora as a point of leverage. Terminology, 9(1), 99–115. Evert, S., & Krenn, B. (2005). Using small random samples for the manual evaluation of statistical association measures. Computer Speech & Language, 19(4), 450–466. Fellbaum, C. (1998). WordNet: An electronic lexical database. The MIT press. Frantzi, K., & Ananiadou, S. (1999). The C-value/NC-value domain independent method for multi-word term extraction. Journal of Natural Language Processing, 6(3), 145–179. Frantzi, K., Ananiadou, S., & Mima, H. (2000). Automatic recognition of multi-word terms: the C-value/NC-value method. International Journal on Digital Libraries, 3(2), 115–130. Guinaudeau, C., Gravier, G., & Sebillot, P. (2011). Enhancing lexical cohesion measure with confidence measures, semantic relations and language model interpolation for multimedia spoken content topic segmentation. Computer Speech & Language. Hamp, B., & Feldweg, H. (1995). Germanet-a lexical-semantic net for german. In Proceedings of acl workshop automatic information extraction and building of lexical semantic resources for NLP applications. Hiekata, K., Yamato, H., & Tsujimoto, S. (2010). Ontology based knowledge extraction for shipyard fabrication workshop reports. Expert Systems with Applications, 37(11), 7380–7386. Hliaoutakis, A., Zervanou, K., & Petrakis, E. (2009). The AMTEx approach in the medical document indexing and retrieval application. Data & Knowledge Engineering, 68(3), 380–392. Hoste, V., Vanopstal, K., & Lefever, E. (2007). The Automatic Detection of Scientific Terms in Patient Information. In Proceedings of RANLP. Justeson, J., & Katz, S. (1995). Technical terminology: Some linguistic properties and an algorithm for identification in text. Natural Language Engineering, 1(1), 9–27. Kageura, K., & Umino, B. (1996). Methods of automatic term recognition: A review. Terminology, 3(2), 259–289. Lapata, M., & Lascarides, A. (2003). Detecting novel compounds: The role of distributional evidence. In Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics. Marcus, M., Marcinkiewicz, M., & Santorini, B. (1993). Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2), 313–330. Maynard, D., & Ananiadou, S. (2000). Identifying terms by their family and friends. In Proceedings of the 18th conference on computational linguistics. Medelyan, O., Milne, D., Legg, C., & Witten, I. (2009). Mining meaning from Wikipedia. International Journal of Human–Computer Studies, 67(9), 716–754. Milios, E., Zhang, Y., He, B., & Dong, L. (2003). Automatic term extraction and document similarity in special text corpora. In Proceedings of the sixth conference of the pacific association for computational linguistics. Nakagawa, H., & Mori, T. (1998). Nested collocation and compound noun for term recognition. In Proceedings of the first workshop on computational terminology COMPUTERM (pp. 64–70). Pazienza, M., Pennacchiotti, M., & Zanzotto, F. (2005). Terminology extraction: An analysis of linguistic and statistical approaches. Knowledge Mining, 255–279. Pecina, P., & Schlesinger, P. (2006). Combining association measures for collocation extraction. In Proceedings of the COLING/ACL on Main conference poster sessions. Petrovic, S., Snajder, J., & Basic, B. (2010). Extending lexical association measures for collocation extraction. Computer Speech & Language, 24(2), 383–394.

2540


Piao, S., Rayson, P., Archer, D., & McEnery, T. (2005). Comparing and combining a semantic tagger and a statistical tool for MWE extraction. Computer Speech & Language, 19(4), 378–397. Rayson, P., & Garside, R. (2000). Comparing corpora using frequency profiling. In Proceedings of the workshop on comparing corpora, Association for Computational Linguistics. Romero, M., Moreo, A., Castro, J., & Zurita, J. (2012). Using Wikipedia concepts and frequency in language to extract key terms from support documents. Expert Systems with Applications, 39(18). Russom, P. (2007). BI search and text analytics: TDWI Best Practices Report. Sager, J. (1998). In search of a foundation: Towards a theory of the term. Terminology, 5(1), 41–57. Salton, G. (1991). Developments in automatic text retrieval. Science, 253(5023), 974. Toutanova, K., Klein, D., Manning, C., & Singer, Y. (2003). Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (Vol. 1).

Uchimoto, K., Sekine, S., Murata, M., Ozaku, H., & Isahara, H. (2001). Term recognition using corpora from different fields. Terminology, 6(2), 233–256. van Rijsbergen, C. (1979). Information Retrieval. London: Butterworths. Vivaldi, J., & Rodriguez, H. (2001). Improving term extraction by combining different techniques. Terminology, 7(1), 31–48. Wright, S., & Budin, G. (1997). Handbook of terminology management: Basic aspects of terminology management. John Benjamins Pub Co. Xu, F., Kurz, D., Piskorski, J., & Schmeier, S. (2002). Term extraction and mining of term relations from unrestricted texts in the financial domain. Business Information Systems. Springer. Zhang, Z., Iria, J., Brewster, C., & Ciravegna, F. (2008). A comparative evaluation of term recognition algorithms. In Proceedings of the sixth international conference of language resources and evaluation (LREC 2008). Zhang, W., Yoshida, T., & Tang, X. (2009). Using ontology to improve precision of terminology extraction from documents. Expert Systems with Applications, 36(5), 9333–9339.

Term extraction from sparse, ungrammatical domain ...

Term extraction from sparse, ungrammatical domain ...

Suggest Documents

Domain-independent term extraction through domain modelling

Domain-independent term extraction through ... - Semantic Scholar

METADATA EXTRACTION FROM TEXT IN SOCCER DOMAIN

Information Extraction from Heterogeneous Sources Using Domain ...

Automatic Domain-Relevant Collocation Extraction from Arabic ...

SPARSE FEATURE EXTRACTION USING ...

Term Extraction and Mining of Term Relations from Unrestricted ... - DFKI

Domain-Specific Term Extraction and Its Application in ... - CiteSeerX

Using Wikipedia for term extraction in the biomedical domain - UPC

Mining Term Translations from Domain Restricted

Sparse and Shift-Invariant Feature Extraction From ... - Semantic Scholar

Extraction of Spatially Sparse Common Spatio ...

Creating Relational Data from Unstructured and Ungrammatical Data

Fetal ECG Extraction Exploiting Joint Sparse Supports

Domain-Specific Relation Extraction

Domain Independent Model for Product Attribute Extraction from User ...

Relation extraction from clinical texts using domain invariant ...

Extraction of Gene-Disease Relations from Medline Using Domain ...

Highly Accurate Material Parameter Extraction from THz Time Domain ...

Bilingual lexicon extraction from comparable corpora using in-domain ...

Time Delay Extraction from Frequency Domain ... - Semantic Scholar

A Semi-Automatic Method for Domain Ontology Extraction from ...

Domain Adaptive Information Extraction From Text - Semantic Scholar

Bilingual lexicon extraction from comparable corpora using in-domain ...