Appl Intell (2011) 34: 311–327 DOI 10.1007/s10489-009-0197-4
Automatic extraction of acronym definitions from the Web David Sánchez · David Isern
Published online: 30 September 2009 © Springer Science+Business Media, LLC 2009
Abstract Acronyms are widely used to abbreviate and stress important concepts. The discovery of the definitions associated to an acronym is an important matter in order to support language processing and knowledge-related tasks as information retrieval, ontology mapping or question answering. Acronyms represent a very dynamic and unbounded topic that is constantly evolving. Manual attempts to compose a global scale dictionary of acronym-definition pairs result in an overwhelming amount of work and limited results. Attending these shortcomings, this paper presents an automatic and unsupervised methodology to generate acronyms and extract their potential definitions from the Web. The method has been designed to minimise the set of constraints, offering a domain and -partially- language independent solution, and to exploit the Web in order to create large and general acronym-definition sets. Results have been manually evaluated against the largest manually built acronym repository: Acronym Finder. The evaluation shows that the proposed approach is able to improve the coverage of manual attempts maintaining a high precision. Keywords Acronyms · Information extraction · Web mining
D. Sánchez () · D. Isern Department of Computer Science and Mathematics, Intelligent Technologies for Advanced Knowledge Acquisition (ITAKA) Research Group, University Rovira i Virgili, Tarragona, Catalonia Spain e-mail:
[email protected] D. Isern e-mail:
[email protected]
1 Introduction Acronyms are textual forms used to refer to relevant concepts or entities [14]. Human languages are very prone to the creation of acronyms in order to (i) stress the importance of entities, (ii) avoid redundancy by omitting entity’s long forms and (iii) offer an alternative way for referring to the same entity which is easier to remember. Some characteristics of acronyms are: • They are very dynamic. New acronyms are defined every day for almost every possible domain of knowledge. This is especially evident in domains such as biomedicine [15, 39]. • They are highly polysemic. Acronyms are composed by a short combination of alpha-numeric characters (i.e., commonly from 2 to 6 characters). Consequently, the amount of possible combinations is limited and biased towards those simpler forms. Some short combinations of letters may correspond to dozens of possible entities (e.g., ABC stands for 253 different entities, according to Acronym Finder.1 ) • They have a very diverse degree of generality. Some acronym-definition pairs are very common (e.g., USA— United States of America) but others are rare and nonreferred outside the information source in which they are defined (e.g., USA—Unique Settable Attributes). Formally, an acronym may correspond to one or more definition(s) from which several participating characters are used to construct the acronym. Identifying equivalent acronymdefinition pairs is a crucial task in natural language processing and information retrieval [18, 42]. Ontology Population 1 Web
site: http://www.acronymfinder.com [last access: 01/09/2009].
312
and Questions Answering are other areas in which acronym handling can improve language understanding [48]. However, due to the previously introduced characteristics, it is very difficult to construct a general and up-to-date database of acronym-definition repository [45]. From the manual point of view, there have been some ambitious attempts to provide global and reliable acronym dictionaries such as Acronym Finder or The Internet Acronym Server.2 They offer a valuable source of knowledge at the cost of a huge human work. For example, Acronym Finder, which is the world’s largest dictionary of acronyms, includes more than 750,000 human-edited definitions; the total effort required to compile this set is estimated to be more than 6,000 hours,3 a task performed during the last 11 years. Moreover, manual knowledge acquisition supposes a bottleneck which limits the coverage, as it will be shown in the evaluation section. Consequently, automated methodologies may aid the composition of those repositories. Recently, there have been approaches focused on the automatic identification of acronym-definition pairs within text [30–32]. As it will be shown in the next section, some of them require a certain amount of supervising or training and/or introducing of restrictive constraint sets to associate acronyms to definitions. Also, most of them have a limited scope (e.g., acronyms and definitions are retrieved from a unique input document) and they are also languagedependant due to linguistic analyses performed over the text. This paper presents a novel method to, at first, generate valid acronyms and, then, retrieve feasible definitions. The proposal uses the Web as source and handles these tasks automatically and in an unsupervised way, creating large acronym-definition sets from scratch. It is based on general extraction patterns and a reduced filtering constraint set, configuring a domain-independent approach. As no language dependant analyses are introduced, cross-language results can be also retrieved. However, the proposal is limited to languages written with Latin characters (e.g., English, Spanish, Italian, etc). Finally, it exploits the Web information distribution to estimate result’s reliability, which can be used to filter the results and potentially improve the precision. Results have been evaluated against the largest manually built repository in order to check their accuracy and to show the benefits that an automatic approach can bring in extending manually composed repositories. The rest of the paper is organized as follows. Section 2 describes previous works in the area of acronymdefinition discovery. Section 3 analyses the characteristics of the acronym-definition identification problem. Section 4 2 Web site: http://acronyms.silmaril.ie/cgi-bin/uncgi/acronyms [last ac-
cess: 01/09/2009]. 3 Source: http://blog.acronymfinder.com/2008/04/acronym-finder-rolls-
over-600000.html [14/05/09].
D. Sánchez, D. Isern
describes in detail the proposed methodology. Section 5 presents the evaluation, showing the obtained results and discussing them. The last section gives some conclusions and proposes some lines of future work.
2 Related works In the last years, there have been some attempts at dealing with the acronym identification task (i.e., given an input text document, acronyms are identified and associated to a definition). Taghva and Gilbreth [38] use a word window to retrieve definition candidates from the acronym surroundings. The window length is a function of the number of letters in the acronym. Those letters should also match with the words in the definition. An additional linguistic analysis is performed over the definition to detect which are the word types (i.e., stop words, hyphenated and normal) that contribute to the acronym definition. The Longest Common Subsequence algorithm [23] is then used to select the definition. Larkey et al. [28] use a similar approach, identifying acronyms through capitalizations and windows of 20 words near the acronym. A linguistic analysis is applied to detect the presence of meaningless words and to check those which can contribute to acronym letters. Additionally, simple patterns such as parenthesis or constructions indicating equivalencies, are used to check the suitability of acronym definitions. However, non-parenthetical patterns introduce constraints to the analysis which may hamper the recall. Park and Bryd [32] propose a combination of pattern-based abbreviation rules with text markers and cue words (e.g., “or”, “short,”, “stand”) to detect acronym-definition pairs. Zahariev [48] presents a model for acronym identification based on strong constraints but allowing different languages. Yeates [44] uses special characters such as commas, parentheses and periods to extract acronym-definition candidates. Acronyms are detected using capitalization heuristics and the definition is associated by evaluating several rules such as the matching of letters, without considering English stop words. Adar [1] introduces some basic rules about scoring to rank the suitability of acronyms and their definitions according to the number of matching letters or the presence of parenthesis. Liu and Friedman [29] use collocation measures and parenthetical expressions to associate definitions to acronyms. However, their approach cannot recognise expanded forms occurring only once in the corpus. There are also some approaches which use supervised models. Chang and Schütze [8] use logistic regression as a learning algorithm, which employs features to describe the association between the acronym letters and the definition.
Automatic extraction of acronym definitions from the Web
Nadeau and Turney [30] also present a supervised system which uses machine learning involving weak constraints to improve the recall. Dannells [14] presents a supervised machine learning algorithm (Memory Based Learning). Internally, acronym-definition pairs are described by a vector of syntactic features, and the algorithm is trained to classify them. The presented approaches are meant to detect acronym candidates within the text (which represents a bounded context), trying to associate a suitable definition extracted from the same document. Very few approaches have been developed to create a large scale dictionary, like in the present work. Okazaki and Ananiadou [31] introduce a method to compose an acronym-definition repository from a large text collection. Their approach focuses on terms appearing frequently in the proximity of an acronym and measuring the likelihood scores of such terms to be the definitions of acronyms. However, the repository scope depends highly on the manually composed input corpus. Recently, Yoon et al. [45] have been using a set of definitions to create possible acronyms through an automatic acronym generation algorithm (for the Korean language). The authors check the suitability of acronym candidates by estimating the probability of appearing of the given acronym-definition pair from a set of Web resources. Most of the presented methods deal with English-written resources, being very few of them multi-language [48]. Some authors have dealt with other languages. For instance, Hisamitsu and Niwa [22] analyse Japanese written articles using measures of co-occurrence between inner and outer parenthetical expressions, and Yoon et al. [45] deal with Korean Web resources. Research on acronyms has been applied in some relevant domains such as Genetics and Medicine [33, 46]. Schwartz and Hearst [37] also work over biomedical text, identifying acronyms with a set of basic restrictions and seeking for the shortest definition in its surrounding which match all the acronym letters. From the presented works, several conclusions can be extracted: • Most approaches use a set of patterns to extract definition candidates for a given acronym and some heuristics (mainly constraint sets) to evaluate their suitability. The use of capitalization heuristics, letter matching and parenthetical rules are the most common and effective ones [30]. • Most of the approaches are applied over English written resources. Some of them employ a certain degree of linguistic analysis (e.g., detection of stop words) which hampers their applicability to other languages. • In unsupervised (mainly rule-based) approaches, the result’s quality depends highly on the set of constraints used to filter acronym definition candidates. As stated in [30],
313
the use of strong constraints over a reduced corpus results in high precision but compromises the final recall. • Most of the approaches use a document or a set of documents as source. This is a way of contextualizing the search towards a certain domain. However, this introduces limitations because, if no explicit matching for the defined patterns appears within the text, no results will be obtained. As very few approaches aim to detect wide and general definition sets for a given acronym (which is the goal of the present approach), the computation of the degree of generality or reliability of a definition for a certain acronym is not necessary.
3 Acronym-definition analysis Acronyms are sequences of alpha-numeric characters usually capitalized, even though some intra-word non capitalized characters may also appear. Examining the literature, many authors establish a minimum length of 2–3 characters and a maximum of 9–10 [28, 32, 38], even though normal acronyms rarely have more than 5 letters. From the generative point of view, they are created by aligning and extracting some parts from a definition composed by several words. As introduced in [45], according to the generation rules, acronyms can be classified in three types: • Character-based acronyms are generated by the typically ordered combination of characters from the definition. Most acronyms for Latin-based languages are composed in this manner (e.g., BBC stands for British Broadcasting Corporation). • Syllable-based acronyms are composed by joining some syllables from the definition (e.g., University of PENNsylvania is abbreviated as UPENN). • A combination of the above types (e.g., RADAR mixes the syllables and initial characters of RAdio Detection And Ranging). These are the types of acronyms which will be considered in the proposed approach. Following the generative rules, all the characters in the acronym should appear in the same order in the corresponding definition. This multi-language and domain independent rule is very powerful and one of the bases of the acronym-definition discovery heuristic. There are, however, exceptions. For example, in fields such as Genetics and Medicine, it is possible to find acronyms such as E2 which stands as “estradiol-17 beta” which does not follow the presented rules. There are also situations in which the expansion represents a translation (e.g., in German “Vereinte Nationen (UN)”, which corresponds to United Nations; also numbers, such as 2, can be translated as two, bi, duo, etc.). Those situations can be tackled using
314
D. Sánchez, D. Isern
Table 1 List of language-independent acronym-definition association patterns Pattern
Example
Acronym (definition)
NBC (National Broadcasting Company)
Definition (acronym)
British Broadcasting Corporation (BBC)
Acronym -definition-
CNN -Cable News Network-
Definition -acronym-
Home Box Office -HBO-
language dependant text processing tools (such as translators) or weaker constraints such as in [39]. Those rare cases will not be considered in this proposal as they will require much more analysis and background knowledge, hampering the performance of the system as a general solution. 3.1 Pattern-based analysis Different strategies can be followed to decide if a definition is a valid expansion of an acronym but, as a general statement, it is required that acronyms and definitions must be adjacent. Definition identification is based on patterns that describe different syntactic situations. Those patterns can be divided according to their language dependency in: • Multi-language. They are the most effective patterns and mainly based on the use of parenthetical expressions [30]. More concretely, patterns such as acronym (definition) and definition (acronym) are the most used ones. Languages based on the Latin character set follow these rules. • Language dependant. This type of patterns involves the use of additional words (e.g., “also known as”, “aka”, “short of”, “stand”, etc.), resulting in a regular expression which is dependant on the particular language (typically English). Pretending to avoid language-dependant constraints, the first type of patterns will be employed in the proposal. Only punctuation symbols will be used (parenthesis and dashes), as shown in Table 1. 3.2 The Web as a learning corpus As shown, some approaches are focused on the detection of acronym-definition pairs within a unique input text. Considering the typical patterns exploited to associate acronyms and definitions, the use of a unique source implies that a reduced amount of pattern matchings can be retrieved. In order to obtain good quality results, those approaches use extraction and selection rules from two points of view. On the one hand, in order to maximise the amount of definitions, weak constraint sets can be specified. This may impact the precision, or to force the use of an additional analyses like the supervised approach presented in [14]. On the other hand, in
order to obtain a high precision, strong constraint sets may be employed but, in this case, recall may suffer. In addition, there exist very common acronyms such as “USA” which will often appear without being defined. Consequently, using a limited corpus may compromise the algorithm’s performance if an explicit acronym-definition match is expected. However, one may maintain the precision of a shallow analysis based on relatively strong constraints and minimise the data sparseness problems (improve the recall) of an unsupervised pattern-based approach by using a wider repository like the Web. The Web is the biggest repository of information available [3] with more than 100 billions web resources indexed by Google. The amount and heterogeneity of information available on the Web [24] are very adequate to develop methods which aim to provide global and up-to-date results like the proposed approach. In addition, the Web exhibits another important characteristic: high redundancy. The redundancy can be used for developing reliable shallow linguistic analytic techniques [3, 18], and for evaluating the relevance of words [13]. In our case, data redundancy implies that the same information—acronyms and their corresponding definitions—may appear many times in many different textual—simpler or more complex, explicit or implicit— forms. Focusing on the simpler cases eases the extraction process. Finally, publicly available Web search engines are very effective as massive Web Information Retrieval tools [16]. In our case, Web search engines can be exploited as domain independent information retrieval tools to obtain resources to analyse, forcing the appearance of specific textual patterns by means of search queries.
4 Methodology The proposed methodology is divided in three main stages (see the pseudocode in Fig. 1): acronym generation, definition retrieval and definition reliability estimation and filtering. 4.1 Acronym generation GenerateAcronyms function (line 4 of Fig. 1) creates acronym candidates using the generation rules presented in Sect. 3 through the combination of Latin letters (A–Z) and numbers (0–9) of a given length or range (from minLetters to maxLetters). Candidates composed only by numbers are avoided because they are not acronyms. Strings with special characters (e.g., “.”,“-”) are not considered because these characters are not supported by Web search engines (e.g., R.A.D.A.R. will lead to the same results than RADAR being the two forms a valid acronym for RAdio Detection And Ranging).
Automatic extraction of acronym definitions from the Web
Fig. 1 General acronym-definition discovery algorithm in pseudocode
315
316
4.2 Definition retrieval RetrieveSnippets function (line 34 in Fig. 1) queries each acronym candidate in a Web search engine. Unfortunately, Web search engines do not distinguish the punctuation symbols included in the acronym-definition patterns and, consequently, the appearance of the exact matching cannot be forced. For instance, the entry “AAA (*)” returns the same set of results as “AAA”. The obtained corpus of texts is analysed to extract definition candidates. However, considering the scarcity of finding several acronym-pattern matches within a unique document, the complete online analysis of each Web site may result in a low learning-effort ratio. In addition, as each website typically refers the acronym in a concrete sense, all candidate definitions that may co-occur would belong to the same acronym sense (i.e., the same definition). The fact that words tend to exhibit only one sense in a given discourse or document was demonstrated by Yarowsky [43] on a large corpus (37232 examples) obtaining a very high precision (around 99%). Due to those reasons, it has been opted to analyse only the Web abstracts -snippets- provided by the web search engine, which presents the two or three lines in which the query matching appears. In this manner, with only one query, it is possible to retrieve up to 200 different web snippets representing the same number of web documents. Two advantages arise from this approach: (i) the minimization of the number web accesses and (ii) the maximization of the corpus heterogeneity, increasing the diversity of the information sources and, consequently, the acronym senses in polysemic cases. ExtractNewDefinitions function (line 36 of Fig. 1) parses the snippet set to find matches for acronym-definition patterns. Patterns introduced in Sect. 3.1, involving only punctuation symbols are used in order to retrieve definitions for different -Latin-based- languages. No language dependant analysis (e.g., stop word detection, post tagging, chucking, etc.) is performed in order to avoid language dependant restrictions. ValidateDefinitions function (line 38 of Fig. 1) filters the list of definition candidates to select only those which fulfil the set of rules shown in Table 2. Those have been selected considering the characteristics of acronym-definition association (i.e., letter participation as introduced in Sect. 3) and to be general and valid for different languages. The goal is to minimise the set of candidates which will be evaluated in the final stage without heavily compromising the recall. The strongest constraints are established by the first 2 rules, which are meant to discover typical acronym constructions (as introduced in Sect. 3). Rule 3 is focused on the detection of the beginning of the definition in patterns like definition (acronym) or definition -acronym- without relying on stop words analysis. This heuristic may produce,
D. Sánchez, D. Isern Table 2 Acronym definition filtering rules Rule
Description
Rule1
All acronym characters must appear in the definition
Rule2
Acronym characters must appear in the same order as in the definition
Rule3
Definition must begin with the same letter as the acronym
Rule4
Definition maximum length is n ∗ 10, where n is the number of acronym characters
Rule5
Definition must have at least one more character than the acronym
in some situations, the loss of meaningless words belonging to the definition (e.g., determinants). For those patterns, the end of the definition is detected using the punctuation symbol (‘(’ or ‘-’) (i.e. the longest possible definition). Rules 4 and 5 are introduced to minimise the possibility of retrieving malformed candidates (e.g., missing definition terms or arbitrarily concatenated text). Those cases are very common when dealing with snippets, as web abstracts’ text may omit significant parts of the sentence. Compared to previous approaches, other authors (working on corpuses with limited scope) introduce stronger rules in order to improve the precision at the cost of a lower recall (like “only the first letter of the definition can participate” [38] or “first three letters of the definition can participate” [44]). Spelling variations found through the definition set (e.g., united nations, United Nations) are also treated in a language independent way: definitions with variations in capitalizations or punctuation symbols are considered equivalent. No language-dependant steaming algorithms are applied. If definitions sharing the same root words are found (e.g., Volkswagen AG and Volkswagen), all the forms are stored independently. 4.2.1 Adaptive corpus analysis How big should the set of web snippets be in order to acquire a relevant set of acronym definitions? On the one hand, due to the automatic and naives nature of the acronym generation process, there will be many occasions in which a combination of characters has never been used as a definition abbreviation, resulting in unproductive analyses. On the other hand, some character combinations may result in highly polysemic acronyms for which even a thousand of resources may be not enough to discover some definitions. In order to set the appropriate size of the web corpus in function on the acronym, an adaptive algorithm dynamically increases its size according to the learning throughput. At first, a NUMBER_WEBS_PER_ITERATION snippets (e.g., the maximum allowed by the web search en-
Automatic extraction of acronym definitions from the Web
gine) is analysed. As a result, a number of new definitions is extracted and selected. If it surpasses a MINIMUM_ DEFINITIONS threshold (e.g., at least 1), it continues the analysis by retrieving the next set of NUMBER_WEBS_ PER_ITERATION snippets, controlled by the webOffset variable. The process continues until the number of results for an iteration does not fulfil the MINIMUM_ DEFINITIONS threshold. In this manner, invalid acronym candidates are early rejected as no definitions are found, whereas highly productive ones result in a more extended analysis. 4.2.2 Query expansion Even though a web search engine may return millions of results per query, considering that users rarely explore more than a few dozens of results, it only indexes the first 1000 resources. This poses a problem for highly productive acronyms if the result set of web resources is not big or heterogeneous enough to cover most of the possible definitions available in the Web. This problem is aggravated by the ranking algorithms of Web search engines (e.g., PageRank [21]) because they introduce a relevance-based bias. The consequence is that the first web sites will cover the most common acronym definitions and rarer ones would remain hidden in not directly accessible web resources. In order to overcome those limitations, query expansion algorithms seek for web search engine variability [18] in order to enhance retrieval by reformulating the initial query [9]. There exist different query expansion approaches according to the information exploited to extend the query: • Thesaurus-based techniques [26, 27] use semantically related words (e.g. synonyms or hyponyms of the query terms) from a dictionary like WordNet [41]. • Co-occurrence-based techniques employ terms highly cooccurring with the initial query retrieved from a corpus (e.g., entire documents [34] or lexical affinity relationships [5]), resulting in an increase of retrieval precision [25]. • Relevance feedback techniques analyse the documents retrieved from the initial query in order to extract related information, in a supervised [7] or unsupervised fashion [27, 47]. • Brute-force techniques [17] recursively construct queries from an initial one by adding new terms from a repository of common words until the amount of results is below the maximum number of indexed resources. Varying the set of words, it is possible to coax a search engine to return most of the resources. In our case, queries for acronym definition discovery are very different from those seeking for information related to
317
a searched concept. Semantically related terms such as synonyms are not applicable to acronyms, and the introduction of terms related to already retrieved definitions will produce a negative effect (i.e., we aim to widen the search to unexplored acronym senses, not to bias it to already considered ones). So, the first three types of query expansions are not directly applicable. Brute force techniques may help to widen the corpus without introducing bias but, considering the potential amount of acronyms to analyse, the overhead introduced by the enormous number of required queries will compromise the scalability of the approach. So, instead of using a general purpose expansion algorithm, an adaptive approach was designed. The algorithm iteratively reformulates the query focusing on two aims: (i) To avoid the retrieval of resources covering definitions already retrieved. The definition set is avoided in further queries by iteratively adding an exclusion restriction to the acronym using “-” or “NOT” query operators (lines 26 and 27 in Fig. 1). Even though long queries may be not supported by some search engines (e.g., Google web interface supports up to 32 terms), this problem has not been observed when accessing the search engine via API (Google API) or when using other search engines (like MSNLive!). (ii) To expand the search by including terms which may potentially belong to new acronym definitions (i.e. words with one or several participating letters) in order to increase result’s variance. It relies on the examination of the retrieved definitions. It has been observed that words with participant letters appear several times through the definition set. For example, for the “URV” acronym, the “U” stands in many occasions to adjectives such as “universal”, “unified” or “uniform”. On the other hand, the “V” corresponds commonly to the noun “Vehicle” or “Value”. This uniformity gives a clue that it is likely to discover new definitions involving repeated definition term(s), expanding the search by adding them as seeds for further queries (line 24 of Fig. 1). The ExtractMostRepeatedUntreatedWord function (line 48 of Fig. 1) iteratively selects, after analysing the snippet set of the previous query, a new term appearing several times in the definition set. As an alternative to this process, it was also considered to use a thesaurus from which extract words with potential participant letters but the amount of queries resulting from word combinations would be overwhelming. Instead, the proposal recursively exploits acquired definitions as feedback to expand the search. As a result of this iterative expansion algorithm, the search engine will provide a maximum of 1000 new resources to analyse for each multi-appeared term. So, each analysis iter-
318 Table 3 Queries performed for the URV acronym. From left to right: queried acronym, query terms extracted from multi-appeared words in the definition set, number of excluded definitions per query, search offset for the web query, accumulated number of analysed snippets, number of obtained definitions and fulfilment of the learning threshold
D. Sánchez, D. Isern Query
Included
#Excluded
Search
#Analysed
#Obtained
Threshold
term
definitions
offset
snippets
definitions
fulfilled
“URV”
–
–
0
0
4
True
“URV”
–
–
200
200
5
True
“URV”
–
–
400
400
6
True
“URV”
–
–
600
600
7
True
“URV”
–
–
800
800
8
True
“URV”
“Virgili”
8
0
1000
13
True
“URV”
“Virgili”
8
200
1200
14
True True
“URV”
“Virgili”
8
400
1400
15
“URV”
“Virgili”
8
600
1600
15
False
“URV”
“University”
15
0
1800
16
True
“URV”
“University”
15
200
2000
16
False
“URV”
“Universitat”
16
0
2200
16
False
“URV”
“Tarragona”
16
0
2400
16
False
“URV”
“Value”
16
0
2600
17
True True
“URV”
“Value”
16
200
2800
18
“URV”
“Value”
16
400
3000
18
False
“URV”
“Vehicle”
18
0
3200
24
True
“URV”
“Vehicle”
18
200
3400
25
True True
“URV”
“Vehicle”
18
400
3600
26
“URV”
“Vehicle”
18
600
3800
26
False
“URV”
“Unit”
26
0
4000
28
True
“URV”
“Unit”
26
200
4200
29
True
“URV”
“Unit”
26
400
4400
29
False
“URV”
“Underwater”
29
0
4600
30
True
“URV”
“Underwater”
29
200
4800
30
False
“URV”
“Underwater”
29
400
5000
31
True
“URV”
“Underwater”
29
600
5200
31
False
“URV”
“Urban”
31
0
5400
32
True
“URV”
“Urban”
31
200
5600
33
True
“URV”
“Urban”
31
400
5800
33
False
“URV”
“Unmanned”
33
0
6000
33
False
ation is fed with new acquired definitions (line 40 of Fig. 1). The process ends when all multi-appeared terms have been used to create new queries (line 14 of Fig. 1) and the adaptive analysis of web resources has been executed for each one. 4.2.3 An example The behaviour of the adaptive corpus analysis and query expansion algorithms for the URV acronym is presented in Table 3. In that case the analysis is iteratively expanded up to 6000 web resources. Initially, analysing only the first directly accessible 1000 snippets, the algorithm finds 13 definitions. After several query expansions, it is able to discover up to 33 definitions.
4.3 Definition reliability estimation and filtering The set of obtained definitions have been extracted from individual observations. Even though being apparently correct (as they fulfil definition filtering rules), no clue about their accuracy or reliability with respect to the acronym is provided. Some problems may affect the set of definitions due to the lack of an extended linguistic analysis over text, such as word combinations fulfilling definition rules by pure chance or the presence of misspelled or incomplete definitions. In order to tackle these errors, an additional filtering estimates the reliability of each definition by exploiting the Web’s information distribution (lines 54 and below of Fig. 1) It is based on the amount of acronym-definition cooccurrence at a web-scale [11].
Automatic extraction of acronym definitions from the Web
319
4.3.1 Web-scale statistics In order to statistically assess the degree of relatedness between words from their co-occurrence, one can consider term collocation functions (1) p(ab)k , ck (a, b) = p(a)p(b)
(1)
being p(a) the probability that the word a occurs within the text, and p(ab) the probability of co-occurrence of words a and b. From this formula, one can define the Symmetric Conditional Probability (SCP) [19] as c2 and the Point-wise Mutual Information (PMI) [10] in the form log2 c1 . The problem is that the computation of co-occurrence measures from an enormous repository like the Web is not practical. However, Web Information Retrieval tools can be a valuable help. In fact, it has been demonstrated that the probabilities of terms indexed by a web search engine, conceived as the frequencies of page counts returned by the search engine divided by the number of indexed pages, approximate the current relative frequencies of those terms as actually used in society [11]. Taking this premise into consideration, Turney [40] adapted PMI to approximate term probabilities from web search hit counts (web-scale statistics). He defined a score (2) to compute the collocation between an initial word (problem) and a related candidate concept (choice). Score(choice, problem) =
hits(problem AND choice) hits(choice)
(2)
This measure is very similar to the original PMI (log2 c1 ), but since it is looking for a comparative score among a set of choices, it drops log2 and p(problem) in the denominator because it has the same value for all choices. 4.3.2 Estimating definition reliability Estimation of definition reliability exploits Web-scale acronym-definition co-occurrence from two points of view. First, the absolute co-occurrence value may give an idea of the acronym-definition generality, allowing to distinguish correct forms from misspelled ones (which will be much rarer in comparison). The ComputeDefinitionWebOccurrences function (line 54 of Fig. 1) constructs, for each acronym definition, a web query to evaluate the absolute co-occurrence using the introduced extraction patterns. As web search engines do not distinguish punctuation symbols, only two different queries can be constructed: “acronym definition” and “definition acronym”. Note the use of the double quotes (“ ”) to force the immediate adjacency of terms. Adding the individual hit count for each query (3), it is possible to retrieve the amount of explicit acronym-definition
co-occurrence in the Web. If this does not surpass the MINIMUM_COOCCURRENCES threshold (line 55 of Fig. 1), the definition is likely to be misspelled or erroneous and it will be discarded. In our tests, this constant has a value of 1 Cooccuri (acronym, definitioni ) = hits(“acronym definitioni ”) + hits(“definitioni acronym”)
(3)
Then, taking Turney’s score as the base, ComputeWebScore function (line 58 of Fig. 1) normalizes the absolute cooccurrence value by dividing by the number of appearances of the definition alone, computing a conditioned probability. The number of hits of the acronym can be eliminated from the denominator as it has the same value for the definition set. The result of this score (4) gives a robust estimation of the percentage of observations in which the acronymdefinition pair explicitly appears in the definition scope. Consequently, the higher the value, the more evidence of the association reliability. Scorei (acronym, definitioni ) =
Coocuri (acronym, definitioni ) hits(“definitioni ”)
(4)
On the one hand, an estimation of the reliability of a definition is valuable information that may aid the final user to better understand the results, for instance to observe which are the most common definitions of an acronym. On the other hand, it may be used to further filter the list of definitions, omitting those for which the score value is below a certain threshold. The use of statistical selection assessors is very common in unsupervised approaches working over noisy environments [12, 18, 36] to filter potentially non-related terms. As a result, the precision is improved. In our case, it could be interesting to test if the use of simple rules may be enough to filter most of the incorrect candidates or an additional statistical assessor can aid to improve the precision maintaining the coverage. This will be tested in the evaluation section. 4.3.3 An example Table 4 lists some definitions for the acronym URV. This example shows the presence of multi-lingual definitions, such as Unidade Real de Valor -Portuguese-, Universitat Rovira i Virgili -Catalan-, Union Reiseversicherung AG -German-). Also, a misspelled item -unmanned reconaissance vehicleis rejected according to the absolute co-occurrence value. It was also obtained alternative lexicalizations of the same definition, such as Universitat Rovira Virgili and Universitat Rovira i Virgili. Finally, translations to several languages can be found such as Universitat Rovira i Virgili -Catalan-,
320
D. Sánchez, D. Isern
Table 4 Examples of definitions for the acronym URV sorted by Web score (4). In italics, an example of a misspelled definition, rejected according to the total co-occurrence value (3) Definition
Co-occurrence
Score
UVGI Rating Value
56
0.767
Unidade Real de Valor
3080
0.704
Urban Regional Very Large
22
0.431
Underwater Roving Vehicle
3
0.375
Unmanned reconaissance vehicle
1
0.25
Uniform Resource Visualization
28
0.193
Unit Review Visit
4
0.153
Unit Readiness Validation
6
0.076
Ultimate Robotic Vehicle
20
0.071
Unmanned Research Vehicle
7
0.059
Unit Reference Value
8
0.057
United Recreational Vehicles LLC
4
0.055
Urban Regeneration Vehicle
11
0.055
Union Reiseversicherung AG
262
0.052
Universitat Rovira i Virgili
5990
0.045
Urban Recreational Vehicle
6
0.038
Universitat Rovira Virgili
24
0.036
University Rovira i Virgili
207
0.033
Upper Range Value
184
0.029
Universidad Rovira i Virgili
693
0.026
Universidad Rovira i Virgili -Spanish- and University Rovira i Virgili -English-.
5 Evaluation The evaluation of automatic learning procedures which deal with highly dynamic environments like acronyms and unbounded corpuses as the Web is a challenging task. Fortunately, there exist general manually composed acronymdefinition repositories, being the mentioned Acronym Finder the biggest one. Acronym Finder provides a generalitybased ranked set of definitions for a given acronym which can stand as a baseline to compare and evaluate automatically obtained results. Even though, as it will be noted during the evaluation, being hand made, it presents coverage limitations. In this section, the design of the evaluation procedure is presented, describing the criteria, metrics and results for several tests. As the extraction and selection of acronym definitions is based in common patterns and rules used by previous approaches (summarised in Sect. 2), special care will be put in evaluating the improvements which bring the two aspects which differentiates the proposal from previous ones: (i) the exploitation of the Web by means of the adaptive query expansion algorithm and (ii) the web-based score used to estimate the definition’s reliability.
Considering the amount of possible acronyms and definitions to evaluate and the bottleneck of a manual evaluation, partial (randomly selected) sample sets have been considered. At the end, more than 1800 acronym-definition pairs have been checked. Compared to evaluations performed by other authors, our set is considerably bigger (specifically, 166 pairs were evaluated in [31], 168 in [30], 861 in [14], and 815 in [45]). All tests have been performed under the same conditions, using Google Search API and the algorithm’s parameters mentioned in the explanation (MINIMUM_DEFINITIONS =1, MINIMUM_COOCCURRENCES=1 and the maximum number of snippets supported by Google per query for the NUMBER_WEBS_PER_ITERATION constant). 5.1 Evaluation measures Results’ quality has been evaluated by means of the typical measures used in Information Retrieval: precision, recall and F-measure. Precision measures the percentage of correctly extracted definitions in relation to the complete set (5). Due to the coverage limitations of Acronym Finder (i.e. many correctly extracted definitions are not considered), the correctness of each definition is manually assessed by a human expert Precision =
#correct definitions . #total definitions
(5)
Recall shows how much of the existing definitions have been extracted with respect to the baseline set provided by Acronym Finder (6) Recall =
#Acronym Finder definitions extracted . #Acronym Finder definitions
(6)
F-measure provides the weighted harmonic mean of precision and recall (7) F -Measure =
2 ∗ Precision ∗ Recall . Precision + Recall
(7)
5.2 Evaluation of highly polysemic acronyms The first tests will cover 3 letters lengthen acronyms. They constitute an especially problematic set because, on the one hand, the amount of available definitions for letter combinations can be overwhelming for manually constructed repositories. On the other hand, due to their shortness, they are very polysemic with dozens of possible definitions per acronym. So, they are interesting in order to show the performance of the approach in the most adverse conditions. The combination of 3 Latin non-numeric characters constitutes an acronym candidate set of 17576 possible acronyms. After the algorithm is executed over those
Automatic extraction of acronym definitions from the Web Table 5 Evaluation results for 20 three letter lengthen acronyms against Acronym Finder
Acronym
321
#Definitions
#Retrieved
#Non-English
AcroFinder
definitions
definitions
Precision
Recall
F-measure
ABG
19
115
50 (43.4%)
94%
52.6%
67.4%
CNL
16
ETN
10
87
24 (27.6%)
91%
50%
64.5%
57
13 (22.8%)
89.4%
70%
IQC
78.5%
8
28
4 (14.2%)
92.8%
75%
82.9%
IQL
5
13
5 (38.4%)
92.3%
80%
85.7%
KMP
9
90
66 (73.3%)
86.6%
77.7%
81.9%
LEF
15
111
29 (26.1%)
95.5%
60%
73.7%
NIO
7
46
86.9%
71.4%
78.4%
NLE
14
38
4 (10.5%)
94.7%
57.1%
71.2%
NRF
20
111
26 (23.4%)
96.4%
80%
87.4%
OLT
20
110
33 (30%)
95.4%
75%
84%
RBN
9
54
24 (44%)
92.6%
66.6%
77.5%
SFE
19
177
39 (22%)
92.6%
63.1%
75%
TWF
13
87
34 (39.1%)
93.1%
69.2%
79.5%
29 (63%)
TWI
13
101
39 (39.6%)
85.1%
69.2%
76.3%
VDC
17
134
39 (29.1%)
95.5%
70.6%
81.2%
VSW
9
30
5 (16.6%)
90%
66.6%
76.5%
WME
13
97
31 (31.2%)
89.6%
69.2%
78.1%
WRP
14
151
44 (29.1%)
96.7%
71.4%
82.1.%
WSN
13
50
14 (28%)
96%
53.8%
68.9%
acronyms, a list of at least one definition was retrieved for 70% of candidates. This indicates that combinations of three characters constitute an especially productive acronym set. In order to perform the manual evaluation, we executed the algorithm for a random set of 20 acronyms with at least 5 available definitions in Acronym Finder; the aim is to analyse polysemic cases. Results’ accuracy was manually evaluated and they were compared against Acronym Finder sets. The total number of acronym-definition pairs manually evaluated was 1687. We have also counted the amount of non-English definitions to show the capability of the system to retrieve multi-language results. This includes results in other languages and those definitions with cross-language terms (mainly Named Entities). The results are presented in Table 5. First, it can be seen that, in average, 32% of the results correspond to non-English terms. The main languages in which definitions are expressed are Latin-based ones such as Italian, Portuguese or Spanish, even though westernEuropean languages such as German also appear frequently. Some letter combinations are more prone to English definitions such as those starting with ‘W’. Acronyms with rarer -with respect to English- letters such as ‘K’ (e.g., KMP) return a higher percentage of non-English definitions (73.3%). From the evaluation measures, it can be observed that the precision is high and consistent through the evaluated cases (among 85–96%). This high accuracy shows the effec-
tiveness of patterns used to extract candidates and the rules employed to filter them. Even though dealing with different corpuses, this precision is higher than previous works attempting to compose large scale acronym-definition sets (like [31], in which a precision of 78% is reported) and quite similar to previous state of the art works (above 90% in most cases) dealing with unique domain documents such as [30]. Generalizing this problem at a Web scale, it shows that the quality of the results is maintained even using a much bigger, unstructured, noisier and apparently unreliable corpus. Regarding the recall, it is lower and more variable, even though maintained at a usable range (among 50–78%). In order to study the causes of this situation, other indicators can be analysed. First, the absolute number of definitions automatically retrieved is much higher (almost one order of magnitude) than the list presented by Acronym Finder. Considering that definitions have been validated by an expert, the coverage limitations of manually constructed repositories can be noticed. Next, we analysed against the Web the Acronym Finder definitions which the system was not able to discover. First, it was found that the low recall was not caused by the set of selection rules, as most of the missing definitions fulfil them. In order to analyse other causes, each non-retrieved definition was queried in conjunction with the acronym into the web search engine to estimate the number of available Web documents for which an explicit acronym-definition match-
322 Table 6 Analysis of non-retrieved Acronym Finder definitions. From left to right, number of missing definitions with 0, 1 to 9 and 10+ web hits and percentage of mistakes located in the last two quartiles of the ranked Acronym Finder definition list
D. Sánchez, D. Isern Acronym
#Missing defs
#Missing defs
#Missing defs
% Missing defs in
with 0 hits
with hits < 10
with hits ≥ 10
3r and 4th quartiles
ABG
4 (44.4%)
5 (55.5%)
0 (0%)
CNL
2 (25%)
4 (50%)
2 (25%)
ETN
1 (33.3%)
1 (33.3%)
1 (33.3%)
IQC
0 (0%)
1 (50%)
1 (50%)
50%
IQL
0 (0%)
1 (100%)
0 (0%)
100%
KMP
0 (0%)
1 (50%)
1 (50%)
50%
LEF
1 (16.6%)
5 (83.3%)
0 (0%)
62.5%
NIO
0 (0%)
1 (50%)
1 (50%)
NLE
2 (33.3%)
3 (50%)
1 (16.6%)
NRF
0 (0%)
3 (75%)
1 (25%)
OLT
1 (20%)
2 (40%)
2 (40%)
RBN
0 (0%)
2 (66.6%)
1 (33.3%)
100%
SFE
3 (42.8%)
4 (57.1%)
0 (0%)
100%
TWF
2 (50%)
1 (25%)
1 (25%)
75% 75%
TWI
0 (0%)
2 (50%)
2 (50%)
VDC
2 (40%)
2 (40%)
1 (20%)
VSW
1 (33.3%)
1 (33.3%)
1 (33.3%)
WME
1 (25%)
2 (50%)
1 (25%)
WRP
1 (25%)
3 (75%)
0 (0%)
WSN
0 (0%)
4 (66.6%)
2 (33.3%)
ing can be extracted. As a result (see Table 6), it was found that only a 25% of the queries returned more than 10 results. From the remaining 75%, a significant 20% of the definitions returned zero hits. So, one can observe that missing definitions correspond mainly to rare definitions with a very low (even non-existent) amount of Web occurrences (at least indexed by the web search engine). Considering that Acronym Finder presents definitions sorted by relevance according to their common use, we also evaluated the missing results in relation to their position in that ranked list. In order to measure this, the percentage of missing definitions with lower rates (third and fourth quartiles) was calculated. As a result, in average, 79.3% of the missing definitions corresponded to the less relevant ones according to Acronym Finder (in all cases, the percentage is equal or higher than 50% as shown in Table 6). This also shows that recall problems are associated to the rarest definitions. Recall limitations have been also observed in previous unsupervised works attempting to construct acronym dictionaries (such as [45] with a maximum recall of 70.9%). So, data sparseness may appear even when using the Web as a learning corpus. Considering that the method completely relies on Google’s IR recall, many pages belonging to the so-called deep Web [6] are not retrieved. In fact, it is estimated that the deep Web is several orders of magnitude larger than the surface Web. In an ideal case, missing terms with one or more hits could be retrieved by the pro-
60% 75% 100%
100% 83.3% 100% 50%
55.5% 100% 66.6% 100% 83%
posed approach by means of a more relaxed corpus analysis which seeks for more resources (e.g., less constrained finalisation rules) and further expands web queries (e.g., introducing new terms). However, considering the problem size, the scalability of the approach can be compromised by the number of web accesses and search engine queries required to evaluate, in the worst case, the full set of Web resources available for a given acronym. Analysing the missing definitions individually, we also found that some of the non retrieved definitions do not follow the generation rules presented in Sect. 3. As mentioned, those particularly problematic cases are very difficult to identify [39] and require new heuristics which may compromise the algorithm’s generality. 5.3 Query expansion evaluation We also tested the influence of the query expansion algorithm described in Sect. 4.2.2. The results obtained when analysing the static list of web resources presented by the search engine when querying the acronym (i.e., no query expansion, only 1000 web sites available) were compared against the those obtained by the adaptive analysis presented in Sect. 4.2.2. The objective is to demonstrate the necessity and the usefulness of the incremental query expansion algorithm in order to obtain results with good coverage. The results of this experiment are show in Table 7.
Automatic extraction of acronym definitions from the Web
323
Table 7 Evaluation of the results with and without applying the query expansion (QE) algorithm Acronym
#Definitions
#Definitions
Precision
Precision
Recall
Recall
F-measure
F-measure
(with QE)
(no QE)
(with QE)
(no QE)
(with QE)
(no QE)
(with QE)
(no QE)
ABG
115
9
94%
88.8%
52.6%
5.7%
67.4%
10.7%
CNL
87
17
91%
88.2%
50%
12.5%
64.5%
21.9%
ETN
57
12
89.4%
100%
70%
30%
78.5%
46.1%
IQC
28
5
92.8%
100%
75%
25%
82.9%
40%
IQL
13
8
92.3%
100%
80%
40%
85.7%
57.1%
KMP
90
9
86.6%
77.7%
77.7%
44.4%
81.9%
56.6%
LEF
111
6
95.5%
83.3%
60%
6.7%
73.7%
12.3%
NIO
46
6
NLE
38
13
87%
83.3%
71.4%
28.6%
78.4%
42.5%
94.7%
92.3%
57.1%
14.2%
71.2%
24.6%
NRF
111
20
96.4%
95%
80%
30%
87.4%
45.6%
OLT
110
12
95.4%
83.3%
75%
10%
84%
17.8%
RBN
54
12
92.6%
91.6%
66.6%
22.2%
77.5%
35.7%
SFE
177
14
92.6%
92.8%
63.1%
21%
75%
34.2% 14.3%
TWF
87
5
93.1%
100%
69.2%
7.7%
79.5%
TWI
101
8
85.1%
100%
69.2%
30.8%
76.3%
47.1%
VDC
134
9
95.5%
77.7%
70.6%
11.8%
81.2%
20.4%
VSW
30
9
90%
100%
66.6%
33.3%
76.5%
49.9% 46%
WME
97
12
89.6%
91.6%
69.2%
30.7%
78.1%
WRP
151
15
96.7%
100%
71.4%
28.5%
82.1%
44.3%
WSN
50
9
96%
88.8%
53.8%
23%
68.9%
36.5%
In all the tested cases, the 1000 directly accessible web resources are not enough to obtain a representative set of definitions. From the average amount of 101 definitions per acronym retrieved by means of the query expansion algorithm, only an average of 11 is obtained with the first 1000 ones. This results in a much lower recall with an average of only 22.8% compared to the 67.4% obtained after the initial query is expanded. In both cases, precisions are very similar (91.72% vs. 92.31%), with a higher variability for the fixed set due to the lower amount of results. As a conclusion, F-Measure shows a value that is less than half the one obtained by the proposed approach (35.18% against 77.5%). 5.4 Web-based reliability evaluation Next, we evaluated the quality of the definition reliability estimation. As mentioned in Sect. 4.3.2, the Web-based score can be taken into consideration to further filter the results and improve the results’ precision. In order to test it, the distribution of the mistakes in the list of definitions sorted according to the computed reliability score was checked. Table 8 summarises the obtained results with and without the last quartile, where the less apparently reliable definitions are located. Several conclusions can be drawn. First, it can be observed that, on average, 51.8% of the total mistakes are located in the fourth quartile and, in all cases, the percentage
is equal to or higher than 25%. These results suggest that the Web-based score approximate definition’s reliability by rating erroneous definitions with a low value, which can be used as a filter to improve the precision. As expected, the average precision rises from 92.3% to 94.8%, when excluding the elements of last quartile. The recall value is identical in most of the cases but, when last quartile contains valid definitions, the value is lower. Considering the reduced amount of definitions available in Acronym Finder (10–20), this fact affects significantly the final performance (lower FMeasure). Even though, in most cases, results are slightly better due to the improvement in selection accuracy. 5.5 Evaluating acronyms with low polysemy On the contrary to short acronyms considered up to this moment, for longer forms, the number of definitions is significantly lower (e.g. AA stands for 266 definitions, AAA for 162, AAAA for 31, AAAAA for 5 and AAAAAA for 1, according to Acronym Finder). Unambiguous cases can be easily solved as the queried acronym has very few senses, resulting in a high Web-IR precision [16]. These cases are evaluated in this section. We took another random set of 20 acronyms with 4 letters for which Acronym Finder provides a minimum of 1 definition and a maximum of 5. 159 acronym-definition pairs have
324
D. Sánchez, D. Isern
Table 8 Evaluation of definition reliability including and omitting the last quartile of definitions Acronym
%Mistakes
Precision
Precision
Recall
Recall
F-Measure
F-Measure
in 4th quartile
(with 4th
(without
(with 4th
(without
(with 4th
(without
quartile)
4th quartile)
quartile
4th quartile)
quartile)
4th quartile
ABG
71.4%
94%
97.7%
52.6%
47.4%
67.4%
63.8%
CNL
50%
91%
93.8%
50%
50%
64.5%
65.2%
ETN
33%
89.4%
90.7%
70%
70%
78.5%
79%
IQC
50%
92.8%
95.2%
75%
62.5%
82.9%
75.5%
IQL
100%
92.3%
80%
80%
85.7%
88.9%
KMP
25%
86.6%
86.7%
77.7%
66.7%
81.9%
75.4%
LEF
40%
95.5%
96.4%
60%
46.6%
73.7%
62.8%
NIO
50%
86.9%
91.4%
71.4%
71.4%
78.4%
80.2%
100%
NLE
50%
94.7%
96.5%
57.1%
57.1%
71.2%
71.7%
NRF
75%
96.4%
98.8%
80%
75%
87.4%
82.3%
OLT
40%
95.4%
96.4%
75%
70%
84%
81.1%
RBN
50%
92.6%
95%
66.6%
55.5%
77.5%
70%
SFE
38.5%
92.6%
93.9%
63.1%
63.1%
75%
75.5%
TWF
50%
93.1%
95.4%
69.2%
69.2%
79.5%
80.2%
TWI
26.7%
85.1%
85.5%
69.2%
69.2%
76.3%
76.5%
VDC
50%
95.5%
97%
70.6%
64.7%
81.2%
77.6%
VSW
66.6%
90%
95.4%
66.6%
66.6%
76.5%
78.4%
WME
60%
89.6%
94.5%
69.2%
69.2%
78.1%
79.9%
WRP
60%
96.7%
98.2%
71.4%
71.4%
82.1%
82.7%
WSN
50%
96%
97.3%
53.8%
53.8%
68.9%
69.3%
been manually evaluated. The results are summarised in Table 9. In most cases, the system discovers a reduced amount of definitions, especially when a unique one exists in Acronym Finder. Recall is maximum in most situations, with only one case in which the definition set has not been discovered (SHID). Precision follows the same tendency observed in previous tests, with a high accuracy (94% on average).
6 Conclusions and further work In this paper, a novel approach to compile general and largescale acronym-definition sets is introduced. Considering that most of the previous attempts dealing with acronyms are only focused in the contextualized detection and discovery of acronyms and definitions in a document, the proposed approach can contribute by offering a more general solution. Specifically, being automatic and unsupervised, it may aid in the development of manually composed repositories such as Acronym Finder, improving the recall and maintaining results up-to-date (through continuous automatic executions). Even though the approach relies on the same principles as previous attempts (summarised in Sect. 2) with respect to the use of patterns and rules to extract and filter acronym definitions, several aspects differentiate it from those works:
• It is adapted to the Web environment, exploiting general purpose Web search engines in order to incrementally retrieve Web resources to analyse, minimising (but not completely eliminating) data-sparseness. On the contrary, most of the previous works are applied over reduced and predefined corpuses with a very limited or domain-dependant coverage. In fact, very few attempts have been made in compiling large acronym-definition sets (as shown in Sect. 2). • Considering the unfeasibility (due to scalability problems) and impossibility (due to Web search indexing limitations) of a complete Web corpus analysis for a given acronym, an adaptive and incremental analysis based on the expansion of search queries according to the already acquired definitions is proposed. The algorithm shows its effectiveness in expanding the search to initially hidden resources, which improve recall. • The generality of the approach relies in the use of general domain independent and multi-language patterns and selection rules. The limitations of the pattern-based approaches are compensated by the high redundancy of the Web information, which provides the same information in different textual forms. On the contrary, as introduced in Sect. 2, many approaches are language or domain-
Automatic extraction of acronym definitions from the Web Table 9 Evaluation results for 20 four-letter lengthen acronyms against Acronym Finder
Acronym
BEHI
325
#Definitions
#Retrieved
AcroFinder
definitions
1
1
Precision
Recall
F-measure
100%
100%
100%
CAEU
1
3
100%
100%
100%
CIJE
1
5
100%
100%
100%
CNIA
3
18
100%
66.6%
79.92%
CMKD
1
1
100%
100%
100%
CUHI
1
4
100%
100%
100%
DNIS
2
11
90.1%
100%
94.79%
GTA4
1
1
100%
100%
100%
LMEA
1
9
100%
100%
100%
LMES
4
18
100%
75%
85.7%
MUNS
3
7
100%
66.6%
80%
NWIA
1
4
100%
100%
100%
NWUA
1
1
100%
100%
100%
SHID
1
1
100%
0%
0%
SLIA
2
16
100%
100%
100%
SLIG
1
14
92.8%
100%
96.26%
SMEI
4
38
97.36%
100%
98.66%
WIEA
1
3
100%
100%
100%
WMIE
1
2
50%
100%
66.6%
XMEA
1
2
50%
100%
66.6%
dependant due to the language-dependant patterns employed or the use of linguistic analyses. • The designed Web-based reliability assessor has proved as a valid estimation of the definition suitability for a given acronym. Web-based statistical analyses have been extensively used in Information Extraction (e.g., discovery of relevant terms [18]) and Knowledge Acquisition tasks (e.g., Ontology Learning [36]) but, as far as we know, they have not been applied to estimate the degree of acronym-definition association. The proposed approach offers accurate results (after the manual evaluation of more than 1800 acronym-definition pairs) with a reasonable level of coverage in comparison to manually built repositories. The number of results is an order of magnitude bigger than manual attempts, which show the usefulness of the proposal. This fact also shows the value of the Web as a learning corpus [35], which has also been demonstrated by other authors in the fields of question answering [4], machine translation [20] or ontology enrichment [2]. As a future line of research we would try to refine the query expansion algorithm in order to extend even more the analysed corpus. Additional Web search operators and new terms can be employed to create queries retrieving new resources. In addition, several Web search engines (e.g., Google, AltaVista, MSNLive!) could be combined in order to compose a more complete and heterogeneous corpus
to analyse. The final objective will be to overcome the detected coverage issues. Other long term research lines may include the detection of definition language using automatic language recognisers or the automatic clustering of domain related definitions according to, for example, predefined categories. Acknowledgements Authors would like to acknowledge the feedback of Dr. Antonio Moreno. The work is partially supported by the Universitat Rovira i Virgili (2009AIRE-04) and the DAMASK project (Data mining algorithms with semantic knowledge, TIN2009-11005).
References 1. Adar E (2002) S-RAD: A simple and robust abbreviation dictionary. HP Laboratories 2. Agirre E, Ansa O, Hovy E, Martínez D (2000) Enriching very large ontologies using the WWW. In: Proc of Workshop on Ontology Construction of the European Conference of AI. ECAI, Berlin, pp 73–77 3. Brill E (2003) Processing natural language without natural language processing. In: Gelbukh A (ed) Proc of 4th international conference on computational linguistics and intelligent text processing, bconfnameCICLing 2003, Mexico City, Mexico. Springer, Berlin/Heidelberg, pp 360–369 4. Brill E, Lin J, Banko M, Dumais S (2001) Data-intensive question answering. In: Voorhees EM, Harman DK (eds) Proc of tenth text retrieval conference, TREC 2001. Department of Commerce, National Institute of Standards and Technology, Gaithersburg, Maryland, US, pp 393–400
326 5. Carmel D, Farchi E, Petruschka Y, Soffer A (2002) Automatic query wefinement using lexical affinities with maximal information gain. In: Beaulieu M, Baeza-Yates R, Myaeng SH, Järvelin K (eds) Proc of 25th annual international. ACM, SIGIR conference on research and development in information retrieval, SIGIR 02. Tampere, Finland, pp 283–290 6. Castells P (2003) Sistemas interactivos y colaborativos en la Web. In: Bravo C, Redondo MA (eds) La web semántica. Ediciones de la Universidad de Castilla-La Mancha, pp 195–212 7. Chang C-H, Hsu C-C (1998) Integrating query expansion and conceptual relevance feedback for personalized web information retrieval. Comput Netw ISDN Syst 30:621–623 8. Chang JT, Schütze H (2006) Abbreviations in biomedical text. In: Ananiadou S, McNaught J (eds) Text mining for biology and biomedicine. Artech House, Norwood, pp 99–119 9. Chirita P-A, Firan CS, Nejdl W (2007) Personalized query expansion for the Web. In: Clarke CLA, Fuhr N, Kando N, Kraaij W, de Vries AP (eds) Proc of 30th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR 07. ACM, Amsterdam, pp 7–14 10. Church KW, Gale W, Hanks P, Hindle D (1991) Using statistics in lexical analysis. In: Zernik U (ed) Lexical acquisition: exploiting on-line resources to build a lexicon. Lawrence Erlbaum Associates, New Jersey, pp 115–164 11. Cilibrasi RL, Vitányi PMB (2006) The Google similarity distance. IEEE Trans Knowl Data Eng 19:370–383 12. Cimiano P, Staab S (2004) Learning by Googling. SIGKDD Explor 6:24–33 13. Ciravegna F, Dingli A, Guthrie D, Wilks Y (2003) Integrating information to bootstrap information extraction from Web sites. In: Kambhampati S, Knoblock CA (eds) Proc of IJCAI workshop on information integration on the Web, IIWeb 2003. IJCAI Press, Acapulco, pp 9–14 14. Dannélls D (2006) Automatic acronym recognition. In: Proc of 11st conference of the European chapter of the association for computational linguistics, EACL 2006. The Association for Computer Linguistics, Trento, pp 167–170 15. Dimililer N, Varo˘glu E, Altınçay H (2009) Classifier subset selection for biomedical named entity recognition. Appl Intell. doi:10.1007/s10489-008-0124-0 to appear 16. Dujmovic J, Bai H (2006) Evaluation and comparison of search engines using the LSP method. Comput Sci Inf Syst 3:711–722 17. Etzioni O, Cafarella M, Downey D, Kok S, Popescu A, Shaked T, Soderland S, Weld DS (2004) Web-scale information Extraction in KnowItAll. In: Proc of 13th international World Wide Web conference, WWW 2004. ACM Press, New York, pp 100–110 18. Etzioni O, Cafarella M, Downey D, Popescu A-M, Shaked T, Soderland S, Weld DS, Yates A (2005) Unsupervised namedentity extraction from the Web: an experimental study. Artif Intell 165:91–134 19. Ferreira da Silva J, Lopes GP (1999) A local maxima method and a fair dispersion normalization for extracting multi-word units from corpora. In: Proc of sixth meeting on mathematics of language, MOL6. Association for Computational Linguistics, Orlando, pp 369–381 20. Grefenstette G (1999) The World Wide Web as a resource for example-based machine translation tasks. In: Proc of twenty-first international conference on translating and the computer. Aslib Press, London 21. Henzinger MR (2008) PageRank algorithm. In: Kao M-Y (ed) Encyclopedia of algorithms. Springer, New York 22. Hisamitsu T, Niwa Y (2001) Extracting useful terms from parenthetical expression by combining simple rules and statistical measures: a comparative evaluation of bigram statistics. In: Bourigault D, Christian J, L’Homme M-C (eds) Recent advances in computational terminology. Benjamins, Amsterdam, pp 209–224
D. Sánchez, D. Isern 23. Hunt JW, Szymanski TG (1977) A fast algorithm for computing longest common subsequences. Commun ACM 20:350–353 24. Kilgarriff A, Grefenstette G (2003) Introduction to the special issue on the Web as Corpus. Comput Linguist 29:333–347 25. Kim M-C, Choi K-S (1999) A comparison of collocation-based similarity measures in query expansion. Inf Process Manag 35:19– 30 26. Kim S-B, Seo H-C, Rim H-C (2004) Information retrieval using word senses: root sense tagging approach. In: Järvelin K, Allan J, Bruza P, Sanderson M (eds) Proc of 27th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR 04. ACM, Sheffield, pp 258–265 27. Lam-Adesina AM, Jones GJF (2001) Applying summarization techniques for term selection in relevance feedback. In: Kraft DH, Croft WB, Harper DJ, Zobel J (eds) Proc of 24th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR 01. ACM, New Orleans, pp 1–9 28. Larkey L, Ogilvie P, Price A, Tamilio B (2000) Acrophile: an automated acronym extractor and server. In: Proc of 5th ACM conference on digital libraries. Association for Computing Machinery, San Antonio, pp 205–214 29. Liu H, Friedman C (2003) Mining terminological knowledge in large biomedical corpora. In: Altman RB, Dunker AK, Hunter L, Klein TE (eds) Proc of 8th Pacific symposium on biocomputing, PSB 2003. PSB Association, Lihue, pp 415–426 30. Nadeau D, Turney PD (2005) A supervised learning approach to acronym identification. In: Kégl B, Lapalme G (eds) Proc of 18th conference of the Canadian society for computational studies of intelligence, Canadian AI 2005. Springer, Berlin/Heidelberg, pp 319–329 31. Okazaki N, Ananiadou S (2006) A term recognition approach to acronym recognition. In: Proc of international committee on computational linguistics and the association for computational linguistics, COLING-ACL 2006. Association for Computational Linguistics, Sydney, pp 643–650 32. Park Y, Byrd RJ (2001) Hybrid text mining for finding abbreviations and their definitions. In: Lee L, Harman D (eds) Proc of conference on empirical methods in natural language processing, EMNLP 2001. Intelligent Information Systems Institute, Pittsburgh, pp 126–133 33. Pustejovsky J, Castaño J, Cochran B, Kotecki M, Morrell M (2001) Automatic extraction of acronym-meaning pairs from MEDLINE databases. In: Patel V, Rogers R, Haux R (eds) Proc of 10th Triennial congress of the international medical informatics association, MEDINFO 2001. IOS Press, London, pp 371–375 34. Qiu Y, Frei H-P (1993) Concept based query expansion. In: Korfhage R, Rasmussen E, Willett P (eds) Proc of 16th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR 93. ACM, Pittsburgh, pp 160–169 35. Resnik P, Smith N (2003) The Web as a parallel corpus. Comput Linguist 29:349–380 36. Sánchez D, Moreno A (2008) Pattern-based automatic taxonomy learning from the Web. AI Commun 21:27–48 37. Schwartz A, Hearst M (2003) A simple algorithm for identifying abbreviation definitions in biomedical texts. In: Altman RB, Dunker AK, Hunter L, Klein TE (eds) Proc of 8th Pacific symposium on biocomputing, PSB 2003. PSB Association, Lihue, pp 451–462 38. Taghva K, Gilbreth J (1999) Recognizing acronyms and their definitions. Int J Document Anal Recognit 1:191–198 39. Torii M, Hu Z-Z, Song M, Wu CH, Liu H (2006) A comparison study on algorithms of detecting long forms for short forms in biomedical text. BMC Bioinform 8:S5 40. Turney PD (2001) Mining the Web for synonyms: PMI-IR versus LSA on TOEFL. In: Raedt LD, Flach P (eds) Proc of 12th European conference on machine learning, ECML 2001, Freiburg, Germany. Springer, Berlin/Heidelberg, pp 491–499
Automatic extraction of acronym definitions from the Web 41. WordNet (1998) WordNet—an electronic lexical database. MIT Press, Cambridge 42. Xiao L, Wissmann D, Brown M, Jablonski S (2004) Information extraction from the Web: system and techniques. Appl Intell 21:195–224 43. Yarowsky D (1995) Unsupervised word-sense disambiguation rivaling supervised methods. In: Uszkoreit H (ed) Proc of 33rd annual meeting of the association for computational linguistics. Association for Computational Linguistics, Cambridge, pp 189–196 44. Yeates S (1999) Automatic extraction of acronyms from text. In: Yeates S (ed.) Proc of third New Zealand computer science research students’ conference. University of Waikato, Te Kohinga Marama Marae, Hamilton, New Zealand, pp 117–124 45. Yoon Y-C, Park S-Y, Song Y-I, Rim H-C, Rhee D-W (2008) Automatic acronym dictionary construction based on acronym generation types. IEICE Trans Inform Syst E91-D:1584–1587 46. Yu H, Hripcsak G, Friedman C (2002) Mapping abbreviations to full forms in biomedical articles. J Am Med Inform Assoc 9:262– 272 47. Yu S, Cai D, Wen J-R, Ma W-Y (2003) Improving pseudorelevance feedback in web information retrieval using web page segmentation. In: Hencsey G, White B, Robin Chen Y-F, Kovács L, Lawrence S (eds) Proc of 12th international conference on World Wide Web, WWW 03, Budapest. ACM, New York, pp 11– 18 48. Zahariev M (1991) In faculty of control systems and computers. Polytechnic Institute of Bucharest Simon Fraser University, Bucharest, Rumania
327 David Sánchez is a Lecturer at the University Rovira i Virgili’s Computer Science and Mathematics Department. He received a PhD on Artificial Intelligence from UPC (Technical University of Catalonia) in 2008. He is a member of the ITAKA research group (Intelligent Techniques for Advanced Knowledge Acquisition). His research interests are intelligent agents and ontology learning and the Semantic Web. He has been involved in several research projects (National and European), and published several papers and conference contributions. David Isern is a post-doctoral researcher of the University Rovira i Virgili’s Department of Computer Science and Mathematics. He is also associate professor of the Open University of Catalonia. He received his PhD in Artificial Intelligence (2009) and an MSc (2005) from the Technical University of Catalonia. His research interests are intelligent software agents, distributed systems, user’s preferences management, and ontologies, especially applied in healthcare and information retrieval systems. He has been involved in several research projects (National and European), and published several papers and conference contributions.