Automatic extraction of acronym definitions from the Web - Springer Link

43 downloads 58607 Views 2MB Size Report
Sep 30, 2009 - tending these shortcomings, this paper presents an automatic and unsupervised ...... Kim S-B, Seo H-C, Rim H-C (2004) Information retrieval using word senses: root sense .... ligent software agents, distributed systems, user's ...
Appl Intell (2011) 34: 311–327 DOI 10.1007/s10489-009-0197-4

Automatic extraction of acronym definitions from the Web David Sánchez · David Isern

Published online: 30 September 2009 © Springer Science+Business Media, LLC 2009

Abstract Acronyms are widely used to abbreviate and stress important concepts. The discovery of the definitions associated to an acronym is an important matter in order to support language processing and knowledge-related tasks as information retrieval, ontology mapping or question answering. Acronyms represent a very dynamic and unbounded topic that is constantly evolving. Manual attempts to compose a global scale dictionary of acronym-definition pairs result in an overwhelming amount of work and limited results. Attending these shortcomings, this paper presents an automatic and unsupervised methodology to generate acronyms and extract their potential definitions from the Web. The method has been designed to minimise the set of constraints, offering a domain and -partially- language independent solution, and to exploit the Web in order to create large and general acronym-definition sets. Results have been manually evaluated against the largest manually built acronym repository: Acronym Finder. The evaluation shows that the proposed approach is able to improve the coverage of manual attempts maintaining a high precision. Keywords Acronyms · Information extraction · Web mining

D. Sánchez () · D. Isern Department of Computer Science and Mathematics, Intelligent Technologies for Advanced Knowledge Acquisition (ITAKA) Research Group, University Rovira i Virgili, Tarragona, Catalonia Spain e-mail: [email protected] D. Isern e-mail: [email protected]

1 Introduction Acronyms are textual forms used to refer to relevant concepts or entities [14]. Human languages are very prone to the creation of acronyms in order to (i) stress the importance of entities, (ii) avoid redundancy by omitting entity’s long forms and (iii) offer an alternative way for referring to the same entity which is easier to remember. Some characteristics of acronyms are: • They are very dynamic. New acronyms are defined every day for almost every possible domain of knowledge. This is especially evident in domains such as biomedicine [15, 39]. • They are highly polysemic. Acronyms are composed by a short combination of alpha-numeric characters (i.e., commonly from 2 to 6 characters). Consequently, the amount of possible combinations is limited and biased towards those simpler forms. Some short combinations of letters may correspond to dozens of possible entities (e.g., ABC stands for 253 different entities, according to Acronym Finder.1 ) • They have a very diverse degree of generality. Some acronym-definition pairs are very common (e.g., USA— United States of America) but others are rare and nonreferred outside the information source in which they are defined (e.g., USA—Unique Settable Attributes). Formally, an acronym may correspond to one or more definition(s) from which several participating characters are used to construct the acronym. Identifying equivalent acronymdefinition pairs is a crucial task in natural language processing and information retrieval [18, 42]. Ontology Population 1 Web

site: http://www.acronymfinder.com [last access: 01/09/2009].

312

and Questions Answering are other areas in which acronym handling can improve language understanding [48]. However, due to the previously introduced characteristics, it is very difficult to construct a general and up-to-date database of acronym-definition repository [45]. From the manual point of view, there have been some ambitious attempts to provide global and reliable acronym dictionaries such as Acronym Finder or The Internet Acronym Server.2 They offer a valuable source of knowledge at the cost of a huge human work. For example, Acronym Finder, which is the world’s largest dictionary of acronyms, includes more than 750,000 human-edited definitions; the total effort required to compile this set is estimated to be more than 6,000 hours,3 a task performed during the last 11 years. Moreover, manual knowledge acquisition supposes a bottleneck which limits the coverage, as it will be shown in the evaluation section. Consequently, automated methodologies may aid the composition of those repositories. Recently, there have been approaches focused on the automatic identification of acronym-definition pairs within text [30–32]. As it will be shown in the next section, some of them require a certain amount of supervising or training and/or introducing of restrictive constraint sets to associate acronyms to definitions. Also, most of them have a limited scope (e.g., acronyms and definitions are retrieved from a unique input document) and they are also languagedependant due to linguistic analyses performed over the text. This paper presents a novel method to, at first, generate valid acronyms and, then, retrieve feasible definitions. The proposal uses the Web as source and handles these tasks automatically and in an unsupervised way, creating large acronym-definition sets from scratch. It is based on general extraction patterns and a reduced filtering constraint set, configuring a domain-independent approach. As no language dependant analyses are introduced, cross-language results can be also retrieved. However, the proposal is limited to languages written with Latin characters (e.g., English, Spanish, Italian, etc). Finally, it exploits the Web information distribution to estimate result’s reliability, which can be used to filter the results and potentially improve the precision. Results have been evaluated against the largest manually built repository in order to check their accuracy and to show the benefits that an automatic approach can bring in extending manually composed repositories. The rest of the paper is organized as follows. Section 2 describes previous works in the area of acronymdefinition discovery. Section 3 analyses the characteristics of the acronym-definition identification problem. Section 4 2 Web site: http://acronyms.silmaril.ie/cgi-bin/uncgi/acronyms [last ac-

cess: 01/09/2009]. 3 Source: http://blog.acronymfinder.com/2008/04/acronym-finder-rolls-

over-600000.html [14/05/09].

D. Sánchez, D. Isern

describes in detail the proposed methodology. Section 5 presents the evaluation, showing the obtained results and discussing them. The last section gives some conclusions and proposes some lines of future work.

2 Related works In the last years, there have been some attempts at dealing with the acronym identification task (i.e., given an input text document, acronyms are identified and associated to a definition). Taghva and Gilbreth [38] use a word window to retrieve definition candidates from the acronym surroundings. The window length is a function of the number of letters in the acronym. Those letters should also match with the words in the definition. An additional linguistic analysis is performed over the definition to detect which are the word types (i.e., stop words, hyphenated and normal) that contribute to the acronym definition. The Longest Common Subsequence algorithm [23] is then used to select the definition. Larkey et al. [28] use a similar approach, identifying acronyms through capitalizations and windows of 20 words near the acronym. A linguistic analysis is applied to detect the presence of meaningless words and to check those which can contribute to acronym letters. Additionally, simple patterns such as parenthesis or constructions indicating equivalencies, are used to check the suitability of acronym definitions. However, non-parenthetical patterns introduce constraints to the analysis which may hamper the recall. Park and Bryd [32] propose a combination of pattern-based abbreviation rules with text markers and cue words (e.g., “or”, “short,”, “stand”) to detect acronym-definition pairs. Zahariev [48] presents a model for acronym identification based on strong constraints but allowing different languages. Yeates [44] uses special characters such as commas, parentheses and periods to extract acronym-definition candidates. Acronyms are detected using capitalization heuristics and the definition is associated by evaluating several rules such as the matching of letters, without considering English stop words. Adar [1] introduces some basic rules about scoring to rank the suitability of acronyms and their definitions according to the number of matching letters or the presence of parenthesis. Liu and Friedman [29] use collocation measures and parenthetical expressions to associate definitions to acronyms. However, their approach cannot recognise expanded forms occurring only once in the corpus. There are also some approaches which use supervised models. Chang and Schütze [8] use logistic regression as a learning algorithm, which employs features to describe the association between the acronym letters and the definition.

Automatic extraction of acronym definitions from the Web

Nadeau and Turney [30] also present a supervised system which uses machine learning involving weak constraints to improve the recall. Dannells [14] presents a supervised machine learning algorithm (Memory Based Learning). Internally, acronym-definition pairs are described by a vector of syntactic features, and the algorithm is trained to classify them. The presented approaches are meant to detect acronym candidates within the text (which represents a bounded context), trying to associate a suitable definition extracted from the same document. Very few approaches have been developed to create a large scale dictionary, like in the present work. Okazaki and Ananiadou [31] introduce a method to compose an acronym-definition repository from a large text collection. Their approach focuses on terms appearing frequently in the proximity of an acronym and measuring the likelihood scores of such terms to be the definitions of acronyms. However, the repository scope depends highly on the manually composed input corpus. Recently, Yoon et al. [45] have been using a set of definitions to create possible acronyms through an automatic acronym generation algorithm (for the Korean language). The authors check the suitability of acronym candidates by estimating the probability of appearing of the given acronym-definition pair from a set of Web resources. Most of the presented methods deal with English-written resources, being very few of them multi-language [48]. Some authors have dealt with other languages. For instance, Hisamitsu and Niwa [22] analyse Japanese written articles using measures of co-occurrence between inner and outer parenthetical expressions, and Yoon et al. [45] deal with Korean Web resources. Research on acronyms has been applied in some relevant domains such as Genetics and Medicine [33, 46]. Schwartz and Hearst [37] also work over biomedical text, identifying acronyms with a set of basic restrictions and seeking for the shortest definition in its surrounding which match all the acronym letters. From the presented works, several conclusions can be extracted: • Most approaches use a set of patterns to extract definition candidates for a given acronym and some heuristics (mainly constraint sets) to evaluate their suitability. The use of capitalization heuristics, letter matching and parenthetical rules are the most common and effective ones [30]. • Most of the approaches are applied over English written resources. Some of them employ a certain degree of linguistic analysis (e.g., detection of stop words) which hampers their applicability to other languages. • In unsupervised (mainly rule-based) approaches, the result’s quality depends highly on the set of constraints used to filter acronym definition candidates. As stated in [30],

313

the use of strong constraints over a reduced corpus results in high precision but compromises the final recall. • Most of the approaches use a document or a set of documents as source. This is a way of contextualizing the search towards a certain domain. However, this introduces limitations because, if no explicit matching for the defined patterns appears within the text, no results will be obtained. As very few approaches aim to detect wide and general definition sets for a given acronym (which is the goal of the present approach), the computation of the degree of generality or reliability of a definition for a certain acronym is not necessary.

3 Acronym-definition analysis Acronyms are sequences of alpha-numeric characters usually capitalized, even though some intra-word non capitalized characters may also appear. Examining the literature, many authors establish a minimum length of 2–3 characters and a maximum of 9–10 [28, 32, 38], even though normal acronyms rarely have more than 5 letters. From the generative point of view, they are created by aligning and extracting some parts from a definition composed by several words. As introduced in [45], according to the generation rules, acronyms can be classified in three types: • Character-based acronyms are generated by the typically ordered combination of characters from the definition. Most acronyms for Latin-based languages are composed in this manner (e.g., BBC stands for British Broadcasting Corporation). • Syllable-based acronyms are composed by joining some syllables from the definition (e.g., University of PENNsylvania is abbreviated as UPENN). • A combination of the above types (e.g., RADAR mixes the syllables and initial characters of RAdio Detection And Ranging). These are the types of acronyms which will be considered in the proposed approach. Following the generative rules, all the characters in the acronym should appear in the same order in the corresponding definition. This multi-language and domain independent rule is very powerful and one of the bases of the acronym-definition discovery heuristic. There are, however, exceptions. For example, in fields such as Genetics and Medicine, it is possible to find acronyms such as E2 which stands as “estradiol-17 beta” which does not follow the presented rules. There are also situations in which the expansion represents a translation (e.g., in German “Vereinte Nationen (UN)”, which corresponds to United Nations; also numbers, such as 2, can be translated as two, bi, duo, etc.). Those situations can be tackled using

314

D. Sánchez, D. Isern

Table 1 List of language-independent acronym-definition association patterns Pattern

Example

Acronym (definition)

NBC (National Broadcasting Company)

Definition (acronym)

British Broadcasting Corporation (BBC)

Acronym -definition-

CNN -Cable News Network-

Definition -acronym-

Home Box Office -HBO-

language dependant text processing tools (such as translators) or weaker constraints such as in [39]. Those rare cases will not be considered in this proposal as they will require much more analysis and background knowledge, hampering the performance of the system as a general solution. 3.1 Pattern-based analysis Different strategies can be followed to decide if a definition is a valid expansion of an acronym but, as a general statement, it is required that acronyms and definitions must be adjacent. Definition identification is based on patterns that describe different syntactic situations. Those patterns can be divided according to their language dependency in: • Multi-language. They are the most effective patterns and mainly based on the use of parenthetical expressions [30]. More concretely, patterns such as acronym (definition) and definition (acronym) are the most used ones. Languages based on the Latin character set follow these rules. • Language dependant. This type of patterns involves the use of additional words (e.g., “also known as”, “aka”, “short of”, “stand”, etc.), resulting in a regular expression which is dependant on the particular language (typically English). Pretending to avoid language-dependant constraints, the first type of patterns will be employed in the proposal. Only punctuation symbols will be used (parenthesis and dashes), as shown in Table 1. 3.2 The Web as a learning corpus As shown, some approaches are focused on the detection of acronym-definition pairs within a unique input text. Considering the typical patterns exploited to associate acronyms and definitions, the use of a unique source implies that a reduced amount of pattern matchings can be retrieved. In order to obtain good quality results, those approaches use extraction and selection rules from two points of view. On the one hand, in order to maximise the amount of definitions, weak constraint sets can be specified. This may impact the precision, or to force the use of an additional analyses like the supervised approach presented in [14]. On the other hand, in

order to obtain a high precision, strong constraint sets may be employed but, in this case, recall may suffer. In addition, there exist very common acronyms such as “USA” which will often appear without being defined. Consequently, using a limited corpus may compromise the algorithm’s performance if an explicit acronym-definition match is expected. However, one may maintain the precision of a shallow analysis based on relatively strong constraints and minimise the data sparseness problems (improve the recall) of an unsupervised pattern-based approach by using a wider repository like the Web. The Web is the biggest repository of information available [3] with more than 100 billions web resources indexed by Google. The amount and heterogeneity of information available on the Web [24] are very adequate to develop methods which aim to provide global and up-to-date results like the proposed approach. In addition, the Web exhibits another important characteristic: high redundancy. The redundancy can be used for developing reliable shallow linguistic analytic techniques [3, 18], and for evaluating the relevance of words [13]. In our case, data redundancy implies that the same information—acronyms and their corresponding definitions—may appear many times in many different textual—simpler or more complex, explicit or implicit— forms. Focusing on the simpler cases eases the extraction process. Finally, publicly available Web search engines are very effective as massive Web Information Retrieval tools [16]. In our case, Web search engines can be exploited as domain independent information retrieval tools to obtain resources to analyse, forcing the appearance of specific textual patterns by means of search queries.

4 Methodology The proposed methodology is divided in three main stages (see the pseudocode in Fig. 1): acronym generation, definition retrieval and definition reliability estimation and filtering. 4.1 Acronym generation GenerateAcronyms function (line 4 of Fig. 1) creates acronym candidates using the generation rules presented in Sect. 3 through the combination of Latin letters (A–Z) and numbers (0–9) of a given length or range (from minLetters to maxLetters). Candidates composed only by numbers are avoided because they are not acronyms. Strings with special characters (e.g., “.”,“-”) are not considered because these characters are not supported by Web search engines (e.g., R.A.D.A.R. will lead to the same results than RADAR being the two forms a valid acronym for RAdio Detection And Ranging).

Automatic extraction of acronym definitions from the Web

Fig. 1 General acronym-definition discovery algorithm in pseudocode

315

316

4.2 Definition retrieval RetrieveSnippets function (line 34 in Fig. 1) queries each acronym candidate in a Web search engine. Unfortunately, Web search engines do not distinguish the punctuation symbols included in the acronym-definition patterns and, consequently, the appearance of the exact matching cannot be forced. For instance, the entry “AAA (*)” returns the same set of results as “AAA”. The obtained corpus of texts is analysed to extract definition candidates. However, considering the scarcity of finding several acronym-pattern matches within a unique document, the complete online analysis of each Web site may result in a low learning-effort ratio. In addition, as each website typically refers the acronym in a concrete sense, all candidate definitions that may co-occur would belong to the same acronym sense (i.e., the same definition). The fact that words tend to exhibit only one sense in a given discourse or document was demonstrated by Yarowsky [43] on a large corpus (37232 examples) obtaining a very high precision (around 99%). Due to those reasons, it has been opted to analyse only the Web abstracts -snippets- provided by the web search engine, which presents the two or three lines in which the query matching appears. In this manner, with only one query, it is possible to retrieve up to 200 different web snippets representing the same number of web documents. Two advantages arise from this approach: (i) the minimization of the number web accesses and (ii) the maximization of the corpus heterogeneity, increasing the diversity of the information sources and, consequently, the acronym senses in polysemic cases. ExtractNewDefinitions function (line 36 of Fig. 1) parses the snippet set to find matches for acronym-definition patterns. Patterns introduced in Sect. 3.1, involving only punctuation symbols are used in order to retrieve definitions for different -Latin-based- languages. No language dependant analysis (e.g., stop word detection, post tagging, chucking, etc.) is performed in order to avoid language dependant restrictions. ValidateDefinitions function (line 38 of Fig. 1) filters the list of definition candidates to select only those which fulfil the set of rules shown in Table 2. Those have been selected considering the characteristics of acronym-definition association (i.e., letter participation as introduced in Sect. 3) and to be general and valid for different languages. The goal is to minimise the set of candidates which will be evaluated in the final stage without heavily compromising the recall. The strongest constraints are established by the first 2 rules, which are meant to discover typical acronym constructions (as introduced in Sect. 3). Rule 3 is focused on the detection of the beginning of the definition in patterns like definition (acronym) or definition -acronym- without relying on stop words analysis. This heuristic may produce,

D. Sánchez, D. Isern Table 2 Acronym definition filtering rules Rule

Description

Rule1

All acronym characters must appear in the definition

Rule2

Acronym characters must appear in the same order as in the definition

Rule3

Definition must begin with the same letter as the acronym

Rule4

Definition maximum length is n ∗ 10, where n is the number of acronym characters

Rule5

Definition must have at least one more character than the acronym

in some situations, the loss of meaningless words belonging to the definition (e.g., determinants). For those patterns, the end of the definition is detected using the punctuation symbol (‘(’ or ‘-’) (i.e. the longest possible definition). Rules 4 and 5 are introduced to minimise the possibility of retrieving malformed candidates (e.g., missing definition terms or arbitrarily concatenated text). Those cases are very common when dealing with snippets, as web abstracts’ text may omit significant parts of the sentence. Compared to previous approaches, other authors (working on corpuses with limited scope) introduce stronger rules in order to improve the precision at the cost of a lower recall (like “only the first letter of the definition can participate” [38] or “first three letters of the definition can participate” [44]). Spelling variations found through the definition set (e.g., united nations, United Nations) are also treated in a language independent way: definitions with variations in capitalizations or punctuation symbols are considered equivalent. No language-dependant steaming algorithms are applied. If definitions sharing the same root words are found (e.g., Volkswagen AG and Volkswagen), all the forms are stored independently. 4.2.1 Adaptive corpus analysis How big should the set of web snippets be in order to acquire a relevant set of acronym definitions? On the one hand, due to the automatic and naives nature of the acronym generation process, there will be many occasions in which a combination of characters has never been used as a definition abbreviation, resulting in unproductive analyses. On the other hand, some character combinations may result in highly polysemic acronyms for which even a thousand of resources may be not enough to discover some definitions. In order to set the appropriate size of the web corpus in function on the acronym, an adaptive algorithm dynamically increases its size according to the learning throughput. At first, a NUMBER_WEBS_PER_ITERATION snippets (e.g., the maximum allowed by the web search en-

Automatic extraction of acronym definitions from the Web

gine) is analysed. As a result, a number of new definitions is extracted and selected. If it surpasses a MINIMUM_ DEFINITIONS threshold (e.g., at least 1), it continues the analysis by retrieving the next set of NUMBER_WEBS_ PER_ITERATION snippets, controlled by the webOffset variable. The process continues until the number of results for an iteration does not fulfil the MINIMUM_ DEFINITIONS threshold. In this manner, invalid acronym candidates are early rejected as no definitions are found, whereas highly productive ones result in a more extended analysis. 4.2.2 Query expansion Even though a web search engine may return millions of results per query, considering that users rarely explore more than a few dozens of results, it only indexes the first 1000 resources. This poses a problem for highly productive acronyms if the result set of web resources is not big or heterogeneous enough to cover most of the possible definitions available in the Web. This problem is aggravated by the ranking algorithms of Web search engines (e.g., PageRank [21]) because they introduce a relevance-based bias. The consequence is that the first web sites will cover the most common acronym definitions and rarer ones would remain hidden in not directly accessible web resources. In order to overcome those limitations, query expansion algorithms seek for web search engine variability [18] in order to enhance retrieval by reformulating the initial query [9]. There exist different query expansion approaches according to the information exploited to extend the query: • Thesaurus-based techniques [26, 27] use semantically related words (e.g. synonyms or hyponyms of the query terms) from a dictionary like WordNet [41]. • Co-occurrence-based techniques employ terms highly cooccurring with the initial query retrieved from a corpus (e.g., entire documents [34] or lexical affinity relationships [5]), resulting in an increase of retrieval precision [25]. • Relevance feedback techniques analyse the documents retrieved from the initial query in order to extract related information, in a supervised [7] or unsupervised fashion [27, 47]. • Brute-force techniques [17] recursively construct queries from an initial one by adding new terms from a repository of common words until the amount of results is below the maximum number of indexed resources. Varying the set of words, it is possible to coax a search engine to return most of the resources. In our case, queries for acronym definition discovery are very different from those seeking for information related to

317

a searched concept. Semantically related terms such as synonyms are not applicable to acronyms, and the introduction of terms related to already retrieved definitions will produce a negative effect (i.e., we aim to widen the search to unexplored acronym senses, not to bias it to already considered ones). So, the first three types of query expansions are not directly applicable. Brute force techniques may help to widen the corpus without introducing bias but, considering the potential amount of acronyms to analyse, the overhead introduced by the enormous number of required queries will compromise the scalability of the approach. So, instead of using a general purpose expansion algorithm, an adaptive approach was designed. The algorithm iteratively reformulates the query focusing on two aims: (i) To avoid the retrieval of resources covering definitions already retrieved. The definition set is avoided in further queries by iteratively adding an exclusion restriction to the acronym using “-” or “NOT” query operators (lines 26 and 27 in Fig. 1). Even though long queries may be not supported by some search engines (e.g., Google web interface supports up to 32 terms), this problem has not been observed when accessing the search engine via API (Google API) or when using other search engines (like MSNLive!). (ii) To expand the search by including terms which may potentially belong to new acronym definitions (i.e. words with one or several participating letters) in order to increase result’s variance. It relies on the examination of the retrieved definitions. It has been observed that words with participant letters appear several times through the definition set. For example, for the “URV” acronym, the “U” stands in many occasions to adjectives such as “universal”, “unified” or “uniform”. On the other hand, the “V” corresponds commonly to the noun “Vehicle” or “Value”. This uniformity gives a clue that it is likely to discover new definitions involving repeated definition term(s), expanding the search by adding them as seeds for further queries (line 24 of Fig. 1). The ExtractMostRepeatedUntreatedWord function (line 48 of Fig. 1) iteratively selects, after analysing the snippet set of the previous query, a new term appearing several times in the definition set. As an alternative to this process, it was also considered to use a thesaurus from which extract words with potential participant letters but the amount of queries resulting from word combinations would be overwhelming. Instead, the proposal recursively exploits acquired definitions as feedback to expand the search. As a result of this iterative expansion algorithm, the search engine will provide a maximum of 1000 new resources to analyse for each multi-appeared term. So, each analysis iter-

318 Table 3 Queries performed for the URV acronym. From left to right: queried acronym, query terms extracted from multi-appeared words in the definition set, number of excluded definitions per query, search offset for the web query, accumulated number of analysed snippets, number of obtained definitions and fulfilment of the learning threshold

D. Sánchez, D. Isern Query

Included

#Excluded

Search

#Analysed

#Obtained

Threshold

term

definitions

offset

snippets

definitions

fulfilled

“URV”





0

0

4

True

“URV”





200

200

5

True

“URV”





400

400

6

True

“URV”





600

600

7

True

“URV”





800

800

8

True

“URV”

“Virgili”

8

0

1000

13

True

“URV”

“Virgili”

8

200

1200

14

True True

“URV”

“Virgili”

8

400

1400

15

“URV”

“Virgili”

8

600

1600

15

False

“URV”

“University”

15

0

1800

16

True

“URV”

“University”

15

200

2000

16

False

“URV”

“Universitat”

16

0

2200

16

False

“URV”

“Tarragona”

16

0

2400

16

False

“URV”

“Value”

16

0

2600

17

True True

“URV”

“Value”

16

200

2800

18

“URV”

“Value”

16

400

3000

18

False

“URV”

“Vehicle”

18

0

3200

24

True

“URV”

“Vehicle”

18

200

3400

25

True True

“URV”

“Vehicle”

18

400

3600

26

“URV”

“Vehicle”

18

600

3800

26

False

“URV”

“Unit”

26

0

4000

28

True

“URV”

“Unit”

26

200

4200

29

True

“URV”

“Unit”

26

400

4400

29

False

“URV”

“Underwater”

29

0

4600

30

True

“URV”

“Underwater”

29

200

4800

30

False

“URV”

“Underwater”

29

400

5000

31

True

“URV”

“Underwater”

29

600

5200

31

False

“URV”

“Urban”

31

0

5400

32

True

“URV”

“Urban”

31

200

5600

33

True

“URV”

“Urban”

31

400

5800

33

False

“URV”

“Unmanned”

33

0

6000

33

False

ation is fed with new acquired definitions (line 40 of Fig. 1). The process ends when all multi-appeared terms have been used to create new queries (line 14 of Fig. 1) and the adaptive analysis of web resources has been executed for each one. 4.2.3 An example The behaviour of the adaptive corpus analysis and query expansion algorithms for the URV acronym is presented in Table 3. In that case the analysis is iteratively expanded up to 6000 web resources. Initially, analysing only the first directly accessible 1000 snippets, the algorithm finds 13 definitions. After several query expansions, it is able to discover up to 33 definitions.

4.3 Definition reliability estimation and filtering The set of obtained definitions have been extracted from individual observations. Even though being apparently correct (as they fulfil definition filtering rules), no clue about their accuracy or reliability with respect to the acronym is provided. Some problems may affect the set of definitions due to the lack of an extended linguistic analysis over text, such as word combinations fulfilling definition rules by pure chance or the presence of misspelled or incomplete definitions. In order to tackle these errors, an additional filtering estimates the reliability of each definition by exploiting the Web’s information distribution (lines 54 and below of Fig. 1) It is based on the amount of acronym-definition cooccurrence at a web-scale [11].

Automatic extraction of acronym definitions from the Web

319

4.3.1 Web-scale statistics In order to statistically assess the degree of relatedness between words from their co-occurrence, one can consider term collocation functions (1) p(ab)k , ck (a, b) = p(a)p(b)

(1)

being p(a) the probability that the word a occurs within the text, and p(ab) the probability of co-occurrence of words a and b. From this formula, one can define the Symmetric Conditional Probability (SCP) [19] as c2 and the Point-wise Mutual Information (PMI) [10] in the form log2 c1 . The problem is that the computation of co-occurrence measures from an enormous repository like the Web is not practical. However, Web Information Retrieval tools can be a valuable help. In fact, it has been demonstrated that the probabilities of terms indexed by a web search engine, conceived as the frequencies of page counts returned by the search engine divided by the number of indexed pages, approximate the current relative frequencies of those terms as actually used in society [11]. Taking this premise into consideration, Turney [40] adapted PMI to approximate term probabilities from web search hit counts (web-scale statistics). He defined a score (2) to compute the collocation between an initial word (problem) and a related candidate concept (choice). Score(choice, problem) =

hits(problem AND choice) hits(choice)

(2)

This measure is very similar to the original PMI (log2 c1 ), but since it is looking for a comparative score among a set of choices, it drops log2 and p(problem) in the denominator because it has the same value for all choices. 4.3.2 Estimating definition reliability Estimation of definition reliability exploits Web-scale acronym-definition co-occurrence from two points of view. First, the absolute co-occurrence value may give an idea of the acronym-definition generality, allowing to distinguish correct forms from misspelled ones (which will be much rarer in comparison). The ComputeDefinitionWebOccurrences function (line 54 of Fig. 1) constructs, for each acronym definition, a web query to evaluate the absolute co-occurrence using the introduced extraction patterns. As web search engines do not distinguish punctuation symbols, only two different queries can be constructed: “acronym definition” and “definition acronym”. Note the use of the double quotes (“ ”) to force the immediate adjacency of terms. Adding the individual hit count for each query (3), it is possible to retrieve the amount of explicit acronym-definition

co-occurrence in the Web. If this does not surpass the MINIMUM_COOCCURRENCES threshold (line 55 of Fig. 1), the definition is likely to be misspelled or erroneous and it will be discarded. In our tests, this constant has a value of 1 Cooccuri (acronym, definitioni ) = hits(“acronym definitioni ”) + hits(“definitioni acronym”)

(3)

Then, taking Turney’s score as the base, ComputeWebScore function (line 58 of Fig. 1) normalizes the absolute cooccurrence value by dividing by the number of appearances of the definition alone, computing a conditioned probability. The number of hits of the acronym can be eliminated from the denominator as it has the same value for the definition set. The result of this score (4) gives a robust estimation of the percentage of observations in which the acronymdefinition pair explicitly appears in the definition scope. Consequently, the higher the value, the more evidence of the association reliability. Scorei (acronym, definitioni ) =

Coocuri (acronym, definitioni ) hits(“definitioni ”)

(4)

On the one hand, an estimation of the reliability of a definition is valuable information that may aid the final user to better understand the results, for instance to observe which are the most common definitions of an acronym. On the other hand, it may be used to further filter the list of definitions, omitting those for which the score value is below a certain threshold. The use of statistical selection assessors is very common in unsupervised approaches working over noisy environments [12, 18, 36] to filter potentially non-related terms. As a result, the precision is improved. In our case, it could be interesting to test if the use of simple rules may be enough to filter most of the incorrect candidates or an additional statistical assessor can aid to improve the precision maintaining the coverage. This will be tested in the evaluation section. 4.3.3 An example Table 4 lists some definitions for the acronym URV. This example shows the presence of multi-lingual definitions, such as Unidade Real de Valor -Portuguese-, Universitat Rovira i Virgili -Catalan-, Union Reiseversicherung AG -German-). Also, a misspelled item -unmanned reconaissance vehicleis rejected according to the absolute co-occurrence value. It was also obtained alternative lexicalizations of the same definition, such as Universitat Rovira Virgili and Universitat Rovira i Virgili. Finally, translations to several languages can be found such as Universitat Rovira i Virgili -Catalan-,

320

D. Sánchez, D. Isern

Table 4 Examples of definitions for the acronym URV sorted by Web score (4). In italics, an example of a misspelled definition, rejected according to the total co-occurrence value (3) Definition

Co-occurrence

Score

UVGI Rating Value

56

0.767

Unidade Real de Valor

3080

0.704

Urban Regional Very Large

22

0.431

Underwater Roving Vehicle

3

0.375

Unmanned reconaissance vehicle

1

0.25

Uniform Resource Visualization

28

0.193

Unit Review Visit

4

0.153

Unit Readiness Validation

6

0.076

Ultimate Robotic Vehicle

20

0.071

Unmanned Research Vehicle

7

0.059

Unit Reference Value

8

0.057

United Recreational Vehicles LLC

4

0.055

Urban Regeneration Vehicle

11

0.055

Union Reiseversicherung AG

262

0.052

Universitat Rovira i Virgili

5990

0.045

Urban Recreational Vehicle

6

0.038

Universitat Rovira Virgili

24

0.036

University Rovira i Virgili

207

0.033

Upper Range Value

184

0.029

Universidad Rovira i Virgili

693

0.026

Universidad Rovira i Virgili -Spanish- and University Rovira i Virgili -English-.

5 Evaluation The evaluation of automatic learning procedures which deal with highly dynamic environments like acronyms and unbounded corpuses as the Web is a challenging task. Fortunately, there exist general manually composed acronymdefinition repositories, being the mentioned Acronym Finder the biggest one. Acronym Finder provides a generalitybased ranked set of definitions for a given acronym which can stand as a baseline to compare and evaluate automatically obtained results. Even though, as it will be noted during the evaluation, being hand made, it presents coverage limitations. In this section, the design of the evaluation procedure is presented, describing the criteria, metrics and results for several tests. As the extraction and selection of acronym definitions is based in common patterns and rules used by previous approaches (summarised in Sect. 2), special care will be put in evaluating the improvements which bring the two aspects which differentiates the proposal from previous ones: (i) the exploitation of the Web by means of the adaptive query expansion algorithm and (ii) the web-based score used to estimate the definition’s reliability.

Considering the amount of possible acronyms and definitions to evaluate and the bottleneck of a manual evaluation, partial (randomly selected) sample sets have been considered. At the end, more than 1800 acronym-definition pairs have been checked. Compared to evaluations performed by other authors, our set is considerably bigger (specifically, 166 pairs were evaluated in [31], 168 in [30], 861 in [14], and 815 in [45]). All tests have been performed under the same conditions, using Google Search API and the algorithm’s parameters mentioned in the explanation (MINIMUM_DEFINITIONS =1, MINIMUM_COOCCURRENCES=1 and the maximum number of snippets supported by Google per query for the NUMBER_WEBS_PER_ITERATION constant). 5.1 Evaluation measures Results’ quality has been evaluated by means of the typical measures used in Information Retrieval: precision, recall and F-measure. Precision measures the percentage of correctly extracted definitions in relation to the complete set (5). Due to the coverage limitations of Acronym Finder (i.e. many correctly extracted definitions are not considered), the correctness of each definition is manually assessed by a human expert Precision =

#correct definitions . #total definitions

(5)

Recall shows how much of the existing definitions have been extracted with respect to the baseline set provided by Acronym Finder (6) Recall =

#Acronym Finder definitions extracted . #Acronym Finder definitions

(6)

F-measure provides the weighted harmonic mean of precision and recall (7) F -Measure =

2 ∗ Precision ∗ Recall . Precision + Recall

(7)

5.2 Evaluation of highly polysemic acronyms The first tests will cover 3 letters lengthen acronyms. They constitute an especially problematic set because, on the one hand, the amount of available definitions for letter combinations can be overwhelming for manually constructed repositories. On the other hand, due to their shortness, they are very polysemic with dozens of possible definitions per acronym. So, they are interesting in order to show the performance of the approach in the most adverse conditions. The combination of 3 Latin non-numeric characters constitutes an acronym candidate set of 17576 possible acronyms. After the algorithm is executed over those

Automatic extraction of acronym definitions from the Web Table 5 Evaluation results for 20 three letter lengthen acronyms against Acronym Finder

Acronym

321

#Definitions

#Retrieved

#Non-English

AcroFinder

definitions

definitions

Precision

Recall

F-measure

ABG

19

115

50 (43.4%)

94%

52.6%

67.4%

CNL

16

ETN

10

87

24 (27.6%)

91%

50%

64.5%

57

13 (22.8%)

89.4%

70%

IQC

78.5%

8

28

4 (14.2%)

92.8%

75%

82.9%

IQL

5

13

5 (38.4%)

92.3%

80%

85.7%

KMP

9

90

66 (73.3%)

86.6%

77.7%

81.9%

LEF

15

111

29 (26.1%)

95.5%

60%

73.7%

NIO

7

46

86.9%

71.4%

78.4%

NLE

14

38

4 (10.5%)

94.7%

57.1%

71.2%

NRF

20

111

26 (23.4%)

96.4%

80%

87.4%

OLT

20

110

33 (30%)

95.4%

75%

84%

RBN

9

54

24 (44%)

92.6%

66.6%

77.5%

SFE

19

177

39 (22%)

92.6%

63.1%

75%

TWF

13

87

34 (39.1%)

93.1%

69.2%

79.5%

29 (63%)

TWI

13

101

39 (39.6%)

85.1%

69.2%

76.3%

VDC

17

134

39 (29.1%)

95.5%

70.6%

81.2%

VSW

9

30

5 (16.6%)

90%

66.6%

76.5%

WME

13

97

31 (31.2%)

89.6%

69.2%

78.1%

WRP

14

151

44 (29.1%)

96.7%

71.4%

82.1.%

WSN

13

50

14 (28%)

96%

53.8%

68.9%

acronyms, a list of at least one definition was retrieved for 70% of candidates. This indicates that combinations of three characters constitute an especially productive acronym set. In order to perform the manual evaluation, we executed the algorithm for a random set of 20 acronyms with at least 5 available definitions in Acronym Finder; the aim is to analyse polysemic cases. Results’ accuracy was manually evaluated and they were compared against Acronym Finder sets. The total number of acronym-definition pairs manually evaluated was 1687. We have also counted the amount of non-English definitions to show the capability of the system to retrieve multi-language results. This includes results in other languages and those definitions with cross-language terms (mainly Named Entities). The results are presented in Table 5. First, it can be seen that, in average, 32% of the results correspond to non-English terms. The main languages in which definitions are expressed are Latin-based ones such as Italian, Portuguese or Spanish, even though westernEuropean languages such as German also appear frequently. Some letter combinations are more prone to English definitions such as those starting with ‘W’. Acronyms with rarer -with respect to English- letters such as ‘K’ (e.g., KMP) return a higher percentage of non-English definitions (73.3%). From the evaluation measures, it can be observed that the precision is high and consistent through the evaluated cases (among 85–96%). This high accuracy shows the effec-

tiveness of patterns used to extract candidates and the rules employed to filter them. Even though dealing with different corpuses, this precision is higher than previous works attempting to compose large scale acronym-definition sets (like [31], in which a precision of 78% is reported) and quite similar to previous state of the art works (above 90% in most cases) dealing with unique domain documents such as [30]. Generalizing this problem at a Web scale, it shows that the quality of the results is maintained even using a much bigger, unstructured, noisier and apparently unreliable corpus. Regarding the recall, it is lower and more variable, even though maintained at a usable range (among 50–78%). In order to study the causes of this situation, other indicators can be analysed. First, the absolute number of definitions automatically retrieved is much higher (almost one order of magnitude) than the list presented by Acronym Finder. Considering that definitions have been validated by an expert, the coverage limitations of manually constructed repositories can be noticed. Next, we analysed against the Web the Acronym Finder definitions which the system was not able to discover. First, it was found that the low recall was not caused by the set of selection rules, as most of the missing definitions fulfil them. In order to analyse other causes, each non-retrieved definition was queried in conjunction with the acronym into the web search engine to estimate the number of available Web documents for which an explicit acronym-definition match-

322 Table 6 Analysis of non-retrieved Acronym Finder definitions. From left to right, number of missing definitions with 0, 1 to 9 and 10+ web hits and percentage of mistakes located in the last two quartiles of the ranked Acronym Finder definition list

D. Sánchez, D. Isern Acronym

#Missing defs

#Missing defs

#Missing defs

% Missing defs in

with 0 hits

with hits < 10

with hits ≥ 10

3r and 4th quartiles

ABG

4 (44.4%)

5 (55.5%)

0 (0%)

CNL

2 (25%)

4 (50%)

2 (25%)

ETN

1 (33.3%)

1 (33.3%)

1 (33.3%)

IQC

0 (0%)

1 (50%)

1 (50%)

50%

IQL

0 (0%)

1 (100%)

0 (0%)

100%

KMP

0 (0%)

1 (50%)

1 (50%)

50%

LEF

1 (16.6%)

5 (83.3%)

0 (0%)

62.5%

NIO

0 (0%)

1 (50%)

1 (50%)

NLE

2 (33.3%)

3 (50%)

1 (16.6%)

NRF

0 (0%)

3 (75%)

1 (25%)

OLT

1 (20%)

2 (40%)

2 (40%)

RBN

0 (0%)

2 (66.6%)

1 (33.3%)

100%

SFE

3 (42.8%)

4 (57.1%)

0 (0%)

100%

TWF

2 (50%)

1 (25%)

1 (25%)

75% 75%

TWI

0 (0%)

2 (50%)

2 (50%)

VDC

2 (40%)

2 (40%)

1 (20%)

VSW

1 (33.3%)

1 (33.3%)

1 (33.3%)

WME

1 (25%)

2 (50%)

1 (25%)

WRP

1 (25%)

3 (75%)

0 (0%)

WSN

0 (0%)

4 (66.6%)

2 (33.3%)

ing can be extracted. As a result (see Table 6), it was found that only a 25% of the queries returned more than 10 results. From the remaining 75%, a significant 20% of the definitions returned zero hits. So, one can observe that missing definitions correspond mainly to rare definitions with a very low (even non-existent) amount of Web occurrences (at least indexed by the web search engine). Considering that Acronym Finder presents definitions sorted by relevance according to their common use, we also evaluated the missing results in relation to their position in that ranked list. In order to measure this, the percentage of missing definitions with lower rates (third and fourth quartiles) was calculated. As a result, in average, 79.3% of the missing definitions corresponded to the less relevant ones according to Acronym Finder (in all cases, the percentage is equal or higher than 50% as shown in Table 6). This also shows that recall problems are associated to the rarest definitions. Recall limitations have been also observed in previous unsupervised works attempting to construct acronym dictionaries (such as [45] with a maximum recall of 70.9%). So, data sparseness may appear even when using the Web as a learning corpus. Considering that the method completely relies on Google’s IR recall, many pages belonging to the so-called deep Web [6] are not retrieved. In fact, it is estimated that the deep Web is several orders of magnitude larger than the surface Web. In an ideal case, missing terms with one or more hits could be retrieved by the pro-

60% 75% 100%

100% 83.3% 100% 50%

55.5% 100% 66.6% 100% 83%

posed approach by means of a more relaxed corpus analysis which seeks for more resources (e.g., less constrained finalisation rules) and further expands web queries (e.g., introducing new terms). However, considering the problem size, the scalability of the approach can be compromised by the number of web accesses and search engine queries required to evaluate, in the worst case, the full set of Web resources available for a given acronym. Analysing the missing definitions individually, we also found that some of the non retrieved definitions do not follow the generation rules presented in Sect. 3. As mentioned, those particularly problematic cases are very difficult to identify [39] and require new heuristics which may compromise the algorithm’s generality. 5.3 Query expansion evaluation We also tested the influence of the query expansion algorithm described in Sect. 4.2.2. The results obtained when analysing the static list of web resources presented by the search engine when querying the acronym (i.e., no query expansion, only 1000 web sites available) were compared against the those obtained by the adaptive analysis presented in Sect. 4.2.2. The objective is to demonstrate the necessity and the usefulness of the incremental query expansion algorithm in order to obtain results with good coverage. The results of this experiment are show in Table 7.

Automatic extraction of acronym definitions from the Web

323

Table 7 Evaluation of the results with and without applying the query expansion (QE) algorithm Acronym

#Definitions

#Definitions

Precision

Precision

Recall

Recall

F-measure

F-measure

(with QE)

(no QE)

(with QE)

(no QE)

(with QE)

(no QE)

(with QE)

(no QE)

ABG

115

9

94%

88.8%

52.6%

5.7%

67.4%

10.7%

CNL

87

17

91%

88.2%

50%

12.5%

64.5%

21.9%

ETN

57

12

89.4%

100%

70%

30%

78.5%

46.1%

IQC

28

5

92.8%

100%

75%

25%

82.9%

40%

IQL

13

8

92.3%

100%

80%

40%

85.7%

57.1%

KMP

90

9

86.6%

77.7%

77.7%

44.4%

81.9%

56.6%

LEF

111

6

95.5%

83.3%

60%

6.7%

73.7%

12.3%

NIO

46

6

NLE

38

13

87%

83.3%

71.4%

28.6%

78.4%

42.5%

94.7%

92.3%

57.1%

14.2%

71.2%

24.6%

NRF

111

20

96.4%

95%

80%

30%

87.4%

45.6%

OLT

110

12

95.4%

83.3%

75%

10%

84%

17.8%

RBN

54

12

92.6%

91.6%

66.6%

22.2%

77.5%

35.7%

SFE

177

14

92.6%

92.8%

63.1%

21%

75%

34.2% 14.3%

TWF

87

5

93.1%

100%

69.2%

7.7%

79.5%

TWI

101

8

85.1%

100%

69.2%

30.8%

76.3%

47.1%

VDC

134

9

95.5%

77.7%

70.6%

11.8%

81.2%

20.4%

VSW

30

9

90%

100%

66.6%

33.3%

76.5%

49.9% 46%

WME

97

12

89.6%

91.6%

69.2%

30.7%

78.1%

WRP

151

15

96.7%

100%

71.4%

28.5%

82.1%

44.3%

WSN

50

9

96%

88.8%

53.8%

23%

68.9%

36.5%

In all the tested cases, the 1000 directly accessible web resources are not enough to obtain a representative set of definitions. From the average amount of 101 definitions per acronym retrieved by means of the query expansion algorithm, only an average of 11 is obtained with the first 1000 ones. This results in a much lower recall with an average of only 22.8% compared to the 67.4% obtained after the initial query is expanded. In both cases, precisions are very similar (91.72% vs. 92.31%), with a higher variability for the fixed set due to the lower amount of results. As a conclusion, F-Measure shows a value that is less than half the one obtained by the proposed approach (35.18% against 77.5%). 5.4 Web-based reliability evaluation Next, we evaluated the quality of the definition reliability estimation. As mentioned in Sect. 4.3.2, the Web-based score can be taken into consideration to further filter the results and improve the results’ precision. In order to test it, the distribution of the mistakes in the list of definitions sorted according to the computed reliability score was checked. Table 8 summarises the obtained results with and without the last quartile, where the less apparently reliable definitions are located. Several conclusions can be drawn. First, it can be observed that, on average, 51.8% of the total mistakes are located in the fourth quartile and, in all cases, the percentage

is equal to or higher than 25%. These results suggest that the Web-based score approximate definition’s reliability by rating erroneous definitions with a low value, which can be used as a filter to improve the precision. As expected, the average precision rises from 92.3% to 94.8%, when excluding the elements of last quartile. The recall value is identical in most of the cases but, when last quartile contains valid definitions, the value is lower. Considering the reduced amount of definitions available in Acronym Finder (10–20), this fact affects significantly the final performance (lower FMeasure). Even though, in most cases, results are slightly better due to the improvement in selection accuracy. 5.5 Evaluating acronyms with low polysemy On the contrary to short acronyms considered up to this moment, for longer forms, the number of definitions is significantly lower (e.g. AA stands for 266 definitions, AAA for 162, AAAA for 31, AAAAA for 5 and AAAAAA for 1, according to Acronym Finder). Unambiguous cases can be easily solved as the queried acronym has very few senses, resulting in a high Web-IR precision [16]. These cases are evaluated in this section. We took another random set of 20 acronyms with 4 letters for which Acronym Finder provides a minimum of 1 definition and a maximum of 5. 159 acronym-definition pairs have

324

D. Sánchez, D. Isern

Table 8 Evaluation of definition reliability including and omitting the last quartile of definitions Acronym

%Mistakes

Precision

Precision

Recall

Recall

F-Measure

F-Measure

in 4th quartile

(with 4th

(without

(with 4th

(without

(with 4th

(without

quartile)

4th quartile)

quartile

4th quartile)

quartile)

4th quartile

ABG

71.4%

94%

97.7%

52.6%

47.4%

67.4%

63.8%

CNL

50%

91%

93.8%

50%

50%

64.5%

65.2%

ETN

33%

89.4%

90.7%

70%

70%

78.5%

79%

IQC

50%

92.8%

95.2%

75%

62.5%

82.9%

75.5%

IQL

100%

92.3%

80%

80%

85.7%

88.9%

KMP

25%

86.6%

86.7%

77.7%

66.7%

81.9%

75.4%

LEF

40%

95.5%

96.4%

60%

46.6%

73.7%

62.8%

NIO

50%

86.9%

91.4%

71.4%

71.4%

78.4%

80.2%

100%

NLE

50%

94.7%

96.5%

57.1%

57.1%

71.2%

71.7%

NRF

75%

96.4%

98.8%

80%

75%

87.4%

82.3%

OLT

40%

95.4%

96.4%

75%

70%

84%

81.1%

RBN

50%

92.6%

95%

66.6%

55.5%

77.5%

70%

SFE

38.5%

92.6%

93.9%

63.1%

63.1%

75%

75.5%

TWF

50%

93.1%

95.4%

69.2%

69.2%

79.5%

80.2%

TWI

26.7%

85.1%

85.5%

69.2%

69.2%

76.3%

76.5%

VDC

50%

95.5%

97%

70.6%

64.7%

81.2%

77.6%

VSW

66.6%

90%

95.4%

66.6%

66.6%

76.5%

78.4%

WME

60%

89.6%

94.5%

69.2%

69.2%

78.1%

79.9%

WRP

60%

96.7%

98.2%

71.4%

71.4%

82.1%

82.7%

WSN

50%

96%

97.3%

53.8%

53.8%

68.9%

69.3%

been manually evaluated. The results are summarised in Table 9. In most cases, the system discovers a reduced amount of definitions, especially when a unique one exists in Acronym Finder. Recall is maximum in most situations, with only one case in which the definition set has not been discovered (SHID). Precision follows the same tendency observed in previous tests, with a high accuracy (94% on average).

6 Conclusions and further work In this paper, a novel approach to compile general and largescale acronym-definition sets is introduced. Considering that most of the previous attempts dealing with acronyms are only focused in the contextualized detection and discovery of acronyms and definitions in a document, the proposed approach can contribute by offering a more general solution. Specifically, being automatic and unsupervised, it may aid in the development of manually composed repositories such as Acronym Finder, improving the recall and maintaining results up-to-date (through continuous automatic executions). Even though the approach relies on the same principles as previous attempts (summarised in Sect. 2) with respect to the use of patterns and rules to extract and filter acronym definitions, several aspects differentiate it from those works:

• It is adapted to the Web environment, exploiting general purpose Web search engines in order to incrementally retrieve Web resources to analyse, minimising (but not completely eliminating) data-sparseness. On the contrary, most of the previous works are applied over reduced and predefined corpuses with a very limited or domain-dependant coverage. In fact, very few attempts have been made in compiling large acronym-definition sets (as shown in Sect. 2). • Considering the unfeasibility (due to scalability problems) and impossibility (due to Web search indexing limitations) of a complete Web corpus analysis for a given acronym, an adaptive and incremental analysis based on the expansion of search queries according to the already acquired definitions is proposed. The algorithm shows its effectiveness in expanding the search to initially hidden resources, which improve recall. • The generality of the approach relies in the use of general domain independent and multi-language patterns and selection rules. The limitations of the pattern-based approaches are compensated by the high redundancy of the Web information, which provides the same information in different textual forms. On the contrary, as introduced in Sect. 2, many approaches are language or domain-

Automatic extraction of acronym definitions from the Web Table 9 Evaluation results for 20 four-letter lengthen acronyms against Acronym Finder

Acronym

BEHI

325

#Definitions

#Retrieved

AcroFinder

definitions

1

1

Precision

Recall

F-measure

100%

100%

100%

CAEU

1

3

100%

100%

100%

CIJE

1

5

100%

100%

100%

CNIA

3

18

100%

66.6%

79.92%

CMKD

1

1

100%

100%

100%

CUHI

1

4

100%

100%

100%

DNIS

2

11

90.1%

100%

94.79%

GTA4

1

1

100%

100%

100%

LMEA

1

9

100%

100%

100%

LMES

4

18

100%

75%

85.7%

MUNS

3

7

100%

66.6%

80%

NWIA

1

4

100%

100%

100%

NWUA

1

1

100%

100%

100%

SHID

1

1

100%

0%

0%

SLIA

2

16

100%

100%

100%

SLIG

1

14

92.8%

100%

96.26%

SMEI

4

38

97.36%

100%

98.66%

WIEA

1

3

100%

100%

100%

WMIE

1

2

50%

100%

66.6%

XMEA

1

2

50%

100%

66.6%

dependant due to the language-dependant patterns employed or the use of linguistic analyses. • The designed Web-based reliability assessor has proved as a valid estimation of the definition suitability for a given acronym. Web-based statistical analyses have been extensively used in Information Extraction (e.g., discovery of relevant terms [18]) and Knowledge Acquisition tasks (e.g., Ontology Learning [36]) but, as far as we know, they have not been applied to estimate the degree of acronym-definition association. The proposed approach offers accurate results (after the manual evaluation of more than 1800 acronym-definition pairs) with a reasonable level of coverage in comparison to manually built repositories. The number of results is an order of magnitude bigger than manual attempts, which show the usefulness of the proposal. This fact also shows the value of the Web as a learning corpus [35], which has also been demonstrated by other authors in the fields of question answering [4], machine translation [20] or ontology enrichment [2]. As a future line of research we would try to refine the query expansion algorithm in order to extend even more the analysed corpus. Additional Web search operators and new terms can be employed to create queries retrieving new resources. In addition, several Web search engines (e.g., Google, AltaVista, MSNLive!) could be combined in order to compose a more complete and heterogeneous corpus

to analyse. The final objective will be to overcome the detected coverage issues. Other long term research lines may include the detection of definition language using automatic language recognisers or the automatic clustering of domain related definitions according to, for example, predefined categories. Acknowledgements Authors would like to acknowledge the feedback of Dr. Antonio Moreno. The work is partially supported by the Universitat Rovira i Virgili (2009AIRE-04) and the DAMASK project (Data mining algorithms with semantic knowledge, TIN2009-11005).

References 1. Adar E (2002) S-RAD: A simple and robust abbreviation dictionary. HP Laboratories 2. Agirre E, Ansa O, Hovy E, Martínez D (2000) Enriching very large ontologies using the WWW. In: Proc of Workshop on Ontology Construction of the European Conference of AI. ECAI, Berlin, pp 73–77 3. Brill E (2003) Processing natural language without natural language processing. In: Gelbukh A (ed) Proc of 4th international conference on computational linguistics and intelligent text processing, bconfnameCICLing 2003, Mexico City, Mexico. Springer, Berlin/Heidelberg, pp 360–369 4. Brill E, Lin J, Banko M, Dumais S (2001) Data-intensive question answering. In: Voorhees EM, Harman DK (eds) Proc of tenth text retrieval conference, TREC 2001. Department of Commerce, National Institute of Standards and Technology, Gaithersburg, Maryland, US, pp 393–400

326 5. Carmel D, Farchi E, Petruschka Y, Soffer A (2002) Automatic query wefinement using lexical affinities with maximal information gain. In: Beaulieu M, Baeza-Yates R, Myaeng SH, Järvelin K (eds) Proc of 25th annual international. ACM, SIGIR conference on research and development in information retrieval, SIGIR 02. Tampere, Finland, pp 283–290 6. Castells P (2003) Sistemas interactivos y colaborativos en la Web. In: Bravo C, Redondo MA (eds) La web semántica. Ediciones de la Universidad de Castilla-La Mancha, pp 195–212 7. Chang C-H, Hsu C-C (1998) Integrating query expansion and conceptual relevance feedback for personalized web information retrieval. Comput Netw ISDN Syst 30:621–623 8. Chang JT, Schütze H (2006) Abbreviations in biomedical text. In: Ananiadou S, McNaught J (eds) Text mining for biology and biomedicine. Artech House, Norwood, pp 99–119 9. Chirita P-A, Firan CS, Nejdl W (2007) Personalized query expansion for the Web. In: Clarke CLA, Fuhr N, Kando N, Kraaij W, de Vries AP (eds) Proc of 30th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR 07. ACM, Amsterdam, pp 7–14 10. Church KW, Gale W, Hanks P, Hindle D (1991) Using statistics in lexical analysis. In: Zernik U (ed) Lexical acquisition: exploiting on-line resources to build a lexicon. Lawrence Erlbaum Associates, New Jersey, pp 115–164 11. Cilibrasi RL, Vitányi PMB (2006) The Google similarity distance. IEEE Trans Knowl Data Eng 19:370–383 12. Cimiano P, Staab S (2004) Learning by Googling. SIGKDD Explor 6:24–33 13. Ciravegna F, Dingli A, Guthrie D, Wilks Y (2003) Integrating information to bootstrap information extraction from Web sites. In: Kambhampati S, Knoblock CA (eds) Proc of IJCAI workshop on information integration on the Web, IIWeb 2003. IJCAI Press, Acapulco, pp 9–14 14. Dannélls D (2006) Automatic acronym recognition. In: Proc of 11st conference of the European chapter of the association for computational linguistics, EACL 2006. The Association for Computer Linguistics, Trento, pp 167–170 15. Dimililer N, Varo˘glu E, Altınçay H (2009) Classifier subset selection for biomedical named entity recognition. Appl Intell. doi:10.1007/s10489-008-0124-0 to appear 16. Dujmovic J, Bai H (2006) Evaluation and comparison of search engines using the LSP method. Comput Sci Inf Syst 3:711–722 17. Etzioni O, Cafarella M, Downey D, Kok S, Popescu A, Shaked T, Soderland S, Weld DS (2004) Web-scale information Extraction in KnowItAll. In: Proc of 13th international World Wide Web conference, WWW 2004. ACM Press, New York, pp 100–110 18. Etzioni O, Cafarella M, Downey D, Popescu A-M, Shaked T, Soderland S, Weld DS, Yates A (2005) Unsupervised namedentity extraction from the Web: an experimental study. Artif Intell 165:91–134 19. Ferreira da Silva J, Lopes GP (1999) A local maxima method and a fair dispersion normalization for extracting multi-word units from corpora. In: Proc of sixth meeting on mathematics of language, MOL6. Association for Computational Linguistics, Orlando, pp 369–381 20. Grefenstette G (1999) The World Wide Web as a resource for example-based machine translation tasks. In: Proc of twenty-first international conference on translating and the computer. Aslib Press, London 21. Henzinger MR (2008) PageRank algorithm. In: Kao M-Y (ed) Encyclopedia of algorithms. Springer, New York 22. Hisamitsu T, Niwa Y (2001) Extracting useful terms from parenthetical expression by combining simple rules and statistical measures: a comparative evaluation of bigram statistics. In: Bourigault D, Christian J, L’Homme M-C (eds) Recent advances in computational terminology. Benjamins, Amsterdam, pp 209–224

D. Sánchez, D. Isern 23. Hunt JW, Szymanski TG (1977) A fast algorithm for computing longest common subsequences. Commun ACM 20:350–353 24. Kilgarriff A, Grefenstette G (2003) Introduction to the special issue on the Web as Corpus. Comput Linguist 29:333–347 25. Kim M-C, Choi K-S (1999) A comparison of collocation-based similarity measures in query expansion. Inf Process Manag 35:19– 30 26. Kim S-B, Seo H-C, Rim H-C (2004) Information retrieval using word senses: root sense tagging approach. In: Järvelin K, Allan J, Bruza P, Sanderson M (eds) Proc of 27th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR 04. ACM, Sheffield, pp 258–265 27. Lam-Adesina AM, Jones GJF (2001) Applying summarization techniques for term selection in relevance feedback. In: Kraft DH, Croft WB, Harper DJ, Zobel J (eds) Proc of 24th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR 01. ACM, New Orleans, pp 1–9 28. Larkey L, Ogilvie P, Price A, Tamilio B (2000) Acrophile: an automated acronym extractor and server. In: Proc of 5th ACM conference on digital libraries. Association for Computing Machinery, San Antonio, pp 205–214 29. Liu H, Friedman C (2003) Mining terminological knowledge in large biomedical corpora. In: Altman RB, Dunker AK, Hunter L, Klein TE (eds) Proc of 8th Pacific symposium on biocomputing, PSB 2003. PSB Association, Lihue, pp 415–426 30. Nadeau D, Turney PD (2005) A supervised learning approach to acronym identification. In: Kégl B, Lapalme G (eds) Proc of 18th conference of the Canadian society for computational studies of intelligence, Canadian AI 2005. Springer, Berlin/Heidelberg, pp 319–329 31. Okazaki N, Ananiadou S (2006) A term recognition approach to acronym recognition. In: Proc of international committee on computational linguistics and the association for computational linguistics, COLING-ACL 2006. Association for Computational Linguistics, Sydney, pp 643–650 32. Park Y, Byrd RJ (2001) Hybrid text mining for finding abbreviations and their definitions. In: Lee L, Harman D (eds) Proc of conference on empirical methods in natural language processing, EMNLP 2001. Intelligent Information Systems Institute, Pittsburgh, pp 126–133 33. Pustejovsky J, Castaño J, Cochran B, Kotecki M, Morrell M (2001) Automatic extraction of acronym-meaning pairs from MEDLINE databases. In: Patel V, Rogers R, Haux R (eds) Proc of 10th Triennial congress of the international medical informatics association, MEDINFO 2001. IOS Press, London, pp 371–375 34. Qiu Y, Frei H-P (1993) Concept based query expansion. In: Korfhage R, Rasmussen E, Willett P (eds) Proc of 16th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR 93. ACM, Pittsburgh, pp 160–169 35. Resnik P, Smith N (2003) The Web as a parallel corpus. Comput Linguist 29:349–380 36. Sánchez D, Moreno A (2008) Pattern-based automatic taxonomy learning from the Web. AI Commun 21:27–48 37. Schwartz A, Hearst M (2003) A simple algorithm for identifying abbreviation definitions in biomedical texts. In: Altman RB, Dunker AK, Hunter L, Klein TE (eds) Proc of 8th Pacific symposium on biocomputing, PSB 2003. PSB Association, Lihue, pp 451–462 38. Taghva K, Gilbreth J (1999) Recognizing acronyms and their definitions. Int J Document Anal Recognit 1:191–198 39. Torii M, Hu Z-Z, Song M, Wu CH, Liu H (2006) A comparison study on algorithms of detecting long forms for short forms in biomedical text. BMC Bioinform 8:S5 40. Turney PD (2001) Mining the Web for synonyms: PMI-IR versus LSA on TOEFL. In: Raedt LD, Flach P (eds) Proc of 12th European conference on machine learning, ECML 2001, Freiburg, Germany. Springer, Berlin/Heidelberg, pp 491–499

Automatic extraction of acronym definitions from the Web 41. WordNet (1998) WordNet—an electronic lexical database. MIT Press, Cambridge 42. Xiao L, Wissmann D, Brown M, Jablonski S (2004) Information extraction from the Web: system and techniques. Appl Intell 21:195–224 43. Yarowsky D (1995) Unsupervised word-sense disambiguation rivaling supervised methods. In: Uszkoreit H (ed) Proc of 33rd annual meeting of the association for computational linguistics. Association for Computational Linguistics, Cambridge, pp 189–196 44. Yeates S (1999) Automatic extraction of acronyms from text. In: Yeates S (ed.) Proc of third New Zealand computer science research students’ conference. University of Waikato, Te Kohinga Marama Marae, Hamilton, New Zealand, pp 117–124 45. Yoon Y-C, Park S-Y, Song Y-I, Rim H-C, Rhee D-W (2008) Automatic acronym dictionary construction based on acronym generation types. IEICE Trans Inform Syst E91-D:1584–1587 46. Yu H, Hripcsak G, Friedman C (2002) Mapping abbreviations to full forms in biomedical articles. J Am Med Inform Assoc 9:262– 272 47. Yu S, Cai D, Wen J-R, Ma W-Y (2003) Improving pseudorelevance feedback in web information retrieval using web page segmentation. In: Hencsey G, White B, Robin Chen Y-F, Kovács L, Lawrence S (eds) Proc of 12th international conference on World Wide Web, WWW 03, Budapest. ACM, New York, pp 11– 18 48. Zahariev M (1991) In faculty of control systems and computers. Polytechnic Institute of Bucharest Simon Fraser University, Bucharest, Rumania

327 David Sánchez is a Lecturer at the University Rovira i Virgili’s Computer Science and Mathematics Department. He received a PhD on Artificial Intelligence from UPC (Technical University of Catalonia) in 2008. He is a member of the ITAKA research group (Intelligent Techniques for Advanced Knowledge Acquisition). His research interests are intelligent agents and ontology learning and the Semantic Web. He has been involved in several research projects (National and European), and published several papers and conference contributions. David Isern is a post-doctoral researcher of the University Rovira i Virgili’s Department of Computer Science and Mathematics. He is also associate professor of the Open University of Catalonia. He received his PhD in Artificial Intelligence (2009) and an MSc (2005) from the Technical University of Catalonia. His research interests are intelligent software agents, distributed systems, user’s preferences management, and ontologies, especially applied in healthcare and information retrieval systems. He has been involved in several research projects (National and European), and published several papers and conference contributions.