International Conference on Computer Systems and Technologies - CompSysTech’11
Enhancing Automatic Term Recognition Algorithms with HTML Tags Processing Milan Luþanský, Marián Šimko, Mária Bieliková Abstract: We focus on mining relevant information from web pages. Unlike plain text documents, web pages contain another source of potentially relevant information – easily processable mark-up. We propose an approach to keyword extraction that enhances Automatic Term Recognition (ATR) algorithms intended for processing plain text documents with an analysis of HTML tags present in the document. We distinguish tags that have a semantic potential. We present results of an experiment we conducted on a set of Wikipedia pages. It shows that enhancement yields better results than using ATR algorithms alone. Key words: Keyword Extraction, Lightweight Semantics, HTML Tag, ATR, Term
INTRODUCTION With the rise of keyword search paradigm keywords become crucial for information retrieval especially in the large hypertext spaces. Keywords form a basis for various types of semantic representations, e.g., they are utilized in the field of ontology engineering. The role of keywords is especially important in emerging social web, which leverages keywords in the form of ubiquitous tags in order to support sharing and organizing the knowledge. As a result, keywords are used not only to describe the content; they are even used in user modelling for adaptive web-based systems to represent the context [2, 3]. The vast amounts of online data make it impossible to assign keywords manually. Various approaches to automatic term recognition (ATR) evolved [1, 4, 6, 9, 10, 13]. ATR algorithms are widely used to get relevant terms in large text corpora. However, the area of application of ATR algorithms in hyperspaces is not as explored as large corpora of offline data (e.g., from medical domain). On the Web, ATR could possibly benefit from features specific for a web environment, such as web mark-up and website structure. It has been already shown that some HTML tags could flag semantic content [7]. The aim of our research is to explore and evaluate possibilities of processing web mark-up and structure for automatic keyword extraction improvement. In this paper we propose an approach enhancing ATR with HTML mark-up processing and we present results of an experiment related to keyword extraction from a set of Wikipedia pages. The rest of the paper is structured as follows. In section 2 we discuss related work. In section 3 we present our approach to keyword extraction. Then we describe the experiment we conducted and discuss the results (section 4). In chapter 5 we sum up our contribution and discuss future work. ……………………………………………………………………………………………..……….… RELATED WORK ATR algorithms are often used to retrieve keywords from documents in vast document collections. We can divide ATR algorithms into the two groups following the measure used to rank candidate terms [8, 14]: termhood algorithms and unithood algorithms. Termhood algorithms try to find the degree of domain specificity of linguistic unit (candidate term) and are based on term frequency in corpus, e.g., by introducing the probability of occurrence for every candidate term (Residual Inverse Document Frequency; RIDF) [10] or assuming that a candidate term will occur more often in domain specific collection of documents than in the rest of a corpus (Weirdness) [1]. In contrast to Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CompSysTech'11, June 16–17, 2011, Vienna, Austria. Copyright ©2011 ACM 978-1-4503-0917-2/11/06...$10.00.
173
International Conference on Computer Systems and Technologies - CompSysTech’11
termhood algorithms, unithood algorithms measure the strength of collocation in terms, e.g., by investigating mutual probability of occurrence for words in candidate term (Pointwise Mutual Information) [4]. There are also approaches, which combine both aforementioned types of measures, e.g., C-value [6], Glossex [9] and Termex [13] algorithms. The majority of the presented approaches to keyword extraction leverage domain specificity of a term or they consider occurrence probability. Processing domain specific corpora is supported with background corpus consisting of document from heterogeneous sources. The precision of ATR algorithms differs according to corpus used [14]. State-ofthe-art approaches use ATR algorithms to extract keywords typically from textual document collections based on their linguistic and/or statistical features. In addition to textual content, web pages contain relatively easy processable mark-up data that can be leveraged to improve the results of ATR algorithms in a web-based environment. In [5] authors used HTML mark-up to automatically build a topic maps for arbitrary chosen set of web-pages. Authors in [11] are concerned with automatic annotation of HTML documents with semantic labels. They use HTML pages generated from template and containing rich semantic data. The observations were made on pages of The New York Times. Not only mark-up tags are a source of semantic information. Anchor text also describes the content of a page it refers to. In [7], the author presents a study showing that 61 % of pages contain anchor text, which is semantically meaningful and describes the content of the page it refers to. Content of some specific HTML tags is also used by search engines to identify keywords and to improve search results [12]. Although mark-up processing constitutes promising source of improvement, we are not aware of works exploring its impact on ATR algorithms. ……………………………………………………………………………………………..……….… ATR-BASED KEYWORD EXTRACTION ENHANCED BY HTML TAG SEMANTICS The keyword extraction using an ATR algorithm is domain specific, because with different collections it yields different results. The extraction of keywords from web pages is specific even more, because they typically cover topics from various domains. Such diversity of topics usually reflects into worse results. Keyword extraction from web pages could benefit from other sources of information that the web offers. The challenge is to consider web mark-up and to make use of emergent semantics that HTML tags represent. Advantages of ATR algorithms, such as the ability to extract most relevant single and multi-word terms by processing plain text, seem to be appropriate for textual content in web environment. In our approach we combine the ATR algorithms with the processing of HTML tags. We aim to enhance a way how a final weight of the candidate term is computed. We introduce a TagRel coefficient that modifies weight of a term obtained by an ATR algorithm according to the relevance of HTML tag enclosing the term. Our method for keyword extraction consists of the following steps: 1. Web structure preprocessing. 2. Term extraction. 3. Keyword selection. In the first step we analyze the link structure of examined web pages. When examining a particular page, we focus on the anchors of links pointing to that page from the rest of pages in order to extract terms describing the page. We either crawl pages (a case of closed corpus such as a website) or leverage already existing indices of search engines (a case of the open web, e.g., by using ‘link:’ operator provided by Google). Having analyzed web structure, in the second step we extract terms (i.e., candidate keywords) and compute weights reflecting their significance for a document. The weight obtained by an ATR algorithm we modify by TagRel bound to a tag enclosing the term: 174
International Conference on Computer Systems and Technologies - CompSysTech’11
wi ' = wi × TagRelT
(1)
where wi’ is improved weight of a term i, wi is weight of a term i obtained by a ATR algorithm and TagRelT represents relevance of a tag T that encloses term i (including nested tags). TagRel varies among different HTML tags. We consider tag importance as an indicator of how much a tag is important for a page. TagRel is also normalized with respect to the number of links present in a collection: TagRelT = 1 + (rT × h)
(2)
where rT is importance of a HTML tag T expressed as a number from interval and h is handicap factor. Importance rT denotes the probability that a term enclosed by a tag T becomes a keyword for given page. Factor h correlates negatively with the number of links in the collection. It can be viewed as normalization factor for tag relevance. Result of keyword extraction depends on proper rT values for selected tags and h factor, which can also be a subject for further parameterization. Parameter rT is inspired by a user study [12], where 72 search engine optimization experts participated on survey which referred to the most important elements that comprise search engine rankings. We derived our parameter from on-page (keyword-specific) ranking factors. If there are more ranking factors to estimate importance of a HTML tag (such as in case of headings), the rT parameter is computed as a weighted average of importance of all ranking factors: k
rT = ¦ ei riT i =1
k
¦e
i
(3)
i =1
where rT is average importance of specific HTML tag T, ei is weight of i-th ranking for the overall importance rT, riT is importance of i-th ranking factor for the HTML tag T and k is number of different ranking factors applied for specific HTML tag T. To clarify the computation, we describe computation of TagRel for heading tags . We consider three different ranking factors for the heading tags known from search results optimization [12]: - Keyword used anywhere in the h1 tag has importance 49 %. - Keyword used as the first word in the h1 tag has importance 45 %. - Keyword used in other heading tags (h2-6) has importance 35 %. All ranking factors have weight ei=1, the overall importance r is computed as follows: r =
e1r1 + e2 r2 + e3 r3 1.0.49 + 1.0.45 + 1.0.35 = = 0.43 e1 + e2 + e3 3
(4)
Let us consider handicap factor to be h=0.01 in this example (details on handicap factor are described in the following section). The final TagRel is computed as follows: TagR el < h1−6> = 1 + (r< h1−6 > × h) = 1 + (0.43 × 0.01) = 1.0043
(5)
In the third step of the proposed method we select the most relevant terms to become keywords of a page. We select top-k % terms, where k depends on the overall number of terms (candidate keywords). ……………………………………………………………………………………………..……….… 175
International Conference on Computer Systems and Technologies - CompSysTech’11
EXPERIMENTAL EVALUATION In order to evaluate the proposed approach, we have conducted an experiment on a small set of Wikipedia pages. The implementation of weight computation is based on JATR Java library1. We selected five ATR algorithms it contains: C-value, GlossEx, TermEx, TF-IDF and Weirdness, because each is based on a different approach to keywords extraction. We wanted to find out, whether the proposed method is able to improve results of the algorithms regardless of an approach they are based on. We selected articles about seven animals (ant, cat, cow, leopard, penguin, swallow and walrus). We crawled the web pages from the English version of Wikipedia and extracted plain text, which was processed by ATR algorithms to produce weighted keywords. From the HTML source code we extracted the text of the title tag and all headings. We also used Wikipedia link library2 to discover pages linking to articles about the seven animals. From those linking pages we extracted the text inside the anchor tag. TagRel for all three tags was computed using the equation (2). We obtained the following tag relevancies: TagRel for title tag was 1.0066 TagRel
for anchor tag was 1.0047 TagRel for heading tag was 1.0043 Handicap factor we used in evaluation was h = 0.01 (5). We set the value experimentally with regards to number of articles (the more articles we have, the smaller h should be) containing links to examined pages (ranging from 372 for the walrus to 1054 for the ant article). An extent, to which terms we extracted from anchor texts matched candidate terms extracted from the page, was that large, that the value of term weight after many multiplications would just overflow. After computing TagRel for all three tags, we computed term weights using five ATR algorithms. After that, we computed modified term weights for terms contained within the examined tags according to equation (1). We obtained two sets of candidate terms. From both – the original keyword candidates and improved keyword candidates – we selected top 1 % which was proposed to 15 respondents. Respondents had to choose which keywords are relevant to articles, which they were extracted from. Respondents did not know whether proposed keyword had improved weight or which ATR algorithm produced it. There were two possible answers, “Yes, proposed keyword is relevant to article” and “No, proposed keyword is not relevant to article”. We gathered opinions of respondents to extracted keywords. Conventional metrics like precision and recall could not be used because we had not the “gold standard” for our collection of extracted keywords. Therefore, we introduce the measure of relevance, which should be interpreted as a degree of keyword relatedness to the article from which it was extracted. For every extracted keyword we compute the relevance as follows: R(T ) =
Cnt y (t ) Cnt y (t ) + Cnt n (t ) + Cnt e (t )
(6)
where R(t) is relevance of keyword t, Cnty(t) is number of respondents who referred to keyword t as relevant to article it was extracted from, Cntn(t) is number of respondents who referred to keyword t as not relevant to article it was extracted from, Cnte(t) is number of respondents who did not refer to keyword t.
1 It contains implementation of the following algorithms: TF, TF-IDF, Weirdness, GlossEx, TermEx, C-value. It is available at: http://www.dcs.shef.ac.uk/~ziqizhang/resources/tools/jatr_v1.0.zip 2 http://users.on.net/~henry/home/wikipedia.htm
176
International Conference on Computer Systems and Technologies - CompSysTech’11
The final relevance for every ATR algorithm was computed as arithmetic average of relevancies of users’ evaluated keywords. Users evaluated from 88 to 22 keywords per article, depending on length of article. The relevance of ATR algorithms before and after modification by TagRel together with the average improvement of ATR algorithms is available in Table 1. For three algorithms (GlossEx, TermEx, Weirdness) we observed the improvement of average relevance for more than 30 %. Table 1. Average relevancies of keywords extracted by selected ATR algorithms (without and with TagRel). Article
Ant Cat Cow Leopard Penguin Swallow Walrus
Cvalue
C-value +TagRel
Glossex
Glossex +TagRel
TermEx
TermEx +TagRel
TFIDF
TF-IDF +TagRel
Weirdness
Weirdness +TagRel
0.172
0.313
0.338
0.223
0.333
The results show that modification of ATR algorithms using TagRel improves the relevance of extracted keywords. The relevance was never worse in case of applying TagRel. The analysis of results revealed that anchor tag was the most significant tag, which changed the original weights. Number of potential keywords acquired from anchor tag was larger than those acquired from title or headings tags. This leads us to the idea of improving parameterization for TagRel according to the variables characterizing document set, e.g., a number of HTML tags extracted or count of words inside particular tag. ……………………………………………………………………………………………..……….… CONCLUSIONS AND FUTURE WORK The aim of our work is to acquire more relevant keywords from heterogeneous domains such as World Wide Web is by incorporating ATR algorithms, typically used for domain specific “offline” collections. In this paper we presented the method for keyword extraction from websites enhancing state-of-the-art ATR algorithms by considering the semantic potential of title tag, heading tags and anchor text in external links By conducting an experiment we showed that after modifying extracted term weights with introduced TagRel we can achieve promising improvements. The relevance of extracted keywords increased over 30 % for the three ATR algorithms used. This result indicates that keyword extraction on the web-based environment can benefit from HTML tags processing. We found out that the most important factor that changes weights of keywords extracted by ATR algorithms is number of external links. In our ongoing research we tackle TagRel parameterization via tag importance and handicap factor. We believe the value of TagRel for specific HTML tag could also depend on number of words inside the tag or number of occurrences of a particular tag in examined corpora. It should be noted that the experimental domain of Wikipedia has links and structure of pages maintained at much better standard then the other pages on the Web. We plan to extend the experiment on random set of web pages selected from “wild web” to find out how much our combined approach is suitable for keyword extraction in any domain. Acknowledgements. This work was partially supported by the grants VG1/0675/11, KEGA 345-032STU-4/2010, APVV-0208-10 and it is the partial result of the Research & 177
International Conference on Computer Systems and Technologies - CompSysTech’11
Development Operational Programme for the project Research of methods for acquisition, analysis and personalized conveying of information and knowledge, ITMS 26240220039, co-funded by the ERDF. ……………………………………………………………………………………………..……….… REFERENCES [1] Ahmad, K., Gillam, L., Tostevin, L. University of Surrey participation in TREC 8: Weirdness indexing for logical document extrapolation and retrieval (WILDER). In Text Retrieval Conference, TREC 1999, (1999). [2] Barla, M., Bieliková, M. Ordinary Web Pages as a Source for Metadata Acquisition for Open Corpus User Modeling. In White, B., Isaías, P., Andone, D., (Eds.): WWW/Internet 2010, IADIS Press, pp. 227–233 (2010). [3] Barla, M. Towards Social-based User Modeling and Personalization. In Information Sciences and Technologies Bulletin of the ACM Slovakia, Vol. 3, No. 1, pp. 52–60 (2011). [4] Church, K.W., Hanks, P. Word association norms, mutual information, and lexicography. In Computational Linguistics, MIT Press, 16(1), pp. 22–29 (1991). [5] Dicheva, D., Dichev, C. Helping Courseware Authors to Build Ontologies: the Case of TM4L. In Proc. of the Conf. on Artificial Intelligence in Education: Building Technology Rich Learning Contexts That Work, IOS press, pp. 77–84 (2007). [6] Frantzi, K. T., Ananiadou, S., Mima, H. Automatic recognition of multi-word terms: the C-value/NC-value method. In Int. Journal on Digital Libraries, 3(2), Springer, pp. 115– 130 (2000). [7] Hodgson, J. Do HTML Tags Flag Semantic Content? IEEE Internet Computing, 5(1), pp. 20–25 (2001). [8] Knoth, P., Schmidt, M., Smrž, P., Zdráhal Z. Towards a Framework for Comparing Automatic Term Recognition Methods, In: Znalosti 2009, pp. 83–94 (2009). [9] Kozakov, L., Park, Y., Fin, T., Drissi, Y., Doganata, Y., and Cofino, T. Glossary extraction and utilization in the information search and delivery system for IBM technical support for IBM System. IBM Systems Journal, IBM Corp., 43(3), pp. 546–563 (2004). [10] Manning, C. D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press (1999). [11] Mukherjee, S., Yang, G., Ramakrishnan, I. V. Automatic Annotation of ContentRich HTML Documents: Structural and Semantic Analysis. In: The SemanticWeb – ISWC 2003. (2003), pp. 533–549. [12] SEOmoz.: Searching engine ranking factors [online; accessed 2010-03-31], (2009). Available at: http://www.seomoz.org/article/search-rankingfactors#overview. [13] Sclano, F., Velardi, P. TermExtractor: a Web Application to Learn the Shared Terminology of Emergent Web Communities. In Enterprise Interoperability II, pp. 287–290 (2007). [14] Zhang, Z., Iria, J., Brewster, Ch., Ciravegna, F. A Comparative Evaluation of Term Recognition Algorithms. In Proc. of the 6th Int. Conf. on Language Resources and Evaluation, LREC08, (2008). ……………………………………………………………………………………………..……….… ABOUT THE AUTHORS Bc. Milan Luþanský, Faculty of Informatics and Information Technologies, Slovak University of Technology, Ilkoviþova 3, Bratislava, Slovakia, ȿ-mail: [email protected] Ing. Marián Šimko, Faculty of Informatics and Information Technologies, Slovak University of Technology, Ilkoviþova 3, Bratislava, Slovakia, ȿ-mail: [email protected]. Prof. Mária Bieliková, Faculty of Informatics and Information Technologies, Slovak University of Technology, Ilkoviþova 3, Bratislava, Slovakia, ȿ-mail: [email protected]. 178