Aug 31, 2010 - Concept Extraction, Keyword Extraction and Tag Recommendation .... 3 NMF-Based Soft Clustering for Optimizing Concept Index- ing. 25.
Machine Learning for Text Indexing Concept Extraction, Keyword Extraction and Tag Recommendation
vorgelegt von Master of Science Hendri Murfi aus Jakarta, Indonesien
Von der Fakult¨at IV – Elektrotechnik und Informatik der Technischen Universit¨at Berlin zur Erlangung des akademischen Grades Doktor der Naturwissenschaften Dr. rer. nat.
genehmigte Dissertation
Promotionsausschuss: Vorsitzender: Prof. Dr. rer. nat. Volker Markl Berichter: Prof. Dr. rer. nat. Klaus Obermayer Prof. Dr. -Ing. Sahin Albayrak Tag der wissenschaftlichen Aussprache: 31. August 2010
Berlin 2010 D 83
Abstract Due to some drawbacks, mainly because of semantic issues such as synonymy and polysemy, people consider some approaches to improve the performance of full-text indexing. The alternative approaches include latent semantic indexing, keyword indexing, social indexing (web 2.0) and linked data-based indexing (semantic web). The aim of this dissertation is to investigate the applications of machine learning methods for the alternative approaches. The application areas are concept extraction, keyword extraction and tag recommendation. Firstly, we propose a new learning method called two-level learning hierarchy (TLLH) to extract concepts from tagged textual contents. This learning method executes separately the existing textual sources, i.e. the user-created tags and the textual contents. At the lower level, concepts and conceptdocument relationships are discovered by non-negative matrix factorization (NMF) algorithm based on the user-created tags. Having these relationships, the concepts are populated by terms existing in the textual contents at higher level. We expect this method to be successful because the hidden document structures are discovered based on tags collectively created by users who understand the semantic content of documents. Another advantage is that the NMF algorithm executes more compact and cleaner data representations. On the other hand, concept extraction from the textual contents is handled by non-negative least squares (NNLS) algorithm which is much more efficient than the NMF algorithm. Moreover, the TLLH approach may have richer vocabularies because it can combine vocabularies from the user-created tags and the textual contents. Therefore, this approach is not only more reliable but also more efficient than the standard one-level learning hierarchy (OLLH) which extracts concepts only from the textual contents. Next, we apply the extracted concepts for a keyword extraction method. In other words, we propose a new keyword extraction method called concept-based keyword extraction (CBKE). Its basic idea is that a term of a document is important if the term is associated to important concepts of the document and important itself in the document. The flexibility regarding the characteristics of learning data is one of the advantages of the method. This method can operate on learning data either with or without manually assigned keywords. Finally, we apply our proposed CBKE methods to content-based tag recommendations in folksonomy. The results show that the tag recommendations have competitive performances in ICML PKDD Discovery Challenge 2009.
i
Zusammenfassung Aufgrund einiger Nachteile, vor allem wegen semantischer Fragen wie Synonymie und Polysemie, betrachtet man einige Ans¨atze, um die Leistung der Volltextindexierung zu verbessern. Der alternative Ansatz umfasst latent semantic indexing, keyword indexing, social indexing (Web 2.0) und linked data-based indexing (Semantisches Web). Das Ziel dieser Dissertation ist es, Methoden des Maschinelles Lernen f¨ ur die alternativen Ans¨atze zu untersuchen. Die Einsatzgebiete sind concept extraction, keyword extraction und tag recommendation. Erstens wird eine neue Lernmethode vorgestellt, mit der Konzepte Textinhalten, welche durch vom Benutzer eingegebene Stichworte begleitet werden, extrahiert werden k¨onnen. Das Lernen besteht aus zwei Ebenen, welche die beiden Arten von Textquellen separat ausf¨ uhren. Auf der unteren Ebene werden die Konzepte und die Konzept-Dokument Beziehungen von der vom Benutzer erstellten Stichworte durch Nicht-negative Matrix Faktorisierung (NMF) entdeckt. Aufgrund dieser Beziehungen sind die Konzepte durch W¨orter von anderen Textinhalten auf einer h¨oheren Ebene angesiedelt. Es wird erwartet, dass diese Methode erfolgreich ist, weil die verborgenen Dokument Strukturen auf Stichw¨ortern basieren, die von Benutzern kreiert wurden, welcher die semantischen Inhalte der Dokumente versteht. Ein weiterer Vorteil dieses Ansatzes ist, dass das NMF zu einer kompakten und sauberen Dokument Darstellung f¨ uhrt. Andererseits ist die Konzept Extraktion aus Textinhalten durch die Methode der Nicht-negative kleinsten Quadrate (NNLS) sehr viel effizienter als die Methode der NMF. Daher ist diese Two-Level Learning Hierarchy (TLLH) nicht nur sicherer sondern auch effizienter als One-Level Learning Hierarchy (OLLH), das die Konzepte nur aus dem Textinhalt extrahiert. Dar¨ uber hinaus kann die Methode reicheren Wortschatz besitzen, weil Vokabeln aus den vom Benutzer erstellten Stichworten mit textlichen Inhalten kombiniert werden. Als n¨achstes wenden wir die extrahierten Konzepte f¨ ur die Stichwort Extraktion an. Mit anderen Worten stellen wir ein neues Stichwort Extraktion Verfahren genannt Concept-Based Keyword Extraction (CBKE) vor. Die Grundidee der Methode ist, dass ein Terminus des Dokuments wichtig wird, wenn dieser Terminus auf wichtige Konzepte des Dokuments zugeordnet wird und an sich f¨ ur das Dokument wichtig ist. Die Flexibilit¨at in Bezug auf die Merkmale der Lerndaten ist ein Vorteil der Methode. Es kann auf Trainingsdaten arbeiten entweder mit oder ohne manuell zugewiesen Stichwort. SchlieSSlich wird sich dem CBKE auf Inhalt basierten Tag Empfehlungen im folksonomy iii
iv
Zusammenfassung
zugewandt. Die Ergebnisse zeigen, dass die Tag Empfehlungen wettbewerbsf¨ahige Leistungen in ICML PKDD Discovery Challenge 2009 besitzt.
Acknowledgment I would like to thank my adviser Prof. Klaus Obermayer for his great support and for giving me the opportunity to conduct my research in the multidiscipline environment of Neural Information Processing (NI) group of Technische Universit¨at Berlin. He gave me the opportunity to work in the DFGfunded Advance Learning Framework (ALF) project providing the basis for my research. I would also like to thank Prof. Sahil Albayrak from DAI-Labor of Technische Universit¨at Berlin for his support for my scholarship. Also, I would like to thank my colleague and my roommate, Nicolas Neubauer, for the valuable discussions and fruitful collaboration. His suggestions enriched and improved the quality of this work. I learned a lot from him. I also would like to thank to Andr´e Paus who introduced me to the ALF project. Other members of the project, Dr. -Ing Dragan Milosevic and Christian Schee, have given many valuable feedbacks to my work as long as the project period. Thanks also to all the NI group members, for their cooperation, help and understanding. Special thanks goes to my parents for their constant support and encouragement throughout my study. I would like also to express my special thanks to my wife Munaya Fauziah who has always been besides me as long as my study. I would like to express my gratitude to the German Academic Exchange Service (DAAD) for funding the years of my research and thus provide me the opportunity to finish my doctoral study in Germany. Finally, there are many other people have given big support and help relating to my study or my stay in Berlin. So a big thank you to them I do not mention here name by name.
v
Contents Abstract
i
Zusammenfassung
iii
Acknowledgment
v
Contents
vii
1 Introduction
1
2 Machine Learning for Text Indexing 2.1 Information Retrieval . . . . . . . . . . . . 2.1.1 Text Indexing . . . . . . . . . . . . 2.1.2 Text Ranking . . . . . . . . . . . . 2.1.3 Semantic Issues . . . . . . . . . . . 2.2 Concept Extraction . . . . . . . . . . . . . 2.2.1 Latent Semantic Analysis . . . . . 2.2.2 Latent Semantic Indexing . . . . . 2.3 Keyword Extraction . . . . . . . . . . . . 2.3.1 Candidate Selection . . . . . . . . . 2.3.2 Filtering . . . . . . . . . . . . . . . 2.3.3 Keyword Indexing . . . . . . . . . 2.4 Tag Recommendation . . . . . . . . . . . . 2.4.1 Folksonomy . . . . . . . . . . . . . 2.4.2 Social Indexing Limitations . . . . 2.4.3 Purposes . . . . . . . . . . . . . . . 2.4.4 Methods . . . . . . . . . . . . . . . 2.4.5 Social Linked Data-based Indexing
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
7 8 8 10 11 12 12 13 14 14 14 16 16 17 19 19 20 23
3 NMF-Based Soft Clustering for Optimizing Concept Indexing 25 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 vii
viii
Contents
3.2
3.3
3.4 3.5 3.6 3.7
NMF-Based Concept Extraction . . . . . . . . . 3.2.1 NMF Formulation . . . . . . . . . . . . . 3.2.2 NMF Interpretation . . . . . . . . . . . 3.2.3 NMF Algorithms . . . . . . . . . . . . . NMF-Based Concept Indexing Methods . . . . . 3.3.1 Standard Method . . . . . . . . . . . . . 3.3.2 Cluster-Based Method . . . . . . . . . . Soft Clustering-Based Concept Indexing Method Relevance Feedback . . . . . . . . . . . . . . . . Experiments . . . . . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
26 27 27 27 30 30 31 31 33 33 35
4 A Two-Level Learning Hierarchy Method for Concept Extraction 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 One-Level Learning Hierarchy . . . . . . . . . . . . . . . . . . 4.3 Two-Level Learning Hierarchy . . . . . . . . . . . . . . . . . . 4.3.1 TLLH Algorithm . . . . . . . . . . . . . . . . . . . . . 4.3.2 NNLS Algorithms . . . . . . . . . . . . . . . . . . . . . 4.3.3 Advantages . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41 42 43 43 44 44 45 46
5 Concept-Based Keyword Extraction 5.1 Introduction . . . . . . . . . . . . . . 5.2 CBKE Algorithms . . . . . . . . . . 5.2.1 Basic Idea . . . . . . . . . . . 5.2.2 One-Level Learning Hierarchy 5.2.3 Two-Level Learning Hierarchy 5.3 Advantages . . . . . . . . . . . . . . 5.4 Applications . . . . . . . . . . . . . .
. . . . . . .
49 50 51 52 52 53 55 55
. . . . . . . . .
61 62 63 64 64 65 65 66 68 68
6 An 6.1 6.2 6.3
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
Application to Tag Recommendation in Folksonomy Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . Hybrid Tag Recommendation . . . . . . . . . . . . . . . . Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.1 Test Collection . . . . . . . . . . . . . . . . . . . . 6.3.2 Measures . . . . . . . . . . . . . . . . . . . . . . . 6.3.3 Software . . . . . . . . . . . . . . . . . . . . . . . . 6.3.4 Settings . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.1 Single-Concept vs. Multi-Concept Method . . . . .
. . . . . . . . . . . . . . . .
Contents
6.5
6.4.2 One-Level vs. Two-Level Learning Hierarchy . . . . . . 6.4.3 CBKE vs. Standard Keyword Extraction Methods . . 6.4.4 Hybrid vs. Individual Components . . . . . . . . . . . 6.4.5 Performance in ECML PKDD Discovery Challenge 2009 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ix 69 71 73 78 79
List of Figures
81
List of Tables
85
Bibliography
87
1 Introduction Information retrieval is the finding of material of an unstructured nature that satisfies an information need usually in form of query from within large collections [52, 100]. The material could be either textual contents such as document, web or multimedia contents such as image, video, music. Text retrieval is a critical area of study today, as it is the fundamental basis of all Internet search engines. Generally, the text retrieval consists of two main stages: text indexing and text ranking. Today most search systems that deal with the textual collections use fulltext indexing, i.e, indexing uses mostly terms in the document collections. However, the indexing method inherits problems that exist as a result of the necessarily imperfect, yet natural and evolving process of creating semantic relationships between words and their referents. Two of these problems are polysemy and synonymy. Synonymy, or multiple words having the same or closely related meaning, presents a greater problem because many people often use a diversity of terms to describe the same objects. A survey conducted by Furnas et. al. also supports this problem [20]. This survey found that any two people would use the same term for a single well known object less than 20% of time. Due to this problem, some relevant documents may not be re1
2
Introduction
trieved. Polysemy is the same term having more than one meaning. A word is judge to be polysemous if it has many senses of the word whose meanings are related. Thus, polysemy dilutes query results by returning related but potentially inapplicable items leading to retrieval of irrelevant documents. Theses drawbacks have prompted some researchers consider some approaches to improve the performance of the full-text indexing method. The approaches in contemporary use involve either the exploitation of semantic contents of documents or selection of a small set of terms as keywords of documents. These are expected to describe the documents and enable the users to search more precisely. Alternative approaches include latent semantic indexing [17, 7, 6], keyword indexing [41, 58], social indexing (web 2.0) [56, 51, 84], and linked data-based indexing (semantic web) [3, 81, 8, 9]. The aim of this dissertation is to investigate the applications of machine learning methods for the alternative approaches. The application areas involved are concept extraction, keyword extraction and tag recommendation. Concept Extraction. Concept extraction is an activity that results in the extraction of concepts from textual contents. The concepts provide powerful insights into the meaning, provenance and similarity of textual contents. In textual contents, a concept is typically associated with a word. However, the concept extraction is a very difficult problem because the mappings of words to concepts are often ambiguous. Typically each word in a given language will relate to several possible concepts. Humans use context to disambiguate the various meanings of a given piece of text, where available. However, machine translation systems cannot easily infer context. Therefore, there are many researchers trying to address the concept extraction problem. We consider one of them assuming that a set of related terms may describe specific concepts. Therefore, the concept extraction method becomes a method to find the related terms from document collections. There are some proposed methods for this task, for example, clustering [29, 47, 48, 72] and latent semantic analysis (LSA) [17, 7, 6]. Using singular value decomposition (SVD) to reduce the dimension of the term-document matrix, LSA exploits relationships between terms and hidden concepts in documents. See Chapter 2.2 for more details. Keyword Extraction. Keyword extraction is the task of automatically selecting a small set of important, topical terms within the content of a document. That the keywords are extracted means that the selected terms are present in the document [41]. This is different to keyword generation, in which the candidate keywords may contain terms from outside the doc-
3 ument. Machine learning is one of proposed methods for this task. The common machine learning method used for keyword extraction is supervised learning. Supervised learning means that prediction models are constructed from documents with known keywords based on some features of terms in the documents [90, 58]. See Chapter 2.3 for more details. Tag Recommendation. Tag recommendation is a service of a social indexing system, also known as a collaborative tagging system, that assists users in storing or sharing resources, by automatically recommending an appropriate set of tags for the resources. Aggregating the tags of many users creates a structure called folksonomy whose usual model is a 3-partite, 3uniform hypergraph, where nodes are users, tags and resources [40]. The service is a mediated suggestion system. As such, it does not apply the recommended tags automatically, but rather suggests a set of appropriate tags and allows the user to select tags from the set they find most appropriate. The main goal of the service is to direct users towards the consistency of the recommended tags. Moreover, the tag recommendation can serve many purposes such as consolidating vocabulary across the users, giving a second opinion as to what a resource is about and, importantly, increasing the success of searching because of the consistency [35]. See Chapter 2.4 for more details. The main contributions of this dissertation relating to the above application areas are: • Optimizing concept indexing using NMF-based soft clustering. Due to a semantic interpretation problem, non-negative matrix factorization (NMF) is used to extract concepts rather than SVD. Several ways exist to use these concepts to index documents. The standard approach consists of indexing the documents by all relevant concepts and returning documents of concepts related to a given query (multi-concept approach). This approach can be accelerated by indexing documents by their most similar concept and only returning documents associated with the concept a given query is most similar to (single-concept approach). This single-concept approach tremendously decreases the number of evaluated documents, thus speeding up retrieval. Since not all ignored documents are in fact irrelevant, however, retrieval performance is decreased as well. We show that by extending this singleconcept approach using several significant concepts (soft clustering), we can improve performance of the single-concept method while still reducing the number of considered documents to a reasonable size. See Chapter 3 for more details.
4
Introduction
• Proposing a new learning scheme for concept extraction, referred to as a two-level learning hierarchy method. This method aims to incorporate user-created tags, which are valuable information about the documents, in a proper way in the learning process. At the lower level, concepts and concept-document relationships are discovered using the user-created tags. Having these relationships, the concepts are populated by terms existing in textual contents of documents at the higher level. See Chapter 4 for more details. • Introducing a keyword extraction method called a concept-based keyword extraction method. The basic idea is that a term in a document is important if associated with important concepts in the document and itself important to the document. Moreover, we apply our proposed methods, i.e. the soft clustering method and the two-level learning hierarchy method, for implementing this keyword extraction method. See Chapter 5 for more details. • Applying and evaluating the proposed methods to content-based tag recommendations in Folksonomy. In addition to user-created tags, textual contents linked to resources are considered as sources of candidate tags to improve tag recommender performance in folksonomy. Therefore, a hybrid tag recommender combining the two tag sources is a common approach in tag recommendations. The purpose of applying the user-created tags is to direct the standardization and consistency of supplied tags, while the use of tags extracted from the textual contents is intended as a means of overcoming the cold start problem in particular. See Chapter 6 for more details. The structure of this dissertation is outlined as follows: • Chapter 2 provides the basic of the investigated methods and the application areas for using in the next chapters • In Chapter 3 we present our optimized NMF-based concept indexing approach. We describe first the NMF-based concept extraction method includes its formulation and interpretation. Next, some existing NMFbased concept indexing methods are outlined. Our proposed optimized indexing method is presented in the next section. Implementation and performance improvement issues are also discussed. Finally, we simulate the performance of the indexing methods using the standard test collections.
5 • Our proposed two-level learning hierarchy for concept extraction is described in Chapter 4. Firstly, we describe the standard NMF-based concept extraction, called the one-level learning hierarchy approach. Following this is our proposed two-level learning hierarchy approach and its advantages. We review also some algorithms to implement the proposed approach and its applications. • A keyword extraction method, which ranks the terms used in a document by their occurrences in the concepts of the document in addition to the standard weighting functions, is discussed in Chapter 5. In this chapter, the common keyword extraction algorithms, which are supervised methods, are first reviewed. Next, some aspects of our proposed method are introduced, that is, its basic ideas and its algorithms. We outline some advantages of the proposed methods in the next section. Finally, some applications are reviewed in the last section. • Chapter 6 describes an application of our concept-based keyword extraction to the content-based tag recommendations in Folksonomy. Other proposed approaches are also investigated using this application area to show their improvement to the baseline methods. The performances of the proposed tag recommendation in ICML PKDD Discovery Challenge 2009 are presented in the last section.
2 Machine Learning for Text Indexing Due to some drawbacks, mainly because of semantic issues such as synonymy and polysemy, various approaches have been considered for improving the performance of full-text indexing. Alternatives include latent semantic indexing, keyword indexing, social indexing (web 2.0) and linked data-based indexing (semantic web). In this chapter, we introduce the basics of the fulltext indexing and review the applications of machine learning methods for the alternative approaches. These application areas are concept extraction, keyword extraction and tag recommendation.
7
8
2.1
Machine Learning for Text Indexing
Information Retrieval
Information retrieval is finding material of an unstructured nature that satisfies an information need, usually expressed in a query, from within large collections [52, 100]. The material could be either textual contents such as document, web or multimedia contents such as image, video, music. Information retrieval on text collections is also called text retrieval. In other words, text retrieval is a branch of information retrieval where the information is stored primarily in the form of text. Text retrieval is a critical area of study today, since it is the fundamental basis of all Internet search engines. Generally, text retrieval consists of two main stages: text indexing and text ranking. Details surrounding them are discussed below.
2.1.1
Text Indexing
The purpose of storing an index is to optimize speed and performance in finding relevant documents for a search query. Without an index, the search engine would have to scan every document in the corpus, requiring considerable time and computing power. During indexing, documents are prepared for use by a search engine. This means the processing of the raw text collection into an easily accessible representation of texts, i.e. term vector or bag of words. This transformation - from a document into the data representation is the essence of document indexing. Transforming a document into indexed form includes: markup or format removal, tokenization, filtering, stemming and weighting. If there is not markup or format that is frequently found in databases that merely sort text files and raw data, then this transformation involves only tokenization, filtering, stemming and weighting. On the Web, however, the above five steps are used especially since textual contents are created in different formats. Tokenization. The purpose of tokenization is to break down a document into tokens (terms), consisting of small units of meaningful text. Tokenization can be done by simply using white spaces or some non-alphanumerical characters as delimiters. During this phase, all remaining text is parsed, lowercased and all punctuation removed. Strange alphanumeric characters are normally removed during tokenization. In some fields of application, i.e. in biomedical information retrieval, we need more sophisticated tokenization mechanisms [36]. Filtering. Filtering is a process of deciding which terms should be used to represent the documents so that these can be used for
9
2.1 Information Retrieval
1. Describing the document’s content. 2. Discriminating the document from the other documents in the collection. A term that will be effective in separating the relevant documents from the non-relevant documents, then, is likely to be a term that appears in a small number of documents. This means that high frequently used terms are poor discriminators. For this reason, stopwords - words that are so common that they have no discriminating ability - are removed from text streams. The stopwords library can be either generic or specific. A generic library is applied to all collections, while specific libraries are created for given collections. The occurrence threshold value of words for generation of a stopwords library usually depends on the collection. Stemming. In most cases, morphological variants of words have similar semantic interpretation and can be considered as equivalent for the purpose of text retrieval. For this reason, a number of so-called stemming algorithms, or stemmers, have been developed, which attempt to reduce a word to its stem or root form. For example study, studies, studying, studied may all be represented as stud. A number of stemming algorithms have been developed so far such as Lovins stemmer [49], Porter stemmer [69], Snowball stemmer [70], etc. The stemmers can also reduce dictionary size, that is, the number of distinct terms needed for representing textual contents in form of term vectors. A smaller dictionary size results in a saving of storage space and processing time. Weighting. To determine the importance of terms in a document, a weighting function is introduced. To do this, each term is typically assigned a numerical score, usually, incorporating features of the document and the overall document collection. Some of the known weighting functions in information retrieval are described in the following [52, 100]. Let a document d be a vector d = (d1 , d2 , ..., dn ), where dj denotes a weighting for the j th term in d and n is the total number of terms in the vocabulary. Let tfj be the document frequency of term j, dfj be the document frequency of term j, dl be the document length, avdl be the average document length across the collection, and k1 and b be free parameters. Next, the weighting of term j is calculated by: dj = fj (d) where fj (d) denotes a weighting function of j th term in d, e.g:
10
Machine Learning for Text Indexing
1. The boolean weighting function. This determines whether a term exists or not in the document. 1 if term j exists fj (d) = (2.1) 0 if term j does not exist 2. The term frequency weighting function. This counts the number of times that each term occurs in the document as in following function: fj (d) = tfj , ∀j
(2.2)
3. The term frequency - inverse document frequency (TF-IDF) weighting function. It computes the product of the term frequency and inverse document frequency for each term, then multiplying these as shown by the following function: tfj , ∀j (2.3) fj (d) = dfj 4. Probabilistic BM25 weighting function [74]. For ad-hoc retrieval, and ignoring any repetition of terms in the query, this function can be simplified to: fj (d) =
2.1.2
N − dfj + 0.5 (k1 + 1)tfj , ∀j log dl dfj + 0.5 k1 ((1 − b) + b avdl ) + tfj
(2.4)
Text Ranking
The second stage of text retrieval is text ranking. The ranking is based on a score which shows how relevant a document to a given query. In order to score such a document against a query, most ranking functions define a term weighting function, which exploits term frequency, inversed document frequency, as well as other factors such as the document’s length and collection statistics. The document score is then obtained by adding the document term weights of terms matching the query: Let d be a document belonging to a collection, dj a weighting of a term j of document d, q a query vector and qj a weighting of a term j of query q. The document score is X S(q, d) = dj · qj (2.5) j
This general dot product form covers many different ranking functions, including vector space models such as cosine [78], probabilistic models such as Okapi BM25 [74] and language models [68], etc. Moreover, document length normalization is usually used to improve the performance of the ranking methods [83].
2.1 Information Retrieval
11
Popularity score. Besides the content score above, another method for scoring a document is the popularity score. This is the primary component of PageRank, Google’s patented ranking system proposed by Google founders Larry Page and Sergey Brin [11]. The underlying idea is that a page is important if it is pointed to by other important pages, that is, the importance of a page is determined by summing the scores of all pages that point to the page. If a page points to several other pages then its score should be distributed proportionately.
2.1.3
Semantic Issues
Today most search systems that deal with document collections use the full text indexing method, that is, indexing that uses mostly terms in the document collection (see Chapter 2.1.1). However, the indexing method has inherent problems as a result of the necessarily imperfect, yet natural and evolving process of creating semantic relation between words and their referents. Two of these problems are polysemy and synonymy. Synonymy. Synonymy is multiple words having the same or closely related meaning, presents a greater problem because many people often use a diversity of terms to describe the same objects. Synonymy is a significant problem because it is impossible to know how many items ”out there” one would have liked one’s query to have retrieved, but did not. For example, Furnas et. al. [20] have shown that any two persons are likely to use the same term for a single well known object less than 20% of time. Due to this problem, some relevant documents may not be retrieved. Polysemy. Polysemy is the same term having more than one meaning. A word is judge to be polysemous if it has many (”poly”) senses (”semy”) of the word whose meaning are related. For example, a ”window” may refer to a hole in the wall, or to the pane of glass that resides within it. Thus, polysemy dilutes query results by returning related but potentially inapplicable items. Due to the drawbacks of the full text indexing method, many turn to other approaches to improve text indexing performance. The main existing approaches either exploit the semantics of documents or select a small set of terms as document keywords expected to describe the documents and enables users to search more precisely. These approaches include latent semantic indexing [17, 7, 6], keyword indexing [41, 58], social indexing (web 2.0) [56, 51, 84], and linked data-based indexing (semantic web) [3, 81, 8, 9].
12
Machine Learning for Text Indexing
In the following sections, we examine the applications of machine learning methods for the alternative approaches. The application areas are concept extraction [17, 7, 6], keyword extraction [19, 90, 33, 93] and tag recommendation [35, 86].
2.2
Concept Extraction
Concept extraction is an activity that results in the extraction of concepts from textual contents. The concepts provide powerful insights into the meaning, provenance and similarity of textual contents. However, it is a very difficult problem, due to lack of clear definition. In textual contents, a concept is typically associated with a word. Typically each word in a given language will relate to several possible concepts. Humans use context to disambiguate the various meanings of a given piece of text, where available. However, machine translation systems cannot easily infer context. Therefore, there are many researchers trying to address the concept extraction problem. We consider one assuming that a set of related terms may describe specific concepts. The concept extraction method thus becomes a method of finding the related terms. The common machine learning methods for this task rely on unsupervised learning, in which concepts are extracted from a training document collection without depending on the labels of the documents. These have been used to develop concept extraction methods, e.g. clustering [29, 47, 48, 72] and latent semantic analysis [17, 7, 6].
2.2.1
Latent Semantic Analysis
Latent semantic analysis (LSA) is one of methods that can be used to extract concepts from a document collection. Using singular value decomposition (SVD) to reduce the dimension of the term-by-document matrix, the authors claim that the latent semantic relationship between documents and terms is exploited. This method transforms the document space whose basis is a set of terms into a new document space whose basis is a set of patterns of terms called concepts. Let V be a m×n term-by-document matrix whose columns are document vectors. SVD decomposes V into three matrices: V = XY Z T
(2.6)
where X is the m × m orthogonal matrix whose columns define the left singular vector of V , Z is the n × n orthogonal matrix whose columns define the right singular vector of V , and Y is the m × n diagonal matrix containing
13
2.2 Concept Extraction
the singular values σ1 ≥ σ2 ≥ ... ≥ σmin m,n of V in order along its diagonal. This factorization exists for any matrix V and the methods for computing the SVD of both dense [24] and sparse matrix [67] are well documented. Recalling that the rank (r) of the matrix V is the number of non-zero diagonal elements of Y , the first r columns of X form a basis for the column space. Since a rank-k approximation to V , where k ≤ r, can be constructed by ignoring (or setting equal to zero) all but the first k rows of Y , we can define an alternative rank-k approximation V˜ to the matrix V by setting all but the k-largest singular value of V equal to zero: ˜ Y˜ Z˜ T V˜ = X
(2.7)
˜ and Z˜ comprise the first k columns of X and Z, and Y˜ is the kxk where X diagonal matrix containing the k-largest singular value of Y . According to [7, 6] the error in approximating V by V˜ is given by q
2 ˜ + · · · + σr2 (2.8)
V − V = min kV − BkF = σk+1 F
rank(B)≤k
In other words, the error in approximating the original term-by-document matrix V by V˜ is determined by the discarded singular values (σk+1 , σk+2 , · · · , σr ). Therefore, this approximation is the closest rank-k approximation to V . From concept extraction point of view, each column of X corresponds to a set of related term called concept.
2.2.2
Latent Semantic Indexing
We can use these concepts to index each document in Vk as XkT Vk = XkT (Xk Yk ZkT ) = Yk ZkT
(2.9)
In other words, columns of Z scaled by singular values of Y is the representations of documents in V relative to concepts in X. Moreover, the normalized distance (the cosine similarity) between the query vector q and the n documents of Vk can be calculated by cos θj =
sTj (XkT q) , j = 1, 2, ..., n ksj k2 kXkT qk2
(2.10)
where sj = Yk ZkT ej is the scaled document vector in Equation 2.9 and ej is the j th canonical vector of dimension n (i.e. the j th column of the n x n identity matrix In ). Here, we also use XkT to transform q before similarity score can be calculated.
14
Machine Learning for Text Indexing
2.3
Keyword Extraction
Keyword extraction is the task of automatically selecting a small set of important, topical terms within the content of a document. If the keywords are extracted, this means that the selected terms are present in the document [41]. This contrasts with keyword generation, in which candidate keywords may contain terms from outside the document. A keyword may consist of one or several terms. If the keywords consist of several terms then they are also known as keyphrases. In general, the task of automatically extracting keywords can be divided into two stages: 1. Selecting candidate terms in the document 2. Filtering out the most significant ones to serve as keywords and rejects those that are inappropriate
2.3.1
Candidate Selection
There are some proposed methods for selecting candidate terms from textual content, including: 1. The standard tokenization method. The purpose of tokenization is to break down a document into tokens (terms) which are small units of meaningful text. Tokenization can be done by simply using white spaces or by using some non-alphanumerical characters as delimiters. 2. N-gram approach [90, 19, 33]. This approach selects the candidate terms by means of extracting n-gram (phrase), where n equals one (uni), two (bi), or three (tri). The terms are converted to lowercase, and all terms beginning or ending with a stop word are removed. The remaining candidate terms are stemmed. 3. Linguistically oriented [33]. This method use natural language processing (NLP) methods such as NP-chunker and part-of-speech (PoS) to extract all candidate terms in the documents.
2.3.2
Filtering
Once the candidate terms have been extracted, the problem lies in how to restrict the number of terms and only keep the ones that are most relevant. Statistical methods such as frequency and co-occurrence are simple and practical methods to determine what the candidate terms are keywords [15, 57, 2, 66]. These methods rank the candidate terms according to their
2.3 Keyword Extraction
15
scores. The candidate terms having highest scores are selected as keywords. Another filtering method is machine learning, where the ranking function is defined by a statistical model derived from training data [19, 93, 90, 33, 58]. Building the Model. The common machine learning methods for keyword extraction involve supervised learning with prediction models are constructed from documents with known keywords. A set of features of candidate terms is set as input to generate an output classifying candidate terms as either keywords or not keywords. Therefore, this keyword extraction can be viewed as a binary classification problem. In machine learning, a feature vector is called an example. Examples comprising manually assigned keywords are assigned the class positive, while all other examples are assigned the class negative. Moreover, some kind of supervised learning can be used to solve this problem such as C4.5 decision tree [71] by Turney [90], rule induction by Hulth [33], Naive Bayes [37] and SVM [80] by Medelyan et al. [58]. Feature Selection. The quality of models constructed by supervised learning is strongly depended on the selected features. Therefore, the next problem is how to define the features discriminating the candidate terms that are appropriate keywords from those candidate terms that are inappropriate. The standard features are three domain-independent features proposed by Frank et al. [19], i.e. term frequency (TF), inversed document frequency (IDF) and position of first occurrence. The position of first occurrence of a term is calculated as its distance in words from the beginning of the document, normalized by the document’s word count. There is also other proposed features extending these standard features such as PoS tag or tags assigned to a candidate term [33], length of candidate terms [58] or term degree in the thesaurus graph structure if keywords controlled by a thesaurus [58]. Extracting Keywords from New Documents. To extract keywords from a new document, its candidates and their feature values must first be determined. Then, the model built during the training is applied to determine the overall score for each candidate keyword. Finally, the candidates with the greatest scores are selected as keywords. For example, we consider a filtering method used in [58] with the following specifications: this method uses just two features, i.e. TF-IDF and position of first occurrence (PFO). The model is constructed using the Naive Bayes method [37]. The probabilities that a candidate is a keyword or not are: P [yes] =
Y +1 PT F −IDF [T F − IDF |yes]PP F O [P F O|yes] Y +N +2
(2.11)
16
Machine Learning for Text Indexing
Y +1 PT F −IDF [T F − IDF |no]PP F O [P F O|no] (2.12) Y +N +2 where Y is the number of positive instances in training data, N is the number of negative instances in training data, PT F −IDF [] is the probability distribution function computed from the training data for the TF-IDF feature, and PP F O [] is the analogous function for the position of first occurrence feature. The overall probability that the candidate is a keyword is then: P [no] =
p=
2.3.3
P [yes] P [yes] + P [no]
(2.13)
Keyword Indexing
The extracted keywords from a document can be used directly to index the document. However, for a controlled indexing, the candidates are first mapped to terms in a controlled vocabulary. Next, their properties are analyzed to identify the most significant terms. For example, Tiun et al. map candidates from Web pages to categories in the Yahoo!Directory via synonyms obtained from WordNet [87]. Aronson et al. index medical documents with MeSH terms by mapping the candidates in the documents to concepts in the medical UMLS thesaurus [1]. Golub index documents with a vocabulary consisting of 800 descriptors and 22,000 nondescriptors [25].
2.4
Tag Recommendation
The phenomenon of web 2.0, which refers to the second generation of internetbased services emphasizing online collaboration and sharing among users, has led to development of several tools which have succeeded in making this task more attractive to a broader audience [63]. One of them is social tagging [56, 51, 84]. Social tagging, also known as social indexing, allows ordinary users to collaboratively assign keywords, or tags, to resources. In contrast to traditional subject indexing, the tags are generated not only by experts, but also by creators and consumers of the resources. Usually, freely chosen tags are used instead of a controlled vocabulary. When creating the tags, users normally have a specific objective, such as sharing or labeling a resource so they can find it later. The tagging system can be distinguished according to what kind of resources is supported. Wikis and Weblogs are cooperative Web publishing tolls. Flickr1 allows the sharing of photos [92]. Last.fm2 the sharing of mu1 2
http://flickr.com http://last.fm
2.4 Tag Recommendation
17
Figure 2.1: Bibsonomy - a social bookmark and publication sharing system sic listening habits. del.icio.us3 , furl4 , reddit5 , and Digg6 are for bookmark sharing [28]. Technorati7 allows weblog authors to tag their articles. Connotea8 , CiteULike9 , and LibraryThing10 allow users to manage and share bibliographic metadata on the web. BibSonomy11 allows to share bookmarks and bibtex based publication entries simultaneously [31] (Figure 2.1).
2.4.1
Folksonomy
Pulling all the user-created tags together in an automated way creates a folksonomy (Figure 2.2). A folksonomy, which comes from folk and taxonomy, is a user-generated taxonomy that emerges from social tagging. In a folksonomy, the relationships between tags are inferred based on their usage patterns. The related tags are determined programmatically by the system. There are no formal relationships in a folksonomy, other than perhaps ”de3
http://del.icio.us http://furl.net 5 http://reddit.com 6 http://digg.net 7 http://technorati.com 8 http://connotea.org 9 http://citeulike.org 10 http://librarything.com 11 http://bibsonomy.org 4
18
Machine Learning for Text Indexing
Figure 2.2: Folksonomy [40] gree of relatedness”, while other taxonomy systems define parent-child, or broader and narrow, relationships between terms or between concepts referred by terms. The usual model of the folksonomy is a 3-partite, 3-uniform hypergraph, where nodes are users, tags and resources [40]: Definition 2.1. Let U , T , and R be finite sets, whose elements are called users, tags and resources. A folksonomy is a graph G = (V, E), where V = U ∪ T ∪ R is the set of nodes, and E = {{u, t, r} |(u, t, r) ∈ Y } is the set of hyperedges. A formal definition of the folksonomy is given in [32]: Definition 2.2. A folksonomy is a tuple F := (U, T, R, Y ) where U , T , and R be finite sets, whose elements are called users, tags and resources. Y is a ternary relation between them, i.e., Y ⊆ U × T × R, called tag assignments. The folksonomic tagging is intended to make the resources easier to discover and recover over time. Discovery enables users to find new content of their interest shared by other users. This social indexing gives a promising index quality because it is done by human beings, who understand the content of the resource, as opposed to software, which algorithmically attempts to determine the meaning of a resource. Moreover, it operates with the collective human intelligence of the users as an index extractor. Recovery enables a user to recall content that was discovered before. It is should be easier
2.4 Tag Recommendation
19
because the tags are both originated by, and familiar to, its primary users. As a result, social tagging has created a renewed level of interest in manual indexing [91] whose costs are reduced by social interaction among users.
2.4.2
Social Indexing Limitations
A number of web sites now feature social tagging as one of their services and are rapidly gaining popularity. Besides covering a wide range of resources and communities that meet user’s need, they are also easy to use. Users do not need special knowledge to use them, just put the resources into the systems and assign freely tags that describe the resources from their own perspective. In that sense, one of the greatest strengths of the social tagging systems, the fact that no predefined vocabulary is assumed, leads to a number of limitations and weaknesses in what concerns the use of the tags to retrieve content. Golder et al. [23] identifies three major problems with current tagging systems: 1. Polysemy 2. Synonymy 3. Level variation The first two are inherent problems due to the necessarily imperfect, yet natural and evolving process of creating semantic relation between words and their referents. The third problem refers to the phenomenon of users tagging content at differing levels of abstraction. e.g. car and BMW. Other problems are dealing with word forms, nouns in singular, nouns in plural, abbreviations, misspelled words. The lack of consistency among users in choosing tags for similar resources makes it impossible for the user to retrieve all the desired resources unless he/she knows all the possible variants of the tags that may have been used.
2.4.3
Purposes
The social tagging system usually has a service that assists users in the tagging process by automatically recommending an appropriate set of tags (Figure 2.3). The service is a mediated suggestion system, that is, the service does not apply the recommended tags automatically, rather it suggests a set of appropriate tags and allows the user to select tags from the set they find appropriate. Moreover, the tag recommendation can serve many purposes such as [35]:
20
Machine Learning for Text Indexing
Figure 2.3: Tag recommender service in delicious - a social bookmarking system 1. Directing users towards the consistency of the tags 2. Consolidating the vocabulary across the users 3. Giving a second opinion what a resource about 4. Increasing the success of searching because of the consistency
2.4.4
Methods
Most Popular Tags. In practice, the standard tag recommender in folksonomy is a service that recommends the most popular tags used for either a particular resource or a whole system. Tag popularity is sometimes represented in form of a tag cloud, where the tags are depicted in a larger font size compared to other tags in a box of the tags (Figure 2.4). AutoTag [59] suggests tags to weblogs based on the tags associated with other similar weblogs in a given collection. The method uses information retrieval methdos [52, 100] to find the most similar weblogs and then aggregates all the tags in these weblogs for ranking (Figure 2.5). Another related work that analyzes blogs and give tag suggestion is TagAssist [85]. TagAssist improves the AutoTag method by improving the quality of suggested tags. This is done by introducing tag compression, aiming to reduce tags to their root form (stemming) [69], and case evaluation to filter and rank tag suggestions. Some metrics are used to evaluate the relative usefulness of tags, e.g.:
2.4 Tag Recommendation
21
Figure 2.4: Tag popularity in form of tag cloud from Last.fm system 1. Frequency - the number of times a tag appears as an associated tag in the top results 2. Text occurrence - whether a tag appears in the new post 3. Tag count - the number of times a tag have been used in the corpus 4. Rank - the relative rank of the blog that contained the post that was assigned a tag 5. Cluster - whether any of candidate tags are members of topically related clusters
22
Machine Learning for Text Indexing
Figure 2.5: AutoTag - tag recommendations in weblogs Collaborative Recommendation. Jaeschke et al. [35] use the adaptation of collaborative filtering [79] for tag recommendation in folksonomies. The adaptation lies in reducing the three-dimensional folksonomy to a two dimensional projection in order to apply the traditional collaborative filtering method. The projection preserves the user information and creates a usertag matrix based on occurrence or non-occurrence of tags with the users. Therefore, the k -neighborhood of a user can be performed by considering the tags as objects. For a given tag-by-user matrix X, a given user u, a given resource r, and integer k and n, the set T (u, r) of n recommended tags is calculated by: X sim(Xu , Xv )δ(v, t, r) (2.14) T (u, r) = argmaxnt∈T v∈Nuk
where Nuk is k nearest neighbors of u in X, δ(v, t, r) = 1 if (v, t, r) ∈ f olksonomy and 0 else. Multi-Label Classification. Multi-label classification is concerned with learning from a set of examples that are associated with a set of labels Y ⊆ L, where L is a set of disjoint existing labels [88]. Katakis et al. have created a tag recommender as a multi-label classification problem where the feature vector are a term weighting vector of the textual contents and the labels are tags existing in the training document collection [38]. In order do reduce the dimensionality of the problem, they retain only terms and tags with a minimum frequency of appearance. They use the Binary Relevance (BR) method that treats the prediction of each label as an independent binary classification problem. This learns one binary classifier for each different label as following: Cl : D → l, ¬l, ∀l ∈ L (2.15)
23
2.4 Tag Recommendation
where D is a document collection. The BR method transforms the original data set into |L| data set Dl that contain all examples of the original data set, labeled as l if the labels of the original example contained l and as ¬l otherwise. The base learner used with BR was a naive Bayes classifier [37]. For the classification of a new document d, this method combines the outputs of the |L| classifiers: [ CBR (d) = Cl (d) (2.16) l∈L
Content-based methods. All the above tag recommendation systems are based on the tags created by users and saved in the system. While AutoTag [59], TagAssist [85] and the multi label classification method [38] depend on the content of the document, no new tag is suggested from the textual content. The drawbacks of these user-created tag based methods are that they often do not work well for new resources, that is, resources that have not been tagged in the system (cold start problem). To overcome this problem, people have begun to consider other sources for candidate tags. One of them is textual contents associated with each resource that are available in the system. For example, Xu et al. [99] suggest tags auto-generated via content-based and context-based analysis. This not only solves the cold start problem, but also increases the tag quality of those objects that are less popular. Tatu et al. [86] use natural language processing tools to extract important concepts (nouns, adjectives and named entities) from the textual contents. Some lexico-semantic resources, e.g WordNet, are also used to stem the concepts and link synonyms. They conclude that the understanding of the textual contents improves the quality of the tag recommendations.
2.4.5
Social Linked Data-based Indexing
The extracted tags from folksonomy can be recommended directly to users to index the document as mostly used by current collaborative tagging systems such as del.icio.us or BibSonomy. However, for a controlled indexing, the candidates are first mapped to terms in a controlled vocabulary. Next, their properties are analyzed to identify the most significant terms. For example, Faviki12 is a social bookmarking system that allows tagging of bookmarks with Wikipedia-based identifiers to prevent ambiguities. A number of tools such as Zemata13 , Sindice14 are proposed to allow users to tag content that 12
http://www.faviki.com http://www.zemanta.com 14 http://www.sindice.com 13
24
Machine Learning for Text Indexing
they contribute to popular Web 2.0 services such as del.icio.us, Flickr using Linked Data identifiers such as those provided by DBpedia15 .
15
http://www.dbpedia.org
3 NMF-Based Soft Clustering for Optimizing Concept Indexing Non-negative matrix factorization (NMF) transforms a document space into a latent semantic space which consists of concepts instead of terms as a basis. Several ways exist to use these concepts to index documents. The standard approach consists of indexing the documents by all relevant concepts and returning documents of concepts related to a given query (multiconcept approach). This approach can be accelerated by indexing documents by their most similar concept and only returning documents associated with the concept a given query is most similar to (single-concept approach). This approach tremendously decreases the number of evaluated documents, thus speeding up retrieval. Since not all ignored documents are in fact irrelevant, however, retrieval performance is decreased as well. We show that by extending this single-concept approach using several significant concepts (soft clustering), we can improve performance of the single-concept approach while still reducing the number of considered documents to a reasonable size.
25
26
3.1
NMF-Based Soft Clustering for Concept Indexing
Introduction
Using singular value decomposition (SVD) to reduce the dimension of the term-by-document matrix, latent semantic analysis (LSA) exploits relationships between terms and hidden concepts in documents. However, there are some disadvantages of this method. Negative values make a semantic interpretation difficult. What we would really like to say is that a concept is mostly concerned with some subset of terms, but any semantic interpretation is difficult because of these negative values. To circumvent this problem, a new method which maintains the non-negative structure of original documents has been proposed. The method uses non-negative matrix factorization (NMF) [43, 44] rather than SVD to extract the concepts. Generally, there are two indexing methods that are usually used with NMF: the standard method that uses all relevant concepts of a document to index the document (multi-concept approach) [89], and a cluster-based method that uses only the most relevant concept to index the document (single-concept approach) [82]. The cluster-based method only indexes and ranks documents which belong to the most relevant concept for a given query, achieving much better response times. Assuming that a document may contain several concepts among the existing concepts, in this chapter we improve the cluster-based method by assigning the document into several significant concepts (soft clustering). This NMF-based soft clustering can improve the precision of the cluster-based method to become comparable to the standard method. Still, the number of documents to be ranked remains much smaller. Our simulations show that the optimized cluster-based method handles only about 4% of documents for the comparable precision. This means the approach will decrease response time without much loss of retrieval quality. In the following sections, we describe first the NMF-based concept extraction includes its formulation and interpretation. Next, some existing NMF-based concept indexing methods are outlined. Our proposed optimized indexing method is presented in the next section. Implementation and performance improvement issues are also discussed. Finally, we simulate the performance of the indexing methods using the standard test collections.
3.2
NMF-Based Concept Extraction
Using non-negative matrix factorization (NMF) to reduce the dimension of the term-by-document matrix, the latent semantic relationship between documents and terms is exploited. This method transforms the document space whose basis is a set of terms into a new document space whose basis is a set
3.2 NMF-Based Concept Extraction
27
Figure 3.1: NMF Interpretation terms called concepts.
3.2.1
NMF Formulation
Given a non-negative m × n term-by-document matrix V whose columns are document vectors and a positive integer k < min(m, n). NMF problem is to find a non-negative m × k matrix W and a non-negative k × n matrix H by minimize the following constrained optimization problem: 1 min f (W, H) ≡ kV − W Hk2F W,H 2 subject to Wia ≥ 0, Haj ≥ 0, ∀i, a, j . (3.1) where k·kF is the Frobenius norm [5]. An appropriate decision on the value of k is critical in practice, but the choice of k is very often problem dependent. In most cases, however, k is usually chosen such that k