Document Recommendation Using Data Compression

0 downloads 0 Views 318KB Size Report
where lcom(x, ei) is the length of the output document compressed by the ..... A common pitfalls using the normalized compression distance: what to watch out.
Procedia - Social and Behavioral Sciences 27 (2011) 150 – 159

Pacific Association For Computational Linguistics (PACLING 2011)

Document recommendation using data compression Takafumi Suzukia*, Shin Hasegawab, Takayuki Hamamotoc, Akiko Aizawad a

Tokyo University Faculty of Sociology 5-28-20, Hakusan Bunkyo-ku, Tokyo, Japan b

Sony 1-7-1, Konan, Minato-ku, Tokyo, Japan

c

Tokyo University of Science, Faculty of Engineering 1-14-6, Kudan-kita, Chiyoda-ku, Tokyo, Japan d

National Institute of Informatics Digital Content and Media Sciences Research Division 2-1-2, Hitotsubashi, Chiyoda-ku, Tokyo, Japan

Abstract

We propose a new method of content-based document recommendation using data compression. Though previous studies mainly used bags-of-words to calculate the similarity between the profile and target documents, users in fact focus on larger unit than words, when searching information from documents. In order to take this point into consideration, we propose a method of document recommendation using data compression. Experimental results using Japanese newspaper corpora showed that (a) data compression performed better than the bag-of-words method, especially when the number of topics was large; (b) our new method outperformed the previous data compression method; (c) a combination of data compression and bag-of-words can also improve performance. We conclude that our method better captures users’ profiles and thus contributes to making a better document recommendation system. © 2011 Published by Elsevier Ltd. Selection and/or peer-review under responsibility of PACLING Organizing © 2011 Published by Elsevier Ltd. Selection and/or peer-review under responsibility of PACLING 2011 Committee.

Keywords: Data compression; document recommendation; LZ78; PRDC; document classification.

a *Corresponding author. Tel.: E-mail address:

1877-0428 © 2011 Published by Elsevier Ltd. Selection and/or peer-review under responsibility of PACLING Organizing Committee. doi:10.1016/j.sbspro.2011.10.593

Takafumi Suzuki et al. / Procedia - Social and Behavioral Sciences 27 (2011) 150 – 159

1. Introduction Information technology is progressing at an extraordinary speed, and there are various new tools available online like blogs and Twitter. These new technologies show promise as means of making better use of a large amount of varied information, but their actual tendency is to bury users in a mountain of information. Thus, it is very important to select useful information from the large amount of information available, and hence, recommendation systems are currently attracting attention from researchers and commercial interests. Recommendation systems can be basically classified into two types; content-based filtering and collaborative filtering. In content-based filtering, which we focus on in this study, document recommendation algorithm plays an especially important role because document data is one of the best resources for estimating users’ preferences. In document recommendation, we need to calculate the similarity between profile and target documents. Many of the previous studies used the bag-of-words method for this task [1, 2, 3, 4], but recent studies have showed that users in fact focus on larger textual units when searching for information [5, 6]. Thus, we feel that methods focusing on larger units than words will be important for developing a better document recommendation system. As well, bag-of words methods can not return robust results when various topics are included in the profile, since the wordvector matrix becomes very large and sparse in such a case. Though semantic approaches [23] can be used for these tasks, they use words and fixed expressions for calculating the similarity. As the fragmented and ungrammatical texts like blog or Twitter are currently increasing importance, alternative, non-semantic approaches that capture larger (or smaller) units will be useful. Against this background, we propose a new method of document recommendation using data compression. Our methods, based on the LZ78 algorithm, uses partially matching characters when calculating the similarity between profile and target documents. Thus it can capture larger units than words in the profile and should be able to deal with a large number of topics in the profile. A previous data compression method using LZ78 uses the individual profile document as input, and thus it has two problems; it takes much computational efforts when the number of documents in the profile increases, and it can not capture the characteristics of all documents in the profile. To solve these problems, we propose to use a united profile document as the input. Moreover, we suppose that a combination of compression and bag-of-words can inherit the good points of both methods, so we also examine the performance of such a combination. The compression based method is target-independent; thus it can be used on various data such as music, literature, and images. The rest of this paper is organized as follows. We describe our method in Section 2 and our experimental design in Section 3. We discuss the results of the experiment in Section 4 and review related work in Section 5. Finally, we conclude our remarks in Section 6. 2. Proposed method

151

152

Takafumi Suzuki et al. / Procedia - Social and Behavioral Sciences 27 (2011) 150 – 159

First, we explain LZ78, the compression algorithm we use, and after that, three methods of document recommendation using data compression, namely, (a) PRDC, (b) PRDCUD, and (c) the combined method of PRDCUD and bag-of-words. 2.1 LZ78 LZ78 is an adaptive dictionary technique proposed by Ziv and Lempel [7]. In LZ78, the input is coded as a double < i, c > with i being an index corresponding to the dictionary entry that was the longest match to the input, and c being the code for the character in the input following the matched portion of the input. The index value of 0 is used in the case of no match. This double then becomes the newest entry in the dictionary. Thus, each new entry in the dictionary is one new symbol concatenated with an existing dictionary entry [8]. We used LZ78 to make a dictionary from a profile document because it keeps the profile as a dictionary and reuses it; thus, it can effectively calculate similarities. Other compression methods like NCD and compression revised coefficients (see Section 5) have to measure the similarity for each document in the profile and conduct compression experiments many times; thus they are not fit for our purpose. Compressing a target document with a dictionary enables us to calculate the proximate similarity between the profile and the target document.

2.2 Pattern representation scheme using data compression (PRDC) Here, we shall explain the pattern representation scheme using data compression (PRDC) proposed by Watanabe et al. [9]. PRDC is a scheme that can be used for classification and discrimination of various types of data. In PRDC, input data is transformed into a compression ratio vector, and this vector is used for calculating the similarity between the input data and target data. The similarity between a document x and the document di included in the profile G is given by the following formula  where lcom(x, ei) is the length of the output document compressed by the dictionary ei and lin (x) is the length of the input (original) document. The LZ78 algorithm is used to make the dictionary ei from di, and the dictionary ei is not rebuild once it is made. We propose to use PRDC for document recommendation. We define the similarity between a document x and profile G as follows. 

This means we use the minimum of the sim (x, di) for sim (x, G). We make a dictionary ei from each document di in the profile G, and calculate the compression ratio using each dictionary. 2.3 Pattern representation using data compression with a united document (PRDCUD) PRDC has two problems; as the number of documents in the profile increases, the number of dictionaries

Takafumi Suzuki et al. / Procedia - Social and Behavioral Sciences 27 (2011) 150 – 159

also increases, and it is not efficient to conduct a compression experiment on each dictionary. Moreover, the characteristics of individual documents in the profile can be captured by each dictionary, but the characteristics of all documents in the profile can not be captured. To solve these problems, we propose a new method, namely, a pattern representation using data compression with a united document (PRDCUD). Our basic idea is to use the united document D of all the documents di in the profile G, instead of di when making a dictionary E. We use LZ78 to make the dictionary, as in the case of PRDC. Thus, similarity is defined as follows.

where lcom (x, E) is the length of an output document compressed by the dictionary E, and lin (x) is the length of the input (original) document. PRDCUD has the following advantages over PRDC; (a) RDCUD conducts one compression experiment for calculating the similarity between G and x; thus it takes less computational effort; (b) all the documents in the profile are used for making the dictionary, thus the characteristics of various topics are summarized in it; (c) the compression ratio for a document with a similar topic can be increased because we use longer partially matching characters with the united document when applying LZ78; (d) when a new document is taken into the profile, we can update the dictionary without having to reconstruct it because we can sequentially compress the document [11]. 2.4 Combined method The combined method uses PRDCUD and the bag-ofwords method. Bag-of-words focuses on the unit of a word in the document, and it is effective for simple topic classifications, for example, when the number of topics is small [11]. We thought that an appropriate combination of these two methods would be more robust than either one by itself because it would exploit words as well as longer units in the profile.We simply combined two similarity measures by taking their product. Thus, the similarity of the combined method is as follows.

where VD represents the word vector of the united document D of all documents di in the profile G, and Vx represents the word vector of the document x. We simply weight the words using tfi idfi as follows.

 whereN represents the number of the documents in the profile G, K represents the number of different word types in the profile G, tfi represents the frequency of words in the document x, dfi represents the number of documents including the word i in the profile G.

153

154

Takafumi Suzuki et al. / Procedia - Social and Behavioral Sciences 27 (2011) 150 – 159

3. Experimental setup 3.1 Data We used the Mainichi Newspaper corpora, which include 102,545 articles in Japanese. Each article was allocated to one of 20 topics. Each topic includes 40 to 100 articles. In order to allocate a topic, we applied LDA [12], and manually checked the result. We used GibbsLDA++ (gibbslda.sourceforge.net). The number of topics included in our data was unknown, so we tested from 100 to 300 latent topics in increments of 50 and eventually selected T = 250 latent topics (it returned the best results). We selected alpha = 50000/T as the usual case and we updated the topics 2000 times for each experiment. We obtained the topics and articles shown in Table 1. For the experiments, we selected nouns representing content, i.e., noun-common, noun-proper, noun-verbal, and noun-adjective-base, based on the part-of speech tags assigned by ChaSen (chasen-legacy.sourceforge.jp). In so doing, we removed the function words or signs, and extracted the content related patterns, that were appropriate for our purpose (c.f., [13]). 3.2 Profilemaking and evaluation methods

We made the profile G by randomly selecting ten articles from m topics, that had been randomly selected from the 20 topics lists. We varied the number of topics m, and conducted the experiments. We limited ourselves to 1000 combinations when selecting topics from the 20 topics lists, lest the number of combinations would become too large. We selected 300 documents as the evaluation data set; 30 were relevant documents, i.e., documents that had topics in the profile. The other 270 were irrelevant documents, i.e., documents that were allocated other topics. For example, if we selected two topics for making a profile, the 30 relevant documents consisted of two sets (i.e., number of topics) of 15 articles, while the 270 irrelevant documents consisted of 18 sets (i.e., number of topics) of 15 articles. We

155

Takafumi Suzuki et al. / Procedia - Social and Behavioral Sciences 27 (2011) 150 – 159

calculated the similarity between G and every document x in the 300 articles, and sorted the documents in decreasing order of similarity to G. We used the following F1 measure as the evaluating measure.



 where r represents the rank, RD represents the number of relevant documents, and RD less than r represents the number of relevant documents with rank less than r. We calculated the maximum value of F1 while varying the rank r. We performed the experiments ten times and selected the mean value as the final performance measure. A high F1 value means that we can obtain more relevant documents similar to the profile, and we can recommend the documents that individual users found most interesting. 4. Results and discussion We conducted experiments using bag-of-words, PRDC, PRDCUD, and the combined method. We used the bagof- words as baseline because our method using data compression has a characteristics that uses longer (smaller) unit than words, while the bag-of-words is the simple and the typical method using ‘words’ unit. Thus we believe that such a comparison is the best way to discuss the effectiveness of our methods for our preliminary study.5 Table 2 summarizes the mean F1 values together with the number of topics in relevant documents. The following discussion focuses on the results of (a) PRDC and PRDCUD, (b) bagof- words and PRDCUD, and (c) bag-of-words, PRDCUD and the combined method.



156

Takafumi Suzuki et al. / Procedia - Social and Behavioral Sciences 27 (2011) 150 – 159

PRDC vs. PRDCUD Table 2 showed that PRDCUD performed better than PRDC in all cases. This indicates that by using a united profile document instead of all documents in the profile, we can better capture the character of the profile and obtain better document recommendations. Bag-of-words vs. PRDCUD Table 2 showed that bag-of-words performed better when the number of topics was smaller (1 or 3), but PRDCUD performed better when the number of topics was larger (5,7). In other words, bag-of-words became worse as the number of topics increased whereas PRDCUD became better. As mentioned in Section 1, the performance of bagof- words was affected by the dimension of the word vector; thus, it seems that the increase in the number of word vectors and the sparseness of the feature matrix caused its performance to deteriorate. In contrast, PRDCUD used the patterns of partially matching characters in the dictionary when calculating the similarity between the profile and target documents, and it captured the character of the profileeven when various topics were included in it; thus it was more robust for more topics. In actuality, the number of topics is usually large; thus we think that PRDCUD would be of practical use in a document recommendation system. Bag-of-words vs. PRDCUD vs. combined method Table 2 shows that the combined method performed better than bag-of-words when the number of topics was large (3, 5, 7) while it performed slightly worse when the number of topics was 1. It performed better than PRDCUD when the number of topics was small (1, 3, 5) and slightly worse when the number of topics was 7. This indicates that the combined method had intermediate characteristics between those of bag-of-words and PRDCUD and that it can perform better than either in many cases. We conclude that an appropriate combination of the two methods could improve the performance in practice. 5. Related work The application of compression algorithm to text classification has been studied in the field of computational stylistics [14]. For example, Ishihara and Sato [15] proposed to use the Normalized Compression Distance (NCD) for authorship attribution in Japanese novels. NCD was proposed by Cilibrasi and Vitanyi [16], and it was used for various other fields such as bio-informatics [17], music and literature [16], image processing [16, 18], and masquerader detection [19]. The NCD of two information sources a and b is defined as follows.

Takafumi Suzuki et al. / Procedia - Social and Behavioral Sciences 27 (2011) 150 – 159

where C (a) and C (b) represent statistics obtained from compressions of a and b respectively, C (ab) represents a statistics obtained from compression of conjugation between a and b. C (ab) - min ( C (a),C (b) ) are statistics that are included in either a or b exclusively, and not superior to C (a) or C (b); thus they are normalized by max ( C (a),C (b) ). In this way, similarity can be calculated when the difference between two sources is large. NCD worked well for authorship attribution of Japanese novels. However, it faces two problems; (a) a higher compression rate leads to higher performance [20], but there is a tradeoff between compression rate and compression time, and (b) compression experiments have to be conducted four times; thus it needs more computational efforts than our methods. Agata [21] proposed compression improvement coefficient as an identification method of similar text. He used a Zip-like algorithm and measured the difference between the compression rate of each information source and that of the combined information source. The compression improvement coefficient CIC is defined as follows.

where La and Lb represent the original text lengths and La+b represents the length of the combined text using a and b. LZa and LZb represent code lengths after compression, and LZa+b and LZb+a represent the code lengths of the combined text. The effect of the order between LZa+b and LZb+a can be removed by calculating LZa+b and LZb+a. The following CIC’ is a revised version for when the sizes of the original data vary.

This method performed well for authorship attribution in Japanese novel; however, to calculate CIC’, the compression experiment has to be conducted four times; thus it needs more computational efforts than our method. All of these studies belong to stylistic text classification, but we used compression based method to classify topics for making better document recommendation systems. 6. Conclusion We describe a new method of document recommendation using data compression. Experimental results using Japanese newspaper corpora showed that (a) PRDCUD outperformed the previous PRDC, (b) PRDCUD was more robust than bag-of-words when the number of topics was large, and (c) the combination of bag-of-words and PRDCUD had intermediate characteristics between the methods composing it. Our experiments proved the effectiveness of PRDCUD and the combination for document recommendation. We focused on the topics of the text, but the compression based method can also be used in computational stylistics. Thus, we will examine the use of this method for both stylistic text classification and topic classification on various data sets. Furthermore, in this study, we used a simple product of bag-of-words and PRDCUD similarity measures for the combined method; we believe there should be a more

157

158

Takafumi Suzuki et al. / Procedia - Social and Behavioral Sciences 27 (2011) 150 – 159

systematic way of making such a combination. Henceforward, we will investigate this point by using, for example, Markov Chain Monte Carlo [22]. Acknowledgements We were supported by Grant-in-Aid for Scientific Research 23700288 for Young Scientists (B) from the Ministry of Education, Culture, Sports, Sciences and Technology, Japan, and Grant for joint research from National Institute of Informatics. Earlier versions of this study were presented at 2009th and 2010th Annual meetings of The Institute of Electronics, Information and Communication Engineers, and 2009th Annual conference of The Japan Society for Artificial Intelligence. We would like to thank the participants of these meetings for their helpful comments. References [1] R. J. Mooney and L. Roy, Content-based book recommending using learning for text categorization. Proc. ACM ICDM, 2000; 195-204. [2] D. R. Liu, C. H. Lai, and C. W. Huang, Document recommendation for knowledge sharing in personal folder environments, The Journal of Systems & Software, 2008; 81(8):1377-88. [3] K. Sugiyama, K. Hatano, and M. Yoshikawa, Adaptive web search based on user profile constructed without any effort from users, Proc. ACM WWW, 2004; 675-864. [4] Y. Sawai and K. Yamamoto, Estimating level of public interest for documents, Journal of Natural Language Processing, 2008; 15 (2) : 101-36. [5] G. Buscher, A. Dengel and L. van Elst, Query expansion using gaze-based feedback on the subdocument level, Proc. ACM SIGIR, 2008; 387-94. [6] G. Buscher, A. Dengel, L. van Elst and F. Mittag, Generating and using gaze-based document annotations, Proc. CHI, 2008; 3045-50. [7] J. Ziv and A. Lempel, Compression of individual sequences via variable-rate coding, IEEE Trans. InformationTheory,1978; 24 (5): 530-6. [8] K. Sayood, Introduction to Data Compression (3rd ed.), Margan Kaufmann, San Francisco: CA, 2006. [9] T. Watanabe, K. Sugawara and H. Sugihara, A new pattern representation scheme using data compression, IEEE Trans. Pattern Analysis and Machine Intelligence, 2002; 24 (5): 579-90. [10] H. Kimura, T. Watanabe, H. Koga, and Z. Nuo, A new document retrieval method using LZ78 compression function, IPSJ SIG Tech. Report, 2006-F1-84/2006-NL-175, 2006; 65-70. [11] S. Hasegawa, A. Aizawa and T. Hamamoto, Unconscious similarity of topics in personalization, Proc. JSAI, 2009; 3I2-4. [12] D. M. Blei, A. Y. Ng and M. I. Jordan, Latent Dirichlet allocation, The Journal of Machine Learning Research, 2003; 3: 9931022. [13] S. Helmer, Measuring the structural similarity of semistructured documents using entropy, Proc. VLDB, 2007; 1022-32. [14] E. Stamatatos, A survey of modern authorship attribution methods, Journal of the American Society for Information Science

Takafumi Suzuki et al. / Procedia - Social and Behavioral Sciences 27 (2011) 150 – 159 and Technology, 2009; 60(3): 538- 56. [15] M. Ishihara and S. Sato, Classification of Japanese novels according to authors by normalized compression distance and validity of compression programs, Journal of Information Processing Society in Japan, 2008; 49(12): 4016-24. [16] R. Cilibrasi and P. M. B. Vitanyi, Clustering by compression, IEEE Trans. on Information theory, 2005; 51(4): 1523-45. [17] M. Nykter, O. Yli-Harja and I. Shmulevich, Normalized compression distance for gene expression analysis, IEEE GENSIPS, 2005. [18] M. Kramm, Image cluster compression using partitioned iterated function systems and efficient inter-image similarity features. Proc. IEEE SITIS, 2007; 989-96. [19] M. Bertacchini and C. E. Benitez, NCD Based masquerader detection using enriched command lines, Proc. CIBSI, 2007;32938. [20] M. Cebri´an, M. Alfonseca and A. Ortega, A common pitfalls using the normalized compression distance: what to watch out for in a compressor, Communications in Information and Systems, 2005; 15(4): 367-84. [21] T. Agata, Authorship attribution by data compression program, Library and Information Science, 2005; 54: 1-18. [22] D. Mochihashi and E. Sumita, Bayesian unsupervised word segmentation with nested Pitman-Yor language modeling, Proc. ACL-IJCNLP, 2009; 100-8. [23] G. Tsatsaronis, I. Varlamis and K. Nørv°ag, SemanticRank: ranking keywords and sentences using semanticgraphs, Proc. COLING, 2010; 1074-82.

159