Using Wavelets to Classify Documents - IEEE Xplore

1 downloads 0 Views 399KB Size Report
1 Programa de Engenharia de Sistemas e Computação, COPPE/UFRJ. 2 Departamento de Ciência da Computação, IM/UFRJ. 3 Instituto Militar de Engenharia, ...
2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology

Using Wavelets to Classify Documents Geraldo Xexéo1,2, Jano de Souza1,2, Patrícia F. Castro1, Wallace A. Pinheiro1,3 1 Programa de Engenharia de Sistemas e Computação, COPPE/UFRJ 2 Departamento de Ciência da Computação, IM/UFRJ 3 Instituto Militar de Engenharia, Exército Brasileiro [xexeo,jano,patfiuza,awallace]@cos.ufrj.br According to Lima[2], the idea behind any compression (reduction) scheme is to remove the correlation from this data. Data correlated are characterized by the fact that we can fill the part that is missing to one of them. The idea to represent the data using a different mathematical basis based on the hope that this new representation will reveal this correlation. By this we mean that in this new basis most coefficients are very small and the reduction is achieved transforming the coefficients lower than a certain threshold equal to zero. In these cases, the information will be characterized by a small number of coefficients. In practice, there is a demand for a function that: • Be independent of the data; • Be calculated by a fast algorithm(linear or linear-logarithmic); • Be able to remove the correlation from a generic large set of data. A wavelet basis is perfectly applicable to this circumstance, allowing a fast processing that is independent of data, and a compact representation for a huge and generic set of data. This paper proposes a representation of documents based on the reorganization of their terms made from the analysis of correlation among them. This reorganization, which allows the preservation of the relevant characteristics of each document, facilitates the exploitation of the main features of the wavelet transform and its multi-resolution property. The goal is to show that this new representation is able to describe most relevant features of the documents and also increases the performance of classification algorithm. Besides this introductory section, this article presents some related in work section 2. In section 3, we show a brief introduction to wavelets. Section 4 describes our proposal for applying these concepts to the text classification task. Section 5 presents the experiment, describing the methods, the tests, the

Abstract Currently ,Fourier and cosine discrete transformations are used to classify documents. This article proposes a new strategy that uses wavelets in the representation and reduction of data text. Wavelets have been extensively used for dimensionality reduction in the field of signal processing. In this work, we show that a text document, after being subjected to a simple process of reorganization of its terms, can be treated like a signal and analyzed by signal processing tools. We demonstrate that this new representation is able to describe the most relevant features of documents in a synthetic representation and this new perspective improves the performance of the classification algorithm.

1. Introduction The text classification task can be summed up as the identification of characteristics that indicate the category or subject of a document. Machine learning algorithms are able to extract patterns from training examples. These patterns are represented in terms of parameters that are related to the characteristic of the analyzed domain. These patterns can be used to classify new cases. Nevertheless, this characteristic creation process can produce a quantity of data that can affect the performance of the learning method. In many cases, the result set contains thousands of elements. Thus, a second characteristic selection step is essential. This step consists in reducing the original data set, known as dimensionality reduction. As described in Li[1], wavelets can be useful in this process. Firstly, because classification methods can be applied to both: the wavelet domain of the original data and the selected dimensions of this domain. Secondly, because the multiresolution property can be incorporated, making the process easier to perform. 978-0-7695-3496-1/08 $25.00 © 2008 IEEE DOI 10.1109/WIIAT.2008.221

268 272

with the Haar wavelet, because it is simple to implement, very fast and has all the features that we wish (multiresolution and orthogonality). Figure 1 depicts a scheme for a filter bank, g[n] is a low-pass filter and h[n] represents the high-pass filter. The number ‘2’ inside the circle represents a decimation of two, that is, for each level of the filter, we can throw away half of the points of the original signal.

results and the analysis. Finally, in section 6, we present our conclusion and ideas for further work.

2. Related Works In the last years, some interesting works involving wavelets and text were proposed. Miller[3] applies wavelet transform to text data in a work whose focus is text visualization. Flesca[4] uses signal-processing tools to compare text semi-structured documents (XML). Park[5] and Silva[6] propose the wavelet transform for creating indexes in information retrieval systems. Wavelets have also been frequently employed in data classification. Aggarwal [7] uses wavelets for strings classification. Kiema[8] and Castelli[9] describe its use in image data classification and Bergstra[10] in music classification.

3. Foundation

Figure 1. Filter bank userd to obtain the wavelet transform

In this section we present a brief introduction to wavelet transform through the description of the fundamental concepts necessary for the understanding of this work. More detailed texts on the subject can be found at Mallat[12], Daubechies[13, 14] and Graps[15].

Each level of the filter bank is composed of a lowpass and a high-pass filter. The low-pass filter output is transferred to the next level of the filter-bank while the output of the high-pass filter (the wavelet coefficients) is stored. In the transition between levels, we can reduce number of points of the signal by half. The low-pass filter output is frequently interpreted as an approximation of the original signal, and the high-pass filter output is the error between the original signal and the new approximation. In the Haar case, the approximation is simply the average between two consecutive points, while the error is the difference between these points. Figure 2 shows schematically how the original signal can be decomposed in wavelet coefficients (we present the non-normalized version of this tree to simplify the calculation). The left branches are the outputs of the low-pass filter, while the right ones are the wavelet coefficients (high-pass filter output).

3.1. Wavelets Graps[15] defines wavelets as “mathematical functions that cut up the data (signal) into different frequency components and as the study of component with a resolution matched with its scale.” The wavelet transform uses a prototype function called mother wavelet to represent the original data (signal) in the wavelet domain with multiple levels of detail. This concept is called multi-resolution. The mother wavelet is compressed or dilated to match high or low frequency signals, respectively. It can be shown [16] that the mother wavelet forms a base for the L2(R) space and can approximate any signal that has finite energy (information). In the discrete case (the case that we are interested because of the nature of our data), we say that wavelet can operate over any finite-summable signal. From the point of view of signal processing, this transformation can be obtained by the successive application of low-pass and high-pass filters, forming a filter-bank. This view is very important for using wavelets, because it gives us a way to implement the wavelet transform. The low-pass filter has the effect of smoothing the signal, while the high-pass one has the effect of filtering the details. There are several types of wavelets, but in this work we are limited to working

Figure 2. Decomposition tree scheme used to evaluate wavelet coefficients The tree operation in the Haar case is very simple. In the example showed in Figure 4, the first left branch is formed by 6=(7+5)/2 and 5=(1+9)/2, which is the average of the values taken two by two. The first

273 269

common term. In result, the first term is the one with the highest average entry on the corresponding row; the next term is always the one corresponding to the column with the highest entry of the previous one". Example 1 illustrates this idea. Example 1: This example shows how to obtain the TCF. Suppose we have the collection defined in Figure 4. The term-document matrix, initially used, would be the same used in a typical vector space model (VSM) [16]. In this case, we use TF-IDF (Term Frequency Inverse Document Frequency) heuristics, as defined in equation (1), to evaluate the term weight. This matrix is represented in Figure 5.

right branch is formed by –1=(5-7)/2 and 4=(9-1/4), which is the difference (error) between two consecutives coefficients. Applying this procedure recursively we obtain the tree. The inverse operation to obtain the original signal is easy, starting at the last level of the tree. Figure 3 shows the reconstruction tree. The leaves of the tree form the original signal. Basically, we follow the nodes adding the values, if we chose the left branch, and subtracting the values, if we chose the right branch. For example, to find the value 5, we start from the value 5.5 then we subtract the value (-0.5), obtaining the value 6. Finally we add the value (-1).

TFIDF (t , d ) =

D freq(t , d ) × log( ) max x∈d ( freq( x, d )) x∈D∧t∈ x

(1)

D1 = (A, B, A, B, A) D2 = (A, A, A, A, A) D3 = (B, B, B, B, B) D4 = (A, C, B, C, A) D5 = (A, A, B, B, B) Figure 4. A hypothetical collection of documents

Figure 3. Reconstruction tree for wavelet transform

4. The Proposal We present in this work the use of the signalprocessing framework to deal with text documents, in the same way as described in Silva[6]. The main feature of this proposal is the definition of a Document Codification Function (DCF) that is able to transform the document in a signal. This model can be split into three steps. First, we define a Term Codification Function (TCF), which associates an identification number to each document term. In the second step, these values will be used by the DCF to rearrange the terms of the document. Finally, once the documents are represented as a signal through the processing performed by the DCF, we apply the wavelet framework to decompose it. The new representation of the document, now in wavelet domain, can be used in many different resolutions.

M vsm

documets

  0 0.096 0.064⎤ ⎫ A ⎡0.096 0.096 ⎪ = B ⎢⎢0.064 0 0.096 0.048 0.096⎥⎥ ⎬term 0 0 0.698 0 ⎥⎦ ⎪⎭ C ⎢⎣ 0

Figure 5. The term-document matrix Now, to define the correlation between terms, we can use the equation (2) to evaluate the term-to-term correlation. T M vsm * M vsm = M tt

(2)

Where Mvsm is the term-document matrix, MTvsm is the transposed matrix of Mvsm, and Mtt is the term-toterm correlation matrix. For our example, the normalized matrix Mtt is shown in Figure 6. A

4.1. Documents as Signals

B

C

0.677 0.725⎤ A⎡ 1 1 0.638⎥⎥ M tt = B ⎢⎢0.611 0.666 1 ⎥⎦ C ⎢⎣ 1

A real physical signal, like an electric signal, is characterized by its high correlation between two adjacent points. We define the TCF to give this characteristic to our document representation. In order to do it, we attribute a code to each term, so terms with a high correlation are near to each other. We start with the term-document matrix. We evaluate the correlation between all the terms and then order them based on their correlation coefficient, starting with the most

Figure 6. Normalized term-to-term correlation matrix used to evaluate the TCF Continuing our analogy using a physical signal, we can interpret the matrix of Figure 6 as the binding energy between terms. That is, the term A, for example, is more related to term C than to term B.

274 270

Based on it, we can establish a partial order for the terms. We propose starting with term B, because it has the lower average weight according to TF-IDF heuristics, which means that, on average, the term B is the weakest term in most documents. Following the binding energy, we would start with B, pass to C and finally to A. We propose to use this order as the TCF. The table 1 presents the TCF for each term.

is very common when a fast variation occurs. The document D1, for instance, was converted to d1=[0.064, 0, 0.096, 0.096], and the wavelet transform obtained was W(d1)=[0.064, 0.032, -0.032, 0]. The last coefficient is zero. Due to the fact that Mwav is in the wavelet domain, it represents the original signal (document) in different resolutions. The matrix of Figure 8, for example, could generate the following two matrices (Figure 9 and Figure 10).

Table 1. The application of TCF to terms of example 1 Term TCF(Term) A 3 B 1 C 2

M wav,1 / 2

Figure 9. Wavalet transform matrix with half resolution

After we have defined the TCF, we can use it to construct the DCF. In our case, it is simply a kind of term histogram, but the term order on the x-axis is determined by the TCF. The term-document matrix obtained is very similar to the one used in vector space model, but the column space reflect the association between terms. Example 2 shows how it works. Example 2: Continuing with our previous example, we can now define the term-document matrix. We start with the matrix in the document domain (Figure 7).

M sig

documets

  M wav,1 / 4 = [0.128 0.096 0.048 0.469 0.112]}signal

Figure 10. Wavelet transform matrix with quarter resolution These two matrices (Figure 9 and Figure 10)) are the representation of the original document in different resolutions (it is enough to drop the higher resolution terms). Of course, the lower the dimensions of column space of Mwav matrix, the poorer the precision of the representation. The matrix in Figure 10, for instance, is the worst approximation of the document that we can obtain in wavelet domain: it contains one dimension in its column space.

documets

  B ⎡0.064 0 0.096 0.048 0.096⎤ ⎫ ⎪ = C ⎢⎢ 0 0 0 0.698 0 ⎥⎥ ⎬term A ⎢⎣0.096 0.096 0 0.096 0.064⎥⎦ ⎪⎭

Figure 7. Term-document matrix This matrix (Figure 7 ) is similar to the original matrix (Figure 5), but its interpretation is quite different. In the original matrix, the column space is a vector spanned in the term space, while in the new matrix, the column space is a sampled signal. As a signal, we can operate it with the techniques described in section 3. We can use the algorithm presented in section 3 to convert each column of Msig to a new column in the wavelet domain.

M wav

documets

  ⎡ 0.128 0.096 0.048 0.469 0.112 ⎤⎫ =⎢ ⎥⎬signal ⎣− 0.064 − 0.096 0.048 0.277 − 0.016⎦⎭

4.2. Examples of Training Vectors representing each document in the collection are created from the wavelet matrix. It is necessary only to add a category indicator, used by the classification algorithm to distinguish among the training instances. Since the generation algorithm of the Haar transform needs a data input of 2n elements, we decided to represent the documents through the 2,048 most representative terms of the collection. In this way, each vector is a set formed by the 2,048 most representative terms of the document collection and a category indicator.

documets

  0.096 0.048 0.469 0.112 ⎤ ⎫ ⎡ 0.128 ⎪ = ⎢⎢ − 0.064 − 0.096 0.048 0.277 − 0.016⎥⎥ ⎬signal ⎢⎣ 0.045 0 0.068 − 0.460 0.068 ⎥⎦ ⎪⎭

5. Experiment and Results

Figure 8. Documents in wavelet domain

As in most trials for evaluation of classifiers, we have chosen the data set Reuters-21578 (available at http://www.daviddlewis.com/resources/testcollections/ reuters21578/) to make our tests. Standard ModApte that is defined within this set [18], was used to select

The Mwav matrix (Figure 8) is the result of the wavelet transform applied to each column vector of Msig. We reflect the signal in relation to the y-axis to obtain an even function and avoid the edge effect that

275 271

similarity classification approach and the well known characteristics of simplicity and efficiency in handling large volumes of data. We believe that the properties of wavelets could be better exploited by the application of classification methods based on this methodology. In a similar way, as described in [11], we varied the k value between the minimum value 1 and the maximum value 90. For each execution of the algorithm, we calculated the average for precision and recall values in each class in each resolution and for each k value. The results found in each of these situations are represented in the Figures 11 and 12 and in the Table 2.

examples of training and testing. Procedures for stopwords removal and stemming, through the algorithm of Porter [19], were made on the documents before the implementation of functions TCF and DCF, described in section 4. After, all other steps for the wavelet transform were applied on the document set. We performed 10 experiments with this document set exploring the property of multiresolution as a method of term reduction in each of these experiments (each experiment, one resolution). The classifier k-Nearest Neighbours (KNN) was chosen. As described in [20], this is a classifier where learning is based on analogy. The choice of the classification method takes into consideration its 100% 80%

Precision

60%

40%

20%

0% 2048

512

128

32

8

Terms k=1

k=15

k=30

k=45

k=60

Figure 11. Average Precision

276 272

k=75

k=90

100%

80%

Recall

60%

40%

20%

0% 2048

512

128

32

8

Terms k=1

k=15

k=30

k=45

k=60

k=75

k=90

Figure 12. Average Recall Table 2. Average measures of performance (precision and recall) of the classifier from the variation of k and quantity of terms (resolution). K 1 15 30 45 60 75 90

Terms Precision Recall Precision Recall Precision Recall Precision Recall Precision Recall Precision Recall Precision Recall

2048 49,1% 35,1% 68,2% 58,1% 76,5% 65,0% 79,9% 68,9% 81,5% 72,1% 82,9% 75,0% 83,0% 76,0%

1024 55,1% 50,1% 71,1% 63,2% 76,9% 65,2% 81,1% 75,3% 81,9% 82,2% 84,6% 85,6% 84,7% 84,2%

512 55,3% 59,9% 73,4% 66,3% 81,5% 71,5% 82,6% 76,4% 84,1% 83,3% 87,6% 87,2% 84,9% 84,6%

256 56,0% 68,2% 74,1% 72,5% 83,6% 80,2% 81,6% 79,9% 85,0% 85,3% 89,0% 87,5% 85,0% 85,3%

128 65,3% 63,3% 75,0% 72,5% 79,8% 75,5% 81,1% 80,2% 86,0% 86,2% 89,2% 87,5% 86,5% 85,3%

64 63,2% 66,4% 69,5% 72,6% 76,3% 72,3% 80,9% 81,5% 84,1% 85,3% 88,0% 84,7% 87,2% 87,0%

32 60,3% 64,1% 48,5% 55,6% 65,2% 65,2% 76,3% 76,3% 76,5% 75,3% 75,7% 75,3% 76,5% 70,2%

16 40,2% 41,2% 37,5% 56,2% 54,1% 50,2% 51,9% 52,1% 65,2% 63,3% 65,7% 62,3% 67,2% 68,8%

8 40,0% 35,0% 35,6% 32,0% 25,2% 39,8% 27,8% 39,6% 36,8% 40,2% 36,5% 35,2% 44,1% 32,2%

4 20,0% 30,0% 42,0% 21,0% 30,0% 25,0% 36,0% 21,0% 41,0% 30,0% 32,0% 22,0% 25,0% 29,0%

The graphics in the Figures 12 and 13 and in the Table 2 demonstrate this. The classifier performance displays an increasing improvement up to an average resolution level for all of the evaluation criteria adopted. The decrease of these values, after a certain resolution, was expected. Since we supposed there would be a loss of graphics, we observe that the best indicators of classifier quality appear between the resolution values of 6 and 8 and are k-value independent. In a similar work, Pryczek[11] presents the results achieved by the application of Fourier transform in the representation of documents for classification. His trial also makes use of the algorithm knn and the

5.1. Analysis of the Result The increase in k value represents an increase in the rigor of the classifier. The goal of using the wavelet multiresolution property is to search for document representation through words or terms, increasingly more significant and in a minimum amount. We tried to show through the tests that the application of this property can increase the quality of the generated classifiers, even when the classifiers are generated using rigid configuration parameters.

277 273

performance analysis of the classifier is based on the change of k. In addition, four different formulae are applied to the assessment of similarity among the documents. Under the use of more rigid parameter k (k = 90), results show very good values to measure the precision of the classifier, namely 0.9108, 0.8899, 0.8939, 0.9231, one for each of equations used. However, this gain in precision resulted in loss of recall, whose final figures are in 0.5120, 0.5210, 0.4563, 0.4205. Since goal of any task of classification will always maximize the values of precision and recall, it is a clear improvement in the performance of the classifier to deal with documents based on the representation proposed here.

transform. ACM Transactions on Information Systems. Volume 23 , Issue 3. Pages: 267 – 298. 2005 [6] Silva R. MSBRI: Signal Model for Information Retrieval. Master Thesis. COPPE/UFRJ. 2007. [7] Aggarwal C., Bradley P. On the Use of Wavelet Decomposition for String Classification. ACM Transactions. Data Mining and Knowledge Discovery. Volume 10 , Issue 2 . Pages: 117 – 139. 2005. [8] Kiema J.B., Wavelet Compression and Data Fusion: An Investigation into the Automatic Classification of Urban Environments using Color Photography and Laser Scanning Data. icpr, p. 3089, 15th International Conference on Pattern Recognition (ICPR'00) - Volume 3. 2000. [9] Castelli V., Kontoyiannis I. Wavelet-based classification: theoretical analysis. Technical Report RC20475, IBM Watson Research Center, 1996. [10] Bergstra, J., Casagrande, N., & Eck, D. Genre classification: a timbre- and rhythm-based multiresolution approach. MIREX genre classification contest. 2005. [11] Pryczek, M.; Szczepaniak, P.S. On Textual Documents Classification Using Fourier Domain Scoring Web Intelligence. IEEE/WIC / ACM International Conference. Volume , Issue , 18-22, Page(s):773 – 777. 2006. [12] Mallat S. A wavelet tour of signal processing. Academic Press. 1999. [13] Daubechies, I. The Wavelet Transform, TimeFrequency Localization and signal analysis. IEEE Transactions on Information Theory, vol. 36, no 5, pp. 9611005. 1990. [14] Daubechies, I. Ten Lectures on Wavelets. Proc. CBMS-NSF Regional Conference Series in Applied Mathematics Philadelphia, PA, SIAM, Vol. 61. 1992. [15] Graps A. An Introduction to wavelets. IEEE Computational Sciences and Engineering. Volume 2 , Issue 2 . Pages 50 – 61. 1995. [16] Young R.K., Wavelet Theory and its Applications. Kluwer Academic Publisher. Sixth Edition. 1998. [17] Baeza-Yates , Ribeiro-Neto B., Modern Information Retrieval. New York: ACM Press Series/Addison Wesley.1999. [18] Sebastiani F., Debole F. An analysis of the relative difficulty of Reuters-21578 substes. Journal of the American Society for Information Science and Technology. Vol. 56. Pages 584-596. 2005. [19] Porter M.F. An algorithm for suffix stripping. Program., 14. Pages 130–137. 1980. [20] Mitchell T.. Machine Learning. MacGraw-Hill,. 1997.

6. Conclusion The reordering of the documents terms from the analysis of correlation among them, allows us to create a representation for these documents that maintain their unique features and is still able to exploit the key features of the wavelet transform. This conversion allows the use of wavelets and their multiresolution property to improving the representation of these documents. This improvement in the representation is reflected in a better performance of the classifier chosen. It is possible that this strategy could be used by other classification methods, whether based on similarity or not. So our next step is to extend the tests presented here to evaluate the performance of classifiers based on Support Vector Machines (SVM), which classification also occurs through analysis of the similarity between training examples. After that, we intend to extend our evaluation to classifiers that employ other learning methodologies as well. We also intend to do tests with other types of wavelets.

7. References [1] Li T. , Li Q., Zhu S., Ogihara M. A survey on wavelets application on datamining. SIGKDD Explorations. 2005. [2] Lima P.C., Wavelets: an introduction. Mathematics Magazine University nº 33, p 13-44. Brazilian Society of Mathematics. 2002. [3] Miller N. , Wong P. , Brewster M., ; Foote H. TOPIC ISLANDS - A wavelet based text visualization system. IEEE Visualization. Proceedings Volume , Issue , 24-24. Page(s):189 – 196. 1998. [4] Flesca S., Giusepp M. , Masciari E. , Pontieri L., Plugliese A. Fast Detection of XML Structural Similarity. IEEE Transactions on Knowledge and Data Engineering. Volume 17, Issue 2, Page(s): 160 – 175. 2005. [5] Park L., Ramamohanarao K. , Plalaniswamo M. A novel document retrieval method using the discrete wavelet

278 274