Text categorization, P-trees, Intervalization, k-Nearest Neighbors,. KNN. 1. ..... Figure 8. Time comparison graph. Compared to the KNN approach, our approach ...
2004 ACM Symposium on Applied Computing
An Optimized Approach for KNN Text Categorization using P-trees Imad Rahal and William Perrizo Computer Science Department North Dakota State University IACC 258 Fargo, ND, USA 001-701-231-7248
{imad.rahal, william.perrizo}@ndsu.nodak.edu
ABSTRACT
document databases in the literature) are usually characterized by being very dynamic in size. They contain documents from various sources such as news articles, research publications, digital libraries, emails, and web pages. Perhaps the worldwide advent of the Internet is one of the main reasons for the rapid growth in the sizes of those collections.
The importance of text mining stems from the availability of huge volumes of text databases holding a wealth of valuable information that needs to be mined. Text categorization is the process of assigning categories or labels to documents based entirely on their contents. Formally, it can be viewed as a mapping from the document space into a set of predefined class labels (aka subjects or categories); F: DÆ{C1, C2…Cn} where F is the mapping function, D is the document space and {C1, C2…Cn} is the set of class labels. Given an unlabeled document d, we need to find its class label, Ci, using the mapping function F where F(d) = Ci. In this paper, an optimized k-Nearest Neighbors (KNN) classifier that uses intervalization and the P-tree1 technology to achieve a high degree of accuracy, space utilization and time efficiency is proposed: As new samples arrive, the classifier finds the k nearest neighbors to the new sample from the training space without a single database scan.
In the term space model [6][7], a document is presented as a vector in the term space where terms are used as features or dimensions. The data structure resulting from representing all the documents in a given collection as term vectors is referred to as a document-by-term matrix. Given that the term space has thousands of dimensions, most current text-mining algorithms fail to scale-up. This very high dimensionality of the term space is an idiosyncrasy of text mining and must be addressed carefully in any text-mining application. Within the term space model, many different representations exist. On one extreme, there is the binary representation in which a document is represented as a binary vector where a 1 bit in slot i implies the existence of the corresponding term ti in the document, and a 0 bit implies its absence. This model is fast and efficient to implement but clearly lacks the degree of accuracy needed because most of the semantics are lost. On the other extreme, there is the frequency representation where a document is represented as a frequency vector [6][7]. Many types of frequency measures exist: term frequency (TF), term frequency by inverse document frequency (TFxIDF), normalized TF, and the like. This representation is obviously more accurate than the binary one but is not as easy and efficient to implement. Text preprocessing such as stemming, case folding, and stop lists can be exploited prior to the weighting process for efficiency purposes.
Categories and Subject Descriptors
I.5.4 [Pattern Recognition]: Applications – Text Processing. I.5.2 [Pattern Recognition]: Design Methodologies – Classifier design and evaluation. E.1 [Data Structures]: Trees.
General Terms
Algorithms, Management, Performance, Design.
Keywords
Text categorization, P-trees, Intervalization, k-Nearest Neighbors, KNN.
1. INTRODUCTION
Nowadays, a great deal of the literature in most domains is available in text format. Document collections (aka text or
1
In this paper we present a new model for representing text data based on the idea of intervalizing (aka discretizing) the data into a set of predefined intervals. We propose an optimized KNN algorithm that exploits this model. Our algorithm is characterized by accuracy and space and time efficiency because it is based on the P-tree technology.
Patents are pending for the P-tree technology. This work was partially supported by the GSA grant ACT#: K96130308
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SAC ‘04, March 14-17, 2004, Nicosia, Cyprus Copyright 2004 ACM 1-58113-812-1/03/04… $5.00.
The rest of this paper is organized as follows: In Section 2, an introduction to the P-tree technology is given. Section 3 discusses data management aspects required for applying the text categorization algorithm which, in turn, is discussed in Section 4. Section 5 gives a performance analysis study. Finally, in Section 6, we conclude this paper by highlighting the achievements of our work and pointing out future direction in this area.
613
2. THE P-TREE TECHNOLOGY
The basic data structure exploited in the P-tree technology [1] is the Predicate Count Tree2 (PC-Tree) or simply the P-tree. Formally, P-trees are tree-like data structures that store numericrelational data (i.e. numeric data in relational format) in columnwise, bit-compressed format by splitting each attribute into bits (i.e. representing each attribute value by its binary equivalent), grouping together all bits in each bit position for all tuples, and representing each bit group by a P-tree. P-trees provide a lot of information and are structured to facilitate data mining processes. Figure 1. Relational numeric data converted to binary format with the first three bit groups in Attribute 1 highlighted.
After representing each numeric attribute value by its bit representation, we store all bits for each position separately. In other words, we group together all the bit values at bit position x of each attribute for all tuples t. Figure 1 shows a relational table made up of three attributes and four tuples transformed from numeric to binary, and highlights all the bits in the first three bit groups for the first attribute, Attribute 1; each of those bit groups will form a P-tree. Since each attribute value in our table is made up of 8 bits, 24 bit groups are generated in total with each attribute generating 8 bit groups. Figure 2 shows a group of 16 bits transformed into a P-tree after being divided into quadrants (i.e. subgroups of 4). Each such tree is called a basic P-tree. In the lower part of Figure 2, 7 is the total number of bits in the whole bit group shown in the upper part. 4, 2, 1 and 0 are the number of 1’s in the 1st, 2nd, 3rd and 4th quadrants respectively in the bit group. Since the first quadrant (the node denoted by 4 on the second level in the tree) is made up of “1” bits in its entirety (we call it a pure-1 quadrant) no sub-trees for it are needed. Similarly, quadrants made up entirely of “0” bits (the node denoted by 0 on the second level in the tree) are called pure-0 quadrants and have no sub-trees. As a matter of fact, this is how compression is achieved3 [1]. Non-pure quadrants such as nodes 2 and 1 on the second level in the tree are recursively partitioned further into four quadrants with a node for each quadrant. We stop the recursive partitioning of a node when it becomes pure-1 or pure-0 (eventually we will reach a point where the node is composed of a single bit only and is pure because it is made up entirely of either only “1” bits or “0” bits).
Figure 2. A 16-bit group converted to a P-tree.
3. DATA MANAGEMENT 3.1 Intervalization
Viewing the different text representations discussed briefly in the introduction as a concept hierarchy with the binary representation on one extreme and the exact frequencies representation on the other, we suggest working somewhere along this hierarchy by using intervals. This would enable us to deal with a finite number of possible values thus approaching the speed of the binary representation, and to be able to differentiate among term frequencies existing in different documents on a much more sophisticated level than the binary representation thus approaching the accuracy of the exact frequencies representation. Given a document-by-term matrix represented using the aforementioned TFxIDF measurement, we aim to intervalize this data. To do this, we need to normalize the original set of weighted term frequency measurement values into values between 0 and 1 (any other range will do). This would eliminate the problems resulting from differences in document sizes. After normalization, all values for terms in document vectors lie in the range [0, 1]; now, the intervalization phase starts. First, we must decide on the number of intervals and specify a range for every interval. After that, we replace the term values of document vectors by their corresponding intervals so that values are now drawn from a finite set of values. For example, we can use a four-interval value logic: I0=[0,0], I1=(0,0.1], I2=(0.1,0.2] and I3=(0.2,1] where “(“ and “)” are exclusive and “[“ and “]” are inclusive. The optimal number of intervals and their ranges depend on the type of the documents and their domain. Further discussion of those variables is environment dependent and outside the scope of this paper. Domain experts and experimentation can assist in this regard.
P-tree algebra includes operations such as AND, OR, NOT (or complement) and ROOTCOUNT (a count of the number of “1”s in the tree). Details for those operations can be found in [1]. The latest benchmark on P-trees ANDing has shown a speed of 6 milliseconds for ANDing two P-trees representing bit groups each containing 16 million bits. Speed and compression aspects of Ptrees have been discussed in greater details in [1]. [2], [4] and [5] give some applications exploiting the P-tree technology. Once we have represented our data using P-trees, no scans of the database are needed to perform text categorization as we shall demonstrate later. In fact, this is one of the very important aspects of the P-tree technology.
2
Formerly known as the Peano Count Tree
3
Its worth noting that good P-tree compression can be achieved when the data is very sparse (which increases the chances of having long sequences of “0” bits) or very dense (which increases the chances of having long sequences of “1” bits)
After normalization, each interval would be defined over a range of values. The ranges are chosen in an ordered consecutive manner so that the corresponding intervals would have the same intrinsic order amongst each other. Consider the example interval set used previously in this section; we have I0=[0,0], I1=(0,0.1], I2=(0.1,0.2] and I3=(0.2,1]. We know that [0,0] < (0,0.1]