Text classi cation using Lattice Machine Hui Wang1 and Nguyen Hung Son2 1
School of Information and Software Engineering University of Ulster Newtownabbey, BT 37 0QB, N.Ireland
[email protected]
Institute of Mathematics Warsaw University 02-095, Banacha 2 Str., Warsaw Poland 2
[email protected]
Abstract. A novel approach to supervised learning, called Lattice Machine, was proposed in [5]. In the Lattice Machine, it was assumed that data are structured as relations. In this paper we investigate the application of the Lattice Machine in the area of text classi cation, where textual data are unstructured. We represent a set of textual documents as a collection of Boolean feature vectors, where each vector corresponds to one document and each entry in a tuple indicates whether a particular term appears in the document. This is a common representation of textual documents. We show that using this representation, the Lattice Machine's operations are simply set theoretic operations. In particular, the lattice sum operation is simply set intersection and the ordering relationship is simply set inclusion. Experiments show that the Lattice Machine, under this con guration, is quite competitive with state-of-the-art learning algorithms for text classi cation.
1 Introduction Lattice machine, proposed in [5], is a general framework for supervised learning. Two of its components can be speci ed to suit dierent situations. In [5] the Lattice Machine was presented to work on structured data, which is in the form of database relations. In this paper we re-con gure the Lattice Machine so that it works on unstructured textual data. We will show that adopting the common Boolean feature vector representation of textual documents, two of the Lattice Machine's components { ordering relation and sum operation { are simply set inclusion and set intersection respectively. This approach will be validated by experimentation using benchmark (textual) datasets, and it will also be compared with similar approaches.
In the rest of the paper, we rst of all present a brief review of the Lattice Machine. Then we specify the Lattice Machine for use with unstructured textual data. We present a detailed example to illustrate the con gured Lattice Machine, along with experimental results. Finally we summarise and conclude the paper.
2 A brief review of the Lattice Machine Given a dataset represented as a database relation, a tuple is a vector of attribute values, a hyper tuple is a vector of sets of attribute values, and a hyper relation is a set of hyper tuples. The notion of hyper tuple is a generalisation of database tuple. Examples of a simple relation and a hyper relation are shown in Tables 1 and 2 respectively. It has been shown [5] that an elegant structure is implied among hyper tuples { the collection of all hyper tuples in a given domain is a semilattice under the following ordering relation : def hyper tuple (A) hyper tuple (A); hyper tuple1 hyper tuple2 () 1 2
for all A 2 U and U is the set of attributes. We call this domain lattice. In a domain lattice, a labelled dataset (training data) corresponds to a labelling of the lattice. For example, the following labelled dataset corresponds to the labelled lattice in Figure 1. E: hLarge; Red; Triangle; +i G: hLarge; Blue; Circle; +i J: hSmall; Blue; Triangle; ?i L: hSmall; Green; Rectangular; ?i Given a labelled lattice, a straightforward representation of the labelling is by sets { each set consisting of all data units with the same label, and a simple classi cation is by enumeration { if a new data unit is the same as one element in the set representation, then it is classi ed by the label of this element; otherwise, it is marked as unknown. This is in fact the idea of rote learning. Clearly there is no generalisation in this kind of learning. The nearest neighbour method goes one step further. In its basic form, the representation is also by sets, but the classi cation is based on some distance measure. There is generalisation in this kind of learning, but distance measures can be troublesome in cases where discrete attributes are involved. The Lattice Machine takes a dierent approach to this problem. It represents the labelling by a subset of the elements in the lattice { usually much less than
the number of labelled elements. These elements have the property that all elements below them have the same label or are unlabelled, and we called them equilabelled. The classi cation is governed by the following simple rule: given an equilabeled element, any elements below it will have the same label as the equilabeled element (if any).
The hyper relation in Table 2 is in fact the set of equilabelled elements found by the Lattice Machine from the relation in Table 1. The core of the Lattice Machine is the following algorithm [5]. Let D be a labelled dataset. The labelling implies a natural partition of the dataset with classes fM0; M1 ; ; M g. Given M of labelled elements for class i, the algorithm nds a set H of equilabelled maximal elements (in the sense that the sum of any pair of them is not equilabelled any more). This algorithm is based on the lattice operation sum (+) for nding the unique least upper bound of a set of elements. n
i
i
Let E be the set of all (possible) equilabelled elements in a labelled lattice. = M. 1. C1 def = The set of maximal elements of [# (C + M )] \ E . 2. C +1 def i
k
k
i
Note that, for a lattice L and e 2 L, # e = fy 2 L : y eg. It has been shown [5] that there is some n such that C = C +1 , and therefore C = C for all r n. It has also been proved that C = H . This H is called interior of class M . n
n
Attribute ID A A 1 2 t0 a 1 t1 a 2 t2 a 3 t3 b 1 t4 b 2 t5 b 3 t6 c 2 t7 c 3
n
i
Class 0 1 1 0 1 1 0 0
Table 1. A simple (database) relation.
n
r
i
i
Attribute ID A A Class 1 2 t0 fa; bg f1g 0 t1 fa; bg f2; 3g 1 t2 fcg f2; 3g 0 0 0 0
Table 2. A hyper relation obtained from the relation in Table 1 by the Lattice Machine.
1 A B E
F
C G
H
D I
J
K
L
0
Fig. 1. A labelled lattice.
3 Lattice Machine for text classi cation Lattice Machine is a general framework for supervised learning. The basic elements can be re-con gured to suit dierent learning problem. In this section we are going to show how to re-con gure the Lattice Machine for the task of text classi cation.
3.1 Representation Lattice Machine was originally proposed to work on structured data. Text documents, however, are unstructured data. A common representation of text document is by boolean feature vector. For example, one could construct a feature for each word appearing in the corpus { a set of documents. A corpus with m documents and n words would thus be represented as a m n matrix. This representation in fact turns unstructured datasets into structured. However, this representation is inecient in its usage of space, since even a moderate sized corpus usually contains many dierent words. Lattice Machine has a natural solution to this problem. Given two documents of a same class in Table 3, we rst of all represent them by boolean feature vectors as in Table 4. Merging the two vectors by the lattice sum operation (either using set union or intersection, depending on the ordering relation), we get new vectors as in Table 5. Translating the new vectors back into word lists, we get Table 6. It is clear that the lattice sum operation is equivalent to taking the documents as sets of words and exercising the set operations (either set union or intersection). Therefore, for the task of text classi cation, we represent text documents as sets (lists) of words.
England, Ireland, Poland, Scotland, Wales Europe Czech, Ireland, Poland, Slovia Europe Table 3. Two documents of the same class. Czech England Ireland Poland Scotland Slovia Wales Class 0 1 1 1 1 0 1 Europe 1 0 1 1 0 1 0 Europe Table 4. A Boolean feature vector representation of the documents in Table 3. Czech England Ireland Poland Scotland Slovia Wales Class 1 1 1 1 1 1 1 Europe 0 0 1 1 0 0 0 Europe Table 5. The sum of the two tuples in Table 4. The rst tuple is by set union, and the second tuple is by set intersection. Whether these sums can be accepted depends on data in other classes. Czech, England, Ireland, Poland, Scotland, Slovia, Wales Europe Ireland, Poland Europe Table 6. Set representation of results merged by the lattice sum operation.
3.2 The ordering relation In the original Lattice Machine the ordering relation is as follows: def hyper tuple (A) hyper tuple (A); hyper tuple1 hyper tuple2 () 1 2
for all attributes A. In fact the set inclusion can be changed from to , but the sum operation has also to be changed accordingly. In the context of text classi cation, if the documents are relatively large hence dierent documents have large number of common elements, it would be more ecient to use rather than to de ne the ordering relation. This amounts to classifying a document using a subset of the words in the document either in a xed order or in any order. If the documents are relatively small, however, it would be more eective to use instead. This amounts to classifying a document according to whether it is a subset of an already classi ed document. In what follows, our presentation is mainly for . Where a result doesn't apply to , we will clearly spell it out, and an alternative solution will be provided accordingly.
Since we choose to represent a document by a set of words, the ordering relation can therefore be speci ed as follows: def set of words(doc ) set of words(doc ); doc1 doc2 () 1 2
where set of words(doc ) is the set of words in doc . i
i
3.3 The sum operation Now that documents are represented by sets (word list) and the ordering relation is (inverse) set inclusion, the sum operation (denoted by +) is simply set intersection. For example, the sum of the two documents in Table 3 is the intersection of the them as sets, and this results in a new document (i.e., set) fIreland; Polandg.
4 An example Now we use a sample of the MEMO data, shown in Table 7 to illustrate our solution to the text classi cation problem. This dataset was used in [2]. It was 11 classes, but we use only samples from two classes. Each line corresponds to one document, and it is a list of word stems in the document (stemmed with the Porter Stemmer algorithm [3, 6]). Since the documents have been processed, they may not be readable to humans any more. So are the interiors to be found later. For the \commun" class, we sum all pairs of documents using and we get the hyper tuples in Table 8. The set of maximal equilabelled hyper tuples is shown in Table 9. The same operation is applied to this table, and it is clear that the result is the same as this table. Therefore this is the interior for the \commun" class. The same is done for the \comput" class, and the interior for this class is shown in Table 9.
4.1 Discussion Applying the classi cation rule on page 2, it can be easily veri ed that the interiors can correctly classify all the documents in Table 7. However, the generalisation ability may not be very high as the dataset is very small compared with the original dataset. The generalisation ability can be seen from the experimental results to be presented in the next section. Further processing can done to simplify the interiors obtained by the Lattice Machine. For example, since \97" appears in all interiors for both classes, it can
[1] 97 doc code r area 02 18 n america by state lewen [2] 97 j code text area 02 18 n america 2 memo by state megibow [3] 97 doc at t answer machin code 10 15 command for 1343 dave diehl [4] 97 doc at t answer machin control code 10 15 1545 r frisbi [5] 97 doc 11 at t answer machin 1339 control code text michael choi [6] 97 at code text r 02 18 2 hard drive rev sutherland [7] 97 code text r 02 18 2 xt hard drive rev sutherland [8] 97 code text 02 18 ibm irq and pc diagnost w taucher [9] 97 doc j at command 04 09 modem karam [10] 97 doc j code 02 18 7 error system mac 20k simpson
commun commun commun commun commun comput comput comput comput comput
Table 7. A sample of the MEMO data. be safely removed without aecting the classi cation performance. Taking this one step further, we can probably remove those words that appear in almost all interiors.
5 Experimental results To evaluate the Lattice Machine as a tool for text classi cation, we applied it to nine benchmark problems which were used in [2]. Table 10 lists the problems and gives a summary description of each domain, the number of classes and number of terms found in each problem, and the number of training and testing examples used in the experiment. On smaller problems 10-fold cross-validation was used, and on larger problems a single holdout set of the speci ed size was used. The experimental results are listed in Table 11. As a comparison we cite from [2] experimental results on the same set of problems using two state-of-theart learning algorithms { C4.5 [4], RIPPER [1].
6 Conclusion The Lattice Machine is a general framework for supervised learning. Its basic components can be re-con gured to suit dierent learning requirements. Previously the Lattice Machine works on structured data (i.e., feature vector based). In this paper the Lattice Machine is re-con gured so that it can used for unstructured textual data. Under this re-con guration, the sum operation is simply set intersection, and the ordering relation is simply set inclusion. The experiments show that the Lattice Machine under this con guration is quite competitive with state-of-the-art methods for classi cation.
[1+1] 97 doc code r area 02 18 n america by state lewen [1+2] 97 code area 02 18 n america by state [1+3] 97 doc code [1+4] 97 doc code r [1+5] 97 doc code [2+2] 97 j code text area 02 18 n america 2 memo by state megibow [2+3] 97 code [2+4] 97 code [2+5] 97 code text [3+3] 97 doc at t answer machin code 10 15 command for 1343 dave diehl [3+4] 97 doc at t answer machin code 10 15 [3+5] 97 doc at t answer machin code [4+4] 97 doc at t answer machin control code 10 15 1545 r frisbi [4+5] 97 doc at t answer machin control code [5+5] 97 doc 11 at t answer machin 1339 control code text michael choi [6+6] 97 at code text r 02 18 2 hard drive rev sutherland [6+7] 97 code text r 02 18 2 hard drive rev sutherland [6+8] 97 code text 02 18 [6+9] 97 at [6+10] 97 code 02 18 [7+7] 97 code text r 02 18 2 xt hard drive rev sutherland [7+8] 97 code text 02 18 [7+9] 97 [7+10] 97 code 02 18 [8+8] 97 code text 02 18 ibm irq and pc diagnost w taucher [8+9] 97 [8+10] 97 code 02 18 [9+9] 97 doc j at command 04 09 modem karam [9+10] 97 doc j [10+10] 97 doc j code 02 18 7 error system mac 20k simpson
commun commun commun commun commun commun commun commun commun commun commun commun commun commun commun comput comput comput comput comput comput comput comput comput comput comput comput comput comput comput
Table 8. The sum of all pairs of tuples in Table 7. [1+2] 97 code area 02 18 n america by state [1+4] 97 doc code r [3+5] 97 doc at t answer machin code [6+7] 97 code text r 02 18 2 hard drive rev sutherland [8+8] 97 code text 02 18 ibm irq and pc diagnost w taucher [9+10] 97 doc j
commun commun commun comput comput comput
Table 9. Maximal equilabelled hyper tuples (interiors) obtained from Table 8.
Datasets #Train #Test #Classes #Terms Text-valued eld Label memos 334 10cv 11 1014 document title category cdroms 798 10cv 6 1133 CD-Rom game name category birdcom 914 10cv 22 674 common name of bird phylogenic order birdsci 914 10cv 22 1738 common + phylogenic order scienti c name of bird hcoarse 1875 600 126 2098 company name industry (coarse grain) h ne 1875 600 228 2098 company name industry ( ne grain) books 3501 1800 63 7019 book title subject heading species 3119 1600 6 7231 animal name phylum netvet 3596 2000 14 5460 URL title category Table 10. Description of benchmark problems. Dataset default RIPPER C4.5 LM/Text memos 19.8 50.9 57.5 59.8 cdroms 26.4 38.3 39.2 40.0 birdcom 42.8 88.8 79.6 90.4 birdsci 42.8 91.0 83.3 92.3 hcoarse 11.3 28.0 30.2 20.2 h ne 4.4 16.5 17.2 13.8 books 5.7 42.3 52.2 45.3 species 51.8 90.6 89.4 89.4 netvet 22.4 67.1 68.8 67.8 Average 57.1 57.5 57.7 Table 11. Experimental results of the Lattice Machine, along with those of C4.5 and RIPPER cited from [2].
References 1. W. W. Cohen. Fast eective rule induction. In Machine Learning: Proceedings of the Twelfth International Conference. Morgan Kaufmann, 1995. 2. William W. Cohen and Haym Hirsh. Joins that generalize: Text classi cation using whirl. In Proc. KDD-98, New York, 1998. http://www.research.att.com/ ~wcohen/. 3. M. Porter. An algorithm for sux stripping. Program, 14(3):130{137, 1980. 4. Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, 1993.
5. Hui Wang, Ivo Duntsch, and David Bell. Data reduction based on hyper relations. In Proceedings of KDD98, New York, pages 349{353, 1998. http://www.infj.ulst. ac.uk/~cbcj23/latmach.html. 6. Jinxi Xu and W.B. Croft. Corpus-based stemming using co-occurrence of word variants. ACM TOIS, 16(1):61{81, Jan. 1998.