Dimension Reduction for Text Data Representation Based on Cluster Structure Preserving Projection
Technical Report
Department of Computer Science and Engineering University of Minnesota 4-192 EECS Building 200 Union Street SE Minneapolis, MN 55455-0159 USA
TR 01-013 Dimension Reduction for Text Data Representation Based on Cluster Structure Preserving Projection Haesun Park, Moongu Jeon, and Peg Howland
March 05, 2001
Dimension Reduction for Text Data Representation Based on Cluster Structure Preserving Projection Haesun Park, Moongu Jeony and P. J. Howland z March, 2001
Abstract In today’s vector space information retrieval systems, dimension reduction is imperative for efficiently manipulating the massive quantity of data. To be useful, this lower dimensional representation must be a good approximation of the full document set. To that end, we adapt and extend the discriminant analysis projection used in pattern recognition. This projection preserves cluster structure by maximizing the scatter between clusters while minimizing the scatter within clusters. A limitation of discriminant analysis is that one of its scatter matrices must be nonsingular, which restricts its application to document sets in which the number of terms does not exceed the number of documents. We show that by using the generalized singular value decomposition (GSVD), we can achieve the same goal regardless of the relative dimensions of our data. We also show that, for k clusters, the generalized right singular vectors that correspond to the
k?1
largest generalized singular values are all we need to compute the optimal transformation to the reduced dimension. In addition, applying the GSVD allows us to avoid the explicit formation of the scatter matrices in favor of working directly with the data matrix, thus improving the numerical properties of the approach. The work of all three authors was supported in part by the National Science Foundation grant CCR-9901992. Dept. of
Computer Science and Engineering, Univ. of Minnesota, Minneapolis, MN 55455, U.S.A.(
[email protected]) y Dept. of Computer Science and Engineering, Univ. of Minnesota, Minneapolis, MN 55455, U.S.A.(
[email protected]) z Department of Computer Science and Engineering, Univ. of Minnesota, Minneapolis, MN 55455, U.S.A.(
[email protected]).
1
Finally, we present experimental results that confirm the effectiveness of our approach.
1 Introduction The vector space information retrieval system, originated by G. Salton [12, 13], represents documents as vectors in a vector space. The document set comprises an m n term-document matrix A, in which each
column represents a document, and each entry aij represents the weighted frequency of term i in document j . A major benefit of this representation is that methods developed in linear algebra for use in other application areas can be applied to document retrieval as well [1]. Modern document sets are huge [3], so we need to find a lower dimensional representation of the data. To achieve higher efficiency in manipulating the data, it is often necessary to reduce the dimension severely. Since this may result in loss of information, we seek a representation in the lower dimensional space that best approximates the full term-document matrix [7, 11]. The specific method we present in this paper is based on the discriminant analysis projection used in pattern recognition [4, 14]. Its goal is to find the mapping that transforms each column of A into a column in the lower dimensional space, while preserving the cluster structure of the full data matrix. This is accomplished by forming scatter matrices from A, the traces of which provide measures of the quality of the cluster relationship. After defining the optimization criterion in terms of these scatter matrices, the problem can be expressed as a generalized eigenvalue problem. As we explain in the next section, the discriminant analysis approach can only be applied in the case where m
n, i.e., when the number of terms does not exceed the number of documents. By recasting the
generalized eigenvalue problem in terms of a related generalized singular value problem, we circumvent this restriction on the relative dimensions of A, thus extending the applicability within information retrieval. At the same time, we improve the numerical properties of the approach by working with A directly rather than forming the scatter matrices explicitly. Our algorithm follows the generalized singular value decomposition (GSVD) [2, 5, 15] as formulated by Paige and Saunders [10]. For a data matrix with k clusters, we limit our computation to the generalized right singular vectors that correspond to the k ? 1 largest generalized singular
2
values. In this way, our algorithm remains computationally simple while achieving its goal of preserving cluster structure. Experimental results demonstrating its effectiveness are described in the last section of the paper.
2 Dimension Reduction Based on Discriminant Analysis Given a term-document matrix A 2 R mn , the general problem we consider is to find a linear transformation
GT 2 Rlm that maps each column ai , 1 i n, of A in the m dimensional space to a column yi in the l dimensional space :
GT : ai 2 Rm1 ! yi 2 Rl1 :
(1)
Rather than looking for the mapping that achieves this explicitly, one may rephrase this as an approximation problem where the given matrix A has to be decomposed into two matrices B and Y as
A BY where both B
(2)
2 Rml with rank(B ) = l and Y 2 Rln with rank(Y ) = l are to be found. The problem (2)
can be interpreted as a minimization problem [2]
min kBY ? AkF : B;Y When the matrix B is given, the solution satisfies the normal equation
B T BY = B T A; and therefore
Y = (B T B )?1 B T A: Since (3) gives a relationship between A and Y , we can directly define the linear transformation GT as
GT = (B T B )?1 B T 2 Rlm :
3
(3)
If the matrix
B has orthonormal columns,
then since
BT B = I , G = B.
Now our goal is to find a
linear transformation so that the clustering result in the full dimensional space is preserved in the reduced dimensional space, assuming that the given data are already clustered. For this purpose, first we need to formulate a measure of cluster validity. To have high cluster validity, a specific clustering result must have a tight within-cluster relationship while the between-cluster relationship has to be remote. For defining this measure, in discriminant analysis [4, 14], within-cluster, between-cluster, and mixture scatter matrices are defined as follows. For simplicity of discussion, we will assume that the given data matrix clustered into k clusters as
A = [E1 E2 Ek ]
Ei
where
2 Rmni ;
and
k X i=1
ni = n:
Let Ji denote the set of column indices that belong to the cluster i. The centroid
computed by taking the average of the columns in Ei , i.e.,
c(i) = n1 Ei e(i) i
where
e(i) = (1; : : : ; 1)T 2 Rni 1 ;
and the grand centroid is
c = n1 Ae;
e = (1; : : : ; 1)T 2 Rn1 :
where
Then the within-cluster scatter matrix Sw is defined as
Sw =
k X X
i=1 j 2Ji
(aj ? c(i) )(aj ? c(i) )T ;
and the between-cluster scatter matrix Sb is defined as
Sb = =
k X X
(c(i) ? c)(c(i) ? c)T
i=1 j 2Ji k X ni(c(i) ? c)(c(i) ? c)T : i=1
Finally, the mixture scatter matrix is defined as
Sm =
n X i=1
(ai ? c)(ai ? c)T ; 4
A 2 Rmn is
c(i) of each cluster Ei is
and it can be shown [6] that the scatter matrices have the relationship
Sm = Sw + Sb:
(4)
Hw = [E1 ? c(1) e(1)T ; E2 ? c(2) e(2)T ; : : : ; Ek ? c(k) e(k)T ] 2 Rmn ;
(5)
Hb = [pn1 (c(1) ? c); pn2 (c(2) ? c); : : : ; pnk (c(k) ? c)] 2 Rmk ;
(6)
Hm = [a1 ? c; : : : ; an ? c] = A ? ceT 2 Rmn ;
(7)
Defining the matrices,
and
the scatter matrices can be expressed as
Sw = Hw HwT ; Sb = HbHbT ;
and
Sm = HmHmT :
(8)
Note that another way to define Hb is
Hb = [(c(1) ? c)e(1)T ; (c(2) ? c)e(2)T ; : : : ; (c(k) ? c)e(k)T ] 2 Rmn ; but using the lower dimensional form in (6) reduces the storage requirements and computational complexity of our algorithm. Now, trace(Sw ), which is
trace(Sw ) =
k X X i=1 j 2Ji
(aj ? c(i) )T (aj ? c(i) ) =
k X X i=1 j 2Ji
kaj ? c i k ;
provides a measure of the closeness of the columns within its clusters over all
( ) 2 2
(9)
k clusters, and trace(Sb ),
which is
trace(Sb ) =
k X X i=1 j 2Ji
(c i
( )
k X X T (i) kc(i) ? ck22 ; ? c) (c ? c) = i=1 j 2Ji
(10)
provides a measure of the distance between clusters. When items within each cluster are located tightly around its own cluster centroid, then trace(Sw ) will have a small value. On the other hand, when the 5
between-cluster relationship is remote, and hence the centroids of the clusters are remote, trace(Sb ) will
have a large value. Using the values trace(Sw ), trace(Sb ), and relationship Eqn. (4), the cluster quality can be measured. A number of different criteria for measuring cluster quality have been proposed. In general,
when trace(Sb ) is large while trace(Sw ) is small, or trace(Sm ) is large while trace(Sw ) is small, then we can expect that the clusters of different classes are well separated while the items within each cluster are tightly related, and therefore, the cluster quality is high. An optimal cluster would maximize trace(Sb ) and minimize trace(Sw ). However, finding a cluster with such qualities will be difficult since it involves
maximization and minimization problems simultaneously. There are several alternatives which involve the three scatter matrices such as maximization of
J1 = trace(Sw?1 Sb);
(11)
J2 = trace(Sw?1 Sm ):
(12)
or
Note that both of the above criteria require that Sw be nonsingular, and equivalently, that Hw has full rank. In the lower dimensional space obtained from the linear transformation GT , the within-cluster, betweencluster, and mixture scatter matrices become
SwY
=
SbY =
k X X i=1 j 2Ji k X X
(GT aj ? GT c(i) )(GT aj ? GT c(i) )T = GT Sw G; (GT c(i) ? GT c)(GT c(i) ? GT c)T = GT Sb G;
i=1 j 2Ji n X SmY = (GT ai ? GT c)(GT ai ? GT c)T i=1
= GT Sm G;
where the superscript Y denotes the values in the l dimensional space. Utilizing the above measures for cluster quality and the relation (3), given clustered data with k clusters, the linear tranformation
GT
that would best approximate this cluster structure in the reduced space can be
found. Note that in the reduced space, the two criteria (11) and (12) become maximization of
J1 (G) = trace((GT Sw G)?1 (GT Sb G)); 6
or
J2 (G) = trace((GT Sw G)?1 (GT Sm G)): For now, we will focus our discussion on the criterion J1 . The linear transformation
J1 satisfies
GT which maximizes
@J1 (G) = 0: @G
(13)
@J1 (G) = ?2S G(GT S G)?1 (GT S G)(GT S G)?1 + 2S G(GT S G)?1 w w w w b b @G
(14)
It can be shown that
and from (13) we have
Sw?1 Sb G = G(GT Sw G)?1 (GT SbG) = G((SwY )?1 SbY ): Note that since Sw is assumed to be nonsingular in J1 and J2 , and Sw
(15)
= Hw HwT , Sw is symmetric pos-
2 Rmn must satisfy the condition that m n for Sw to be nonsingular. According to results from the generalized eigenvalue problems [5], for the symmetric matrix SbY 2 Rll Y 2 Rll , there exists a nonsingular matrix Z 2 Rll such that and the symmetric positive definite matrix Sw itive definite. This means that Hw
Z T SbY Z = = diag(1 : : : l ) and Z T SwY Z = Il : Therefore,
(SwY )?1 SbY = Z Z ?1
(16)
Sw?1 Sb GZ = GZ ;
(17)
and Eqn. (15) becomes
2 Rml . Note that the transformation G is not unique in the sense that J (G) = J (GW ) for any nonsingular matrix W 2 R ll since
where GZ
1
J1 (GW ) = trace((W T GT Sw GW )?1(W T GT Sb GW )) = trace(W ?1 (GT Sw G)?1 W ?T W T (GT Sb G)W ) = trace((GT Sw G)?1 (GT Sb G)WW ?1) = J1 (G): 7
1
GZ that satisfies Eqn. (17), and accordingly Eqn. (13), consists of l eigenvectors of Sw?1 Sb. ?1 Sb is an m m matrix, so we must choose l out of m eigenvalues, and their corresponding However, Sw The matrix
eigenvectors, for the solution of (17). Note that
J1 (G) = J1 (GZ ) = trace(((GZ )T Sw (GZ ))?1 ((GZ )T Sb (GZ ))) = trace(((GZ )T Sw (GZ ))?1 ((GZ )T Sw (GZ ))) = trace(); where the second to last equality follows from (17). If we choose the eigenvectors that correspond to the l
?1 Sb , then trace() = 1 + + l , and hence J1 (G), will be maximized. largest eigenvalues of Sw
?1 Sb ). The question then becomes what value we should choose for l. To answer, we need to know rank (Sw
= Hb HbT and Hb is as shown in Eqn. (6), assuming that the centroids are linearly independent, rank(Hb ) = k ? 1. Accordingly, rank(Sb) = k ? 1, and rank(Sw?1 Sb ) = k ? 1. This shows that only ?1 Sb are nonzero. When l k ? 1, trace(Sw?1 Sb ) = 1 + + k?1 = the largest k ? 1 eigenvalues of Sw J1 (G) = trace((SwY )?1 SbY ) whenever G 2 Rml consists of l eigenvectors of Sw?1 Sb corresponding to the l largest eigenvalues. Therefore, if we choose l = k ? 1, there is no loss of cluster quality upon dimension Since Sb
reduction. Now, a limitation of the criterion J1 (G) in many applications, especially text processing in information
retrieval, is that the matrix Sw has to be nonsingular. For the matrix Sw to be nonsingular, we can only allow the case m n, since Sw is the product of an m n matrix, Hw , and an n m matrix, HwT . In other words, the number of terms cannot exceed the number of documents, which is a severe restriction. For that reason, we rewrite the eigenvalue problem (17) as a generalized eigenvalue problem
Sb xi = iSw xi or
Hb HbT xi = iHw HwT xi for
1 i m.
(18)
We seek a solution which does not require Sw to be positive definite, and which can be
found without explicitly forming Sb and Sw from Hb and Hw , respectively. Multiplying (18) by xTi , we see 8
that kHbT xi k22
= i kHwT xi k22 , so we let i = 2i = i2 . (i will be infinite when i = 0, as we discuss later.)
The problem becomes
i2 Hb HbT xi = 2i Hw HwT xi for 1
(19)
i m. This has the form of a generalized singular value problem [5, 10, 15], which can be solved
using the GSVD, as described in the next section.
3 Generalized Singular Value Decomposition The following theorem introduces the GSVD as was originally defined by Van Loan [15].
2 Rmn with m n and KB 2 Rpn are given. Then there exist orthogonal matrices U 2 Rmm and V 2 Rpp and a nonsingular matrix X 2 R nn such that T HEOREM 1 Suppose two matrices KA
U T KA X = diag(1 ; :::; n ) V T KB X = diag( 1 ; :::; q ) where q
= min(p; n), i 0 for 1 i n, and i 0 for 1 i q.
This formulation cannot be applied to the matrix pair HbT and HwT when the dimensions of HbT do not satisfy the assumed restrictions. Paige and Saunders [10] developed a more general formulation which can be defined for any two matrices with the same number of columns. They state theirs as follows.
0 1 KA A and T HEOREM 2 Suppose two matrices KA 2 R mn and KB 2 Rpn are given. Then for C = @
t = rank(C ), there exist orthogonal matrices U 2
Rmm , V
2
Rpp , W
2
Rtt , and Q
that
U T KAQ = A(W 0 ) | {zT R}; |{z} t
n?t
and
9
V T KB Q = B (W 0 ); | {zT R}; |{z} t
n?t
KB
2 Rnn such
where
0 IA B A = B B DA mt @
0A
1 C C C A;
0 OB B B = B B DB pt @
IB
1 C C C A;
2 Rtt is nonsingular with singular values equal to the nonzero singular values of C . The matrices IA 2 Rrr and IB 2 R t?r?s t?r?s are identity matrices, where the values of r and s depend on the data, 0A 2 R m?r?s t?r?s and 0B 2 R p?t r r are zero matrices with possibly no rows or no columns, and R
(
(
)
(
)
)
(
)
(
+ )
and
DA = diag(r+1 ; : : : ; r+s )
DB = diag( r+1 ; : : : ; r+s )
and
satisfy
1 > r+1 r+s > 0; 0 < r+1 r+s < 1;
(20)
and
2i + i2 = 1
for i = r + 1; : : : ; r + s:
Paige and Saunders give a constructive proof of Theorem 2, which starts with the complete orthogonal decomposition [5, 2, 9] of C , or
0 R P T CQ = @
0 0 0
(21)
Q are orthogonal and R is nonsingular with the rank of C . The construction proceeds by exploiting the SVDs of submatrices of P . They also relate theirs to Van Loan’s by writing where
P
1 A;
and
U T KAX = (A ; 0) where
V T KB X = (B ; 0);
and
0 ? R W @ X = Q nn 1
0
10
0
I
1 A:
(22)
From the form in (22) we see that
KA = U (A; 0)X ?1 which imply that
0 T KAT KA = X ?T @ A A 0
0 0
1 A X?
1
and
KB = V (B ; 0)X ?1 ;
and
0 T KBT KB = X ?T @ B B 0
0 0
1 A X? : 1
Defining
i = 1; i = 0
for i = 1; : : : ; r
and
i = 0; i = 1
for i = r + s + 1; : : : ; t;
we have, for 1 i t,
i2 KAT KAxi = 2i KBT KB xi ;
(23)
n ? t columns of X , both KAT KA xi and KBT KB xi are zero, so (23) is satisfied for arbitrary values of i and i when t + 1 i n. We conclude that the columns of X , called the generalized right singular vectors, provide the solution to the generalized singular value problem for the matrix pair KA and KB . where xi represents the ith column of
X.
For the remaining
4 Application of the GSVD to Dimension Reduction Recall that for our mn term-document matrix A, when m n and the scatter matrix Sw is nonsingular,
discriminant analysis can be applied. However, one drawback of discriminant analysis is that both Sw
=
Hw HwT and Sb = Hb HbT must be explicitly formed. Since forming these cross-product matrices can cause a severe loss of information [5, page 239, Example 5.3.2], using the GSVD, which works directly with Hw and Hb , avoids a potential numerical problem. 11
Algorithm 1 DiscGSVD Given a data matrix A 2
Rmn with k clusters, it computes the columns of the optimal transformation
G 2 Rm(k?1) , which exactly preserves the cluster structure in the reduced dimensional space, and the k ? 1 dimensional representation Y of A. 1. Compute Hb and Hw from A according to Eqns. (6) and (5)
0 T1 0 H b A: P T CQ = @ R 2. Compute the complete orthogonal decomposition of C = @ T Hw
3.
1 A
t = rank(C )
4. Compute W from the SVD of P (1 : m; 1 : t):
U T P (1 : m; 1 : t)W = A
0 ? R W 5. Compute the first k ? 1 columns of X = Q @ 1
0
6.
0 0 0
0
I
1 A, and set it as G.
Y = GT A
In terms of our problem (19), we once again choose the xi ’s which correspond to the k ? 1 largest i ’s,
where i
= 2i = i2 . Since the algorithm orders the singular value pairs as in (20), the generalized singular values, or the i = i quotients, are in nonincreasing order. Therefore, the first k ? 1 columns of X are all we
need to compute. (Under our assumption that Sw is nonsingular, (19) shows that the i ’s are nonzero.)
Our algorithm first computes the matrices Hb and Hw from the term-document matrix A. We then solve
0 T1 Hb A . This solution is accomplished by for a very limited portion of the GSVD of the matrix C = @ HwT
following the construction in the proof of Theorem 2. Since all we need to compute are the columns of
X , the major steps are limited to the complete orthogonal decomposition of C , which produces orthogonal matrices P and Q and a nonsingular matrix R, and a singular value decomposition of a submatrix of P . The steps are summarized in Algorithm DiscGSVD. For the case when
m > n, the scatter matrix Sw is singular.
Hence, we cannot even define the
J1
criterion, and discriminant analysis fails. From Eqn. (19), we see that generalized right singular vectors that
12
Algorithm 2 Centroid Given a data matrix A 2 R mn with k clusters, it computes a k dimensional representation Y of A. 1. Compute the centroid c(i) of the ith cluster, 1 i k 2. Set B
= c(1) c(2) c(k)
3. Solve minY
kBY ? AkF
Algorithm 3 CentroidQR Given a data matrix A 2 R mn with k clusters, it computes a k dimensional representation Y of A. 1. Compute the centroid c(i) of the ith cluster, 1 i k 2. Set B
= c(1) c(2) c(k)
3. Compute the reduced QR decomposition of B , which is B 4. Solve minY
= QR.
kQY ? AkF (in fact, Y = QT A).
lie in the null space of Sw will have corresponding i ’s equal to zero. This results in infinite generalized singular values. We observed in our experiments that even in this case our Algorithm DiscGSVD worked very well, thus extending its applicability beyond that of the original discriminant analysis.
5 Experimental Results We tested our proposed algorithm on two different data types. In the first data type, column dimension is higher than row dimension, which can be dealt with by the original discriminant analysis scheme, assuming that the within-cluster scatter matrix is nonsingular. In the second data type, row dimension is higher than column dimension, which gives a singular matrix for the original discriminant analysis, but can be solved effectively by our improved discriminant analysis (DiscGSVD).
13
Algorithm 4 Centroid based Classification
Given a data set A with k clusters and k corresponding centroids, c(i) , 1 i k , it finds the index j of the
cluster in which the new vector q belongs.
find the index j such that
sim(q; c(i) ), 1 i k, is minimum(or maximum), where sim(q; c(i) ) is the similarity measure between q and c(i) .
(For example, with L2 norm, sim(q; c(i) ) = kq ? c(i) k2 and
with cosine measure, sim(q; c(i) ) = cos(q; c(i) ) = kqkq kcc(i) k ) 2 2 T (i)
Test I: In this experiment we use clustered data of the first data type. It is artificially generated by an algorithm adapted from [6, Appendix H]. We compare classification results in the full dimensional space and the reduced dimensional space using Algorithm DiscGSVD and other dimension reduction methods we have developed [7, 11], namely Algorithms Centroid and CentroidQR. Table 1 shows the dimensions of the data and classification results using the Centroid based Classification Algorithm (see Algorithm 4). The data consists of 7 classes comprising 100 50-dimensional documents. DiscGSVD reduces the original dimension 50 to 6, which is k ? 1, where k is the number of classes. The
other methods reduce it to 7, which is k . We can see that the DiscGSVD method has a lower misclassifica-
tion rate than our other reduction methods. As shown in our previous work [7], CentroidQR gives the same result as classification in the full dimensional space. We noticed the same trend in many tests with different data sizes. Test II: In this test, a higher dimensional data set is tested for classification after dimension reduction by DiscGSVD, Centroid and CentroidQR. This data set consists of 5 categories, which are all from the abstract of the MEDLINE 1 database. Each category has 20 documents, and the total number of terms are 3219 (see Table 2) after preprocessing with stopping and stemming algorithms [8]. For this type of data matrix, the original discriminant analysis breaks down, since the within-cluster scatter matrix is singular. Our improved method does not need to compute scatter matrices, and it also avoids the singularity problem. 1
http://www.ncbi.nlm.nih.gov/PubMed
14
Table 1: Misclassification Rate with Cosine measure
Reduction Method
Y) trace(Sm
trace(SbY )
Y) trace(Sw
trace(SbY ) trace(SwY )
Mis. rate
Misclassified item
Full
5282
696.2
4586
0.152
0.14
21,25,30,50,54,58,67,
50 100 DiscGSVD
74,85,86,92,95,96,100 6.00
4.09
1.90
2.15
0.06
18,21,67,70,73,92
173.9
85.0
88.9
0.956
0.13
1,30,50,54,58,67,74,85,
6 100 Centroid
7 100 CentroidQR
86,92,95,96,100 1390
696.2
694.2
7 100
2.00
1.00
21,25,30,50,54,58,67, 74,85,86,92,95,96,100
By Algorithm DiscGSVD the dimension 3219 is dramatically reduced to 4, which is less than the number of classes by 1. The other methods reduce it to the number of classes, 5. Also, in this test, two different similarity measures are compared in classification. Tables 3 shows classification results using L2 norm and cosine similarity measures. As in the results of Test I, DiscGSVD produces the lowest misclassification rate. Traces of scatter matrices are also shown in Tables 1 and 3 for test I and test II, respectively. Because
Y ) as a measure the J1 criterion is not defined in the test II case, we computed the quotient trace(SbY )/trace(Sw
Y ). It is interesting to observe a couple of of how well we maximized trace(SbY ) while minimizing trace(Sw facts. One fact is that the value of this quotient for the DiscGSVD is larger than that for any other reduction method, and the other fact is that trace(Sb ) in the full dimensional space and trace(SbY ) using CentroidQR are identical. The first fact demonstrates the optimization of cluster quality of the DiscGSVD algorithm, and the second fact appears to be related to the order preserving property [7] of the CentroidQR algorithm. Our experimental results confirm that the J1 criterion effectively optimizes classification in the reduced dimensional space, and our DiscGSVD extends its applicability to cases which the original discriminant analysis cannot handle. With slight modifications, the GSVD can also be used to maximize the J2 criterion (12). 15
Table 2: Medline Data Set Data from MEDLINE class
category
no. of data
1
heart attack
20
2
colon cancer
20
3
diabetes
20
4
oral cancer
20
5
tooth decay
20
dimension
3219 100
References [1] M.W. Berry, S.T. Dumais, and G.W. O’Brien. Using linear algebra for intelligent information retrieval. SIAM Review, 37:573-595, 1995. ˚ Bj¨orck. Numerical Methods for Least Squares Problems. SIAM, Philadelphia, 1996. [2] A. [3] W.B. Frakes and R. Baeza-Yates. Information Retrieval: Data Structure and Algorithms, Prentice Hall PTR, 1992. [4] K. Fukunaga. Introduction to statistical pattern recognition, 2nd ed. Academic Press, Inc., 1990. [5] G.H. Golub and C.F. Van Loan. Matrix Computations, third edition. Johns Hopkins University Press, Baltimore, 1996. [6] A.K. Jain, and R.C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988. [7] M. Jeon, H. Park and J.B. Rosen. Dimension reduction based on centroids and least squares for efficient processing of text data. Proceedings of SIAM Conference on Data Mining 2001, Apr. 2001, Chicago, to appear.
16
Table 3: Misclassification Rate with L2 norm and Cosine measures
Reduction Method
Y) trace(Sm
trace(SbY )
Y) trace(Sw
trace(SbY ) trace(SwY )
41286
4418
36868
0.12
Mis. rate
Misclassified item
L2
0.06
26,28,40,58,85,86
Cosine
0.03
26,73,86
L2
0.01
68
Cosine
0.01
86
L2
0.03
26,73,86
Cosine
0.03
26,73,86
L2
0.06
26,28,40,58,85,86
Cosine
0.03
26,73,86
Full(3219 100)
DiscGSVD(4 100)
Centroid(5 100)
CentroidQR(5 100)
4.00
124.38
7089
3.95
80.00
4418
0.05
44.38
2672
79
1.80
1.65
[8] G. Kowalski. Information Retrieval System: Theory and Implementation, Kluwer Academic Publishers, 1997. [9] C.L. Lawson and R.J. Hanson. Solving Least Squares Problems. SIAM, Philadelphia, 1995. [10] C.C. Paige and M.A. Saunders. Towards a generalized singular value decomposition. SIAM J. Numer. Anal., 18:398-405, 1981. [11] H. Park, M. Jeon and J.B. Rosen. Representation of High Dimensional Text Data in a Lower Dimensional Space in Information Retrieval, SIAM Journal on Matrix Analysis and Applications(submitted). [12] G. Salton. The SMART retrieval system, Prentice Hall, 1971. [13] G. Salton, and M.J. McGill. Introduction to Modern Information Retrieval, McGraw-Hill, 1983.
17
[14] S. Theodoridis and K. Koutrousmbas. Pattern Recognition, Academic Press, 1999 [15] C.F. Van Loan. Generalizing the singular value decomposition. SIAM J. Numer. Anal., 13:76-83, 1976.
18