INFORMATION RETRIEVAL SYSTEM IN BAHASA INDONESIA USING LATENT SEMANTIC INDEXING AND SEMI-DISCRETE DECOMPOSITION
Yeni Herdiyeni Computer Science Department Faculty of Mathematics and Natural Sciences Bogor Agricultural University (IPB) Bogor – Indonesia
[email protected] Zainal A. Hasibuan Computer Science Department Faculty of Computer Science University of Indonesia Depok 16424 – Indonesia
[email protected] Abstract
June 2003
The focus of this paper is exploring the use of Latent Semantic Indexing (LSI) and Semi-Discrete Matrix Decomposition (SDD) in Bahasa Indonesia Information Retrieval System. The method is to take advantage of implicit higher-order structure in association of terms with document (“semantic structure”) in order to improve the detection of relevant document on the basis of terms found in queries in Indonesian Language. LSI is a promising enhancement to the Vector Scale Model of Information Retrieval and uses statistically derived relationship between documents instead of individual words for retrieval. The particular technique used is Semi-Discrete Matrix Decomposition (SDD) based on Kolda Research [5] – in which requires significantly less storage and is faster at query processing than Singular Value Decomposition (SVD). Using Kolda and OLeary’s SDDPACK software [7], an Implementation of SDD LSI is built in Visual Basic 6.0, Matlab 6.5 and Visual C++ and tested in the collection of student research documents at Computer Science Department of IPB. The results will be compared between the SDD performance using stemming terms and non-stemming terms.
1. Introduction Typically, information is retrieved by literally matching terms in documents with those of a query. However, lexical matching method can be inaccurate when they are used to match a user’s query. Since there are usually many ways to express a given concept (synonym), the literal terms in a user’s query may not match those of relevant documents. In addition, most term have multiple meanings (polysemy), so terms in a user’s query will literally match terms in irrelevant documents. A better approach would allow users to retrieve information on the basis of a conceptual topic or meaning of a document. Latent Semantic Indexing (LSI) [3] tries to overcome the problems of lexical matching by using statistically derived conceptual indices instead of individual word for retrieval. LSI assumes that there is some underlying or latent structure in word usage that is partially obscured by variability in word choice. The latent semantic structure starts with a matrix of terms by documents. This matrix is then analyzed by Singular Value Decomposition (SVD) to derive our particular latent semantic structure model. In information retrieval system, SVD can be viewed as a technique for deriving a set of uncorrelated indexing variables or factors; each term and documents are represented by its vector of factor values. Note that by virtue of the dimension reduction, it is possible for documents with somewhat different profiles of term usage to be mapped into the same vector of factor values. Disadvantages of LSI include the large amount of storage required for SVD representation. Retrieval efficiency may not as good as traditional Information Retrieval (IR), as LSI needs to compare the query against every document in the collection (as opposed to using an inverted index which only needs to examine documents which include the query terms). Another criticism of LSI is that the SVD method is designed for normally distributed data but a term-by-term matrix (even if weighted) from a document collection may not be normally distributed [9]. It has been suggested that a dimensionality reduction based on the Poisson distribution would provide a better approximation of the original term by document matrix. The storage requirements for the three matrices in SVD can be much greater than original TermDocument matrix, as the singular value matrices are usually dense [2]. The Semi-Discrete Decomposition (SDD) is a method of reducing the storage for a matrix. SDD has been used in image compression where it has achieved 10 to 1 compression without degrading image quality [7]. According to [6] found that for equal query times, the SDD produced precision rates similar to SVD with only one-tenth of the storage (using the Medline test collection). However the decomposition requires more time to decompose the original matrix, and requires a higher dimension (k) than SVD. The main goal of this study is to evaluate the effectiveness of stemming in Bahasa Indonesia using LSI and SDD to make Term-Document matrix and query vector.
2. Previous Work Many researchers have evaluated the used of Semi-Discrete Matrix Decomposition, but there are few studies which evaluate the use of Semi-Discrete Matrix Decomposition for Indonesian text
retrieval. This study examine end-to-end LSI engine using a Semi-Discrete Decomposition for Indonesian text retrieval. This engine is evaluated using document collections of students’ research at computer science department, IPB. Another contribution of this study includes a method in preprocessing document to parse Indonesian text to be term-document matrix that will be used for SDD in LSI.
3. The Theoretical Foundation Document Preprocessing In order to identify which terms should be used to index a document collection, these terms need to be identified and stored. Parsing Parsing is one of the most overlooked parts of most information retrieval system. Many systems describe their proprietary technique designed to find the “perfect documents” and focus entirely on how to find relevant documents. Parsing refers to the process of identifying tokens in a stream of text. For a string “the big dog jumped up the hill” we can agree that tokens are “the”, “big”, “dog”, “jumped”, “up”, and “hill”. In this paper will examine to parse Indonesian Text. The document parser is used to accept text prior to indexing and query parser is used to identify tokens prior to implementing the query. Stop Lists During the automatic indexing of documents, candidate index terms are usually compared against a ‘stop list’, which is a list of very common words (such as “sebuah”, “adalah”, “kecuali” , etc). These terms are removed as they appear too frequently to be used as index terms (they probably appear in every document). The advantages of using a stop list are that less storage space is required in the term index and the high frequency terms are removed from both the query and the term index resulting in faster retrieval. The disadvantage of using a stop list is that search phrases might require words from the stop list. Stemming Another option in document preprocessing of IR is stemming. Stemming involves collapsing morphological variants of the same lexical item into a single root. For example, “proses”, “memproses” and “pemrosesan” will have the same root “proses”. The advantage of stemming is that a query on the keyword “pemrosesan” will be stemmed to “proses” before the keyword index is searched and will retrieve document which also use keyword “memproses”. The disadvantage of stemming is that it can return terms which have stemmed to the same root, but are not related to the query. In this study we used stemming algorithm for Bahasa Indonesia [8]. Some notations that are used in this study are: M is size of terms, W is word (term), L is length of term, C is Consonant, V is Vowel, V(x) is letter of the x-th is vowel, and C(x) is letter of the x-th is Consonant. Some functions used in stemming algorithm are: 1. Valid(x), to check validity of token as input. 2. ReduceRep(x), to handle the words x which contains the repetitive mark (-)
3.
4. 5. 6. 7. 8.
a. If the word is a bound morpheme (such as a, adi, antar, anti, ekstra, inter, nir, pan, para, pasca, pra, supra, swa, trans, tuna, ultra) then the repetitive mark will be removed (for example trans-Sumatra Æ transSumatra) b. If the word before (-) has size more than 1 or the word after (-) is the same as the word before, then the first word before (-) that will be used. (for example hak-hak Æ hak, lalulalang Æ lalu) c. Other, the repetitive mark still is used. ValidDblConsonant(x), to check validity of double consonants that comes before x. In Bahasa Indonesia double consonant that comes before a term is limited to pl, bl, kl, gl, fl, sl, pr, br, tr, dr, kr, gr, fr, sr, ps, sw, kw, sp, sm, sn, sk, pt, ts, st, ng, ny, str, spr, skr, dan skl [1]. AdjustHead(x) and AdjustTail(x), to change the first and the last of word x to handle the affixation problem Right(x,y), to return the most right character y from x. Left(x,y), to return the most left character y from x. Mid(x,y,z), to return the character z from the word x from the position z. StripPrefiks (x,y), to stem prefix of x at y with the constraints are : a. If at the position y contains (-) then stem the y character b. if (V(y) OR C(y) AND V(y+1)) OR ValidDbllConsonant (Mid(x,y))) then stem the (y-1) character, c. Others, no stem x.
The rules for stemming are: 1. PreS1 Se (M>1) Æ 2. PreS2 Mem (M>1 AND (b* OR p* OR f*)) Æ Mem(M>1) Æ m Meng (M>1 AND (g* OR h* OR kh*)) Æ Meny (M>1 AND V*) Æ s Men (M>1 AND V*) Æ n Men (M>1) Æ Me (M>1) Æ StripPrefix (W,3) Di (M>1) Æ StripPrefix (W,3) 3. PreS3 Ber(M>1) Æ Be (M>1 AND Cer*) Æ 4. PreS4 Pem (M>1 AND b*) Æ Peng(M>1 AND (g* OR h* OR kh*)) Æ Peny (M>1 AND V*) Æ s Pen (M>1) Æ Per(M>1 AND C*)ÆStripPrefix (W,3) Pe (M>1) Æ StripPrefix (W, 3) Ter (M>1) Æ Te (M>1 AND Cer*) Æ
5. SufS1 (seni or budi) man Æ (M>a) wan Æ (M>1) wati Æ 6. SufS2 (L>4) –kan Æ (M>1) kan Æ 7. SufS3 (L>3) –an Æ (M>1) an Æ 8. SufS4 (M>1 AND NOT(*i) AND (*ng OR NOT (*CC))) I Æ 9. FSufS1 (M>1) sionis Æ si (L>5) –isme Æ is (M>0) isme Æ is (M>1) itas Æ (M>1) asi Æ (M>1 AND *c) si Æ t (M>1) or Æ (M>1) er Æ
13. ConS Ke (M>1) an Æ Ke(tahu) I Æ
10. FsufS2 (M>1) if Æ (M>1) ik Æ (M>1) is Æ
14. ParS (M>1) –kah Æ (M>1) –lah Æ
11. FsufS3 (M>1) at Æ (M>1) wi Æ (M>1) wiah Æ (M>1) iah Æ (M>1) al Æ
15. ProS (M>1) –ku Æ (M>1) –mu Æ (M>1) ku Æ (M>1) mu Æ (M>1) –nya Æ (M>1) nya Æ ku (M>1) Æ kau (M>1) Æ
12. FSUfS4 (M>0 AND *V) v Æ f (M>0 AND *V) pt Æ p (M>0 AND *V) kt Æ k (M>0 AND *V) nt Æ n
The Vector Space Method The SDD-based LSI method is extension of the vector space method, which we describe in this section. Creating the Term-Document Matrix We begin with a collection of textual documents. We determine a list of keyword or terms by: (1) Creating a list of all words that appear in the documents (2) Removing words void of semantic content such as “dari” and “karena” (using stop word list of Bahasa Indonesia). (3) Stemming. (4) Further trimming the list by removing words that appear in only one document. The remaining words are the terms, which we number from 1 to m and then create m x n termdocument matrix A [aij ], Where aij represents the weight of term i in document j. A term weight has three components: local, global and normalization [10]. We let
aij
g i t ij d j
Where tij is the local term component (based on information in the j-th document only), gi is the global component (based on information about the use of the i-th term throughout the collection), and dij is the normalization component, specifying whether or not the columns (i.e., the documents) are normalized. According to [5] and [10], contain a more comprehensive list of weighting formulas. Note that the function F ( f ik ) returns 1 if f ik t 0 and returns 0 if f ik 0 Local Term Weight Formulas No local weight (symbol x) = 1 Term Frequency (symbol t) = f ij Binary (symbol b) = F ( f ij ) Log weighting (symbol l) = log( f ij 1)
Global Term Weight Formulas No global weight (symbol x) = 1 Inverse Document Frequency (symbol f) =
log(
n
¦
Probabilistic Inverse (symbol p) =log( n ¦k 1 F ( f ik ) n
¦
Normalization Formulas
No normalize (symbol x) = 1 Normalizes (symbol n) =
¦
m k 1
n
k 1
n k
F ( f ik ) 1
)
F ( f ik )
( g k t kj ) 2
IR System Evaluation
Because of the definition of relevance and because of the fact that user queries are inherently vague, the retrieved documents are not exact answers to the query. The ranked documents are an approximation to the users query, with the documents on top more likely to be relevant than the documents further down. Average Precision
One standard evaluation measure is interpolated precision at 11 levels of recall. Let 0%, 10%,..., 100% be the 11 standard recall level. P(r) is the precision when n% of the relevant document is retrieved. To compare recall-precision averages we need to normalize this result. This is done by ceiling interpolation. P(r j )
max r j d r d r j 1 P (r )
with r j , j {0,1,2,....,10) a reference to the j-th standard recall level. The recall-precision average above is applied to one query. For a set of queries Q with the size Q the average precision at recall level r is: P(r )
Pq ( r )
¦
Q
qQ
with for each q Q, Pq (r ) refers to the i-th interpolated precision at recall level r. Interpolated Average Precision and Non-Interpolated Average Precision
Interpolated Average Precision and Non Interpolated Average Precision often used single values to summaries. For all queries q Q, Pq (r ) , is the interpolated precision at recall level r. The interpolated average precision for (n + 1) standard recall levels then is:
¦
¦
qQ
Pq (r )
n r 0
n
Q
for all queries q Q, The notation, RELq are the relevant documents for query q and RETq are the retrieved documents. Non-interpolated average precision is then defined as:
¦
qQ
RELq RETq
Semi-Discrete Decomposition
A semi discrete decomposition (SDD) approximates a matrix as weighted sum of outer product form by vectors with entries constrained to be in the set S = {-1, 0, 1}. SDD is used by [6] for latent semantic indexing (LSI) in information retrieval. An SDD of an m x n matrix A is a decomposition of the form Ak
[ x1
ª d1 «0 ... x k ]« « ... « ¬0
x2
0 º ª y1T º « » ..... 0 »» « y 2 T » ... .. » « .... » » »« ..... d k ¼ «¬ y k T »¼
0
....
d2 ... 0
n
¦d x y
i i i
i 1
Here each xi is an m-vector with entries from the set S = {-1, 0, 1}, each yi is an n-vector with entries from the set S and each di is a positive scalar. It is called k-term SDD. An SDD approximation can be formed iteratively via a greedy algorithm. Let Ak denote the k-term approximation (A0 = 0). Let Rk be the residual at the k-th step; that is, Rk = A – Ak-1. Then the optimal choice of the next triplet (dk, xk, yk) is the solution to the sub problem. 2 min Fk (d , x, y ) { Rk dxy T s.t , x S m , y S m , d ! 0 F
This is a mixed integer programming problem. We can formulate this as an integer programming problem by eliminating d. For convenience, we temporarily drop the subscript k, we have m
n
¦¦ r
F ( d , x, y )
dxi y j )2 T
ij
2
R
2dx T R y d 2 x
F
2 2
y
2 2
i 1 i 1
At the optimal solution, wFk / wd
2 x T Ry 2d x
So the optimal value, d*, of d is given by d *
2 2
y
2 2
0
x T Ry x
2 2
y
2 2
Plugging d* into F we get F (d *, x, y )
R
2 F
§ T ¨ x Ry 2¨ 2 2 ¨ x y © 2 2
· § T ¸ T ¨ x Ry x Ry ¸¸ ¨¨ 2 2 ¹ © x2 y2
2
· ¸ ¸¸ x ¹
2 2
y
2 2
R
2 F
( x T Ry ) 2 x
2 2
y
2 2
Thus the mixed integer programming problem above is equivalent to max Fk ( x, y ) { max
( x T Rk y ) 2 x
2 2
y
2
s.t , x S m , y S m
2
This is an integer programming problem with 3(m+n) feasible points. When both m and n are small, we enumerate the feasible points and compute each function value to determine the maximized. However, as the size of m and/or n grows, the cost of this approach grows exponentially. Rather than trying to solve the problem exactly, we use an alternating algorithm to generate an approximate solution. We begin by fixing y and solving x, we then fix that
x and solve for y, we then fix that y and solve x and so on. If x or y is fixed then the problem can be solved. Suppose that y is fixed. Then we must solve ( x T s) 2
max F ( x, y )
x
Where s R y y
2 2
, xSm
is fixed. Sort the element of s so that
2
si1 t si2 t ..... t sim If we knew x had exactly j non zero, then it is clear that the solution would be given by : sign ( s i j ), jika 1 d j d J
Xij =
0, jika j 1 d j d m
Hence, the O’Leary-Peleg algorithm to find the SDD approximation of rank kmax to m x n matrix A is given by 1. Let R1 = A 2. Outer iteration (k = 1, 2, ......, kmax); a. Choose a starting vector y such that Rk y z 0 b. Inner iteration (i = 1, 2, ….., imax); ( x T Rk y) 2 i. Fix y and let x solve m max
x
ii. Fix x and let y solve max
2 2
, xS
( y T R k T x) 2 y
c. End Inner Iteration
2 2
, ySm
T
d. Let x k e. Let Rk 1
x, y k
x k Rk y k
y, d k
Rk d k x k y k
xk T
2 2
yk
2 2
3. End Outer Iteration
4. The Experiment We conducted some experiments to investigate the effectiveness of parsing and stemming process in LSI using SDD. In this study, we use a small collection of 107 student research documents from Computer Science Department IPB. An automatic index was built using stemmer program and then built Term-Document matrix using SDD function. For the experiment we built Stemmer Program in Visual Basic 6.0, SDD functions which built using Matlab 6.5 and query program in C++. The output of stemmer program is a term-document matrix which has been weighted. That matrix becomes an input to SDD function. Using The SDD Function with rank of matrix k will yield matrix Dk, Xk and Yk. The output from the SDD function will be input for query program. The query program will stem the text and find the similar documents in documents collection. The similarity document was computed using cosine coefficient.
The main topic of document collections is computer science. We evaluate the effect of stemming in indexing and query using 6 different queries with rank of matrix k = {10, 20, 30, 40, 50, 54, 58} on two kinds of index terms, that are using stemming and non-stemming. The effectiveness of Indonesian text retrieval was measured by their average precision, interpolated average precision and non-interpolated average precision for the top-20 documents in the documents rank list. The small number of documents (20) from 107 documents collection to evaluate was used because we are particularly interested in measuring the effectiveness the SDD technique for quick retrieval, where the user did not want to spend time checking the relevance of too many retrieved documents. 5. The Experiment Results
The experiment results of documents collection retrieval using stemming and non-stemming are displayed in Figure 1 (using average precision), Figure 2 (using interpolated average precision), and Figure 3 (using non-interpolated average precision). The results indicate that average precision using SDD technique and stemming terms is better than non-stemming terms with k 50 and also it can be seen that the average precision reaches 95% for k = 40. However, for overall performance stemming terms slightly increase over all queries. This might due to the chosen value ranked matrix (k). For example k with good average precision values (k = 40 and k = 50), the increase are 10% and 5% respectively. In addition, stemming process did reduce the number of index terms from 1940 to 1112 which would result in a smaller termdocument matrix size and faster term index searching. Interpolated Average Precision
Average Precision
Interpolated Average Precision
Average Precision
120% 100% 80%
Stemmed
60%
Non-Stemmed
40% 20% 0%
100% 80% 60%
Stemmed
40%
Non-Stemmed
20% 0%
K=10 k=20 K=30 K=40 K=50 K=54 K=58
K=10 k=20 K=30 K=40 K=50 K=54 K=58
K
K
Figure 2. Interpolated Average Precision for stemming terms and non stemming terms at rank k-th
Figure 1. Average precision for stemming terms and non stemming terms at rank k-th
Non Interpolated Precision
Non Interpolated Precision 100% 80% 60%
Stemmed
40%
Non-Stemmed
20% 0% K=10 k=20 K=30 K=40 K=50 K=54 K=58 K
Figure 3. Non Interpolated Average Precision for stemming terms and non stemming terms at rank k-th
6. Conclusion
In this paper, we applied technique for developing a Latent Semantic Indexing using Semi Discrete Matrix Decomposition for Bahasa Indonesia Information Retrieval. This paper explained about document pre-processing (term extraction, stop list and stemming in Indonesian texts), term document matrix construction, weighting and query processing). The results indicate that average precision using SDD technique and stemming terms is better than non-stemming terms with k 50 and also it can be seen that the average precision reaches 95% for k = 40. However, for overall performance stemming terms slightly increase over all queries. This might due to the chosen value ranked matrix (k). In addition, Stemming process did reduce the number of index terms from 1940 to 1112 which would result in a smaller term-document matrix size and faster term index searching. Open questions remain: in particular, how the stemming algorithm would behave on more extensive documents collection and how these algorithm effects on average precision when the documents collection changes. 7. References
[1]
ALWI, HASAN., S. DARWOWIDJOJO, H. LAPOLIWA, & A.M. MOELIONO. 1998. Tata Bahasa Baku Bahasa Indonesia Edisi Ketiga. Balai Pustaka, Jakarta [2] BERRY, M. AND BROWNE, M. Understanding Search Engines: Mathematical Modeling and Text Retrieval, Society for Industrial and Applied Mathematics. 1999. [3] DEERWESTER, S. C., DUMAIS, S. T., LANDAUER, T. K., FURNAS, G. W. and HARSHMAN, R. A. (1990). Indexing by latent semantic analysis, Journal of the American Society of Information Science 41(6): 391{407. [4] FRAKES, B. (1992). Stemming algorithm, in B. Frakes and R. Baeza-Yates (eds). Information Retrieval Data Structure and Algorithm, Morgan Kaufman, San Fransisco, CA, pp. 131 – 160. [5] KOLDA, T. (1997). Limited-Memory Matrix Methods with Applications, PhD thesis, University of Maryland at College Park, Applied Mathematics Program.URL: citeseer.nj.nec.com/115586.html [6] KOLDA, T.G., dan O’LEARY,D.P. A Semi discrete Matrix Decomposition for Latent Semantic Indexing in Information Retrieval. ACM Transaction on Information Systems, Vol. 16, No. 4, 1998, 322 – 246 [7] KOLDA, T. G. and O'LEARY, D. P. (2000). Algorithm 805: Computation and uses of the semi discrete matrix decomposition. ACM Transactions on Mathematical Software 26(3): 415{435. URL: http://doi.acm.org/10.1145/358407.358424 [8] RIDHA, A. Pengindeksan Otomatis dengan Istilah Tunggal Untuk Dokumen Berbahasa Indonesia. Skripsi S1 Jurusan Ilmu Komputer IPB. 2002. [9] ROSARIO, B. Latent Semantic Indexing: An overview, Technical Report INFOSYS 240 Spring Paper, University of California, Berkeley. URL:http://www.sims.berkeley.edu/rosario/project/LSI.pdf [10] SALTON, G. AND BUCKLEY, C. (1997B). Term-weighting approaches in automatic text retrieval, in K. Sparck Jones and P. Willet (eds), Readings in Information Retrieval, Morgan Kaufmann Publishers, Inc