Effective summarization method of text documents Rasim Alguliev, Ramiz Aliguliyev Institute of Information Technology Azerbaijan National Academy of Sciences Baku, Azerbaijan
[email protected],
[email protected] Abstract The actual work is dedicated to the problems of text documents classification through summarization. There are various approaches to text documents classification. Most of classification methods, which rely on Vector Space Model, analyze separate words in documents. To increase the accuracy of documents classification it is necessary to take into account more informative features of documents in question. For this purpose a summarization method called preprocessing in documents classification has been suggested in this work. While summarization this method takes into account weight of each sentence in the document. The essence of the method suggested is in preliminary identification of every sentence in the document with characteristic vector of words, which appear in the document, and calculation of relevance score for each sentence. The relevance score of sentence is determined through its comparison with all the other sentences in the document and with the document title by cosine measure. Prior to application of this method the scope of features is defined and then the weight of each word in the sentence is calculated with account of those features. The weights of features, influencing relevance of words, are determined using genetic algorithms.
1. Introduction Of all kinds of information accumulated on WWW, the text data represents the most interest as a rule. Every day hundreds of new text documents appear on Internet increasing already enormous amount of accessible text information. With all this, the text search is not limited to the search of relevant web page on Internet, but is used in numerous important applications in contemporary world such as text objects classification, creation of automated reference systems etc. In such cases, the problem of mining of text documents arises sharply. There are various methods of data mining [1]. Classification is one of those methods. Classification consists of breaking down the sample of
text documents into non-overlapping groups of documents with aim of ensuring maximal “proximity” (similarity) between documents of each group, corresponding to certain topic and maximal difference between groups [10]. There are various approaches to classification of text documents [1, 22]. Most of classification methods rely on Vector Space Model and analyze separate words in documents [17, 18, 19]. Vector Space Model represents documents as characteristic vector of words, which appear in whole array of documents in question. Each characteristic vector contains weights of words (usually number of occurrences of a word) appearing in the array of documents. Similarity between documents is measured with the use of one of similarity measures such as cosine measure, Euclidean measure and measure of Jaccard. To attain higher level of accuracy in documents classification it is necessary to take into account more informative features of documents. For this purpose, for instance, in work [11] the weights of HTML tags, which affect efficiency of information retrieval, are defined using genetic algorithms. In work [4], documents classification is carried out at the level of separate words, but unlike classical works, the relevance of each word here is defined in relation to their informative features, which are the occurrences of a word in the title, emphasis of a word by means of italic, bold fonts or its underlining and position of a word on the page. A DIG (Document Index Graph) algorithm based on graph theory and taking into account phrases and their weights was suggested in work [8]. Here the term “phrase” means a sequence of words, and not the grammatical structure of a sentence. An algorithm GIS (Generalized Instance Set) suggested in work [13] combines the methods of k -nearest neighbors and linear classifier. Over the recent years a texts summarization technique [20] called “preprocessing in classification” has been widely used in classification of text documents. The summarization technique is used for extraction of important contexts [2], sentences [6, 7, 12, 21], and paragraphs [16, 9]. The effect of context,
2
which holds information extracted from the content of all documents linked with given page, was studied in work [2] in relation to summarization of web pages. TRM (Text Relationship Map) technique was suggested for extraction of important paragraphs from texts in work [16], and in work [9] before extraction of important paragraphs they cluster through k -means method. In work [12] the relevance score of sentence is calculated through combination of two techniques for summarization purposes. According to the first technique, the words are ranked by categories according to their χ 2 statistics values, and then the importance of sentence is defined by formula TF*IDF (Term Frequency, Inverse Document Frequency) with account of words categories. As per the second technique, the importance of sentence is defined by measure of similarity between sentence and title. In work [6], the relevance score of sentence is calculated for summarization of newspaper articles by weighted combination of statistical and linguistic features. In work [21] four techniques have been suggested for increasing accuracy of summarization of web pages. The score of relevance of sentence is calculated by means of each of the techniques suggested. The resulting score of relevance of sentence is represented by a sum of those scores. In work [7] propose two generic text summarization methods that create text summaries by ranking and extracting sentences from the original documents. The first method uses standard information retrieval methods to rank sentence relevancies, while the second method uses the latent semantic analysis technique to identify semantically important sentences, for summary creations. Both methods strive to select sentences that are highly ranked and different from each other. Note right away that there are several shortages in the work cited. Firstly, the calculation of vectors of weighted words does not take into account informative features of those words, secondly, the score of sentence relevance is defined between the sentence and the document in whole and this cannot produce accurate score of relevance of the sentence in the document. Note that work [21] has the same shortage. And finally, this method is not efficient from computational point of view as far as the vector of weighted words of the document and relevance score of the sentence are calculated a new at each iteration. The actual work is dedicated to classification of text documents through summarization technique. In this work we suggest to identify the score of relevance for each word in the document in relation to their informative features prior to summarization. After that the score of relevance of each sentence is calculated in relation to all other sentences and the title of document. Finally the sentence is included in summary in
accordance with its score of relevance. The results of summarization and classification are evaluated in this work as well.
2. Informative feature selection and definition of words weight Despite of criticism in address of the methods analyzing separate words, it is clear that the words are the main carriers of information. Therefore for classification purposes it is necessary to determine the weight of each word in the document first. It is known that the words may encounter in various forms of writing in the documents. These forms of writing provide additional, but substantial information about the importance of words. These informative features have to be taken into account with aim of increasing the accuracy of classification. In our view these informative features may be like this: words emphasized by application of bold, italic or underlined fonts; words typed or written in upper case; the size of font applied. We introduce the following denotations: N ( w, d ) - total number of words w in document d ; N ( s, d ) - total number of sentences s in document d ; N E ( w, d ) - total number of emphasized words w in document d ; N U ( w, d ) - total number of words w in document d , typed in upper-case; N L ( w, d ) - total number of words w in document d , typed in large size fonts; N S ( w, d ) - total number of words w in document d , typed in small size fonts; N R ( w, d ) - total number of words w in document d , typed in regular size fonts; N E ( w j , s i ) -number of occurrences of j th emphasized word w j in i th sentence si , ( j = 1,..., N E ( w, d )) ; N U ( w j , si ) - number of occurrences of j th word w j , typed
in
upper-case,
in
i th
sentence
si ,
U
( j = 1,..., N ( w, d )) ; N L ( w j , si ) number of occurrences of j th word w j , typed in large size font, in L
( j = 1,..., N ( w, d )) ;
i th sentence
si ,
3
N S ( w j , si ) - number of occurrences of j th word w j , i th sentence
typed in small size font, in
si ,
S
( j = 1,..., N ( w, d )) ; N R ( w j , si ) - number of occurrences of j th word w j , typed in regular size font, in i th sentence si ,
( j = 1,..., N
)
( w, d ) , i = 1,..., N ( s, d ) . Here the words typed in regular fonts are understood as those ones not in possession of informative features indicated above. After introduction of these denotations, we define the frequency functions: Frequency function of occurrences of j th emphasized word w j in i th sentence si : R
E
f ijE =
N (w j , si ) E
N ( w, d )
,
j = 1,..., N E ( w, d ) ;
(1)
Frequency function of occurrences of j th word w j , typed in upper-case, in i th sentence s i : f ijU =
N U (w j , si ) N U ( w, d )
, j = 1,..., N U ( w, d ) ;
(2)
Frequency function of occurrences of j th word w j , typed in large size fonts, in i th sentence si : f ijL
=
N L (w j , s i ) N L ( w, d )
L
, j = 1,..., N ( w, d ) ;
(3)
Frequency function of occurrences of j th word w j , typed in small size fonts, in i th sentence si : f ijS =
N S (w j , si ) N S ( w, d )
, j = 1,..., N S ( w, d )
(4)
Frequency function of occurrences of j th word w j , typed in regular fonts, in i th sentence s i : f ijR
=
N R (w j , si ) N R ( w, d )
R
, j = 1,..., N ( w, d ) .
(5) Pass on to definition of weight of each word in the sentence according to their features. Weight of j th word w j depends on frequency of its occurrences in i th concrete sentence si and in the document d . For this purpose we shall use the formula TF*IDF, which assigns the highest weights to the words of medium frequency and insignificant weights to the commonly used words and those used occasionally.
Thus, the weight of each word is defined with account of their feature by the following formulas: weight of j th emphasized word w j in i th sentence si of document d : N (s, d ) , ωijE = f ijE log 2 E (6) N ( s, d , w ) j weight of j th word w j , typed in upper-case, in i th
sentence si of document d : N ( s, d ) ωijU = f ijU log 2 U , (7) N ( s, d , w ) j weight of j th word w j , typed in large size font, in i
th sentence si of document d : N ( s , d ) ω ijL = f ijL log 2 L , (8) N ( s, d , w ) j weight of j th word w j , typed in small size font, in i
th sentence si of document d : N (s, d ) , ωijS = f ijS log 2 S (9) N (s, d , w ) j weight of j th word w j , typed in regular size font, in
i th sentence si of document d : N ( s, d ) . ωijR = f ijR log 2 R (10) N ( s, d , w ) j Here the following definitions were applied: N E ( s, d , w j ) - number of sentences s in document d ,
where j th emphasized word w j is appear; N U ( s, d , w j ) - number of sentences s in document d , where j th word w j , typed in upper case, is appear; N L ( s, d , w j ) - number of sentences s in document d , where j th word w j , typed in large size fonts, is appear; N S ( s, d , w j ) - number of sentences s in document d , where j th word w j , typed in small size fonts, is appear; N R ( s, d , w j ) - number of sentences s in document d , where j th word w j , typed in regular size font, is appear.
4
So, the weights of each j th word w j in i th sentence si of document d were defined with account of their features. Now, let’s define generalized weight of each j th word w j in i th sentence s i of document d , with the use of the following formula: ω ij = α1ωijE + α 2ω ijU + α 3ωijL + α 4ωijS + α 5ωijR ,
(11)
where the feature weights α k ∈ [0,1], k = 1,2,3,4,5 satisfy the following condition: 5
∑ αk = 1.
(12)
k =1
Selection of feature weights α k (k = 1,2,3,4,5) is carried out with aid of genetic algorithm [5, 15, 3] on the training set.
3. Document summarization In summarization of documents the main objective is automatic selection of text passages (here the passages may be phrases, sentences or paragraphs) that would have mirrored adequately the meaning of the document. The method for extraction of meaningful sentences from the text to summary based on definition of score of relevance for each of the sentences and called “sentence by sentence” is suggested in this section. The essence of the method suggested is in comparison of each of the sentences with the rest of them and the title and calculation of proximity measure between them.
3.1. Definition of relevance score of the sentence Before calculation of proximity measure between the sentences, first, each sentence si is associated with characteristic vector s i = (ω i1 ,..., ω iN ( w,d ) ) of words, which appear in the document d , and them the cosine measure is applied: N ( w, d )
sim( s i , s j ) = cos(s i , s j ) =
∑ ωik ω jk k =1
N ( w, d )
∑
k =1
ω ik2
N ( w ,d )
, (13)
∑ ω 2jk k =1
i, j = 1,..., N ( s, d ). Further, let's calculate the aggregate proximity measure between i th and the rest of the sentences in the text:
simΣ ( si ) =
N ( s ,d )
∑ sim( si , s j ) . j =1 j ≠i
(14)
On the other hand, it's known that the title is one of information carriers in the text. Therefore, to take into account contribution of title into summarization, let's define the proximity measure between the title and i th sentence: simtitle ( si ) = sim( s i , T ), i = 1,..., N ( s, d ) , (15) where T -characteristic vector of words, corresponding to title. As each of the sentences and title contribute differently in summarization process, then they will be entered into final score of relevance of i th sentence with their according weights: score( s i ) = β 1 sim Σ ( s i ) + β 2 sim title ( s i ) , (16) where β 1 , β 2 ∈ [0,1] and β1 + β 2 = 1 . These weights are defined using genetic algorithms [5, 15, 3] as well. The next step is inclusion of sentences into summary according to their score of relevance. Prior to inclusion of sentences into summary, they are ranked in descending order by their scores of relevance. After ranking, the sentences are included into summary starting with the highest score of relevance provided that scores of relevance of all sentences included in summary are above the threshold score( s i ) > θ (i = 1,..., N ( s, d )) set for this purpose. This process continues up to the point when compression ratio rate comp satisfies limit set. In work [14] it was shown that if compression ratio stays within interval [0.05, 0.3] , then the summary is considered to be acceptable: len summary ratecomp = , (17) lendocument where len summary , lendocument lengths (number of words) of summary and the document, respectively.
3.2 Evaluation of summarization To evaluate the results of summarization we shall rel use F1 -measure. Let N document - the number of relevant rel sentences in the document, N suumary - a number of
relevant sentences in the summary, N summary -a number of sentences in summary, Psummary -precision, Rsummary recall. Then it follows that: rel N summary , (18) Psummary = N summary
5
Rsummary =
rel N summary rel N document
,
F1 summary =
(19) 2 PsummaryRsummary Psummary + R summary
.
(20)
score(d N +1 , c k ) = =
∑ cos( d N +1 , d ′ ) =
d ′∈ Ok ( d N +1 ) D k
, (23)
∑ cos( d N +1 , d i )e ik
i ∈I k ( d N+1 )
4. Classification The task of classification of documents consists of automatic ascription of a new document to some class known to the system. There are numerous methods of resolution of the task of documents classification. The method k -nearest neighbors is one of the fast and effective methods for classification.
4.1 Method k -nearest neighbors The method k -nearest neighbors is used in resolution of many classification problems for identification of class to which the document pertains. This method is used on the set of documents that have been already classified. For each new document d N +1 , coming in the system, k documents nearest to this one that have been already allocated to one of the classes are identified. While using k -nearest neighbors method for classification of documents researcher has to resolve the problem of selection of metrics for identification of proximity of objects. If the Euclidean measure is used: dist (d i , d j ) =
If the cosine measure is used, then classifier k nearest neighbors assigns relevance score to each classcandidate ck by formula:
NV
∑ (ω il − ω jl ) 2 , i, j = 1,..., N , (21) l =1
then for each of the classes ck the sum of distances between the new document d N +1 and each of k documents, attributed earlier to this class: dist Σ ( d N +1 , c k ) = ∑ dist( d N +1 , d ′ ) = d ′∈Ok ( d N +1 ) Dk , (22) = ∑ dist ( d N +1 , d i ) eik i ∈I k ( d N +1 )
where NV total number of words in the set of documents D = (d1 , d 2 ,..., d N ) , Ok (d N +1 ) , I k (d N +1 ) elements and their indices of k -nearest neighbors of the document d N +1 , accordingly, and e ik
1, if d i ∈ c k , = . 0 , otherwise.
New document d N +1 is included to class ck with minimum value dist Σ (d N +1 , c k ) .
and then document d N +1 pertains to the class ck , for which the value score(d N +1 , c k ) is maximal.
4.2. Reliability and evaluation of classification Let N is the total number of documents, with M documents of which considered to be “relevant”. Further, let n is the number of documents to be classified in classes c k (k = 1,..., K ) and Pm ( N , M , n) is the probability that classification of m documents will be “relevant”. Then the probability Pm ( N , M , n) can be calculated by formula of hypergeometric distribution: C m C n −m Pm ( N , M , n ) = M nN − M , (24) CN a! - binomial factor. b!( a − b )! Hypergeometric distribution is associated with the selection without return, namely: formula (24) indicates the probability of getting exactly m «relevant» documents in random sample of n from general sample, containing N documents, among which M are «relevant» and N − M «irrelevant» documents. With this the probability (24) is determined only for: max(0, M + n − N ) ≤ m ≤ min(n, M ) . (25) However definition (24) may be used for all m ≥ 0 , as it's possible to consider that C ab = 0 for b > a , b where C a =
therefore equality Pm ( N , M , n) = 0 should be understood as lack of possibility to obtain m «relevant» documents in the sample. The sum of values Pm ( N , M , n) , extrapolated for all sample space, is M equal to one. If denote = p , then formula (24) can N be re-written in other form: Pm ( N , M , n ) = Cnm
m ANp AnNq− m
A nN
,
(26)
where Aab = C ab b! and p + q = 1 . If p is constant and N → ∞ , then binomial iteration takes place:
6
(27) Pm ( N , M , n) ~ C nm p m q n − m . Average value of hypergeometric distribution doesn't depend on N and matches the average np of corresponding binomial distribution. Dispersion of N −n hypergeometric distribution σ 2 = npq doesn't N −1 exceed dispersion of binomial rule σ 2 = npq . Let N rel ( N , M , n) expected number of “relevant” documents in class ck . Then N rel ( N , M , n ) =
M
∑ mPm ( N , M , n) .
(28)
m= 0
m If take into account relation C M =
M m −1 C M −1 , then we m
have: N rel ( N , M , n ) =
M C nN
M
∑ C Mm−−11C Nn−−mM
.
(29)
m= 0
From equality M
∑ C Mm−−11C Nn−−mM
m= 0
= C Nn −−11
(30)
It follows that, N rel ( N , M , n ) =
M C Nn
C nN−−11
(31)
and, hence, Mn . (32) N After that we can calculate precision and recall of classification: N ( N , M , n) M , (33) Pclass = rel = n N N ( N , M , n) n , (34) Rclass = rel = M N 2Pclass Rclass class F1 = . (35) Pclass + Rclass N rel ( N , M , n) =
5. Conclusion Enormous amount of information we have to deal with have a form of texts. Books, magazines, manuals, web-pages, e-mail messages and regular letters – these all are examples of the fact to what extent the text information is important for people. With development of information technology and telecommunications the volume of text information is constantly increasing. Therefore a need arises for means ensuring processing of this information and facilitating access to this information with account of concrete applications. The means for automatic classification of text documents
can play the role of such means. The problem of automatic classification of texts is the object of interest for many specialists all over the world. Unlike the other types of classification, when processing the texts the problems may arise that make difficult application of the most used methods in this area. The specifics of this task are in the fact that the number of features used in classification is high, but the features themselves change insignificantly. With account of all described above, a method of summarization of text documents that takes into account informative features of words and importance of sentences was suggested in this work. The method of text summarization presented in this article consists of the following steps: 1. Informative features of words are selected, which, in our opinion, affect the accuracy of results of summarization and thus results of classification; 2. Using genetic algorithms the weights of informative features influencing the relevance of words are calculated; 3. The aggregate similarity measure between each sentence and the rest of the sentences is calculated by cosine measure; 4. The similarity measure between each of the sentences and the title is calculated; 5. Weighted score of relevance is defined for each sentence; 6. The scores of relevance are ranked; 7. Starting with the highest score the sentences for which the relevance score is higher than the threshold value set are included in the summary, and the process continues until the ratio of compression satisfies the limitation set in advance. Finally the results of summarization and classification are evaluated against various measures of efficiency.
6. References [1] [2]
[3] [4]
S. Chakrabarti, “Data mining for hypertext: a tutorial survey”, ACM SIGKDD Explorations, vol. 1, no. 2, January 2000, pp. 1-11. J.-Y. Delort, B. Bouchon-Meunier, and M. Rifqi, “Enhanced web document summarization using hyperlinks”, Proceedings of the 14th ACM Conference on Hypertext and Hypermedia, Nottingham, United Kingdom, August 26-30, 2003, pp. 208-215. A.E. Eiben and J.E. Smith, “Introduction to evolutionary computing”, Berlin, Springer-Verlag, 2003. V. Fresno and A. Ribeiro, “An analytical approach to concept extraction in HTML environments”, Journal of Intelligent Information Systems (Special Issue on Web Content Mining), vol. 22, no. 3, May 2004, pp. 215-235.
7
[5] [6]
[7]
[8]
[9]
[10] [11]
[12]
[13]
[14] [15] [16]
[17]
[18]
D.E. Goldberg, Genetic algorithms in search, optimization and machine learning, Addison Wesley, 1989. J. Goldstein, M. Kantrowitz, V. Mittal, and J. Carbonell, “Summarization text documents: sentence selection and evaluation metrics”, Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’99), Berkeley, USA, August 15-19, 1999, pp. 121-128. Y. Gong and X. Liu, “Generic text summarization using relevance measure and latent semantic analysis”, Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’01), New Orleans, Louisiana, USA, September 9-12, 2001, pp. 19-25. K.M. Hammouda and M.S. Kamel, “Efficient phrasebased document indexing for web document clustering”, IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 10, October 2004, pp. 1279-1296. P. Hu, T. He, D. Ji, and M. Wang, “A study of Chinese text summarization using adaptive clustering of paragraphs”, Proceedings of the 4th International Conference on Computer and Information Technology (CIT’04), Wuhan, China, IEEE Computer Society, September 14-16, 2004, pp.1159-1164. A.K. Jain, M.N. Murty, and P.J. Flynn, “Data clustering: a review”, ACM Computing Surveys, vol. 31, no. 3, September 1999, pp. 264-323. S. Kim and B.T. Zhang, “Genetic mining of HTML structures for effective web-document retrieval”, Applied Intelligence, vol. 18, no. 3, May-June 2003, pp. 243-256. Y. Ko, J. Park, and J. Seo, “Improving text categorization using the importance of sentences”, Information Processing and Management, vol. 40, no. 1, January 2004, pp. 65-79. W. Lam and Y. Han, “Automatic textual document categorization based on generalized instance sets and a metamodel”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no. 5, May 2003, pp. 628-633. I. Mani and M.T. Maybury, “Advances in automated text summarization”, Cambridge, MIT Press, 1999. Z. Michalewicz, “Genetic algorithms + data structures=evolution programs”, Berlin, Springer-Verlag, 1996. M. Mitra, A. Singhal, and C. Buckley, “Automatic text summarization by paragraph extraction”, Proceedings of the ACL'97/EACL'97 Workshop on Intelligent Scalable Text Summarization, Madrid, Spain, July 7-12, 1997, pp. 39-46. G. Salton, A. Wong, and C. S. Yang, “A vector space model for automatic indexing”, Communications of the ACM, vol. 18, no. 11, November 1975, pp. 613620. G. Salton and M.J. McGill, Introduction to modern information retrieval, New York, USA, McGrawHill, 1986.
[19] G. Salton, Automatic text processing: the transformation, analysis and retrieval of information by computer, Boston, USA, Addison Wesley, 1989. [20] G. Salton, A. Singhal, M. Mitra, and C. Buckley, “Automatic text structuring and summarization”, Information Processing and Management, vol. 33, no. 2, March 1997, pp. 193-207. [21] D. Shen, Z. Chen, Q. Yang, H.J. Zeng, B. Zhang, Y. Lu, and W.Y. Ma, “Web-page classification through summarization”, Proceedings of the 27th Annual International Conference on Research and Development in Information Retrieval (SIGIR’04), Sheffield, United Kingdom, July 25-29, 2004, pp. 242-249. [22] V. O. Tolcheev, “Models and methods of the classification of text information”. Information Technologies, no. 5, May 2004, pp. 6-14, (In Russian).