Summarization of Text-based Documents with a ... - Semantic Scholar

4 downloads 695 Views 117KB Size Report
Institute of Information Technologies, National Academy of Sciences of Azerbaijan, ul. ... texts, sentences, paragraphs) from a document that reflect its content [1].
ISSN 0146-4116, Automatic Control and Computer Sciences, 2007, Vol. 41, No. 3, pp. 132–140. © Allerton Press, Inc., 2007. Original Russian Text © R.M. Alguliev, R.M. Alyguliev, 2007, published in Avtomatika i Vychislitel’naya Tekhnika, 2007, No. 3, pp. 21–32.

Summarization of Text-based Documents with a Determination of Latent Topical Sections and Information-Rich Sentences R. M. Alguliev and R. M. Alyguliev Institute of Information Technologies, National Academy of Sciences of Azerbaijan, ul. F. Agaev 9, Baku, AZ-1141, Azerbaijan e-mail: [email protected], [email protected] Received October 2, 2006; in final form, October 30, 2006

Abstract—A method is proposed for use in summarization of text-based documents. By means of the method it is possible to discover latent topical sections and information-rich sentences. The underlying basis of the method – clustering of sentences – is formulated mathematically in the form of a problem of quadratic-type integer programming. An algorithm that makes it possible to determine with specified precision the optimal number of clusters is developed. The synthesis of a neural network is described for the purpose of solving a problem of integer quadratic programming. DOI: 10.3103/S0146411607030030 Key words: summarization, clustering, optimal number of clusters, information-rich sentence, neural networks

1. INTRODUCTION The rate and scale of propagation of information have sharply grown with the growth of the World Wide Web. Of all the different types of information collected on the World Wide Web text-based documents are generally of the greatest interest. Despite their simplicity, texts are the most important carrier of information and, obviously, will continue to function in this role for a long time to come. The overwhelming majority of scientific articles, documentation, on-line news, etc., possess a text format, which is responsible for the interest in problems of text processing and retrieval. The Internet’s keyword-based retrieval mechanisms generate hundreds and even thousands of documents, which tends to confuse the user. With the growth in the number of available text-based documents ordinary information-retrieval technologies are no longer satisfactory as means of finding relevant information. Therefore, there has arisen a need for a new technology that could assist the user in filtering the enormous quantity of information and in rapidly identifying the most relevant documents. Providing the user with annotations (summaries) of documents could significantly simplify the problem of identifying required documents. Text-based retrieval and summarization are two technologies which complement each other. The objective of the task of automatic summarization is to extract the information-rich fragments (contexts, sentences, paragraphs) from a document that reflect its content [1]. A large number of studies on the problem of summarization of text-based documents has been published in the scientific literature in recent years. For example, the TRM (Text Relationship Map) was proposed in [2, 3] as a means of extracting significant paragraphs. The underlying idea of the method is to represent a text in the form of a graph the vertices of which are paragraphs. Each paragraph is identified by a weighted vector of words, and a measure of similitude between paragraphs, determined by a scalar product, is calculated. If the measure of similitude is greater than some given threshold, these vertices are linked. The criterion for inclusion of a paragraph in a summary is determined by the number of edges linking it to other paragraphs. In [3] four types of criteria for selecting a paragraph were proposed: bushy path, depth-first path, segmented bushy path, and augmented segmented bushy path. Most studies have been concerned with determining a relevancy score (score of relevance) of a sentence [4–10] for the purpose of including a sentence in a summary. The relevancy score proposed in [4] is determined by a weighted combination of its local and global characteristics. The local characteristic of a sentence is determined by Luhn’s method [11], where the weight of a word is determined not by the TF*IDF formula (Term Frequency*Inverse Document Frequency), but rather by the formula TL*TF (Term 132

SUMMARIZATION OF TEXT-BASED DOCUMENTS

133

Length*Term Frequency). The underlying idea of the TL*TF method [12] is based on the fact that words that occur frequently tend to be short. Such words do not describe the basic topic of a document, i.e., they are stop words. Conversely, words which are used rarely tend to be long. The advantage gained with the use of TL*TF for weighting words derives from the fact that this method does not require any external resources and utilizes only information found within the document. The global characteristic of a sentence is determined by the TRM method. Two approaches focused on the solution of this problem are proposed in [5]: MCBA (Modified Corpus-Based Approach) and LSA + TRM (Latent Semantic Analysis + Text Relationship Map). The first approach is trainable and takes into account certain features, such as the position of the sentence in the paragraph, positive and negative key words, and the centrality of the sentence in the document and its similarity to the heading. The second approach creates a semantic matrix of the document by means of Latent Semantic Analysis and then, using a semantic representation, constructs a semantic TRM (Text Relationship Map). A combination of statistical and linguistic features is proposed in [6] as a way of determining the relevancy score of a sentence in newspaper articles. The statistical feature is determined by standard methods of information retrieval, while the linguistic feature, from an analysis of the summary of newspaper articles. Text summaries may be query-relevant or generic. A query-relevant summary represents the content of a document associated with the retrieval query. The creation of a query-relevant summary essentially involves a process of reconstructing query-relevant sentences (fragments) from a document, i.e., it possesses a strong analogy to the process of retrieving texts. Therefore, a query-relevant summary is often achieved through the application of the technology of information retrieval, and a large number of methods of summarization of texts belong to this category. A query-relevant summary is useful in responding to such questions as: “Is this document relevant to the user’s question?” And if it is relevant, “What part(s) of the document are relevant?” A query-relevant summary does not fully encompass the content of a document and, consequently, is not appropriate for a brief survey of the content of a document. To respond to questions as to the particular category which a given document belongs to and what are its key words, a generic summary must be created and presented to the user. On the other hand, using generic summarization it is possible to support summarization of a broad coverage of the content of the document. Two methods of generic summarization are proposed in [7] on the basis of these considerations. The first method, which is used to rank sentences relative to the relevancy score, utilizes standard methods of information retrieval. The relevancy score of a sentence is determined by the scalar product of the weighted vectors of the document and the sentence. In this method the principal focus is on minimization of redundancy in the summary, ignoring the broad scope of the document content. This follows from the fact that once a sentence with the highest value of the relevancy score has been selected, it is eliminated from the document. After a sentence has been eliminated, the vector of weighted words of the document is computed anew, where words that are contained in the eliminated sentence are not present in the definition of this vector. The second method, which utilizes LSA, identifies semantically significant sentences in order to create a summary. Summarization of Web pages is the subject of [9], a study in which the relevancy score of each sentence is computed using all the four known methods, with the final relevancy score set equal to the sum of these four scores. In [8, 10], the paragraphs and sentences are first clustered before an attempt is made to determine which are the information-rich sentences. The method proposed in [8] essentially consists of three phases. In the first phase a weighted vector of paragraphs is created, in the second phase a clustering of the paragraphs (partitioning into topical sections) is performed and a new algorithm for determining the number of clusters based on minimization of some objective function is proposed. Finally, in the third phase sentences are extracted from each topical section for inclusion in the summary. An analysis of the type of objective function shows that such an approach for determining the number of clusters and the number of selected information-rich sentences (a single sentence is taken from each cluster) cannot encompass the principal content of a document. These results are all related to the fact that there is no clear definition of the number of clusters. And in [10], clustering of sentences is realized by means of a hierarchical clustering algorithm which, from the computational point of view, is more complicated than the k-means algorithm. In the present article a new method of clustering of sentences designed to ensure minimal redundancy in a summary and maximally possible degree of coverage of the document content is proposed. The method is based on the solution of a problem of integer quadratic programming with Boolean variables. Since the solution of a problem of integer programming involves considerable computational difficulties, we propose to use neural networks with feedback. The determination of the number of clusters is one of the hard problems of cluster analysis. Therefore, in the present study an algorithm for stepwise determination of the number of clusters will also be proposed. Following clustering, in order to avoid redundancy in a summary, the information-rich sentences and the number of such sentences will be determined in each cluster, i.e., in each topical section. AUTOMATIC CONTROL AND COMPUTER SCIENCES

Vol. 41

No. 3

2007

134

ALGULIEV, ALYGULIEV

2. MATHEMATICAL MODEL OF PROBLEM OF SENTENCE CLUSTERING In the process of intelligent data retrieval (data mining) clustering proves to be among the most useful approaches for discovering natural groups in a data set. Traditional algorithms, such as the k-means algorithm, hierarchical clustering, the GEM (Gaussian expectation-maximization) algorithm, and others are usually used to solve a clustering problem [13]. The k-means algorithm is among the more widely used of these algorithms. This is because of the fact that the particular algorithm is mathematically well defined. The formulation of the k-means algorithm as a problem of mathematical programming was proposed in [14]. In [15] the k-means algorithm was formulated in terms of nonsmooth and nonconvex optimization. In our review of problems of clustering in the present section we will not apply traditional methods, but instead use a technique proposed in [16]. Suppose that document d consists of m sentences. We represent it in the form of a set of sentences d = (s1, s2, …, sm). The problem of clustering is to partition the set d = (s1, s2, …, sm) into disjoint clusters C = (C1, C2, …, Cq), q ≥ 2. The objective is to assure maximal proximity between sentences of the same cluster corresponding to a previously defined semantic topic and to assure maximal difference between the clusters. A definition of the concept of proximity is given below. Before undertaking a formulation of the method by means of a vector space model let us represent each sentence in the form of a weighted vector si = (wi1, wi 2, …, win) of the words that occur in the document, where n is the number of words in document d. The weight wij of word j depends on the frequency with which it occurs in a concrete sentence i and in the entire set of sentences (in the document), as determined by the formula TF*IDF: m w ij = f ij log 2⎛ ------⎞ , ⎝ m j⎠

i = 1, …, m;

j = 1, …, n,

(1)

where mj is the number of sentences in which word j is present. The function fij of the frequency of occurrence of word j in sentence i is calculated as follows: n ij f ij = -------------------, n len ( s i )

(2)

where nij is the number of occurrences of word j in sentence i and len(si ) is the length of sentence si. To avoid any bias induced by the length (number of words) of a sentence, the function fij is normalized relative to sentence length. Euclidean distance is the metric used most often in determining the proximity dip between sentences si and sp: n

d ip =

∑ (w

2

ij

– w pj ) ,

i, p = 1, …, m.

(3)

j=1

Proximity of the sentences in a cluster and remoteness of sentences belonging to distinct clusters should be understood, first, as asserting that the total sum of the distances between sentences in the same cluster must be minimal, and, on the other hand, that the total distance between sentences belonging to distinct clusters must be maximal. In line with this reasoning, we define the sum Sk of distances dip between sentences si and sp in cluster Ck thus: 1 S k = --2

∑ ∑d

ip ,

k = 1, …, q;

i, p = 1, …, m.

(4)

si ∈ Ck s p ∈ Ck

Summing over k, we obtain the overall sum of the distances between sentences in all the clusters Ck, k = 1, …, q: q

∑ k=1

1 S k = --2

q

∑∑ ∑d

ip ,

i, p = 1, …, m.

(5)

k = 1 si ∈ Ck s p ∈ Ck

Now let us determine the sum Skl of the distances dip between sentences si and sp belonging to distinct clusters Ck and Cl (k ≠ l ): AUTOMATIC CONTROL AND COMPUTER SCIENCES

Vol. 41

No. 3

2007

SUMMARIZATION OF TEXT-BASED DOCUMENTS

1 S kl = --2

∑ ∑d

k, l = 1, …, q,

ip ,

i, p = 1, …, m.

135

(6)

si ∈ Ck s p ∈ Cl

Summing over k and l (k ≠ l), we obtain the overall sum of the distances between sentences belonging to distinct clusters: q

q

∑ ∑S

kl

k = 1l = 1 l≠k

1 = --2

q

q

∑∑ ∑ ∑d

i, p = 1, …, m.

ip ,

(7)

k = 1 l = 1 si ∈ Ck s p ∈ Cl l≠k

Using formulas (6) and (7), we formulate the problem of clustering as follows: q

∑∑ ∑d

q

ip



k = 1 si ∈ Ck s p ∈ Ck

q

∑∑ ∑ ∑d

min.

ip

(8)

k = 1 l = 1 si ∈ Ck s p ∈ Cl l≠k

We next introduce the Boolean variable xik, equal to 1 if sentence si belongs to cluster Ck and to 0 otherwise: si ∈ Ck

⎧ 1, if x ik = ⎨ ⎩ 0, if

si ∉ Ck

,

i = 1, …, m;

k = 1, …, q.

(9)

With this notation, formula (8) assumes the following form: q

m

m

m

∑∑∑

d ip x ik x pk –

k = 1i = 1 p = 1

q

m

q

∑∑ ∑ ∑d

ip x ik x pl

min.

(10)

i = 1k = 1 p = 1l = 1 l≠k

We next introduce the notation d ip e kl = a ikpl where ⎧ 1, if k = l e kl = ⎨ ⎩ – 1, if k ≠ l. We rewrite problem (10) in the compact form m

q

m

q

∑ ∑ ∑ ∑a

ikpl x ik x pl

(11)

min.

i = 1k = 1 p = 1l = 1

From the proposition Ck ∩ Cl = ∅ (k ≠ l) it follows that the following condition must hold: q

∑x

ik

= 1,

i = 1, …, m.

(12)

k=1

On the other hand, we will assume that each cluster must contain at least a single sentence: m

∑x

ik

≥ 1,

k = 1, …, q,

(13)

i=1

where x ik ∈ { 0, 1 }

for any

i, k.

(14)

Thus, the problem of clustering of sentences has been reduced to an integer quadratic programming problem with the Boolean variables (11)–(14). AUTOMATIC CONTROL AND COMPUTER SCIENCES

Vol. 41

No. 3

2007

136

ALGULIEV, ALYGULIEV

3. SYNTHESIS OF NEURAL NETWORK FOR AN INTEGER QUADRATIC PROGRAMMING PROBLEM Problem (11)–(14) belongs to the class of problems of combinatorial optimization. Many such problems are NP-complete and the process of solving such problems involves insurmountable time expenditures. Special methods and algorithms characterized by polynomial complexity have been developed to solve these types of problems. However, the algorithms that have been developed up till now enable us to find solutions that are acceptable in terms of quality and time expenditure only for problems of low dimension. Since there are no persuasive arguments in favor of the existence of algorithms that realize a solution in acceptable time, these problems are classified as NP-complete problems. Therefore, it seems best to employ neural networks when attempting to solve these types of problems; in fact, neural networks have been shown to be effective in the solution of problems of combinatorial optimization [16, 17]. Through the use of neural networks with feedback it becomes possible to substantially reduce the time it takes to solve such problems (and to reduce to an even greater extent the solution of NP-complete problems). It is clear that in the general case neural networks do not guarantee globality of an optimal solution of problem (11)–(14). However, in actual practice it is often necessary to find one or more local minima within a definite time. In this case, the use of neural networks proves to be very effective. With this in mind as well as for the purpose of assuring that an optimized approach to the problem of clustering can be applied under ordinary conditions, we propose a neural network-based solution of problem (11)–(14). In order to synthesize a neural network for the solution of an optimization problem, we synthesize a triple of the form {N, W, B}, where N is the set of neurons of the network; W, matrix of synaptic relations; and B, vector of external shifts. In the general case, the problem of synthesis consists in determining all the components of the given triple, i.e., the form and number of neurons, the values of the external shifts, and the structure of the relationship matrix and the values of its elements. It is assumed that the type and model of the dynamics of neural-like elements are given. Therefore, the problem of network synthesis involves determining the structure of the network, the relationship matrix W, and a vector of shifts B, all of which conform to the intended utilization of the particular network. The process of synthesizing a neural network for the solution of an optimization problem consists in the following stages. Stage 1. Neural-network interpretation of problem To arrive at a neural-network interpretation of an optimization problem we consider a network of binary neurons that forms a matrix Y = ||yik || of dimension m × q. With each Boolean variable xik we associate the output signal yik of the ik-th neuron. In this matrix an excited state of neuron yik = 1 corresponds to the situation in which sentence i belongs to cluster k. Stage 2. Construction of energy function of network We will construct the energy function of the network in the form of a sum, the individual terms of which constitute convex functions that assume minimal values in states of the network that satisfy the constraints which are being considered and minimize the objective function. Proceeding on the basis of the foregoing, a term that ensures minimization of problem (11) maybe constructed in the following form: λ E 0 = – -----0 2

m

q

m

q

∑ ∑ ∑ ∑a

ikpl y ik y pl ,

(15)

i = 1k = 1 p = 1l = 1

and terms that ensure that constraints (12)–(14) are satisfied may be constructed in the form λ E 1 = -----1 2

m

q

∑∑ i = 1k = 1

λ y ik ( 1 – y ik ) + -----2 2

2

2

⎛ ⎞ ⎞ ⎞ λ ⎛ λ 2⎛ y ik – 1⎟ + -----3 ⎜ y ik – m⎟ + -----4 ϕ ⎜ y ik – 1⎟ , ⎜ 2⎝ 2 ⎝ ⎠ ⎠ ⎝i = 1 ⎠ i=1 k=1 i = 1k = 1 k=1 m

q

∑ ∑

m

q

∑∑

q



m



(16)

where λ0, λ1, λ2, λ3, λ4 are positive constants and ϕ(z) = z – |z| is a function that possesses the property ϕ2(z) = 2zϕ(z). The firm term in (16) corresponds to the binarity of the variables (14), the second term corresponds to the constraint (12), in which each row of matrix Y contains at most one unit, the third term expresses the fact that there are precisely m units in matrix Y, and, finally, the last term corresponds to constraint (13). Hence, it follows that once constraints (12)–(14) are satisfied, (16) assumes its minimal value, equal to zero. AUTOMATIC CONTROL AND COMPUTER SCIENCES

Vol. 41

No. 3

2007

SUMMARIZATION OF TEXT-BASED DOCUMENTS

137

Following summing of the two expressions (15) and (16) and simple algebra, we obtain the following form of the energy function of a neural network: λ E = E 0 + E 1 = – -----0 2 λ + -----1 2 λ + -----3 2

m

q

m

m

q

∑∑ i = 1k = 1

m

q

∑∑ ∑∑ i = 1k = 1 p = 1l = 1

λ y ik + -----2 2

m

q

ik y pl

m

m

– mλ 3

q

∑∑ i = 1k = 1

q

m

q

∑ ∑ ∑ ∑δ m

q

δ ip y ik y pl – λ 2

q

∑∑y

ik

i = 1k = 1

λ 2 y ik + -----3 m + λ 4 2

ip δ kl y ik y pl

i = 1k = 1 p = 1l = 1

i = 1k = 1 p = 1l = 1 m

i = 1k = 1 p = 1l = 1

λ a ikpl y ik y pl – -----1 2

∑∑ ∑∑

q

∑ ∑ ∑ ∑y

q

m

λ + -----2 m 2

(17)

⎛ ⎞ ⎛ ⎞ ⎜ y ik – 1⎟ ϕ ⎜ y ik – 1⎟ , ⎝ ⎠ ⎝i = 1 ⎠ k=1 i=1 q

m

m

∑ ∑



where δip is the Kronecker symbol. Stage 3. Determination of network parameters The third stage consists in direct determination of the parameters of the neural network, i.e., the matrix of synaptic relationships W and the vector of external shifts B, through comparison of the energy function E that has been constructed with its canonical form Ec, constructed in the following way: 1 E c = – --2

m

q

q

m

∑∑ ∑∑

m

w ikpl y ik y pl +

i = 1k = 1 p = 1l = 1

q

∑∑b

ik y ik .

(18)

i = 1k = 1

Comparing the two expressions (17) and (18) and equating their linear and quadratic components, we find the parameters of the neural network thus: ⎧ w ikpl = λ 0 a ikpl + λ 1 δ ip δ kl – λ 2 δ ip – λ 3 ⎪ λ1 ⎨ - – λ 2 – mλ 3 + λ 4 ⎪ b ik = ---2 ⎩

(19)

where i, p = 1, …, m; k, l = 1, …, q. Note that in determining the parameters (19), terms that are independent of the state yik of the neural network are ignored. Thus, a neural network the parameters of which are determined to within constant coefficients has been constructed. The question of determining the coefficients λ0, λ1, λ2, λ3, λ4 requires a separate investigation. 4. STEPWISE ALGORITHM FOR DETERMINING NUMBER OF CLUSTERS Note that the choice of an optimal number of clusters is an important stage in cluster analysis [13, 18]. It is difficult a priori to determine how many clusters a given set consists of. The following strategy will be followed: Beginning with a sufficiently small q, the number of clusters is increased in steps until some termination criterion is satisfied. From the point of view of the perspective of optimization, this means that if a solution corresponding to the optimization problem (11)–(14) is not satisfactory, problem (11)–(14) must be considered with q + 1 clusters and so on, i.e., problem (11)–(14) must be solved repeatedly with different values of q. Below, we present an algorithm for stepwise calculation of the number of clusters. We introduce the function F(x) defined by the relationship q

F( x ) =

m

m

∑∑∑d

ip x ik x pk

k = 1i = 1 p = 1 -------------------------------------------------------, m q m q

∑∑ ∑ ∑d

(20)

ip x ik x pl

i = 1k = 1 p = 1l = 1 l≠k

where the numerator corresponds to the first term in formula (10) and the denominator to the second term. Step 1. Specify the tolerance ε > 0. We set k = 2 and solve problem (11)–(14). Let F2 be the value of the function F(x) corresponding to the solution of (11)–(14). AUTOMATIC CONTROL AND COMPUTER SCIENCES

Vol. 41

No. 3

2007

138

ALGULIEV, ALYGULIEV

Step 2. Set k = k + 1 and solve problem (11)–(14). Let Fk + 1 be the value of the function F(x) corresponding to the solution of (11)–(14). Fk – Fk + 1 - < ε, k ≥ 2, halt the algorithm, otherwise set k = k + 1 and go to Step 2. Step 3. If ---------------------F2 It is easily proved that for all k, the condition Fk ≥ Fk + 1 > 0 is satisfied. Thus, a decreasing sequence {Fk} and Fk > 0 for all k is obtained. Consequently, following k* iterations, the halt criterion in Step 3 is satisfied. 5. DETERMINATION OF INFORMATION-RICH SENTENCES The next step after clustering is to decide which are the information-rich sentences in each cluster. The “information richness” of a sentence is defined by a measure of proximity, calculated between the given sentence and a corresponding cluster centroid, i.e., the less is the Euclidean distance between a sentence and a corresponding cluster centroid, the greater is the information richness of the particular sentence. Before including different sentences in a summary, the sentences are ranked in increasing order in terms of their measure of proximity to a corresponding cluster centroid. Most text-based documents usually consist of several topics. Some topics are described by many sentences and, consequently, form the basic content of the document. Other topics may be referred to only briefly in order to complete the principal topic. Consequently, the number of sentences in the different clusters will be different. The number of sentences selected from each cluster will also differ. Using such an approach it is possible to encompass to the maximally possible degree the principal content of the document and to avoid redundancy. In the general case, the number of sentences included in the summary depends on the compression coefficient. The compression coefficient αcmpr is defined as the ratio of the length of the summary to the length of the document: len ( summ ) α cmpr = --------------------------- , len ( doc )

(21)

and is an important coefficient influencing the quality of a summary, where len(summ) and len(doc) (are the length of the summary and the length of the document, respectively. With a low value of the compression coefficient, the summary will be shorter and the bulk of the information will be lost; but with a high value the summary will be voluminous, though it will contain insignificant sentences. In [1] it is shown that if the compression coefficient is within the interval [0.05, 0.3], the result of summarization will be acceptable. In view of the foregoing discussion, we may determine the number Nk of information-rich sentences selected from each cluster k, calculating it from the following formula: ( C k )α cmpr , N k = len -----------------------------len avg

k = 1, …, q,

(22)

len ( doc ) where len(Ck) is the length of cluster Ck, lenavg = --------------------- , the mean length of the sentences in the docum ment, and |a| the integral part of the number a. rel

We will use the criterion F1 to assess the result of summarization. Let N d be the number of relevant rel

sentences in the document; N s , number of relevant sentences in the summary; Ns, number of sentences in the summary; P, precision; and R, completeness. Then rel

Ns P = --------, Ns

(23)

rel

Ns R = --------, rel Nd

(24)

2PR F 1 = -------------. P+R

(25)

AUTOMATIC CONTROL AND COMPUTER SCIENCES

Vol. 41

No. 3

2007

SUMMARIZATION OF TEXT-BASED DOCUMENTS

139

6. CONCLUSION The objective of automatic summarization is to extract from a text several generalizing fragments (sentences, paragraphs, contexts) that reflect the content of the document. The present article has been focused on the problem of generic summarization of text-based documents, the essence of which is the following: • It is known that the result of clustering is assessed from the point of view of homogeneity, thus points of one cluster must be close to each other, and heterogeneity, that is, points of different clusters must be distant from each other. Most of the methods of clustering used in [8, 10] support either homogeneity or heterogeneity of clusters. For example, the operating principle of the k-means algorithm is based on assuring the homogeneity of clusters. The method of clustering proposed in the present article not only supports maximal proximity of points in clusters (homogeneity), but also guarantees that points that have been grouped into distinct clusters will be maximally distant from each other (heterogeneity). The mathematical realization of the method relies on a problem of integer quadratic programming. To reduce the solution time, a neural network implementation of the integer quadratic programming problems is presented. • If the document consists of several topical sections, it becomes difficult to create a summary that encompasses all the topical sections of the document. Thus, in summarization one of the hard problems is to determine the topical sections in a document, a step that is directly related to that of determining the number of clusters. In traditional algorithms the number of clusters is basically specified in advance, a step which it is far from always possible to achieve. In [8] an attempt was made to determine the number of clusters, though an analysis shows that this cannot guarantee optimality of the number of clusters thus found. To solve this problem we have proposed a new algorithm by means of which the number of clusters, i.e., the number of latent topical sections in the document, is found optimally. The advantage of the algorithm is that the process of determining the number of clusters is associated directly with an objective function that assures precise clustering. • The final step in summarization is that of determining a criterion for deciding on the degree of “information richness” (informativeness) of sentences and the number of sentences. To achieve a broad coverage of the content of a document and avoid redundancy, the present study proposed an information richness criterion and an algorithm for determining the number of sentences for inclusion in the summary. The number of sentences selected from each cluster is controlled by a previously specified parameter αcmpr. REFERENCES 1. Mani, I. and Maybury, M.T., Advances in Automated Text Summarization, Cambridge: MIT Press, 1999. 2. Salton, G., Singhal, A., Mitra, M., and Buckley, C., Automated Text Structuring and Summarization, Inf. Process. Manage., 1997, vol. 33, no. 2, pp. 193–207. 3. Mitra, M., Singhal, A., and Buckley, C., Automatic Text Summarization by Paragraph Extraction, Proc. ACL’97/EACL’97 Workshop on Intelligent Scalable Text Summarization, Madrid, July 7–12, 1997, pp. 39–46. 4. Kruengkrai, C. and Jaruskulchai, C., Generic Text Summarization Using Local and Global Properties of Sentences, Proc. IEEE/WIC Intern. Conf. Web Intelligence (WI’03), Halifax, Canada, October 13–17, 2003, pp. 201– 206. 5. Yeh, J.-Y., Ke, H.-R., Yang, W.-P., and Meng, I.-H., Text Summarization Using a Trainable Summarizer and Latent Semantic Analysis, Inf. Process. Manage., 2005, vol. 41, no. 1, pp. 75–95. 6. Goldstein, J., Kantrowitz, M., Mitral, V., and Carbonell, J., Summarization of Text Documents: Sentence Selection and Evaluation Metrics, Proc. 22nd Annual International ACM SIGIR Conf. Res. Develop. in Information Retrieval (SIGIR’99), Berkeley, USA, August 15–19, 1999, pp. 121–128. 7. Gong, Y. and Liu, X., Generic Text Summarization Using Relevance Measure and Latent Semantic Analysis, Proc. 24th Annual Intern. ACM SIGIR Conf. Res. Develop. in Information Retrieval, New Orleans, USA, 2001, pp. 19– 25. 8. Hu, P., He, T., Ji, D., and Wang, M. A Study of Chinese Text Summarization Using Adaptive Clustering of Paragraphs, Proc. 4th Intern. Conf. Computers and Information Technology (CIT’04), Wuhan, China, September 14– 16, 2004, pp. 1159–1164. 9. Shen, D., Chen, Z., Yang, Q., Zeng, H.J., Zhang, B., Lu, Y, and Ma, W.Y., Web-Page Classification Through Summarization, Proc. 27th Annual Intern. Conf. Res. Develop. Information Retrieval, Sheffield, UK, July 25–29, 2004, pp. 242–249. 10. Delort, J.-Y., Bouchon-Meuniere, B., and Rifqi, M., Enhanced Web Document Summarization Using Hyperlinks, Proc. 14th ACM Conf. Hypertext and Hypermedia, Nottingham, UK, August 26–30, 2003, pp. 208–215. 11. Luhn, H.P., The Automatic Creation of Literature Abstracts, IBM J. Res. Develop., 1958, vol. 2, no. 2, pp. 159–165. AUTOMATIC CONTROL AND COMPUTER SCIENCES

Vol. 41

No. 3

2007

140

ALGULIEV, ALYGULIEV

12. Banko, M., Mitral, V., Kantrowitz, M., and Goldstein, J., Generating Extraction-Based Summaries from HandWritten Summaries by Aligning Text Spans, Proc. 14th Conf. Pacific Assoc. Computational Linguistics (PACLING’99), Waterloo, Canada, August 25–28, 1999, pp. 36–40. 13. Grabmeier, J. and Rudolph, A., Techniques of Cluster Algorithms in Data Mining, Data Mining Knowledge Discovery, 2002, vol. 6, no. 4, pp. 303–360. 14. Bradley, P.S., Fayyad, U.M., and Mangasarian, O.L., Mathematical Programming for Data Mining: Formulations and Challenges, INFORMS J. Comput., 1999, vol. 11, no. 3, pp. 217–238. 15. Bagirov, A.M., Ferguson, B., Ivkovic, S., Saunders, G., and Yearwood, J., New Algorithms for Multi-class Diagnosis Using Tumor Gene Expression Signature, Bioinformatics, 2003, vol. 19, no. 14, pp. 1800–1807. 16. Alguliev, R.M., Alyguliev, R.M., and Alekperov, R.K., An Approach to Optimal Assignment of Tasks in a Distributed System, Avtom. Vychisl. Tekh., 2004, no. 5, pp. 55–61. 17. Neyromatematika. Kniga 6. Uchebnoe posobie dlya vuzov (Neuro-Mathematics. Vol. 6. A Textbook for Post-Secondary Educational Institutions), Galushkin, A.I., Ed., Moscow: IPRZhR, 2002. 18. Kim, D.-W., Lee, K.H., and Lee, D., On Cluster Validity Index for Estimation of the Optimal Number of Fuzzy Clusters, Pattern Recognition, 2004, vol. 37, no. 10, pp. 2009–2025.

AUTOMATIC CONTROL AND COMPUTER SCIENCES

Vol. 41

No. 3

2007

Suggest Documents