Multiple documents summarization based on evolutionary optimization ...

3 downloads 90181 Views 634KB Size Report
This article appeared in a journal published by Elsevier. The attached ... ments, search engines such as Google, Yahoo!, AltaVista, and others provide users with ...
This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and education use, including for instruction at the authors institution and sharing with colleagues. Other uses, including reproduction and distribution, or selling or licensing copies, or posting to personal, institutional or third party websites are prohibited. In most cases authors are permitted to post their version of the article (e.g. in Word or Tex form) to their personal website or institutional repository. Authors requiring further information regarding Elsevier’s archiving and manuscript policies are encouraged to visit: http://www.elsevier.com/copyright

Author's personal copy

Expert Systems with Applications 40 (2013) 1675–1689

Contents lists available at SciVerse ScienceDirect

Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

Multiple documents summarization based on evolutionary optimization algorithm Rasim M. Alguliev, Ramiz M. Aliguliyev ⇑, Nijat R. Isazade Institute of Information Technology of Azerbaijan National Academy of Sciences, 9, B. Vahabzade Street, Baku AZ1141, Azerbaijan

a r t i c l e

i n f o

Keywords: Multi-document summarization Diversity Content coverage Optimization model Differential evolution algorithm Self-adaptive crossover

a b s t r a c t This paper proposes an optimization-based model for generic document summarization. The model generates a summary by extracting salient sentences from documents. This approach uses the sentence-to-document collection, the summary-to-document collection and the sentence-to-sentence relations to select salient sentences from given document collection and reduce redundancy in the summary. To solve the optimization problem has been created an improved differential evolution algorithm. The algorithm can adjust crossover rate adaptively according to the fitness of individuals. We implemented the proposed model on multi-document summarization task. Experiments have been performed on DUC2002 and DUC2004 data sets. The experimental results provide strong evidence that the proposed optimization-based approach is a viable method for document summarization. Ó 2012 Elsevier Ltd. All rights reserved.

1. Introduction Interest in text mining started with advent of on-line publishing, the increased impact of the Internet and the rapid development of electronic government (e-government). With the exponential growing of the information–communication technologies a huge amount of electronic documents are available online. This explosion of electronic documents has made it difficult for users to extract useful information from them. While the Internet has increased access to text collections on a variety of topics, consumers now face a considerable amount of redundancy in the texts that they encounter online. In this case, the user due to the large amount of information does not read many relevant and interesting documents. Thus, now more than ever, consumers need access to robust text summarization systems, which can effectively condense information found in several documents into a short, readable synopsis, or summary (Harabagiu & Lacatusu, 2010; Yang & Wang, 2008). Text mining approach is feasible and powerful for e-government digital archives. Digital archives have been built up in almost every level of e-government hierarchy. Digital archives in the domain of e-government involve various medium formats, such as video, audio and scanned document. In fact, governmental documents are the most important production of e-government, which contain the majority information of government affairs. The text mining approach described in Dong, Yu, and Jiang ⇑ Corresponding author. Address: 9, B. Vahabzade Street, Baku AZ1141, Azerbaijan. Fax: +994 12 539 61 21. E-mail addresses: [email protected] (R.M. Alguliev), [email protected], [email protected], [email protected] (R.M. Aliguliyev), [email protected] (N.R. Isazade). 0957-4174/$ - see front matter Ó 2012 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.eswa.2012.09.014

(2009) targets the text in the scanned documents. The mined knowledge helps a lot in policymaking, emergency decision support, and government routines for civil servants. The successful application of the system to archives testifies the correctness and soundness of this approach. Text summarization is a good way to condense a large amount of information into a concise form by selecting the most important and discarding the redundant information. According to Mani and Maybury (1999), automatic text summarization takes a partially structured source text from multiple texts written about the same topic, extracts information content from it, and presents the most important content to the user in a manner sensitive to the user’s needs. Nowadays, without browsing the large volume of documents, search engines such as Google, Yahoo!, AltaVista, and others provide users with the clusters of documents they are interested in and present a summary of each document briefly which facilitates the task of finding the desired documents (Boydell & Smyth, 2010; Shen, Sun, Li, Yang, & Chen, 2007; Song, Choi, Park, & Ding, 2011; Yang & Wang, 2008). Boydell and Smyth (2010) focus on the role of snippets in collaborative web search and describe a technique for summarizing search results that harnesses the collaborative search behavior of communities of like-minded searchers to produce snippets that are more focused on the preferences of the searchers. They go on to show how this so-called social summarization technique can generate summaries that are significantly better adapted to searcher preferences and describe a novel personalized search interface that combines result recommendation with social summarization. Depending on the number of documents, summarization techniques can be classified into two classes: single-document and multi-document (Fattah & Ren, 2009; Zajic, Dorr, & Lin, 2008). Single-document summarization can only condense one

Author's personal copy

1676

R.M. Alguliev et al. / Expert Systems with Applications 40 (2013) 1675–1689

document into a shorter representation, whereas multi-document summarization can condense a set of documents into a summary. Multi-document summarization can be considered as an extension of single-document summarization and used for precisely describing the information contained in a cluster of documents and facilitate users to understand the document cluster. Since it combines and integrates the information across documents, it performs knowledge synthesis and knowledge discovery, and can be used for knowledge acquisition (Zajic et al., 2008). In addition to single document summarization, which has been first studied in this field for years, researchers have started to work on multi-document summarization whose goal is to generate a summary from multiple documents. The multi-document summarization task has turned out to be much more complex than summarizing a single document, even a very large one. This difficulty arises from inevitable thematic diversity within a large set of documents. A multi-document summary can be used to concisely describe the information contained in a cluster of documents and to facilitate the users to understand the document cluster.

2. Related work Multi-document summarization has been widely studied recently. Researchers all over the world working on multi-document summarization are trying different directions to see methods that provide the best results (Tao, Zhou, Lam, & Guan, 2008; Wan, 2008; Wang, Li, Zhu, & Ding, 2008; Wang, Zhu, Li, Chi, & Gong, 2011; Wang, Li, Zhu, & Ding, 2009). In general, document summarization can be divided into extractive summarization and abstractive summarization. Extractive summarization produces summaries by choosing a subset of the sentences in the original document(s). This contrasts with abstractive summarization, where the information in the text is rephrased. An extract summary consists of sentences extracted from the document, while an abstract summary employs words and phrases not appearing in the original document (Mani & Maybury, 1999). Extractive summarization is a simple but robust method for text summarization and it involves assigning saliency scores to some textual units of the documents and extracting those with highest scores. Abstraction can be described as reading and understanding the text to recognize its content, which is then compiled in a concise text. In general, an abstract can be described as summary comprising concepts/ideas taken from the source, which are then reinterpreted and presented, in a different form, whilst an extract is a summary consisting of units of text taken from the source and presented verbatim (Kutlu, Cigir, & Cicekli, 2010). Although an abstractive summary could be more concise, it requires deep natural language processing techniques. Thus, an extractive summary is more feasible and has become the standard in document summarization. In this paper, we focus on extractive multi-document summarization. There are several most widely used extractive summarization methods as follows. Summaries can be generic or query-focused (Dunlavy, O’Leary, Conroy, & Schlesinger, 2007; Gong & Liu, 2001; Ouyang, Li, Li, & Lu, 2011; Wan, 2008). A query-focused summary presents the information that is most relevant to the given queries, while a generic summary gives an overall sense of the document’s content. As compared to generic summarization that must contain the core information central to the source documents, the main goal of query-focused multi-document summarization is to create from the documents a summary that can answer the need for information expressed in the topic or explain the topic. Zhao, Wu, and Huang (2009) propose a query expansion algorithm used in the graph-based ranking approach for query-focused multi-document summarization. This algorithm makes use of both sentence-tosentence relationships and sentence-to-word relationships to

select expansion words from the documents. By this method, the expansion words satisfy both information richness and query relevance. The problem of using topic representations for multidocument summarization has received considerable attention recently. Several topic representations have been employed for producing informative and coherent summaries. The work presented in Harabagiu and Lacatusu (2010) has two main goals. First, it introduces two novel topic representations that leverage sets of automatically generated topic themes for multi-document summarization. It shows how these new topic representations can be integrated into a state-of-the-art multi-document summarization system. Second, it presents eight different methods of generating multi-document summaries. Up to now, various extraction-based techniques have been proposed for generic multi-document summarization. In order to implement extractive summarization, some sentence extraction techniques are utilized to identify the most important sentences, which can express the overall understanding of a given document. The centroid-based method, MEAD, is one of the popular extractive summarization methods (Radev, Jing, Stys, & Tam, 2004). MEAD uses information from the centroids of the clusters to select sentences that are most likely to be relevant to the cluster topic. Gong and Liu (2001) proposed a method using latent semantic analysis (LSA) to select highly ranked sentences for summarization. Other methods include NMF-based topic specification (Lee, Park, Ahn, & Kim, 2009; Wang et al., 2008, 2009) and CRF-based summarization (Shen et al., 2007). In framework CRF (conditional random fields), input document is conveyed to sequence of sentences first, and then each sentence evaluated by CRF to represent its importance. Wang et al. (2008) proposed a framework based on sentence-level semantic analysis and symmetric NMF (non-negative matrix factorization). Wang, Li, and Ding (2010) proposed the weighed feature subset non-negative matrix factorization (WFS-NMF), which is an unsupervised approach to simultaneously cluster data points and select important features and different data points are assigned different weights indicating their importance. They applied proposed approach to document clustering, summarization, and visualization. Recently, Wang and Li (2012) proposed a novel weighted consensus summarization method to combine the results from different summarization methods, in which, the relative contribution of an individual method to the consensus is determined by its agreement with the other members of the summarization systems. The graph-based ranking algorithms such as PageRank (Brin & Page, 1998) and HITS (Kleinberg, 1999) have also been used in generic multi-document summarization. The major concerns in graph-based summarization researches include how to model the documents using text graph and how to transform existing web page ranking algorithms to their variations that could accommodate various summarization requirements (Wenjie, Furu, Qin, & Yanxiang, 2008). A similarity graph is produced for the sentences in the document collection. In the graph, each node represents a sentence. The edges between nodes measure the cosine similarity between the respective pair of sentences where each sentence is represented as a vector of term specific weights. An algorithm called LexRank (Erkan & Radev, 2004), adapted from PageRank, was applied to calculate sentence significance, which was then used as the criterion to rank and select summary sentences. In Chali, Hasan, and Joty (2011), authors extensively study the impact of syntactic and semantic information in measuring similarity between the sentences in the random walk framework for answering complex questions. They apply the tree kernel functions and Extended String Subsequence Kernel (ESSK) to include syntactic and semantic information. Ordering extracted sentences into a coherent summary is a non-trivial task. Bollegala, Okazaki, and Ishizuka (2010) presented a bottom-up approach to arrange

Author's personal copy

R.M. Alguliev et al. / Expert Systems with Applications 40 (2013) 1675–1689

sentences extracted for multi-document summarization. The proposed sentence ordering method uses four criteria: chronology, topical-relatedness, precedence, and succession. Each criterion expresses the strength and direction of the association between two text segments. To integrate the four criteria support vector machine (SVM) utilized. Using the trained SVM, they agglomeratively clustered the extracted sentences to produce a total ordering. Furthermore, the summarization problem can be specified as supervised or unsupervised (Fattah & Ren, 2009; Mani & Maybury, 1999; Riedhammer, Favre, & Hakkani-Tür, 2010). A supervised system learns how to extract sentences given example documents and respective summaries. Supervised summarizations are regarded as two-class classification problem at sentence level, where the sentences in summary are positive samples, and the non-summary sentences are negative samples (Chali & Hasan, 2011; Song et al., 2011). For supervised learning techniques, a huge amount of annotated or labeled data is required as a precondition. Some widely used classification techniques, e.g. neural network (Fattah & Ren, 2009) and support vector machine (Ouyang et al., 2011), are applied to achieve sentence classification. These techniques characterize each sentence according to a set of predefined features, such as summary-to-text pairs. An unsupervised method generates a summary while only accessing the target document. Unsupervised approaches are very enticing for extractive document summarization, as they do not depend on extensive manually annotated training data. They can thus be applied to any new observed data without prior adjustments. Unsupervised extractive summarizations use heuristic rules to select the most informative sentences into summary directly (Fattah & Ren, 2009). Since clustering is the most important unsupervised categorization method, the application of clustering for solving extractive summarization problems appears to be appropriate and natural (Alguliev & Aliguliyev, 2008; Aliguliyev, 2009a, 2009b, 2010; Dunlavy et al., 2007; Wang et al., 2008). Redundancy is one of the important issues in multi-document summarization. To remove redundancy, some systems select the top most sentences first and measure the similarity of a next candidate textual unit (sentence or paragraph) to that of previously selected ones and retain it only if it contains enough new (dissimilar) information (Sarkar, 2010). Many approaches to reduce redundancy, such as maximal marginal relevance (MMR), were reported in the literature. Carbonell and Goldstein (1998), which was used for reducing redundancy while maintaining query relevance in document re-ranking and text summarization, introduced the MMR approach. The MMR model was used by Li and Croft (2008) in their work on novelty detection. Unlike the MMR that uses greedy approach to sentence selection and redundancy removal, the clustering-based approaches control redundancy in the final summary by clustering sentences to identify themes of common information and selecting one or two representative sentences from each cluster into the final summary (Alguliev & Aliguliyev, 2008; Aliguliyev, 2009a, 2010; Wang et al., 2011). The work (Sarkar, 2010) presents a sentence compression based summarization technique that uses a number of local and global sentence-trimming rules to improve the performance of an extractive multi-document summarization system. Binwahlan, Salim, and Suanmali (2010) introduced a different hybrid model based on fuzzy logic, swarm intelligence and diversity selection for text summarization problem. The purpose of employing the swarm intelligence for producing the text features weights was to emphasize on dealing with the text features fairly based on their importance. The weights suggested by swarm intelligence were used to adjust the text features scores, which played an important role in the differentiation between higher and less important features. In the fuzzy logic, the trapezoidal membership function was used for fuzzifying the crisp numerical values of text features. The features values were

1677

adjusted using the weights obtained in the training of the particle swarm optimization. Selecting the best summary is a global optimization problem in comparison with simply pursing the local greedy approximation as in the procedure of selecting the best sentences (Huang, He, Wei, & Li, 2010). Filatova and Hatzivassiloglou (2004) represented each sentence with a set of conceptual units and formalized the extractive summarization as a maximum coverage problem that aims at covering as many conceptual units as possible by selecting some sentences. In (McDonald, 2007), the summarization task was defined as a global inference problem, which attempted to optimize three properties jointly, i.e., relevance, redundancy, and length. The scoring function they defined was similar to MMR. Takamura and Okumura (2009a) represented text summarization as maximum coverage problem with knapsack constraint (MCKP). One of the advantages of this representation is that MCKP can directly model whether each concept in the given documents is covered by the summary or not, and can dispense with rather counter-intuitive approaches such as giving penalty to each pair of two similar sentences. A novel multi-document summarization model based on the budgeted median problem proposed in Takamura and Okumura (2009b). The proposed method is somewhat similar to methods based on sentence clustering (Wang et al., 2011) in the sense that both methods generate some sets of sentences. However, there is a big difference between these two methods. While the methods based on sentence clustering generate sets of similar sentences, the proposed method attempts to generate the sentence sets, each of which has one selected sentence and contains sentences entailed by the selected sentence. Wang et al. (2009) propose a Bayesian sentence-based topic model (BSTM) for multidocument summarization by making use of both the termdocument and term sentence associations. It models the probability distributions of selecting sentences given topics and provides a principled way for the summarization task. Document summarization, especially multi-document summarization in essence is a multi-objective optimization problem. The potential of optimization based document summarization models has not been well explored to date. It requires the simultaneous optimization of more than one objective function. A good summary, as whole, is expected to be the one with extensive coverage of the focuses presented in documents, minimum redundancy, and smooth connection among sentences. Huang et al. (2010), consider the four objectives, i.e., information coverage, information significance, information redundancy, and text cohesion. Ma and Wan (2010) propose three new models based on the optimization of an information theoretic measure: distortion. The p-median model respects the optimization as a p-median problem and conveys as more information between the whole summary and the original documents as possible. The facility location model adds features to the p-median model, and the linear representation model jumps out of the idea of clustering, and modify the representation method. Linear representation is proved to be effective both theoretically and experimentally. Motivated by recent progress in optimization-based document summarization (Alguliev, Aliguliyev, Hajirahimova, & Mehdiyev, 2011; Alguliev, Aliguliyev, & Mehdiyev, 2011) in this paper we propose a novel document summarization model which simultaneously considers content coverage and redundancy. This approach can directly discover key sentences in the given collection and covers the main content of the original source(s). Our model can reduce redundancy in the summary. In this paper, a self-adaptive differential evolution (DE) algorithm is created to solve the optimization problem. The performance of the proposed approach is tested on the standard DUC2002 and DUC2004 data sets and the results show that our method improves the summary performance in comparison with some newly proposed or commonly used summarization approaches.

Author's personal copy

1678

R.M. Alguliev et al. / Expert Systems with Applications 40 (2013) 1675–1689

The rest of this paper is organized as follows. Our generic document summarization model is presented in Section 3. This model is formulated as an optimization problem. In Section 4, we briefly introduce the DE algorithm. In addition, some improved variants of DE are reviewed. An improved version of DE is presented in detail in Section 5. Comprehensive experimental results are shown in Section 6. Finally, Section 7 concludes the paper. 3. Formulation of sentence selection problem

Once each sentence vector si is represented by this way, a cosine measure is utilized to calculate the similarity between pair of sentences. Suppose sentence vector si is represented by si = [wi1, . . . , wim] and sentence vector sj is represented by sj = [wj1, . . . , wjm], the cosine similarity between vectors si and sj is defined by:

Pm k¼1 wik wjk simðsi ; sj Þ ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Pm 2 Pm 2ffi ; k¼1 wik  k¼1 wjk

i; j ¼ 1; . . . ; n:

ð2Þ

3.1. Problem statement

3.3. Mathematical formulation of problem

Sentence extractive summarization algorithms often rely on the measurement of two important aspects: relevance (selected sentences should be important) and non-redundancy (duplicated content should be avoided). These two aspects are usually addressed by computing separate scores and deciding for the best candidates regarding some relevance redundancy trade-off (Riedhammer et al., 2010). Upon this extractive summarization framework, sentences are first evaluated and ranked according to certain criteria and measures and then the most significant ones are extracted from the original documents to generate a summary automatically. As stated in (Huang et al., 2010), each the selected sentences included in the summary should be individually important, however, this does not guarantee they collectively produce the best summary. For example, if the selected sentences overlap a lot with each other, such a summary is definitely not desired. In this section, we present our approach towards all of the three aspects of summarization, namely: (1) content coverage, summary should contain salient sentences that cover the main content of the documents; (2) diversity, summaries should not contain multiple sentences that convey (carry) the same information; and (3) length, summary should be bounded in length. Optimizing all three properties jointly is a challenging task and is an example of a global summarization problem. That is why the inclusion of relevant textual units relies not only on properties of the units themselves, but also on properties of every other textual unit in the summary (Yang & Wang, 2008).

The goal of text summarization is to cover as many conceptual sentences as possible using only a small number of sentences. In our study, we attempt to find a subset of the sentences D = {s1, s2, . . . , sn} that covers the main content of the document collection while reducing the redundancy in the summary. In other words, the goal is to find a subset S  D that covers as many conceptual sentences as possible. We think of the situation that the summary length must be at most L (cardinality constraint). In the following, we introduce model for that purpose. In our model, both content coverage and diversity are taken into consideration. If we let S  D be a summary, then the similarity between the set of sentences D and the summary S is going to be sim(D, S), which we would like to maximize. Let X be a binary vector such that, for every element on it, xi = 1 if sentence si is selected to be included to the summary, and xi = 0 otherwise. The objective of the text summarization is to find X which maximizes the following objective function:

3.2. Similarity measure

fcover ðXÞ : fdiver ðXÞ

ð3Þ

The function fcover(X) provides the covering of the main content of the document collection, while the function fdiver(X) provides a high diversity in the summary. Notice that high diversity provides low redundancy in the summary. The coverage function fcover(X) we define as follows:

fcover ðXÞ ¼ simðO; OS Þ 

Similarity measure plays an important role in the field of text related research, such as natural language processing, information retrieval, and text mining (Aliguliyev, 2010; Chali et al., 2011; Song et al., 2011). In order to process the documents, we must formalize sentences. Most of text organizing approaches adopt vector space model (VSM) to represent sentence, that is to say, each unique term in vocabulary represents one dimension in feature vector space. Given a document collection D = {d1, d2, . . . , dN}, where N is the number of documents. For simplicity, we represent the document collection as the set of all sentences from all the documents in D, S = {s1, s2, . . . , sn}, where n is the total number of sentences in the collection and si denotes ith sentence in D. Let T = {t1, t2, . . . , tm} represents all the distinct terms occurred in the document collection D, where m is the number of terms. Using VSM, each sentence si is represented using these terms as a vector in m-dimensional space, si = [wi1, . . . , wim], where each component reflects weight of a corresponding term. The term specific weights in the sentence vectors are products of local (tf) and global (isf) weights. The model is known as term frequency–inverse sentence frequency (TF–ISF) scheme:

wik ¼ tfik  isfk ;

f ðXÞ ¼

ð1Þ

where tfik is the number of occurrences of term tk in sentence si, isfk = log (n/nk), and nk is the number of sentences containing term tk.

n X simðO; si Þxi ;

ð4Þ

i¼1

where O and OS denote the mean vectors of the collection D = {s1, s2, . . . , sn} and the summary S, respectively. kth coordinate ok of the mean vector O is calculated as:

ok ¼

n 1X wik ; n i¼1

k ¼ 1; . . . ; m:

ð5Þ

kth coordinate oSk of the mean vector OS we define as:

oSk ¼

1X wik ; nS s 2S

k ¼ 1; . . . ; m;

ð6Þ

i

where through nS is denoted the number of sentences in the summary S. From Radev et al. (2004) we know that the centre of the document collection reflects its main content. Thus, in Eq. (4), the first multiplier aims to evaluate the importance of the summary and the second multiplier aims to evaluate the importance of each sentence in the summary by measuring their similarity to the centre O of document collection D. Therefore, this function not only aims to estimate a content coverage of the summary as a whole, it also aims to estimate a content coverage of each sentence of the summary. Thus, higher value of fcover(X) corresponds to higher content coverage of summary. The diversity function fdiver(X) we define as follows:

Author's personal copy

R.M. Alguliev et al. / Expert Systems with Applications 40 (2013) 1675–1689

fdiver ðXÞ ¼

n1 X n X

simðsi ; sj Þxi xj :

ð7Þ

i¼1 j¼iþ1

Lower value of fdiver(X) corresponds to lower overlap in content between sentences si and sj, i.e. lower value of function (7) provides high diversity in the summary. By inserting the functions (4) and (7) into (3), we obtain the following mathematical formulation of problem:

simðO; OS Þ  maximize f ðXÞ ¼

n X simðO; si Þxi i¼1

n1 X n X

where up,i(t) is the ith decision variable of the pth individual in the population, i = 1, . . . , n; p = 1, . . . , P, P is the population size. In the initialization procedure, P solutions will be created at random to initialize the population. At the very beginning of a DE run, problem independent variables are initialized in their feasible numerical range. Therefore, if the ith variable of the given problem has its lower and upper bound as umin and umax , respeci i tively, then the ith component of the pth population member Up(t) may be initialized as:

  up;i ð0Þ ¼ umin þ umax  umin  randp;i ; i i i ð8Þ

simðsi ; sj Þxi xj

i¼1 j¼iþ1

1679

ð12Þ

where randp,i is a uniform random number between 0 and 1, and is instantiated independently for each component i 2 {1, . . . , n} of the pth vector Up(t).

subject to n X li xi 6 L;

4.2. Mutation

ð9Þ

i¼1

xi 2 f0; 1g;

ð10Þ

where L is the length of summary, li is the length of sentence si. The number of words or bytes measures the summary and sentence length. Note that the constraint (9) implies that the length constraint on a summary cannot be violated. The integrality constraint on xi (10) is automatically satisfied in the problem above. 4. Basics of differential evolution algorithm Differential evolution (DE) is proposed by Storn and Price (1997). DE has gradually become more popular and has been successfully applied to solve many optimization problems due to its strong global search ability, causing widespread concern among scholars. DE is a simple yet efficient evolutionary algorithm, which has been widely applied to solve continuous optimization problems (Price, Storn, & Lampinen, 2005). Therefore, in our study the optimization problem (8)–(10) was solved using a DE algorithm. The DE algorithm is a population-based algorithm like genetic algorithms using the three operators: crossover, mutation and selection. The main difference in constructing better solutions is that genetic algorithms rely on crossover while DE relies on mutation operation. This main operation is based on the differences of randomly sampled pairs of solutions in the population. The algorithm uses mutation operation as a search mechanism and selection operation to direct the search toward the prospective regions in the search space. The basic idea which DE scheme is based on is to generate new trial vector. When mutation is implemented, several differential vectors obtained from the difference of several randomly chosen parameter vectors are added to the target vector to generate a mutant vector. Then, a trial vector is produced by crossover recombining the obtained mutant vector with the target vector. Finally, if the trial vector yields better fitness value than the target vector, replace the target vector with the trial vector. The main steps of the basic DE algorithm are described below (Chakraborty, 2008; Das & Suganthan, 2011; Lu, Zhou, Qin, Li, & Zhang, 2010). 4.1. Encoding of the individuals and population initialization DE starts the search with an initial population of individuals randomly sampled from the decision space. The original DE (Storn & Price, 1997) uses a real-coded representation. The pth individual vector of the population at generation t has n components:

U p ðtÞ ¼ ½up;1 ðtÞ; . . . ; up;n ðtÞ;

ð11Þ

DE is based on a mutation operator, which adds an amount obtained by the difference of several randomly chosen individuals of the current population, in contrast to most of the evolutionary algorithms, in which the mutation operator is defined by a probability function. Then, some individuals called the target vectors in the population are chosen to carry out the mutation operation to generate some new individuals called the mutant vectors. For each target vector Up(t) randomly select three vectors Up1(t), Up2(t) and Up3(t) from the same generation, such that the indices p, p1, p2 and p3 are distinct. Then, a mutant vector Vp(t) = [vp,1(t), . . . , vp,n(t)] is generated by adding the weighted difference of two vectors to the third:

V p ðtÞ ¼ U p1 ðtÞ þ F  ðU p2 ðtÞ  U p3 ðtÞÞ;

ð13Þ

where F is called mutation factor which controls the amplification of the differential variation (Up2(t)  Up3(t)). The usual choice for F is a number between 0.4 and 1.0 (Das & Suganthan, 2011; Storn & Price, 1997). In the original DE algorithm, many mutation schemes have been proposed (Cai, Gong, Ling, & Zhang, 2011; Pan, Suganthan, Wang, Gao, & Mallipeddi, 2011) that use different learning strategies and/or recombination operations in the reproduction stage. In order to distinguish among its schemes, the notation ‘‘DE/a/b/c’’ is used, where ‘‘DE’’ denotes the Differential Evolution; ‘‘a’’ specifies the vector to be mutated (which can be random or the best vector); ‘‘b’’ is the number of difference vectors used; and ‘‘c’’ denotes the crossover scheme, binomial or exponential. Using this notation, the DE strategy described in Eq. (13) can be denoted as DE/rand/ 1/bin. Other well-known schemes are DE/best/1/bin, DE/rand/2/ bin, DE/best/2/bin (Storn & Price, 1997; Mallipeldi, Suganthan, Pan, & Tasgetiren, 2011; Qin, Huang, & Suganthan, 2009), and DEGD (DE with Generalized Differential) (Ali, 2011), which can be implemented by (14)–(17), respectively:

V p ðtÞ ¼ U best ðtÞ þ F  ðU p1 ðtÞ  U p2 ðtÞÞ;

ð14Þ

V p ðtÞ ¼ U p1 ðtÞ þ F  ðU p2 ðtÞ  U p3 ðtÞÞ þ F  ðU p4 ðtÞ  U p5 ðtÞÞ;

ð15Þ

V p ðtÞ ¼ U best ðtÞ þ F  ðU p1 ðtÞ  U p2 ðtÞÞ þ F  ðU p3 ðtÞ  U p4 ðtÞÞ;

ð16Þ

V p ðtÞ ¼ U p1 ðtÞ þ F 1  U p2 ðtÞ  F 2  U p3 ðtÞ;

ð17Þ

where Ubest(t) represents the best individual in the current generation, the indices p1, p2, p3, p4, and p5 are mutually exclusive integers randomly chosen from the set {1, . . . , P}, which are also different from the index p. After mutation, all the components of the mutant vector are checked whether they violate the boundary constraints. If the ith component vp,i(t) of the mutant vector Vp(t) violates the boundary constraint (12), vp,s(t) is reflected back from the violated boundary constraint as follows (Kukkonen & Lampinen, 2006):

Author's personal copy

1680

R.M. Alguliev et al. / Expert Systems with Applications 40 (2013) 1675–1689

8 min min > < 2ui  v p;i ðtÞ; if v p;i ðtÞ < ui ; max v p;i ðtÞ ¼ > 2ui  v p;i ðtÞ; if v p;i ðtÞ > umax ; i : v p;i ðtÞ; otherwise:

5.1. Modified mutation

ð18Þ

4.3. Crossover In order to increase the diversity of the perturbed parameter vectors, a crossover operator is introduced. The target vector Up(t) is mixed with the mutant vector Vp(t) to produce a trial vector Zp (t) = [zp,1(t), . . . , zp,n(t)]. It is created from the elements of the target vector, Up(t), and the elements of the mutant vector, Vp(t) (Das & Suganthan, 2011; Storn & Price, 1997):

zp;i ðtÞ ¼



v p;i ðtÞ;

if randp;i 6 CR or i ¼ k;

ð19Þ

up;i ðtÞ; otherwise;

In modified DE, an approach has been introduced during mutation to push the trial vector quickly towards the global optima. In mutation process, that deals with three vectors, two of them represent global best (best vector evaluated till the current generation), Ugbest(t) and local best Ubest(t), respectively. Thus, for each target vector Up(t) our modification creates the mutant vector Up(t) by following scheme:

V p ðtÞ ¼ U p ðtÞ þ ð1  FðtÞÞ  ðU gbest ðtÞ  U p1 ðtÞÞ þ FðtÞ  ðU best ðtÞ  U p1 ðtÞÞ;

ð21Þ

where p – p1 and in each generation mutation factor is computed as:

FðtÞ ¼ expð2t=t max Þ:

where, as before, randp,i is a uniformly distributed random number lying between 0 and 1, which is called anew for each ith component of the pth parameter vector, CR 2 [0, 1] is the crossover constant which controls the recombination of target vector and mutant vector to generate trial vector and k 2 {1, . . . , n} is the randomly chosen index which ensures at least one element from mutant vector is obtained by the trial vector, otherwise, there is no new vector would be produced and the population would not evolve. The crossover rate CR is a probability of mixing between trial and target vectors.

ð22Þ

Note that as number of generation increases the value of F decreases in the range between 1 and 0. Fig. 1 depicts the variation of mutation factor (F) versus current generation (t), where tmax = 1000. Also when modified mutation is used, in the initial stage Ubest (t) has more contribution for evolving the mutant vector than in the later stage. As the contribution of Ubest(t) for the mutant vector decreases with generation, contribution of Ugbest(t) increases. 5.2. Self-adaptive crossover

4.4. Selection



U p ðt þ 1Þ ¼

Z p ðtÞ; if f ðZ p ðtÞÞ P f ðU p ðtÞÞ; U p ðtÞ; otherwise;

;

ð20Þ

where f() is the objective function (8) to be maximized. Therefore, if the trial vector yields a better or equal value of the fitness function, it replaces its target vector in the next generation; otherwise the target is retained in the population. Hence, the population either gets better in terms of the fitness function or remains constant but never deteriorates. Note that in Eq. (20), target vector is replaced by the trial vector even if both yields the same value of the objective function – a feature that enables DE-vectors to move over flat fitness landscapes with generation. 4.5. Stopping criterion Mutation, crossover and selection operations continue until some stopping criterion is reached. The stopping criterion can be defined in a few ways like: (1) by a fixed number of iterations tmax, with a suitably large value of tmax depending upon the complexity of the objective function; (2) when best fitness of the population does not change appreciably over successive iterations; (3) by a specified CPU time limit; and alternatively (4) attaining a pre-specified objective function value (Das & Suganthan, 2011). According to our previous successful experience (Aliguliyev, 2009a), in this paper we use the first one as the termination criteria, i.e., the algorithm terminates when the maximum number of generations tmax is achieved.

The crossover constant CR, which usually is set a fixed value in (0, 1) or changes dynamically within (0, 1), is generally one of key factors affecting the DE’s performance (Das & Suganthan, 2011; Price et al., 2005). Choosing suitable value of CR is difficult for DE, which is usually problem-dependent. From Eq. (19) we can know that more elements will be devoted to the trial vector from mutant vector when a large value is chosen for CR, in this case, local search around Up(t) is facilitated by DE to explore the promising area and algorithm may not explore sufficiently beyond locally promising area. On the contrary, elements of the trial vector will be mainly provided by target vector when CR takes a small value, in this case, global search is implemented by DE to keep the diversity of population and the algorithm might miss the promising area where the global optimal solution exists (Lu et al., 2010). In this section, a self-adaptive dynamic control mechanism for choosing the suitable value of CR during the evolutionary progress is presented. To calculate the crossover rate for pth vector in tth iteration, denoted by CRp(t), first the relative distance (RDp(t)) between the target vector Up(t) and the best individual Ubest(t) is defined by:

RDp ðtÞ ¼

f ðU best ðtÞÞ  f ðU p ðtÞÞ ; f ðU best ðtÞÞ  f ðU worst ðtÞÞ

ð23Þ

1.1 1 0.9 0.8 Mutation factor, F

To keep the population size constant over subsequent generations, the selection process is carried out to determine which one the trial and the target will survive in the next generation, i.e., at time t + 1. The target vector Up(t) is compared with the trial vector Zp(t), in terms of the objective function value and the better one survives into the next generation:

0.7 0.6 0.5 0.4 0.3 0.2 0.1

5. An improved DE algorithm

0 0

In this section, a modified DE algorithm is created to solve the optimization problem (8)–(10) as follows.

50

100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 1050 Current generation, t

Fig. 1. Mutation factor (F) versus current generation (t).

Author's personal copy

R.M. Alguliev et al. / Expert Systems with Applications 40 (2013) 1675–1689

where randp,i is the random number selected from a uniform over [0, 1], which is called anew for each ith component of the pth parameter vector and sigm(z) is the sigmoid function:

1.1 1 0.9 0.8 Crossover rate, CR

1681

0.7

sigmðzÞ ¼

0.6 0.5 0.4 0.3 0.2 0.1 0 0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

1,1

Relative distance, RD

Fig. 2. Crossover rate (CR) versus relative distance (RD).

where Uworst(t) represents the worst individual of the population at the tth iteration, f(Uworst(t)) = minp2{1, . . . , P}{f(Up(t))}. It can be concluded that a big RDp(t) means that the target vector Up(t) is far away from the best individual Ubest(t) and it needs a strong global exploration, therefore a large crossover rate. On the other hand, a small RDp(t) means that the vector Up(t) has a high nearness to the best individual Ubest (t) and so it needs a strong local exploitation, therefore a small crossover rate. Hence, the value of crossover rate for every target vector in tth iteration is dynamically changed with the following formula:

CRp ðtÞ ¼

2 tanhð2  RDp ðtÞÞ ; 1 þ tanhð2  RDp ðtÞÞ

ð24Þ

where tanh (z) is the hyperbolic tangent function

tanhðzÞ ¼

expð2zÞ  1 : expð2zÞ þ 1

ð25Þ

Under this definition, it can be concluded that CRp(t) 2 [0, 1). Fig. 2 depicts the variation of crossover rate (CR) versus relative distance (RD). According to Eqs. (23)–(25), during the search, the individuals get different values of RD and then crossover rate depending on their fitness. While the fitness of an individual is far away from the fitness of the best individual, RD for this individual has a big value and the value of crossover rate will be large resulting strong global search abilities and locate the promising search areas. Meanwhile, when the fitness of an individual achieves near the best individual, RD for this individual has a small value and crossover rate will be set small, depending on the distance of its best fitness to the best individual fitness, to facilitate a finer local explorations and so accelerate convergence. 5.3. Binarization DE is a real-valued algorithm in its original version, therefore, when the DE is used in the proposed sentence selection model, binary version DE should be adopted. The binary DE algorithm was introduced (Pampara, Engelbrecht, & Franken, 2006) to allow the DE algorithm to operate in binary problem spaces. In this work, a simple binary DE is adopted. The major difference between binary DE with continuous version is that components of the individuals are rather defined in terms of probabilities that a bit will change to one. The component of an individual vector is used as a probability to determine whether a bit will be in one state or zero. So a map is introduced to map all real-valued of individuals to the range [0, 1]. The mathematical description is given as follows (Pampara et al., 2006):

up;i ðt þ 1Þ ¼



1; if randp;i < sigmðup;i ðt þ 1ÞÞ; 0; otherwise;

ð26Þ

1 : 1 þ expðzÞ

ð27Þ

Using this transformation from the real-coded representation (11) and (12) we obtain the binary-coded representation, up,i (t) 2 {0, 1} where the up,i(t) = 1 indicates that the ith sentence will be selected to be included to the summary, otherwise, the ith sentence will not be selected. For example, the individual Up(t) = [1, 0, 0, 1, 1] represents a candidate solution that first, fourth and fifth sentences are selected to be included to the summary, i.e., x1 = x4 = x5 = 1 and x2 = x3 = 0. 5.4. Constraint handling When population initialization, mutation, crossover and binarization have been implemented, the new generated solution may not satisfy the constraint (9). The most popular constraint handling strategy at present is penalty method, which often uses function to convert a constrained problem into an unconstraint one. Therefore, this strategy is very convenient to handle the constraints for evolutionary algorithm by punishing the infeasible solution during the selection procedure to ensure the feasible ones are favored. However, this strategy has some drawbacks and the main one is the requirement of multiple runs for the fine tuning of penalty factors, which would increase the computational time and degrade the efficiency of the algorithm. In order to overcome the drawbacks of penalty method and handle the constraints of problem effectively, the following heuristic procedure is produced for all P solutions in the population to resolve the constraint (9). Constraint is handled by using a suitable fitness function which depends on the current population. Solutions in a population are assigned fitness so that feasible solutions are emphasized more than infeasible solutions. The following three criteria are satisfied during the handling process (Mezura-Montes, 2009; Woldesenbet, Yen, & Tessema, 2009):  Any feasible solution wins over any infeasible solution.  Two feasible solutions are compared only based on their objective function values.  Two infeasible solutions are compared based on the amount of constraint violation. 5.5. Parameter settings In the proposed adaptive DE the population size, P = 50, is maintained constant throughout the evolution process. The value of P can be increased for obtaining the global solution with higher probability. However, the higher the value of P is, the higher the number of fitness evaluation is. The maximum number of generations was set to tmax = 1000. The lower and upper bounds of variables were set to umin ¼ 5 and umax ¼ 5 for all i 2 {1, . . . , n}. i i These are a heuristic choice. 5.6. Runtime complexity analysis Runtime-complexity analysis of the population-based stochastic search techniques like DE is a critical issue by its own right (Mallipeldi et al., 2011; Pan et al., 2011). According to (Das & Suganthan, 2011) the average runtime of a standard DE algorithm can be affected by several factors such as the number of terms, the number of sentences, the population size, the fitness computation and the number of generations.

Author's personal copy

1682

R.M. Alguliev et al. / Expert Systems with Applications 40 (2013) 1675–1689

For analyzing the time complexity of proposed method, the different steps are as follows: 1. Initialization of population needs O(P  n) time where P and n indicate the population size and the length of each vector in the DE, respectively. 2. Since the mutation and crossover operations are performed at the component level for each DE vector, then mutation and crossover require O(P  n) time each. 3. The time complexity for selection is O(P). 4. Checking the boundary constraints (18) requires O(P  n) time. 5. Binarization of the DE requires O(P  n) time. 6. Fitness computation is composed of three steps:  Complexity of computing similarity of n sentences to the centre of document collection is O(P  n  m).  For updating the centers (of summaries) total complexity is O(P  m).  Complexity of computing similarity between n sentences is O(P  m  n2). Therefore, the fitness evaluation has total complexity O(P  m  n2). Thus summing up the above complexities, total time complexity becomes O(P  m  n2) per generation. For maximum tmax number of generations total complexity becomes O(P  m  n2  tmax). 5.7. Framework of the improved DE algorithm The pseudo-code of the proposed improved DE algorithm can be summarized as: Step 1 (Initial population). Using the rule (12) create an initial population. Step 2 (Binarization). Transform real-coded individuals to binary-coded individuals using Eq. (26). Step 3 (Evaluate initial population). Calculate fitness value of each individual in the population based on Eq. (8). Step 4 (Select best and worst vectors). Select the individual with current best solution. Step 5 (Create mutant vector). Using Eq. (21) create mutant vector. Step 6 (Check the boundary constraints). Using Eq. (18) check the boundary constraints (12) for the components of the mutant vector. Step 7 (Generate trial vector). Using crossover operators Eqs. (19), (24) generate trial vector of the target vector. Step 8 (Binarization). Transform real-coded trial vector to binary-coded trial vector using Eq. (26). Step 9 (Constraint handling). Handle the constraint (9) using the strategy described in Section 5.4. Step 10 (Selection). If a trial vector is better than target vector, then replace the target vector by trial vector in the next generation. Step 11 (Stopping criterion). Turn to Step 12 if the maximum number tmax of generations is achieved, otherwise go to Step 2. Step 12 (Output). Output the summary obtained by the best vector Ubest(t) as the final solution at maximum number of generations and stop running. 6. Evaluation 6.1. Experimental data We conducted experiments on the DUC2002 and DUC2004 data sets, both of which are open benchmark data sets from Document

Understanding Conference (DUC) (http://duc.nist.gov) for generic automatic summarization evaluation. For each document set, four human-generated summaries are provided for the target length. Table 1 gives a brief description of the data sets. All the documents were segmented into sentences using a script distributed by DUC. We used the stoplist from (ftp://ftp.cs.cornell.edu/pub/smart/english.stop) for document preprocessing. The stoplist contains about 600 common words. In our experiments, stopwords were removed and the remaining words were stemmed using the Porter’s scheme (http://www.tartarus.org/martin/PorterStemmer/). 6.2. Evaluation metrics To evaluate our method we use the ROUGE metric (Lin, 2004), which is adopted by DUC as the official evaluation metric for text summarization. It includes five measures, which automatically determine the quality of a machine-generated summary by comparing it to ideal summaries created by humans: ROUGE-N, ROUGE-L, ROUGE-W, ROUGE-S and ROUGE-SU. These measures evaluate the quality of the summarization by counting the number of overlapping units, such N-grams, word sequences and word pairs between the candidate summary and the reference summary. ROUGE-N measures the N-gram recall between a candidate summary and a set of reference summaries. ROUGE-N is computed as follows:

P

S2Summref

ROUGE  N ¼ P

P

Ngram2S Count match ðN-gramÞ

S2Summref

P

N-gram2S CountðN

 gramÞ

;

ð28Þ

where N stands for the length of the N-gram, Countmatch(N-gram) is the maximum number of N-grams co-occurring in candidate summary and the set of reference-summaries. Count(N-gram) is the number of N-grams in the reference summaries. ROUGE-L uses the longest common subsequence (LCS) metric in order to evaluate summaries. Each sentence is viewed as a sequence of words, and the LCS between the automatic summary and the reference summary is identified. ROUGE-L is computed as the ratio between the length of the LCS and the length of the reference summary:

8 > P ðR; SÞ ¼ LCSðR;SÞ ; > jSj > LCS < ; RLCS ðR; SÞ ¼ LCSðR;SÞ jRj > > 2 > : F LCS ðR; SÞ ¼ ð1þb ÞPLCS ðR;SÞRLCS ðR;SÞ ; b2 P ðR;SÞþR ðR;SÞ LCS

ð29Þ

LCS

where jRj and jSj is the length of the reference R and candidate S sentence summaries, respectively. LCS(R, S) is the length of a LCS of R and S. PLCS(R, S) is the precision of LCS(R, S), RLCS(R, S) is the recall of LCS(R, S), and b = PLCS(R, S)/RLCS(R, S). ROUGE-W (Weighted Longest Common Subsequence) is an improvement of the basic LCS method. It favors LCS with consecutive matches. ROUGE-S (Skip-Bigram Co-Occurrence Statistics) measures the overlap ratio of skip-bigrams between a candidate summary and

Table 1 Description of the data sets.

Number of clusters Number of documents in each cluster Number of documents Number of terms Data source Summary length

DUC2002

DUC2004

59 10 567 19,464 TREC 200 words

50 10 500 16,037 TDT 665 bytes

Author's personal copy

R.M. Alguliev et al. / Expert Systems with Applications 40 (2013) 1675–1689

a set of reference summaries. A skip-bigram is any pair of words in their sentence order, allowing for arbitrary gaps. ROUGE-S with maximum skip distance is called ROUGE-SN, where N is the distance. The way ROUGE-S is calculated identical to ROUGE-2, except that skip-bigrams are defined as subsequences rather than the regular definition of bigrams as substrings. Skip-bigram co-occurrence statistics, ROUGE-S, measure the similarity of a pair of summaries based on how many skip-bigrams they have in common:



8 > PSKIP2 ðR; SÞ ¼ SKIP2ðR;SÞ ; > CðjSj;2Þ > < SKIP2ðR;SÞ RSKIP2 ðR; SÞ ¼ CðjRj;2Þ ; > > > : F SKIP2 ðR; SÞ ¼ ð1þb2 ÞPSKIP2 ðR;SÞRSKIP2 ðR;SÞ ; b2 P ðR;SÞþR ðR;SÞ



SKIP2

ð30Þ

SKIP2

where SKIP2(R, S) is the number of skip-bigram matches between R and S, b is the relative importance of PSKIP2(R, S) and RSKIP2(R, S), PSKIP2(R, S) being the precision of SKIP2(R, S) and RSKIP2(R, S) the recall of SKIP2(R, S). C(, ) is the combination function. One potential problem for ROUGE-S is that it does not give any credit to a candidate sentence if the sentence does not have any word pair co-occurring with its references. To accommodate this, ROUGE-S is extended with addition of unigram as counting unit. ROUGE-SU is an extension of ROUGE-S that solves the problem of ROUGE-S of not giving credit to sentences that do not have any word pair co-occurring with their references. ROUGE-SU adds the unigram as counting unit, which is a weighted average between ROUGE-S and ROUGE-1.





 6.3. Evaluation results The first experiment compares our method with other methods. In this experiment, we compare our method with several methods as follows. They are either newly proposed or commonly used summary technologies, which can stand for the status of development in the literature of summarization.  DUCbest: It shows the scores of the best participating systems in DUC.  Random: a baseline algorithm that produces summaries through random sentence selection.  NMF (Lee et al., 2009): The method performs NMF on terms by sentences matrix and then ranks the sentences by their weighted scores.  FGB (Wang et al., 2011): It translates the clustering-summarization problem into minimizing the Kullback–Leibler divergence between the given documents and model reconstructed terms. The minimization process results two matrices, which represent the probabilities of the documents and sentences given clusters (topics). The document clusters are generated by assigning each document to the topic with the highest probability, and the summary is formed with the sentences with the high probability in each topic. The algorithm for estimating FGB model is similar to NMF algorithms (Lee et al., 2009).  BSTM (Wang et al., 2009): The BSTM explicitly models the probability distributions of selecting sentences given topics and provides a principled way for the summarization task. BSTM is similar to the FGB summarization since they are all based on sentence-based topic model. The difference is that the document-topic allocation matrix is marginalized out in BSTM. The marginalization increases the stability of the estimation of the sentence-topic parameters. BSTM model is also related to 3-factor NMF model.  LexRank (Erkan & Radev, 2004): The LexRank defines sentence salience based on graph-based centrality scoring of sentences. Constructing the similarity graph of sentences provides a better

1683

view of important sentences compared to the centroid approach, which is prone to over-generalization of the information in a document cluster. Erkan % Radev have introduced three different methods for computing centrality in similarity graphs. LSA (Gong & Liu, 2001): The method first creates a term– sentence matrix, where each column represents the weighted term-frequency vector of a sentence in the set of documents. Then singular value decomposition (SVD) is used on the matrix to derive the latent semantic structure. The sentences with the greatest combined weights across all the important topics are included in the summary. Centroid (Radev et al., 2004): The centroid-based method is the algorithm used in the MEAD system. The score for each sentence is a linear combination of the weights computed based on the following three features: (1) centroid based weight; (2) sentence position; (3) first sentence similarity. MCKP (Takamura & Okumura, 2009a): The MCKP represents a text summarization task as a maximum coverage problem with knapsack constraint. We should note that MCKP is different from a knapsack problem. MCKP merely has a constraint of knapsack form. One of the advantages of this representation is that MCKP can directly model whether each concept in the given documents is covered by the summary or not, and can dispense with rather counter-intuitive approaches such as giving penalty to each pair of two similar sentences. WFS-NMF (Wang et al., 2010): It extends the NMF model and provides a good framework for weighting different terms and documents. WCS (Wang & Li, 2012): Weighted consensus scheme (WCS) combines diverse summaries from single summarization methods.

We first run our method on the DUC2002 data set. Further, we denote our method by OCDsum, where ‘O’, ‘C’, and ‘D’ stand for optimization, coverage and diversity, respectively. For simplicity by ODCsum-SaDE we refer to our method where self-adaptive DE is applied to solve the optimization problem, and by ODCsum-DE we refer to our method where original DE is applied. For the following experiments, we report the average recalls of ROUGE-1, ROUGE-2, ROUGE-L and ROUGE-SU, considering they are highly correlated with human judgments. Table 2 shows the ROUGE-1, ROUGE-2, ROUGE-L and ROUGE-SU values and rankings of each method over DUC2002 data set. The bold entries represent the best performing methods in terms of average ROUGE evaluation metrics. The number in parentheses in each table slot shows the ranking of each method on a specific data set. From Table 2 we can see that the performance of OCDsum-SaDE is better than of the other methods in terms of the results of ROUGE-1, ROUGE-2, ROUGE-L, and ROUGE-SU metrics. Among the other methods, the worst result shows Random. In terms of the results of all ROUGE metrics, the best performance shows WFS-NMF. From Table 2 we can also see that the ROUGE scores of the method OCDsum-SaDE are only inferior to that of WFS-NMF. We further compared our results with the methods on the DUC2004 data set. Table 3 reveals the summary results. We see that in terms of all ROUGE metrics, our method outperforms the DUCbest. In terms of ROUGE-SU metric, the method OCDsum-SaDE achieves the best results, 0.1367. In terms of ROUGE-2 and ROUGEL metrics, the best performance (0.1121 and 0.3960) is achieved by WFS-NMF. Note that, all of the results reported in Tables 2 and 3 are averaged over 20 runs. For better demonstration of the results, Figs. 3 and 4 visually illustrate the comparison. Note that we subtracted the ROUGE scores of the worst method Random from all the methods and added the number 0.01 in these figures, thus the difference can be observed more clearly.

Author's personal copy

1684

R.M. Alguliev et al. / Expert Systems with Applications 40 (2013) 1675–1689

Table 2 ROUGE values for methods obtained with the DUC2002 data set. Methods

ROUGE-1

WFS-NMF OCDsum-SaDE DUCbest MCKP WCS BSTM FGB LexRank Centroid NMF LSA Random

0.4994 0.4990 0.4987 0.4938 0.4933 0.4881 0.4851 0.4796 0.4538 0.4459 0.4308 0.3878

ROUGE-2 (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12)

0.2582 0.2548 0.2523 0.2511 0.2484 0.2457 0.2410 0.2295 0.1918 0.1628 0.1502 0.1196

ROUGE-L (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12)

0.4893 0.4708 0.4680 0.4694 0.4628 0.4552 0.4508 0.4433 0.4324 0.4151 0.4051 0.3771

(1) (2) (5) (4) (3) (6) (8) (7) (9) (10) (11) (12)

0.3960 0.3927 0.3869 0.3892 0.3893 0.3880 0.3842 0.3753 0.3618 0.3675 0.3497 0.3488

ROUGE-SU (1) (2) (4) (3) (5) (6) (7) (8) (9) (10) (11) (12)

0.2874 0.2855 0.2841 0.2855 0.2789 0.2702 0.2686 0.2620 0.2363 0.2169 0.2023 0.1852

(1) (2) (6) (4) (3) (5) (7) (8) (10) (9) (11) (12)

0.1354 0.1367 0.1323 0.1333 0.1353 0.1322 0.1296 0.1310 0.1251 0.1292 0.1195 0.1197

(1) (2) (4) (3) (5) (6) (7) (8) (9) (10) (11) (12)

Table 3 ROUGE values for methods obtained with the DUC2004 data set. Methods

ROUGE-1

WFS-NMF OCDsum-SaDE DUCbest MCKP WCS BSTM FGB LexRank Centroid NMF LSA Random

0.3933 0.3954 0.3822 0.3864 0.3987 0.3907 0.3872 0.3784 0.3673 0.3675 0.3415 0.3227

ROUGE-2 (3) (2) (7) (6) (1) (4) (5) (8) (10) (9) (11) (12)

0.1121 0.0969 0.0922 0.0924 0.0961 0.0901 0.0812 0.0857 0.0738 0.0726 0.0654 0.0639

ROUGE-L

ROUGE-SU (2) (1) (5) (4) (3) (6) (8) (7) (10) (9) (11) (12)

0.16

DUC2002 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 ROUGE-1

ROUGE-2

ROUGE- L

ROUGE-SU

WFS-NMF

OCDsum-SaDE

DUCbest

MCKP

WCS

BSTM

FGB

LexRank

Centroid

NMF

LSA

Random

Fig. 3. Comparison of the methods on DUC2002.

Table 4 shows the average values of the ROUGE scores on both data sets DUC2002 and DUC2004 which are obtained from Tables 2 and 3 by averaging. From Table 4 we can see that in terms of ROUGE-1 the method OCDsum-SaDE shows better result. Moreover, in terms of ROUGE2, ROUGE-L and ROUGE-SU metrics, it concedes only to the method WFS-NMF. Fig. 5 visually illustrates the overall comparison of the methods.

6.4. Detailed comparison and analysis With comparison to the ROUGE values for other summarization systems, the method OCDsum-SaDE achieves significant improvement. In order to show its improvements directly, we adopt the relative enhancement. The relative enhancement is calculated as (b  a)⁄100/a when b is compared to a. Tables 5–7 demonstrate the improvements of OCDsum-SaDE on all ROUGE scores.

Author's personal copy

1685

R.M. Alguliev et al. / Expert Systems with Applications 40 (2013) 1675–1689

0.1

DUC2004 0.08

0.06

0.04

0.02

0 ROUGE-1

ROUGE-2

ROUGE-L

ROUGE-SU

WFS-NMF

OCDsum-SaDE

DUCbest

MCKP

WCS

BSTM

FGB

LexRank

Centroid

NMF

LSA

Random

Fig. 4. Comparison of the methods on DUC2004.

Table 4 Average ROUGE values for methods obtained on all data sets. Methods

ROUGE-1

WFS-NMF OCDsum-SaDE DUCbest MCKP WCS BSTM FGB LexRank Centroid NMF LSA Random

0.4464 0.4473 0.4405 0.4401 0.4460 0.4394 0.4362 0.4290 0.4106 0.4067 0.3862 0.3553

ROUGE-2 (2) (1) (4) (5) (3) (6) (7) (8) (9) (10) (11) (12)

0.1852 0.1759 0.1723 0.1718 0.1723 0.1679 0.1611 0.1576 0.1328 0.1177 0.1078 0.0918

Table 7 demonstrates the overall comparison results of the summarization systems. In Tables 5–7 ‘‘+’’ means the result outperforms and ‘‘’’ means the opposite. From Table 7 we see that overall performance comparison with the DUCbest the method OCDsum-SaDE improves the performance by 1.53%, 2.09%, 1.01% and 1.39% in terms of ROUGE-1, ROUGE-2, ROUGE-L and ROUGESU4 metrics, respectively. In addition, from the results reported in these tables, we have the follow observations:  On the ROUGE-1 metric, the method OCDsum-SaDE shows better result than the WFS-NMF.  Among the methods, in general WFS-NMF achieves the highest ROUGE-2, ROUGE-L, and ROUGE-SU scores. This observation demonstrates that the sentence feature selection is effective and the weights on document side help the sentence weighting process.  Random has the worst performance, as expected.  The results of MCKP, FGB, BSTM, and LexRank, are better than LSA, NMF and Centroid.  LSA and NMF are both factorization-based techniques that extract the semantic structure and hidden topics in the documents and select the sentences representing each topic as the summary. However, with nonnegative constrains, MNF provides the best results than LSA.

ROUGE-L (1) (2) (3) (5) (4) (6) (7) (8) (9) (10) (11) (12)

0.4427 0.4318 0.4275 0.4293 0.4261 0.4216 0.4175 0.4093 0.3971 0.3913 0.3774 0.3630

ROUGE-SU (1) (2) (4) (3) (5) (6) (7) (8) (9) (10) (11) (12)

0.2114 0.2111 0.2082 0.2094 0.2071 0.2012 0.1991 0.1965 0.1807 0.1731 0.1609 0.1525

(1) (2) (4) (3) (5) (6) (7) (8) (9) (10) (11) (12)

 The Centroid system outperforms clustering-based summarization method NMF. This is because the Centroid system takes into account positional value and first-sentence overlap which are not used in NMF.  BSTM model outperforms FGB since the document-topic allocation is marginalized out in BSTM and the marginalization increases the stability of the estimation of the sentence-topic parameters.  LexRank outperforms Centroid. This is because LexRank applies graph analysis and takes the influence of other sentences into consideration, which provides a better view of the relationships in the sentences. In addition, LexRank ranks the sentence using eigenvector centrality, which implicitly accounts for information subsumption among all sentences.  The Centroid method includes the sentences of the highest similarities with all the other sentences in the documents into the summary, which is good since these sentences deliver the majority of information contained in the document, however the redundancy needs to be further removed and the subtopics in the documents are hard to detect.  FGB performs better than LexRank. This is because FGB model makes use of both term-document and term-sentence matrices.  As the fact that WFS-NMF considers the importance of different documents instead of treating them equally, the results of WFSNMF achieve the best performance on both data sets.

Author's personal copy

1686

R.M. Alguliev et al. / Expert Systems with Applications 40 (2013) 1675–1689

0.12

(DUC2002+DUC2004)/2 0.1

0.08

0.06

0.04

0.02

0 ROUGE- 1

ROUGE- 2

ROUGE-L

ROUGE-SU

WFS-NMF

OCDsum-SaDE

DUCbest

MCKP

WCS

BSTM

FGB

LexRank

Centroid

NMF

LSA

Random

Fig. 5. Overall comparison of the methods.

Table 5 Comparison of OCDsum-SaDE with other methods on DUC2002. Methods

WFS-NMF DUCbest MCKP WCS BSTM FGB LexRank Centroid NMF LSA Random

Table 7 Overall comparison of OCDsum-SaDE with other methods.

Improvement of OCDsum-SaDE (%)

Methods

ROUGE-1

ROUGE-2

ROUGE-L

ROUGE-SU

0.08 +0.06 +1.05 +1.16 +2.23 +2.87 +4.05 +9.96 +11.91 +15.83 +28.60

1.32 +0.99 +1.47 +2.58 +3.70 +5.73 +11.02 +32.85 +56.51 +69.64 +110.95

3.78 +0.60 +0.30 +1.73 +3.43 +4.44 +6.20 +8.88 +13.42 +16.22 +24.11

0.66 +0.49 0.00 +2.37 +5.66 +6.29 +8.97 +20.82 +31.63 +41.13 +53.40

Table 6 Comparison of OCDsum-SaDE with other methods on DUC2004. Methods

WFS-NMF DUCbest MCKP WCS BSTM FGB LexRank Centroid NMF LSA Random

Improvement of OCDsum-SaDE (%) ROUGE-1

ROUGE-2

ROUGE-L

ROUGE-SU

+0.53 +3.45 +2.33 0.83 +1.20 +2.12 +4.49 +7.65 +7.59 +15.78 +22.53

13.56 +5.10 +4.87 +0.83 +7.55 +19.33 +13.07 +31.30 +33.47 +48.17 +51.64

0.83 +1.50 +0.90 +0.87 +1.21 +2.21 +4.64 +8.54 +6.86 +12.30 +12.59

+0.96 +3.33 +2.55 +1.03 +3.40 +5.48 +4.35 +9.27 +5.80 +14.39 +14.20

 WFS-NMF extends the NMF model and provides a good framework for weighting different terms and documents. Hence, it outperforms NMF-related methods, i.e., the NMF, FGB and BSTM methods on both data sets. In the meanwhile, this algorithm can discover important term features. This observation demonstrates that the sentence feature selection is effective and the weights on document side help the sentence weighting process.  The ROUGE scores of OCDsum-SaDE are higher than the DUCbest in DUC2004 and comparable to the DUCbest from DUC2002. Note that the good results of the DUCbest team come

WFS-NMF DUCbest MCKP WCS BSTM FGB LexRank Centroid NMF LSA Random

Improvement of OCDsum-SaDE (%) ROUGE-1

ROUGE-2

ROUGE-L

ROUGE-SU

+0.19 +1.53 +1.61 +0.27 +1.78 +2.53 +4.24 +8.93 +9.96 +15.81 +25.88

5.02 +2.09 +2.39 +2.09 +4.73 +9.16 +11.58 +32.42 +49.41 +63.13 +91.66

2.46 +1.01 +0.57 +1.34 +2.41 +3.41 +5.48 +8.73 +10.34 +14.40 +18.96

0.14 +1.39 +0.81 +1.93 +4.92 +6.03 +7.43 +16.82 +21.99 +31.20 +38.47

from the fact that they perform deeper natural language processing techniques to resolve pronouns and other anaphoric expressions, which we do not use for the data preprocessing.  The results of optimization-based methods, MCKP and OCDsumSaDE, are better than the Centroid method and LexRank. It shows that the traditional selection methods in a cluster are not good enough and an optimization approach is a better choice. We also need note that the good performance of the WFS-NMF benefits from the weighting schemes for sentence features (or sentence samples). In addition, it is necessary note that the good performance of the DUCbest benefits from their preprocessing on the data using deep natural language analysis, which is not applied in other methods, in particular in our method. Although we can spend more efforts on the preprocessing or language-processing step, our goal here is to demonstrate the effectiveness of formalizing the document summarization problem using the optimization approach and hence we do not utilize advanced NLP techniques for preprocessing. The experimental results demonstrate that the proposed optimization-based method can lead to competitive performance. 6.5. Efficiency The efficiency of the algorithm computation is an important factor. Our evaluations are performed by Delphi 7 on a Server running Windows Vista with two Dual-Core Intel Xeon CPU 4 GHz and 4 Gb memory. Table 8 shows the comparison in terms of time spent by

Author's personal copy

1687

R.M. Alguliev et al. / Expert Systems with Applications 40 (2013) 1675–1689 Table 8 Comparison of the methods on time spent.

Table 9 Comparison of OCDsum-SaDE with OCDsum-DE on DUC2002.

Methods

DUC2002 (min)

DUC2004 (min)

Rank

LexRank LSA OCDsum-SaDE WCS NMF Centroid FGB BSTM MCKP WFS-NMF

17.2 21.3 24.9 27.3 32.4 32.7 33.9 36.7 37.4 41.4

16.1 20.2 19.7 22.5 30.9 28.8 31.7 33.1 33.2 37.3

1 2 3 4 5 6 7 8 9 10

ROUGE-1 ROUGE-2 ROUGE-L ROUGE-SU

each method. Last column shows the ranks of the method on their time spent. From the experimental results, we clearly observe that (1) WFS-NMF method performs slowly; (2) the methods MCKP, NMF, FGB, BSTM, and Centroid spend almost equal time; (3) our method takes the third place and spend almost equal time as the method WCS.

ROUGE-1 ROUGE-2

ROUGE-SU

This section demonstrates the feasibility of the self-adaptive DE (OCDsum-SaDE) based document summarization. The results are compared to the results obtained from original DE (OCDsum-DE). In self-adaptive DE, the parameter CR is determined using Eq. (20). On the contrary in DE the parameter CR does not change during search process which is equal to CR = 0.7. In original DE algorithm the mutation strategy Eq. (19) is utilized, where the mutation factor was set to F = 0.6. To perform a fair comparison, the same computational effort is used in both of original DE and self-adaptive DE. That is, the maximum generation, population size and searching range of the parameters in original DE are the same as those in self-adaptive DE. In addition, notice that random number generator is initialized with the same seed values. The maximum generation, the population size and the maximum number of iterations are set to 20, 50, and 1000, respectively. Tables 9 and 10 show the worst, mean, best and standard deviation of ROUGE results during 20 runs for each algorithm original DE and self-adaptive DE. In these tables, ‘‘Stdv.’’ is the standard deviation of ROUGE results during 20 runs for each method. From Tables 9 and 10, it is obvious that the worst results obtained by self-adaptive DE are even better than the mean results obtained by original DE. 6.7. Statistical significance test In order to statistically compare the performance of OCDsumSaDE with other summarization methods, we use a non-parametric statistical significance test, called Wilcoxon’s matched-pairs signed

Worst

Mean

Best

Stdv.

OCDsum-SaDE OCDsum-DE OCDsum-SaDE OCDsum-DE OCDsum-SaDE OCDsum-DE OCDsum-SaDE OCDsum-DE

0.4972 0.4908 0.2524 0.2487 0.4687 0.4662 0.2834 0.2787

0.4990 0.4943 0.2548 0.2512 0.4708 0.4673 0.2855 0.2821

0.5039 0.4997 0.2575 0.2543 0.4734 0.4722 0.2873 0.2849

1.35e04 1.84e04 1.32e04 1.71e04 1.21e04 1.62e04 1.39e04 1.72e04

Table 10 Comparison of OCDsum-SaDE with OCDsum-DE on DUC2004.

ROUGE-L

6.6. Comparison of OCDsum-SaDE with OCDsum-DE

Algorithm

Algorithm

Worst

Mean

Best

Stdv.

OCDsum-SaDE OCDsum-DE OCDsum-SaDE OCDsum-DE OCDsum-SaDE OCDsum-DE OCDsum-SaDE OCDsum-DE

0.3918 0.3874 0.0943 0.0891 0.3896 0.3853 0.1348 0.1309

0.3954 0.3909 0.0969 0.0926 0.3927 0.3889 0.1367 0.1331

0.3987 0.3961 0.0996 0.0949 0.3953 0.3932 0.1394 0.1376

1.05e04 1.19e04 1.07e04 1.24e04 1.12e04 1.32e-04 1.21e04 1.48e04

rank based statistical test (Hollander & Wolfe, 1999), to determine the significance of our results. The statistical significance test for independent samples has been conducted at the 5% significance level of the summarization results. Eleven groups, corresponding to the eleven methods: 1. OCDsum-SaDE, 2. WFS-NMF, 3. MCKP, 4. WCS, 5. BSTM, 6. FGB, 7. LexRank, 8. Centroid, 9. NMF, 10. LSA, and 11. Random, have been created for each data set. Two groups are compared at a time one corresponding to OCDsum-SaDE method and the other corresponding to some other method considered in this paper. Each group consists of the ROUGE-1, ROUGE-2, ROUGE-L and ROUGE-SU scores for the data sets produced by 20 consecutive runs of the corresponding method. To establish that the results of OCSsum-SaDE is statistically significant, Table 11 reports the P-values produced by Wilcoxon’s matched-pairs signed rank test for comparison of two groups (one group corresponding to OCDsum-SaDE and another group corresponding to some other algorithm) at a time (http://www.graphpad.com/). As a null hypothesis, it is assumed that there are no significant differences between the median values of two groups. Whereas, the alternative hypothesis is that there is significant difference in the median values of the two groups. It is clear from Table 11 that P-values are much less than 0.05 (5% significance level). For example, the Wilcoxon’s matched-pairs signed rank test between the algorithms OCDsum-SaDE and WFS-NMF for DUC2002 and DUC2004 provides a P-value of 0.0035 and 0.0026 (ROUGE-1), respectively, which is

Table 11 P-values produced by Wilcoxon’s matched-pairs signed rank test by comparing OCDsum-SaDE with other methods. Data set

WFS-NMF

Centroid

NMF

LSA

Random

DUC2002 DUC2004

Comparing medians of ROUGE-1 metric of OCDsum-SaDE with other methods 0.0035 0.0011 0.0007