Using Sentence-Selection Heuristics to Rank Text ... - Semantic Scholar

Using Sentence-Selection Heuristics to Rank Text Segments in TXTRACTOR Daniel McDonald and Hsinchun Chen Artificial Intelligence Lab Management Information Systems Department University of Arizona Tucson, AZ 85721, USA 520-621-2748

{dmm, hchen}@eller.arizona.edu Indicative text summarization systems support the user in deciding which documents to view in their totality and which to ignore. Some summarization techniques use measures of query relevance to tailor the summary to a specific query [22] [5]. Providing tools for users to sift through query results can potentially ease the burden of information overload. Using document summaries can also potentially improve the results of queries on digital libraries. Relevance feedback methods usually select terms from entire documents in order to expand queries. Lam-Adesina and Jones found query-expansion using document summaries to be considerably more effective than query-expansion using full-documents [13]. Other summarization research explores the processing of summaries instead of full documents in information retrieval tasks [18, 21]. Using summaries instead of full documents in a digital library has the potential to speed query processing and facilitate greater post retrieval analysis, again potentially easing the burden of information overload.

ABSTRACT TXTRACTOR is a tool that uses established sentence-selection heuristics to rank text segments, producing summaries that contain a user-defined number of sentences. The purpose of identifying text segments is to maximize topic diversity, which is an adaptation of the Maximal Marginal Relevance criterion used by Carbonell and Goldstein [5]. Sentence selection heuristics are then used to rank the segments. We hypothesize that ranking text segments via traditional sentence-selection heuristics produces a balanced summary with more useful information than one produced by using segmentation alone. The proposed summary is created in a three-step process, which includes 1) sentence evaluation 2) segment identification and 3) segment ranking. As the required length of the summary changes, low-ranking segments can then be dropped from (or higher ranking segments added to) the summary. We compare the output of TXTRACTOR to the output of a segmentation tool based on the TextTiling algorithm to validate the approach.

1.2 Background

Categories and Subject Descriptors

Approaches to text summarization vary greatly. A distinction frequently is made between summaries generated by text extraction and those that generate text abstracts. Text extraction is widely used [10], utilizing sentences from a document to create a summary. Early examples of summarization techniques utilized text extraction [16]. Text abstraction programs, on the other hand, produce grammatical sentences that summarize a document’s concepts. The concepts in an abstract are often thought of as having been compressed. While the formation of an abstract may better fit the idea of a summary, its creation involves greater complexity and difficulty [10]. Producing abstracts usually involves several stages such as topic fusion and text generation that are not required for text extracts. Recent summarization research has largely focused on text extraction with renewed interest in sentence-selection summarization methods in particular [17]. An extracted summary remains closer to the original document, by using sentences from the text, thus limiting the bias that might otherwise appear in a summary [16]. TXTRACTOR continues this trend by utilizing text extraction methods to produce summaries. The goals of text summarizers can be categorized by their intent, focus, and coverage [7]. Intent refers to the potential use of the summary. Firmin and Chrzanowski divide a summary’s intent into three main categories: indicative, informative, and evaluative. Indicative summaries give an indication of the central topic of the original text or enough information to judge the text’s relevancy.

I.2.7 Natural Language Processing - Language parsing and understanding, Text analysis

General Terms: Algorithms

Keywords Text summarization, text segmentation, Information Retrieval, text extraction

1. INTRODUCTION 1.1 Digital Libraries Automatic text summarization offers potential benefits to the operation and design of digital libraries. As digital libraries grow in size, so does the user’s need for information filtering tools.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. JCDL’02, July 13-17, 2002, Portland, Oregon, USA. Copyright 2002 ACM 1-58113-513-0/02/0007…$5.00.

28

Informative summaries can serve as substitutes for the full documents and evaluative summaries express the point of view of the author on a given topic. Focus refers to the summary’s scope, whether generic or query-relevant. A generic summary is based on the original text, while a query-relevant summary is based on a topic selected by the user. Finally, coverage refers to the number of documents that contribute to the summary, whether the summary is based on a single document or multiple documents. TXTRACTOR uses a text extraction approach to produce summaries that are categorized as indicative, generic, and based only on single documents.

conclusion” are more likely to appear in scientific literature than in newspaper articles [10]. Position-based methods are also domain-dependent. The first sentence in a paragraph contains the topic sentence in some domains, whereas it is the last sentence elsewhere. Combined with other techniques, however, these extraction methods can still contribute to the quality of a summary.

2.2 Document Segmentation Document segmentation is an Information Retrieval (IR) approach to summarization. Narrowing the scope from a collection of documents, the IR approach views a single document as a collection of words and phrases from which topic boundaries must be identified [10]. Recent research in this field, particularly the TextTiling algorithm [9], seems to show that a document’s topic boundaries can be identified with a fair amount of success. Once a document’s segments have been identified, sentences from within the segments are typically extracted using word-based rules in order to turn a document’s segments into a summary. Breaking a document into segments identifies the document’s topic boundaries. Segmentation is a nice way to make sure that a document’s topics are adequately represented in a summary. The IR approach to extraction does have some weaknesses. Having a word-level focus “prevents researchers from employing reasoning at the non-word level” [10]. While the IR technique successfully segments single documents into topic areas [9], the selection of sentences to extract from within those topic areas could be improved by using many different heuristics, both wordbased and those that utilize language knowledge. In addition, once a document is segmented, there is no way to know which of the segments is the most salient to the overall document. Some mechanism is required to rank segments so that the most pertinent topic information either gets extra coverage in the summary, or is covered first in the summary. A practical problem is also addressed by ranking segments. When the required number of sentences in a summary is less than the number of identified segments, there must be an intelligent way to decide which segments will not be covered. A possible solution perhaps is to force a document to have a certain number of segments that match the number of sentences allowed in the summary. Presetting the number of acceptable topic areas, however, seems to defeat the process of true segment identification. Segment boundaries would seem arbitrary if there were a limit on their number. The process of finding a document’s topic areas is separate from that of selecting representative sentences to appear in the summary. A ranking of the segments, however, would allow a summary to grow and shrink, while extracting sentences from as many of the highest ranked topic areas as possible. While this approach would not be suited to an informative summary, ranking segments and controlling the number of sentences in a summary are acceptable for an indicative summary.

2. RELATED RESEARCH TXTRACTOR is most strongly related to the research by Carbonell and Goldstein [5] that strives to reduce the redundancy of information in a query-focused summary. Carbonell and Goldstein introduce the concept of Maximal Marginal Relevance (MMR), where each sentence is ranked based on a combination of a relevance and diversity measure. The consideration of diversity in TXTRACTOR is achieved by segmenting a document using the TextTiling algorithm [9]. Sentences coming from different text segments are considered adequately diverse. All text segments must be represented in a summary before additional sentences from an already represented segment can be added. Nomoto [18] and Radev [19] also present different ways to implement diversity calculations for summary creation. Different from the summarization work done by Carbonell and Goldstein, however, TXRACTOR is not query-focused, but rather uses sentenceselection heuristics, instead of query relevance, to rank a document’s sentences.

2.1 Sentence Selection Much research has been done on techniques to identify sentences that effectively summarize a document. Luhn in 1958 first utilized word-frequency-based rules to identify sentences for summaries [16]. Edmundson (1969) added three rules in addition to word frequencies for selecting sentences to extract, including cue phrases (e.g., “significant,” “impossible,” “hardly”), title and heading words, and sentence location (words starting a paragraph were more heavily weighted) [6]. The ideas behind these older approaches are still referenced in modern text extraction research. Sentence-selection methods assist in finding the salient sentences in a document. By salient, we mean sentences a user would include in a summary. There has been much review of sentenceselection methods in research. Teufel and Moens found the use of cue phrases to be the best individual method [24]. Kupiec et al., on the other hand, found the position-based method to be the best [12]. Regarding the combining of sentence-selection heuristics, research conducted by Kupiec, Pedersen, and Chen found that the best mix of extraction methods included position, cue phrase, and sentence length. C. Aone et al. tested several different variations of tf*idf and the using or suppressing of proper names in their system DimSum [3]. Goldstein et al. found that summary sentences had 90 percent more proper nouns per sentence [8]. When deciding which combination of extraction methods to use in TXTRACTOR, we assume each method is independent and its impact can be aggregated into the total sentence score. As we conduct additional summarization experimentation, we will further refine our use of sentence-selection methods and add additional promising methods. Despite the usefulness of sentence extraction methods in finding salient sentences, they cannot alone produce the highest-quality extracts. Sentence-selection techniques are often domain dependent. For example, the words “Abstract” and “in

2.3 Combination Proposal TXTRACTOR attempts to capture the benefits of sentenceselection summarization and document segmentation while overcoming many of their deficiencies. The document segmentation algorithm identifies the document’s main topic areas, while sentence-selection heuristics identify the salient sentences of a summary. The topic areas are used as the foundation for the summary and the salient sentences are used as a compass guiding the inclusion of certain topic areas. Document segmentation provides a thorough domain-independent analysis of the entire document, created in a bottom-up manner. Sentenceselection heuristics provide saliency information in a structured

29

top-down manner. In addition, we have included many sentenceselection techniques in order to reduce the domain-dependency effect of any one heuristic. We hypothesize that ranking a document’s segments on the basis of their containing one or many of the document’s salient sentences will produce summaries that are more information rich than those produced by the segmentation-only approach.

sentence. Thus, a high tf*idf score for a sentence is normalized for sentence length. The resulting score is then added to the sentence’s score.

3.1.4 Sentence position in a paragraph As the sentences are extracted from the original document, new lines and carriage returns signal the beginning of new paragraphs. The beginning sentence of a document and the beginning sentence in a paragraph are given additional weight due to their greater summarizing potential.

3. TXTRACTOR IMPLEMENTATION TXTRACTOR is a summarizer based on text extraction written in Java. Its major components include sentence-selection rules and a segmentation algorithm. The summarization process takes place in three main steps 1) sentence evaluation 2) segmentation or topic boundary identification and 3) segment ranking and extraction.

3.1.5 The sentence length The length of a sentence can provide clues as to its usefulness in a summary [12] [3]. Before adding sentence length summarization heuristic, we tried to achieve the same effect by simply not averaging tf*idf scores over the number of words in the sentence. Longer sentences would naturally be scored higher because they would contain more non-stop word terms. This approach overly weighted long sentences to the point where scores from the tf*idf equation would overpower the scores in other areas. Normalizing that score would mute the value of a concentrated sentence with many document-wide terms. To solve this problem, we made sentence length its own rule. The length of a sentence is calculated and its impact added to the sentence’s weight. Because each of the five sentence-selection rules is calculated differently, each score had to be normalized so the impact of each rule would be comparable. For example, the impact of an extra proper noun in a sentence is not the same as that of a sentence occurring first in a paragraph or of a sentence being very long. The current normalization factor for each heuristic was determined through experimentation. TXTRACTOR has a configuration option that allows the user to adjust the impact of each sentence-selection heuristic, without having to recompile the program. Because each extraction heuristic was normalized, a user can change the weighting of a particular heuristic and immediately judge its impact on the summary. Including the configuration capability facilitates experimentation with different heuristic weights. In addition, while the summary-generation logic of TXTRACTOR was designed to be reasonably domain-independent, a user can still change the weighting given to different sentence-selection rules through the configuration option, thus customizing the summarizer to different domains and uses. For example, a user may want to see as many proper nouns as possible in the summary. Increasing the weight of the proper nouns rule will cause sentences with proper nouns to move to the top of the sentence ranking and thus appear more often in the summary. Once each sentence is scored based on the above five heuristics, the sentences are then ranked according to their summarizing value. Unlike segmentation-only approaches, sentences are ranked against other sentences from the entire document, not only those sentences within the same topic area. An example of sentence scoring is shown in Figure 1. The three sentences listed are the top three sentences extracted from a document entitled “May the Source be With You” from Wired Magazine [15]. A nearly complete copy of the article is found in Figure 4. Each sentence in Figure 1 comes from a different topic segment. The highestscoring sentence greatly benefits from being the first sentence in the article (+20). This position heuristic has been shown to be the most effective of all sentence-selection heuristics [12]. All three sentences begin a paragraph (+10). The second sentence contains the cue phase “thus”, adding 10 points. The cue phrase allows it to outrank the third sentence despite the third sentence having the highest values of the three for tf*idf and sentence length.

3.1 Sentence Evaluation: The summarization process begins by parsing the sentences of the original text using a program that recognizes 60 abbreviations and various punctuation anomalies. The original orders of the sentences are preserved so they can be added to the summary in the intended order. Once the sentences are identified, TXTRACTOR begins ranking each sentence. We use five sentence-selection heuristics to evaluate the document’s sentences. The following different ranking methods have an impact on the corresponding scores of each of the sentences.

3.1.1 Presence of cue phrases Currently, each sentence is checked for the existence of ten different cue phrases (e.g. “in summary,” “in conclusion,” “in short,” “therefore”). Cue phrases are words that signal to a reader that the author is going to summarize his or her idea or subject. The cue phrases are loaded out of a text file so that additional words can be easily added as more experimentation is done in this area.

3.1.2 Proper nouns in the sentence A TXTRACTOR-generated summary is meant to provide enough information for a user to be able to decide whether she or he wants to read the original document in its entirety. Important to this decision is the existence of certain proper names and places. Currently, TXTRACTOR simply reads each sentence and counts the capitalized words, not including the opening word in the sentence. This is meant as a temporary implementation until a full entity-extraction algorithm can be implemented. The total number of capitalized words in each sentence is then averaged over the number of words in the sentence. Shorter sentences are thus not penalized for having fewer proper nouns than longer sentences. The average number of proper nouns is then normalized and added to the sentence’s score.

3.1.3 TF*IDF Tf*idf measures how frequently the words in a sentence occur relative to their occurrence in the entire document. Sentences that have document words in common are scored higher. To calculate tf*idf, the occurrence of every word in a sentence and the word’s total occurrences in the document are totaled. Before the terms are totaled, however, each word is made lower-case and stemmed using the Porter stemmer. The Porter stemmer is one of the most widely used stemming algorithms [11] and can be thought of as a lexicon-free stemmer because it uses cascaded rewrite rules that can be run very quickly and do not require the use of a lexicon. Stemming is performed so that words with the same stem but different affixes may be treated as the same word when calculating the frequency of a particular term. The tf*idf calculation is computed and then averaged over the length of the

30

(topic: 0) (sentence 0) (score: 95) “The laws protecting software code are stifling creativity, destroying knowledge, and betraying the public trust.” (First document sentence: +30, first sentence of paragraph + 20, 0 for proper nouns, +34 for tf*idf, +11 for sentence length = score of 95) (topic: 5) (sentence 52) (score: 85) “Thus, I would dramatically reduce the safeguards for software from the ordinary term of 95 years to an initial term of 5 years, renewable once.” (First sentence of paragraph + 20, +10 for cue phrase “thus,” +0 for proper nouns, +41 for tf*idf, +14 for sentence length = score of 85) (topic: 1) (sentence 23) (score: 85) “Finally, while control is needed, and perfectly warranted, our bias should be clear up front: Monopolies are not justified by theory; they should be permitted only when justified by facts.” (First sentence of paragraph + 20, +1 for proper nouns, +45 for tf*idf, +18 for sentence length = score of 85) Figure 1- the weighting of individual sentences

example of the segment ranking is shown in Figure 2. Highranking sentences are added first to the summary. Two sentences from the same segment are not included in the summary (regardless of their ranking) until a sentence from each segment has been included. Once all segments are represented in the summary, then the process starts over adding one sentence from all segments. Remaining ranked-sentences are added by segment

3.2 Segmentation: The segmentation algorithm used is based on the TextTiling algorithm developed by Marti Hearst [9]. The TextTiling algorithm analyzes a document and determines where the topic boundaries are located. A topic boundary can be thought of as the point at which the author of the document changes subjects or themes. The first step in the TextTiling algorithm is to divide the text into token-sequences, removing any words that appear on the stop list. We have used a token-sequence length of 20 and the same stop word list used by Marti Hearst in TextTiling. Tokensequences are then combined to form blocks. Blocks are compared using a similarity algorithm. The comparison between blocks functions like a sliding window. The first block contains the first token-sequence plus k token-sequences before it. The second block contains the second token-sequence and the k tokensequences after it. The value for k used in our summarizer is 10, also the same one used by Marti Hearst in TextTiling. The blocks are then compared using an algorithm that returns the similarity as a percentage, which is derived from the number of times the same terms appear in the two blocks being compared. The Jaccard coefficient is used for the similarity equation, which differs slightly from the normalized inner product equation used by Hearst. We did not consider the impact of using different similarity equations to be significant. The Jaccard coefficient is as follows: L

∑ (w

ik

Si, j =

L

k =1 L

∑w +∑w k =1

2 ik

k =1

2 jk

2. Segment Ranking 1. Text Segmentation with ranked sentences 1. 2. 5. 15. 18.

Sentence Ordering A two-sentence summary would include sentences ranked 1 & 3 A five-sentence summary would include sentences ranked 1, 2, 15, 3, & 4

3.

w jk )

4.

L

8.

− ∑ wik w jk k =1

Figure 2- text segmentation and sentence-selection combined

Si,j is the similarity between the two blocks of grouped tokensequences i and j, wik is the number of occurrences of term k in the block, and L is the total number of blocks. Once the topic boundaries have been identified, TXTRACTOR then assigns each sentence to a document segment. After all sentences have been given weights and assigned to segments, then segment ranking and sentence extraction can operate.

until the summary-length requirement is met. Once the length requirement is met, the sentences are then sorted by the order in which they appeared in the original document and displayed on the screen. Figure 3 shows some pseudo code for the segment ranking routine. Document segmentation, therefore, provides the topic structure for a document within which sentence selection can be utilized to identify the salient topic areas. It is of practical advantage to rank segments so that a user can easily change the desired length of the summary while the ranking routine identifies

3.3 Segment Ranking: Once a document is segmented into its main topic areas, TXTRACTOR ranks the document segments based on the scores given to sentences by the sentence-selection heuristics. An

31

4.3 An Example Document

Rank segments (Array of ranked sentences) while(summary length not achieved) for( each ranked sentence in array ) if(sentence segment not already used) if(summary length achieved) break; add sentence to summary end if else add to temp array for recursive call end else end for Rank segments( temp array ) end while end Rank segments for( all summary sentences) rank sentences by original document order end for

While not included in the summaries evaluated in the user studies, the article in Figure 4 is a good example of the differences between the TXTRACTOR (reference by “TXT#” in the figure) summaries and the summaries generated by the segmentation-only approach (referenced by “SEG#” in the figure) [15]. Large asterisks and segment numbers highlight the breaks in the document segments. The first sentence selected by TXTRACTOR is the first sentence in the document, despite the fact that it had a 10-point lower tf*idf score than the first segmentation sentence. The first sentence, however, is a very good summarizing sentence. The segmentation approach then selects a second sentence from the first topic area. The two summaries then select the same sentence from segment two to add to their summaries. Later, TXTRACTOR skips over the third topic area, while the segmentation algorithm adds its final two sentences from that topic area. A segmentation summary tries to include sentences from every segment. TXTRACTOR had ranked the two sentences added in the segmentation summary as 50th and 62nd respectively. The sentences had low scores for sentence length and somewhat low scores for tf*idf. Sentences three and four for the TXTRACTOR summary are scored highly due to the included cue phrase, “thus”. The final sentence selected by TXTRACTOR is not rated in the top five best sentences (it is sixth), but because two sentences in the top-five come from the first topic area, room in the summary is preserved for a segment not already represented. Thus, the ranking routine includes a sentence from the seventh segment in the summary, instead of duplicating sentences in a segment.

Figure 3 – Pseudo code for segment ranking which segments, represented by their sentences, to add to or drop from the summary.

4. PRELIMINARY TESTING As a preliminary test of the performance of the TXTRACTOR summarizer, subjects compared summaries produced by segmentation alone with summaries produced by TXTRACTOR. A length limit of five sentences was imposed on all the summaries.

4.4 Document Selection Five documents were deliberately selected from various different subject domains. Document subjects ranged from psychology and sports to arts and science. The subjects of the documents were varied in order to see whether the TXTRACTOR approach had limitations in certain subject domains. Effort was also made to vary the length of the documents. The numbers of words in the documents ranged from 537 up to 13,293. Different lengths of documents were selected so that varied numbers of segments would be created. By including long articles, we hoped to get preliminary clues as to which summarizer prioritized a document’s segments best and whether prioritizing segments led to improved summaries. In this experiment, we did not ask the subjects to judge the cohesiveness of the summaries. We tried to focus the user on the information content of the summary, not its cohesiveness. We selected the following five documents to be summarized: 1. Turning Snooping Into Art, by Noah Shachtman, 773 words [23] 2. No. 4 Virginia suffers first loss of season, Game Day Recap, 537 words [20] 3. Nanotech Fine Tuning by Mark K. Anderson, 654 words [2] 4. Ann Landers by Ann Landers, 650 words [14] 5. A Primer on Narcissism by Sam Vaknin, Ph.D, 13,293 words [25]

4.1 Segmented Summaries The summaries produced by the segmentation-only approach used the same segmenting code as that used in TXTRACTOR. After segments were identified, every word in each segment, except for those on the stop list, was scored based on the tf*idf equation. The two highest-ranking terms from the segment were identified along with the first occurrence of each of the terms in the segment. The sentence(s) where the first occurrences were identified was then added to the summary. Each segment produced one or two sentences, depending on whether one sentence contained the top two keywords or not. The same procedure for every segment was carried out until there was at least one sentence from every segment in the summary. In cases where there were more than five sentences, only the first five sentences were included in the summary for comparison purposes.

4.2 TXTRACTOR Configuration The configuration settings of the document-segmenting algorithm were kept constant between the TXTRACTOR system and the segmentation-only system. Token-sequences of 20 characters were used and 10 token-sequences were added together to form a block for the similarity comparisons. Blocks were allowed to cross sentence boundaries and no stemming or noun phrasing was applied in identifying the document segments. Stemming, however, was used in calculating tf*idf in the sentence-selection portion of TXTRACTOR. The segmenting code was allowed to determine how many topic areas the document had instead of being forced to generate the boundaries for a predetermined number of topics.

4.5 Experiment Participants Five subjects were chosen to compare the TXTRACTOR summary with the summary generated by the segmentation–only approach. All subjects were above the age of 20 and were either completing or had already obtained a bachelor’s degree. The subjects were emailed the ten summaries, grouped by original document. Participants were directed to choose the summary that

32

1**[ The laws protecting software code are stifling creativity, destroying knowledge, and betraying the public trust.

TXT1 Legal heavy Lawrence Lessig argues it's time to bust the copyright monopoly. In the early 1970s, RCA was experimenting with a new technology for distributing film on magnetic tape - what we would come to call video. SEG1 Researchers were keen not only to find a means for reproducing celluloid with high fidelity but also to discover a way to control the use of the technology. Their aim was a method that could restrict the use of a film distributed on video, allowing the studio to maximize the film's return from distribution. The technology eventually chosen was relatively simple. A video would play once, and when finished, the cassette would lock into place. If a customer wanted to play the tape again, she would have to return it to the video store and have it unlocked.…. They were horrified. They would "never," Feely reported, permit their content to be distributed in that form, because the content - however clever the self-locking tape was - was still insufficiently controlled. How could they know, one of the Disney execs asked Feely, "how many people were going to be sitting there watching" a film? What's to stop someone else from coming in and watching for free? SEG2 We live in a world with "free" content, and this freedom is not an imperfection. We listen to the radio without paying for the songs we hear; we hear friends humming tunes that they have not licensed. We tell jokes that reference movie plots without the permission of the directors. We read our children books, borrowed from a library, without paying the original copyright holder for the performance rights. The fact that content at a particular time may be free tells us nothing about whether using that content is theft. Similarly, in arguing for increasing content owners' control over content users, it's not sufficient to say "They didn't pay for this use." Second, the reason perfect control has not been our tradition's aim is that creation always involves building upon something else. There is no art that

]**2**[

Monopoly controls have been the exception in free doesn't reuse. And there will be less art if every reuse is taxed by the appropriator. societies; they have been the rule in closed societies. Finally, while control is needed, and perfectly warranted, our bias should be clear up front: Monopolies are not justified by theory; TXT2 SEG3 they should be permitted only when justified by facts. If there is no solid basis for extending a certain monopoly protection, then we should not extend that protection. This does not mean that every copyright must prove its value initially. That would be a far too cumbersome system of control. But it does mean that every system or category of copyright or patent should prove its worth. Before the monopoly should be permitted, there must be reason to believe it will do some good - for society, and not just for monopoly holders.

]**3**[

One example of this expansion of control is in the realm of software. Like authors and publishers, coders (or more likely, the SEG4 companies they work for) enjoy decades of copyright protection. Yet the public gets very little in return. The current term of protection for software is the life of an author plus 70 years, or, if it's work-for-hire, a total of 95 years. This is a bastardization of the Constitution's requirement that copyright be for "limited times." By the time Apple's Macintosh operating system finally falls into the public domain, there will be no machine that could possibly run it. The term of copyright for software is effectively unlimited. Worse, the copyright system safeguards software without creating any new knowledge in return. When the system protects Hemingway, we at SEG5

]**4**[

least get to see how Hemingway writes. We get to learn about his style and the tricks he uses to make his work succeed. We can see this because it is the nature of creative writing that the writing is public. There is no such thing as language that conveys meaning while not simultaneously transmitting its words. Software is different: Software gets compiled, and the compiled code is essentially unreadable; but in order to copyright TXT3 software, the author need not reveal the source code. Thus, while the English department gets to analyze Virginia Woolf's novels to train its students in better writing, the computer science department doesn't get to examine Apple's operating system to train its students in better coding.

]**5**[

The harm that comes from this system of protecting creativity is greater than the loss experienced by computer science education. While the creative works from the 16th century can still be accessed and used by others, the data in some software programs from the 1990s is already inaccessible. Once a company that produces a certain product goes out of business, it has no simple way to uncover how its product encoded data. The code is thus lost, and the software is inaccessible. Knowledge has been destroyed. Copyright law doesn't require the release of source code because it is believed that software would become unprotectable. The open source movement might throw that view into doubt, but even if one believes it, the remedy (no source code) is worse than the disease. There are plenty of ways for software to be secured without the safeguards of law. Copy-protection systems, for example, give the copyright holder plenty of control over how and when the software is copied.

]**6**[

And one If society is to give software producers more protection than they would otherwise take, then we should get something in return. thing we could get would be access to the source code after the copyright expires. TXT4 Thus, I would dramatically reduce the safeguards for software - from the ordinary term of 95 years to an initial term of 5 years, renewable once. And I would extend that government-backed protection only if the author submitted a duplicate of the source code to be held in escrow while the work was protected. Once the copyright expired, that escrowed version would be publicly available from the copyright office. Most programmers should like this change. No code lives for 10 years, and getting access to the source code of even orphaned software projects would

]**7**[

Software benefit all. More important, it would unlock the knowledge built into this protected code for others to build upon as they see fit. would thus be like every other creative work - open for others to see and to learn from. There are other ways that the government could help free up resources for innovation. … One context in particular where this could do some good is in orphaned software. Companies often decide that the costs of developing or maintaining software outweigh the benefits. They therefore "orphan" the software by neither selling it nor supporting it. They have little reason, however, to make the software's source code available to others. The code simply disappears, and the products become useless. Software gets 95 years of copyright protection. By the time the Mac OS finally falls into the public domain, no machine will be able to run it. TXT5 But if Congress created an incentive for these companies to donate their code to a conservancy, then others could build on the earlier work and produce updated or altered versions. This in turn could improve the software available by preserving the knowledge that was built into the original code. Orphans could be adopted by others who saw their special benefit.

]**8**[

The problems with software are just examples of the problems found generally with creativity. Our trend in copyright law has been to enclose as much as we can; the consequence of this enclosure is a stifling of creativity and innovation. If the Internet teaches us anything, it is that great value comes from leaving core resources in a commons, where they're free for people to build upon as they see fit. An Innovation Commons was the essence - the core - of the Internet. We are now corrupting this core, and this corruption will in turn destroy the opportunity for creativity that the Internet built.

]**

Figure 4 - original text showing topics and sentences extracted via TXTRACTOR and a segmentation-only approach 33

provided the most pertinent information and seemed to be the most useful in general.

7. REFERENCES [1]

4.6 Results

[2]

Despite the small size of the experiment, we were able to observe some encouraging responses. Of the 25 comparisons that were made between summaries, the TXTRACTOR summary was preferred 14 times, the segmentation-only summary was preferred 8 times, and 3 times the summaries were judged to be more or less the same. The subjects therefore preferred the TXTRACTOR summaries 7:4 over the summaries generated by segmentation only. After submitting their responses, the subjects were told which summarizer produced which summary. The participants then volunteered explanations for why they choose the summary they did. A common sentiment was that the TXTRACTOR summary contained more information, but the sentences sometimes did not flow well. When the sentences flowed well, as judged by the participants, the TXTRACTOR-produced summary was usually preferred. An interesting note is that even though we had not instructed the subjects to assess the readability of the summary, users did not ignore the summary’s cohesiveness. It seems, even with indicative summaries, that poor readability can distract a subject from information content.

[3]

[4]

[5]

[6]

[7]

5. CONCLUSION & FUTURE DIRECTION Based on our tests, the TXTRACTOR summarizer outperformed the summarizer based solely on segmentation. The hypothesis that ranking segments through the use of established sentenceselection heuristics leads to better text-extracted summaries appears to be promising. There is much that can be done, however, to improve the performance of the summarizer. Future improvements to TXTRACTOR include implementing the local salience method of cohesion analysis [4]. The local salience method is based on the assumption that relevant words and phrases are revealed by a “combination of grammatical, syntactic, and contextual parameters”. The original document is parsed to identify a sentence’s subjects and predicates. Different weights are then given to sentences based on the part-of-speech containing the term being analyzed. Experimentation will be conducted on how many parts-of-speech to parse out of each sentence. Additional research is needed to tune the weights of the sentenceselection methods being used. Much research that has been done in this area could be incorporated into our work. In addition, analyzing the discourse context of the sentences should help improve the cohesiveness of the summaries. We are currently planning on conducting more complete experiments and user studies on our combined segmentation and sentence-selection approach to summarization. We are looking to test our summarization approach on a larger scale, similar to that done at the May 1998 SUMMAC conference [1]. Finally, we would like to implement and test TXTRACTOR in different digital library domains such as medical libraries and web pages.

[8]

[9]

[10]

[11]

[12]

[13]

[14]

6. ACKNOWLEDGMENTS We would like to express our gratitude to NSF Digital Library Initiative-2, “High-performance Digital Library Systems: From Information Retrieval to Knowledge Management,” IIS-9817473, April 1999 – March 2002. We also would like to thank William Oliver for his implementation of the TextTiling algorithm and Karina McDonald for her feedback on the summaries.

[15]

[16]

34

in TIPSTER Text Phase III 18-Month Workshop, (Fairfax, VA, 1998). Anderson, M.K. Nanotech Fine Tuning http://www.wired.com/news/technology/0,1282,494472,00.html. Aone, C., Okurowski, M.E., Gorlinsky, J. and Larsen, B. A Trainable Summarizer with Knowledge Acquired from Robust NLP Techniques. in Maybury, M.T. ed. Advances in Automatic Text Summarization, The MIT Press, Cambridge, 1999, 71-80. Boguraev, B. and Kennedy, C., Salience-based Content Characterization of Text Documents. in Proceedings of the Workshop on Intelligent Scalable Text Summarization at the ACL/EACL Conference, (Madrid, Spain, 1997), 2-9. Carbonell, J. and Goldstein, J., The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries. in SIGIR, (Melbourne, Austrailia, 1998), 335-336. Edmundson, H.P. New Methods in Automatic Extracting. in Maybury, M.T. ed. Advances in Automatic Text Summarization, The MIT Press, Cambridge, 1969, 23-42. Firmin, T. and Chrzanowski, M.J. An Evaluation of Automatic Text Summarization Systems. in Maybury, M.T. ed. Advances in Automatic Text Summarization, The MIT Press, Cambridge, 1999. Goldstein, J., Kantrowitz, M., Mittal, V. and Carbonell, J., Summarizing Text Documents: Sentence Selection and Evaluation Metrics. in 22nd International Conference on Research and Development in Information Retrieval, (1999). Hearst, M.A. Segmenting Text into Multi-Paragraph Subtopic Passages. Computational Linguistics (23(1)). 3364. Hovy, E. and Lin, C.-Y. Automated Text Summarization in SUMMARIST. in Maybury, M.T. ed. Advances in Automatic Text Summarization, The MIT Press, Cambridge, 1999, 81-94. Jurafsky, D. and Martin, J.H. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Prentice Hall, Upper Saddle River, 2000. Kupiec, J., Pedersen, J. and Chen, F., A Trainable Document Summarizer. in Proceedings of the 18th ACMSIGIR Conference, (1995), 68-73. Lam-Adesina, A.M. and Jones, G.J.F., Applying Summarization Techniques for Term Selection in Relevance Feedback. in SIGIR, (new Orleans, Louisiana, USA, 2001), 1-9. Landers, A. Ann Landers http://www.washingtonpost.com/wp-dyn/articles/A628232002Jan4.html. Lessig, L. May the Source Be With You. Wired Magazine, 9.12 (December). http://www.wired.com/wired/archive/9.12/lessig.html. Luhn, H.P. The Automatic Creation of Literature Abstracts. in Maybury, M.T. ed. Advances in Automatic Text Summarization, The MIT Press, Cambridge, 1958, 15-22.

[17] Mani, I. and Maybury, M.T. (eds.). Advances in Automatic Text Summarization. The MIT Press, Cambridge, 1999. [18] Nomoto, T. and Matsumoto, Y., A New Approach to Unsupervised Text Summarization. in SIGIR, (New Orleans, LA, USA, 2001), 26-34. [19] Radev, D.R., Jing, H. and Budzikowska, M., Centroidbased summarization of mulitiple documents: sentence extraction, utility-based evaluation, and user studies. in ACL/NAAL Workshop on Summarization, (Seatle, WA., 2000). [20] Recap. No. 4 Virginia suffers first loss of season http://sports.espn.go.com/ncaa/mbasketball/recap?gameId =220050189, 2002. [21] Sakai, T. and Jones, K.S., Generic Summaries for Indexing in Information Retrieval. in SIGIR, (New Orleans, Louisiana, USA, 2001), 190-198.

[22] Sanderson, M., Accurate user directed summarization from existing tools. in Conference on Information and Knowledge Management, (Bethesda, MD, USA, 1998), 4551. [23] Shachtman, N. Turning Snooping Into Art http://www.wired.com/news/culture/0,1284,49439,00.html. [24] Teufel, S. and Moens, M., Sentence Extraction as a Classification Task. in Workshop on Intelligent Scalable Summarization ACL/EACL Conference, (Madrid, Spain, 1999), 58-65. [25] Vaknin, S. A Primer on Narcissism http://www.mentalhelp.net/poc/view_doc.php/type/doc/id/41 9.

35

Using Sentence-Selection Heuristics to Rank Text ... - Semantic Scholar

Using Sentence-Selection Heuristics to Rank Text ... - Semantic Scholar

Suggest Documents

Rank-Two Relaxation Heuristics for Max-Cut and ... - Semantic Scholar

Usability Evaluation Using Specialized Heuristics ... - Semantic Scholar

Tactical Planning using Heuristics - Semantic Scholar

Learning to Rank Short Text Pairs with ... - Semantic Scholar

Learning to Rank Short Text Pairs with ... - Semantic Scholar

Using Heuristics to Improve Service Portfolio ... - Semantic Scholar

Using Heuristics to Evaluate the Overall User ... - Semantic Scholar

meta-heuristics - Semantic Scholar

Hyper-heuristics - Semantic Scholar

Using Rank Propagation and Probabilistic ... - Semantic Scholar

Learning to Rank Using Classification and ... - Semantic Scholar

Learning to Rank Using Classification and ... - Semantic Scholar

Paradigms as Heuristics - Semantic Scholar

assembly line balancing using eight heuristics - Semantic Scholar

Using Mini-Bucket Heuristics for Max-CSP 1 ... - Semantic Scholar

Online Rank Aggregation - Semantic Scholar

Reverse Split Rank - Semantic Scholar

Uniform-Rank Systems - Semantic Scholar

Reverse Split Rank - Semantic Scholar

Using Web Search Engines to Improve Text ... - Semantic Scholar

Using multi-agent Systems to visualize text ... - Semantic Scholar

Using Text Mining to Analyze Quality Aspects of ... - Semantic Scholar

Using SOFM to Improve Web Site Text Content - Semantic Scholar

Erratum to: Using text mining for study ... - Semantic Scholar