NowOnWeb: News Search and Summarization - Semantic Scholar
Recommend Documents
In this paper, we present our online summarization system of web topics. The user ... Then the system searches the Web for the relevant documents. The top ...
the own texts are extracted. The cluster is saved in our ... The resulting texts, together with titles and .... Obama Hillary Clinton president elections 10. 10. 0. 0. 6.
includes filled pauses, repairs, hesitations, repetitions, partial words, and ... within the wider field of automatic speech recognition (e.g. Shinozaki et al., 2001; ..... transcription hypothesis, to create word-class n-gram models that are then ..
Beaux Sharifi, Mark-Anthony Hutton, and Jugal Kalita. 2010. Summarizing microblogs automatically. In. Proc. HLT/NAACL, pages 685â688. Chao Shen and Tao ...
recognizing speech, where transcribing the spoken words is the primary .... the recognition error rate (WER: word error rate), and the best WER of 25.3% was.
19 Jul 2010 - In this paper we study the problem of entity retrieval for news applications and the importance of the news trail his- tory (i.e. past related articles) ...
and domain knowledge related to pure and vocal music. Support vector machine ... THE RAPID development of various affordable technolo- gies for multimedia ...
rization method for multi-domain contact center dialogues. Since the ... Automatically summarizing calls is one of the important goals ..... Summarization flow.
first name.last [email protected] ... decision making effort in domains where decisions are based ... first to chose the best summarization technique for the target.
4. MUSIC SUMMARIZATION. Compared with text, speech, and video summarization techniques, music summarization provides a special challenge because ...
Just returning the most recent tweets about the game is ... in games. Sports Coverage on Twitter. We want to summarize events so that the summaries can.
tween relevant and shallow information in an HTML document1. For a few years, .... between anchor texts and text spans consists in taking the whole sentence ...
summary, or a good summary from a bad one. The ... plied to a reasonable size domain. The approach .... such as content words, verb tense, proper names, etc.
summarization for recommendation) to auto- recommend ... in video recommendation mail is a good and .... threshold, it will send the auto-generated video.
Department of Computer Science and Engineering. Seoul National ..... Online review data ex- ..... his B.S. degree in Computer Science and Statistics from Seoul.
Nov 8, 2011 - Christian Smith, Arne Jönsson. Santa Anna IT Research Institute AB. Linköping, Sweden [email protected], [email protected]. Abstract.
being included in the final summary. In these techniques, instead of employing natural language understanding ... Easter Island. 0. 500. 1000. 1500. 2000. 2500.
video. A video player based on our devised algorithm is developed which has functions of content- ... temporal subsampling method in their ''magnifier tool''.
ROUGE-1 average F-score [31]. After all of the summaries for the input documents are generated, the average ROUGE-1 score for each threshold value was ...
Jul 23, 2009 - Glasgow, United Kingdom. {leif, owensc}@dcs.gla.ac.uk .... on business, entertainment, science, sport and world news. The title of each news ...
that learning to rank approaches using features specific to blog posts/tweets lead to ...... 4.1 Top ranked results from the Google Web search engine for the query ...
sentences from text using features like word and phrase frequency (Luhn, 1958),[3] ... Sentence weighting: summing over all signature word weights, modifying the .... âAutomatic condensation of electronic publications by sentence selection.
School of Computer Engineering, Nanyang Technological University, Singapore 639798. {zma4 ... use two ranking mechanisms Maximal Marginal Relevance (MMR) and Rating ..... because of its short length, most sentence ranking/selection tech- ..... gradua
NowOnWeb: News Search and Summarization - Semantic Scholar
In the last years the number of news sites available on Internet has experienced an amazing growing; this produced the overwhelm of the manual techniques to.
NowOnWeb: News Search and Summarization ´ Javier Parapar, Jos´e M. Casanova, and Alvaro Barreiro IRLab, Department of Computer Science , University of A Coru˜ na, Campus de Elvi˜ na s/n, 15071, A Coru˜ na, Spain {javierparapar,jcasanova,barreiro}@udc.es http://www.dc.fi.udc.es/irlab
1
Introduction
In the last years the number of news sites available on Internet has experienced an amazing growing; this produced the overwhelm of the manual techniques to cover all this information. In this situation the use of Information Retrieval (IR) strategies has been a successful solution. In this context we present a system to manage the news articles from different online sources and capable to provide search ability, redundancy detection and summarization. We called it NowOnWeb 1 and is based on our previous research and solutions in the IR field. Within this paper our aim is to introduce the general architecture of our news system and also to go deep with the three main research subjects for which we provided useful solutions.
2
System Architecture
My company
RELEVANT NEWS SUMMARIES NOW INDEX MANAGER
CRAWLER (NUTCH)
INTRANET WWW
+
USER
INCREMENTAL INDEX COMPOSITION
INDEXER (LUCENE)
QUERY DOCUMENT COLLECTION
NOW SUMMARIZER SUMMARY GENERATION
RETRIEVAL
INCREMENTAL DAILY INDEX WITH TEMPORAL WINDOW
MODEL NOW GROUPING REDUNDANCY NEWS GROUPING
NOW EXTRACTOR
NEWS CONTENT EXTRACTION
Fig. 1. System architecture overview 1
A version with Spanish and international news is operative in http://irlab.dc.fi.udc.es
Figure 1 is a general overview of our system architecture. It shows the main components of the application and how they interact.
3
Research Issues
NowOnWeb shows innovative solutions to the three main problems of an IR news system, i.e., news recognition and extraction, redundancy detection and summary generation. Next we will introduce the issues and solutions. News Recognition and Extraction: There are computationally expensive previous works [1, 2] providing a generic method to automated data extraction. We developed a computationally effective alternative based on a few heuristics about the structure of the news body, title and images, with linear complexity on the document length Redundancy Detection: In this field our aim was to avoid to the user the overloading of read repeated news. As solution we created a new algorithm based upon the calculation of redundancy using the cosine distance[3] over a previous ranked collection of relevant news represented as vectors. Summary Generation: For this problem we extract the relevant sentences [4] of an article with respect to the user query. The selected sentences are joined to form a coherent summary within the 30% of the original size[5]. Acknowledgements. The work reported here was cofunded by the “Secretar´ıa de Estado de Universidades e Investigaci´on” and FEDER funds under the project MEC TIN2005-08521-C02-02 and “Xunta de Galicia” under project PGIDIT06PXIC10501PN. The authors also want to acknowledge the additional support of the “Galician Network ot NLP&IR” (2006/23).
References 1. D. C. Reis and P. B. Golgher and A. S. Silva and A. F. Laender: Automatic web news extraction using tree edit distance. WWW ’04: Proceedings of the 13th international conference on World Wide Web (2004) 502–511 2. Valter Crescenzi and Giansalvatore Mecca: Automatic information extraction from large websites Journal ACM Vol 51 5 (2004) 731–779 3. Yi Zhang and Jamie Callan and Thomas Minka: Novelty and redundancy detection in adaptive filtering SIGIR ’02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval (2002) 81–88 4. James Allan and Courtney Wade and Alvaro Bolivar: Retrieval and novelty detection at the sentence level SIGIR ’03: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval (2003) 314–321 5. Eduard Hovy: Text Summarization. In R. Mitkov Ed. The Oxford Handbook of Computational Linguistics, chapter 32 (2005) 583-598