Web Page Summarization for Handheld Devices: A Natural Language Approach Hassan Alam, Rachmat Hartono, Aman Kumar, Fuad Rahman§, Yuliya Tarnikova and Che Wilcox BCL Technologies Inc.
[email protected] Abstract Summarization of web pages is a very interesting topic from both academic and commercial point of view. Academically, it is challenging to create a summary of a document (e.g. a web page) that is highly structured and has multi-media components in it. From the commercial point of view, it is advantageous to summarize web pages to be viewed in small display devices such as PDAs and cell phones. Summarization not only makes web browsing and navigation easier, but it makes browsing faster as complete web pages need not be downloaded before viewing. In this paper, a novel combination of natural language and non-natural language based summarization techniques have been used to automatically generate an intelligent re-authored display of web pages in real time.
1. Introduction Document summarization is a discipline that has attracted huge academic attention for many years. The document summarization community has its root primarily in computational linguistics and information retrieval. Historically, this research focuses on summarizing "flat" documents, which implies documents with no structures and comprised wholly of textual information. However, the emergence of HTML/XML and related hyperlink based documents changed the way people regarded documents. This also revolutionized the way documents are created and shared, especially on the World Wide Web. Not only that, often a variety of multimedia objects, such as graphics, audio, video, flash etc., are embedded in "web" pages. In addition, the abundant and sometimes strict structures that are now routinely imposed on documents have made the task of document summarization very difficult. This short discussion demonstrates why web document summarization is treated differently than summarizing other types of documents. Realizing this, many researchers have turned their attention to solving the problem of web document summarization. In this paper, a novel approach of summarizing web pages based on structural analysis, segmentation and natural language processing is presented.
2. Background The research on web document summarization can be broadly separated into two parts: approaches that explicitly use natural language processing (NLP) techniques based on computational linguistics [1,2], and the approaches that use non-NLP techniques [3,4]. NLP based approaches for generic document summarization can be subdivided into three categories, statistical methods [5,6], knowledge based methods [7] and methods employing data mining techniques [8]. A comprehensive bibliography can be found in [9].
3. Web page summarization for handheld devices Web browsing in a handheld device can be a very annoying task. The principal reason for this is the display area available on these devices. Typically, a PDA (such as a Palm V) can have 160X160 pixels of resolution, whereas a cell phone might be as low as 120X80. PocketPCs have much higher resolution, but the price point is still prohibitive. Trying to use these devices to browse or try to find information on the web is a tiring task. On top of that, since a page is downloaded as a whole before viewing, this can take up significant time. Summarization is a very attractive solution in these cases.
3.1 Web page data structure If HTML is used to compose a web page, the data ("content") is arranged using an HTML data structure. Depending on what version of HTML is a used, small variation of the tagset is possible. Content is extracted in a local data structure mirroring the HTML data structure.
3.2 Content analysis Content analysis aims to decompose a web document based on the extracted structure from the tree hierarchy. The goal of this stage is to analyze the type of content in each node and pre-select a processing style to be associated with processing these nodes. If the scanning
§ Corresponding author. The authors gratefully acknowledge the support of a Small Business Innovation Research (SBIR) award under the supervision of US Army Communications-Electronics Command (CECOM).
Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 $17.00 © 2003 IEEE
order is from right to left, bottom to top, for each lowest level node (where the content is stored with the data structure), the number of non-link words, linked words and "form" related keywords are calculated. Based on these attributes, content type and content weight (which is related to the "importance" of content) are computed. This is used in selecting the corresponding processing modes for the nodes so that the re-authored content can maximize the display capability of the device.
3.3 Content processing for re-authoring The aim of this research is to re-author the content in a format that is most suited for displaying in a target device. The primary parameters of concern are the display area, resolution, graphics capability, display memory and color display capability. Each node with content can be processed in three different modes: verbatim, transcode and summarize. 3.3.1 Verbatim. In verbatim mode, the content is not processed any further and is displayed in the target device 'as-is'. This is often true when the content is small in size or when the content is a nicely formed list. The original tags are replaced to device-specific tags in a way so that the display in the target device remains the same as in the original rendering. 3.3.2 Transcode. In the transcode mode, the content is reflowed keeping in mind the display area of the device and it's memory capability. Transcoding replaces HTML tags with suitable device specific tags (HDML, WML etc.), without changing any content. However, it does change the way it is displayed and viewed.
the case where a node is transcoded or summarized, the content is shown in two levels. Depending on the processing mode automatically decided, the first level could either have labels (short phrases) describing the main theme of the content, or a short summary. In either case, each of these labels or summaries are linked to the complete content using hyperlinks. 3.5.1 "To summarize or not to summarize" It so happens that not all the segments of a web page are good candidates for NLP summarization. For example, a bulleted list, or a list of links are not good candidates. This is decided by looking at the "link versus text" patterns in the content. Initially, each sentence is categorized as "Image", "Text", "Link" or "Others". Then section headings and titles are detected. Based on this primary classification, blocks of similar contiguous content (block of text, or image or links or others) are determined. At the end of this merging process, if multiple blocks of different types are detected, then it is separated (forced segmentation) into multiple blocks. The merging also takes care not to create blocks that are too small and not to separate small links or images from related text (for example, an image might be used as a prefix or as a tool for alignment). For each block broadly of type "Text", an NLP summary is created, while for "Link", "Image" or "Others" types, non-NLP labels are generated. 3.5.2 Creating a label
3.3.3 Summarize. In the summarize mode, the content is summarized using NLP and non-NLP techniques. Details of this summarization process is provided later.
For creating a label, non-NLP techniques are adequate. Visual clues are used to detect the most important segments of the content, or a 'label'. Some of these visual clues include the font size, boldness, underlines, italics, heading weight, phrase size (collection of words) and link properties. This label is linked to the rest of the content by a hyperlink. The full content is simply transcoded.
3.4 Node merging and segmenting
3.5.3 Creating a summary
The nodes, as discussed so far, may not be adequately logical to be considered as "viewing blocks". Sometimes the nodes are too small or too large, or they are placed in the neighborhood of similar nodes. In cases where the node is a neighbor of other nodes with similar type of content, the nodes are merged. In cases, where the content is too large, it is split into smaller coherent segments.
NLP techniques need to be employed to create short summaries of the content. This summary is then linked to the rest of the content via hyperlinks. Lexical chains are used to create a summary of the content. 'Cohesion' is a way of connecting different parts of text into a single theme. This is a list of semantically related words constructed by the use of co-reference, ellipses and conjunctions. This tends to identify the relationship between words that tend to co-occur in the same lexical context. An example might be the relationship between the words "students" and "class" in the sentence: "The students are in class". For every sentence in the node (the "content"), all nouns are extracted using a Parts of Speech tagger (Brill POS tagger [10]). Then all possible synonym sets that
3.5 Representing the complete web page Once merging and segmentation is completed, it is possible to recreate the original web page by combining these merged and segmented nodes. This consists of a combination of content that are either shown verbatim, transcoded or summarized. In the verbatim mode, the complete content is already shown on the top page, but in
Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 $17.00 © 2003 IEEE
each noun could be part of are determined. For every synonym set, a lexical chain is created by utilizing a list of words related to these nouns by WordNet relations [11]. Once lexical chains are created, a score for each chain is calculated using the following scoring criterion:
how the system segments it into various "blocks" of information. As can be seen, this has a collection of images, links and textual content. Some of these blocks are the result of merging of smaller blocks derived from the HTML data structure.
Score = Chain Size * Homogeneity Index where, ChainSize = ∑all chain entries (ch(i)) in the text w(ch(i)); representing how large the chain is, and each member contributing according to how related it is. w(ch(i)) = relation(ch(i)) / (1 + distance (ch(i))) relation(ch(i)) = 1, if ch(i) is a synonym, 0.7, if ch(i) is an antonym, 0.4, if ch(i) is a hypernym, holonym or hyponym. distance(ch(i)) = number of intermediate nodes in the hypernym graph for hypernyms and hyponyms and 0 otherwise. Homogeneity Index = 1.5 – (∑all distinct chain entries (ch(i)) in the text w(ch(i)))/ChainSize; representing how diverse the members of the chain are.
Figure 1: http://memory.loc.gov/ammem/amlaw/lawhome.html
To make sure there is no duplicate chain and that no two chains overlap, only one lexical chain with the highest score is selected for every word and the rest are discarded. Of the remaining chains, "strong chains" are determined by applying the following criterion: Score >= Average Score + 0.5 * Standard Deviation While generating the summary, sentences with strong chains are cumulatively added to form a summary until there is no sentence with a 'strong' chain is left. Each sentence is 'rated' by the following: (∑ all chains passing through this sentence, ch – an entry in the chain that is from the sentence w(ch)*Score + 2 * ∑ all chains starting in this sentence w(ch)*Score) / sentence length The final summary is formed by adding sentences to the summary starting with the highest score until there is no sentence left or the length or the summary reaches the target length. The target length of the summary is often related to the length of the original content, but can also be empirically set by the user depending on the display area of device being used for this purpose.
4. Example Figure 1 shows a web site related to US Congressional Documents and Debates. Figure 2 shows
Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 $17.00 © 2003 IEEE
Figure 2: Segmentation Figure 3 shows the overall re-authored display of the web site as seen by a Palm V PDA. As seen, the first entry is an example of verbatim mode of re-authoring, and the prefix "[Links]" indicates that the content in this entry is a link or a collection of links. The second entry has a prefix of "[Image]", indicating that the content has image(s) embedded in it. The third entry has the prefix of "[Navi]" which is an abbreviation for "navigation", indicating that the content is primarily a collection of links to navigate the site. The fourth entry is the main body of the web page and is prefixed "[Story]". Here the content has been summarized using NLP techniques. The fifth and sixth entries are again links. So in this example, only the fourth entry has been summarized using NLP techniques, and the rest of the entries have been generated using non-NLP techniques. The decision of whether to summarize using NLP techniques or use non-NLP approaches is taken automatically by the system. Figure 4 shows the detailed content of the second entry in the re-authored display. The mode of display is transcode. Note that the image has been resized to fit the display area. Figure 5 shows the content of the third entry, which is made up of navigation links. Figure 6 shows part of the detailed content which was summarized using NLP techniques. As can be seen from Figure 2, this content is quite long and needs to be scrolled down to view in its entirety. Figure 7 and Figure 8 show details of the fifth and sixth entries respectively.
approach, the overall summarization of the web pages retains the original multi-level structure, preserves the association of multi-media objects and images embedded in the text, while at the same time producing high quality intelligent natural language summarization of the textual components of the web pages. Not only that, since only a fraction of the web content is actually textual, it makes the summarization faster by processing only the segments which are good candidates for natural language based summarization. In this sense, this approach tries to exploit the best of both worlds.
Figure 3: Overall re-authored display of the web site
5. Discussion Previous attempts at creating a summary of web pages concentrated on one approach only, either NLP techniques or non-NLP techniques. The reasoning for this is simple. It is easy to either consider a document totally flat by reconstructing the multi-level structure to a single level, or directly use the multiple level structure to create a multiple level summary using non-NLP techniques based primarily on visual clues. Both approaches have their advantages and disadvantages. The advantage of an NLP based approach is the generation of a cleaner output of logically related sentences, which makes more sense to the reader, but the disadvantage is that this can only be applied on a flat document where the content is entirely textual. In addition to that, the processing requirements are demanding for a handheld device. On the other hand, the non-NLP approaches are advantageous in the sense that they are fast, and they capture the structure of the original document. The disadvantage is that the entries into the Table of Content (TOC) approach [12] are crudely generated, based on visual clues and are therefore may become unintelligible and sometimes misleading. The proposed approach has achieved a compromise by using both approaches simultaneously. By adopting this
Figure 4: Expansion of the second entry
Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 $17.00 © 2003 IEEE
Figure 5: Expansion of the navigation content
summarization process. Two of the issues that are directly related to this are the association of "related blocks" of content and the special treatment of specific constructs such as lists.
7. Conclusion This paper has presented a novel approach of web page summarization using a combination of NLP and non-NLP techniques. It is shown that such an approach produces high quality intelligent summary for web pages allowing fast and efficient web browsing on small display handheld devices such as PDAs and cell phones.
References Figure 6: Full text of the summarized content
Figure 7: Details of the fifth entry
Figure 8:Details of the sixth entry
6. Further Work The research reported here is very much a work in progress. Some issues with this approach are yet to be solved. For example, no java script interpreter is integrated within the current system, although we have started working on it separately. The same is true for image maps, which are a collage of embedded images with links. Currently, the system is unable to process these parts of web pages. We are also working to implement a geometry-based parser for creating a visual representation of the web pages as part of the
[1] Berger, A. and Mittal, V. "OCELOT: A System for Summarizing Web Pages". Research and Development in Information Retrieval, pages 144-151, 2000. [2] Buyukkokten, O., Garcia-Molina, H., and Paepcke, A. "Seeing the Whole in Parts: Text Summarization for Web Browsing on Handheld Devices". Proc. of the Tenth Int. World-Wide Web Conference, 2001. [3] H. Alam, R. Hartono and A. Rahman. "Extraction and Management of Content from HTML Documents". Book chapter in "Web Document Analysis: Challenges and Opportunities". World Scientific Series in Machine Perception and Artificial Intelligence, 2002. In press. [4] A. Rahman, H. Alam, R. Hartono and K. Ariyoshi. "Automatic Summarization of Web Content to Smaller Display Devices". 6th Int. Conf. on Document Analysis and Recognition, ICDAR01, pages 1064-1068, 2001. [5] Knight, K., and Marcu, D. "Statistics-Based Summarization - Step One: Sentence Compression". AAAI/IAAI, pages 703-710, 2000. [6] Witbrock, M., and Mittal, V. "Ultra-Summarization: A Statistical Approach to Generating Highly Condensed NonExtractive Summaries". Research and Development in Information Retrieval, pages 315-316, 1999. [7] McKeown, K., Barzilay, R., Evans, D., Hatzivassiloglou, V., Schiffman, B., and Teufel, S. "Columbia MultiDocument Summarization: Approach and Evaluation". Proc. of the Workshop on Text Summarization, ACM SIGIR Conference, 2001. [8] McKeown, K., Klavans, J., Hatzivassiloglou, V., Barzilay, R., Eskin, E. "Towards Multidocument Summarization by Reformulation: Progress and Prospects". AAAI/IAAI, pages 453-460, 1999. [9] Columbia University Summarization Resources (http://www.cs.columbia.edu/~hjing/summarization.html) and Okumura-Lab Resources (http://capella.kuee.kyotou.ac.jp/index_e.html). [10] Brill E. "A Simple Rule-based Part of Speech Tagger". In Proc. of the 3rd Conference on Applied Natural Language Processing, 1992. [11] WordNet - A lexical database for the English language. http://www.cogsci.princeton.edu/~wn/. [12] A. Rahman and H. Alam. "Challenges in Web Document Summarization: Some Myths and Reality". Proc. Document
Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 $17.00 © 2003 IEEE
Recognition and Retrieval IX, Conference, SPIE 4670-27, 2002.
Electronic
Imaging
Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 $17.00 © 2003 IEEE