Automatic summarization of web content to smaller ... - CiteSeerX

2 downloads 0 Views 405KB Size Report
Phone) and EZweb (KDDI) formats. Live demonstration will be organized for more elaborate understanding of the system during the presentation of the paper.
Automatic Summarization of Web ConUent to Smaller Displlay Devices A. F. R. Rahman, H. Alam,R. Hartono and K. Ariyoshi

Document Analysis and Recognition Team (DART) BCL Computers Inc. 990 Linden Drive, Suite # 203, Santa Clara, CA 95050, USA. Tel: + I 408 557 5279, Fax: +I 408 249 4046, Email: [email protected]

Abstract

2. Aim of this work Web documents usually have complicated layouts and the overall information content can be huge. All these documents are designed for viewing in large screen devices, such as a computer monitor. In recent times, a large number of small screen portable devices, such as Personal Digital Assistants (PDAs) and cellular phones, have been made available f o r mobile browsing. Viewing a web page originally written for large screen devices using these very small screen devices can be extremely cumbersome. This paper discusses this issue of small viewing form factor of electronics devices from the perspective of web browsing and proposes an approach to automatically summarize and transform web documents into a meaningful, readable and above all, browsable format.

1. Introduction Recently there is a proliferation of communication and computing devices with smaller display devices, such as cellular phones and Personal Digital Assistants (PDAs). The conversion of standard web pages to be used with these new devices is a challenge, as the amount of content within each of these pages is too large to be adequately displayed in these smaller display areas. Any web document (HTMUXML) is multi-layered and multi-directional. In order to analyze these documents, the most important thing is to analyze its structure. Each such document is written with a structure in mind, which when incorporated with context, produces the direction into which the reader is guided. This observation shows that extracting the structure of a web document can help in analyzing such documents successfully.

0-7695-1263-11011$10.000 2001 IEEE

1064

This work is concerned with the following:

1. Structural Analysis: Analysis of the structure of a web document 2. Decomposition: Decomposing a web document based on the extracted structure

3.

Contextual Analysis: Once decomposed into constituent sub-document!;, analyze each document for its context.

4.

Summarization (Labeling): This contextual analysis of each sub-document produces a summarization, which can be expressed as a sentence or sub-sentence (a label) indicating the content of this sub-document.

5.

Table of Content (TOC): Since each of these subdocuments are summarized with an “intelligent” summary, these can be put together as a summary of the whole document, giving rise to a Table of Content (TOC). Each entry into this TOC points to specific sub-documents within each document.

6. Order of TOC The order in which the TOC is extracted depends on the “natural” order of the subdocuments extracted from the main document. However, this “natural” order is often misleading as the main “interesting” or “important” message of the document can be lost in the TOC. So it is important to analyze the content of each sub-document and display the TOC by re-ordering them based on their relative “importance”.

3 Overview of the Process In the following discussion, each page of a web document (web page) is treated as a document. Whenever

a new link is followed to go to a new page, the whole process of (1) to (6) requires to be repeated, as in Figure1. So selecting an entry within that TOC that points to another page generates another set of TOC. So the cycle is (A)->(B)->(A) etc.

Input : Source of a site

Generate

Figure-]: Method of web page summarization.

If one of the items from the TOC that points to a sub-document within that page is called. Then the link points directly to that sub-document and no TOC generation is required. For example, (A) is processed to produce (C) via (B), then when any item from (C) is called, it calls (B), so that the cycle is now (A)->(B)>(C)->(B) etc. (Figure-2).

I

Generate TOC

I

(C)

Generate Summary

(B)

Figure-2: Continuity of summarization in a hypertext con text.

So, in this context, the following can be visualized: The Input: The input to this system can be any web document.

The Output: Each page of the web document is decomposed using structural analysis, a summary (label) is generated and the TOC displayed after proper reordering.

1065

4 Description of the Process The normal web pages are also very difficult to read on devices other than a full sized monitor because of their complicated layout. When viewing these pages in devices with smaller viewing screen, it is required to summarize and redisplay such pages so as to make the task of viewing them more practical. The proposed system here achieves precisely this. The target devices for this implementation are the handheld devices such as cell phones and PDAs. One way of solving this problem is to analyze the structure of these pages in order to create a multi dimensional representation of the two dimensional layout of the web page. This leads to a multiple level abstraction of the content of a specific page. At the highest level of abstraction, a summary of the whole page, properly labeled is displayed. As the abstraction levels are traversed, more and more detail about the page is revealed, and at the lowest level, the complete content of the original page is accessible. This reasoning leads to the formation of a layered tree structure, where each level of the tree corresponds to a level of abstraction of the Page.

In general, the content of a page is separated into constituent ‘objects’, the definition of these objects depends on the specific types of classes of elements they encompass [I]. Each such object, therefore, demonstrates specific properties that distinguish them from each other. These properties also provide subtle clues as to their importance within the whole page. Each of these objects is then analyzed in turn and each is given a specific label describing the essence of that object [2]. Part of this analysis exploits stochastic grammar, which is a contextfree grammar defining syntactic structures by means of a system of rules dealing with specific probabilities and associated rankings that correlate with the application probabilities of the rules. Stochastic grammar, initially developed for study of linguistics, provides a very useful tool to analyze and understand content of these objects, which is essential in determining the relative importance of an object within the whole document. This research direction producing the current technology is a natural progression from previous techniques [3], [4], [SI. These deal principally with generalized document analysis and understanding techniques, and specifically for tasks such as automatic classification, detection and extraction of embedded tabular, graphical and textual data. Once these ‘objects’ are detected and extracted, it is possible to design appropriate algorithms to deal with problems separately in a way so that each solution can be locally optimized.

This is a much better way of solving a complicated document processing task, as it is almost impossible to find a single solution that will produce satisfactory performance on a wide variety of tasks. The different implementations of these localized solutions exploit physical layout, geometry, and different visual cues to produce customized solution to specific problems.

member of the TOC represents several segments within the page. Selecting any of these links will enable the user to go to the more detailed content associated with that TOC. For example, selecting the link “Solutions for Acrobat PDF users” will lead the user to the display shown in Figure-5.

This same reasoning has been applied to web based document analysis (web pages). Although the algorithms described in the relevant patens have earlier been applied to ‘static’ documents, the rules are generic enough to be transported to analysis and understanding of ‘dynamic’ documents. The problem here is not only to summarize and separate the content into objects, but also to reconstruct the document in a way so that the original essence of the document is preserved. In order to do that, web content need to be analyzed, the document deconstructed and then reconstructed, based on logical and visual cues, producing a representation that enables the user of a cell phone or other PDAs to grasp the total experience of web surfing on their devices. The overall system uses multiple rule-based experts in a horizontal combination scheme[6,7,8]. Not only it is totally device neutral and automatic, it also guaranties that the total information content of the web is preserved and available for browsing. It does not edit or reformat the web content in any way, it just enhances the surfing experience on devices where till now it was extremely unpleasant. This technology can work with any HTML or XLM document (static or live web pages) and produce compatible output for WAP or iMode download. Serverklient dialog parallels what takes place in an HTML download, i.e. the server attempts to determine the client’s display characteristics (aspect ration, size, font type and size, color/grayscale capabilities) and formats the downloaded page accordingly.

5 Results The performance of the system is best described on a real life application. Figure-3 shows the first page of the web page www.bcl-computers.com. As is clearly seen, this web site has a complicated multi-column layout. The content in presented in multiple segments with an implied relationship between these segments. For example, a story segment might be followed by a segment providing additional links to similar stories. The system analyses the layout and segments within the page and produces a summarized output, as shown in Figure-4. This is the total table of content (TOC). Each

1066

Figure-3 A sample web page: yww.bcl-computers.com

In the same fashion, it is possible to select the link “Liven up” from Figure-6 and arrive at the display presented in Figure-7. Again , this representat ion has links that take the user out of the confinement of the original page (i.e. links with the prefix “=>”) and links that are generated from graphical links from the original page (i.e. [IMGI).

the demands of your organizations. Find our more >> +

Page Root Figure-5: More detailed content The most interesting aspect of this summarization is that since the content is in a multi-dimensional space, the navigation within the representation can be twofold:

In the same way story contents are summarized, side-bars and navigation links are also summarized. For example, Figure4 shows the summarized top bar from It is interesting to the page www.hcl-co!!!puters.co.uk. note that the segment with the longest list of navigational links

Following links that lead to other content within that page. For example, selecting “Solutions for Acrobat PDF users” in Figure-4 takes the user to Figure-5. But the user is still within the same original web page.

+

+

Jobs@BCL + Res ellers + Clients

Following links that lead to other web pages. In general, following links within a web page brings the user to the other pages. This can also be true for this summarized representation. This signifies that selecting a specific link will take the user out of the confinement of the current original page and fetch a new page. Also there has to be some way to navigate between the various levels of abstraction. The symbol “>>” is used for that purpose. For example selecting the symbol “>>” beside the link “Solutions for Acrobat PDF users” in Figure-4 will take the user to the output of Figure-6. This actually represents summary of the relevant section on the original page, with links to stories such as “Solutions for Acrobat PDF users” and “Advanced Search” and “Free upgrade” etc. If the user selects the link “Solutions for Acrobat PDF users” now from this display, the system will show the output of Figure-5. Therefore, there are two ways of navigating to content, either directly, or .indirectly via other associated links.

1067

Press Releases

+

Contact Us Acrobat Plugms + I&igellixi PDF to HTML

+

Jade PDF e : ~ ~ ~ t i ~ n Figure-7: More detailed content.

Press Keleases Drake PDF to RTF Freebird PDF to graphcs -

+

+

+

I

1

Figure-8: The summarized top bar.

+

+

+

+

+

+

+

Press Releases Job sa? CL Resellers Clients Contact u s Acrobat Plums Magellan PDF to l3”ML Jade PDF extraction

1.8 References [I] H. Alam, A. F. R. Rahman, P. Lawrence, R. Hartono, K. Ariyoshi. Viewing Web pages on small form factor

devices, U.S. Patent Application pending, 60/191,329. I

[2] H. Alam, A. F. R. Rahman, P. Lawrence, R. Hartono, K. Ariyoshi. Automatic summarkzation and display of web content in various display devices, U.S. Patent Application pending, 60/232,648.

[3] H. Alam. Processor Based Method For Extracting Tables From Printed Documents. U S Patent 5,737,442. [4] H. Alam. Processor Based Method For Extracting Tablets From Printed Documents. US Patent 5,956,422.

Figure-9: The expanded side bar following from Figure 8. (i.e.“Press Release”) have been summarized, where as the other smaller segments with fewer number of links were not summarized. This is done in order to maximize the usage of the display area without introducing additional (and Often unnecessary) links. If the link of “Press Release” is selected, the display of Figure-9 is obtained, which in turn is identical to Figure-8. This now expands the summary Of the links and provides additional and full content.

[5] H. Alam. Networked Fax Routing Via Email U S Patent 6,104,500.

[61 A‘ F’ R‘ Rahman and M‘ ’‘ Fairhurst’ Introducing new multiple expert decision combination topologies: A case study using recognition of handwritten In Proc, 4h Int. Conf. On Document and ICDAR~~, 2, pages 886-891, Ulm, Germany, 1997. [7] A. F. R. Rahman and M. C. Fairhurst, “Multiple expert classification: A new methodology for parallel decision fusion”. Int. Jour. Of Document Analysis and Recognition, 3(1):40-55, 2000. [8] A. F. R. Rahman and M. C. Fairhurst, “Enhancing consensus in multiple expert decision fusion”. IEE Proc. on Vision, Image and Signal Processing, 147(1):39-46, 2000.

6 Supported Devices The proposed system works in automatically summarizing live web content on the fly to fit smaller screen devices, such as PDAs and cellular phones with web capability. At the present time, the system supports all PDAs using an HTML 3.2 browser and also cellular phones using WAP, iMode (NTT DoCoMo), J-Sky (JPhone) and EZweb (KDDI) formats. Live demonstration will be organized for more elaborate understanding of the system during the presentation of the paper.

7 Conclusion This paper has presented a system to automatically summarize web documents. This is essential to achieve the maximum potential from the smaller screen devices by making the content understandable, more manageable and useful, thereby making the concept of “mobile browsing” one step closer to reality.

1068

Suggest Documents