HTML Page Transcoding Technique for Mobile Devices - wseas.us

4 downloads 8550 Views 182KB Size Report
semi-semantic information of page layout, a given complex HTML page can be ... to achieve a correct conversion, but amount of .... convert image format;.
HTML Page Transcoding Technique for Mobile Devices HEESOOK SHIN, SOOSUN CHO, DONGWOO LEE, PYEONG-SOO MAH Computer Software Laboratory ETRI 161 Gajung-Dong Yuseong-Gu Daejeon KOREA

Abstract: - We propose a new HTML page transcoding technique for hand-held devices connected to the wireless Internet in order to give successful display of original web pages designed for desktop computers in wired network. Our approach is based on the observation that most web pages are composed of apparently separated blocks with similar contents, and the visual separation is caused by structural tags. We apply block grouping algorithm to extract certain blocks with similar appearance by using a content unit and semi-semantic information of page layout, and that information is extracted by structural tag analysis algorithm. Also we employ an index generation mechanism and a block arrangement method to produce the suitably looking document and efficiently browsing interface. As our approach uses the analysis of HTML pages with minimun units of contents block and semi-semantic information of page layout, a given complex HTML page can be automatically converted into an appropriate HTML page displayed in small screen properly and can offer better displaying quality than other exsiting automatic transcoding approaches. We have designed the transcoding system and implemented the prototype, and we also verified the performance of our approach by analyzing the transcoded results of real web pages. Key-Words: - HTML page, document transcoding, hand-held devices, mobile web

1 Introduction When mobile users browse the web pages through hand-held devices with small screen, it causes the problem that web pages cannot be expressed properly, because most existing web pages in wired Internet are manufactured for desktop computers. In order to solve this problem and to support mobile devices with suitable web pages, many researches have been gone and achieved automatic or non-automatic transcoding functions in web server, client terminal, or middle transcoding server such as proxy. However, most existing approaches have used text level summary, partial extraction/elision or page partition method for mobile devices with low computing power. Therefore, in the case of current web pages that express much information in complex structures, these transcodings have little effect on small screen devices. Furthermore, because of increasing mobile user's desire to connect to magnificent multimedia mobile Internet using hand-held devices and because of increasing usage of mobile terminals with high performance which include HTML browser, new HTML page transcoding techniques are required[1][2][3]. In this paper, we propose a transcoding approach that converts original HTML page into

content-lossless HTML page which can be properly displayed in small screen and gives high quality of views for complex structured web pages. This technique defines minimum units of contents block and uses semi-semantic information from structural tags that express visual separation and achieves transcoding functions through a suitable reconfiguration method and an index generation method. The rest of this paper is organized as follows. In section 2, we present related studies, and describe our new approach in minute detail in section 3. A system design and implementation are given in section 4, followed by experimental results in section 5. And then we conclude with a summary and future works in section 6.

2 Related Studies Transcoding techniques depend on performance of mobile devices. Therefore, we categorize types of devices that try to connect mobile web, and then we discuss about transcoding approaches for each type[2]. First is a Notebook series. This kind of terminal is similar to the existing desktop PC. Because it has more than 800*600 resolution, it can express the web pages in wired Internet without any transcoding approach.

Second type is a Cellular Phone series. As mobile communication companies began to provide mobile web services through cellular phones with extended function, this kind of terminal has appeared. Because of limited expression capacity such as 20 text lines and 90*60 resolution, transcoded pages are presented with the Markup languages of WML, HDML, cHTML, etc. Therefore, many parts of original web pages need to be modified, because most transcoded results consist of summary or partial data of text type. So, manual transcoding approaches are more useful for this kind of terminal than automatic ones. Third type is a Hand-Held series such as Palm-Size PC, Hand-Held PC, PDA, etc. This kind of terminal has a HTTP/HTML browser and has 320*240 resolution in 3-5 inches screen generally. Lately, according to the tendency that screen size and resolution have increased, many terminals support even 640*480 resolution. But, an absolute restriction exists in the screen size for the reason of 'Mobility' that is the biggest characteristic of mobile terminals. Also, the growth of desktop PC's display performance causes relative differences of capability of presentation still, when the web pages in wired Internet are presented in mobile devices. Our research has focused on automatic transcoding methods for hand-held devices that load a browser supporting HTML/HTTP. For those hand-held devices, existing related researches can be classified as follows depending on the occuring location of transcoding process. l Web server side transcoding: With automatic transcoding techniques, it supplies transcoding tools to accompany simple manual processes. (IBM)WebSphere Transcoding Publisher[15] is a representative transcoding system, and this uses an annotation based transcoding approach[7]. An advantage of server side transcoding approaches is to achieve a correct conversion, but amount of web pages offered by transcoding service is so limited in comparison with bulky quantity of existing web pages. l Client side transcoding: Transcoding mechanism (SmartView[11], Pad++[9], etc.) is run in client devices that receive original web pages from web servers. It has an advantage of a customizing, but it also has an inefficient use of network resources and a requirement of high computing power of client terminals. l Proxy side transcoding: It can offer transcoded pages to many kinds of terminals from most existing web pages automatically[13][14].

Digestor[4] and WebAlchemist[5] represent the proxy based automatic transcoding system. And there are other researches such as Top Gun Wingman[10] and WebExpress[12], which achieve transcoding functions in proxy and in client browser together. Above existing transcoding approaches are based on summary, partial extraction/elision or page partition/indexing page creation method. Therefore, they used not to transfer exactly current complicated web pages as well as not to reflect intention of the designer of original web pages. Although WebAlchemist[5] uses the semi-semantic information with several heuristic transcoding algorithms which were proposed by Digestor[4] to get better quality of transcoded pages for complex web pages, it still has a limitation because its basic idea is based on partial extraction and page division. Therefore, we suggest a new transcoding approach for the complex web pages to support mobile users who want to view gorgeous web pages in high-performance hand-held devices with small screen.

3 HTML Page Transcoding Technique Most web pages use visually separated expression in order to represent the difference of contents and disambiguate the meaning of contents, like Fig.l.

Component

Component Block

Fig.1 Visually Separated Expression of Web Page We can extract semi-semantic information of page layout by analyzing structural tags that motivate such visual separation. This supplements imprecise conversion that appears in existing syntactic transcoding approaches[5]. Through this process, we can define a minimum unit of contents as Component (in Fig.1) and make

group, Component Block by name, depending on the client's screen size. At this point, each Component Block has similarities on the presentation and the arrangement. This process is achieved by two algorithms - Layout-Forming Tag Analysis Algorithm and Component Grouping Algorithm. In case of Fig.2 example, it gives Components such as (1)… (16) and the semi-semantic information of layout for each Component (in Fig.3). It extracts Component Blocks such as (1-3), (4-7), (8), (9-14) and so on, and then the extracted Component Blocks are classified into INDEX or BODY type according to characteristics of contents which are included in each Block. And each Block is expressed in an indexing format or is reallocated in element, like Fig 2. Through these processes, we can obtain an efficient browsing without left-right scrolling and more natural expression and an easy awareness for complex web pages.

code of Fig.2, and Fig.4 represents Layout-Forming Tag Analysis Algorithm. Table 1. Parameters of Structural Information Parameter

Description Component ID General Component has (sequence num, 0, 0). CompNum Inclusive Component, which includes tag, has (0,first figure of first child’s CompNum, first figure of last child’s CompNum). RowNum Row number in total layout ColNum Column number in total layout TableDepth Number of ancestor
tag Width Re-calculated width value using pixel unit

Fig.2 Before(top) and After(bottom) of Transcoding Our transcoding technique can be summarized with following 5 characteristics. l Web page Analysis based on visual separation l Definition of minimum unit for transcoding l Using semi-semantic information of page layout l Grouping and rearrangement of Component l Re-expression specific blocks in indexing format

3.1 Layout-Forming Tag Analysis Algorithm In this algorithm, tags such as
, ,
, are analyzed mainly, and a specific element is defined as the Component that is used as a minimum unit of contents during our transcoding process. Through this analyzation process, we can extract semi-semantic structural information by using several parameters in Table 1. Fig.3 shows examples of structural information and partial HTML

Fig.3 partial HTML code and Structural Information of Fig.2 Repeat { extract next tag node; If (the tag == ) { If (TableDepth > threshold) { define Width of all elements inside
; }Else { increase TableDepth; define Width of
element; } }Else if (the tag == ) { increase RowNum; define Width of element; }Else if (the tag ==
) {

increase ColNum; define CompNum according to whether it has tag or not; define Width of
element; }Else if (the tag == ) { convert image format; set NewWidth of image ; if( is used) modify COORDS of ; }Else if (other tags) process for other tags; }Until (end-of-tags of HTML)

Fig.4 Layout-Forming Tag Analysis Algorithm

3.2

Component Grouping Algorithm

This algorithm creates Component Blocks by grouping some Components according to the total sum of Component’s width and the Component type. All tags included in Component Block are wrapped by single tag and inserted as a sibling node of ancestor
of the Component Block. If early width value of Component is below MAX_WIDTH, the grouping process is done continuely. But, if Component's width exceeds MAX_WIDTH, the width of image, text line, etc. is adjusted forcibly, because the Component has only data values without any interior structures. Fig.5 expresses Component Grouping Algorithm.

3.3 Classification of Component Block and Index Generation Component Block is classified into INDEX or BODY type according to patterns of including contents. INDEX Block is re-expressed by creating script file and element on the top portion of new HTML, and BODY Block is reallocated inside element. This classification process is achieved through measurements and comparisons with the parameters of comparing pattern such as below Table 2. And this process is done during Component Grouping step and Component Block Classification step. Table 2. Parameters for Comparing Pattern Parameter

Expected Pattern

TextLength

Similarly repeated pattern, Limited short length Similarly repeated pattern, Limited narrow width. Almost text contents have link information, Similarity of position of linked document and file name. Small number, Special block that is arranged on the top portion of original web page. Maximum or minimun number, Special block that is arranged left or right side in original web page.

ImageWidth LinkNumber

RowNum

ColNum Repeat { extract next Component node; If(Component type == INDEX) { insert into INDEX Component Block; }Else { If(Component width