SMARTCOMP 2014
Automatic Page Scrolling for Mobile Web Search Mostafa Alli
Ling Feng
Department of Computer Science and Technology Tsinghua University, Beijing, China Email:
[email protected]
Department of Computer Science and Technology Tsinghua University, Beijing, China Email:
[email protected]
Abstract—Nowadays the usage of mobile phones is widely spread in our daily life. We use mobile phones as a camera, radio, music player, and even an internet browser. As most Web pages were originally designed for desktop computers with large screens, viewing them on smaller displays involves a number of horizontal and vertical page scrolling. To save mobile Web search fatigue caused by repeated scrolling, we investigate the automatic Web page scrolling problem based on two observations. First, every web page has many different parts that do not have the equal importance to an end user, and the user is often interested in a certain part of the Web page. Second, the ease of use of text-entry in mobile phones compare to the desktop computers’, users usually prefer to search the Web just once and get the needed answer. Compared to the existing efforts on page layout modification and content splitting for easy page navigation on mobile displays, we present a simple yet effective approach of automatic page scrolling for mobile Web search, while keeping the original Web page content keeps its integrity and hence, preventing any loss of information. We work with the Document Object Model (DOM) of the clicked page by user, compute the relevance of each paragraph of the Web page based on the tf*idf (term frequency*inverse document frequency) values of user’s search keywords occurring in that paragraph. the focus of the browser will be automatically scrolled to the most relevant one. Our user study shows that the proposed approach can achieve 96.47% scrolling accuracy under one search keyword, and 94.78% under multiple search keywords, while the time spending in computing the most important part does not vary much from the number of search keywords. The users can save up to 1.5 sec in searching and finding the needed information compare to the best case of our user study.
I. I NTRODUCTION Rapid development of wireless Internet and mobile computing technologies make mobile Web search more and more popular. While a number of popular Web sites offer miniaturized Web browsers, the majority of today’s Web developers still target at desktop computers with large screens, whose pages are too large to fit into the relatively small screens of mobile devices. Unnecessary scrolling of mobile users within content of a page, making mobile search time-consuming and strenuous. In fact, every web page has many different parts that do not have the equal importance to an end user, and the user is often interested in a certain portion of the Web page. Adding to this, mobile phones have a longer page loading time and higher data transfer cost than wire-connected desktop computers. Gomez. Inc [1] reported their study result that around 60 percent of mobile users expect that a page loads in their phone within 3 seconds or less, 74 percent of them are willing to wait for 5 seconds or less for the page to be
978-1-4799-5711-8/14/$31.00 ©2014 IEEE
opened, 78-80 percent of users at most try twice to open a page that initially failed in opening, and 57 percent of users are not going to retry a site if they had a bad experience with that site. meanwhile, 38 percent stated that the most problem for their Web navigation on mobile phones is the slow page loading time, while 15 percent complained about the bad format, making it hard to read the page. Therefore, we can conclude that mobile users are inpatient and wish to get what they want as quickly as possible. If we can let mobile users reach the desired information easier and faster, their search time and cost can substantially be reduced. In the literature, a lot of efforts have been made towards Web page layout modification and content splitting for the sake of easy navigation on mobile displays. In contrast, this study aims at the automatic page scrolling solution during mobile Web search, that is, bringing users to the mostly desirable part of a searched page, while keeping the original page unchanged. This is based on the fact that many mobile users work with the uneasy-to-use text-entry facility of mobile phones rather than a desktop computer keyboard, and thus prefer to search the Web just once and get the needed answer [2]. Hereby, adjusting and adapting the whole Web page just for one random search might not always be necessary. Automatic page scrolling to the most relevant part constitutes another efficient yet effective solution by not only reducing users’ search time but also losing no page information. To address the issue, two questions need to be resolved. Question 1: How to logically decompose a Web page into smaller parts? At a fine-grained granularity, we decompose a Web page on the basis of page paragraph, where a figure or a table is treated as a paragraph. We work with the Document Object Model (DOM) tree rather than the raw HTML markups of each Web page. DOM is a cross-platform and languageindependent convention for defining the logical structure of HTML, XHTML, and XML documents. The nodes of every Web page are organized in a DOM tree structure, with topmost node named “Document object”. In the DOM tree of a page, the leaf sibling nodes (of the same parent node) together correspond to a page paragraph. A figure with a figure label or a table is also treated as a paragraph. Question 2: How to compute relevance degree of each part in response to mobile user’s one or multiple search keywords? Study of Fu et al. [3] showed that whether a search result is accurate or not, users would feel more confident by guessing
the page’s usefulness through keyword occurrence information. This fact leads us to strongly consider the occurrence of search keywords in different parts of a page. By looking at this occurrence as tf (term frequency) and its occurrence in other parts of the page as idf (inverse document frequency), we will be able to apply the tf-idf principle to compute the relevance degree of each part of the paper for our automatic page scrolling.We conducted a user study to verify the efficiency and effectiveness of our page scrolling approach. The remainder of the paper is organized as follows. We review related work in Section II. The framework for automatic page scrolling is detailed in Section III. We evaluate the performance of such a framework in Section IV, and conclude the paper in Section V. II. R ELATED W ORK Efforts on displaying web pages on small screens fall into four categories, i.e., web page display, web page segmentation, web page summarization, and identification of important page regions. A. Transformation of a Web Page Layout for Small Displays For an easy-to-view display for the mobile phone user, Opera [4] transformed the Web page layout into a vertically long form. Chen et al. [5] and Masuda et al. [6] converted table-formatted information by extracting item names from the table, and display each column along with the extracted item names. Buchanan and Jones et al. [7], [8] generated a menulike page for the mobile phone based on the site map of the Web page. Recently, a Web design approach called Responsive Web design (RWD) was presented, aiming to provide an easy reading and navigation experience with a minimum of resizing, panning, and scrolling operations across a wide range of devices from mobile phones to desktop computers [9]. A web site designed with RWD can thus adapt the layout to the viewing environment by using fluid, proportion-based grids, flexible images, etc. These efforts require Web designers to create separate or different page contents from traditional ones for desktop computers. B. Segmentation of a Web Page into Smaller Pieces 1) Page Segmentation based on DOM Trees: Kuppusamy and Aghila [10], [11] proposed to break the whole page into smaller blocks based on page’s DOM tree representation. They grouped the DOM tree nodes of each page into two categories: block-level nodes and non-block-level nodes. The first group contains nodes with children nodes, while the second one contains nodes without any child. For each node, the text density of the contents is analyzed. If it exceeds a certain threshold, the node can be viewed as an individual segment; otherwise, it will be merged with the closest block-level node. Madaan et al. [12] designed a query interface specially for the use of medical encyclopedia. They combined object and entity based search. As a result, they exploited the semantic connection between different page segments. The algorithm used visual cues and heuristics to build a hierarchical structure
of the page to enable a query interface for the user for an indepth querying. These together gives a language independent page segmentation and gives the user the ability of querying based on an area of a page rather than the whole page. Liu et al. [13] proposed a page segmentation algorithm based on the Gomory-Hu tree in a planar graph. The graph vertices represent leaf nodes in the page’s DOM tree, and the edges represent the relationship between the vertices. Three steps are involved to build a graph, namely, 1) downloading the HTML of the page, 2) creating the DOM tree, and finally 3) selecting the nodes from it using a selection policy. After these 3 steps, the vertices of the graph are gathered. Vertices which are physically neighbors in the page are linked by the edges. At the end, the authors applied their Gomory-Hu algorithm to this graph to cluster the web page. Kalaivani and Rajkumar [14], [15] and Kang et al. [16] introduced a new method of Web page segmentation by recognizing repetitive tag patterns called key patterns in the DOM tree structure of a page. According to their algorithm (called Repetitionbased Page Segmentation (REPS) or Reappearance Layout based Web page Segmentation (RLSE)), the key patterns in the DOM tree are detected by reproducing subsequences of length of m from a tag pattern of length n, where m > 1. By this definition, the longest length of a subsequence is n/2. From the subsequences, repeated tag patterns can be found. Accordingly, the authors made blocks of page based on this tags by creating some virtual nodes as the parents. Finally, a page is segmented into blocks which can be produced by repetitive HTML tag patterns. 2) Page Segmentation based on Page Layouts: Aruljothi et al. [17] proposed a page segmentation method based on tags that represent the layout of the page. They first extracted the tags from the page. Then, they did a spectral clustering by finding each extracted HTML tag’s path. Next, either a ”reappearance-based segmentation” or a ”layout-based segmentation” is applied to segment a web pages. The former one is based on the pattern of tags that appears in sequence. The authors evaluated the approach based on the the accuracy, bandwidth, and memory being used. Also, Hattori et al. [18] divided a Web page into small segments on the basis of both content distance (expressing the strength of connections between content elements in the Web page based on the structural depth of HTML tags) and page layout information. Song et al. [19] built block importance models for web pages. They used the VIPS algorithm to partition a web page into semantic blocks with a hierarchy structure. Then spatial features (such as position, size) and content features (such as the number of images and links) were extracted to construct a feature vector for each block. Based on these features, some learning algorithms like SVM and neural network were applied to train various block importance models. Wu et al. [20] investigated the xhtml page’s layout and VIPS (Vision-based Page Segmentation) and introduced a new algorithm called BGBPS (Block Gathering Based Page Segmentation), which adopts some page tags such as < div > and < table > for archiving the page layout. Followed by this, they grouped
176
tags into two main categories, namely, “gathering node” and “no-gathering node”. The gathering nodes are those that are used to form a mobile web page such as < p >, < td >, < f orm >, etc. No-gathering nodes are those with tags other than gathering nodes. Hence, the authors applied several rules upon both categories to do page segmentation. Akpnar and Yesilada [21] gave a comparison between a pure DOMbased page segmentation and a combination of both DOM and visual rendering of a page. They found that using both DOM and VIPS can deliver more “convenient” web page segmentation. However, VIPS has its own limitation. To address it, the authors tried to extend the general algorithm by adding additional abilities and properties such as adding more visual cues for detecting separate contents like margin, padding and floating attributes. These attributes are then employed for separate block detection, due to the fact that they are usually used for separating different contents with title and empty spaces. Sanoja and Ganc¸arsk [22], [23], [24] proposed a hybrid segmentation tool which combines both VIPS and geometric models. They called this hybrid model Block-o-Matic, whose aim is to reduce the cost of segmentation evaluation by providing manual segmentation capability to the assessors. However, their main focus was geometrical property of page and not the text content. Four rules are applied to recognize different blocks in each page, taking the inline and line-break elements of HTML pages in account. 3) Page Segmentation based on Semantic Contents: Chen et al. [25], [26] proposed to split a Web page into smaller, logically related units that fit on the small screen of a mobile phone. To do this, they first detected the high-level content blocks of a page, containing location and size information for headers, footers, sidebars, and the body. From each content block, they identified explicit separators to determine where to split the blocks. Some implicit separators were also detected to help split the blocks further. Xie et al. [27] employed the block importance model to assign importance values to different segments of a web page, and displayed the result pages from the highest priority to the least. Xiao et al. [28] performed vision-based Web page segmentation, and the transformed page sets form a multi-level slicing*-tree, in which an internal node consists of a thumbnail image with hyperlinks and a leaf node is a block from the original web page. Kovaˇcevi´c et al. [29] also presented an M-tree which collects the coordinates of HTML tags and used this tree to mimic the visual behavior of IE for rendering the page. Baluja [30] cast the web page segmentation problem into a machine learning framework based on the lens of entropy reduction and decision tree learning. The SmartView of Fraying and Sommerer [31] used the page-splitting technique to group the elements of a Web page and present them together while allowing to zoom into the individual elements. De Bruijn et al. [32] proposed the use of Rapid Serial Visual Presentation (RSVP) to provide a rich set of navigational information for Web browsing. It uses the WML cards and decks for segmenting a web page with different interface for switching between the cards and images. The CMo system, developed by Borodin et al. [33], [34], [35],
allowed users to see and navigate between fragments of a Web page. On following a link, CMo captures the context of the link, employing a simple topic-boundary detection technique. It uses the context to identify relevant information in the next page with the help of a Support Vector Machine, CMo displays the most relevant fragment of the Web page. C. Web Page Summarization and Key Phrases Extraction for Thumbnail View Alessio Leoncini et al. [36] proposed a framework for web mining for summarization and segmentation of the page. The overall framework can be described as following steps. First, they extract the textual content from the page; second, by spotting the sentences terminators, they separate the text into words and sentences. Stop word removal would be the next step followed by applying a semantic network to the tokens (words) to extract a group of concepts. Consequently, they group these concepts into homogeneous groups and calling them domain. The framework will process these domains for the topics that user has indicated. The output of this step will be used for sentence ranking to select the portion of the webpage which deals with the main topic. Yang and Wang [37] presented a fractal theory based mathematical way for summarizing a Web page into a tree structure, and displayed the summary to the handheld devices through cards in WML. Users may browse the selected summary by clicking the anchor links from the highest abstraction level to the lowest abstraction level. Based on the sentence weight computed by the summarization technique, the sentences are displayed in different font sizes to enlarge the focus of interest and diminish the less significant sentences. Fractal views are utilized to filter the less important nodes in the document structure. Buyukkokten et al. [38], [39], [40] represented a Web page as a short accordion summary by selecting important keywords or sentences. The user can then drill down to discover relevant parts of the page. If desired, keywords can be highlighted and exposed automatically. By text summarization and visualization, Bj¨ork et al. [41] provided a thumbnailed overview of the contents of a Web page. Lam and Baudisch [42] also proposed a summary thumbnail view of a page, making the text fragments of thumbnail view of a page readable. Jones et al. [43] used key phrases extracted from the document as search result surrogates on small displays. After clicking each result surrogate (whether based on the title or key phrases), the user is allowed to see the page in one of these two ways: key phrase emphasis or sentence emphasis which will highlight the key phrases or the sentences or even the whole paragraph. D. Important Regions Identification and Noise Removal Yin and Lee [44] modeled a Web page as a graph and applied a Google’s PageRank like link-analysis method to compute an importance value for each basic element of an HTML DOM tree. This allows the extraction of only the important parts of Web pages for delivery to mobile devices.
177
Noise elimination from a Web page was also studied. Gupta et al. [45] implemented an advertisement remover by maintaining a list of advertiser hosts and calculating the ratio of the number of words within links and in plain text. Htwe and Kham [46] used DOM and neural networks together for data extraction by removing noises from a certain page. To do so, they build a DOM tree and by training a data set, using Neural network concepts, a threshold will be examined to the tree and the rest of the tree with deeper depth will be removed as they are regarded as noise. III. A N AUTOMATIC PAGE S CROLLING F RAMEWORK
Fig. 1.
An automatic page scrolling framework for mobile Web search
Fig. 1 illustrates our automatic page scrolling framework for mobile Web search. After a mobile user types in one or several keywords to the browser, we extract these keywords from the URL of the result page returned by the browser. User’s further clicked Web page is the one to be automatically scrolled to on the mobile screen in the following two steps. Step 1. The Result Page’s Decomposition module is responsible for logically fragmenting the whole Web page into several smaller parts. To do this, we apply the HTML parser tagsoup.jar to construct the DOM tree of the page based on the page’s HTML source code. The leaf sibling nodes of the same parent node in the DOM tree form one part, which usually corresponds to a paragraph of the page. A figure with a figure label or a table is also treated as a paragraph. By posterior traverse of the DOM tree, we can obtain number of page paragraphs {g1 , g2 , · · · , gn }. That is, each time we are at a leaf node, we check if the current leaf node has no more siblings,
Fig. 3.
The original Web page
and if yes, this group of siblings of the same parent node is treated as a page paragraph. Step 2. The Paragraphs’ Relevancy Computation module computes the relevancy degree of each paragraph with respective to user’s search keywords, and returns the most relevant paragraph to scroll to. We tackle this problem by finding a way to predict the occurrence of the search keywords in every part of the page and then pick up the most likely part as the scrolling destination. Assume the page is decomposed into n parts hpart1 , part2 , · · · , partn i. According to the Bayes Theorem, which expresses relations between two or more events through conditional probabilities and makes inferences, we have P (parti |word) = P (parti ) ×
Fig. 2.
Scrolled Web page on a mobile screen
P (word|parti ) P (word)
(1)
where parti (1 ≤ i ≤ n) has a prior probability P (parti ), P (parti |word) is parti ’s posterior probability given word word, and P (word|parti ) is the conditional probability of word being seen in parti . In information retrieval, word’s tf-idf (term frequencyCin-
178
verse document frequency) is often used as a weighting factor, statistically reflecting how important the word is to a document in a collection or corpus. The tf-idf value increases proportionally to the number of times the word appears in the document, but is offset by the frequency of the word in the corpus, which helps to control for the fact that some words are generally more common than others. Here, we apply the classic tf*idf principle, and for each parti and its stemmed root word, we compute the tf value tf (parti , word) as the number of word in the whole page divided by the total number of words in the page and idf (parti , word) as the inverse of the total number of word appearing in all parts. In this way, we can obtain the meani and variancei of words’ tf*idf values in parti by averaging all the tf*idf values of words in parti and their deviation from the mean value. With meani = µi and varianti = σi of all the words’ tf*idf values in parti , P (word|parti ) in Equation 1 can thus be approximately measured through a Normal Distribution N (µi , σi2 ): P (word|parti ) = p
1 2πσi2
× exp
−(word.tf ∗idf −µi )2 2σ 2 i
(2)
In Equation 1, as the denominator P (word) is independent of parti , and P (parti ) remains the same for all words, the likelihood that a search word appears in parti , P (word|parti ), dominates the posterior probability P (parti |word). Given a mobile user’s search request containing m keywords {s1 , · · · , sm }, for each parti of the page, we calculate the average likelihood L(parti , s1 , · · · , sm )= (P (parti |s1 ) + · · · + P (parti |sm ))/m, and pick up partk with the highest likelihood value (arg maxk L(partk , s1 , · · · , sm )). Fig. 2 shows the scrolled result from the original page (Fig. 3) under the search keyword Computer engineer. IV. P ERFORMANCE E VALUATION We conducted three experiments on a computer with CPU of 1.83 GHz and RAM of 1 GB as the server to examine the efficiency (i.e., the computing time and accuracy that the system computes the most relevant part of the search page) and effectiveness (i.e., user’s search time with/without page scrolling) of our Web page scrolling approach. Here, the computing time is counted as the search time with page scrolling. Assume hr1 , r2 , . . . , rm i is a sequence of paragraphs that are relevant to a user’s search request, and rk (1 ≤ k ≤ m) is the most relevant paragraph computed by the approach. We measure the page scrolling accuracy according to the distance of the returned one rk from the first relevant paragraph r1 , i.e., . Then we conducted same experiments Accuracy = m−k+1 m with 4 different phones including iphones4 , Huawei T 895 , Oppo 520 and Motorola MB 865 . We did not considered the accuracy result for our user study due to the fact that as long as each user finds the desired information, so the accuracy is always 100
TABLE IV S TATISTICS OF C OMPUTING T IME ( S ) AND ACCURACY (%) W ITH O NE S EARCH K EYWORD Measure Mean 95% Confidence Interval for Mean Median Variance Stabdard Deviation Minimum Maximum
Compute Time (s) 4.541 .5566 3.361 5.721 3.900 5.266 2.2948 2 11
Compute Accuracy (%) 96.4706 1.95373 92.3289 100.6123 100.0000 64.890 8.05541 75.0 100
TABLE V O NE KEYWORD ’ S SEARCH TIME ( S ) WITHOUT PAGE SCROLLING ON 4 MOBILE PHONES
Mean N Standard. Deviation
iphone 6.0167 15 4.39259
Huawei 10.2857 14 5.32669
Oppo 21.7780 15 18.01356
MB865 9.8200 15 5.79275
A. Experiment 1: Mobile Search With One Keyword From the results presented in Table I and IV, we can see that usually it takes around 4.5s to identify the relevant part of a page under single search keyword. As the Standard deviation is low, most of the values are around the average time. Due to the mean of accuracy , 96.47%, we can assure that the page scrolling method gives good accuracy for this kind of keywords. Based on the report [1], the users are expected to see the page at most after 5s. Therefore, in our evaluation we tried to show how much Automatic scrolling and Non scrolling are satisfying this finding . For the first experiment,the Mean of iphone is the least among those of other phones. Notwithstanding,page scrolling can save at least 1.5s search time. B. Experiment 2: Mobile Search With Two Keywords As shown in Table III and VII, it takes 0.67s more on average to derive the relevant paragraph of a Web page given two or more keywords than one keyword. Also the lower mean for accuracy shows that the accuracy is going lower too. The Motorola phone has a better performance with Mean=6.56, yet more than Mean=5.18 with page scrolling, as shown in Table VI. C. Experiment 3: Mobile Search With Three or More Keywords As expected, the accuracy gets even lower by nearly 10% though the time consumed to find the relevant information
179
TABLE VI T WO K EYWORDS ’ S EARCH T IME ( S ) W ITHOUT PAGE S CROLLING ON 4 M OBILE P HONES
Mean N Stabdard Deviation
iphone 10.8824 17 5.04830
Huawei 8.0000 17 6.41531
Oppo 17.8122 18 14.69538
MB865 6.5667 18 4.87852
TABLE I P ERFORMANCE W ITH O NE S EARCH K EYWORD Keyword
Page URL
photography sport Olympics Merlin green hadoop cartoons Sabah rainbow Samsung laptop flood bags beach art
www.photography.com/ www.discoverhongkong.com/eng/events/sports.html www.hotelolympic.com/ www.sentosa.com.sg/en/attractions/imbiah-lookout/the-merlion/ www.enb.gov.hk/en/Green HK/index.html hadoop.apache.org/ www.glasbergen.com/ www.sabahtourism.com/ rainbowsystem.com/ drive.seagate.com/content/samsung-en-us www.eboxgz.com/ www.fema.gov/hazard/flood/index.shtm www.thebaglady.tv/ en.wikipedia.org/wiki/Beaches of Hong Kong www.hongkongartfair.com/
Compute Time Accuracy (s) (%) 4.2 100 9.3 87.5 5.7 100 10.9 100 3.1 100 3.2 100 3.3 100 3.2 68.42 5.1 100 2.9 100 3.1 100 2.2 100 4.8 80 3.7 100 4.7 100
Non scrolling Huawei Oppo (s) (s) N/A 19.90 10 18.51 18 13.64 13 74 9 09.46 6 10.65 13 12.12 15 40.93 11 11.82 21 35.98 3 09.06 6 N/A 9 26.24 7 26.79 3 17.57
iphone (s) 4.1 11 9 7 3 9 1.15 12 3 4 14 N/A 2 2 9
Motorola (s) 9.3 8.6 7.5 29 4.5 6 5.8 7.8 12 13 10 11 7 8 7.8
TABLE II P ERFORMANCE W ITH T HREE OR M ORE S EARCH K EYWORDS Keyword
Compute Time Accuracy (s) (%)
Page URL
iphone (s)
Non scrolling Huawei Oppo (s) (s)
Motorola (s)
www.newasia-singapore.com/travel information/ introduction/brief history of singapore 200705304.html www.ucar.edu/learn/1 3 1.htm www.joyofbaking.com/ChocolateChipCookies.html www.msnbc.msn.com/id/28148553/ns/technology and science-space/t/ six-frontiers-alien-life/#.QtS HVJtiPE
8
100
5
8
10.10
4
3 8.1
66 87.5
5 7
4 3.5
16.48 18.85
2.8 6
6
100
5
5.5
N/A
5
who is Sir Stamford Raffles
www.britannica.com/topic/489451/Sir-Stamford-Raffles
5.9
100
9
17
14.21
6
ways to fix shelf how to pass an exam ways to gain confidence most expensive city of the world
www.ehow.co.uk/how 6609528 fix-floating-shelf.html www.explainthatstuff.com/howtopassexams.html www.essortment.com/gain-confidence-life-54988.html www.en.wikipedia.org/wiki/ List of most expensive cities for expatria
4.5 5.8 6.1
100 68.42 72.72
5 N/A 14
9 6 9.5
23.01 15.17 14.04
4 5 8
10.3
75
7
10.5
30.78
11
history of Singapore green house effects chocolate chip cookies do alien exist
TABLE VII S TATISTICS OF C OMPUTING T IME ( S ) AND ACCURACY (%) W ITH T WO S EARCH K EYWORDS Measure Mean 95% Confidence Interval for Mean Median Variance Stabdard. Deviation Minimum Maximum
Compute Time (s) 5.180 .3810 4.382 5.978 4.950 2.904 1.7041 3 9
TABLE VIII S TATISTICS OF C OMPUTING T IME ( S ) AND ACCURACY (%) W ITH T HREE OR M ORE S EARCH K EYWORDS
Compute Accuracy (%) 95.5220 1.81119 91.7311 99.3129 100.00000 65.608 8.09988 75.0 100
Measure Mean 95% Confidence Interval for Mean Median Variance Stabdard Deviation Minimum Maximum
changes less. It shows that the computing time does not depend much on the number of search keywords, but accuracy gets a linear dependency on it. This may be because of the different type of web sites that each group of keywords leads the users to. The more words that search keyword contains,the richer page that user choose as the result. Here the Motorola wins again in Table. IX, but yet the scrolling method has a better performance to save the time
180
Compute Time (s) 5.349 .3627 4.619 6.079 4.800 6.182 2.4863 2 15
Compute Accuracy (%) 94.0443 4.68967 91.0111 97.0774 100.00000 106.722 10.33064 66.0 100
TABLE IX T HREE OR MORE KEYWORDS ’ SEARCH TIME ( S ) WITHOUT PAGE SCROLLING ON 4 MOBILE PHONES
Mean N Stabdard Deviation
iphone 7.1250 8 3.13676
Huawei 8.1111 9 4.13656
Oppo 18.3714 7 6.77346
MB865 5.7556 9 2.46734
TABLE III P ERFORMANCE W ITH T WO S EARCH K EYWORDS Keyword
Page URL
java book cooking pizza adventure racing
www.techbooksforfree.com/java.shtml step-by-step-cook.co.uk/mains/pizza/ www.arworldseries.com/ www.sevenwondersworld.com/ wonders of world giza pyramid.html www.climbmtkinabalu.com www.inventors.about.com/od/bstartinventors/a/telephone.htm
pyramid of Giza Mount Kinabalu when telephone invented? what is origami successful study earn money study abroad how to fishing
www.library.thinkquest.org/5402/history.html www.adprima.com/studyout.htm www.freebyte.com/makemoney/ www.vistawide.com/studyabroad/why study abroad.htm www.takemefishing.org/fishing/fishopedia/how-to-fish www.wellnessletter.com/ucberkeley/foundations/ 13-keys-to-a-healthy-diet/ www.wikihow.com/ Drive-a-Car-With-an-Automatic-Transmission www.angrybirdsriogame.com/ www.stresscure.com/jobstress/speak.html www.computerhope.com/htmcolor.htm www.htmlcodetutorial.com/ www.airlinequality.com/Forum/china.htm
healthy diet how to drive a car angry birds public speaking color codes HTML codes china Airlines
by average of 0.4 s . V. C ONCLUSION AND F UTURE WORK Despite the fact that screen of today’s mobile phones has been enlarged, at the same time, the content of the web pages got richer by numerous type of advertisements and spreadsheet’s charts, which makes it hard for mobile users to search and get the desired information. To address this issue, we presented an Automatic page scrolling method for mobile web search. Our performance study shows that users can save up to 1.5 sec to get their needed information compared to none scrolling. Further improvement of the relevant paragraph determination is needed, taking users’ search context into account. ACKNOWLEDGEMENT The work is supported by National Natural Science Foundation of China (61373022 and 61073004). R EFERENCES [1] “What mobile users want,” Gomez Inc, Tech. Rep., July 2011. [2] M. Kamvar, M. Kellar, R. Patel, and Y. Xu, “Computers and iphones and mobile phones, oh my!: a logs-based comparison of search users on different devices,” in Proc. of WWW, 2009, pp. 801–810. [3] W. Fu, T. Kannampallil, and R. Kang, “Facilitating exploratory search by model-based navigational cues,” in Proc. of IUI, 2010, pp. 199–208. [4] Opera Software ASA, “Small screen rendering,” http://www.opera.com/products/ mobile/smallscreen/. [5] Y. Chen, W. Ma, and H. Zhang, “Improving web browsing on small devices based on table classification,” in Proc. of WWW, 2003. [6] H. Masuda, S. Tsukamoto, S. Yasutomi, and H. Nakagawa, “Recognition of HTML table structure,” in Proc. of IJCNLP, 2004. [7] G. Buchanan, S. Farrant, M. Jones, and H. Thimbleby, “Improving mobile internet usability,” in Proc. of WWW, 2001. [8] M. Jones, G. Buchanan, and H. Thimbleby, “Sorting out searching on small screen devices,” in Proc. of HCI, 2002.
Compute Time Accuracy (s) (%) 4.2 100 4.8 100 4.2 100
iphone (s) 24 10 10
Non scrolling Huawei Oppo (s) (s) 4.5 07.71 5 16.37 5 16.37
Motorola (s) 5.5 12 12
3.1
81.25
8
7
71.07
11.5
5
100
8
4.5
10.79
2.2
3
80
10
6
20.04
6.8
4 4.8 6.6 3.1 2.9
100 100 100 100 100
10 4 45 12 12
2 3.5 N/A 6.5 N/A
20.03 07.14 20.43 08.04 08.14
7.6 2.4 2.7 3 12
8
94.73
13
12
18.01
3.5
7
85.71
9
4.5
13.57
2.5
4.9 6.1 6.8 5 9.1
100 100 100 100 93.75
N/A 8 7 11 18
25 3.5 5 6 19
12.93 07.76 24.50 08.89 27.53
20 2.5 2.2 5.6 6.2
[9] E. Marcotte, “Responsive web design,” A List Apart, May 2010. [10] K. Kuppusamy and G. Aghila, “A personalized web page content filtering model based on segmentation,” International Journal of Information Sciences and Techniques (IJIST), vol. 2, no. 1, pp. 41–51, January 2012. [11] ——, “Caseper: An efficient model for personalized web page change detection based on segmentation,” Journal of King Saud University Computer and Information Sciences, vol. 26, no. 1, pp. 19–27, January 2014. [12] A. Madaan, W. Chu, and S. Bhalla, “Vishue: Web page segmentation for an improved query interface for medlineplus medical encyclopedia,” in Proc. of the 7th Intl. Conf. on Databases in Networked Information Systems, 2011, pp. 89–108. [13] X. Liu, H. Lin, and Y. Tian, “Segmenting webpage with gomory-hu tree based clustering,” JOURNAL OF SOFTWARE, vol. 6, no. 12, pp. 2421–2425, December 2011. [14] Kalaivani and Rajkumar, “Reappearance layout based web page segmentation for small screen devices,” International Journal of Computer Applications, vol. 49, no. 20, 2012. [15] ——, “Dynamic web page segmentation based on detecting reappearance and layout of tag patterns for small screen devices,” in IEEE conference proceedings, 2012, pp. 508–513. [16] J. Kang, J. Yang, and J. Choi, “Repetition-based web page segmentation by detecting tag patterns for small-screen devices,” IEEE Transactions on Consumer Electronics, vol. 56, no. 2, pp. 980–986, 2010. [17] S.Aruljothi, S.Sivaranjani, and S.Sivakumari, “Web page segmentation for small screen devices using tag path clustering approach,” International Journal on Computer Science and Engineering, vol. 5, no. 7, pp. 617–624, 2013. [18] G. Hattori, K. Hoashi, K. Matsumoto, and F. Sugaya, “Robust web page segmentation for mobile terminal using content-distances and page layout information,” in Proc. of WWW, 2007, pp. 361–370. [19] R. Song, H. Liu, J. Wen, and W. Ma, “Learning block importance models for web pages,” in Proc. of WWW, 2004, pp. 203–211. [20] L.Wu, N. Y.He, and Y.Ke, “A block gathering based on mobile web page segmentation algorithm,” in Proceedings of IEEE 10th International Conference on Trust, Security and Privacy in Computing and Communications, 2011, pp. 1425–1430. [21] M. E. Akpinar and Y. Yesilada, “Vision based page segmentation algorithm: Extended and perceived success,” Current Trends in Web Engineering - ICWE 2013 International Workshops ComposableWeb, QWE, MDWE, DMSSW, EMotions, CSE, SSN, and PhD Symposium . Revised Selected Papers, pp. 238–252, 2013.
181
[22] A. Sanoja and S. Ganc¸arsk, “Block-o-matic: a web page segmentation tool and its evaluation,” Demo presented in BDA2013, vol. 1, pp. 1–5, November 2013. [23] A. Sanoja and S. Ganarsk, “Yet another hybrid segmentation tool,” in Proceedings of the 9th International Conference on Preservation of Digital Objects, 2012. [24] ——, “Block-o-matic: A web page segmentation framework,” in International Conference on Multimedia Computing and Systems, 2014, pp. 595–600. [25] Y. Chen, X. Xie, W. Ma, and H. Zhang, “Adapting web pages for smallscreen devices,” IEEE Internet Computing, vol. 9, no. 1, pp. 50–56, 2005. [26] Y. Chen, W. Ma, and H. Zhang, “Detecting web page structure for adaptive viewing on small form factor,” in Proc. of WWW, 2003, pp. 225–233. [27] X. Xie, G. Miao, R. Song, J. Wen, and W. Ma, “Efficient browsing of web search results on mobile devices based on block importance model,” in Proc. of PERCOM, 2005, pp. 17–26. [28] X. Xiao, Q. Luo, D. Hong, and H. Fu, “Slicing*-tree based web page transformation for small displays,” in Proc. of CIKM, 2005, pp. 303– 304. [29] M. Kovaˇcevi´c, M. Diligenti, M. Gori, M. Maggini, and V. Milutinovi´c, “Recognition of common areas in a web page using visual information: a possible application in a page classification,” in Proc. of ICDM, 2002. [30] S. Baluja, “Browsing on small screens: Recasting web-page segmentation into an efficient machine learning framework,” in Proc. of WWW, 2006, pp. 33–42. [31] N. Fraying and R. Sommerer, “Smartview: Flexible viewing of web page contents,” in Proc. of WWW, 2002. [32] O. D. Bruijn, R. Spence, and M. Chong, “RSVP browser: Web browsing on small screen devices,” Personal and Ubiquitous Computing, vol. 6, pp. 245–252, September 2002. [33] Y. Borodin, J. Mahmud, and I. Ramakrishnan, “Context browsing with mobiles - when less is more,” in Proc. of MobiSys, 2007, pp. 3–5. [34] J. Mahmud, Y. Borodin, I. Ramakrishnan, and D. Das, “Combating information overload in non-visual web access using context,” 2007, pp. 341–344. [35] Y. Borodin, J. Mahmud, and I. Ramakrishnan, “CSurf: A context-driven non-visual web-browser,” Proc. of WWW, pp. 31–40, 2007. [36] A. Leoncini, F. Sangiacomo, P. Gastaldo, and R. Zunino, “A semanticbased framework for summarization and page segmentation in web mining,” ISBN: 978-953-51-0852-8, InTech, DOI: 10.5772/51178., pp. 75–100, 2012. [37] C. Yang and F. Wang, “Fractal summarization for mobile devices to access large documents on the web,” in Proc. of WWW, 2003, pp. 215– 224. [38] O. Buyukkokten, H. Garcia-Molina, and A. Paepcke, “Accordion summarization for end-game browsing on pdas and cellular phones,” in Proc. of CHI, 2001, pp. 213–220. [39] ——, “Seeing the whole in parts: Text summarization for web browsing on handheld devices,” in Proc. of WWW, 2001, pp. 652–662. [40] O. Buyukkokten, H. Garcia-Molina, A. Paepcke, and T. Winograd, “Power browser: Efficient web browsing for pdas,” in Proc. of CHI, 2000, pp. 430–437. [41] S. Bj¨ork, L. Holmquist, J. Redstr¨om, I. Bretan, R. Danielsson, and J. Karlgren, “WEST: A web browser for small terminals,” in Proc. of UIST, 1999, pp. 187–196. [42] H. Lam and P. Baudisch, “Summary thumbnails: Readable overviews for small screen web browsers,” in Proc. of CHI, 2005, pp. 689–690. [43] S. Jones, M. Jones, and S. Deo, “Using keyphrases as search result surrogates on small screen devices,” Personal and Ubiquitous Computing, vol. 8, no. 1, February 2004. [44] X. Yin and W. Lee, “Using link analysis to improve layout on mobile devices,” in Proc. of WWW, 2004. [45] S. Gupta, G. Kaiser, D. Neistadt, and P. Grim, “DOM-based content extraction of HTML documents,” in Proc. of WWW, 2003, pp. 207–214. [46] T. Htwe and N. Kham, “Extracting data region in web page by removing noise using DOM and neural network,” in Proc. of Information and Financial Engineering, 2011, pp. 123–128.
182