Area-Efficient Visualization of Web Data - CiteSeerX

2 downloads 1269 Views 279KB Size Report
context or on the basis of a Web user's visit-patterns to Web sites. We group this data ... amount of information that may have to be represented can be huge, it is ...
Area-Efficient Visualization of Web Data 



Vishal Anand , Keith Hansen , Radu Jianu and Adrian Rusu



Department of Computer Science State University of New York College at Brockport, Brockport, NY 14420 USA Email: [email protected] 

Department of Computer Science Rowan University, Glassboro, NJ 08028 USA Emails: [email protected], [email protected] 

Department of Computer Science Politehnica University of Timisoara, 1900 Timisoara, Romania Email: [email protected]

Abstract— With the explosion of the Internet the World Wide Web today has become an infinite source of information. Hence, it is important that one be able to categorize, understand and be able to “view” this data efficiently. In this paper we propose a new system for visualizing Web data efficiently. We assume that data on the Web can be categorized, for example depending on context or on the basis of a Web user’s visit-patterns to Web sites. We group this data in the form of a tree structure. Our system uses a novel area-efficient tree-visualization algorithm to visualize and present these large data sets in as small an area as possible, which is of importance due to the large amount of data that may have to be represented. We present a few exemplary situations in which our system can be used.

I. I NTRODUCTION The explosion of the Internet and the World Wide Web (WWW) is beyond any imagination. The Internet today is an infinite source of information consisting of a large amount of data typically presented as a collection of Web pages. To understand and be able to manage this information, data on the Web is usually categorized based on application, user requirement, and context. It is important that designers and users of Web sites (i.e., users of this data) be able to track their usage patterns and also recognize the structure in which Web data is organized. For example, this makes Web surfing, Web searching, and tracking user navigation patterns simpler. This is especially important for e-commerce Web site designers so that they design their Web sites appropriately and efficiently, and present the user with the required data fast and in an enticing manner. In this work we present a system that can represent Web site structure, usage patterns, etc. as a tree. Our system uses a novel area-efficient tree drawing algorithm to display this information to Web users and Web site operators. Since the amount of information that may have to be represented can be huge, it is imperative that these visualizations (i.e. drawings) be presented in as small an area as possible. By being able to represent this large amount of data in an aesthetically pleasing

manner in a small area, our system gives Web users a better understanding of the information of the Web. A. Background and terminology In this section we present some of the definitions and background related to this work. A simple Jordan curve is a line that does not cross over itself. A drawing  of a tree  maps each node of  to a distinct point in the plane, and each edge  

of  to a simple Jordan curve with endpoints  and .  is a straightline drawing (see Figure 1(a)), if each edge is drawn as a single line-segment.  is a polyline drawing (see Figure 1(b)), if each edge is drawn as a connected sequence of one or more line-segments, where the meeting point of consecutive linesegments is called a bend.  is an orthogonal drawing (see Figure 1(c)), if each edge is drawn as a chain of alternating horizontal and vertical segments.  is a planar drawing if edges do not intersect each other in the drawing (for example, the drawings (a), (b), and (c) in Figure 1 are planar drawings, and the drawing (d) is a non-planar drawing).  is an upward drawing (see Figure 1(a,b)), if the parent is always assigned either the same or higher  -coordinate than its children.  is a grid drawing if all the nodes and edge-bends have integer coordinates. In this paper, we concentrate on grid drawings. So, we will assume that the plane is covered by horizontal and vertical channels with unit distance between two consecutive channels. The meeting point of a horizontal and vertical channel is called a grid-point. Let  be a rectangle with sides parallel to the  and  axis, respectively.  is the enclosing rectangle of  if it is the smallest rectangle that covers the entire drawing. The area of a grid drawing is defined as the number of grid points contained in its enclosing rectangle. The aspect ratio of a grid drawing is defined as the ratio of the length of the longest side to the length of the shortest side of its enclosing rectangle. We denote by   , the subtree of  rooted at node .   consists of and all the descendants of .  has the subtree-separation property [9] if, for any

two node-disjoint subtrees   rectangles of the drawings of with each other.

and 

  

 

of

and

 



, the enclosing do not overlap

(a)

(b)

drawings guarantee at least unit distance separation between the nodes of the tree, and the integer coordinates of the nodes and edge-bends allow the drawings to be displayed in a (largeenough) grid-based display surface, such as a computer screen, without any distortions due to truncation and round-off errors. Focus+context [10] is a style in which part of the information is presented in detail (the focus) while the rest is still available, but at a smaller size (the context). The subtree-separation property allows for a focus+context style rendering of the drawing, so that if the tree has too many nodes to fit in the given drawing area, then the subtrees closer to focus can be shown in detail, whereas those further away from the focus can be contracted and simply shown as filled-in rectangles. Giving users control over the aspect ratio of a drawing allows them to display the drawing in different kinds of display surfaces with different aspect ratios. Finally, it is important to minimize the area of a drawing, so that the users can display a tree in as small of a drawing area as possible. In addition, drawings with small area can be drawn with greater resolution on a fixed-size page. The optimal use of screen space is achieved by minimizing the area of the drawing and by providing usercontrolled aspect ratio. B. Organization of the paper

(c)

(d) Fig. 1. Various kinds of drawings of the same tree: (a) straightline, (b) polyline, (c) orthogonal, (d) non-planar. Also note that the drawings shown in Figures (a) and (b) are upward drawings, whereas the drawings shown in Figures (c) and (d) are not. The root of the tree is shown as a shaded circle, whereas other nodes are shown as black circles.

Planar drawings are normally easier to understand than non-planar drawings, i.e. drawings with edge-crossings. It is natural to draw each edge of a graph as a straight line between its end-vertices. Straight-line drawings are considered more aesthetically pleasing than polyline drawings. An experimental study of the human perception of tree drawings has concluded that minimizing the number of bends increases the understandability of drawings of graphs [11], [12], [13]. Ideally, the drawings should have no edge crossings, i.e. they should be planar drawings, and should have no edgebends, i.e. they should be straight-line drawings. The computer screen can be viewed as a grid of pixels placed at integer coordinates. It is therefore natural to consider grid drawings. Furthermore, we cannot discuss about the area of non-grid drawings (i.e. drawings that have the nodes placed at real coordinates), since, by placing the nodes closer or farther, such a drawing can be scaled down or up by any value. Grid

The rest of the paper is organized as follows. After motivating the need for area-efficient representation of Web data and providing some background and terminology in this section, we present our system in detail in Section II. We present and make references to related work throughout our paper as and when deemed necessary. In Section III we present some of the example applications that our system can be applied to. We note that due to limited space this list is in no way comprehensive and intended only to give the readers a taste of the numerous applications of our system. Finally, we conclude this paper with a summary of its major contributions and some of our future work in Section IV. II. O UR S YSTEM We now introduce our graph display system. In this work we investigate the problem of constructing area-efficient visualization of Web data. We organize the Web data into an hierarchical structure, and then display this structure in an aesthetically pleasing manner. The drawings produced by our system have the following characteristics - no overlapping node images, planar (no two edges intersect), straight-line (bends-free), subtree separation property, user-defined aspect ratio, and the smallest area possible. Figure 2(a) shows a drawing of a complete binary tree with 63 nodes constructed by our system, with an aspect ratio equal to 1. Figure 2(b) shows a drawing of a complete binary tree with 63 nodes, constructed by our system, with an aspect ratio equal to 0.28. A long-standing, fundamental question has been that, given a binary tree  , can we construct a planar straight-line grid drawing of  within an optimal linear area? Recently, the result in [4] has answered this question in affirmative.

(a)

(b) Fig. 2. Drawing of the complete binary tree with 63 nodes. (a) aspect ratio   . (b)  and aspect ratio    .

 

and

Moreover, the drawing produced by the algorithm of [4] allows for user-defined aspect ratio. Furthermore, the drawing also exhibits the subtree separation property. However, trees with degree greater than 3 appear quite commonly in practical applications. Hence, an important natural question arises, i.e., can this result of [4] be generalized to higher degree trees also. In [6], the authors give a partial answer to this question, by giving an algorithm that constructs a planar straight-line grid drawing of an  -node tree with a very large degree (more than is needed for most practical applications), namely   , where  is any constant, in linear area. Even though the algorithms provided in [4] and [6] are optimal in the worst case, they are not suitable for practical use. The main problem is that the constant  hidden in the “Oh” notation for area is quite large (for example  can be as high as 3,900). One may argue that  is really a worst-case bound, and the algorithms might perform better in practice. However, the problem is that given a tree  with  nodes, the algorithms will always pre-allocate a rectangle  with size exactly equal to cn, and draw  within  . Thus, the area of  is always equal to the worst-case area, and correspondingly, the drawing also has a large area. In [5], the authors make several practical improvements to the algorithm in [4], which make it very suitable for practical use. Their experiments show that it constructs area-efficient drawings in practice, with area at most 8 times the number of nodes for complete binary trees, and at most 10 times the number of nodes for randomly-generated binary trees. The result in [5]

can be extended to general trees with similar pratical results. Moreover, the algorithms constructed in [4], [6], and [5] are time-efficient, with a running time of  !#"%$& . Let  be a binary tree with link node (' . Let  be the number of nodes in  . Let ) and * be two numbers, where * is a constant, such that +,*-. , and 0/1324)5261 . ) is called the desirable aspect ratio for  . The tree drawing algorithm of [5], which is the underlying algorithm for our system, takes * , ) , and  as input, and constructs an area-efficient planar straight-line grid drawing  of  . Remark: The drawing  may not have an aspect ratio exactly equal to ) , but it will be very close to ) . In order to work, the algorithm of [5] needs to know in advance the tree  . Since it would be unrealistic to draw the entire structure of the Web, for each user, we need to have a way of deciding when to stop. By using clustering techniques similar with those introduced in [3], we could determine in advance the structure of the site that the user is visiting. By setting as threshold the number of nodes in the tree (i.e. the number of Web-links), and by giving to the user control over this threshold, we let the user choose the amount of information he/she needs. If the user decides that he/she needs more information than what was initially requested, the user will have the option of choosing to derive even more information, starting from any leaf in the displayed tree. Therefore, to use our system, the user will have fullcontrol over the information that will be provided to him/her. The parameters that the user will be asked to enter into the system are: * , ) , and  , where +7.*89 ,  is the number of Web-links that will be displayed, and &/:1;2 When the user places the mouse cursor over a node, then its label will be displayed in a special position on the screen (see Figure 3(a)). We decide to display the label in a predefined position so that the label does not overlap with the existing drawing, and such that at the same time the user has easy and fast access to this information. > Our system also gives the user the option of displaying more than one label at a time, if he/she decides that they need this information. For example, in Figure 3(b), the user has decided to display three labels at the same time. The drawing has shifted, so that the labels won’t overlap

(a)

(b)

A. Web Graph Visualization Graphs are suitable for World Wide Web navigation. Nodes in the graph can be used to represent URLs and edges between nodes represent links between URLs. We can look at the entire cyberspace as one graph - a huge and dynamically growing graph. However, it is impossible to display this huge graph on the computer screen. The Web graph has recently been used to model the link structure of the Web. The studies of such graphs can yield valuable insights into Web algorithms for browsing, searching, and discovery of Web communities. Our system can be applied to efficiently visualize the Web graph. So far, most of the current research interests have been focused on using site mapping methods [8], [7], in an attempt to find an effective way of constructing a structured geometrical map for a single Web site (a local map). This can guide the user through only very limited region of cyberspace, and does not help the user in his/her overall journey through cyberspace. In [3], the authors present a system for visualizing Web graphs, which incrementally calculates and maintains the visualization of a small subset of cyberspace on-line corresponding to the change in the user’s focus. When the user clicks on a node of the displayed graph, the links that are clustered under that node are displayed. For example, Figure 4(a) shows the result after the user clicks on the nodes ) , , , and , in this order. 

(c)



(a)

Fig. 3. Placement of labels on a drawing of the complete binary tree with 63 nodes with aspect ratio  = . (a) The label corresponding to the node pointed by the user appears in the top-right corner of the screen. (b) The user has decided to look at three labels. (c) The user has removed one of the displayed labels.

existing parts. Our system first calculates the length of the label to ensure that the label is displayed fully, while at the same time, valuable screen space is not wasted. Therefore, the user can make any number of labels appear and disappear by direct manipulation, i.e. by clicking on the corresponding nodes. For example, in Figure 3(c), the user has decided to hide one of the existing labels. III. A PPLICATIONS

(b) Fig. 4. (a) The behavior of the visualization system in [3] after the user clicks on the nodes  , , , and , in this order. (b) The behavior of the system after the user clicks on the node . 







In this section, we present three representative applications of our system related to Web mining.

As it can be seen in Figure 4(b), when the user clicks

on the node labelled , the drawings of the links that are clustered under it will overlap a part of the existing drawing and ). In this case, the layout (the drawings of nodes algorithm of [3] will automatically cause the overlapping parts of the drawing invisible, for example in this case, the drawings of nodes and . Since this operation is performed automatically, the user does not have any option but to accept the result. The user does not have the option of knowing if the page that is being visited has been visited before. This may confuse the user, as his/her mental map is not preserved. It is also unclear what will happen if the node that the user clicks has clustered under it a node that is already in the drawing. For example, in Figure 4(b), if node ) is clustered under the node , it is not clear how node ) will be represented, i.e., will it appear again or will there be an edge back to the already appeared node ) (we note that since this is the Web graph, such a result may occur). A natural assumption is that node ) will be displayed again as a node clustered under . The drawing layout will therefore be structured as a tree. Since the user does not have the option of knowing if a page has been already visited, it is very possible that the user might end up navigating data in cycles, oblivious to the occurrence of cycles. Note that such cycles can also occur in our system. However, due to the fact that our system typically displays vastly more data to the users, there is a high probability that the occurrence of cycles will be detected by the users. Furthermore, the edges in the drawings of [3] have bends, and intersect, i.e., the drawings are neither planar, nor straightline, hence making it difficult to follow if a large number of nodes need to be displayed at the same time. Placing all the labels on the drawing occupies a lot of space. Instead, we will only place labels on the nodes that the user has interest in remembering. In this way, we can display more useful information on the screen. In addition, the approach in [3] does not provide to the user any information of what to expect, i.e. the user discovers the information step by step. 







B. Web Tree Visualization Although data structures such as graphs may be most appropriate to represent the data on Web, most of the data on the Web can be organized into simpler data structures such as trees, which are also easier to analyze. As an example see ”The Tree of Life Web project” [2], a collaborative Web project that traces evolutionary history providing information about the diversity of organisms on Earth, their history, and characteristics. The project consists of information organized into more than 2,600 Web pages. Wherein each page contains information about one group of organisms, for example, the Fungi page contains information about fungi. Individual Tree of Life pages are linked one to another in the form of the evolutionary tree that connects all organisms, with the pages branching off from a group’s page being about subgroups, for example, the links from the page on frogs leads one to pages on individual families of frogs,

and eventually up to some individual species of frogs. Once again our system can be used to draw and visualize such large database of information as a tree in a small area efficiently. Note that the advantage of organizing data as a tree is that due to the inherent nature of the organization of data, no cycles can occur. Thus our system can be used for the visualization of data organized as a tree with no problem. C. Web Log Visualization The Internet today is an infinite source of information consisting of a collection of Web pages or sites characterized by randomness and fluidity. These pages are randomly organized, with no master catalog or index for these pages. Pages change over time becoming obsolete, changing content, appearing and disappearing. Chasing these pages and their contents can become a formidable task. On the other hand, a good understanding of Web user visit patterns (e.g. site and content usage) is essential for effective site design by Web site operators to maximize the performance of their Web sites, especially in e-commerce businesses. For example, measuring the rate at which users view product information, add products into their shopping carts, make purchases, compare with products on other Web sites can potentially be used to design Web sites for enhancing the productivity of electronic commerce businesses. In particular, HTTP server log files provide Web site operators with substantial details regarding the visitors to their sites. Hence, the interest in effectively interpreting this data e.g. software packages that summarize and analyze this data e.g. providing histograms, pie charts etc. that provide aggregate data is an important research topic today. In particular, reports such as example visits, document trails, which describe traversal path patterns that have been taken through sites, provide user-level information that can predict where the users are going, what they are seeking, helps in the construction and maintenance of Web servers that can tailor their design to satisfy user’s need. Traversal path pattern mining is based on the availability of traversal paths that can be obtained from Web logs. A typical traversal path pattern of a user is the longest consecutive sequence of Web pages visited (for example, by using links) by the user without revisiting some previously visited page in the sequence. Efficient algorithms such as FullScan, SelectiveScan, Ukkonen algorithm, Sorting Based Suffix Tree Miner (SbSfx Miner), Hashing Based Suffix Tree Miner (HbSfx Miner) for the mining of paths have been developed by various researchers [14], [15], [16], [17], [18]. Figure 5 shows different visualizations of a traversal path pattern that is organized as a binary tree with  nodes. Note that since the tree cannot be visualized in a single page by using the algorithm of [1], the user has to scroll the screen (see Figure 5 (a) and (b)). On the other hand, our system using the algorithm of [5], very easily allows the visualization of the whole tree (see Figure 5 (c)). Data structures such as linked-lists may be used to store the result of the mining of these algorithms in an organized manner. To effectively use this data, it is important that Web server operators and designers be able to easily understand 

IV. C ONCLUSION

(a)

AND

F UTURE W ORK

In this paper we have presented a system for efficient visualization and representation of large amounts of Web data not only in an aesthetically pleasing manner but also in as small an area as possible. This is especially important to Web users and Web site operators as the amount of data that may have to be represented can be humongous, and also so that they can navigate and design Web sites to suite user needs more efficiently. Our system is based on a novel algorithm that can draw trees in the smallest possible area. At this stage, our algorithm is only implemented for binary trees. We are working towards extending our implementation to general trees. The next step would be to test our algorithm with real data, in each of its applications. Also, we plan to experiment with different thresholds, to see if a certain threshold will be more valuable for the user. So, in case the user is not sure which threshold to set, the system can make an appropriate choice for the user. Next we also plan to extend our system so that each node in the Web tree is linked to a URL, for example to support the ShowPage mode feature [3]. Thus, now if the user wishes, he/she will have the option of displaying the detailed Web page content associated with a node in the Web graph.

(b) R EFERENCES

(c) Fig. 5. Two different visualizations of the same traversal path patterns in Web logs, consisting of a complete binary tree with 31 nodes. The drawings in (a) and (b) have been constructed by the algorithm of [1] and represent the same tree that could not be visualized in a single page. The drawing in (c) has been constructed by the algorithm of [5], which is the underlying algorithm for our system. Note that this allows the visualization of the whole tree in a single page.

the relationships between different pages for example. Hence efficient visualization of this data is of utmost importance. Our system ensures that the traversal path patterns of users (which may be gotten from Web logs) can be drawn in the smallest possible area in the form of a tree. Since, our system’s underlying algorithm is area-efficient, aesthetically pleasing, very often it is possible that all of the traversal path patterns of a user (or many different users) can be visualized by an operator without having to scroll and move between pages. Thus enabling the designer to maintain continuity and logical associations between a large number of pages.

[1] E. Reingold, J. Tilford. Tidier drawings of trees. IEEE Transactions on Software Engineering, 7(2):223-228, 1981. [2] http://tolweb.org/tree/phylogeny.html [3] X. Huang, W. Lai. Web graph clustering for displays and Navigation of cyberspace. In A. Scime (ed.), Web Mining: Applications and Techniques, Idea Group Inc., 2005. (to appear) [4] A. Garg, A. Rusu. Straight-line drawings of binary trees with linear area and arbitrary aspect ratio. Proceedings 10th International Symposium on Graph Drawing, volume 2528 of Lecture Notes Comput. Sci., pages 320-331, Springer, 2002. [5] A. Garg, A. Rusu. A more practical algorithm for drawing binary trees in linear area with arbitrary aspect ratio. Proceedings 11th International Symposium on Graph Drawing, to appear. [6] A. Garg, A. Rusu. Straight-line drawings of general trees with linear area and arbitrary aspect ratio. Proceedings 2003 International Conference on Computational Science and Its Applications, volume 2669 of Lecture Notes Comput. Sci., pages 876-885, Springer, 2003. [7] Y.S. Maarek, I.Z.B. Shaul. WebCutter: A system for dynamic and tailorable site mapping. Proceedings 6th International WWW Conference, 713-722, 1997. [8] Y. Chen, K. Eleftherios. WebCiao: A Website Visualization and Tracking System. Proceedings of WebNet 97, Toronto, Canada, October, 1997. [9] T. Chan, M. Goodrich, S. Rao Kosaraju, R. Tamassia. Optimizing area and aspect retio in straight-line orthogonal tree drawings. Computational Geometry: Theory and Applications, 23:153-162, 2002. [10] M. Sarkar, M.H. Brown. Graphical fisheye views. Commun. Acm, 37(12):73-84, 1994. [11] H.C. Purchase, R.F. Cohen, M.I. James. An experimental study of the basis for graph drawing algorithms. ACM J. Experim. Algorithmics, 2(4), 1997. [12] H.C. Purchase. Which aesthetic has the greatest effect on human understanding? Proc. Graph Drawing ’97, volume 1353 of Lecture Notes Comput. Sci., pages 248-261, Springer-Verlag, 1997. [13] R. Tamassia, G. Di Battista, C. Batini. Automatic graph drawing and readability of diagrams. IEEE Trans. Syst. Man Cybern., SMC-18(1):6179, 1988. [14] Z. Chen, R.H. Fowler A. Fu. Linear time algorithms for finding maximal forward references. IEEE Proc. of the IEEE Intl. Conf. on Info. Tech.: Coding and computing, 160-164, 2003.

[15] Z. Chen, R.H. Fowler A. Fu and C. Wang Fast construction of generalized suffix trees over a very large alphabet. Proc. of the Ninth Intl. Computing and Combinatorics Conference, Lecture Notes in Computer Science LNCS 2697, 284-293, 2003. [16] Z. Chen, R.H. Fowler A. Fu and C. Wang Linear and sublinear time algorithms for mining frequent traversal path patterns from very large Web logs. Proc. of the Seventh Intl. Database Engineering and Applications Symposium, 2003. [17] M.S. Chen, J.S. Park P.S Yu. Efficient data mining for path traversal patterns. IEEE Transactions on Knowledge and Data Engineering. 10(2), 209-221, 1998. [18] E. Ukkonen. On-line construction of suffix trees. Algorithmica., 14(3), 249-260, 1995.