Using Access Information in the Dynamic Visualisation of Web Sites Bryan Wong and Gary Marsden CS00-18-00
Collaborative Visual Computing Laboratory Department of Computer Science University of Cape Town Private Bag, RONDEBOSCH 7701 South Africa e-mail:
[email protected],
[email protected]
Abstract With the rapid growth of eCommerce, more attention is being placed on measuring and evaluating Web site usage. The most common form of investigating site usage today is through Web server log file analysis. However, products that offer log file analysis services tend to suffer from drawbacks such as static 2D displays and a lack of integration with site layout information. The purpose of this research is to develop a better visualisation of the statistics produced by Web server logs. The visualization overcomes the limitations possessed by current log analysis products by incorporating the underlying structure of the Web site. Keywords: World Wide Web, Log File Analysis, Information Visualisation
1 Introduction With the continuing growth of the Internet, an increasing number of organisations are incorporating the Web into their business activities. Whether a site involves an online library, a government service or electronic commerce, investigating user visiting patterns is vital for evaluating that site’s effectiveness. Understanding these patterns is also of aid in redesigning sites to improve their usability, thereby increasing their chances of attracting more users. As a result, there is a definite interest in analysing and evaluating Web site usage. One method of measuring Web site usage is to examine the activity logs generated by Web servers. As such, there are a number of products available today which analyse server log statistics and produce summarised reports on their findings. While such reports are certainly useful, they are limited in that generally take the form of static graphics such as tables, bar and pie charts. In addition they contain no relation to the actual structure of the Web site itself. This is significant, as intimate knowledge of the structure of a site is vital in correctly analysing log files. The aim of this research is to develop an interactive visualisation of the statistics produced by Web server logs. The underlying structure of the Web site will be incorporated into the visualisation. In this manner the visualisation will address some of the limitations possessed by current log analysis products, and thereby provide greater insights into the usage of a Web site. The next section provides some background information on log statistics as well as other means of determining usage information. Section 3 contains descriptions on previous research efforts in this area while section 4 gives a brief overview of this project. Finally, this is followed by section 5 which discusses possible future work to be done.
2 Determining Usage Data for a Web Site Since the interest in the measurement and evaluation of Web site usage arose, a number of different mechanisms for obtaining site usage data have been developed. This section outlines a few of these.
2.1 Log File Analysis At present the most common method of measuring Web site usage is to analyse Web server log files. Log files are large text files generated by Web servers. They contain records of any activity that took place between a Web server and users browsers during a particular time period. Log file analysis consists of parsing these files to extract useful information and then summarising it in reports. 2.1.1
Web Server Activity Logs
When a user visits a Web site a connection is established between the Web server on which the site resides and the client browser of the user. Each communication between the browser and server, such as the requests for a page, then results in an entry being added to the servers log, recording the transaction. The data stored in a log file varies depending on the type of server being used and the log file formats that it supports. However, most log file formats share some common data such as the address of the computer requesting the file and the date and time of request, etc.
1
2.1.2
Information that can be Inferred from Log Files
Log files contain a rich set of data that when compiled and combined in various manners can provide statistics describing the usage of a site. Statistics that one can derive for certain from log files include: the number of requests made (commonly referred to as hits), the number of requests by type of file, such as HTML documents, JPG images, etc., the distinct IP addresses served and the number of requests each made, the number of requests by domain suffix (derived from IP addresses), the number of requests for specific files or directories, the number of requests by HTTP status codes (such as successful, failed, redirected, informational), the number and size of files successfully served, the URLs of the referring pages from which a user came, the browser type and version making the requests, and the totals and averages for a specific time period. 2.1.3
Disadvantages of Log Files
While log file analysis does provide some measure of determining site usage, it does suffer from several major flaws [8][9][14]. These are summarised as follows: Since the advent and increasing use of caching, log files may no longer be able to accurately report the correct amount of activity for a Web site. This is due to the fact that all requests for a page that has been stored in cache are not recorded in the servers log, as no request is made to the actual server. Another drawback of log files is that they contain data about files transferred from client to server and not information about people visiting the Web site. This means that certain usage data is not logged, while other data that is logged is inherently incomplete. Data not captured in log files includes items such as individuals identities, sites which users visited after leaving a particular site, or any qualitative data such as user motivation for viewing a site and reactions to site content. Inherently incomplete data includes the number of requests and all other statistics based on that figure. This information is incomplete due to local and regional caching. In addition, many commercial log analysis products employ complex heuristics in order to make educated guesses about information that is excluded from log files. However, not all of the inferences drawn in such a manner are sound.Unsound inferences include the concept that user sessions can be isolated and counted. Many log analyser products calculate user sessions by tracing requests received from a particular IP address until a sufficient period of inactivity suggests that the session ended. This calculation is based on two unsound assumptions; first, that a host corresponds to an individual, and second, that the individual 2
would not pause (to perform some other task) while within a site. As such, many statistics provided by log analyser products, which are based on user sessions, are also unreliable. These include average page per views, average length of session, average length of a page view, top entry and exit pages, single use pages and top paths through a site.
2.2 Other Approaches Other methods besides log analysis are also employed to attempt to measure site usage. Each of these has its own benefits and weaknesses. 2.2.1
Qualitative Methods
Different approaches include qualitative methods of data collection. These range from guest books and feedback forms to user surveys and focus groups. The advantage of these types of data collection is that they include information such as user opinions on site content, navigation and lookand-feel as well as user satisfaction and motivations. A disadvantage of these methods is that many users are unwilling to participate in surveys or to fill out forms. 2.2.2
Using Information Scent
Attempts concerned more with predicting web site usage rather than displaying current usage information have also been proposed [7]. One such attempt makes use of information scent, which is the “imperfect, subjective perception of the value, cost or access path of information sources obtained from proximal cues, such as Web links, or icons representing the content sources”. However, predicting user destinations through a site based on scent relies on being able to accurately predict users information needs, a not uncomplicated task. 2.2.3
Human Browsing
Another approach involves companies, such as Vividence [1], that employ people to browse Web sites. These organisations will gather and provide user feedback from their employees concerning a site for a certain fee. While such an approach can provide one with different types of information than log file analysis, it does also face certain drawbacks. For instance, people who are browsing a Web site as part of their job may have very different interests and motivations to those who are browsing either for leisure, business or information gathering. 2.2.4
Software Agents
A new emerging approach is to employ software agents as surrogate users to traverse a Web site and determine usage information. Systems such as WebCriteria SiteProfile [5] use a browsing agent to traverse a Web site using a modified GOMS model [16] and record download times and other data. The problem with these types of systems is that they are limited to metrics such as load times and amounts of content versus hyperlink structure. Any system attempting to provide more information would have to show that their software agent had similar browsing patterns to a human.
2.3 Method Chosen for this Project Log analysis was chosen as the initial approach for this project for several reasons. Firstly, log analysis is by far the most used method of determining usage information today, with the existence 3
of over fifty commercial and freeware products offering analysis of log files. Secondly, since every Web server produces log files, data from such a source was readily and easily available. Thirdly, while log file analysis does suffer from various weaknesses as discussed in section 2.1.3 improvements are being made. One such improvement is the introduction of cookies. Although cookies present problems of their own (such as privacy issues and users who refuse to accept them), they do increase the accuracy of user tracking, overcoming some of the complications due to caching. Cookies also permit sounder estimations of user sessions. Finally, new proposals are being put forward to make usage data more reliable. Examples include hit-metering [11] and user sampling [14]: Hit-metering proposes the implementation of a new HTTP header, called “Meter”. This header would enable proxy-caches to report usage and referral information to the original server. Additional extensions could also permit the originating server to limit the number of times a proxy-cahe returns a document before requesting a fresh copy. User sampling involves continuous sampling of a random set of users. The users being sampled are identified using cookies. Once a user has been identified, caching is defeated for all subsequent requests by that user during the sampling period.
3 Previous Work Previous work on log analysis can be categorised under one of two classes; namely products and research projects.
3.1 Products Products, such as Webtrends Log Analyzer [6], NetTracker Enterprise [4] and Funnel Web Professional [2], to name a few, parse log files in order to produce output reports which are generally presented as tables, histograms and pie charts. While the types of reports vary from product to product they generally share certain attributes such as: Static 2D reports – output reports are commonly displayed as an HTML page without interactivity. Lack of drill-down capabilities – reports focus on aggregations, with limited (if any) support for direct examination of data related to individual page requests. Relative lack of flexibility – while most products have reports that are configurable to an extent, general customisation is limited. Lack of integration of site layout – no information regarding site layout is presented. More recently, Microsoft’s Site Server [3] augmented the traditional types of reports with a 2D hyperbolic tree visualisation of a network of Web sites. The nodes of the tree, which represented pages, are colour-coded to reflect intensity of use.
4
3.2 Research Efforts One of the earliest attempts to visualise the statistics produced by web log analysis was that of Pitkow and Bharat [15]. Their tool, called WebViz, displays a web site as a directed graph, with the nodes of the graph representing separate documents, and the links representing the hyperlinks between the documents. The colour and widths of the nodes and links could be scaled according to the recency or frequency a particular document or link was accessed. A weakness of this system is its susceptibility to clutter when the site being displayed increases in size. Hochheiser and Shneiderman [10] developed a visualisation of web log data called Spotfire. Their system consisted of a 2D matrix-like representation with one variable, such as URL requested, on the y-axis and another, such as time of request, on the x-axis. The presence and size of circles at the relevant point show correspondences between the two variables. A system that attempts to visualise the usage of a web site that involves information scent is that of Chi et al [7]. The two metaphors used are that of a disk tree, or disk map, and a dome tree. At the center of the disk tree is the root node of a site, and successive levels of the tree are mapped to new rings expanding from the center. A dome tree is essentially a disk tree in 3D, i.e. a disk tree mapped onto a 3D parabola.
4 Project Overview The material presented here is an overview of intended future work as well as an account of initial efforts.
4.1 Contribution Although the layout of a Web site is an important consideration when investigating that site’s usage [9], surprisingly little work has been conducted in integrating site structure and site usage statistics. While previous efforts each have their own merits, we believe that our system contains important differences. The early WebViz effort by Pitkow and Bharat [15] concentrated on only two aspects of usage statistics, namely recency and frequency of access. We plan to display most, if not all, of the statistics offered by current log analysis products. Pitkow and Bharat mention the discrepancy between the underlying topology of a Web site (likened to a cyclic graph) and the hierarchical file system on which that site resides. Hochheiser and Shneiderman’s [10] outline window, which displays an overview of a site, shows the hierarchical directory structure where the files are located, rather than the logical topology. In addition, no information beyond the directory structure is encoded in the outline window, so that one is unable to determine areas of possible interest from viewing the overview of the site. Our system will provide an overview of the logical layout of the site, as opposed to the directory structure. Items such as number of hits and most popular paths will also be embedded in the overview display, allowing the user to identify regions of interest. Chi et al’s [7] work on disktree and dometree metaphors invovles the display of information scent, rather than log analysis statistics. While it is possible that scent will one day replace access statistics as the preferred metric of site usage, usage of log statistics is much more widespread at present. Finally, previous projects have concentrated on displaying the entire site layout and showing statistics produced by a particular element of the site in relation to the statistics of the entire site.
5
This is not always desirable or even useful. Thus an approach will be taken which will show data generated by a particular page in relation to only those pages that are relevant to it.
4.2 Approach We are currently developing an initial metaphor. After refinement, a period of prototyping will take place, both to evaluate the metaphor and to encourage possible amendments and improvements. Further efforts will then be expended on other possible metaphors. Once the initial system has been implemented, a series of user tests will be conducted. Issues to be addressed by the system include: Visualising the statistics produced by server logs while incorporating the site structure. Visualising the structure of a (potentially) huge site. Allowing a user to narrow their attention to a particular sub-branch of interest while still maintaining awareness of that sub-branch’s overall context. Emphasising the insights inherent in the actual structure of the site. Making use of interaction in such a way so as to improve the user’s understanding of the data being visualised.
4.3 Initial Metaphor The initial metaphor consists of two views, which are to be used in conjunction with each other. The first view will display the overall site layout, while the second view will represent a zoomed in portion of the site. Note that these views are still in the process of being designed. 4.3.1
Site Layout View
The site layout view (Figure 1) is the initial view that the user sees and from which the zoomed view is accessed. It is designed to provide the user with the overall context of the subsection of the site that is currently being shown in the zoomed view, as well as to provide the user with an idea of the layout of the entire site. The metaphor chosen for this view is that of a Windows-type directory tree listing. The home page of the site, along with those pages that are immediately accessible from the home page are displayed as the first two levels of a tree. The user is able to expand the tree in order to view those pages on further levels. The page icons representing the individual pages are colour-coded according to a user defined statistic (such as total hits recorded) using grey scale. In addition, the page titles themselves are shaded according to whether that page belongs to one of the top paths through the site, using a blue scale. The titles are preceded by a number (or multiple numbers if that page belongs to multiple paths) indicating where exactly in that path the page appears. 4.3.2
Zoomed View
The zoomed view (Figure 2) is activated by making a page in the site layout view active. This is achieved by dragging the required page into the active box. That page will then appear in the zoomed view, along with accompanying pages below it in the site tree. The number of accompanying levels of the tree that appear is determined by the user settings. Individual statistics on each page will then be displayed in the manner of bars appearing on the corresponding page. The 6
Figure 1: The Site Layout View. heights of the bars indicate the magnitude of the statistics they represent. Once again, the statistic bars that appear are user defined.
Figure 2: The Zoomed View.
4.3.3
Discussion of Metaphors
The choice of a directory listing as a metaphor implies that the site structure is being displayed as a hierarchical tree, even though Web sites are closer to directed cyclic graphs in terms of structure. This was deliberate, as it is not uncommon for many designers and Web users to conceptualise a site hierarchically as a tree. Thus users who might be uncomfortable with a strange new metaphor representing their site should immediately recognise it represented as a tree. The directory listing type metaphor has many attributes that make it attractive. First of all it is able to represent a very large site, although the greater the size of a site the more scrolling will 7
be required. Secondly, no matter what the magnitude of the site, the metaphor will always only occupy a set amount of screen estate. In addition, it is not subject to potential cluttering, as the spacing between individual nodes (pages) remains constant. Finally, most users should be familiar with viewing their file directories using such a metaphor. The site layout view does, however suffer from certain drawbacks. These include its limitations in drawing the attention of the user to interesting data in particular parts of the site, even though the shading of the page icons to represent a single variable does alleviate this somewhat. In addition the site layout view, along with the zoomed view are very poor in displaying paths that return up the tree or else travel to pages on the same level. Metaphors must therefore be found which can overcome these problems.
4.4 Possible Future Metaphors Although preliminary work has been conducted on other metaphors to date it is envisaged that the next metaphors to be developed will be adaptations of work carried out in visualising Web sites, in particular, dome trees [7] and hyperbolic trees [13]. These new metaphors are not intended to displace the current metaphor, but to complement it.
5 Conclusions and Future Work Aside from developing additional metaphors, much more development of the current metaphor must take place. Additions to the current metaphors include introducing path information in the zoomed view, as well as displaying average statistics for the sub-branch currently being viewed. Furthermore, work is planned that will enable the user to view an animation of the variations in the site statistics during a particular time interval. A system will be developed to interactively visualise the usage of a Web site. The aim of the system is to aid site designers in understanding their users access patterns in order to improve the design of the site. The usage data will be obtained using server activity log statistics, as this is the most common method of obtaining usage data at present. The structure of the site will be incorporated into the visualisation. It is hoped that this will result in improved insight and understanding into the access statistics produced by log files. Once the system has been implemented, user tests will be conducted to evaluate the effectiveness of the metaphors used.
References [1] http://www.vividence.com. [2] Funnel web professional. http://www.activeconcepts.com/prod.html. [3] Microsoft site server 3.0. http://www.microsoft.com/siteserver/. [4] Nettracker enterprise. http://www.sane.com/products/NetTracker. [5] Webcriteria siteprofile. http://www.webcriteria.com. [6] Webtrends log analyzer. www.webtrends.com.
8
[7] P. Pirolli E. H. Chi and J. Pitkow. The scent of a site: A system for analyzing and predicting information scent, usage, and usability of a web site. CHI 2000, pages 161–168, The Hague, The Netherlands, 2000. [8] J. Goldberg. On interpreting access statistics. http://www.cranfield.ac.uk/docs/stats. [9] S. Haigh and J. Megarity. Measuring web site usage: http://www.cranfield.ac.uk/docs/stats.
Log file analysis, 1998.
[10] H. Hochheiser and B. Shneiderman. Using interactive visualizations of www log data to characterize access patterns and inform site design. ftp://ftp.cs.umd.edu/pub/hcil/ReportsAbstracts-Bibliography/pdf/99-30.pdf. [11] J. Mogul and P. Leach. Simple hitmetering and usage-limiting for http, 1997. RFC 2227. [12] S. Mukherjea, J. Foley, and S. Hudson. Visualizing complex hypermedia networks through multiple hierarchical views. Proceedings of ACM SIGCHI ’95, May 1995. [13] T. Munzner. Exploring large graphs in 3d hyperbolic space. Computer Graphics and its Applications, 8(4):18–23, July/August 1998. [14] J. Pitkow. In search of reliable usage data on the www. Proceedings of 6th International World Wide Web Conference, Santa Clara, CA, April 1997. [15] J. Pitkow and K. Bharat. Webviz: A tool for world-wide web access log visualization. Proceedings of the First International World-Wide Web Conference, Geneva, Switzerland, May 1994. [16] T. P. Moran S. K. Card and A. Newell. The Psychology of Human Computer Interaction. Lawrence Erlbaum, 1983.
9