Multiform Glyph Based Web Search Result Visualization

4 downloads 145 Views 1MB Size Report
Keywords--- visual web mining, data mining, information visualization, web usage patterns, radial tree layout, user testing. 1. Introduction. Web site design ...
Visual Web Mining of Organizational Web Sites C Oosthuizen, J Wesson, C Cilliers Nelson Mandela Metropolitan University Port Elizabeth, SOUTH AFRICA {Craig.Oosthuizen, Janet.Wesson, Charmain.Cillers}@nmmu.ac.za Abstract Existing web usage mining (WUM) tools do not indicate which data mining algorithms are used or provide effective graphical visualizations of the results obtained. WUM techniques can be used to determine typical navigation patterns in an organizational web site. The process of combining WUM and information visualization techniques in order to discover useful information about web usage patterns is called visual web mining. The goal of this paper is to discuss the development of a visual web mining prototype, called WebPatterns, which allows the user to effectively visualize web usage patterns. Keywords--- visual web mining, data mining, information visualization, web usage patterns, radial tree layout, user testing.

1. Introduction Web site design should incorporate human-computer interaction principles to help visitors find information. By comparing the typical navigation patterns with the site usage expected by the site designer, the structure (information architecture) of the site can be evaluated and suggestions given for its improvement [1]. An organizational web site can be described as a site which has a high level of content. The Nelson Mandela Metropolitan University (NMMU) Computer Science and Information Systems Departmental web site is an example of such a web site. Web usage mining (WUM) techniques can be used to analyze the large amounts of web usage data collected by the web server in order to determine typical navigation patterns. This paper discusses the goals of WUM and the steps involved. A brief description of WUM is given as well as the data to be analyzed. The data mining algorithms used are discussed and appropriate

Proceedings of the Information Visualization (IV’06) 0-7695-2602-0/06 $20.00 © 2006

IEEE

visualization techniques proposed for each algorithm. Based on these visualization techniques, the design and evaluation of an interactive prototype for visual web mining is presented.

2. Web Usage Mining The purpose of WUM is to reveal the knowledge hidden in the log files of a web server [8]. WUM can be broken down into three main phases, namely preprocessing, pattern discovery and pattern analysis (Figure 1). In the first phase, log files are preprocessed in order to retain only the appropriate information. In the second phase, various methods are used to identify interesting patterns. These patterns are then stored so that they can be analyzed in the third phase of WUM, to determine which interesting patterns. Preprocessing involves identifying different sessions, users, pageviews and clickstreams [7]. Part of the preprocessing phase includes cleaning the server log to eliminate all of the irrelevant items [4]. The log file also needs to be parsed into data fields. Pageview identification determines which page file requests form part of the same pageview and what content was provided. Pattern discovery relies on various statistical methods and data mining algorithms to detect interesting patterns [8]. Quantitative statistical methods are the easiest to apply and allow one to determine information such as the frequency of visits to a page, average path length and the average view time of a page. Some of the more useful and appropriate data mining algorithms used for pattern discovery are sequential patterns, association rules and cluster analysis. Pattern analysis involves the evaluation of each of the patterns identified in the pattern discovery phase and deriving conclusions from them.

Figure 1. Web Usage Mining Process [5]

3. Web Site Structure Although there are many tools available to aid WUM, this form of web site analysis has shown limited success. This has been attributed to the fact that the need to understand the web site’s content and structure is often overlooked [5]. The structure of the web site is required to identify potentially interesting navigation patterns. This enables the evaluator to interpret each navigation pattern in terms of the specific web site content. The structure of a web site is created by the hyperlinks between various pageviews. These hyperlinks connect the individual web pages and provide a means for the user to navigate through the site. This path should allow the user to access the content they are looking for in the fewest possible mouse clicks. The ‘three-click’ rule states that any page within a site should be no more than three clicks from the homepage [10]. Although this rule is not an official standard, it can serve as a guideline and assist in developing a successful navigation structure for a web site. The structure of the site and preprocessing the content are interrelated tasks [5]. During a user’s navigation session, all activity on the web site is recorded in a log file by the web server. The log file does not differentiate between various sessions from different users. It is simply a text file which captures each access by a user on a particular day. The fields that were identified as necessary for the analysis of web usage patterns are: date, time, IP address, URL, user name and user agent.

4. Web Usage Mining Algorithms In order to extract any usage patterns from the log file, various data mining techniques need to be applied to the data. Data mining techniques are divided into two categories, namely supervised and unsupervised techniques [2]. In unsupervised techniques, there is no particular reason or goal for creating the model.

Proceedings of the Information Visualization (IV’06) 0-7695-2602-0/06 $20.00 © 2006

IEEE

Supervised techniques are generally used for prediction. Unsupervised techniques include association rules, sequence analysis and cluster analysis, of which association rules and sequence analysis are discussed below.

4.1 Association Rules The purpose of association rule mining is to discover relationships between items found in a database of transactions [2]. In terms of WUM, a transaction is described as a collection of time-ordered web page accesses or items, with an item being a single page access. An association would then imply that certain web pages within a web site are frequently accessed in different user sessions. Association rules could also be used to discover relationships between various usage paths, which indicate that certain paths are followed frequently in different sessions. Association rule mining is a two step process. The first step is to find all frequent itemsets (sets of web page accesses) which occur at least as frequently as a predetermined minimum support (frequency). The second step is to generate strong association rules from the frequent itemsets which satisfy both the minimum support and minimum confidence [5], where confidence is the percentage of items which satisfy all the conditions in the association rule.

4.2 Sequence Analysis Sequence analysis is the process of determining the longest time ordered paths that satisfy a user specified minimum frequency. In the case of WUM, sequences are the web usage paths. Since the log files represent a user’s interaction on a web site, the goal is to discover patterns in the form of sequences [3]. A user supports a sequence s if s is contained in a session for this user. The problem involved with mining sequential patterns is to find the maximal sequences

among all sequences that have a certain user-specified minimum support. Sequences satisfying the minimum support are called large sequences.

5. Visualizing Web Usage Mining Results Information visualization (IV) is the process of creating visual interfaces to help users understand and navigate through complex information spaces [7]. The data that is visualized in WUM is web usage data, which consists of sets of URLs and web usage paths (Section 4). Another factor that has to be considered is the importance of showing the structure of the web site being analyzed, and not only the usage data. A problem with visualizing web site structures is that they are not necessarily hierarchical in nature. Due to the many hyperlinks that may exist, the structure may represent a connected graph rather than a tree. One way in which this problem is addressed is by looking at the usage data when drawing the structure diagram. For each web page within the web site, only one link leading into it is displayed. This is the link that is followed with the highest frequency. By doing this, all the web pages are displayed but the connected graph is reduced to a hierarchical structure. A popular way to visualize the structure of hierarchically organized sites employs a tree metaphor [4, 7]. The trees are frequently laid out radially, with the root page in the middle and pages at various depths positioned in circles around the root node. The radial tree layout utilizes screen real estate more efficiently than a vertical layout. Straightforward implementations of tree layout algorithms work well for relatively small sites with tens to hundreds of pages. Most organizational web sites can be classified as relatively small so this technique was chosen to visualize the static web site structure.

5.1 Visualizing Association Since the purpose of association rules is to show that two or more web pages or web paths are accessed together frequently within a user session, the visualization technique needs to be able to show all the associated web pages at once. A disk tree or radial tree can be used to represent hierarchical information as shown in Figure 2. The primary node or home page is located in the centre of the graph. Each successive descendant falls on concentric rings spanning out from the centre. On each of these rings, the nodes are allocated space according to how many leaf nodes fall under it [7]. Nodes with a large number of descendants are allocated more space than those with fewer descendants. The advantage of the disk tree is that the area in which the tree is displayed is used

Proceedings of the Information Visualization (IV’06) 0-7695-2602-0/06 $20.00 © 2006

IEEE

more efficiently than a hierarchical tree. This allows a larger tree with more levels to be displayed in a smaller space. In order to show the association between various page accesses or usage paths, visual cues can be used. For instance, pages or paths which are associated can be coloured similarly. Three associations are shown in Figure 2, labelled A, B and C. Each association is coloured differently. These indicate that the web pages within each association are accessed frequently together.

Figure 2. Visualizing Association using a Radial Tree

5.2 Visualizing Sequences When visualizing user navigation sequences it is also essential to show the structure of the web site. Radial trees were therefore also selected as an appropriate visualization technique. Navigation sequences, however, not only indicate which pages were accessed but also include the order in which they were visited. This means that the start and end points need to be represented using directed edges. Navigation sequences can be visualized using radial trees by colouring all the pages visited in a specific sequence using the same colour and then using directed arrows to show the order in which they were visited. Figure 3 indicates how a radial tree can be used to visualize user navigation sequences. An example of two separate sequences is shown in Figure 3, the first sequence showing a path from 0 to 1 to 2, and the second showing a path from 0 to 3 to 4. These two sequences represent the most popular paths accessed by the users.

Figure 3. Visualizing Sequences using Radial Trees

6. User Interface Design An iterative design approach was followed to design the user interface (UI) of the prototype. Conceptual model extraction was used to obtain feedback from users and improve the UI design. The UI was designed to enable the user to view the web usage patterns over a specified period and consisted of three coordinated views (Figure 4). The three views

displayed are the graphical view on the left which shows the visualization of the WUM results; the textual view at the bottom which provides the WUM results in textual format; and the filtering view on the right which allows the user to select what data should be analyzed by the WUM algorithms. The graphical view illustrated in Figure 4 uses a radial tree to display the structure of the web site being analyzed as well as the results of one of the WUM algorithms. The graphical view displayed is dependent on which WUM algorithm is selected. For example, the currently selected view in Figure 4 shows the most frequent paths followed by the users. There are five buttons below the filtering menu which allow the user to switch between any of the WUM algorithms. The user can also filter the data according to specific criteria using the filtering menu. The ‘Period’ option allows the user to specify the date and time range to be used. The ‘Page Selection’ option allows the user to specify how much of the web site is displayed in the graphical view. At the bottom of the filtering menu is an ‘Apply’ button which applies the desired filtering to the data, invokes the selected WUM algorithm and refreshes the graphical and textual views. Any of the three graphical views can be selected by clicking on the desired view. The selected view is then displayed as is shown in Figure 4.

Figure 4. UI Design of WUM prototype (Frequent Paths)

Proceedings of the Information Visualization (IV’06) 0-7695-2602-0/06 $20.00 © 2006

IEEE

7.1 Implementing Association

7. Implementation The WebPatterns prototype was implemented using a three-tier architecture, comprising a Presentation Layer in which the visual pattern analysis is provided, an Application Layer in which the pattern discovery is implemented, and a Data Layer in which the preprocessed clickstream data is stored (Figure 1). The Presentation Layer was implemented using C# and FlowChart.NET for the radial tree component and runs in a Windows XP environment. FlowChart.NET also supports the typical interaction techniques used for interactive visualization tools, such as zooming, panning, filtering and details-on-demand. The Application Layer was implemented as separate web services with the WUM algorithms written in Java and running on a Sun Server. The Data Layer was implemented as an Oracle database which was also installed on the Sun Server. The following sections discuss the implementation of the interfaces for the association rules and sequence analysis algorithms (Sections 5.1 and 5.2).

The results of the association rule mining are displayed using colour to indicate the associated pages on the radial tree. The initial design in Figure 4 did not enable a user to select a specific page and show all other pages associated with the selected page. The prototype was therefore implemented to show a default view of all the pages associated with the home page using a default threshold (frequency) and enable users to select a specific page (Figure 5). The graphical view is divided into four areas to show the different number of associations. The first area shows the pages associated with the selected page with a frequency above the given threshold; the second area shows the two pages associated with the selected page; and so on. The frequency of the associations is shown on the radial tree using a rainbow colour legend.

Figure 5. Association Paths in WebPatterns

Proceedings of the Information Visualization (IV’06) 0-7695-2602-0/06 $20.00 © 2006

IEEE

The user can select to view the next four associations or the previous four (if applicable) using the buttons labelled accordingly. The user can also select an alternative page from the URL list, change the threshold and view the pages associated with the selected page. For example, as shown in Figure 5, the pages associated with the home page (page 0) are pages 1, 3, 5 and 9 with frequencies of 85%, 82%, 65% and 27% respectively. Pages 1 and 3 are both coloured red since their frequencies fall in the same range (80-100%). In the second graphical view, page 1 is associated with pages 3, 5 and 9 as indicated by the three different colours.

7.2 Implementing Sequence Analysis The results of the sequence analysis algorithm are displayed as a number of frequent paths on the radial tree. The initial design did not allow for a node (page) being part of more than one path (Figure 4). An alternative technique had to be developed to allow multiple paths to contain the same nodes. Instead of

colouring the nodes in a path using a given colour, multiple edges are drawn in colour between these nodes using Bezier curves to indicate the different paths and only the target nodes are coloured (Figure 6). The default view displays the five most frequent paths for the entire web site above the given threshold (Figure 6). Colour is used to indicate the paths, the target URLs and the frequency. The user can navigate to see the next five most frequent paths or the previous five (if applicable) using the buttons labelled accordingly. The user can also select one or more target pages from the URL list and select to view only the most frequent paths to the selected pages (Figure 6). For example, as shown in Figure 6, the most frequently accessed path to page 11 was 0ĺ3ĺ11ĺ15ĺ11 with a frequency of 61%, which indicates that most of the users navigated to page 15 before returning to page 11. This may indicate problems with the information architecture of the web site or the design of page 11.

Figure 6. Frequent Paths (Show All) in WebPatterns

Proceedings of the Information Visualization (IV’06) 0-7695-2602-0/06 $20.00 © 2006

IEEE

8. Evaluation User testing was conducted in order to evaluate the usefulness of WebPatterns. Three metrics were identified for the evaluation of WebPatterns, namely effectiveness, satisfaction and usefulness. The usefulness of WebPatterns was determined as the combination of the perceived ease of use and the perceived usefulness of WebPatterns [6]. A framework for evaluating visual mining tools was adapted from Margescu et al. [9]. This framework recognises quality of use as a key concept to evaluate WUM tools. Quality of use includes quality of interaction, quality of visualization and quality of information. A post-test questionnaire was then compiled based on these measures. The task list was created in such a way that the participants were not only required to determine the results of the WUM algorithms, but also give their interpretation of the results. The results of this questionnaire were then used to determine the quality of use of WebPatterns. Eight participants were selected for the evaluation of WebPatterns. These participants were selected based on the user profile of the target user population. Eighty seven percent (87.5%) of the participants had more than three years experience in web management. All participants except one were able to complete all the tasks correctly. The only participant who could not complete all the tasks, achieved a 76.92% overall task completion rate. An analysis of the post-test questionnaire completed by the participants revealed very high ratings (mean > 4.4 out of a maximum of 5). Since all the ratings were 4.4 or higher, it was concluded that the participants were very satisfied with the quality of use of WebPatterns. The participants were asked if they would use WebPatterns to assist in similar tasks in future. All the participants indicated that they would like to use WebPatterns in the future to complement existing WUM tools.

Conclusions Web usage mining (WUM) can allow web developers to analyze the information architecture of organizational web sites. Existing tools fail, however, to illustrate the algorithms being used and effectively visualize the results. Two WUM algorithms were selected based on the type of data collected by the web server, namely association rules and sequence analysis. A radial tree was selected as the most appropriate

Proceedings of the Information Visualization (IV’06) 0-7695-2602-0/06 $20.00 © 2006

IEEE

visualization technique for both these algorithms. An iterative design approach was used to develop the UI of the WUM prototype. This process resulted in the implementation of a WUM prototype, called WebPatterns, which can be used to interactively explore the web usage patterns of an organizational web site. The results of the evaluation clearly show that the participants found WebPatterns easy to use and were highly satisfied with the quality of information and visualizations provided.

References [1] BERENDT, B. and SPILIOPOULOU, M. (2000): Analysis of Navigation Behaviour in Web Sites integrating Multiple Information Systems. The VLDB Journal 9:56-75. [2] BERSON, A., SMITH, S. and THEARLING, K. (2004): An Overview of Data Mining Techniques. In Building Data Mining Application for CRM. [3] BÜCHNER, A.G., BAUMGARTEN, M., ANAND, S.S., MULVENNA, M.D. and HUGHES, J.G. (1999): Navigation Pattern Discovery from Internet Data. In Proceedings WEBKDD '99, San Diego, CA. August 1999. [4] CHEN, J., SUN, L., ZAIANE, O.R. and GOEBEL, R. (2004): Visualizing and Discovering Web Navigational Patterns. Seventh International Workshop on the Web and Databases (WebDB 2004), June 17-18. Paris, France. [5] COOLEY, R. (2003): The use of web structure and content to identify subjectively interesting web usage patterns. ACM Transactions on Internet Technology (TOIT) 3(2): pp 93116. [6] DAVIS, F.D. (1989): Perceived Usefulness, perceived ease of use and user acceptance. MIS Quarterly 13(3): pp 319340. [7] EICK, S.G. (2001): Visualizing Online Activity. Communications of the ACM 44(8): pp 45-50. [8] EIRINAKI, M. and VAZIARGIANNIS, M. (2003): Web mining for web personalization. ACM Transactions on Internet Technology (TOIT) 3(1): pp 1-27. February. [9] MARGHESCU, D., RAJANEN, M. and BACK, B. (2004): Evaluating the Quality of use of Visual Data-Mining Tools. In Proceedings 11th European Conference on Information Technology Evaluation, Amsterdam, Netherlands. pp 239-250. 11-12 November. [10] MIGHTYMEDIA (2004): Website Design. http://www.website-design-101.co.uk (Last Accessed: 28 November 2005: Last Accessed: 28 November 2005, Last updated: [11] USABILITY BY DESIGN (2004): Likert Scale.