Integrated Techniques and Tools for Web Mining, User Profiling and ...

Integrated Techniques and Tools for Web Mining, User Profiling and Benchmarking analysis Giovanni Ballocca, Roberto Politi

Giancarlo Ruffo, Rossano Schifanella

CSP S.c.a.r.l. Via Livorno, 60 – 10144, Torino. Italy [email protected] [email protected]

Dip. di Informatica, Università di Torino Corso Svizzera, 185 – 10149, Torino. Italy [email protected] [email protected]

Abstract The World Wide Web is one of the most used interfaces to access remote data and commercial and non commercial services and the number of actors involved in these transactions is growing very quickly. Everyone using the Web, experiences how the connection to a popular web site may be very slow during rush hours and it is well known that web users tend to leave a site if the wait time for a page to be served exceeds a given value. Therefore, performance and service quality attributes have gained enormous relevance in service design and deployment. This has led to the development of Web benchmarking tools largely available in the market. One of the most common critics to this approach, is that synthetic workload produced by web stressing tools is far to be realistic. Moreover, Web sites need to be analysed for discovering commercial rules and user profiles, and models must be extracted from log files and monitored data. This paper deals with a benchmarking methodology based on the integrated usage of web mining techniques and standard web monitoring and assessment tools. Keywords: benchmarking, stressing tool, capacity planning, workload characterization.

1. Introduction Since 1995, web servers have dramatically increased the number of connections served per seconds. Throughput strongly depends on network capacity. However, network capacity is improving faster than server capacity, and enhancements of the network infrastructure (Gigabytes wide-area network, ISDN, xDSL lines, cable modems, optical fibers, and so on) reduce network latency. Latency, measured by RTT (Round Trip time), is one of the key elements of Web Performance, and is dependent on network conditions and web server capacity. A high error rate is the most immediate perception of a web server unreliability. Server errors are generated due to two major reasons: bad implementation or inadequate capacity. Avoiding and/or detecting implementation errors is easier than providing adequate capacity. If a server has a low capacity (i.e., a real-time streaming server able to serve only 200 requests per second), many clients receive a ”connection refused” message: service is not available and clients are lost. Since the main bottleneck is at the server side, expected workload is difficult to define (see [1] for a survey on workload characterization problems). The continuing web framework evolution, from a simple Client/Server architecture to a complex distributed system, has caused a series of consequences: a client request can be fulfilled by many servers; routing strategies can be defined at different levels (the client may originate requests to the primary server as well as to one of its mirrors); DNS servers can resolve the address at different hierarchical levels; Web switches are often used to dispatch requests across a pool of servers providing load balancing [2]. Moreover, workload characterization deals with clients, servers and proxies [3]. In this context, http request/response messages between clients and

server are also definitely influenced by cache activities. Different monitoring strategies can be chosen to describe web workload in order to capture desired metrics (we only need information traced at server-side). To evaluate web server performances, either opensource or commercial web stressing tools are largely available (e.g.: OpenSTA [4] and Load Runner [5]). Both can be used to perform load and stress test on a replicated web site. During the testing period, monitoring agents collect values from a set of parameters representing system resources (CPU and RAM utilization rate, I/O Disks accesses, network traffic, and so on). Generated workload is usually based on a predefined user session, which is replicated many times by a number of Virtual Users (VU) (see section 3). In the rest of the paper we will present a methodology to evaluate the overall performance of a web site to avoid “lack of planning” problems. The core of this methodology is the usage of a stressing tool, which imposes a synthetic realistic workload on the web site under test. In Section 2, we will present the general capacity planning and web mining framework, the adopted analysis workflow will be described in Section 3. Tools, that we share with the community, will be presented in Section 4. In Section 5, a case study will be presented in order to show how the proposed integration of tools and methodologies can be conducted in practice. We will review related works about web mining, workload characterization and web performance in Section 6. In the conclusion, an outline of future directions of this research will be given.

2. Capacity planning and web analysis framework One of the main important steps in Capacity Planning is performance prediction: the goal is to be able to estimate performance measures of the web farm under test for a given set of parameters (ex. response time, throughput, cpu and ram utilization, number of I/O disk accesses, and so on). There are two approaches to predict performance: it is possible to use a benchmarking suite to perform load and stress tests, and/or use a performance model. A performance model [31] predicts the performance of a web system as function of system description and workload parameters. There are simulation models [49] and analytical models (e.g., the analytical based network queue model proposed in [31]). Both models output response times, throughput, resources utilization to be analyzed in order to plan an adequate capacity for the web system. Benchmarking suites are largely used in the industry: experts prefer to measure directly the system responses and they think that no model can really substitute the real architecture. Actual stressing tools (see Section 3) replicate a synthetic session made of a sequence of object-requests. At the other side, performance models can take as input a workload characterization model. Many methodologies, techniques and tools have been developed to support capacity planning, performance prediction and understanding of the user’s behavior. Three main approaches can be readily identified: absolute benchmarking (to provide benchmarking data for architecture comparison); application benchmarking; data mining derived techniques (mainly used to support user profiling and web site customization). The benchmarking approach involves performing load, stability and stress test using workload generators to provide a synthetic workload: web stressing tools use scripts describing user sessions to simulate so called “virtual users” browsing the web site. Four metrics are used to evaluate the web site performance (requests served per second, throughput in bytes per second, round-trip time, errors) and many parameters can be measured to detect system bottlenecks (CPU and RAM utilization, I/O Disks access, Network traffic). The absolute benchmarking approach differs from application benchmarking in the workload model used to test the architecture: since the scope is to provide comparable results for different platforms (hardware, operating system, supporting middleware) the

generated workload will always be the same, provided it is scalable enough to cope with systems ranging from single processor machines to entire web farms. Conversely, the aim of application benchmarking is to highlight bottlenecks and points of failure and the workload should be generated accordingly. One of the most common objections to the approaches developed, is that synthetic workload produced by web stressing tools is far from being realistic. Three main approaches can be identified for the characterization of the request stream: trace based: the characteristics of the Web workload is based on pre-recorded trace logs (e.g., LoadRunner [5], OpenSta [4], Httperf [51], Geist [52]); file list based: the tools provides a list of Web objects with their access frequencies. During workload generation, the next object to be retrieved is chosen on the basis of its access frequency (e.g., SpecWeb suite [6], TPC-W [50], WebStone [53], WebBench [54]); analytical distribution driven: the Web workload characteristics are specified by means of mathematical distributions (e.g., Surge [28]). In general, stressing clients replicate a (set of) artificial session(s) or navigational patterns. Analytical models are outside the scope of this work and, therefore, we’ll stick to: randomly generated user sessions; manually generated user sessions; user sessions extracted from log files. Relevant drawbacks of these approaches are, respectively, impossibility to generate a realistic workload, subjectivity, lack of scalability. An acceptable solution might be to provide every single Virtual User a profile. The VU acts according to his profile as described in section 3. Log files are used to extract (few) user profiles, not (many) user sessions. We strongly encourage this approach because it is more realistic, not subjective, more scalable. Web user behaviour characterization provides an interesting perspective to the comprehension of the workload imposed on a web site and can be used to address crucial points such as load balancing, content caching or data distribution and replication. In particular, Web usage mining is devoted to the investigation on how people accessing web pages use them and their navigational behaviour. It involves the automatic discovery of user access patterns from one or more Web servers. Web usage mining is based on the analysis of secondary data describing interactions between users and the Web. Web usage data include data recorded in: web server access log, proxy server logs, browser logs, user profiles, registration data, user sessions and transactions, cookies, user queries, bookmark data, mouse clicks and scrolls, and any other data deriving from the before mentioned interactions. We propose the usage of tools and techniques used for log file analyzing and web mining, to implement a more realistic workload characterization, and thus to improve existing benchmarking tools. The approach discussed suggests the usage of the Customer Behavior Model Graph (CBMG, originally proposed as a workload characterization of an e-Commerce site) as a modeling tool to derive a realistic workload from log file analysis. Scripts can be generated to describe the behavior of each Virtual User. These scripts can be used to emulate a browser, sending a server sequences of GET (or POST) requests for pages and embedded objects. Between successive requests, the virtual user can be configured for waiting a given interval of time (client think time). Web servers use the log files to record an entry for every single request they serve. As the complexity of the web site or application increases, simple statistics give no meaningful hints on how the web site is being used. Moreover, the log files of popular web sites may grow of several hundreds of megabytes per day, making analysis tasks awkward. [14, 36] describe techniques for mining information and knowledge from large databases. Web mining refers to the application of such techniques to web data repositories, so to enhance the analytical capabilities of the known statistical tools.

The overall framework is composed of different modules (Fig. 1), which are involved in a sort of collaborative workflow: tools and systems adopted for mining rules to be adopted for support to decision making from data are not seamlessly integrated with the stressing tools used for performance analysis; moreover, packages used to monitor and reporting system under test conditions are not interfaced with other tools. Finally, workload characterization models are sometimes fundamental to understand the nature of traffic and resource usage, and user behaviours. These models are not considered at all in many currently used web analysis tools, compromising the completeness of the analysis itself. Such integration and completeness of the overall capacity planning framework is possible if CBMG (see next section) is the adopted workload characterization model. Such a model can be extracted from log files, it can be used to create user profiles and to derive some interesting metrics (i.e., average number of visits, average session length, and so on), and, as discussed in [33,55,56], it can be used to generate realistic trace-based traffic. In session 3 and 4, we describe a continuous analysis workflow based on CBMG, where all the different modules are strictly integrated.

Figure 1: Test and analysis framework: conceptual modules and tools.

3. A continuous workflow for analyzing web performance and user profiles The Customer Behaviour Model Graph The CBMG (Customer Behaviour Model Graph) [25] is a state transition graph proposed for workload characterization of e-commerce sites. It is our opinion that such a model has a wider field of application, and that can also be used to generate realistic workload for every kind of sites. We will use a CBMG for each group of users (or group of sessions) with similar navigational patterns.

Another important feature of CMBGs is that they can be automatically extracted from web log files, even if our tool devoted to create CBMGs, use also other information not contained in standard log files, and retrieved from elsewhere (see Section 4).

Figure 2: CBMGs representing two typical users of the given service: (a) an occasional visitor, (b) a registered user

Nodes in a CBMG correspond to states in the session. A state can be viewed as a collection of web pages semantically related (e.g., in the e-commerce case, browsing, searching, paying, and so on). Such states must be defined by the analyst, because sites could have different purposes, and a functional categorization of the different pages must be performed by a human expert. The arcs in the CBMG correspond to transitions between states. As in Markov Chains, probabilities are associated with transitions (see Figure 2). Formally, a CBMG is represented by a pair (P,Z) of n x n matrices, where n is the number of considered states, P = [pij] contains the transition probabilities between states, and Z = [zij] represents the average server think times between successive requests (i.e., in a given user session the average time interval elapsed between the request completion by the server and the arrival of a new one). A CBMG can be created by means of four steps: Merging and Filtering: a Web site can be composed by several servers. All the log files must be merged together and must be filtered out a set of entries that does not affect the behaviour of a user (e.g., requests of embedded objects, as images, video clip, and so on). Getting sessions: the log files contain information request-oriented, the aim of this phase is to transform the data into a session-oriented format. Transforming Sessions: let us assume that n is the number of different states. For each session S, we create a point XS = (CS,WS), where CS = [cij ] is a n × n matrix of transition counts between states i and j, and WS = [wij ] is a n × n matrix of accumulated server-side think times. CBMGs clustering: using a K-means clustering algorithm, all XS points are transformed in a set of CBMGs that collect sessions with similar navigational pattern. The CBMG can be used to easily calculate some interesting metrics such as the average number of visits, the average session length and also the resource usage by a session [25].

WALTy: Web Application Load-based Testing tool WALTy (Web Application Load-based Testing tool) [55,56] includes a set of tools that allows the performance analysis of web applications by means of a scalable what-if analysis on the test bed. This is composed by two main modules: (1) the CBMGBuilder module intended to generate a set of CBMGs from data log files where each CBMG represents a user profile, and (2) the CBMG2Session

module that emulates the behaviour of the virtual users (by way of the CBMGs previously calculated) using a load testing component based on the httperf tool [51]. WALTy is implemented in Java and integrated with the modified version of httperf through the Java Native Interface architecture. This tool is distributed under the GPL licence (http://security.di.unito.it/software/WALTy). In the next sections, we will describe each module in more detail.

From log files to CBMGs In this section we focus on the first component of WALTy: the creation of CBMGs from input data. Because the web logs are intrinsically hits-oriented, the session identification phase is a central topic due to the nature of a Customer Behaviour Model Graph that is a session-base representation of a user navigational pattern. In general, server side logs include information like client IP address (maybe a proxy), user ID (if authentication is needed), time/date, request and so on. This type of information can be incomplete and not entirely reliable, therefore, it should be integrated by means of packet sniffers and, where available, application server log files. Depending on the data actually available for the analysis, typical arising problems in user behaviour reconstruction include: multiple server sessions associated to a single IP client address (as in the case of users accessing a site through a proxy); multiple IP addresses associated to multiple server sessions (the so called mega-proxy problem); user accessing the web from a plurality of machines; a user using a plurality of user agents to access the Web. Assuming that the user has been identified, the associated click-stream has to be divided in sessions. The first relevant problem is the identification of the session termination. Other relevant issues are the need to access application and content server information and the integration with proxy server logged information relative to cached resources access. Even though a wide range of techniques have been introduced to identify a web session, i.e. cookies, user authentication or URL rewriting, no one of these proposal is actually a standard. Moreover, logging procedures and formats are not uniformly defined. In such a context, we define a methodology to session identification that should be scalable in order to handle with any new data type without recompiling the core of the application, and it should integrate many different input formats in a common framework. Proposed methodology is based on the following steps: Data model definition: an abstract data model is defined. This includes any specific information necessary for a correct execution of the implemented application. Obviously, it is strongly dependent on the particular domain of interest. In such context, we defined the CBMG Input Format (CIF) that is made of the following fields: sessionID

timestamp

request

executionTime

The sessionID is an alphanumeric string that uniquely identifies a session. This information is not present in each log file format, e.g. it is absent in the Common Log Format, but if the source is a servlet filter, the session can be identified sheltered. The timestamp suggests the moment when the request is received by the web server, since the request field shows the resource asked. Finally, the executionTime estimates the time that the server spends to accomplish the request. Kernel-Module scheme implementation: let the kernel be a core module implementing an abstract service like the conversion of a generic input type to our data model. When a user has to deal with a new instance of the service, he can simply implement a module adding it to the kernel.

Figure 3: Kernel-Module paradigm

Figure 3 shows how Kernel class manages the abstract method myMethod. It uses translate method declared in InterfaceModule interface. But, translate method is implemented in different modules, e.g. ImplModule and ImplModule2, according to the requirements of the specific application. These classes are defined as plugins for the application. A plug-in performs the simple task of transforming the format of a generic source file into the data model previously defined (e.g., CIF). Plug-in implementation: when a new file format is encountered or a session identification technique is introduced, a new plug-in must be implemented, in order to handle the conversion to the defined data model. This plug-in is added to the kernel module as previously discussed. This process permits a simple and practical management of several session identification mechanisms, shifting the format conversion problem at implemented plug-ins. In order to build a CBMG, we must perform other fundamental steps: States definition: a state is a collection of static or dynamic pages, showing similar functionalities. A difficult task is to map a physical resource into a logical state. CBMGBuilder defines a state by way of a set of rules, each of them is composed by three main components: (1) a directory, (2) a resource and (3) an extension. Simple regular expressions can be adopted as well. Embedded objects definition: the embedded objects should be filtered out during the preprocessing phase. WALTy allows a definition of the embedded objects via a set of rules using the same syntax adopted for the states definition. Parameters specification: we must specify the following parameters: o Number of cluster: each cluster of CBMGs models a generic user profile. The analyst can properly configure the desired number of profiles (i.e., clusters). o Session Time Limit (STL): two successive requests must be distant lesser than STL to be labelled with the same sessionID. o Session Entry Limit: this value indicates the minimum number of requests in a session. o Important thing to highlight is that all these configuration parameters must be described by an analyst who studied carefully the structure and the aim of the site under examination.

Generating Synthetic Web Traffic from CBMG When a CBMG is generated for each user profile, the next step in the emulation process is traffic generation. As in the general trace-based framework previously described, traffic is generated by means of a sequence of http-requests, with a think time between two successive requests. This section describes how to generate such a sequence from CBMGs. Let us suppose that the clustering phase returned m profiles {Φi}, where i = 1, ..., m. Each profile is a CBMG defined as a pair (P,Z) of matrices n x n, where n is the number of states (see Section 3).

Observe that each profile Φi, corresponds to a set of sessions {Si1, Si2, ..., Sip}. Let us indicate the cardinality of this set of sessions with | Φi |. Moreover, let us define the Representativeness of profile Φi, as the value:

ρ (Φ i ) =

| Φi | m

∑| Φ |

,

i

i =1

which is the rate of the number of sessions corresponding to profile Φi, w.r.t., the total number of sessions. In order to generate traffic to the system under test, profiles Φi are used to properly define the navigational behaviour of virtual users.

Figure 4: Stressing framework based on WALTy clients

In Figure 3, a set of different clients is used to run several clusters of virtual users, that are defined by means of different profiles. Observe that value ρ(Φi) is very important to our analysis, because it gives a way to calculate a representative number of virtual users running with the same profiles. For example, if we want to start a test made of N virtual users accessing the server, we can parallelize stressing clients jobs as it follows: client i, that emulates sessions with profile Φi, runs N * ρ(Φi) virtual users, with i = 1, ...,m. WALTy allows further scalability: in fact, we can perform a fine-grained test, changing relative profiles percentage, e.g., we can run experiments responding to questions like “what does it happens when users with profile Φ3 grows in number w.r.t. other classes of users?”. Moreover, WALTy allows the analyst to perform a what-if analysis at transition level, changing values in matrices P and Z. For example, the analyst may be interested in the consequences of a navigational behaviour alteration, e.g., if a new link is planned to be published in the home page, a different navigation of the occasional visitor is reasonably expectable.

Algorithm: CBMGtoSession input : A profile Φi = (P, Z), and resource set L output : A session begin

Session ← ∅; State ← Entry; while State ≠ Exit do Page ← SelectPage(State, LState); PageProperties ← SetPageProperties(Page); NextState ← SelectNextState(State, P); ThinkTime ← EstimateThinkTime(State, NextState, Z); Request ← CreateRequest(Page, PageProperties, ThinkTime); Session ← Session U {Request}; State ← NextState; end return Session; end Algorithm 1: a procedure that takes a CBMG as input to generate a session trace.

At last, in Algorithm 1 [55] is described how a session can be generated from a CBMG. This procedure takes as input parameter a CBMG profile and the set L = {L2, ...,Ln−1}, where Li is the list of objects (e.g., html files, cgi-bin, …) belonging to the i-th state. The cbmg2session procedure returns a session, made of a sequence of httperf requests. Such generation takes advance of the following functions: SelectPage(S,LS): given a state S and the corresponding list of objects LS, this procedure selects an object (i.e., or simply a page) that belongs to LS with a simple ranking criterion, i.e., pages that are frequently accessed, are likely to be selected. In other words, it is not a random choice, but it is a popularity driven page selection. SetPageProperties(Page): in order to create a well formed httperf request, WALTy associates the given page to the following set of properties: o Method: The HTTP method (GET, HEAD, POST) should be selected, in order to properly send the request for the given page. o Data: in the case of a POST method, a byte sequence to be sent to the server is allocated. o Parameters-list: if selected resource is a dynamic page (e.g. php script, jsp page, …) needing a list of input parameters, WALTy will append to the request a sequence of (name=value) items. This sequence starts with a question mark “?”, and each item is separated by a colon “:”. These parameters should be previously defined by the analyst by means of a simple menu. SelectNextState(S,P): It returns the next state to be visited during this session. The current state S is given as input to the procedure together with matrix P. The random selection is weighted by means of transition probabilities contained in P. EstimateThinkTime(S,N,Z): A think time should also be defined in the httperf request. This value is extracted from matrix Z, i.e., the server side think time corresponding to the transition from “old” state S to next state N. We think that these generated traces are more realistic than randomly selected sessions, because they are modelled on the experimented traffic of the system under test. During benchmarking, the site under test should be monitored as well. The monitored metrics should be finally reported in an understandable format. Final observations aid the analyst to take

decisions, to find bottlenecks, to plan an adequate system capacity, and to improve usability of the hypertext system.

4. Tools and systems integration In the previous sections we described WALTy and how we can generate synthetic traffic in order to monitoring the performance of a system under test. The following pictures represent some screenshots of the software.

(a) CBMG generation process wizard

(b) Parameters specification

(c) Graphical representation of a CBMG and some useful metrics

(d) Some graphs about CBMG’s generation process statistics

The images above show the main features provided by the CBMGBuilder module, e.g. the friendly interface that guides the analyst within the various steps of the configuration phase, and the graphical representation of the results, like the CBMGs created, the metrics derived and the statistics about the generation process. Moreover, we can see below some pictures that represent the different phases of the CBMG2Session module. WALTy provides a graphical interface in order to configure the http-related parameters, a clear window that shows the synthetic traffic generation task and a complete set of graphs and reports that improve the comprehension of the results.

(e) Httperf front-end

(f) HTTP generator

(g) Report of statistics collected during the stress test

(h) Graphs of concurrent connections, bytes sent, response time and other useful metrics.

The last tool developed allows the online monitoring of a plurality of server machines. It’s made up of three parts: - A server (written in perl) that, using SNMP protocol, monitor some physical (CPU and Memory usage, Network and Disk activities) and logical (TCP, HTTP and FTP connections) parameters of the System Under Test logging the results to files. - A client (written in java) listening to SNMP traps sent from the server when one or more threshold is exceeded. Events can be notified by on-screen messages, sound alerts, or email (Linux only) and are always recorded on file and on the screen.

-

A web interface (written in perl) from which is possible to view real-time monitor graphs, logs of exceeded threshold, and configure some parameters (alarm thresholds, email address where to send the alarms)

All software packages have been written using portable programming languages, to allow the use in “any” operating system. They have been tested under Windows and Linux but they can (maybe with minimal modification) be used in other O.S. In future versions we will try to reach a better integration/homogenization of the tools to achieve a simpler usage thus leaving the user the choice to fully control the process of benchmark analysis.

5. A case study: scaling a public administration farm Here we report a case study of a public administration in which we applied our integrated methodology. The given task was the benchmarking and capacity analysis of a web farm before an important public event (with a forecast increase in traffic ranging 200-400%).

Since the webfarm is a “production environment” a major constraint was to limit potential disruptive effects from the test. Firstly we analyzed the log files for a sample time interval to obtain statistical data related to user accesses and/or malfunction of the web site. From this analysis a regular access pattern on a macroscale can be readily observed (figure below).

Hourly Report (One day)

Hourly Report (One week)

Then we used the WALTy tool we developed to extract a set of CBMGs and to reproduce a realistic synthetic workload by means of the httperf-based component. The number of virtual users reproduced has been incremental, starting from a value equivalent to the real one, up to more than 1200%. Since the web site under test actually is a collection of dynamic web applications, the usage of CBMG allows the reconstruction of a real navigation behaviour by means of a set of user profiles. As a consequence, bottlenecks can be found when a realistic traffic emulation based on such profiles is performed on the site. Tests lasted one week and they have run for two hours between 4 and 6 am every day, to allow eventual disservice to be as short as possible (employees only work from 8 am to 5 pm in weekdays). Furthermore, we tried minimized poor performance perception from eventual connected users, and we avoid the need for a replicated architecture (with all the problems concerning). During tests we have monitored (with the tools described in previous section) some physical parameters of the servers, to identify which components was candidate to be bottleneck (high end user response time). We obtained the following results:

The number of served pages is stabilized when there are about 50 users simultaneously connected to the server (this is equivalent to 250% of the load in normal condition): this mean from this point on, some pages are server more slowly, or are not served at all. In the second diagram we notice that the number of pages served in more than 5 seconds start increasing just with 50 simultaneous users. The following graphs show the usage of physical resources of the web server: we can notice that the only parameter that reach a value near to 100% is the outgoing network bandwidth, and therefore we can deduce that this will be the first bottleneck (when the traffic will reach about 10 times the actual value).

For this work we used many tools (mostly open source) and techniques to achieve a result that could have been reached using one (high cost) software. It is opinion of the authors this approach is more flexible allowing an in depth control of the workflow, and therefore, a more precise result.

6. Related Works In literature, workload characterization has been often used to generate synthetic workload [47, 50, 28] for benchmarking purposes. In the particular case of web workload, much research has been conducted to understand the nature of the web traffic and its influence on the performance of the server [26, 27, 28]. A set of invariants have been detected, i.e., phenomena common to many web servers and related network traffic. A list of the most important invariants can be found in [27]. The reference benchmarking tools using a web characterization workload are SPECweb99 [6] (an evolution of the SPECweb96 package) and SURGE [28]. Both systems perform a sequence of requests to the web server under test, respecting some distributional models. The main difference is that the SPECweb family workload is based on independent HTTP operations per second (not related to specific user sessions), while SURGE traffic is generated in terms of (virtual) user sessions. In this perspective, SURGE is similar to other stressing tools like openSTA or Mercury’s Load Runner, because it emulates web sessions alternating requests for web files and idle times. An early taxonomy of web mining has been proposed in [9]. This term is, in general, referred to three distinct activities. Web structure mining refers to the process of information extraction from the topology of the Web (in particular, links between pages). As an example, in [7] the diameter of the web has been calculated. Alternatively, web structure mining has application in categorizing web pages. In [37], a method is described to discover authority sites (authorities) for the subjects and overview sites (hubs) pointing to the authorities. Web content mining is the process of extracting useful information from the web sites and the pages they are composed of. One of the main challenges is the definition of what the web content is. Web content is composed of a plurality of data types: text, images, audio, video, metadata and hyperlinks and multimedia data mining has become a specific instance of web content mining. As the greatest percentage of web content is unstructured text, great relevance is given to knowledge discovery in text [38]. [21, 8] are reviews about web search engines. Web usage mining is devoted to the investigation on how people accessing web pages use them and their navigational behaviour. It involves the automatic discovery of user access patterns from one or more Web servers. Web usage mining is based on the analysis of secondary data describing interactions between users and the Web. Web usage data include data recorded in: web server access log, proxy server logs, browser logs, user profiles, registration data, user sessions and transactions, cookies, user queries, bookmark data, mouse clicks and scrolls, and any other data deriving from the before mentioned interactions [10]. As it can be easily understood, boundaries between these three categories are blurred [10, 39]. Nonetheless, as our main concern is with workload characterization, in this paper we will mainly concentrate on the usage mining as a means to track the behavioural patterns of users surfing a web site. Web usage analysis relates to the development of techniques for discovering and/or predicting the behaviour of a user interacting with the web. The data to be analysed are data logged by the user agent (e.g. web browser) or server side (web server, proxy server, application server). Data logged at different locations represent different navigation patterns: client side data describe, in general, single user-multi site navigation, server side logs describe multi user-single site interaction and proxy server logs describe multi user-multi site interaction. [34, 35] describe widely adopted log file formats. Server side logged data include client IP address (machine originating the request, maybe a proxy), user ID (if authentication is needed), time/date, request (URI, method and protocol), status (action performed by the server), bytes transferred, referrer and user agent (operative system and browser used by the client). As this information is incomplete (hidden request parameters using the POST method) and not entirely reliable, these information should be integrated using packet sniffers and, where available, application server log files. Client side data collection has been implemented using remote agents (Java applets, HTML embedding Javascript code) or ad-hoc browsers [29]. This method has two major drawbacks: the need for the user’s collaboration and for custom code distribution. Proxy server data describe, on one side, access to cached pages and, on the other, access to sites from actual clients seen as a single anonymous entity from the web server.

In [11] web mining techniques are analyzed. Three different phases are identified: pre-processing, pattern discovery and pattern analysis. Abstraction data may be built representing, as examples, users (single actor using a browser to access files served by a web server), page views (the set of files served to the browser in response to a user action such as a mouse click), click-streams (a sequential series of page views requests), user sessions (click stream for a single user across the WWW), server sessions (set of page views for a user session on a single web site). [15] provides the definitions of the terms relevant for web usage analysis. A particular stress is given to data preparation and pre-processing: as before said data may be incomplete (especially when client side logs are unavailable) leading to difficulties in user identification and impossibility to detect the user session termination (a 30 minutes default time is assumed according to [16]). Depending on the data actually available for the analysis, typical arising problems in user behavior reconstruction include: multiple server sessions associated to a single IP client address (as in the case of users accessing a site through a proxy); multiple IP addresses associated to multiple server sessions (the so called mega-proxy problem); user accessing the web from a plurality of machines; a user using a plurality of user agents to access the Web. Assuming that the user has been identified, the associated click-stream has to be divided in user sessions. As anticipated, the first relevant problem is the identification of the session termination. Other relevant issues are the need to access application and content server information and the integration with proxy server logged information relative to cached resources access. Interwoven with the usage preprocessing is the content preprocessing: page views might be classified or clustered depending on their intended use and the results of this process be used to limit discovered usage patterns. Following data preprocessing, is the pattern discovery process. Techniques adopted for this step strictly depend on the aim of the analysis. Methods available draw upon several fields such as statistics, data mining, machine learning, and pattern recognition. Statistical analysis is in general applied to discover information such as most accessed pages or average length of a navigation path through a web site. [40] discuss the use of association rule generation to find correlation between pages most often referenced together in a server session with a support value exceeding a given threshold. The results may find application in developing marketing strategies for e-business sites as well as for providing hints for restructuring a web site. Clustering is used to group together item having similar characteristics: in this case clustering may be used to group users exhibiting similar navigation behaviour (usage clusters) or groups of pages having a related content (page clusters). In the first case, the information is again relevant to marketing scopes while, in the second one, it might be used by search engines. Classification techniques are often used to associate navigation behaviours to group of users (or profiles). [41] discusses the application of sequential pattern discovery techniques to identify set of items followed by further items in a time ordered sequence: this is relevant for marketing purposes, for placing advertisements along the navigation path of certain users. Dependency modelling has the goal of developing a model to represent significant dependencies among various variables in the web (for instance, modelling the stages a user undergoes during a visit to an on-line store). This is useful not only in predicting the user behaviour but also predicting web resource consumption. The last step is pattern analysis. The aim is excluding from analysis pattern and uninteresting rules discovered. Most common techniques require the use of relational databases. [24] describes the modelling of the data in a data cube to perform OLAP operations. According to [10] applications of web usage mining may be classified in two categories: those dedicated to the understanding of user behaviour and tailoring (personalization) the site accordingly

[17, 18, 19] and those dedicated to an architectural (impersonal, related to site topology) improvement of the web site effectiveness [20]. The personalization of the web user interface depending on the user browsing the site has been addressed as the application of artificial intelligence techniques conjoint to the usage of user access patterns [18, 19]. Making dynamic recommendations to a user depending on his/her profile as well as navigation profile has relevance for e-business applications. [42] presents a knowledge discovery process to discover marketing intelligence from web data. Moreover, the analysis of the web site usage may give important hints for site redesign, so as to enhance accessibility and attractiveness. Web usage mining may also provide insight on web traffic behaviour, allowing to develop adequate policies for content caching and distribution, load balancing, network transmission, security management. In [12] an approach is described to the problem of the user navigation pattern reconstruction from the analysis of the logged data: this is relevant both for customizing the content presented to the user and for improving the site’s structure. Two techniques are identified: mapping the data to relational tables and then applying data mining algorithms or directly applying analysis algorithms to logged data. [43] presents an interesting method for modelling disk I/O as well as network or web traffic (in general, everything that can be described by bursty, self-similar sequences) starting from data mining techniques. Another important application field of web usage mining is given by Web Performance. Even in this domain, Intelligent Systems are needed in order to identify, characterize and understand user behaviours. In fact, users’ navigation activity determines network traffic and, as a consequence, influences web server performance. Resources of the server are concurrently accessed and consumed and performance metrics must be continuously tuned in order to make services available and reliable. In [22, 23] pre-fetching is proposed in order to improve Web latency. Of course, this technique can be useful and not to waste bandwidth if pre-fetched object or page is the target of the next request of the user. So, pre-fetching must be subordinated to user browsing activity prediction. Browsing strategies need to be classified and in [29] client-side traces are analyzed. Here, according to common sequences size, users are categorized. Pre-fetching from the server side is considered in [22]. In [23] time-series analysis and digital signal processing are used to model web users: during a session, when a resulting threshold is reached, pre-fetching starts. Anyway, as pointed out in [30], the number of clicks per session exhibits strong regularities. The observed distribution showed to be inverse Gaussian. The authors discuss how this limits pre-fetching strategies. The goal of improving Web performance can be reached if WWW workload is well understood (see [3] for a survey of proposed WWW characterizations). The nature of Web traffic has been deeply studied and analyzed. The first set of invariants has been processed in [27] and in [26] a discussion about the self-similar nature of web traffic is given. A workload characterization could be also used to create synthetic workload [28], in order to benchmark a site: a monitoring agent checks the resource usage metrics during the test to search for system and network bottlenecks. In [31] two important models characterizing user sessions are introduced: Customer Behaviour Model Graph (CBMG) and Customer Visit Models (CVM). Unlike traditional workload characterizations, these models are designed specifically for e-commerce sites, where user sessions are represented in terms of navigational patterns. Secondly, these models can be obtained from the HTTP logs available at the server side. A CVM is a more compact model than the CBMG (that we presented in Section 4). It represents sessions as a collection of session vectors, one per sessions. A value in a session vector is the number of times that each of the different functions (i.e., state in a CBMG) was invoked during the session. In [32] fractal clustering is used to find similar patterns in a collection of CVMs. Such

workload characterizations fit well the nature of web traffic, making possible at once performance analysis and observations about users’ common behaviours.

7. Conclusions A Workload Characterization based on Customer Behaviour Model Graphs has been proposed to improve existing web stressing tools: using a CBMG model, it is possible to produce a realistic workload instead of stressing the web site with a random selected session to be replicated over and over. An open source software package has been implemented allowing the analyst to produce CBMGs from log files and stressing scripts from CBMGs. The overall integrated methodology has been applied to a relevant case study to test its effectiveness. A study of secure web server (https) workload is planned in order to generalize our results in a wider and more complex environment.

8. Acknowledgments This paper has been realized at WTLAB, a consortium between the Department of Computer Science at the University of Turin and CSP. Authors would like also to thank Roberto Borri, Matteo Sereno, and Francesco Bergadano for their helpful comments and suggestions.

9. References [1] M. Calzarossa, G. Serazzi. “Workload Characterization: A Survey”, Proc. of the IEEE, 81(8):1136- 1150. 1993 [2] W. Cardellini, M. Colajanni, P.S. Yu. “Dynamic load balancing on Web Server systems, IEEE Internet Computing, 3(3):28-39, May-June, 1999. [3] J. E. Pitkow, “Summary of WWW Characterization, The Web Journal, 2000. [4] OpenSTA (Open System Testing Architecture), http://www.opensta.org [5] Mercury Interactive Load Runner -http://www.mercuryinteractive.com/products/loadrunner [6] SpecWeb99 Benchmark - http://www.specbench.org/web99/ [7] Réka Albert, Hawoong, Jeong, Albert-Làszlò Barabàsi, Diameter of the World-Wide Web, in Nature 401 (1999) 130-131 [8] H. Vernon Leighton and J. Srivastava. Precision Among WWW Search Services (Search Engines): Alta Vista, Excite, Hotbot, Infoseek, Lycos. http://www.winona.msus.edu/is-f/libraryf /webind2/webind2.htm, 1997 [9] R. Cooley, B.Mobasher, and J.Srivastava. Web Mining: Information and Pattern Discovery on the World Wide Web, Department of Computer Science University of Minnesota Minneapolis, MN 55455, USA, 1997 [10] Kosala R. and Blockeel, H., "Web Mining Research: A Survey," SIGKDD Explorations, 2(1):115, 2000. [11] J. Srivastava, R. Cooley, M. Deshpande, and P.-N. Tan. Web usage mining: Discovery and applications of usage patterns from web data. SIGKDD Explorations, 1(2), 2000. [12] J. Borges, and M. Levene. Data Mining of User Navigation Patterns, in Web Usage Analysis and User Profiling, pp. 92-111. Published by Springer-Verlag as Lecture Notes in Computer Science, Vol. 1836, 2000 [13] Discovery and Evaluation of Aggregate Usage Profiles for Web Personalization (with H. Dai, T. Luo, M. Nakagawa). In Data Mining and Knowledge Discovery, Kluwer Publishing, Vol. 6, No. 1, pp. 61-82, January 2002. [14] U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. From Data Mining to Knowledge Discovery: An overview. In Proc. ACM KDD, 1994 [15] Web Characterization Activity http://www.w3.org/WCA/

[16] Catledge, L. & Pitkow, J. Characterizing browsing strategies in the World Wide Web, in Proceedings of the 3 International WWW Conference (Darmstadt, Germany, April 1995), Elsevier, 1065-1073 [17] P. Langley. User modelling in adaptive interfaces. In Proceedings of the Seventh International Conference on User Modelling, pages 357-370, 1999 [18] M. Perkowitz and O. Etzioni. Adaptive web sites: an AI challenge. In Proc. 15th Int. Joint Conf. AI, pages 16--23, 1997 [19] M. Perkowitz and O. Etzioni. Adaptive Sites: Automatically Learning From User Access Patterns. In Proceedings of the 6th International World Wide Web Conference, poster no. 722, 1997 [20] M. Spiliopoulou, C. Pohle, and L. C. Faulstich. Improving the effectiveness of a Web site with Web usage mining. In Proceedings of the Workshop on Web Usage Analysis and User Profiling (WebKKD99), San Diego, August 1999 [21] H. Vernon Leighton and J. Srivastava. Precision Among WWW Search Services (Search Engines): Alta Vista, Excite, Hotbot, Infoseek, Lycos., http://www.winona.msus.edu/is-f/libraryf /webind2/webind2.htm, 1997. [22] V.N. Pedmanabhan, J.C. Mogul Using Predictive Pre-fetching to Improve World Wide Web Latency, Computer Communication Review, 26, July 1996 [23] C. R. Cunha, C.F.B. Jaccoud, Determining WWW User’s Next Access and its Application to Pre-fetching, Proc. of the Intern. Symp. on Computers and Communication’97, Alexandria, Egypt, 1-3 July 1997. [24] O.R. Zaïane, M. Xin, J. Han, Discovering Web Access Patterns and Trends by Applying OLAP and Data Mining Technology on Web Logs, in Proc. Advances in Digital Libraries Conf. (ADL’98), Santa Barbara, CA, April 1998 [25] D.A. Menascé, V. A.F. Almeida, R. Fonseca, M. A. Mendes A Methodology for Workload Characterization of E-Commerce Sites, in Proc. of ACM Conf. on E-Commerce, Denver, CO, Nov. 1999 [26] M. E. Crovella and A. Bestavros, Self-Similarity in World Wide Web Traffic: Evidence and Possible Causes. In Proc. of ACM SIGMETRICS, 1996. [27] M. F. Arlitt and C. L. Williamson, Web Server Workload Characterization: The Search for Invariants. In Proc. of ACM SIGMETRICS, 1996. [28] P. Barford and M. Crovella, Generating Representative Web Workloads for Network and Server Performance Evaluation. In Proc. of ACM SIGMETRICS, 1998. [29] Catledge L.D. and J.E. Pitkow Characterizing browsing strategies in the World Wide Web, Computer Networks and ISDN Systems 26(6): 1065-1073. 1995. [30] B. Huberman, P. Pirolli, J. Pitkow, and R. Lukose Strong regularities in WWW surfing. Science, Volume 280. [31] D.A. Menascé, V. A.F. Almeida, R. Fonseca, M. A. Mendes Scaling for E-business: Technologies, Models and Performance and Capacity Planning, Prentice Hall, NJ, May, 2000 [32] D. Menascé, B. Abrahāo, D. Barbará, V. Almeida, F. Ribeiro, Fractal Characterization of Web Workloads, World Wide Web Conference 2002. [33] G. Ballocca, R. Politi, G. Ruffo, V. Russo Benchmarking a Site with Realistic Workload,in Proc. of the 5th Workshop on Workload Characterization, Austin (Texas), November. IEEE Press. [34] http://www.w3.org/Daemon/User/Config/Logging.html [35] Extended Log File Format W3C Working Draft WD-logfile-960323, http://www.w3.org/pub/WWW/TR/WD-logfile-960323.html [36] Chen, M.S., Han, J., Yu, P.S., Data Mining: An Overview from a Database Perspective. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, Vol.8, No.6, pp.866-883, 1996 [37] David Gibson, Jon M. Kleinberg, and Prabhakar Raghavan. Inferring web communities from link topology. In HyperText, 1998

[38] M. Hearst. Untangling text data mining. In Proceedings of ACL'99: the 37th Annual Meeting of the Association for Computational Linguistics, 1999 [39] R. W. Cooley. Web usage mining: discovery and application of interesting patterns from web data. PhD thesis, University of Minnesota, USA, 2000 [40] R. Agrawal and R. Srikant. Fast algorithms for mining association rules. Proc. VLDB-94, 1994 [41] Rakesh Agrawal and Ramakrishnan Srikant, "Mining Sequential Patterns," IEEE Eleventh International Conference on Data Engineering, IEEE Computer Society Press, 1995 [42] Buechner, A.G. Anand, S.S., Mulvenna, M.D., Hughes, J.G., Discovering Internet Marketing Intelligence through Web Log Mining, ACM SIGMOD Record, 27(4), 1999 [43] Mengzhi Wang, Tara M. Madhyastha, Ngai Hang Chan, Spiros Papadimitriou, Christos Faloutsos, Data Mining Meets Performance Evaluation: Fast Algorithms for Modeling Bursty Traffic, ICDE, 2002 [44] V. Almeida, A. Bestavros, M. Crovella and A. de Oliveira, “Characterizing reference locality in the WWW. In Proc. of 1996 Intern. Conference on Parallel and Distributed Information Systems (PDIS’96), pages 92-103, December 1996 [45] Analog, http://www.analog.cx [46] P. Barford, M.E. Crovella. “Critical Path analysis of TCP transactios. IEEE/ACM Trans. Networking,9 (3):238-248, June 2001. [47] L. Bertolotti, M. Calzarossa. “Workload Characterization of Mail Servers”, Proc. of SPECTS’2000, July 16-20, 2000, Vancouver, Canada. [48] D. Ferrari, G. Serazzi, and A. Zeigner. Measurement and Tuning of Computer Systems. Prentice-Hall, 1983. [49] M.H. MacDougall, “Simulation Computer Systems: Techniques and Tools., Cambridge, MA: MIT Press, 1987. [50 ] D. A. Menascé, TPC-W: a benchmark for E-commerce, IEEE Internet Computing, May/June 2002. [51] D. Mosberger and T. Jin. “Httperf: A Tool for Measuring Web Server Performance”. In Proc. of First Workshop on Internet Server Performance, pages 59–67. ACM, 1998. [52] K. Kant, V. Tewari, and R. Iyer. “Geist: A generator for e-commerce internet server traffic”. [53] WebStone. http://www.mindcraft.com/webstone/. [54] WebBench. www.veritest.com/benchmarks/webbench/webbench.asp [55] G. Ruffo, R. Schifanella, M. Sereno, R. Politi. “WALTy: A User Behavior Tailored Tool for Evaluating Web Application Performance”. In Proc. of The 3rd IEEE International Symposium on Network Computing and Applications (IEEE NCA), Cambridge, MA, USA, 30 August – 1 September. [56] G. Ruffo, R. Schifanella, M. Sereno, R. Politi. “WALTy: a Tool for Evaluating Web Application Performance”. In Proc. of 1st International Conference on Quantitative Evaluation of Systems (QEST), Enschede (NL), 27-30 September 2004.

Integrated Techniques and Tools for Web Mining, User Profiling and ...

Integrated Techniques and Tools for Web Mining, User Profiling and ...

Suggest Documents

Web Mining Techniques for Recommendation and Personalization

User Profiling for the Web

User Profiling for the Web

Automatic web User Profiling and Personalization

Semantic user profiling techniques for personalised multimedia ...

Building the 3D web: Tools and Techniques

Comparison of Data Mining techniques and tools for

ANALYSIS OF WEB LOGS AND WEB USER IN WEB MINING

Towards User Profiling for Web Recommendation - CiteSeerX

Tools and Techniques for Implementing Integrated Performance ... - IMA

Data Mining and User Profiling for an E-Commerce ... - Fas-web Home

MODELLING TOOLS AND TECHNIQUES FOR

Advanced AI Techniques for Web Mining - CiteSeerX

Implementation of Web Service Techniques and web Mining

Methodologies, tools, and techniques in practice for Web - Aabri.com

Web Personalization using Web Mining Techniques

Tools and Techniques - Springer

biological tools and techniques

Data Mining: Practical Machine Learning Tools and Techniques ...

An Experiential Survey on Image Mining Tools, Techniques and ...

Data Mining: Task, Tools, Techniques and Applications - ijarcce

a survey on text mining tools and techniques

PDF Data Mining: Practical Machine Learning Tools and Techniques

Email Mining: Tasks, Common Techniques, and Tools