Client-Side Monitoring for Web Mining - CiteSeerX

Client-Side Monitoring for Web Mining Kurt Fenstermacher and Mark Ginsburg MIS Department, University of Arizona Eller College of Management Abstract “Garbage in. garbage out” is a well-known phrase in computer analysis, and one that comes to mind when mining Web data to draw conclusions about Web users. The challenge is that data analysts wish to infer patterns of client-side behavior from server-side data. However, because only a fraction of the user’s actions ever reaches the Web server, analysts must rely on incomplete data. In this paper, we propose a client-side monitoring system that is unobtrusive and supports flexible data collection. Moreover, the proposed framework encompasses client-side applications beyond the Web browser. Expanding monitoring beyond the browser to incorporate standard office productivity tools enables analysts to derive a much richer and more accurate picture of user behavior on the Web.

Imagine a library where the librarians know not only which books have been checked out, but also which books patrons have pulled from the shelves. For each book pulled from the shelf, the library staff knows whether the reader simply skimmed the text and returned it to the shelf, or carefully read two of the chapters. If the copier were used to copy pages from a journal, the staff would know which issues and which pages were copied and how they were used. With such information, the staff could better judge how the collection was being used, which areas could be improved and more. In short, the staff could help people become better library users. Such monitoring is not possible in the world of printed texts, but it is possible in the online world. In cyberspace, many people access information through the World Wide Web, making the Web a candidate for improved monitoring. Unfortunately, just as many libraries only know which works are checked out, very little information is available about how people use the Web. Today, Web server logs are the primary source of data for mining. Server logs are thus the base from which analysts draw inferences about user behavior. Although the problems of server log analysis are well known, there has not been extensive work exploring user monitoring at the client. In this paper, we summarize the shortcomings of server-side data, and propose a framework for client-side Web monitoring. Monitoring at the Web browser level enables much richer data collection, supporting better analysis. Although client-side monitoring schemes have been proposed before, our proposed framework is extensible to other client applications. With broader monitoring on the client side, analysts can view Web usage within the overall client context. In the following section, we survey current server-side analysis, describing the collection of the data and its shortcomings for inferring Web user behavior. After summarizing the current state of the art, we summarize the goals of a client-based monitoring framework. With the goals specified, we describe a framework achieves the goals for users who work with common applications on the Microsoft Windows platforms. After presenting a framework for client-side data gathering, we discuss some novel analyses that draw on the richer knowledge of the client. We focus on the potential impact of integrating client-side monitoring of Web access with other client-side applications. Finally, we conclude by describing future directions for the framework.

1. Mining the Web Today On most computer desktops, the Web browser is the gateway to information. Although other applications might be used more often and for longer, the Web browser is a key element in studying and improving information access. Improving today’s access requires understanding how people interact with the Web, which in turn requires mining Web interaction data to infer new knowledge. Web mining has been a subject of much study, for example (Kimbrough, 2000), especially in electronic commerce applications (B. Mobasher, Cooley, Robert, and Jaideep Srivastava, 2000; B. Mobasher, Dai, Honghua, Luo, Tao, Miki Nakagawa, 2000; Padmanabhan, 2001; Sen, Padmanabhan, Tuzhilin, White, & Stein, 1998; Sen, 1998; Vranica, 2001), as electronic retailers aim to target their pitch ever more narrowly. However, Web mining has been problematic due to the poor and incomplete data analyzed. Typically, data is drawn from Web server logs, which have several shortcomings as an information source. In this section, we describe first site-centric data sources that record information on the Web server as visitors arrive at the site. Next, we discuss techniques that combine site-centric server data with third-party ancillary data for better customer profiling, which shifts the focus from the site to the site’s users. We then discuss methods such as clustering, nearest neighbor and other predictive techniques, that pre-process server data to provide better personalization. Following the thread from sitecentric to server-based, user-centric analysis, we conclude with recent work that shifts usercentric analysis from the server to the client. 1.1. History of Server-Side Techniques The widespread use of the Web among computer users makes Web analysis a potentially rich source of data for studying user behavior and designing information access. Because the Web protocol (Hypertext Transfer Protocol, or HTTP) is designed for efficiency and scalability, it offers little information in the course of a Web transaction (December & Ginsburg, 1996). This basic information includes the client’s network address, page requested, time of request and size of page requested, but not, for example, the login name of the user. As the information is transmitted from the client to server, it is usually written to a log file to create a record of server accesses. The absence of user information is aggravated by the statelessness of the Web protocol: no memory is kept between client page requests (December & Ginsburg, 1996). Hence, without additional programming, each connection is established anew and inferences of consumer identities across time become tenuous. Early works have identified both the usefulness and the shortcoming of web server logs to marketers (Sen et al., 1998) The readily available data in Web logfiles satisfies a significant part of the behavior segmentation information needs of marketers, while most of the demographic, geographic, psychographic and benefit segmentation information needs require information to be elicited from the visitors.

Web server logs paint a distorted picture not only because some data never reaches the server, but also because data that is reported is itself incorrect. Conflated data arises because Web server logs record a client’s Internet Protocol (IP) address. Due to the shortage of unique IP addresses, most organizations use the Dynamic Host Configuration Protocol (DHCP) to manage a pool of IP addresses. Addresses are assigned to computers as needed, and thus the same address might be assigned to two (or more) users in sequence. Moreover, the same user might receive different IP addresses on successive days, or even during the course of a day.

Fenstermacher and Ginsburg, JASIST, forthcoming 2003

Some of these shortcomings are addressed through the use of proxy servers. Proxy servers act as links between Web clients and servers, by responding to request from clients as if they were Web servers themselves. Proxy servers are installed by many organizations, including Internet Service Providers (ISPs), for many reasons, including more efficient bandwidth utilization, faster response times and security. With respect to Web mining, proxy servers represent an improvement over Web server data because they consolidate the Web traffic from an entire group or organization. Thus, the proxy server log records all the page accesses of a group, regardless of the destination Web server. Because a typical Web user will visit many servers in a session, the upstream data available from a proxy server can lead to new insights in user behavior. Although proxy servers offer a richer dataset to those who mine them, they also aggravate the problems of those who work with Web server logs. Because the proxy server interposes itself between the client and Web server, the Web server sees a single proxy server in the place of many clients. For example, AOL’s extensive use of proxy servers prevents Web servers from determining which AOL member is accessing a particular page. Many companies use proxy servers for security reasons and thus all company employees appear as the same proxy server to a Web server. Content caching is another problem in capturing accurate data for later mining. Because most Web pages change infrequently, Web browsers streamline page access by storing a copy of Web pages on the user’s computer in a local cache. Later, if a user requests a cached page, the browser simply displays the cached version. (The use of a browser’s Back button is a common instance because the just-viewed page is generally in the browser’s cache.) For Web data mining, the problem is that cached page accesses are not recorded by the server, since the browser simply presents its own copy, without accessing the server. In an extreme case, a user might retrieve two pages from a server then spend hours moving back and forth between the two pages. The Web server would record this as one page access followed shortly by another, but successive switches between the two pages would not be recorded. Proxy servers add to the challenge because proxies cache pages as well, thus a browser may request an updated version of a page, and the proxy will intercept the request and return its own cached page — again preventing the Web server from recording the request. In short, Web server logs suffer from several important shortcomings: network configuration issues cloud server data, users’ Web browsing experiences are fragmented across servers and lack of user context prevents analysts from examining user behavior pre- and post-access. 1.2. Cookies Due to the problems in using network addresses to identify users, Web developers have looked for means to associate users with Web access. The challenge is that the Web itself is stateless — meaning that there is no connection among successive page accesses by the same user. In 1995, Netscape Communications introduced the “cookie” as a mechanism to add state to Web client-server sessions. (The original cookie specification has been codified as Internet Engineering Task Force RFC 2695 (Kristol 2000).) With cookies, Web servers can store short strings of information at the client. For example, a database key or user ol gin name is a common choice for cookie data. When the user returns to the same site, the previously written “cookie” information is automatically sent to the server with each page access. On operating systems that Fenstermacher and Ginsburg, JASIST, forthcoming 2003

support multiple user profiles, cookies can lessen the problems of using network addresses to represent user by associating Web traffic with a returning user, rather than simply an IP number. Often, however, computers shared by many use a single login, defeating cookies. Other people turn off the cookie feature in their browsers, in response to privacy concerns (Schwartz, 2001). 1.3. Server-Side Programming to Augment Basic Server-Side Data Although cookies improve the quality of server-side data, incorporating cookies still maintains a site-centric perspective. Data must still be analyzed site-by-site, meaning that crosssite behavior cannot be observed. Moreover, user actions on the client are only visible through the Web browser preventing analysts from observing patterns among several applications. Web browsers can be augmented with either server-side programs or embedded scripts to provide additional information to Web servers. Lamacchia (LaMacchia, 1996) modified the freely available Excite search engine to build an active search agent in his “Internet Fish” system. Ginsburg (Ginsburg, 1998) and Ginsburg and Kambil (Ginsburg & Kambil, 1999) modified the same search engine to provide readership timings of documents within a full-text search framework in the “Annotate” system. In the latter work, a trimodal Gaussian distribution of readership timings was found as shown in Figure 1. The readership timings distributions shown in Figure 1 lead us to infer three types of document readers: 1. Very short browse time (µ = 3 minutes) – Document uninteresting, and is immediately dismissed. The next document is chosen from the search results or the session is exited 2. Medium browse time (µ = 10 minutes) – Document interesting enough to read for more than a few, but less than about 20, minutes. 3. Long browse time (µ = 65 minutes) – User is idle, having either left her computer or is doing other tasks at the computer. The mean time for this Gaussian distribution was about 65 minutes. The conclusions above are inferences, and while reasonable, they cannot be verified without contextual information. With client-side monitoring that stretches across all the commonly Figure 1 — Counts of reading time (minutes) used applications, however, analysts could resolve the ambiguity in long browse times. In a cross-application monitoring environment, if they are still active on their PC or if they have gone completely idle.

1.4. “User-Centric” Surf Mining In addition to the technical obstacles that limit the accuracy and richness of server logs, server-side data suffers from the absence of data on user context. As (Padmanabhan, Zheng, & Fenstermacher and Ginsburg, JASIST, forthcoming 2003

Kimbrough, 2001) point out, web server data is inherently incomplete. Although many users simply browse the Web to pass the time, others access the Web for specific purposes. For example, a user sending an email about an upcoming restaurant reservation might wish to attach a map to the email, and so would use her browser to look up the restaurant’s address and then submit it to a map site to generate the map. With server-side log analysis, the link among the email, the phone lookup and map lookup is lost. By expanding client-side analysis to monitor applications outside the browser, data mining can reveal patterns of user behavior around Web access, and not simply within it. This field has been very active in both academia and industry. Researchers are wrestling with incomplete and large amounts of data; a big focus is preprocessing to do better analysis and integration (B. Mobasher, Dai, Honghua, Luo, Tao, Miki Nakagawa, 2000). Another focus is user session tagging. For example, in Kimbrough et al., (Kimbrough, 2000) the authors provide a framework of usage metrics and analyse 30,000 users’ surfing behaviour in an online travel-booking domain. To do the study, they used data from user-centric click-stream data provided, by a leading market data vendor. In Damiani et al. (Damiani, Oliboni, Quintarelli, & Tanca, 2001) the authors present a new user modelling technique based on a temporal graph-based model for semi-structured information in order to formalize the user trajectory but no specific software techniques are suggested. Edelstein (Edelstein, 2001) talks about the challenges of clickstream analysis (consolidating raw data from web server access logs across multiple servers and turning them into usable records). Another problem is that logical servers may be multiple physical servers for load-balancing or geographical reasons. The logs are difficult to clean in some cases. It’s not always easy to remove extraneous items, or to identify users and sessions, or to identify transactions. All of these make it hard to integrate this data with other data sources. 1.5. Generic Agents to help Knowledge Work The work presented by (Schnurr, 1999) is not directed specifically at any software architecture. The authors propose an abstract agent to help integrate the semantics of semistructured documents (e.g. XML) and business processes, in order to support “desk work” of the end-user. The goal is to provide a proactive reasoning agent and improve knowledge management. A similar discussion is set forth by Abecker, et al. (Abecker, Bernardi, & Sintek, 1999). The authors discuss the need to actively support the user in a search task. They propose the principle that the organizational memory system should actively offer interesting knowledge. They also discuss that applications that interweave data, formal knowledge, and informal representations all have embedded knowledge. Thus there is a need for a ‘conjoint view and usage of all these representations’(Abecker et al., 1999). The most ambitious of their ideas is that normal work should not be disturbed (the principle of unobtrusiveness) --- the system should observe the user doing tasks, and then ‘automatically gather and store interesting facts and information in an unobtrusive manner’ (Abecker et al., 1999). Their implementation is called KnowMore, and features workflow support, and IR heterogeneous data support via homogeneous knowledge-descriptions. 1.6. Improving Personalization and Recommender Systems One of the problems facing personalization and recommender system designers is incomplete and inaccurate data about consumer preferences and transaction histories. As (Padmanabhan et al., 2001) discuss, personalization by web server log analysis only is likely to be faulty. To Fenstermacher and Ginsburg, JASIST, forthcoming 2003

personalize effectively or to provide robust recommendations, a richer data source is needed. Adomavicius and Tuzhilin (Adomavicius & Tuzhilin, 2001) propose a multi-dimensional “hypercube” and OLAP techniques to provide a richer (they do not define the exact functional form) recommendation framework. However, they do not give any software architectures regarding the data collection (how to populate the hypercube). The next section discusses the start of a new generation in data collection, the client-side monitoring system. 1.7. Client-Side Monitoring Today The “next generation” of monitoring recognizes the inherent limitations of server-side inferences, data mining, data filling, and data manipulation. Instead, recent research has turned to software augmentation on the client to give a more granular view of the session activities. Such client data can be usefully merged with server data or third-party ancillary data for a more complete user profile. The tip of the iceberg is in a paper on personalized shopping (Jörding & Michel, 2001); the paper presents a system, TELLIM that generates custom multimedia presentations at runtime. It matches display with customer-appeal. To do this, the system monitors client interactions; for example the start of an audio or video player. The system then deduces whether the end-user is interested in various presentation elements. After a learning algorithm is applied, only preferred presentation elements are provided going forward, and the suppressed items are only shown as links. The technology uses combination of dynamic HTML (DHTML) and a Java applet for event monitoring on the client.

2. Goals for Client-Side Monitoring Given the strengths and weaknesses of server-side monitoring, we have outlined four goals for a client-side monitoring framework: monitoring should not interfere with user activities, data collection should be specific to individual users, developers should easily be able to specify what data is collected and how and, finally, the framework should be extensible to incorporate monitoring of other client-side applications. Non-interference with user activities is already true of server-side logging, but the shift to client-side makes unobtrusiveness a key element. If users are required to perform additional actions to support monitoring, they will seek ways to lessen the burden of doing so, and will likely distort the results of the monitoring. Thus, the goal becomes to monitor actions in the background, as users work with Web client applications. Unobtrusiveness must be balanced against the need to gather data, however. At one extreme, client-side monitoring could require that users comment on every action they take, including links they examine, but choose not to follow, returning to previous pages and all other actions. Although such monitoring would generate rich data (at least for a short period), it is clearly untenable. One of the key weaknesses of server-side monitoring is that user identities are generally lost to the Web server. (If a proxy server is employed, then the proxy server will usually have access to user data, but the proxy will certainly mask user identities when accessing the requested Web server.) By recording user behavior on the client, it is possible to tie actions to users much more tightly. For example, on the client side, the user’s login name and profile information is available, in addition to the network address information sent to the client. The use of proxy Fenstermacher and Ginsburg, JASIST, forthcoming 2003

servers, a significant problem for Web server log mining, is not a problem with client-side monitoring. Tightly coupling user identity and user actions, although it aids in data mining, raises issues of user privacy. With so much data available at the client, recording and mining the data might reveal extensive personal information. In an organizational setting, this might be appropriate, as companies might wish to know how employees are using company resources. In other settings, however, such extensive data collection could be problematic. In such cases, cryptographic techniques can be applied to preserve the integrity of the data, while breaking the link between the data and a known person. The breadth of data available on the client side poses another problem for those designing client-side solutions: what data should be collected? At the client, every mouse movement, every keystroke, every click and more are available for logging and analysis. The problem of what to collect is not simply a question of granularity, but also of focus. For example, within a Web session, should the system record link hovers, where a user moves the mouse over a hyperlink but does not click? Or should only followed hyperlinks be noted? Which actions to log depends on what the analyst wishes to infer, of course. The range of candidate has an important implication for monitoring framework design: the framework must allow framework users to easily specify what data is needed. Finally, as discussed in the previous section, users access as part of a larger context. Thus, the ideal monitoring framework should be extensible to encompass additional applications; ideally, office productivity applications would be included. A complete client-side monitoring solution could log user actions consistently across many applications. Integrated logging could then support new inferences about user behavior. For example, a user writing a memo might search the Web for background information, copying useful information into the word processor she is using to write the memo. Without knowing what information has been copied and pasted, there is no way to distinguish among all the pages the user has browsed. Once the word processor is incorporated into the framework, however, mining the client-side data could reveal which Web sources are most often copied and pasted into documents. The implication is then that those sources most frequently copied offer the most relevant information for a particular user. In summary, an ideal monitoring framework would address several goals: non-interference, close association between logged data and users, flexible data collection and the ability to integrate monitoring with other client-side applications. In the next section, we propose a framework (which we have already begun to use) that meets the outlined goals.

3. Framework for User Monitoring Ideally, a common Web monitoring framework could be developed for use on many different platforms. Unfortunately, the tight integration required between a monitoring framework and user applications precludes a single framework. The widespread adoption of Microsoft Windows, however, means that a Windows solution addresses the needs of many framework users. Although the previous discussion of client-side monitoring applies to all platforms, the implementation we present is based on Microsoft Windows. In addition, the proposed framework is currently built around Internet Explorer, but could be extended to other browsers. Current implementations of the proposed framework combine Microsoft’s Component Object Model Fenstermacher and Ginsburg, JASIST, forthcoming 2003

(COM) technology (Box 1998; Microsoft Corporation 1995), Internet Explorer (IE) and Python (an open-source programming language) (Python Language Website 2001). Briefly, the framework uses COM to capture events fired by Internet Explorer and log these events on the client for later transmission to a server. In the following sections, we explain the framework more fully. In addition to substantial market share, Microsoft Windows also offers the technology infrastructure needed to support a monitoring framework in its Component Object Model (COM). Most Windows applications expose some, if not all, of their functionality through COM interfaces. In other words, most of what users can do through an application’s user interface programs can do through COM. On Windows, COM provides the needed infrastructure both to allow programs to control applications (automation, in COM terminology) and receive event notifications from applications. One of the key reasons for using COM is that automation and event notification happen in the background; users can continue to use applications as they normally do, while developers can use COM to control the applications through a second channel. Because the COM standard does not specify what properties or events to expose, there is no consistency among monitored applications. Microsoft offers similar support across Office suite applications within a release, but because each software vendor is free to expose what it chooses, researchers interested in monitoring information flows across many applications must often be creative. As shown in Figure 2 , even Office applications support different events and properties — Microsoft Word uses documents while Microsoft Excel uses worksheets and workbooks, for example. In our own work, we are interested in tracking information flows within and among applications. Often users will search for information on the Web through a Web browser, and then copy portions of the text (or simply the URL of a relevant Web page. By monitoring the pages they visit in Internet Explorer, as well as users’ use of the Windows clipboard, we can study how people create documents and what sources are consulted in the creation of those documents. Of course, the same monitoring techniques can be used to gain a different perspective on Web usage alone by recording information not generally available through serverside data collection.

3.1. Monitoring Internet Explorer As Microsoft’s Web browser, Internet Explorer (IE) offers excellent support for control and event handling through COM. Thus, if the users of interest use IE as their browser, the framework can gather extensive information from browsing sections. While a complete listing of IE events and properties is given in Tables 1 and 2 in the appendix, the monitored version of IE that we have implemented uses the three events and two properties shown in Figure 2 . The key event, NavigateComplete2, notifies us each time a user visits a new page in IE; the notification includes the URL itself. Although the NavigateComplete2 event fires each time the browser accesses a Web server, it also fires each time a proxy server intercepts the request, every press of the Back or Forward buttons and every time the users clicks the refresh button. Thus, NavigateComplete2 offers a much more complete picture of user browsing behavior than is possible with server-side data collection. Listing 1 in the Appendix shows a brief Python program that uses the NavigateComplete2 event to print out a Web page’s title after the page has loaded. Fenstermacher and Ginsburg, JASIST, forthcoming 2003

Microsoft Excel XP Monitored user actions: NewWorkbook, SheetChange, SheetSelectionChange, WorkbookBeforeClose, WorkbookOpen

Monitored properties:

Microsoft Word XP

Microsoft Internet Explorer 6

Monitored user actions:

Monitored user actions:

DocumentChange, WindowSelectionChange, Quit

NavigateComplete2, StatusTextChange, Quit



Document path, Document name

Document title, Visited URLs, Status text

Document path, Document name

Windows Clipboard Monitored user actions: Copy, Cut

Monitored properties: Clipboard contents (Multiple formats)

Other, unmonitored applications Our monitored version of Internet Explorer also tracks changes in the status text that IE displays. (In Internet Explorer, the status text is shown in the lower left hand corner of the window frame.) Although client-side scripts can change the status text, IE itself changes the status text when the mouse pointer moves over a link on a Web page — the status text is updated Figure 2 — Some actions and properties exposed through Microsoft's Component Object Model.

to show the URL embedded in the link. In addition, the status text is changed again as the mouse pointer moves off the link. By comparing the timestamps of the successive status text changes, analysts can infer the duration of link hovers perhaps giving a clue as to which links users considered following. Although the monitored version of Internet Explorer we have implemented captures much useful information, it does not currently record an important component of Web behavior: filling out Web forms. The NavigateComplete2 event is not helpful for capturing form data because it fires only after the page has loaded — one can capture the URL of the page containing the form and the page that results from the form-handling script, but not the form data itself. Internet Explorer does offer analysts access to form data, however, through the BeforeNavigate2 event. As its name implies, this event fires just before a new page is requested and the event information includes (among other things) the form data (if using the POST method — when using the GET method, the data is encoded in the URL). In addition to events, Internet Explorer also exposes properties through COM. The properties defined are listed in Table 2, and are in addition to any information that is available through events (such as the form data mentioned above). Although most of the properties are details regarding window size and placement, some of the properties offer important content information. For example, the Document property (Microsoft Corporation 2001) corresponds to the World Wide Web Consortium’s (W3C) Document Object Model (DOM) representation Fenstermacher and Ginsburg, JASIST, forthcoming 2003

(World Wide Web Consortium 2000) of the current document. The W3C DOM offers extensive information useful for analysis, including the document’s title, last modified date, links and the structure of the HTML page. In addition to the information reported by Internet Explorer itself, the monitoring framework has access to data associated with the current Windows session: the computer’s network (IP) address, the current time and time zone, the current user’s name and more. Thus, each Web page loaded can be recorded and stamped with the current time to support post hoc analysis of event logs.

Link mining Because Internet Explorer (IE) exposes the document object model (DOM) of the current Web page, the complete page content is available. IE’s DOM implementation includes some notable collections as well, including all the links on a page, all the IMG (image) objects and all the named anchors, among others. With the complete DOM, mining the log files can compare link hovers (perhaps suggesting consideration of the link) to followed links to identify distinguishing features of followed links. Path analysis could reveal that a user considered two links, followed one, then used the Back button to return to the original page and followed the second link, then continued on, suggesting that the user was “fooled” by the link descriptions of the original page into selecting the less relevant link. Once post-processing has identified which links the user has considered, an analyst can compare and contrast three groupings: links not considered, links considered but not followed and links followed. Identifying patterns among the three groups could aid page designers in reformatting pages to improve access (or boost sales on an e-commerce site). A second-order analysis might incorporate the referring site or page thus suggesting that different pages could be dynamically generated based on the referring site. Link mining could also be the basis for new visualizations. Currently, many graphical renderings of the Web depict it as static, relying on the fixed links embedded in pages to demonstrate the link relationship among pages. With the above link analysis, analysts can view the Web not as it is constructed, but instead as it is used. Visualizations could depict the trajectories through the Web, and help analysts identify patterns such as links followed only to have users return to the linking page to follow another considered link. In addition to extending the potential gains from Web mining, the framework described here can be an integral part of other applications as well.

Process-oriented knowledge delivery The key notion of process-oriented knowledge delivery is that by associating information sources with tasks and processes and monitoring user actions to infer broader processes, the system can offer task-specific assistance to users. Process-oriented knowledge delivery is addressed in (Fenstermacher 1999), but the resulting system in that case was tailored to the particular task at hand (reference library services in a consulting environment). More recent development efforts have focused on building a framework that allows developers to define processes, and annotate them with metadata, that work with off-the-shelf applications. Two projects that adopt the process-oriented perspective offer assistance with Web-based restaurant Fenstermacher and Ginsburg, JASIST, forthcoming 2003

reservations and online public library access, both are currently under development by a research team jointly led by the authors. The restaurant reservation system is primarily Web-based, and thus the application of the client-side framework in use is very similar to that presented in this paper. Briefly, the system monitors user actions looking for Web access to known restaurant recommendation sites, such as Zagat.com or the restaurant section of a CitySearch city guide. After a user’s visit to a review site is noted, the system is able to recommend review sites with similar coverage. After the user selects a particular restaurant, the system uses the browser’s document object model representation to parse the page, extracting relevant information such as restaurant name, address and possibly phone number. The system will then offer to aid the user by verifying the phone number, retrieving a map and directions to the restaurant and send an email notifying others of the restaurant, time and date once a reservation is made. The restaurant reservation assistant relies on the browser’s NavigateComplete2 event to infer when a user it is engaged in a reservation process. When the reservation process is read in at startup, a table of review sites and their URL’s is built. As the user browses Web sites (generating NavigateComplete2 events), the URL passed by to the event handler is used to index into the table, looking for a site that matches a known restaurant review site. If a match is found, the assistant uses COM’s automation capabilities to continue the process on the user’s behalf. Although work is only beginning on the virtual public library system, the architecture is similar to that of other process-oriented retrieval systems. In a joint project with the TucsonPima Public Library, the goal is to build an assistant that acts as a reference librarian for the most common online information requests. After identifying the information sought, and online patron would answer questions, as in a live reference interview. The interview results and original information request would then be used to aid the patron in an online search, while access would continue to be monitored to offer context-sensitive help.

3.2. Monitoring Microsoft Word Because documents are a common endpoint of information gathering tasks, monitoring a user’s word processing application is important to understand the flow of information as well as Web usage. For example, although monitoring the Web browser will reveal which sites a user visited (as well as links considered and other information as described in the above section on monitoring Internet Explorer), it does not tell site designers how the information is being used. Web site designers (or Web researchers) might wish to know which text from a site’s Web pages is being incorporated into end user documents, or which URLs are being pasted into documents. The current version of the framework captures the information listed in Figure 1: events to detect when a user opens or creates a document or simply switches among several documents being edited (DocumentChange), when the user quits Word (Quit) or perhaps most importantly, when the selected text in the active document changes (WindowSelectionChange). The last event is critical because a user pasting text into a document triggers the event. Since users are more likely to copy large blocks of text using the clipboard than by retyping, recognizing when a user has pasted text (and recording the pasted text) is important to support analysis of Web usage. Because Word also provides the document name and path, logging of the aforementioned events can be associated with particular documents over time. Fenstermacher and Ginsburg, JASIST, forthcoming 2003

Compositional analysis Another advantage of our monitoring framework is the ability to pull apart a new document version to determine which sections were taken directly from consulted information sources, for example Web pages. We can also find out lesser indications of interest, for example a Web page link copied into a Word document’s footnote but none of its corresponding content. Document composition, previously not feasible, has wide-ranging implications in enterprise knowledge management. Now we can quickly learn, among such common document genres as daily economic reports, strategy visions, or ad-hoc memoranda, what information sources feed into the creation process. Analyses can be conducted on a genre-by-genre basis as well as a jobfunction by job-function basis for improved understanding of enterprise document workflows. The analysis will also be valuable to enterprise librarians as a metric to gauge the usefulness of subscriptions, such as Gartner and Seybold reports. To accomplish document composition analysis, we store material copied to the Clipboard in a key, value data structure. The key in this case is the hash digest of the copied text, and the value is the test itself. When material from the Clipboard is pasted into a document, we store the pasted text in a similar data structure. Thus when the next version of the document is ready, we need only convert the document into Rich Text Format or ASCII (a format suitable for string scanning) and then compare the document text to the material pasted in the steps leading up to the new version. Knowledge trails in the absence of coordination and awareness are purely ad-hoc, with no carry over from one session to the next. Thus, awareness helps the actors to perform the informed artifact discovery. Therefore, new artifacts found should be recorded for re-use as was suggested long ago in Vannevar Bush’s landmark article on “Memex”, his hypothetical hypermedia computer (Bush, 1945). Client-side monitoring is ideal for this purpose. Our proposal aims to create an infrastructure to capture this awareness. With the Python client monitoring software in place, we can funnel client logs to a server for collation, pre-processing, and analysis to do pattern discovery on the event-logs.

3.3. Monitoring Microsoft Excel Although Excel’s spreadsheet capabilities do not play a prominent role in our current research, we have instrumented Excel to detect user actions similar to those captured in Microsoft Word. For Microsoft Excel, our current implementation records when users create or open workbooks (the Excel files that most Excel users are familiar with) as well as when they create or switch among worksheets (the tabbed sheets that are grouped together inside Excel workbooks). Like Word, Excel supports notification of a selection change (SheetSelectionChange), which notifies the framework when an Excel user has pasted an item into a worksheet.

3.4. Windows clipboard: The link among applications One of the critical challenges in implementing a framework for cross-application monitoring is to not only capture the actions of users within common applications, but across applications as well. Microsoft Windows offers many different mechanisms for applications to communicate with one another, including the Component Object Model (COM), object linking and embedding Fenstermacher and Ginsburg, JASIST, forthcoming 2003

(OLE) and many others as well. However, many Windows users turn to copy and paste to transfer text and graphics from one application to another. As discussed above, both Word and Excel support notification of paste events, when a user inserts an item from the Windows clipboard. However, such notification does not itself report the pasted data or the source of the data (the application in which the cut or copy was performed). Moreover, there are thousands of Windows applications, many more than we could hope to instrument, even given the ease of implementing the current framework. As the linchpin of cross-application interaction, however, integrating the Windows clipboard into the monitoring framework enables us to record cuts or copies from otherwise unmonitored applications, as shown in Figure 3. The current clipboard monitoring mechanism enables to detect when any Windows application cuts or copies data to the clipboard. The current implementation also records the name of the application from which the data was drawn. In addition, the clipboard monitor can retrieve data in multiple formats. Although most users know that the Windows clipboard can store only a single item, in fact the clipboard can store the same item in multiple formats. For example, after a user copies an image from a Web page, some Web browsers will store both the image and the alternative text on the clipboard. In applications that offer a Paste Special… item on the Edit menu in Windows, users can see this choice and select whether they wish to paste a bitmap or unformatted text.

3.5. Putting the Pieces Together: Integrated application monitoring One of the arguments for client-side monitoring is the opportunity to monitor applications beyond the Web browser. For most users, Web access is in Edit article in word service of a larger task. A student is looking for background processor information for a paper, or a salesperson needs to make a restaurant reservation following a client meeting, for example. Switch from word With Web-only monitoring solutions, user actions are seen processor to Web browser through blinders and the connection between Web access and other applications is lost. An important goal for a Web Click on monitoring framework is thus extensibility, enabling Item transfer ResearchIndex from browser to developers employing the framework to monitor other bookmark word processor applications and share monitoring data across applications. The proposed framework, using COM to monitor applications through exposed events, can support any application with a COM interface. Extending the framework reveals one of the weaknesses of COM, however. Although the COM specification states how applications should interact with one another, it does not specify what applications should say to one another. Thus, there is no uniform set of interfaces that are supported by all applications, or even categories of applications. One word processor might decide to expose a DocumentBeforePrint event while another word processor might expose the same event using the name BeforeDocumentPrint — or might choose not to expose such an event at all. Nevertheless, some applications Fenstermacher and Ginsburg, JASIST, forthcoming 2003

Search/browse ResearchIndex page

Select Web page item (text/ graphic) Copy item to clipboard Switch from Web browser to word processor Paste clipboard item Return to Web browser

Figure 3 — Cross-application usage in editing a document

(particularly Microsoft Office suite applications) do offer extensive access to their capabilities. Integrated monitoring of applications is more complex for two reasons. First, data management is more difficult as data is being generated by multiple applications. The second, and more challenging, task is managing inter-application linking. For Web monitoring to benefit from better integration, developers must track the relationships among multiple user applications. For example, suppose that while writing a journal article a researcher shifts from a word processor to a Web browser to use the ResearchIndex Web site (http://citeseer.nj.nec.com/) to research a particular citation. After searching ResearchIndex and browsing the results, the researcher might copy several citations and paste them into her article. A monitoring system that included both the researcher’s word processor and Web browser would observe an action sequence similar to that shown in Figure 3. After observing the sequence, automated analysis of monitoring logs might reasonably conclude that the browser is being used to search for information relevant to the content of the edited document by noting the pattern: 1. User edits document in word processor 2. User switches from word processor to Web browser 3. User copies from data from the browser window to the Windows clipboard 4. User pastes clipboard content into the same document originally being edited in the word processor In this case, the key to the pattern is shift from one application to another and back to the first, strengthened by the copy-and-paste relationship between the two. By examining the text section that was being edited at the time, the connection could be narrowed further. Integrating monitoring across applications perhaps raises as many questions as it answers. Inferring connections among applications or tasks will generally be an educated guess. The framework might be valuable for the dual of monitoring, i.e. taking action on behalf of the user based on real-time pattern analysis. In other research, we are exploring the use of the larger monitoring framework to index into a library of known process descriptions, which incorporate information about applications and how they are used in the process. Once the process monitor recognizes a process, the same COM-based framework enables the system to drive other applications. For example, after a user selects a restaurant at a review site, the system can pull the address from the Web page and submit it to a Web mapping service, without requiring user intervention. Regardless of whether the goal is simply to monitor Web browser actions, or integrate across applications, developers must choose an implementation language for building COM-based systems; the alternatives and our selection for prototype construction are discussed in the next section.

Improving Workflow Management Support with Client-side Monitoring In workflow systems, there is a tension between highly structured workflow (with a strong, underlying model) and the need for model revision via synchronous and asynchronous group interaction. Building flexibility into workflow processes is the theme of many recent works, for example (Antunes, 1995) (Ginsburg, 2001) (Bernstein, 2000). Client-side monitoring allows us to go further in designing an effective and flexible workflow management system (WFMS). Let us consider the general problem of a workgroup that consist of several participants. This workgroup has been assigned a pre-defined task (like processing of a document), which needs to Fenstermacher and Ginsburg, JASIST, forthcoming 2003

be done in a structured manner. Participants working on this task are called actors. These actors use typical productivity tools, such as other documents, Web pages, spreadsheets, and so on, to complete the task. The typical order in which the task is done is that first set of actors performs some knowledge discovery, processes the input with the knowledge gathered, and passes the output generated to the next actor in the workflow. There is an opaque “black box” that links successive versions — our work peers inside the black box. In the workflow literature, a Petri Net model is used (Weitz, 1998) which represents documents as tokens which follow a sequential path. The token passes through different stages (or checkpoints), which are labeled V1 through V6 . To advance the token, knowledge discovery takes place within the black box. With the help of additional resources, the actors (the agents who move the document forward through successive versions) perform several tasks on the token between two stages or checkpoints. Thus, we can say that each actor performs some knowledge discovery by taking a version (a token) as input, and then passes the generated output (again a token) to the next actor(s) in the flow for additional processing. This chain continues until the task is completed. We would like to zoom in on the steps between successive versions. To unravel the mystery of the black box and to shed light on the process of knowledge discovery, let us analyze this black box in more detail. This black box contains several artifacts like Web pages, Word documents, Excel spreadsheets, PowerPoint slides, and databases. These artifacts are utilized by the actors while they perform knowledge discovery and process the input (token). If by any means we are able to track the activities of the actors i.e. what artifacts did they use, how much time did they spent on each, what operations did they perform on each-it would be a big step towards unraveling the mystery of this black box. The actors’ ‘canonical awareness’ (Brown and Duguid, 1991) can be exposed by client monitoring and is very helpful in workflow management. Client monitoring does, in fact, give us information about how, when, why and what artifacts were utilized in knowledge discovery. We can therefore document them, analyze them and store them in the knowledge bank of the organization so that when a similar task is assigned to a workgroup in future, this awareness would prevent the group from reinventing the wheel. Our section on Future Developments outlines Adaptive User Interface techniques to allow for synchronous interface changes when a member of a group discovers a new and useful information asset; we can deduce usefulness from the client-side actions (e.g. large sections copied and pasted from the new source). The deductions can become a powerful addition to a WFMS to better inform group participants, who may be in physical proximity or geographically distributed, during the various document preparation phases. Knowledge trails in the absence of coordination and awareness are purely ad-hoc, with no carry over from one session to the next. Thus, awareness helps the actors to perform the informed artifact discovery. Therefore, new artifacts found should be recorded for re-use. Clientside monitoring is ideal for this purpose. Our proposal aims to create an infrastructure to capture this awareness. With the Python client monitoring software in place, we can funnel client logs to a server for collation, pre-processing, and analysis to do pattern discovery on the event-logs.

4. Future Development Although Python simplifies development work with Microsoft’s COM specification, monitoring user events requires some programming skill. Moreover, even those with prior programming experience must investigate the COM events, methods and properties exposed by a Fenstermacher and Ginsburg, JASIST, forthcoming 2003

particular application. To ease the burden of applying the framework, we are investigating the development of a toolkit that would enable non-programmers to select applications and events they wish to monitor and choose from a menu of actions to take in each case. For example, a toolkit user could select Microsoft’s Internet Explorer application, select the “Page loaded” event and ask that the event and associated information be logged to a file. The toolkit would then generate a NavigateComplete2 event handler “behind the scenes”. This is especially useful for non-technical enterprise librarians or system evaluators who wish a graphical control panel to “activate” monitors on an application-by-application basis, with subsequent tailored log-file writing and possible automated collation of a group task. Later extensions to the toolkit could support client-side data analysis, based on the most promising analytic techniques. Our future work focuses on four major stages. In the first stage, we provision a toolkit to enable a simple mechanism to activate and configure client-side monitoring. In parallel with this stage, we incorporate more client-side productivity tools (such as PowerPoint and common email clients) into the monitoring framework to increase the range of the offering. In the second stage, we establish automated communication channels between the individual or workgroup client PCs and an analysis server as mentioned in the previous discussion of workflow support systems. In the third stage, we perform pattern analysis on the collated client activity logs on the analysis server and provide a means for the server to feed discoveries back to the client interfaces as they interact with the Web and the other productivity tools. Thus, adaptive User Interfaces become the means to signal an individual between sessions or a workgroup between members as knowledge tasks are undertaken. Via such dynamic interfaces, we can provision “Workspace” awareness which is the focus of recent CSCW research (Biuk-Aghai & Hawryszkiewycz, 1999; Gutwin & Greenberg, 1998). A fourth stage involves refining our composition analysis on documents created by individuals or groups within the enterprise. We plan to represent graphically information sources that have gone into a given document version and to map in a field setting document compositions, considering document genres and workgroups. Interesting patterns should be revealed in a longitudinal study using both visual and textual compositional analysis techniques. Taken as a whole, the future work seeks to implement the client-monitoring framework with as broad a reach as possible and as simple a configuration interface as possible. As server-side analysis and adaptive client interfaces are implemented, it will become possible to evaluate this framework in a wide variety of field settings. We see document composition analysis as having high potential in enterprise knowledge management, since documents are such an important knowledge asset of the firm.

5. Conclusions The quality of inference in Web mining today is constrained by a dearth of rich and reliable data. Much of the data mined is read from Web server logs, or proxy server logs. Unfortunately, the parsimonious nature of network transactions on the Web means that little information about the client is transferred to the server. However, by shifting the focus from the server to the client, data can be collected at the source. The much richer data available from the client will enable new observations about user behavior. For online retailers, better data might mean more browsers converted to buyers, or more successful cross-selling of related items. For organizations, client-side information could identify which information sources are most Fenstermacher and Ginsburg, JASIST, forthcoming 2003

valuable — not because such sources return the most results or are the most frequently searched, but because those sources contribute information that is used elsewhere. As more library services are offered online, librarians can more effectively target library resources to those areas needed most. Client-side Web monitoring represents an important advance over server-side techniques. More importantly, however, client-side monitoring offers the potential for an integrated analysis of Web usage in a broader usage context. Client-side Web monitoring offers the potential to change our understanding of Web usage. With a better understanding of the most popular tool for online information access, designers can craft systems that work with people to find the information they need when they need it. By identifying source that people not only skim, but also incorporate into other work, monitoring systems can rank the usefulness of sources. Client-side monitoring — not only of Web usage but of all applications — offers the potential to change how users interact with their computers. The quality of the recommendations of personalized e-commerce systems depends on the quality of the available data. From the business perspective, client-side data enables better recommendations, resulting in more sales and more satisfied customers. With individualized information about application usage, user interfaces can adapt to users. Any application could be developed to recognize information needs that might arise in the course of using the application. For example, word processors could incorporate an Insert citation button that would search for relevant citations given the surrounding text. Information search and retrieval functions could coordinate across sessions and across users, and discover new commonalities among people based up on the richer client-side data. Our framework empowers users to define their own applications, rather than relying on the static features dictated by application programmers. Client-side monitoring and adaptive interface design transform the personal computer from a collection of independent applications into an extension of the knowledge worker.

6. References Abecker, A., Bernardi, A., & Sintek, M. (1999). Enterprise Information Infrastructures for Active, Context-Sensitive Knowledge Delivery. Paper presented at the ECIS'99, Copenhagen, Denmark. Adomavicius, G., & Tuzhilin, A. (2001). Extending Recommender Systems: A Multidimensional Approach. Paper presented at the IJCAI-01 Workshop on Intelligent Techniques for Web Personalization (ITWP’2001), Seattle, WA. Antunes, P., Guinaraes, Nuno, Segovia, Javier, Jesus Cardenosa. (1995). Beyond Formal Processes: Augmenting Workflow with Group Interaction Techniques. Paper presented at the COOCS95, Milpitas, CA. Bernstein, A. (2000, December 2000). How can cooperative work tools support dynamic group processes? Bridging the specificity frontier. Proc, CSCW00, pp. 279-288, Philadelphia, PA. ACM Press. Paper presented at the CSCW00, Philadelphia, PA. Biuk-Aghai, R., & Hawryszkiewycz, I. (1999). Analysis of Virtual Workspaces. Paper presented at the Database Applications in Non-Traditional Environments, Japan. Damiani, E., Oliboni, B., Quintarelli, E., & Tanca, L. (2001). Modeling Users' Navigation History. Paper presented at the International Joint Conference on Artificial Intelligence (IJCAI), Seattle, WA. December, J., & Ginsburg, M. (1996). HTML 3.2 and CGI Unleashed: Sams/Macmillan.


Edelstein, H. A. (2001, March 12, 2001). Pan for Gold in the Clickstream. Information Week, 7791. Ginsburg, M. (1998). Annotate: A Knowledge management support system for Intranet Document Collections. Unpublished PhD, NYU, New York City. Ginsburg, M., & Kambil, A. (1999). Annotate: A Knowledge Management Support System. Paper presented at the HICSS-32, Hawaii. Ginsburg, M., Therani Madhusudan. (2001). Pattern Acquisition to Improve Organizational Knowledge Management. Paper presented at the AMCIS2001, Boston, MA. Gutwin, C., & Greenberg, S. (1998). Design for Individuals, Design for Groups: Tradeoffs Between Power and Workspace Awareness. Paper presented at the CSCW'98. Jörding, T., & Michel, S. (2001). Personalized Shopping in the Web by Monitoring the Customer. Paper presented at the Active Web, UK. Kimbrough, S., Padmanabhan, Balaji, and Z Zheng. (2000, December 2000). On Usage Metrics for Determining Authoritative Sites. Paper presented at the WITS. LaMacchia, B. A. (1996). Internet Fish. Unpublished PhD, MIT, Cambridge. Mobasher, B., Cooley, Robert, and Jaideep Srivastava. (2000). Automatic Personalization based on Web Usage Mining. Communications of the ACM, 43(8), 143-151. Mobasher, B., Dai, Honghua, Luo, Tao, Miki Nakagawa. (2000). Improving the Effectiveness of Collaborative Filtering on Anonymous Web Usage Data. Technical Report, DePaul University, Chicago, IL. Padmanabhan, B., Zheng, Zhiquang, Steven O. Kimbrough. (2001). Personalization from Incomplete Data: What you don't know can hurt. Paper presented at the KDD'01, San Francisco, CA. Schnurr, H.-P., Steffen Staab. (1999). A Proactive Inferencing Agent for Desk Support. Paper presented at the AAAI Spring Symposium on Bringing Knowledge to Business Processes, Stanford, CA. Schwartz, J. (2001, September 4, 2001). Giving Web a Memory Cost its Users Privacy. New York Times, pp. C1. Sen, S., Padmanabhan, B., Tuzhilin, A., White, N., & Stein, R. (1998). On the Analysis of Web Site Usage Data: How Much Can We Learn about the consumer from Web Logfiles? European Journal of Marketing: Special issue on Marketing in Cyberspace. Sen, S., Padmanabhan, Balaji, Tuzhilin, Alexander, White, Norman and Roger Stein. (1998). On the Analysis of Web Site Usage Data: How Much Can We Learn about the consumer from Web Logfiles? European Journal of Marketing: Special issue on Marketing in Cyberspace. Vranica, S. (2001, July 27, 2001). Web Sites seek to turn data into dollars. Wall Street Journal, pp. B8. Weitz, W. (1998). SGML Nets: Integrating Document and Workflow Modeling. Paper presented at the HICSS-32, Hawaii.

7. Appendix Table 1 — Internet Explorer Events Event Description of event Visible Fired when the window should be shown/hidden.


TitleChange DownloadBegin Quit CommandStateChange ToolBar NavigateComplete2 BeforeNavigate2 NewWindow2 TheaterMode ProgressChange StatusBar PropertyChange MenuBar StatusTextChange DocumentComplete DownloadComplete FullScreen

AddressBar FullName LocationName Parent Resizable ToolBar Width

Document title changed. Download of a page started. Fired when application is quitting. The enabled state of a command changed. Fired when the toolbar should be shown/hidden. Document being navigated to becomes visible and enters the navigation stack. Before navigate occurs in the given WebBrowser A new, hidden, non-navigated WebBrowser window is needed. Theater mode should be on/off. Download progress is updated. Status bar should be shown/hidden. PutProperty method has been called. Menubar should be shown/hidden. Status text changed. (Occurs when the mouse hovers over a link — the status text changes to the URL.) Document being navigated to reaches ReadyState_Complete. Download of page complete. Full-screen mode should be on/off.

Table 2 — Properties of Internet Explorer Application Busy Container FullScreen HWND Height LocationURL MenuBar Name Path ReadyState RegisterAsBrowser Silent StatusBar StatusText Top TopLevelContainer Type

from win32com.client import DispatchWithEvents class IEEvents: # Called after a page is loaded by IE def OnNavigateComplete2(self, pDisp, URL): print "Arrival at page " + URL # Start Internet Explorer with event monitoring # provided by the IEEvents class. ieBrowser = DispatchWithEvents( "InternetExplorer.Application", IEEvents)

Fenstermacher and Ginsburg, JASIST, forthcoming 2003 # Force the browser to load the ASIST home page Listing ieBrowser.Navigate2("http://www.asis.org/") 1 — Skeleton Python code for monitoring Internet Explorer’s

Document Left Offline RegisterAsDropTarget TheaterMode Visible

prints Arrival at page http://www.asis.org/


Client-Side Monitoring for Web Mining - CiteSeerX

Client-Side Monitoring for Web Mining - CiteSeerX

Suggest Documents

Client-Side Monitoring for Web Mining - CiteSeerX

2 Web Mining for Web Personalization - CiteSeerX

Visual Web Mining - CiteSeerX

Web Usage Mining - CiteSeerX

A Tool for Web Usage Mining - CiteSeerX

Instructible Information Agents for Web Mining - CiteSeerX

Advanced AI Techniques for Web Mining - CiteSeerX

Methods for Mining Web Communities: Bibliometric ... - CiteSeerX

Mining Paradigms: The roadmap for Semantic Web Mining - CiteSeerX

Deep Web Content Mining - CiteSeerX

Web Mining for Cyber Monitoring and Filtering - Semantic Scholar

Deep Web Content Mining - CiteSeerX

web mining: a roadmap - CiteSeerX

Monitoring Web information changes. - CiteSeerX

WEB SERVICE & WEB MINING

Exploiting Web Log Mining for Web Cache

2 Web Mining for Web Personalization

Mining web navigations for intelligence

web mining

Web Mining for CRM

Radar Interferometry for 3-D Mining Deformation Monitoring - CiteSeerX

On Mining Web Access Logs - CiteSeerX

Different Aspects of Web Log Mining - CiteSeerX

Mining Indirect Associations in Web Data - CiteSeerX