Mining Client-Side Activity for Personalization - CiteSeerX

2 downloads 83961 Views 54KB Size Report
the Web server, analysts must rely on incomplete data. In ... the Microsoft Windows platforms. ..... operating system to passively monitor all user activity in.
Mining Client-Side Activity for Personalization Kurt D. Fenstermacher Mark Ginsburg Department of Management Information Systems Eller College of Business and Public Administration University of Arizona {kurtf,[email protected]} Abstract “Garbage in. garbage out” is a well-known phrase in computer analysis, and one that comes to mind when mining Web data to draw conclusions about Web users. The challenge is that data analysts wish to infer patterns of client-side behavior from server-side data. However, because only a fraction of the user’s actions ever reach the Web server, analysts must rely on incomplete data. In this paper, we propose a client-side monitoring system that is unobtrusive and supports flexible data collection. Moreover, the proposed framework encompasses clientside applications beyond the Web browser. Expanding monitoring beyond the browser to incorporate standard office productivity tools enables analysts to derive a much richer and more accurate picture of user behavior on the Web. Keywords: data mining, human-computer interaction, personalization, cross-application, client-side

1.

Introduction

Imagine a library where the librarians know not only which books have been checked out, but also which books patrons have pulled from the shelves. For each book pulled from the shelf, the library staff knows whether the reader simply skimmed the text and returned it to the shelf, or carefully read two of the chapters. If the copier were used to copy pages from a journal, the staff would know which issues and which pages were copied and how they were used. With such information, the staff could better judge how the collection was being used, which areas could be improved and more. In short, the staff could help people become better library users. Such monitoring is not possible in the world of printed texts, but it is possible in the online world. In cyberspace, many people access information through the World Wide Web, making the Web a candidate for improved monitoring. Unfortunately, just as many libraries only know which works are checked out, very little information is available

about how people use the Web. Today, Web server logs are the primary source of data for mining. Server logs are thus the base from which analysts draw inferences about user behavior. Although the problems of server log analysis are well known, there has not been extensive work exploring user monitoring at the client. In this paper, we summarize the shortcomings of server-side data, and propose a framework for client-side Web monitoring. Monitoring at the Web-browser level enables much richer data collection, supporting better analysis. Although client-side monitoring schemes have been proposed before, our proposed framework is extensible to other client applications. With broader monitoring on the client side, analysts can view Web usage within the overall client context. In the following section, we survey current server-side analysis, describing the collection of the data and its shortcomings for inferring Web user behavior. After summarizing the current state of the art, we summarize the goals of a client-based monitoring framework. With the goals specified, we describe a framework achieves the goals for users who work with common applications on the Microsoft Windows platforms. After presenting a framework for client-side data gathering, we discuss some novel analyses that draw on the richer knowledge of the client. We focus on the potential impact of integrating client-side monitoring of Web access with other clientside applications. Finally, we conclude by describing future directions for the framework.

2.

Mining the Web Today

Analysts mine Web data in support of many goals: better targeted advertising, usability evaluations of both browsers and sites and data consolidation (for example, at bargain finder sites) are all common aims. As researchers in knowledge and content management, we have a different goal — to study the flow of information among applications. On most computer desktops, the Web browser is the gateway to information. Although other applications might be used more often and for longer, the Web browser is a key element in studying and improving information access. Improving today’s access requires understanding how people interact with the Web, which

Page 2

June 2002, Newport Beach, CA WECWIS’02 Fenstermacher and Ginsburg

in turn requires mining Web interaction data to infer new knowledge. Despite our focus on tracking information flows, other analysts can adapt much of the framework to other uses. Web mining has been a subject of much study, for example (Kimbrough 2000), especially in electronic commerce applications (Sen 1998; Mobasher 2000; Mobasher 2000; Padmanabhan 2001; Vranica 2001), as electronic retailers aim to target their pitch ever more narrowly. However, Web mining has been problematic due to the poor and incomplete data analyzed. Typically, data is drawn from Web server logs, which have several shortcomings as an information source. In this section, we describe first site-centric data sources that record information on the Web server as visitors arrive at the site. Next, we discuss techniques that combine site-centric server data with third-party ancillary data for better customer profiling, which shifts the focus from the site to the site’s users. We then dis cuss methods such as clustering, nearest neighbor and other predictive techniques, that pre-process server data to provide better personalization. Following the thread from site-centric to server-based, user-centric analysis, we conclude with recent work that shifts user-centric analysis from the server to the client.

2.1.

History of Server-Side Techniques

The widespread use of the Web among computer users makes Web analysis a potentially rich source of data for studying user behavior and designing information access. Because the Web protocol (HyperText Transfer Protocol, or HTTP) is designed for efficiency and scalability, it offers little information in the course of a Web transaction (December 1995). This basic information includes the client’s network address, page requested, time of request and size of page requested, but not, for example, the login name of the user. As the information is transmitted from the client to server, it is usually written to a log file to create a record of server accesses. The absence of user information is aggravated by the statelessness of the Web protocol (December 1995). Hence, without additional programming, each connection is established anew and inferences of consumer identities across time become tenuous. Early works have identified both the usefulness and the shortcoming of web server logs to marketers (Sen 1997): The readily available data in Web logfiles satisfies a significant part of the behavior segmentation information needs of marketers, while most of the demographic, geographic, psychographic and benefit segmentation information needs require information to be elicited from the visitors.

Web server logs paint a distorted picture not only because some data never reaches the server, but also because data that is reported is itself incorrect. Conflated data arises because Web server logs record a client’s Internet Protocol (IP) address. Due to the shortage of unique IP addresses, most organizations use the Dynamic Host Configuration Protocol (DHCP) to manage a pool of IP addresses. Addresses are assigned to computers as needed, and thus the same address might be assigned to two (or more) users in sequence. Moreover, the same user might receive different IP addresses on successive days, or even during the course of a day. Some of these shortcomings are addressed through the use of proxy servers. Proxy servers act as links between Web clients and servers, by responding to request from clients as if they were Web servers themselves. Proxy servers are installed by many organizations, including Internet Service Providers (ISPs), for many reasons, including more efficient bandwidth utilization, faster response times and security. With respect to Web mining, proxy servers represent an improvement over Web server data because they consolidate the Web traffic from an entire group or organization. Thus, the proxy server log records all the page accesses of a group, regardless of the destination Web server. Because a typical Web user will visit many servers in a session, the upstream data available from a proxy server can lead to new insights in user behavior. Although proxy servers offer a richer dataset to those who mine them, they also aggravate the problems of those who work with Web server logs. Because the proxy server interposes itself between the client and Web server, the Web server sees a single proxy server in the place of many clients. For example, AOL’s extensive use of proxy servers prevents Web servers from determining which AOL member is accessing a particular page. Many companies use proxy servers for security reasons and thus all company employees appear as the same proxy server to a Web server. Content caching is another problem in capturing accurate data for later mining. Because most Web pages change infrequently, Web browsers streamline page access by storing a copy of Web pages on the user’s computer in a local cache. Later, if a user requests a cached page, the browser simply displays the cached version. (The use of a browser’s Back button is a common instance because the just-viewed page is generally in the browser’s cache.) For Web data mining, the problem is that cached page accesses are not recorded by the server, since the browser simply presents its own copy, without accessing the server. In an extreme case, a user might retrieve two pages from a server then spend hours moving back and forth between the two pages. The Web server would record this as one page access followed shortly by

Page 3

June 2002, Newport Beach, CA WECWIS’02 Fenstermacher and Ginsburg

another, but successive switches between the two pages would not be recorded. Proxy servers add to the caching challenge because proxies cache pages as well, thus a browser may request an updated version of a page, and the proxy will intercept the request and return its own cached page — again preventing the Web server from recording the request. In short, Web server logs suffer from several important shortcomings: network configuration issues cloud server data, users’ Web browsing experiences are fragmented across proxy and Web servers and caching reduces server hits.

2.2.

Cookies

Due to the problems in using network addresses to identify users, Web developers have looked for means to associate individual users with specific Web accesses. The challenge is that the Web itself is stateless — meaning that there is no connection among successive page accesses by the same user. In 1995, Netscape Communications introduced the “cookie” as a mechanism to add state to Web client-server sessions. (The original cookie specification has been codified as Internet Engineering Task Force RFC 2695 (Kristol 2000).) With cookies, Web servers can store short strings at the client. For example, a database key or user login name is a common choice for cookie data. When the user returns to the same site, the previously written “cookie” is automatically sent to the server with each page access. On operating systems that support multiple user profiles, cookies can lessen the problems of using network addresses to represent users by associating Web traffic with a returning user, rather than simply an IP number. Often, however, computers shared by many use a single login, defeating cookies. Other people turn off the cookie feature in their browsers, due to privacy concerns (Schwartz 2001).

2.3. Server-Side Programming to Augment Basic Server-Side Data Although cookies improve the quality of server-side data, incorporating cookies still maintains a site-centric perspective. Data must still be analyzed site-by-site, meaning that cross-site behavior cannot be observed. Moreover, user actions on the client are only visible through the Web browser — preventing analysts from observing patterns among several applications, which is critical in the analysis of information flows. Web browsers can be augmented with either server-side programs or embedded scripts to provide additional information to Web servers. LaMacchia (LaMacchia 1997) modified the freely available Excite search engine to build an active search agent in his “Internet Fish” system. Ginsburg (Ginsburg 1998) and Ginsburg and

Kambil (Ginsburg 1999) modified the same search engine to provide readership timings of documents within a fulltext search framework in the “Annotate” system. In the latter work, a trimodal Gaussian distribution of readership timings. The readership timings distributions lead us to infer three types of document readers: • Very short browse time – Document uninteresting, next document chosen from search results or session exited •

Medium browse time – Document interesting enough to read for a few minutes



Long browse time – User is idle (gone away from the computer) or is doing other knowledgework tasks at the computer. Without user interviews or observation, however, there is no way to substantiate these inferences. Client monitoring could help disambiguate the very long browse times observed.; in a cross-application monitoring environment, we would know whether the user is still active or if she is idle.

2.4.

“User-Centric” Surf Mining

In addition to the technical obstacles that limit the accuracy and richness of server logs, server-side data suffers from the absence of data on user context. As (Padmanabhan 2001) point out, web server data is inherently incomplete. Although many users simply browse the Web to pass the time, others access the Web for specific purposes. For example, a user sending an email about an upcoming restaurant reservation might wish to attach a map to the email, and so would use her browser to look up the restaurant’s address and then submit it to a map site to generate the map. She might also wish to include the restaurant phone number in the email, and visit a directory site for such information. With server-side log analysis, the link among the email, map lookup and phone lookup is lost. By expanding client-side analysis to monitor applications outside the browser, data mining can reveal patterns of user behavior surrounding Web access, and not simply within it. This field has been very active in both academia and industry. Another focus is user session tagging. For example, in Kimbrough et al., (Kimbrough 2000) the authors provide a framework of usage metrics and analyse 30,000 users’ surfing behaviour in an online travel-booking domain. To do the study, they used data from user-centric click-stream data provided by a leading market data vendor. In Damiani et al. (Damiani 2001) the authors present a new user modelling technique based on a temporal graph-based model for semi-structured information in order to formalize the user trajectory but no specific software techniques are suggested. Edelstein (Edelstein 2001) talks about the challenges of clickstream analysis (consolidating raw data from web server access

Page 4

June 2002, Newport Beach, CA WECWIS’02 Fenstermacher and Ginsburg

logs across multiple servers and turning them into usable records). Another problem is that logical servers may be multiple physical servers for load-balancing or geographical reasons. The logs are difficult to clean in some cases. It’s not always easy to remove extraneous items, or to identify users and sessions, or to identify transactions. All of these make it hard to integrate this data with other data sources.

2.5.

Generic Agents to help Knowledge Work

The work presented by (Schnurr 1999) is not directed specifically at any software architecture. The authors propose an abstract agent to help integrate the semantics of semi-structured documents (e.g. XML) and business processes, in order to support “desk work” of the enduser. The goal is to provide a proactive reasoning agent and improve knowledge management. A similar discussion is set forth by Abecker, et al. (Abecker 1999). The authors discuss the need to actively support the user in a search task. They propose a principle that the OM system should actively offer interesting knowledge. They also discuss that applications that interweave data, formal knowledge, and informal representations all have embedded knowledge. Thus there is a need for a ‘conjoint view and usage of all these representations’ (Abecker 1999). The most ambitious of their ideas is that normal work should not be disturbed (the principle of unobtrusiveness) --- the system should observe the user doing tasks, and then ‘automatically gather and store interesting facts and information in an unobtrusive manner’ (Abecker 1999). Their implementation is called KnowMore, and features workflow support, and IR heterogeneous data support via homogeneous knowledgedescriptions.

2.6. Improving Personalization Recommender Systems

and

Many researchers and companies have studied recommender systems, systems designed to recommend products and services of likely interest to a user. Because users differ in their tastes and needs, recommendation systems are closely tied to issues in personalization. One of the problems facing personalization and recommender system designers is incomplete and inaccurate data about consumer preferences and transaction histories. As Padmanabhan (Padmanabhan 2001) discusses, personalization by web server log analysis only is likely to be faulty. To personalize effectively or to provide robust recommendations, a richer data source is needed. While Adomavicius and Tuzhilin (Adomavicius 2000) propose multi-dimensional “hypercube” and OLAP techniques to provide a richer recommendation framework, they do not define the exact functional form. Moreover, they do not describe a software architecture to

gather the data needed to populate the hypercube). The next section discusses the start of a new generation in data collection, the client-side monitoring system.

2.7.

Client-Side Monitoring Today

The “next generation” of monitoring recognizes the inherent limitations of server-side inferences, data mining, data filling, and data manipulation. Instead, recent research has turned to software augmentation on the client to give a more granular view of the session activities. Such client data can be usefully merged with server data or third-party ancillary data for a more complete user profile. The tip of the iceberg is in a paper on personalized shopping (Jörding 2001); the paper presents a system, TELLIM that generates custom multimedia presentations at runtime; it matches display with customer-appeal. To do this, the system monitors client interactions; for example the start of an audio or video player. The system then deduces whether the end-user is interested in various presentation elements. After a learning algorithm is applied, only preferred presentation elements are provided going forward, and the suppressed items are only shown as links. The technology uses combination of dynamic HTML (DHTML) and a Java applet for event monitoring on the client. Many researchers in human-computer interaction (HCI) have developed systems to monitor end user activity on desktops. In a broad survey (Hilbert 2000), Hilbert and Redmiles describe the history of such monitoring systems, which have focused primarily on low-level user interface actions such as mouse clicks and keystrokes. Although such systems can reveal problems with a user interface, for example, by measuring the total mouse travel need to perform an operation, the primitive nature of monitored actions makes it difficult to draw inferences about higher level behavior, such as information flows among applications. The rise of the Web as a marketing medium has led many companies to develop Web monitoring systems to track usage by individual users. Perhaps the best known such system is DoubleClick. The DoubleClick system uses a unique DoubleClick ID, stored as a cookie on a computer. Although cookies are generally only sent to the server which placed the cookie, DoubleClick servers serve many of the ads users see when browsing the Web. Because the Web client must make a connection to the DoubleClick server to retrieve the ad, the DoubleClick ID cookie is sent even users view pages that do not appear to be served by DoubleClick. By associating accesses across many sites, DoubleClick can build a profile of users. In addition, if demographic information on a user is known, DoubleClick can study user behavior segmented by demographics.

Page 5

June 2002, Newport Beach, CA WECWIS’02 Fenstermacher and Ginsburg

MediaMetrix has gone a step further to integrate Web monitoring with other client applications. As described in an online whitepaper (Media Metrix, 2002), “[Media Metrix] captures actual usage data from randomly recruited, representative samples of tens-of-thousands of people in homes and businesses around the world.. The meter is a software application that works with the PC operating system to passively monitor all user activity in real time — click-by-click, page-by-page, and minute-byminute.” As discussed in (Hilbert 2000), low-level event monitoring makes it difficult to draw the higher-level inferences needed for information flow monitoring. Moreover, Media Metrix is targeted specifically at tracking viewing times associated with documents. If a user clicks on a file in Portable Document Format (PDF) and the operating system starts Acrobat Reader to display the document, Media Metrix’s software will follow the viewing from browser to Acrobat Reader. However, Media Metrix does not appear to track activity such as copying and pasting from a Web page to a working document.

3.

Goals for Client-Side Monitoring

Given the strengths and weaknesses of server-side monitoring, we have outlined four goals for a client-side monitoring framework: monitoring should not interfere with user activities, data collection should be specific to individual users, developers should easily be able to specify what data is collected and how and, finally, the framework should be extensible to incorporate monitoring of other client-side applications. Non-interference with user activities is already true of server-side logging, but the shift to client-side makes unobtrusiveness a key element. If users are required to perform additional actions to support monitoring, they will seek ways to lessen the burden of doing so, and will likely distort the results of the monitoring. Thus, the goal becomes to monitor actions in the background, as users work with Web client applications. Unobtrusiveness must be balanced against the need to gather data, however. At one extreme, client-side monitoring could require that users comment on every action they take, including links they examine, but choose not to follow, returning to previous pages and all other actions. Although such monitoring would generate rich data (at least for a short period), it is clearly untenable. One of the key weaknesses of server-side monitoring is that user identities are generally lost to the Web server. (If a proxy server is employed, then the proxy server will usually have access to user data, but the proxy will certainly mask user identities when accessing the requested Web server.) By recording user behavior on the client, it is possible to tie actions to users much more tightly. For example, on the client side, the user’s login name and profile information is available, in addition to

the network address information sent to the client. The use of proxy servers, a significant problem for Web server log mining, is not a problem with client-side monitoring. Tightly coupling user identity and user actions, although it aids in data mining, raises issues of user privacy. With so much data available at the client, recording and mining the data might reveal extensive personal information. In an organizational setting, this might be appropriate, as companies might wish to know how employees are using company resources. In other settings, however, such extensive data collection could be problematic. In such cases, cryptographic techniques can be applied to preserve the integrity of the data, while breaking the link between the data and a known person. The breadth of data available on the client side poses another problem for those designing client-side solutions: what data should be collected? At the client, every mouse movement, every keystroke, every click and more are available for logging and analysis. The problem of what to collect is not simply a question of granularity, but also of focus. For example, within a Web session, should the system record link hovers, where a user moves the mouse over a hyperlink but does not click? Or should only followed hyperlinks be noted? Which actions to log depends on what the analyst wishes to infer, of course. The range of candidate has an important implication for monitoring framework design: the framework must allow framework users to easily specify what data is needed. Finally, as discussed in the previous section, users access as part of a larger context. Thus, the ideal monitoring framework should be extensible to encompass additional applications; ideally, office productivity applications would be included. A complete client-side monitoring solution could log user actions consistently across many applications. Integrated logging could then support new inferences about user behavior. For example, a user writing a memo might search the Web for background information, copying useful information into the word processor she is using to write the memo. Without knowing what information has been copied and pasted, there is no way to distinguish among all the pages the user has browsed. Once the word processor is incorporated into the framework, however, mining the client-side data could reveal which Web sources are most often copied and pasted into documents. The implication is then that those sources most frequently copied offer the most relevant information for a particular user. In summary, an ideal monitoring framework would address several goals: non-interference, close association between logged data and users, flexible data collection and the ability to integrate monitoring with other clientside applications. In the next section, we discuss key differences between server-side and client-side monitoring.

Page 6

4.

June 2002, Newport Beach, CA WECWIS’02 Fenstermacher and Ginsburg

Mining client-side data

Because much of the data on the client side is not transmitted to the server, shifting from the analysis of server-side data to client-side offers many new opportunities for mining data. As discussed in Section Error! Reference source not found., integrating data collection and mining across applications allows for understanding Web access in a larger context. However, much can be gained from a Web-only analysis as well. The key difference between server and client data is that on the client side, the monitoring application has access to the entire Web page. Thus, data gathered can be the basis for comparative analysis of links on a page, for example.

4.1.

Link mining

Because Internet Explorer (IE) exposes the document object model (DOM) of the current Web page, the complete page content is available. IE’s DOM implementation includes some notable collections as well, including all the links on a page, all the IMG (image) objects and all the named anchors, among others. With the complete DOM, mining the log files can compare link hovers (perhaps suggesting consideration of the link) to followed links to identify distinguishing features of followed links. Path analysis could reveal that a user considered two links, followed one, then used the Back button to return to the original page and followed the second link, then continued on, suggesting that the user was “fooled” by the link descriptions of the original page into selecting the less relevant link. Identifying link hovers relies on the browser to expose a hover event, or a reasonable proxy for a hover event. Because Internet Exp lorer (IE) fires an event each time the browser’s status text changes, and the status text changes in response to the user moving the mouse over a link or away from a link. Thus, if a StatusTextChange event fires, it may indicate that the user has considered following a link. The conclusion is not certain because StatusTextChange events are generated as the mouse glides over the link, no matter how briefly — the user may have simply dragged the mouse pointer to a new location. However, if all StatusTextChanges are time-stamped (preferably with millisecond resolution, which Windows supports), post-processing of the logs can ignore extremely rapid StatusTextChange event sequences. If a user were to quickly drag the mouse over a link, the log would record the first StatusTextChange as the mouse entered the link area, followed very shortly by another StatusTextChange event. By setting a floor on hover time (say, 400 milliseconds), many of the spurious changes can be filtered out. Table I shows some of the COM events we can trap.

A final complication is that a user who has already decided to follow a link might move the mouse over the link and click to follow the link almost immediately. Following the link will cause IE to update the status text to reflect the loading of the new page, which will in turn lead to a rapid sequence of StatusTextChange events. In post-processing the log files one must ensure that links only briefly considered and then followed are considered as hovered links. Once post-processing has identified which links the user has considered, an analyst can compare and contrast three groupings: links not considered, links considered but not followed and links followed. Identifying patterns among the three groups could aid page designers in reformatting pages to improve access (or boost sales on an e-commerce site). A second-order analysis might incorporate the referring site or page thus suggesting that different pages could be dynamically generated based on the referring site.

5.

Framework Applications

Applications are already being developed using the framework, but they integrate monitoring into a larger system, rather than being focused exclusively on Web data mining. We present brief descriptions of several projects underway to give a sense of what can be done with proposed framework. The restaurant reservation system is primarily Webbased, and thus the application of the client-side framework in use is very similar to that presented in this paper. Briefly, the system monitors user actions looking for Web access to known restaurant recommendation sites, such as Zagat.com or the restaurant section of a CitySearch city guide. After a user’s visit to a review site is noted, the system is able to recommend review sites with similar coverage. After the user selects a particular restaurant, the system uses the browser’s document object model representation to parse the page, extracting relevant information such as restaurant name, address and possibly phone number. The system will then offer to aid the user by verifying the phone number, retrieving a map and directions to the restaurant and send an email notifying others of the restaurant, time and date once a reservation is made. The restaurant reservation assistant relies on the browser’s NavigateComplete2 event to infer when a user it is engaged in a reservation process. When the reservation process is read in at startup, a table of review sites and their URL’s is built. As the user browses Web sites (generating NavigateComplete2 events), the URL passed by to the event handler is used to index into the table, looking for a site that matches a known restaurant review site. If a match is found, the assistant uses COM’s automation capabilities to continue the process on the user’s behalf.

Page 7

June 2002, Newport Beach, CA WECWIS’02 Fenstermacher and Ginsburg

5.1. Improving Workflow Management Support with Client-side Monitoring In workflow systems, there is the tension between highly structured workflow (with a strong, underlying model) and the need for model revision via synchronous and asynchronous group interaction. Building flexibility into workflow processes is the theme of many recent works, for example (Antunes 1995) (Ginsburg 2001) (Bernstein 2000). Client-side monitoring allows us to go further in designing an effective and flexible workflow management system (WFMS). Knowledge trails in the absence of coordination and awareness are purely ad-hoc, with no carry over from one session to the next. Thus, awareness helps the actors to perform the informed artifact discovery. Therefore, new artifacts found should be recorded for re-use. Client-side monitoring is ideal for this purpose. Our proposal aims to create an infrastructure to capture this awareness. With the Python client monitoring software in place, we can funnel client logs to a server for collation, pre-processing, and analysis to do pattern discovery on the event-logs.

5.2.

Future Development

Although Python simplifies development work with Microsoft’s COM specification, monitoring user events requires some programming skill. Moreover, even those with prior programming experience must investigate the COM events, methods and properties exposed by a particular application. To ease the burden of applying the framework, we are investigating the development of a toolkit that would enable non-programmers to select applications and events they wish to monitor and choose from a menu of actions to take in each case. Our future work focuses on three major stages. In the first stage, we provision a toolkit to the system administrator enabling a simple mechanism to activate and configure client-side monitoring. In parallel with this stage, we incorporate more client-side productivity tools (such as PowerPoint) into the monitoring framework to increase the range of the offering. In the second stage, we establish automated communication channels between the individual or workgroup client PCs and an analysis server as mentioned in Section 5.2’s discussion of workflow support systems. Finally, in the third stage, we perform pattern analysis on the collated client activity logs on the analysis server and provide a means for the server to feed discoveries back to the client interfaces as they interact with the Web and the other productivity tools. Thus, adaptive User Interfaces become the means to signal an individual between sessions or a workgroup between members as knowledge tasks are undertaken. Taken as a whole, the future work seeks to implement the client-monitoring framework with as broad a reach as possible and as simple a configuration interface as

possible. As server-side analysis and adaptive client interfaces are implemented, it will become possible to evaluate this framework in a wide variety of field settings.

6.

Conclusions

The quality of inference in Web mining today is constrained by a dearth of rich and reliable data. Much of the data mined is read from Web server logs, or proxy server logs. Unfortunately, the parsimo nious nature of network transactions on the Web means that little information about the client is transferred to the server. However, by shifting the focus from the server to the client, data can be collected at the source. The much richer data available from the client will enable new observations about user behavior. For online retailers, better data might mean more browsers converted to buyers, or more successful cross-selling of related items. For organizations, client-side information could identify which information sources are most valuable — not because such sources return the most results or are the most frequently searched, but because those sources contribute information that is used elsewhere. As more library services are offered online, librarians can more effectively target library resources to those areas needed most. Client-side Web monitoring represents an important advance over server-side techniques. More importantly, however, client-side monitoring offers the potential for an integrated analysis of Web usage in a broader usage context. Client-side Web monitoring offers the potential to change our understanding of Web usage. With a better understanding of the most popular tool for online information access, designers can craft systems that work with people to find the information they need when they need it. By identifying source that people not only skim, but also incorporate into other work, monitoring systems can rank the usefulness of sources. Client-side monitoring — not only of Web usage but of all applications — offers the potential to change how users interact with their computers. The quality of the recommendations of personalized e-commerce systems depends on the quality of the available data. From the business perspective, client-side data enables better recommendations, resulting in more sales and more satisfied customers. With individualized information about application usage, user interfaces can adapt to users. Any application could be developed to recognize information needs that might arise in the course of using the application. For example, word processors could incorporate an Insert citation button that would search for relevant citations given the surrounding text. Information search and retrieval functions could coordinate across sessions and across users, and discover new commonalities among people based up on the richer client-side data. Our framework empowers users to define their own applications, rather than relying on the static

Page 8

June 2002, Newport Beach, CA WECWIS’02 Fenstermacher and Ginsburg

features dictated by application programmers. Client-side monitoring and adaptive interface design transform the personal computer from a collection of independent applications into an extension of the knowledge worker.

7.

References

Abecker, A., Bernardi, Ansgar, Michael Sintek (1999). "Enterprise Information Infrastructures for Active, Context -Sensitive Knowledge Delivery". ECIS'99, SomeWhere. Adomavicius, G., Alexander Tuzhilin (2000). "Extending Recommender Systems: A Multidimensional Approach". AAAI, I think, Somewhere. Antunes, P., Guinaraes, Nuno, Segovia, Javier, Jesus Cardenosa (1995). "Beyond Formal Processes: Augmenting Workflow with Group Interaction Techniques". COOCS95, Milpitas, CA, ACM Press. Bernstein, A. (2000). "How can cooperative work tools support dynamic group processes? Bridging the specificity frontier. Proc, CSCW00, pp. 279-288, Philadelphia, PA. ACM Press." CSCW00, Philadelphia, PA, ACM Press. Damiani, E., Oliboni, Barbara, Quintarelli, Elisa, Letizia Tanca (2001). "Modeling Users' Navigation History". SomeConference, SomeWhere, SomePublisher. December, J., Mark Ginsburg (1995). HTML and CGI Unleashed, Sams/Macmillan. Edelstein, H. A. (2001). "Pan for Gold in the Clickstream". Information Week: 77-91. Ginsburg, M. (1998). "Annotate: A Knowledge management support system for Intranet Document Collections". IS. New York City, NYU. Ginsburg, M., Ajit Kambil (1999). "Annotate: A Knowledge Management Support System". HICSS-32, Hawaii, IEEE. Ginsburg, M., Therani Madhusudan (2001). "Pattern Acquisition to Improve Organizational Knowledge Management". AMCIS2001, Boston, MA. Jörding, T., Stefan Michel (2001). "Personalized Shopping in the Web by Monitoring the Customer". Active Web, UK. Kimbrough, S., Padmanabhan, Balaji, and Z Zheng (2000). "On Usage Metrics for Determining Authoritative Sites". WITS. LaMacchia, B. (1997). "Internet Fish". MIS. Cambridge, MIT. Mobasher, B., Cooley, Robert, and Jaideep Srivastava (2000). "Automatic Personalization based on Web Usage Mining." Communications of the ACM 43(8): 143-151. Mobasher, B., Dai, Honghua, Luo, Tao, Miki Nakagawa (2000). "Improving the Effectiveness of

Collaborative Filtering on Anonymous Web Usage Data". SomeConference, SomeWhere, SomePublisher. Padmanabhan, B., Zheng, Zhiquang, Steven O. Kimbrough (2001). "Personalization from Incomplete Data: What you don't know can hurt". KDD'01, San Francisco, CA, ACM. Schnurr, H.-P., Steffen Staab (1999). "A Proactive Inferencing Agent for Desk Support". AAAI, SomeWhere, American Association for Artificial Intelligence. Schwartz, J. (2001). "Giving Web a Memory Cost its Users Privacy". New York Times: C1. Sen, S., Padmanabhan, Balaji, Tuzhilin, Alexander, White, Norman and Roger Stein (1997). "On the Analysis of Web Site Usage Data: How Much Can We Learn about the consumer from Web Logfiles?" European Journal of Marketing: Special issue on Marketing in Cyberspace. Sen, S., Padmanabhan, Balaji, Tuzhilin, Alexander, White, Norman and Roger Stein (1998). "On the Analysis of Web Site Usage Data: How Much Can We Learn about the consumer from Web Logfiles?" European Journal of Marketing: Special issue on Marketing in Cyberspace. Vranica, S. (2001). "Web Sites seek to turn data into dollars". Wall Street Journal: B8. Event

Description of event

BeforeNavigate2

Before navigate occurs in the given WebBrowser Document title changed. Download of a page started. Document being navigated to becomes visible and enters the navigation stack. Status text changed. (Occurs when the mouse hovers over a link — the status text changes to the URL.)

TitleChange DownloadBegin NavigateComplete2

StatusTextChange

Table 1 — Selected Internet Explorer Events