Exploring Patterns of Social Commonality Among File Directories at ...

6 downloads 23185 Views 484KB Size Report
May 3, 2007 - item for sale) and mining the volume of sales transactions from individuals' ..... the best indicator of when a user might have last read or interacted with a ..... are the desktop and laptop computers (respectively) of the. CHI 2007 ...
CHI 2007 Proceedings • Collaboration at Work

April 28-May 3, 2007 • San Jose, CA, USA

Exploring Patterns of Social Commonality Among File Directories at Work John C. Tang1, Clemens Drews1, Mark Smith1, Fei Wu2, Alison Sue1, Tessa Lau1 1 IBM Research Dept. of Computer Science & Engr. University of Washington 650 Harry Road San Jose, CA 95120 USA Seattle, WA 98195-2350 {john.tang, cdrews, mark1smi, alisue, tessalau}@us.ibm.com [email protected] ABSTRACT

without any additional work by the individuals. That is, starting with a common point of reference (browsing an item for sale) and mining the volume of sales transactions from individuals’ purchases, Amazon can infer relationships with other items that shoppers may be interested to learn about.

We studied files stored by members of a work organization for patterns of social commonality. Discovering identical or similar documents, applications, developer libraries, or other files may suggest shared interests or experience among users. Examining actual file data revealed a number of individual and aggregate practices around file storage. For example, pairs of users typically have many (over 13,000) files in common. A prototype called LiveWire exploits this commonality to make file backup and restore more efficient for a work organization. We removed commonly shared files and focused on specific filetypes that represent user activity to find more meaningful files in common. The Consolidarity project explores how patterns of file commonality could encourage social networking in an organizational context. Mechanisms for addressing the privacy concerns raised by this approach are discussed.

We wanted to explore applying these social mechanisms popular in the consumer arena to the enterprise work setting. We chose to examine how knowledge workers in a workplace environment store and manage files on their workplace computers to see if any social patterns could be identified for the benefit of the company. The computer files that are stored and circulated within a company represent much of the information and knowledge that is being created and processed for the operation of its business. Especially for large, information-intensive enterprises, such as multi-national corporations or research institutions, managing the information contained within the enterprise in a way that enables its members to find the information that they need is a substantial problem.

Author Keywords

Social networking, social recommendation, enterprise work setting, file management, CSCW, usage logs.

Thus, how individual members store and manage their computer files is a basic transaction for an enterprise, of which there is a high volume from which to identify potential patterns. By looking for patterns in the overall aggregate that emerge from individuals’ file storage practices, we wanted to identify ways that could help the enterprise better manage its information. These patterns could encourage social networking among people who are looking for information and those that have already stored the information in their personal file directories. We analyzed the collections of data that people store within a shared organizational context, such as a company, to discover meaningful patterns that could appropriately be shared with others within the same organization.

ACM Classification Keywords

H5.m. Information interfaces and presentation (e.g., HCI): Miscellaneous. INTRODUCTION

Social networking and recommendation features have been rapidly gaining popularity in consumer services. For example, Amazon analyzes their voluminous purchasing data to provide recommendations like, “Customers who bought this item also bought…” [1]. Similarly, the del.icio.us social bookmarking system [4] allows users to benefit from how others have organized their web bookmarks. These systems leverage a collection of individual interactions to deliver some social benefits

A particular focus of our work is exploring opportunities from identifying commonality among the computer files that people store within a shared organizational context. These commonalities suggest shared interests or shared experiences that would be useful to discover. Could finding other people who also downloaded the same report from the web help you discover others in your company with a shared interest? Or, could detecting who has the same software libraries help you find experienced users that could help solve a problem you are having with them?

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CHI 2007, April 28–May 3, 2007, San Jose, California, USA. Copyright 2007 ACM 978-1-59593-593-9/07/0004...$5.00.

951

CHI 2007 Proceedings • Collaboration at Work

April 28-May 3, 2007 • San Jose, CA, USA

Note that identifying these collaboration patterns does not require any additional work by the user, but falls out of the work users are already doing in managing and storing files concerning information that they care about. This could be information that they are creating about a topic (for which they are likely an expert on the topic), or information that they are collecting because they are interested to learn about it (they may not be experts, but would still be a useful resource for learning more about the topic).

level of shared goals and mechanisms for dealing with privacy abuses allow for opportunities of sharing more information internal to the enterprise than would be possible in the consumer world. In that sense, much information created in an enterprise work context should be available to other colleagues within the enterprise, since a business enterprise typically claims ownership of the information managed on their computer infrastructure [22]. Nonetheless, analyzing and searching through personally organized information raises many privacy and security concerns. Recent research highlights several concerns about this kind of shared access to and comparison among personal files [9, 13]. An integral part of our research includes understanding and dealing with the privacy concerns and risks involved within a work setting.

The problem of finding information in the enterprise

A growing problem in information technology is managing, organizing, finding, and utilizing the increasing amounts of data that are available within a work organization. Companies expend much effort to encourage employees to organize and publish information so that it can be easily found, with only limited success. Various approaches have included content management systems [20], databases, and other mechanisms for preserving organizational memory.

Studying social practices around file directories

This research explores whether meaningful social patterns can be found in the way that users in a work enterprise store and manage their files on their work computers. We began by collecting information from enterprise users’ file directories and analyzing for patterns in files that people share in common. This examination of file storage led to a deeper understanding of file storage practice, especially when viewed in the aggregate. Based on understanding and identifying meaningful social patterns, we also explore potential applications that take advantage of these patterns.

Even the advent of powerful search technologies over the web has not addressed this problem. Recent research has shown that finding information within a corporate intranet is typically not as successful as on the internet [6]. Intranet web pages tend to be efficiently and hierarchically structured, without much linking to other information, so search engines that rely on page rank scoring are not as effective within intranets. The motivation for arranging information in ways that attract browsing traffic is not the same within the intranet as it is on the internet.

RELATED WORK

A few related projects have similar goals of enabling more effective access to information in a work context. The Hebb system [3] analyzed the content of email messages sent within a work organization to abstract users’ interests. This system was designed to increase awareness of shared interests among work subgroups that were even located at the same site. By analyzing the semantic content of email, Hebb used “implicit sensing” to infer interests from everyday email correspondence. Our approach differs by looking at all files, not just email. Furthermore, we want to examine only the meta-data about files, which is less privacy sensitive than the content of email messages. Their experience with Hebb deployments showed that initial alarms around privacy intrusions eventually gave way to seeking more detailed information gleaned from analyzing email messages.

Furthermore, much of the information being stored within the enterprise is not easily shared with others in the organization even though it would be useful to them. Voida et al. [23] document the current lack of support for effectively sharing information with others in a work context. That information is typically stored in “personal” file directories that are not published even within the intranet for internal search. These repositories of information are personal in the sense that they are typically stored (out of convenience) on a hard disk in a personal computer. Much of this information is not personal in a privacy sense, since most of the files created and stored on work computers are intended to be shared with others in the same work organization and beyond. The work enterprise context

Lotus Discovery Server (LDS) [16] analyzed the content of each user’s files (and optionally email) to create an “affinity profile”. These profiles could be used to search for expertise within the enterprise. While LDS tried to make this as automated as possible, uncertainty around the content analysis technology indicated the need for more user confirmation and editing of the affinity profiles. Our approach avoids analyzing the semantic content of the files and initially focuses more narrowly on file commonality, which may mitigate some of the privacy concerns but also limits the richness of connections we can find.

Studying user patterns within an enterprise setting has some distinct features from the consumer environment. Enterprises typically use a trusted “intranet login” mechanism for identifying people who are using its computer services. The Dogear social bookmarking system [14] exploited that user authentication to encourage social networking among the people associated with the information that they share via bookmarks. Furthermore, being part of the same enterprise provides a different context for the privacy concerns that arise when sharing information with others. Within a company, some

952

CHI 2007 Proceedings • Collaboration at Work

April 28-May 3, 2007 • San Jose, CA, USA

The Stuff I’ve Seen (SIS) project [5] also analyzed the stream of information that a user has seen through their computer desktop and enabled searching through it to find objects of interest. Google desktop search is a modern commercial example of searching and indexing through the files in one’s own computer desktop to be able to find files [10]. Our approach generalizes beyond searching through the individual’s files to searching through the organization’s files, and explores what might be thought of as “Stuff We’ve Seen” or “workgroup search”.

date and time information associated with the file (last access date, last modified date, creation date), and an algorithmic hash function (MD5 [18]) that uniquely encodes the contents of a file into a number. This MD5 hash of the entire file is algorithmically precise and inexpensively encodes the content of the file in an efficient, privacy-preserving format. The MD5 hash enables us to compare which files have identical content without having to semantically analyze the content of the file. Fifteen volunteers (12 male, 3 female; 13 Windows, 2 Linux) from our company ran the crawler over their file directories. The volunteers included researchers, software developers, managers, interns, and four of the authors of this paper. Users could exclude any branch of their file system by designating any folders (and subfolders that were contained) from being crawled. Furthermore, access to the data was limited to the research team of four people.

The Relescope project [7] inferred what people are associated with what topics by indexing and analyzing publicly available relationship information. For example, Relescope noticed who has co-authored publications with whom on what topics. We look beyond publicly shared to information on users’ “personal” computers within a shared organizational context. Also, by focusing on discovering files that are shared in common with others, we look for traces of having actually worked together on projects, rather than being listed as a paper co-author or patent co-inventor.

We fed the data from the crawls into a MySQL database that enabled us to run queries to look for patterns over the dataset. We examined pair-wise comparisons among the users as well as patterns over the whole aggregate of the data. The first stage of this project was to get a more descriptive understanding of what files users have in common, which ones are interesting, and how we could programmatically identify patterns of interest.

Recently popular systems for social bookmarking demonstrate that people can leverage others’ work in organizing and tagging web bookmarks. Del.icio.us [4] enables people to tag bookmarks with labels to help describe their content and share them with others. Dogear [14] translates a similar system into the enterprise environment, where a corporate directory authenticates contributors’ identities to encourage social networking. While these sites have shown the collaborative benefits of leveraging how individuals organize information, they are limited to information shared via web pages and require a bit of extra user work in tagging the information.

How much we have in common

Table 1 shows the statistics describing the number of files each person stores and the amount of storage space these files consume. In our sample, users typically had over 120,000 files that occupied over 18 GB of storage space. The data show the general scope of the size of individual file directories, as well as the wide variation among users.

Concerning finding files in common, Forman et al., [8] worked on a system for finding similar files in large document repositories. Their interest was in managing a large collection of technical documentation, which accreted large amounts of redundant information through updating files and restructuring as a result of mergers and reorganization. While we use a similar approach for identifying similar files, we focus on the information stored in personal work computers, which have more information and more privacy concerns than published file repositories.

Table 1 also shows the amount of redundancy that occurs within an individual’s file system. This redundancy is calculated by taking the mathematical complement of the number of unique files in the user’s directory (as determined by the MD5 hash) divided by the total number of files counted in the directory. % redundancy = [ 1 – ( #unique files / #total files ) ] x 100 On average, there was 25% redundancy in the number of files that accounts for 14% redundancy in the amount of file storage space for each user. Examining these redundant files revealed that most are a result of multiple installations of the same or similar (perhaps upgraded) programs in

This prior work suggests the need for finding ways of discovering shared interests and expertise within an organization and the promise of searching through various kinds of information to infer those interests. Our approach in this paper is to first explore what meaningful patterns are exhibited in the files shared in common with other users. We are letting this inductive exploration drive the research to identify possible applications of the patterns discovered. EXPLORING COMMONALITY AMONG ACTUAL USERS

To explore what files users have in common, we built a crawler that runs through each file in a user’s file directory and captures the file name, full directory path, file size, the

Avg. Redundancy

Average

Median

Range

# files

134,142

126,780

40,820273,857

25%

GB storage

18.93

18.62

5.9042.77

14%

Table 1. Statistics of individual file storage and redundancy

953

CHI 2007 Proceedings • Collaboration at Work

April 28-May 3, 2007 • San Jose, CA, USA

different branches of the file directory. This is most likely the result of programmatic installations of software libraries and components in different locations, redundant downloads, or redundant copies kept by the Windows DLL cache for recovering from corrupted DLL files. A small fraction of this redundancy is due to actions initiated by the user, such as deliberately saving a file in multiple locations or copying in the process of versioning a file.

wrote our crawler to replace the original access date so as to preserve it, but realized that a number of other programs artificially change the last access date. Many backup programs, virus scanners, or desktop search crawlers change the last access date as a side effect of their normal function. Unfortunately, we believe that the last access date is no longer a meaningful indicator of when the user last accessed the file because it can be changed by a number of tools that do not preserve the user’s last access date.

By extending these statistics across the aggregated set of data across all the users, we calculated the “aggregate redundancy” of the complement of the number of unique files across all users divided by the total number of files aggregated over all users. We found an aggregate redundancy of 54% in the number of files, accounting for 32% of total file storage space. These numbers give a scope of the amount of file commonality that naturally occurs.

We refer to this issue as “temporal pollution”. Many file systems store temporal data around interactions with files, and this temporal data could be useful in identifying meaningful patterns of user interaction with files. However, if programs are not careful to preserve this temporal metadata as an indicator of actual user interaction, the data become polluted and cannot be used. This issue reflects a more general concern of “meta-data pollution” that may thwart some efforts to automatically mine usage data for meaningful user interaction patterns. We need to raise environmental protection awareness to preserve the metadata of files so that they can be used by mining programs that are coming into popular use.

A closer look at pair-wise comparisons yields more information about the amount and nature of commonality between people. The number of identical files between each pair of participants is surprisingly high; the average was 15,223 and the median was 13,974. The number of identically shared files ranged from 304 to 62,186. On average, the number of files shared in common with another user constituted 19.5% of the files that the user stored. This percentage of files shared with others ranged from as low as 0.3% to as high as 72.9% (a recently hired intern who did not have much time to create unique files on his computer)!

HOW TO DEFINE MEANINGFUL “FILES IN COMMON”?

Because our own data were included in the files that we analyzed, we could look for data that we expected to be in common among us. For example, we looked for files that we knew we had shared among ourselves, discerned which files in common were not of interest, and recalled stories for how unexpected files in common came to be shared. This first-hand understanding helped us refine our notion of what shared files were meaningful for identifying patterns of interest. As we learned more about how people actually use files, we came to realize that our simplistic approach of using an MD5 hash of the entire file both returns too many uninteresting files in common, and misses files of interest.

While this number was much higher than expected, the vast majority were related to the operating system (OS), common applications, or files systematically loaded on to the computers by our company before deploying them to employees. Comparing a Windows and Linux user resulted in almost an order of magnitude lower number of identical files than comparing between similar Windows users. Not surprisingly, the OS platform and the applications that run on it define a number of files that computers will have in common. These explorations helped us refine how to find meaningful files in common, as detailed in the next section.

Too much noise

As mentioned earlier, much of the commonality between computers is driven by the computing platform being used, such as the OS, applications, and other files that are systematically loaded onto computers (often before a user even touches the computer). We devised two methods to help filter out much of this noise. First, we crawled a “base machine”, a standard Windows computer configured with our company’s standard client platform, and subtracted the files found on this base machine from the crawls of all the other machines. Secondly, we applied a Term FrequencyInverse Document Frequency (TF-IDF) filtering scheme [21] to subtract files that were widely shared among the dataset. If any file was shared in common by at least 70% of the dataset (an empirically determined threshold), it was subtracted from all of the crawls. This algorithm essentially defines an automatically maintained “blacklist” of files that are not interesting because they were so commonly shared (i.e., did not reflect information that was unique to the user). Taken together, these measures reduced the number

The problem of “temporal pollution”

One temporal pattern we wanted to explore was to see how recently a user had accessed a file. We believed that examining the most recent access time of a file would be the best indicator of when a user might have last read or interacted with a file (without modifying it), reflecting how current their interest was in the file. We also reasoned that files that users have in common but have not accessed recently may be more interesting to discover than those that have been seen recently, as they would probably already be aware of recently sharing such a file. However, we discovered that the file’s last access date does not accurately represent when the user actually interacted with the file, due to a number of programs that can modify it. In fact, our own crawler reset the last access time when it touched the file to collect the information from it! We re-

954

CHI 2007 Proceedings • Collaboration at Work

April 28-May 3, 2007 • San Jose, CA, USA

of files that were examined per person to an average of 20.4% of the original number of files.

document or programmers working on different versions of code are collaboratively sharing information that we would like to detect. Given the meta-data that we collected, we looked for ways to identify these similarities.

Null files or other very small files were another source of noise. These are often programmatically created by applications for scratch storage or other preferences. Null files were especially problematic for us, as each user could have many null files that would match with all other users who had any null files. Consequently, we filtered out any files less than 200 bytes as not interesting for our analysis.

Other work for detecting similar files [8, 17] led us to a method of breaking down files into smaller chunks, hashing the file chunks, and comparing among the chunks of each file. Such a “fingerprinting scheme” for segmenting the files into natural chunks allowed us to find files that are similar but not identical. For our purposes of finding files that were similar enough to be of interest, we counted files to be “similar” if at least 20% of the file chunks matched.

Missing similar files of interest

We also quickly realized that there are many files that we share with people that did not show up as having matched identically via their MD5 checksum. Since we were analyzing data collected of ourselves, we could notice files that we knew we had shared with each other, as email attachments or otherwise, that were not showing up in our comparisons. Further investigation revealed at least two common reasons why our initial approach was missing common files of interest.

Focusing on particular filetypes of interest

We also noticed that most of the interesting files in common come from a small subset of filetypes. By looking at file extensions that are most closely related to user activities for our dataset, we could limit the number of files to look for commonality to a small subset of all files collected. Table 2 lists those file extensions (in two groups), and some of our analyses are based on this more focused view of commonality. These groupings of filetypes distinguished between files generated by productivity applications that many information workers would use and those more specific to software developers, which was a useful distinction among the users in our dataset.

First of all, files that are shared via email attachments often are not permanently stored in a user’s file directory. If a user simply double-clicks on an incoming email attachment to view it, that file may only get saved in a temporary scratch file area maintained by the email client. Thus, we would only capture such temporary files that had been viewed recently, before they were overwritten by subsequent files. This can be addressed either by crawling frequently enough to catch those temporarily stored files, or tapping in to each email client’s temporary storage area and making a more permanent cache of those files.

Reflecting on aggregate work practices of file storage

Having studied actual data of how people store and manage their files in a work setting, we have identified some patterns that could be utilized by applications.

The second reason for missing files in common was a more involved discovery of how files could be semantically the same, but have different file attributes (including file length and MD5 checksum). This disparity can be introduced depending on how the user interacts with incoming email attachments. If a user directly saves an email attachment into her folder directory (e.g., by dragging and dropping the file into a folder in her directory), that file will be directly copied in a way that preserves the original file length and MD5 hash. However, if the user opens the attachment using an application, and then uses the application to save the file into her file directory, many applications alter the file by embedding the last modified date into the file header. This process fundamentally changes the file length and MD5 hash, meaning that the file will no longer match between sender and receiver, even though the files are semantically identical. Given the prevalence of information shared through email attachments that are opened and saved by an application, we realized that our initial approach of an MD5 checksum on the whole file was not detecting many virtually identical files.



Users have a considerable amount (25% on average) of identical file redundancy within their file directories.



Work organizations have even more aggregate redundancy (54%) of identical files when comparing the files stored across the collection of individuals.



Any pair of individuals has a surprisingly high number (typically more than 13,000) of identical files in common, but the vast majority of them are uninteresting artifacts of the OS, applications, and other systematically installed files Productivity Filetypes doc ppt xls pdf txt htm, html

More generally, a number of situations where different people have very similar files would be interesting to detect. Co-authors working on different versions of a

odf odg odp ods tex

Software Developer Filetypes c h cc hh class cpp java cxx jar idl lib o rb pdb rc tlb

Table 2. Specific filetypes of interest

955

CHI 2007 Proceedings • Collaboration at Work



April 28-May 3, 2007 • San Jose, CA, USA

point, we have been able to analyze data from fourteen Windows OS machines representing thirteen users (one user backed up two different machines) totaling about 457 GB. The users included researchers, software engineers, interns, a manager, and five of the authors on this paper.

Finding meaningful files in common involves filtering out many uninteresting files and breaking files into smaller chunks to find files that are the same or similar.

We now discuss two directions for applications that take advantage of these patterns.

Method

LIVEWIRE: AN EFFICIENT FILE BACKUP AND RESTORE SYSTEM FOR A WORK ORGANIZATION

From among the people using LiveWire to backup their computer, we asked volunteers to opt in to having their data analyzed for Consolidarity. To protect the privacy of the volunteers, we enabled them to send us a list of folders that they would like to exclude from the Consolidarity analysis. For efficiency in storing backup files, LiveWire broke files into extents that were approximately 1 megabyte in size. However, to find the kind of similarity among files that we were seeking, we needed a much smaller extent size, on the order of 32 kilobytes (KB). We were able to re-chunk the data into the smaller 32 KB size with a unique hash for each extent by copying from the LiveWire server into another image of the server. At this point, we also filtered out any files that the users requested to exclude from the analysis.

One opportunity for taking advantage of the aggregate redundancy exhibited in our dataset is to design a more efficient file backup and restore system. Since we found that there are many identical files both within a user’s file directory and shared across two or more users, a file backup and restore system for a work organization need not redundantly store those files. Instead, it could store only one copy of the redundant file, and keep a pointer to all the users who stored the file. That single copy can be used to restore that file to any of those users in any location in their file system. This design approach saves backup file storage space and avoids unnecessary bandwidth traffic of transferring redundant copies of the files over the network.

Another level of privacy that we offered to our volunteers is that we would not look at the semantic content of their files, but only examine attributes of the file inferred from the content that could help us detect file commonality and some other useful meta-data. From the LiveWire server, we abstracted the following information:

We have built the LiveWire prototype that streamlines the backup and restore process in this way. In fact, LiveWire realizes even further savings by looking for “natural chunks” within files that are shared in common, using a fingerprinting algorithm similar to that described by Rabin [19]. In general, fingerprinting algorithms are used on streams of bytes to find “meaningful” places to segment the byte stream. In LiveWire, each byte stream is a file, and the fingerprinting algorithm is used to discover a meaningful point to break that file, forming a file extent. These extents are hashed and stored in a hash table to enable redundancy elimination when comparing at the level of a file extent.

• For each file, the list of extents and their hashes that belong to the file • For each extent, the list of files that it is a member of • Meta-data for each file:

By comparing the hash lists between the LiveWire server and the client running on each user’s machine, the system only needs to backup the file extents which are unique from those already stored on the server. Thus, the LiveWire client only transfers file extents over the network for backup storage that it does not already have from prior backups of the same user or any other user in the work organization using the same LiveWire server.



size of the file



owner of the file



path in the owner’s file directory to the file



created and last modified dates

The resulting information was loaded into a MySQL database. We also computed a “pseudo-MD5” hash of the entire file (concatenating all the hashes of the extents that compose a file and doing an MD5 hash of that string). This pseudo-MD5 made it more efficient to find identical files and perform some of the filtering steps. Based on our earlier work that found many uninteresting files in common, we filtered the dataset to exclude any files found on a “base machine”, applied a TF-IDF scheme to subtract out any files that were shared by at least 70% of the users in the dataset, and removed files that were less than 200 bytes.

LiveWire is a fully functional backup and restore working prototype. Bonkenberg et al. [2] describe the technical details and features of an earlier version of the system. LiveWire has been deployed in our work organization and has been used by 28 users to manage backups for 45 machines (some users backup multiple machines). CONSOLIDARITY: IDENTIFYING SOCIAL PATTERNS AROUND FILE COMMONALITY

Table 3 shows the effect of our filtering approaches on the pair-wise number of matches found in our dataset. The data from our first approach using the crawler are included for reference, and the reduction from the first row shows the effect of subtracting the base machine, common files identified by TF-IDF, and files less than 200 bytes.

Since LiveWire’s fingerprinting algorithm enabled identifying similar (but not identical) files, we switched to using LiveWire to collect data for further analysis in finding patterns of file commonality. This project, which we call Consolidarity, explores for patterns of commonality in consolidated file storage for a work organization. At this

956

CHI 2007 Proceedings • Collaboration at Work

April 28-May 3, 2007 • San Jose, CA, USA

avg. # files avg. # median # / computer matches matches Crawler data

134,142

All files identical match All files similar match

83,615

All files total match Productivity files identical Productivity files similar Developer files identical Developer files similar Prod + Dev files total

16,090 13,505 29,595

15,223

13,974

8,336

5,828

416

387

8,752

6,214

929

247

15

6

986

350

15

11

1,944

654

Table 3. Comparing statistics based on various filtering approaches for removing noise among number of matches

Focusing on the subset of filetypes reduces the number of matches even further. Most of the matches found are entire files that are identical, but the similar matches (using the metric of having at least 20% of chunks in common) show the files we were missing in our earlier analysis.

Figure 1. Visualization of identical or similar files among 14 machines representing 13 users.

The various edge thicknesses represent a wide range of shared files between nodes. This variation in the visualization may help identify the “information brokers” [12] or other information hubs in an organization. For example, the user label jl (on the middle right) shares a moderate amount in common with many other users. By contrast, dg shares relatively little in common with others.

From this data, we could write database queries that looked for file commonality among pairs of users, either identical or similar. We could also write queries that focused on the subset of file types listed in Table 2, looking at productivity or software developer filetypes. We note that our research method would not scale up to find patterns among all users of a large work organization. However, this approach allowed us to leverage an existing base of LiveWire users and gave us the flexibility to explore patterns for research purposes. We report on the patterns we found at this stage to guide how to re-engineer the process for large scale data capacity, analysis, and performance.

Figure 1 shows a tendency to share more in common with people in the same or similar roles, represented by the thick edges between vertically aligned nodes. The two summer interns (fw and xc) and the six Software Engineers (SE) show much commonality with others in their role and those in related roles (interns, SEs, and RSM-Ds all do computer programming). Thus, Consolidarity could identify people in similar roles that have experience using the same tools (in this case, software development tools), even though they may work on unrelated projects.

Complementary patterns of file commonality

Our analysis of the data shows that patterns of file commonality reveal relationships different from traditional views of the work organization. Figure 1 shows a visualization of meaningful file commonality among the thirteen users and fourteen machines for all filetypes. In the visualization, each node represents a computer, and the size of the node reflects the number of unique files stored on the disk of that computer (after filtering). The thickness of the edges between nodes reflects the amount of matches (identical and similar) between computers. The darkness of each edge indicates the percentage of the files shared in common relative to the originating node. Thus, each node shares a pair of edges going in both directions, as the number and percentage of files in common is different depending on the direction of the edge. Nodes are plotted according to each person’s job role along the x-axis and according to the user labels of each computer on the y-axis.

For many users, the amount of commonality differs depending on the specific user being compared. For example, ds shares a lot in common with ek, but not much with everyone else. From knowing the users and the projects that they work on, it is unexpected that they would have many files in common. Upon asking the users, both were surprised, and could not offer any explanation for the amount of commonality. Closer inspection of the actual common files show that they are related to various developer tools. This connection represents just the kind of unexpected but useful commonality that Consolidarity can discover. This pair of users could be useful resources for each other with regard to the tools that they use in common, even though neither was aware of it. Note that ehDT and ehLT share much in common, as they are the desktop and laptop computers (respectively) of the

The first observation is that people (through their computers) exhibit different amounts of files in common.

957

CHI 2007 Proceedings • Collaboration at Work

April 28-May 3, 2007 • San Jose, CA, USA

tools, software libraries, or components. Although they may not be aware of this commonality since they work on different projects, these people may be able to offer help from experience in using those resources. PRESERVING PRIVACY

Defining appropriate measures for sharing information, even within a work context, is a potentially sensitive issue. The privacy laws in some countries can be interpreted as preventing sharing computationally inferred information without first notifying the user [22]. A huge part of making any of these ideas usable is preserving people’s sense of control over the privacy of their files. Basically, we are taking advantage of the work people do in managing their “personal files”, and finding safe ways to share some of that information with other people within a shared organizational context. We summarize the steps we have taken to preserve users’ privacy of their information in our research and in the kinds of applications we imagine growing from this research: • We enable participants to exclude any file directory branches from the Consolidarity analysis.

Figure 2. Visualization of only Productivity and Developer filetypes that are identical or similar.

• We do not look at the semantic content of user’s files, but only attributes derived from the file itself.

same user. However, since ehDT is used to archive video files for a specific project, it has more files in common with others who also worked on that project (such as ek) than ehLT. Thus, the visualization accurately reflects the different roles of this user’s two machines.

• We will only share information learned from this approach within the work environment, and examples used for public presentation will first be anonymized. • By informing users of files they have in common with others, rather than inferences extracted from their contents, the users interpret why they may have them in common and infer what that commonality means.

To get a less cluttered view of the data, Figure 2 shows a visualization of only the Productivity and Developer filetypes (all of Table 2) that are most closely related to files which involved the user’s activity. This time, nodes are plotted along the x-axis according to the project they are currently working on. This visualization shows that the three nodes in the Koala project share a lot in common, but (ironically) the three nodes in the Consolidarity (C11y) project do not. Since several of the thicker edges slant across project columns, it appears that more commonality is found that bridges across different projects.

• By only sharing information around files that users have in common, they will only discover information about activities in which they are “complicit”, mitigating the potential for privacy exposure. Some work functions that deal with sensitive information, such as human resources and legal groups, might also need to be excluded from the data pool for analysis. Companies that set up competing teams to work on the same problem would need to develop safeguards to prevent revealing information to each other. Another hypothetical concern would be reviewing research papers—both the authors and reviewers would have the paper under review in common, but they should not be able to discover each other’s identity through that commonality.

For our dataset, viewing only the Developer filetypes could help identify people with shared experience with specific developer tools or software components. As Table 3 shows, viewing only the Productivity filetypes yields the smallest number of matches, but could indicate specific shared interests based on the information content of the file (e.g., Word document or PowerPoint presentation).

Within a shared organizational context, there is some amount of shared values (e.g., most companies share some common business goals and own all the information generated by their employees) so sharing information with other employees at some level seems reasonable. We plan to build on existing and emerging privacy guidelines, such as limiting the scope of what is shared depending on the closeness of organizational context, consistent with some of the privacy groupings found by Olson et al. [15]. For

It appears that file commonality helps identify relationships that are different from what can be found in more traditional, organizational displays of social relationship. The Consolidarity approach may find relationships among people that complement other approaches of determining a social network. Our experience suggests that it is more rare to discover shared interests based on matching a file with the same information content. It may be more commonly useful in finding others who use common resources, such as

958

CHI 2007 Proceedings • Collaboration at Work

April 28-May 3, 2007 • San Jose, CA, USA

example, you might share more information with people in your same work group, less with those in the same division, and perhaps even less with just anyone in the same company. You might also be willing to share some information even with those outside your company. The main concern is to give users clear awareness of and control over how their information is being used. Furthermore, we are seeing a growing number of services that are analyzing aggregated collections of data to make inferences about the interests of the users. Besides the Amazon example mentioned earlier, Google’s Gmail [11] takes advantage of information available in users’ personal email to suggest topics of interest to them. We need to actively develop privacy guidelines for analyses over large collections of data so that these services are developed within appropriate privacy perspectives. FUTURE WORK

While the data we have collected so far has shown some intriguing patterns, we plan to continue to collect data from a more diverse set of users and look for common patterns. We also want to explore whether the structure that users give to their data as they store it in their file directories to help them manage and retrieve it would be “transitively” useful to others, too. That is, users store files in specific file folders, along with other related files, and perhaps within a folder structure. Thus, the everyday practices of structuring information in personal directories could suggest related information that would be useful to colleagues who are searching for that information. Two users who share a file in common, may have each related it to several other files placed respectively in the same folder. Providing an interface that enables users to transitively discover this related information without violating privacy or security concerns (given that people are working within a common business organization) could be useful in searching for information on the intranet. We would like to explore leveraging the implicit data relationships that people naturally create when organizing their personal file directories in a way that would provide benefit to work colleagues when they look for information.

Figure 3. Concept sketch of reflecting commonality patterns back to users.

algorithmically create a tag cloud that characterizes the user’s files. Furthermore, we want to show what folders in the user’s file hierarchy contain files shared with others, and with whom they are shared. Figure 3 shows an idea where each person is assigned a color, and file folder tags that contain files shared with other people have their letters “colorized” with the colors of the people with whom they are shared. The people associated with these colors are shown in the bottom section, as well as a representative list of files that are shared in common with them. We hope to create a compact and effective representation of showing users where their commonality with others occurs in their file directory and with whom.

We have also been exploring ideas of user interfaces to reflect the patterns that we find among users’ files back to the users, shown in the concept sketch in Figure 3. We would reflect statistical data about the user’s files, such as how many files they have, how much disk space they occupy, and how that compares with others in their work organization. Then, we might represent a “tag cloud” of labels that characterize their data through the file folder labels used to organize their data. In the experimentation we have done so far, we believe the most promising approach is to use only the folder labels that contain the user-created files of the focused set of filetype extensions (all of Table 2). We could then assign a rank to each file folder label by adding the number of files contained in that folder (or subfolders). We want to explore ways in which we could

We should point out that one of our original hopes of easily finding commonality in files stored among enterprise users has turned out to be more complicated than expected. Even with the LiveWire prototype, detecting meaningful commonality among users has required surmounting computational challenges in analyzing large volumes of data (chunks of files) as well as clever algorithms for sorting out noise while enabling the discovery of meaningful patterns. However, most of this work is in the research of discovering meaningful ways of detecting these patterns—we still hope that once they are discovered, they can be algorithmically programmed without any additional end user work and even minimal administrative work. One lesson to be learned from this research is that the actual

959

CHI 2007 Proceedings • Collaboration at Work

April 28-May 3, 2007 • San Jose, CA, USA

work practice of enterprise file management makes the conceptually simple premise of comparing files among users to detect commonality rather complex.

10. Google Desktop, “Google Desktop -- Features”, http://desktop.google.com/about.html (verified January 17, 2007).

We plan to continue collecting data and exploring for social patterns. We believe that storing and managing files is a fundamental user “transaction” in the work enterprise space, just as purchases and bookmarks are in the consumer social space. We believe that without any additional user effort, there is an opportunity for providing organizational benefit from sharing these social patterns around files while preserving the individuals’ sense of control over the privacy of their files.

11. Gmail, “About Gmail”, http://mail.google.com/mail/help/about.html (verified, January 17, 2007). 12. Haythornthwaite, Caroline, “Social network analysis: an approach and technique for studying information exchange”, Library & Information Science Research, Vol. 18, No. 4, 1996, pp. 323-342. 13. Lau, Tessa, Oren Etzioni, Daniel S. Weld, “Privacy Interfaces for Information Management”, Communications of the ACM, Vol. 42, No. 10, October 1999, pp. 88-94.

ACKNOWLEDGEMENTS

We gratefully thank the 15 anonymous users who allowed us to crawl their file directory data and the 13 anonymous LiveWire users who opted in to the Consolidarity analysis. We thank Martin Wattenberg and the Visual Computing Lab for the visualization code. We thank Anurag Sharma, Steve Welch, and Dan Shiffman for their help and support.

14. Millen, David R., Jonathan Feinberg, Bernard Kerr, “Dogear: Social bookmarking in the enterprise”, CHI 2006, Montreal, April 2006, pp. 111-120. 15. Olson, Judith S., Jonathan Grudin, and Eric Horvitz, “A Study of Preferences for Sharing and Privacy”, Extended Abstracts: CHI 2005, Portland, April 2005, pp. 1985-1988.

REFERENCES

1. Amazon, “Online Shopping for Electronics, Apparel, Computers, Books, DVDs & more”, http://amazon.com/ (verified January 17, 2007).

16. Pohs, W., G. Pinder, C. Dougherty, and M. White, “The Lotus Knowledge Discovery System: Tools and Experiences”, IBM Systems Journal, Vol. 40, N0. 4, 2001, pp. 956-966.

2. Bonkenburg, Ted Dejan Diklic, Benjamin Reed, Mark Smith, Michael Vanover, Steve Welch, and Roger Williams, “LifeBoat: An Autonomic Backup and Restore Solution”, LISA 2004, Atlanta, November 2004, pp.159-170.

17. Policroniades, Calicrates & Ian Pratt, “Alternatives for Detecting Redundancy in Storage Systems Data”, USENIX 2004 Annual Technical Conference, Boston, June 2004, pp. 73-86.

3. Carter, Scott, Jen Mankoff, and P. Goddi, “Building Connections among Loosely Coupled Groups: Hebb’s Rule at Work”, Computer Supported Cooperative Work, Vol. 13, Issue 3-4, 2004, pp. 305-327.

18. “RFC 1321”, http://rfc.net/rfc1321.html, (verified January 17, 2007). 19. Rabin, M. O., “Fingerprinting by Random Polynomicals”, Report TR-15-81, Center for Research in Computing Technology, Harvard University, 1981.

4. del.icio.us, “del.icio.us/about”, http://del.icio.us/about/, (verified January 17, 2007).

20. Rockley, Ann with Pamela Kostur and Steve Manning, Managing Enterprise Content: A Unified Content Strategy, New Riders Publishing, 2002.

5. Dumais, Susan T., E. Cutrell, E., J. J. Cadiz, G. Jancke, R. Sarin and D. C. Robbins, “Stuff I've Seen: A system for personal information retrieval and re-use”, SIGIR 2003, Toronto, July 2003, pp. 72-79. 6. Fagin, Ronald, Ravi Kumar, Kevin McCurley, Jasmine Novak, D. Sivakumar, John A. Tomlin, David Williamson, “Searching the workplace web”, WWW 2003, pp. 366-375

21. Salton, Gerard & Christopher Buckley, “Termweighting approaches in automatic text retrieval, Information”, Processing and Management: An International Journal, Vol. 24, No. 5, 1988, pp. 513523.

7. Farrell, Stephen, Christopher Campbell, and Suvda Myagmar, “Relescope: an experiment in accelerating relationships”, Extended Abstracts: CHI 2005, Portland, April 2005.

22. Schirmer, Andy, “Privacy and knowledge management: Challenges in the design of the Lotus Discovery Server”, IBM Systems Journal, Vol. 43, No. 3, 2003, pp. 519-531.

8. Forman, George, Kave Eshghi, and Stephane Chiochetti, “Finding Similar Files in Large Document Repositories”, KDD 2005, Chicago, August 2005, pp.

23. Voida, Stephen, W. Keith Edwards, Mark W. Newman, Rebecca E. Grinter, and Nicolas Ducheneaut, “Share and Share Alike: Exploring the User Interface Affordances of File Sharing”, CHI 2006, Montreal, April 2006, pp. 221-230.

9. Good, Nathaniel, Aaron Krekelberg, “Usability and privacy: a study of Kazaa P2P file-sharing”, CHI 2003, pp. 137-144.

960

Suggest Documents