EmailNet: A System for Automatically Mining Social Networks from Organizational Email Communication Marshall van Alstyne School of Information, University of Michigan
[email protected] Jun Zhang School of Information, University of Michigan
[email protected] Abstract Email is a valuable and pervasive mean of communication in the information society. It pervades business, social and technical exchanges and as such it is a highly relevant area for research on communities and social networks [Joshua, et al, 2003]. As an indicator of collaboration and knowledge exchange, email provides a rich source for extracting informal social network across the organization. This method has many advantages over traditional selfreport based methods. However, currently there are few tools that can help social researchers automatically extract network structures from email as well as handling privacy concerns. As part of a NSF project study on Information Productivity, we developed EmailNet – a system that automatically mines the organizational email traffic and generates information on social networks for further analysis. EmailNet flexibly handles different issues that researchers usually encounter in accessing and processing email data. It can also analyze email usage patterns at the individual level as well as interaction details at the working group level. Furthermore, it provides a unique email-oriented network visualization interface that helps researchers explore the social network patterns intuitively. EmailNet is targeted to help social scholars and policy and organizational archivists gather, process, and analyze emails more easily and accurately to facilitate the study of organizational electronic communications. Contact:
[email protected] School of Information University of Michigan Ann Arbor, MI, 48109 Tel: 1-734-763-2285 Fax: 1-734-764-2475 Email:
[email protected] Keywords: Email Analysis, Communication Network Analysis
Social
Network
Analysis,
Network
Visualization,
Support: This work has been generously funded by Intel and NSF Career Award #9876233
EmailNet: A System for Automatically Mining Social Networks from Organization Email Communication Marshall van Alstyne, Jun Zhang Email is a valuable and pervasive mean of communication in the information society. It pervades business, social and technical exchanges and as such it is a highly relevant area for research on communities and social networks [Joshua, et al., 2003]. As an indicator of collaboration and knowledge exchange, email provides a rich source for extracting informal social network from email communication across organization [Wellman, 2002, Joshua, et al., 2003]. This method has many advantages over traditional self-report based method [Guimera, et.al. 2003]. However, currently there are few tools that can help social researchers automatically extract network structures from emails as well as handling privacy concerns. As part of a NSF project study on Information Productivity, we developed EmailNet – a system that automatically mines the organizational email traffic and generates information on social networks for further analysis. EmailNet flexibly handles different issues that researchers usually encounter in accessing and processing email data. It can also analyze email usage patterns at the individual level as well as interaction details at the working group level. Furthermore, it provides a unique email-oriented network visualization interface that helps researchers explore the social network patterns intuitively. EmailNet is targeted to help social scholars and policy and organizational archivists gather, process, and analyze email more easily and accurately to facilitate the study of organizational electronic communications. EmailNet is used in our Information Productivity project which studies how consultants use email to share information and make connections to their customers. In this project, the data capture module of EmailNet was installed into two mid-size executive search firms and more than eighty participants’ email communications are continually captured and processed during a six month period. Different patterns, such as social network structure, email usage, and their correlations with accounting data are revealed with the help of the EmailNet system.
Strength and issues of studying social networks on email Strength of email as network data Rapid development of electronic communications provides a powerful alternative for studying social networks in some perspective. The advantages of email communication as social network data include: • It provides plentiful data on personal communications in a standard electronic form that is relatively easy to process. • The ubiquity of email usage within society and organizations makes it the most promising resource for the study of large-scale social network structures, which is very difficult to conduct with other methods. [Schwartz and Wood, 1992] • Email not only records who links to whom, but also the frequency, longevity, and reciprocity of such social interactions, which might more precisely reveal a weighted organizational social network structure. [Steve, 2002] • Email records the content of communication, which can be used to categorize different types of social relationship by text or genre analysis. • Email automatically archives the timestamp of the occurrence of social interactions in a small segment level. The temporal dimension analysis of email archives can help us look into the dynamics of the organizational social structure [Carley, 2003]. • Partial social networks generated from emails are close to complete social networks of organizations because of multiple copy characteristics of email-- an email is stored in both sender’s and receivers’ email boxes. Issues in accessing and processing email networks First, personal privacy and organizational confidentiality concerns are the biggest barriers for email related social research. Analyzing only header information can alleviate such problem. However, ignoring information carried in the message significantly limits the potential of using email as a research proxy for social relationship. Second, even the format of email is relative standard and it is easy to generate a links list from email archives. Automatically generating an accurate informal social network from email is a far more complex process [Joshua, et al., 2003]. Different processing issues, such as merging multiple identities of same person, identifying spammers
and group aliases, categorizing social relations by email content, weighting ties by different indicators such as reciprocity, frequency, and aggregating “To, Cc, Bcc” with different algorithms, etc. must be addressed. Finally, Email is not a substitute for other forms of communication; so social networks generated from email might be different from networks generated via other methods [Carley, 1991]. EmailNet is designed and implemented to fully utilize the potential of email data and address issues of process.
Implementation of EmailNet Framework As can be seen in figure 1, the system framework of EmailNet mainly includes three modules. First, email is captured, hashed, and stored into the database in the capturing module. Second, different filters are applied to solve different issues noted above in the preprocess module. Finally, in the analysis module, several statistical functions, a social network visualization and analysis interface are provided.
Figure 1. EmailNet system framework Email capturing and hashing There are two different ways to capture individual emails based on different agreements between researchers and participants: extracting mails from personal email clients or from organizational mail servers directly. EmailNet supports both ways, and emphasize the latter. During the data capturing stage, to protect participants’ privacy, each email is hashed to make the messages unreadable but still allow researchers to analyze them automatically through automatic information retrieval methods. This hashing process involves stemming all the words in the subject and content, deleting the common words, hashing and converting readable words into hash code vectors. Email preprocessing The original network generated from all captured emails is quite large and complex and it contains significant noise, such as spammers, duplicated identities, etc. Different aggregations of “To, Cc, Bcc” also generate different networks. EmailNet uses several filters to handle these issues. A text clustering technology based email categorization function is also provided so researchers can categorize emails into several given types of social connections. Email usage patterns Several email usage pattern analysis functions are provided to help users investigate email interactions in detail, such as time distribution across hours and days (Figure 2), response thread visualization.
Figure 2. Incoming Email distribution across hours Network Visualization and Analysis An email oriented network visualization interface is provided to help users explore the email social network. As can be seen in figure 3, there are mainly three features in this network visualization: social attributes based graph layout, flexible filters for decreasing the complexity of network, and time-dimension animation of network dynamics.
Figure 3. Visualization of one person’s email network in EmailNet EmailNet uses R SNA package as the social network analysis engine and its data can be directly exported to other social network analysis and visualization tools, such as UciNet and Pajek. A detailed case study on the use EmailNet to research information productivity is described in the study’s final paper.
Discussion and future work In this paper, we discuss related issues in email social network study and describe the implementation of EmailNet, a system for automatically mining social networks from organizational email communications. By providing EmailNet freely online, we hope to help social research scholars automate the complex process of data capturing, processing, and mining so they can focus more on the social analysis of the final results, and ultimately foster researches in this field. Future work might include adding the G.N algorithm [Girvan & Newman, 2002] for identifying email communities and developing a genre based categorization algorithm for email classification.
References 1. 2. 3. 4. 5. 6. 7. 8.
Tyler, J. R., Wilkinson, D. M. & Huberman, B. (2003). “A. Email as spectroscopy: automated discovery of community structure within organizations”. Preprint http://xxx.lanl.gov/arXiv:cond-mat/0303264 Whittaker, S., Jones, Q, Terveen, L, (2002). “Contact Management: Identifying Contacts to Support Longterm Communication”. In Proceedings of CSCW 2002. New Orleans, LA. Wellman, B. “Computer Networks As Social Network”, Science 293, 14, Sept 2001: 2032-34 Carley, K., forthcoming, “Dynamic Network Analysis” in the summary of the NRC workshop on Social Network Modeling and Analysis. Ron Breiger and Kathleen M. Carley (Eds.), National Research Council Schwartz, M.F., and Wood, (1993). “Discovering Shared Interests Using Graph Analysis”. CACM 36(8) Guimera, R., Danon, L., Diaz-Guilera, A., Giralt, F. & Arenas, “A. Self-similar community structure in organizations”. Preprint, http://xxx.arxiv.org/abs/cond-mat/0211498. (2003). Girvan, M., & Newman, M. (2002), “Community structure in social and biological networks”, Proc. Natl.Acad.Sci. USA 99, 8271-8276 Carley, K., Wendt, K. (1991), “Electronic Mail and Scientific Communication: A Study of the Soar Extended Research Group”. Knowledge: Creation, Diffusion, Utilization , 12(4): 406-440.