This work explores a novel approach for conversation detection in email mailbox. .... marked conversations is compared to a set of automatically detected .... message attributes such as subject, date, and sender/recipient relationship between ...
Conversation Detection in Email Systems Shai Erera
THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE MASTER DEGREE
University of Haifa Faculty of Social Science Department of Computer Science November, 2007
Conversation Detection in Email Systems By: Shai Erera Supervised By: Dr. Shuly Wintner Dr. David Carmel (IBM Labs, Haifa University)
THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE MASTER DEGREE University of Haifa Faculty of Social Science Department of Computer Science Approved by:
Date: (supervisor)
Approved by:
Date: (supervisor)
Approved by:
Date: (Chairman of M.A Committee) November, 2007
I
Acknowledgment The research was done under the supervision of Dr. David Carmel from the IBM Haifa Research Lab and Dr. Shuly Wintner from the department of Computer Science, University of Haifa. I would like to thank David for his excellent guidance, assistance and support throughout this work. I highly regard his support both academic and socially throughout my graduate studies. I would like to thank Dr. Shuly Wintner for his insights and suggestions on various aspects through this work. I would like to give my special thanks and love to my wife Orit for her endless love and support and for just being there through the whole bumpy road.
II
Contents List of Figures
IV
List of Tables
V
Abstract
VI
1
Introduction 1.1 Threading Importance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Existing Solutions for Thread Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Our Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 1 2 2
2
Related Work 2.1 Email Thread Usage . . . . . . . . 2.2 Email Thread Detection . . . . . . . 2.3 Email Topic Detection . . . . . . . 2.4 Email Thread Detection Drawbacks
5 5 6 7 8
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
3
Conversation Detection 10 3.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.2 Message similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.3 The conversation detection algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4
Experiments 4.1 Evaluation Process . . . . . . . . . . . . . . 4.2 Dataset . . . . . . . . . . . . . . . . . . . . 4.3 Results . . . . . . . . . . . . . . . . . . . . . 4.3.1 Email threads evaluation . . . . . . . 4.3.2 Subject threads evaluation . . . . . . 4.3.3 Sub-Conversation detection evaluation 4.3.4 Max date difference threshold impact 4.3.5 Conversation detection evaluation . . 4.3.6 Coefficients Weights Selection . . . . 4.4 Discussion . . . . . . . . . . . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
16 16 16 17 18 18 18 18 19 20 22
Conversations Analysis 5.1 Data Set . . . . . . . . . . . . . . . . . 5.2 Subject-threads and Conversations Size 5.3 Participants Distribution . . . . . . . . 5.4 Response Time . . . . . . . . . . . . . 5.5 Bursts . . . . . . . . . . . . . . . . . . 5.6 Discussion . . . . . . . . . . . . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
23 23 23 24 25 26 26
5
6
. . . . . .
. . . . . .
Summary
. . . . . .
28
III
List of Figures 1 2 3 4 5 6 7 8 9
Email thread example according to RFC 2822 . . . . . . . . . . . . . The conversation detection algorithm . . . . . . . . . . . . . . . . . . The conversation detection process . . . . . . . . . . . . . . . . . . . Date difference threshold impact . . . . . . . . . . . . . . . . . . . . Summary of evaluation scores . . . . . . . . . . . . . . . . . . . . . Subject-threads and Conversations distributions of size and messages . Participants distribution in conversations . . . . . . . . . . . . . . . . Response times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Message bursts in conversations . . . . . . . . . . . . . . . . . . . .
IV
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
3 14 15 19 21 24 25 25 26
List of Tables 1 2 3 4 5
The five different text matching strategies used to detect a parent message. . . . . . . . . . The distribution of the number of sub-conversations in subject threads and in conversations, in mailboxes A and B. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sub-conversation detection results, using different coefficient weights in Equation 7 (mailboxes A and B respectively). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conversation detection results using different coefficient weights in Equation 7 (mailboxes A and B respectively). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Statistics of the mailboxes used for conversations analysis. . . . . . . . . . . . . . . . . .
V
.
6
. 17 . 19 . 20 . 24
Conversation Detection in Email Systems Shai Erera
Abstract Electronic mail (email) has become one of the most popular tools for handling conversations among people. In general, a typical user mailbox contains hundreds of conversations. Detecting these conversations has been identified a long ago as an important task. Clustering the messages into coherent conversations has many useful applications, among them are: • Allowing users to see a greater context of the messages they are reading. • Reminding users that a conversation is in progress. • Recording the state of a discussion. • Collating related messages automatically. • Reducing messages displayed in inboxes. • Allowing users to perform actions such as reading or deleting on a group of messages. This work explores a novel approach for conversation detection in email mailbox. It clusters messages into coherent conversations by using a similarity function among messages that takes into consideration all relevant email attributes, such as the message subject, the participants, the date of submission, and the message content. The output of the similarity function is a decision score that describes whether a message should be included in a conversation or not. The clustering algorithm starts by detecting subject threads in the mailbox (messages are grouped by their subject attribute). It continues by further dividing each subject thread to sub-conversations (messages within a subject thread are grouped by their similarity matching according to the above attributes). Lastly, it groups sub-conversations to conversations, using the similarity function described above. The detection algorithm is evaluated against a manual partition of two mailboxes into sub-conversations and conversations. Experimental results demonstrate the superiority of subject threads over traditional email threads and the superiority of the detected conversation over subject threads. In addition, it shows that the division of subject threads to sub-conversations is necessary to detect high quality and coherent conversations. We also study the behavior of the detected conversations from several aspects, such as: 1) How many participants are in a conversation. 2) Does the conversation size (number of messages) affect the number of participants. 3) What is the response time in small and large conversations. 4) How many emails are sent per day in a conversation. The results show that a mailbox can be roughly (and equally) partitioned into three groups: single-message conversations, short conversations (2-5 messages) and longer ones. We also show that the number of participants is not affected significantly when the conversation grows. VI
1 1.1
Introduction Threading Importance
Electronic mail (email) has become one of the most popular tools for handling conversations among people. In general, a typical user mailbox contains hundreds of conversations. Detecting these conversations has been identified a long ago as an important task (Venolia, G. D. et al., 2001; Kerr, B., 2003; Ducheneaut, N. and Watts, L. A., 2005). Clustering the messages into coherent conversations has many useful applications, among them are: • Allowing users to see a greater context of the messages they are reading. • Reminding users that a conversation is in progress. • Recording the state of a discussion. • Collating related messages automatically. • Reducing messages displayed in inboxes. • Allowing users to perform actions such as reading or deleting on a group of messages. Venolia et al. (2001) studied the activities people use email for in their own company. One of the activities is “email triage” on a body of new messages. They note that people often spend blocks of time going through their mail and deciding what to do with all their messages. This time is used to process both unread messages and messages that have been read but saved for later action. They have built a threaded email client prototype and lab-studied it. The email client provided the user a grouping of the messages by email threads, rather than the traditional grouping by folders display. They write: “... displaying a message along with all the replies above and below it in the chain provides better local context, which can help users better understand conversations that occur via e-mail”. Their results show that the people who worked with the threaded email client prototype preferred to use it for the given task. They note that one participant said “all messages referring to one idea were grouped together. Made it easy to read & refer back”. Kerr (2003) mentions the importance of identifying conversation threads in email system and the necessity of tools for inspecting them: “... The main advantages are: allowing users to see a greater context of the messages they are reading, reminding users that a conversation is in progress, recording the state of a discussion, collating related messages automatically, reducing messages displayed in inboxes, and allowing users to perform actions such as reading or deleting on a group of messages”. Ducheneaut and Watts (2005) survey research literature on email and identify three metaphors that have guided email research: email as a file cabinet extending human information processing capabilities; email as a production line and a place to coordinate work; and email as a communication genre supporting social and organizational processes. They note that email conversations are quite involved and hard to track, and that mechanisms to help the user deal with entire conversations are needed. They also note that email conversations take place in some socio-organizational context. However, email clients today do not provide the users with any means to understand what is being talked about in the conversation, albeit some email clients sort messages by their subject line, which, in their opinion, is mostly inefficient because conversations drift away from their initial subject. 1
1.2
Existing Solutions for Thread Detection
Several email clients deal with conversation detection by detecting email threads. The Internet message format, RFC 2822 (2001), is a common standard that specifies a syntax for text messages within the framework of email systems. It does not provide a precise definition of an email thread, but instead defines how an email thread can be detected using structural attributes embedded in email messages. Figure 1 is an example for an email thread and the attributes that can be used to detect the thread. It contains a series of three messages that compose a conversation thread between John, Bob and Mary. John firsts sends a message to Mary, who then replies to John’s message, carbon copying (CCing) Bob to notify him about the discussion. In addition, Mary also modifies the subject line of her reply to welcome Bob. Then John replies to Mary’s reply message, quoting her original message in the message body. Notice the “In-Reply-To”, “References” and “Message-ID” attributes, which can be followed to detect that thread. Thread detection based on structural attributes has some difficulties: • Not all email clients support the structural attributes required for detecting email threads (Lewis, D. D. and Knowels, K. A., 1997; Klimt, B. and Yang, Y., 2004). • When the thread is broken some systems may fail to detect the entire email thread. For example, if a message m3 is a reply to m2 , which is a reply to m1 , and m2 is deleted, systems that rely on the ”In-Reply-To” attribute will fail to detect that thread. • To continue a conversation one may compose a new message rather than reply to an existing one. In most email clients composing a new message initiates a new email thread. Therefore, the new message will not have any structural relationship with the original thread. • One may reply to a message while modifying the recipients (as Mary did in Figure 1, CCing Bob in her reply), or changing the subject of the conversation (as Mary did), or changing the topic (as John did in his reply to Mary). This might indicate an initiative of a new conversation. However, the structural attributes embedded in the message as a result of the reply command will still link the newly composed message and the previous ones. Several previous works have studied thread detection that does not rely on structural attributes only. Rudy (1996) defines an email thread as “a collection of messages with a common subject”. According to this definition, conversations can be easily detected by looking for messages with identical subjects. Lewis and Knowels (1997) showed that by applying text matching techniques to the textual portion of messages (the content attribute) they were able to improve thread detection. Klimt and Yang (2004) added message participants into the picture by grouping messages with the same subject and are among the same group of people.
1.3
Our Solution
In this work we follow the direction of conversation detection based on email attributes. We begin by sharpening the distinction between email threads and conversations. A conversation is defined in Wikipedia (2007a) as “communication by two or more people, or sometimes with one’s self, often on a particular topic”. Wikipedia (2007b) also defines a threaded discussion as “an electronic discussion (such as one via e-mail, ...) in which the software aids the user by visually grouping messages. Messages are usually grouped visually in a hierarchy by topic”. 2
From: John Doe To: Mary Smith Subject: Saying Hello Date: Fri, 21 Nov 1997 09:55:06 -0600 Message-ID: This is a message just to say hello. ------From: Mary Smith To: John Doe Cc: Bob Roth Subject: Saying Hello, and welcome Bob Date: Fri, 21 Nov 1997 10:01:10 -0600 Message-ID: In-Reply-To: References: This is a reply to your hello. ------From: John Doe To: "Mary Smith: Personal Account" Cc: Bob Roth Subject: Re: Saying Hello, and welcome Bob Date: Fri, 21 Nov 1997 11:00:00 -0600 Message-ID: In-Reply-To: References: >> This is a reply to your hello. I would like to raise another issue... ---Figure 1: An example of an email thread according to RFC 2822 Internet message format.
3
As noted before, messages in the same thread may discuss different topics. We refine the conversation definition by defining a conversation as an exchange of messages among the “same” group of people on the “same” topic 1 . Messages are grouped into coherent conversations by using a similarity function among messages that takes into consideration all relevant email attributes such as the message subject, the participants, the date of submission, and the message content. We intentionally ignore structural attributes such as the “In-Reply-To” field that may link messages belonging to different conversations, and may or may not be supported by the email client. We study the contribution of the message attributes for conversation detection by experimental analysis using a set of email messages that were manually clustered into coherent conversations. We compare the conversations extracted automatically by the detection process and the manually marked conversations. We show experimentally that the detected conversations better suit the user’s expectation (as reflected by the manually marked conversations) than structural-based email threads and conversations detected by alternative approaches. In addition we also study the behavior of the newly defined conversations from different aspects, such as what is the size of the conversation? How many people participate in a conversation? Do conversations have bursts? What is the response time in a conversation? We use a set of conversations, detected using our algorithm, from three different mailboxes. We show that the number of participants in a conversation is independent of the conversation size and that large conversations are characterized with bursts of activities. We also show that a mailbox can be roughly (and equally) divided into three parts: single-message conversations, small conversations (with few messages) and large conversations (with more than six messages). The rest of the paper is organized as follows: Section 2 discusses related work in the field of thread usage, thread detection and topic detection in messages, and compares them to our approach. Section 3 describes the similarity function among messages and the detection algorithm used in this work. Section 4 describes the experiments we conducted. We first describe the evaluation process in which a set of manually marked conversations is compared to a set of automatically detected conversations. The evaluation process is used to compare our detection approach to other approaches and to study the relative significance of the message attributes used by the similarity function. Section 5 describes the analysis we have done on the detected conversations. We discuss different aspects of conversations, like the number of participants in a conversation, the average response time and conversations sizes. Section 6 concludes our work and discusses open issues for further research.
1
The notions of the same group of people and the same topic will be clarified in section 3.
4
2 2.1
Related Work Email Thread Usage
Extensive research has been done over the past years on using email structure, in particular email threads, in several email based applications. Summarization Lam et al. (2002) present work on email summarization by exploiting the thread structure of email conversation and common features such as named entities and dates. The message to be summarized is ”expanded” using the content from its ancestor messages. They note that email threads provide valuable context for summarizing email messages and allow summarization systems to exploit the structure of email not found in other documents. Rambow et al. (2004) apply sentence extraction techniques to the thread to construct a generic summary of the thread and not a specific message. Wan and McKeown (2004) address the problem of constructing a summary of email discussions. Specifically, they examine ongoing discussions which will ultimately culminate in a consensus in a decision-making process. Their summary provides a snapshot of the current state-of-affairs of the discussion and facilitates a speedy response from the user. They present a method which uses the structure of the thread dialogue and word vector techniques to determine which sentence in the thread should be extracted as the main issue. Shrestha and McKeown (2004) detect question-answer pairs in an email conversation for the task of email summarization. They note that sentence extraction based summarization techniques are inadequate for the task since the produced summaries lack the discourse segments that would have made them more readable and complete, and that a summarization module that includes answers to questions posed in extractive summaries will be very useful. Carenini et al. (2007) describe a method for email summarization, called ClueWordSummarizer (CWS), based on a novel concept called clue words. A clue word from a message is a word (modulo stemming) that appears also in its parent message(s) and/or child message(s) in a quotation graph. They show that CWS produces more accurate summaries when compared with other methods. Visualization Venolia and Neustaedter (2003) define an email conversation as the tree of related messages that arises from the use of the reply operation. They propose two models of conversation: the first model characterizes a conversation as a chronological sequence of messages; the second as a tree based on the reply relationship. They present a mixed-model visualization that simultaneously presents sequence and reply relationships among the messages of a conversation. They integrated the visualization into a working prototype email client and show by a usability study that the system meets their usability goals and verifies that the visualization fully conveys both types of relationships within the messages of an email conversation. Kerr (2003) describes Thread Arcs, a novel interactive visualization designed to help people use threads found in email. Thread Arcs combines the chronology of messages with the branching tree structure of a conversational thread in a mixed-model visualization that is stable and compact. By quickly scanning and interacting with Thread Arcs, people can see various attributes of conversations and find relevant messages in them easily. He tested his technique against other visualization techniques with the users’ personal email in a functional prototype email client. His results show that Thread Arcs proved an excellent match for types of threads found in users’ email and for the qualities users wanted in small-scale visualizations. 5
Kerr and Wilcox (2004) have been investigating how people use email and how they might design and build a better email system to aid the users in their tasks. They show a prototype email client developed as part of a larger project called “reinventing email”. Among other new capabilities, this integrated prototype incorporates novel visualizations of the documents within mail databases to aid understanding and navigation, advanced text analysis of the content of messages and different visualizations to aid users deal with email threads, such as thread summarization and display techniques as described in (Kerr, B., 2003). Golder et al. (2006) take a different approach in email visualization. Instead of attempting to visualize the current state of the mailbox, they describe Themail, a visualization of the contents of an email archive. Their system uses the content of exchanged messages to show the words that characterize one’s correspondence with an individual and how they change over the period of the relationship. Their user studies show that the participants appreciated being able to look at the big picture, while still being able to seek pieces of information in their mailbox. Usage Broder et al. (2005) describe a method of using email threads for efficient indexing of shared content. A reply to an email often quotes the content of the original message. Therefore it is important to efficiently identify email threads to reduce the index size. Furthermore, they mention the advantage of using email threads to better organize the search results of an email search system. Neither of the above works, however, specifies how email threads are detected. Furthermore, all of those applications will benefit significantly from better tools for conversation detection.
2.2
Email Thread Detection
Thread detection has also attracted significant attention. Lewis and Knowels (1997) show that by applying text matching techniques to the textual portions of messages they are able to detect threads effectively. They divided a set of 2435 messages to three collections, each containing a single attribute of all messages: subject text, quoted text and unquoted text. In addition, they extracted those attributes from a set of 941 child messages for which they knew their parent message (according to their In-Reply-To attribute) and used them as test queries against the three collections. Table 1 shows the five different text matching strategies they studied to indicate whether one message is a response to another. Each pair (query, collection) specifies which test queries were run against which collection. For example, the pair (Unquoted Text, Quoted Text) specifies that they used the test queries generated from the unquoted text attribute of the messages, to query the collection of messages that was created using the quoted text attribute of the messages. Their results show that the most effective strategy for thread detection is to match the quoted part of one message against the unquoted part of the other messages. This strategy is expected to work well especially for email clients that automatically quote the original message in the reply message. Queries Subject Text Unquoted Text Unquoted Text Quoted Text Quoted Text
Collections Subject Text Unquoted Text Quoted Text Unquoted Text Quoted Text
Table 1: The five different text matching strategies used to detect a parent message. 6
Klimt and Yang (2004) addressed the drawbacks of detecting email threads based on structural data. Instead, they group messages having the same subject attributes and are among the same group of people (ignoring the content and other attributes). Aaron and Jen-Yuan (2006) follow Lewis and Knowles’ work (1997) by improving their content matching method. They extract nested quotations rather than taking the entire quoted text as one consecutive part. This distinguishes between old quoted messages to recent ones. Furthermore, they compare additional message attributes such as subject, date, and sender/recipient relationship between messages. In addition to thread detection they focus on missing messages recovery (when messages are deleted from a thread). Carvalho and Cohen (2004) describe methods for automatically identifying signature blocks and reply lines in email messages. Although their work focuses on signatures extraction from messages, they note that their methods can be used to successfully detect reply lines in messages, and thus be used to detect email threads. Shen et al. (2006) detect threads in dynamic message streams such as Internet Relay Chat. Their work focuses on extracting the thread from an interleaved sequence of messages. Detecting threads in such systems is different than in email systems for two main reasons: • As noted, messages in chat systems are very short and incomplete, which is different in emails, where messages are much more descriptive. • In many cases a thread of chat messages is the result of several interleaved discussions.
2.3
Email Topic Detection
The problem of email overload (Whittaker, S. and Sidner, C., 1996) is now widely acknowledged and has attracted a lot of attention even in the popular media. To help users deal with the growing amounts of email they receive, new structures of email organization are needed. A variety of solutions have been proposed, including sender-based (B¨alter, O. and Sidner, C., 2002), activity-based (Dredze, M., Lau, T., and Kushmerick, N., 2006), and task-based (Whittaker, S., Bellotti, V., and Gwizdka, J., 2006) organization schemes. For example, Dredze et al. (2006) provide successful algorithms to recognize emails that belong to particular activities, such as organizing a conference, reviewing papers, or purchasing equipment. Recently, there have been some works that proposed to organize messages by topic, rather than the traditional email threads. Topic detection in email is the task of grouping together messages that discuss the same story. While considerable clustering research has been based on global clustering, (i.e. clustering stories over an entire data set), topic detection clustering is done incrementally which, for email, means that once a message is received, the topic detection algorithm can decide to associate the message with one or more of the existing clusters, or create a new cluster with the new message. Clustering messages by topics differs from clustering messages into email threads in several ways: • In topic detection, a message can be associated with any number of clusters, while in email threads a message belongs to a single thread. • Email threads maintain a conversation-like pattern, where a message is always a reply to another message (excluding the first message in the thread). Messages that are grouped by topics may be related to different email threads and conversations. • A topic, conceptually, is not a conversation and therefore, theoretically, unbounded in time, meaning messages can be added to a topic over several years (for example, when working on a very long project). Although users can continue to reply to messages as long as they wish, the conversation-like 7
pattern email threads exhibit limits their lifespan (one can reasonably agree that email threads longer than a couple of weeks, or even months, are very rare). Kleinberg (2002) notes that the appearance of a topic in a stream of email messages is signaled by a “burst of activity”, with certain features rising sharply in frequency as the topic emerges. He developed a formal approach for modeling such “bursts” and showed that those “bursts” can be identified by the rate of message arrival in a given period of time. We study the existence of such bursts in the conversations we detect in Section 5.5. Recently, Gabor et al. (2007) develop an email client extension that clusters messages by topics. The similarity between two messages, as applied by the clustering algorithm, is based on the email message subject, date, participants and content attributes. The output of the similarity function is a decision score that describes whether an email should be matched to a topic or not. When this score is below a clustering threshold for all existing topics, the email is mapped to a new topic. Otherwise, it is mapped to any number of closest topics.
2.4
Email Thread Detection Drawbacks
Neither of the above solutions can cope completely with the conversation detection task: • Klimt and Yang (2004) enforce the messages belonging to a thread to have the exact subject line and identical sets of participants, but do not consider the content of the messages. Conversations on the other hand may span over several threads with similar (but not exact) subject lines. This might happen when the subject has been modified during the conversation thread (see for example Mary’s reply in Figure 1). In addition, a conversation may not include all the participants in all the messages. Participants can be roughly split into two groups of active and passive ones. The active participants contribute to the conversation while the passive ones are only CCed. When passive participants are added or removed from the thread it is reasonable to assume that the conversation continues rather than a new one begins. • Lewis and Knowels (1997), and lately Aaron and Jen-Yuan (2006), discover conversation threads by looking for the replies of a given message by comparing the quoted and unquoted text of the messages. While their methods exhibit high effectiveness in detecting threads, they may fail to detect all conversations. When a conversation spans over several threads a message from one thread will not contain the quoted part of messages from the other threads. Aaron and Jen-Yuan (2006) noted that in such cases their approach will fail. In addition, reply messages may not quote the content of the previous message, whether because of the email client behavior or due to the replier’s decision. In such cases, the matching method will not be able to compare quoted and unquoted parts of messages and therefore may fail to detect the conversation. • Gabor et al. (2007) cluster messages by their topic. Their approach is the most similar to ours and will likely to detect similar conversations as our method does. However, as they noted, messages belonging to different conversations on the same topic will be clustered together. Furthermore, they regard the body of the message as one consecutive part rather than identifying its nested quoted parts. This may wrongly group messages that should have belonged to two different conversations. Any text similarity technique that ignores the structure of the content will consider two messages, one quoting the other, as highly similar, even when the unquoted part of the text indicates a significant topic drift. 8
In our work we use similar message attributes as in (Gabor, C., Keno, A., and Roger, W., 2007) to measure message similarity. The main differences are that Gabor et al. (2007) focus on message clustering based on topic detection, allowing messages from different conversations to be clustered together. Moreover, we apply a different similarity function between messages which we found to be more appropriate for the conversation detection task.
9
3
Conversation Detection
In this section we formally describe the process of detecting conversations from a collection of email messages. We begin by defining the basic concepts used in this work. We continue by describing the features extracted from the messages to be used by the similarity function among messages, as well as among conversations. Finally, we describe the detection algorithm.
3.1
Definitions
This section defines the basic concepts that are used in this work: Email Thread is a sequence of messages, ordered by their date of submission, that are related according to their structural attributes. Two messages, e1 and e2 , belong to the same email thread if and only if, without loss of generality, e1 references e2 through its structural attributes as defined by the Internet standard RFC 2822 (2001). Figure 1 provides an example for such an email thread. Subject Thread is a sequence of messages, ordered by their date of submission, that are related according to their subject attribute regardless of their structural relationships. Two messages, e1 and e2 , belong to the same subject thread if and only if their “core” subject is identical. A “core” subject is extracted from the full message subject by eliminating common prefixes such as “Re:”, “Re: Re:”, “Fwd:”, etc., which are very common in email systems. A variant of subject threads was used by Klimt and Yang (2004) for thread detection, along with restricting the messages to also have the same participants. Sub-Conversation is a sequence of messages, ordered by their date of submission, belonging to a subject thread, that focus on the same topic and are among the same group of people (please note that the notion the same is different than identical, and will be defined precisely in the following subsection). A subject thread may contain one or more sub-conversations. One breaks a subject thread by replying to a message while keeping the subject, however changing the topic or modifying the participants. Another indication for a new conversation in the same subject thread is a long time break between two consecutive messages. Such a behavior was noted by Lewis and Knowels (1997) when users may reply to a message even when their intention is to initiate a new conversation, saving the time of retyping the addresses of participants. In addition, users may compose a new message on a different topic, but with the same subject appeared in other previous messages. Such a message will be part of the subject-thread, as determined by its subject, but will belong to a different sub-conversation. A sub-conversation is a special type of conversation for which all its messages share the same subject line. Several sub-conversations can also be grouped together to compose a conversation. Conversation is a sub-conversation, or an ordered sequence of sub-conversations, that focus on the same topic and are among the same group of people. The order of the sub-conversations is determined by their date attribute. The date of a sub-conversation is defined as a (ds , de ) pair, where ds and de are set to the earliest and latest dates of all messages in the sub-conversation, respectively. The sub-conversations are then ordered by their ds part of the date.
3.2
Message similarity
We measure the similarity between two messages as a linear combination of the similarity between their attributes. In the following we describe how the similarity between message attributes is measured. 10
Subject In general, the subject attribute of a message is a short title of the message topic. The similarity of two subject attributes is determined by the subject words. We use the Dice similarity coefficient to calculate the similarity of subject attributes. Let Si and Sj be the sets of words belonging to the core subject attributes of two messages, ei and ej , respectively. Then the subject similarity is defined as: subj(ei , ej ) =
2|Si ∩ Sj | |Si | + |Sj |
(1)
Note that two messages with an identical core subject are maximally scored by that function while messages with disjoint subjects are scored zero. Messages with similar subjects (followed by a slight modification of one of the subjects) are still scored highly. Since all messages belonging to a subject thread, or to a sub-conversation, share the same core subject attribute, we will define that subject as the subject-thread’s (or sub-conversation’s) subject. For a conversation containing several sub-conversations the subject attribute is determined by concatenating the subjects of all sub-conversations. Therefore, the same function will be used to measure subject similarity between conversations. Date Date attributes are highly important for detecting conversations. As Kalman and Rafaeli (2005) discovered, a reasonable response time for an email message is five days after it has been sent while most messages are being replied in a matter of hours. A reply to a message after a very long time indicates an intention of the replier to revive the conversation. However, from many perspectives, it will be wiser to consider such an answer as an initiative of a new conversation. Therefore, the date similarity between two messages should decay as their date difference grows. In addition, we use a max date difference threshold, mdf , above which the date similarity is zeroed. Let di and dj be the date attributes of two messages, ei and ej respectively. The date similarity of two messages is defined as: |di − dj | date(ei , ej ) = 1 − min(1, ) (2) mdf The date of a conversation is defined as a (ds , de ) pair where ds and de are set to the earliest and latest dates of all messages in the conversation. Let (dsi , dei ) and (dsj , dej ) be the date attributes of two conversations, ci and cj respectively, and without loss of generality let dsi ≤ dsj . The date similarity of two conversations is defined as: • If dej ≤ dei then the time range of conversation cj is fully contained in the time range of ci and date(ci , cj ) = 1. • If dei ≤ dsj then the conversations are disjoint. If dei > dsj then the conversations intersect. In both cases date(ci , cj ) is computed using Equation 2, where di = dei and dj = dsj . Participants A participants attribute is determined by an aggregation of all participants mentioned in the From, To, and Cc attributes of the message. In (Gabor, C., Keno, A., and Roger, W., 2007), the similarity between two 11
participant attributes is calculated by the Dice coefficient similarity between two sets (see Equation 1), while excluding the mailbox owner due to his/her participation in all the mailbox messages. One drawback of this approach is that there is no distinction between active and passive participants. As noted before, passive participants are very common in email conversations, especially in the enterprise. Passive participants in one message who are missed in another message can highly affect the participants similarity while using the Dice similarity coefficient even though their contribution to the discussion is negligible. Therefore we define participants similarity, using a variant of the Dice similarity, taking the activity role of participants into consideration. Let Pi and Pj be the sets of participants of messages ei and ej respectively, including the mailbox owner (we include the mailbox owner since she might have a different activity role in the two messages). Let w(p, e) be the activity weight of participant p in message e. We associate a high activity weight for active participants (coming from the From and To message attributes), a lower weight for passive ones (from the Cc attribute), and a zero weight when p does not participate in message e. Since participants might have different activity roles in the two messages, we average the activity weights of the w(p,ei )+w(p,ej ) participants over the two messages: aw(p, ei , ej ) = . The participants similarity between two 2 massages is defined as: P p∈P ∩P aw(p, ei , ej ) (3) part(ei , ej ) = P i j p∈Pi ∪Pj aw(p, ei , ej ) The similarity function returns the sum of activity weights of the joint participants in the two messages, normalized by the sum of weights of all participants (including those who are absent in one of the messages). This function is highly affected by the absence of active participants from one of the messages, and be less affected in the absence of passive ones. For a conversation, the average activity weight of a participant is determined by averaging its activity P k
w(p,ei )
i=1 weight over the sequence of messages: aw(p, (e1 , . . . , ek )) = . Hence, Equation 3 can also k be used to measure participants similarity between such instances, using the average weight of participants over the sequence of messages.
Content While creating a reply message many email clients automatically quote the previous message content in the reply. In a long thread of messages, the quoted part of some messages might be very long comparing to their unquoted part (which may contain only few lines of text). For example, consider a sequence of messages (e1 , . . . , ek ) where the email client quotes each message in the previous message. The content of ek will include the content of all previous messages, along with its own unquoted part. Text similarity between the content of the first message, e1 , and the content of ek , will likely to be low since ek contains many other quoted messages. Furthermore, if the topic has drifted during the thread, the content of e1 might be non-related to the unquoted content of ek . We therefore split the content of a message to its quoted and unquoted parts, while the quoted part is recursively split to its quoted and unquoted parts. Splitting the content of a message can either be done using the email client’s internal API (if available), or using some methods based on the internal representation of the quoted part of a message. For example, many email clients mark the quoted part of a message by preceding ’>’. Given two messages ei and ej with their corresponding elements ei1 , . . . , eim and ej1 , . . . , ejn , as extracted from the message content. In addition, and without loss of generality, let date(ei ) > date(ej ). We compute the textual similarity between each element in ej to the unquoted part of ei using standard cosine 12
similarity measure, taking the maximum value as the similarity measure between the two messages. The idea is that when the unquoted content of the new message, ei , is similar to the unquoted part of ej , or to one of the quoted messages by ej , then ei is similar to the conversation ej belongs to. Comparing the message sub-elements copes well with cases where ei is a reply to ej but the replied text is interleaved with ej ’s content, as often occurs in corporate email, when the replier adds his/her reply within the content of the previous message. This usually indicates that ei and ej belong to the same conversation. The similarity between two elements is calculated using the standard tf − idf cosine similarity. Assuming n terms in the vocabulary. For each element eli , we refer to the term frequency of term tk (the number of occurrences in the element text) as tfik , and to the document frequency of tk (the number of messages in the mailbox containing it) as dfk . The weight wik is defined as wik = log(tfik + 1)log dfNk , where N is the number of messages in the mailbox. Given two elements eli and elj , the similarity between them is calculated by the cosine similarity: Pn
k=1 wik
sim(eli , elj ) = qP
n 2 k=1 (wik )
·
· wjk
qP
n 2 k=1 (wjk )
(4)
Given elui , the unquoted part of message ei , and elj1 . . . eljk the elements of message ej , the content similarity between two messages (ei , ej ) is defined as: content(ei , ej ) = max sim(elui , eljt ) 1≤t≥k
(5)
For a sub-conversation Si , the content attribute includes the unique elements from all its messages. The content similarity between a message ek and sub-conversation Si , content(ek , Si ), is calculated using Equation 5, only we compare the unquoted part of ek to all the elements of Si . Given two sub-conversations Si and Sj and their corresponding unique elements Si1 , . . . , Sim and Sj1 , . . . , Sjn , we compute the similarity between all elements of Si to all elements of Sj using Equation 4. The content similarity between two sub-conversations, is defined as: content(Si , Sj ) = max sim(Sit , Sjk )
(6)
1≤t≥m
1≤k≥n
Similarity Function Finally, the similarity between two messages, ei and ej is computed by a linear combination of the similarities between their attributes: Sim(ei , ej ) = ws × subj(ei , ej ) + wd × date(ei , ej ) +
(7)
wp × part(ei , ej ) + wb × content(ei , ej ) Since all similarity functions of all attributes are also defined for conversation attributes, Equation 7 can also be used to measure similarity between such instances.
3.3
The conversation detection algorithm
This section describes the algorithm used to detect conversations. In the first step the algorithm clusters messages by their subject attribute – all messages with identical core subject attribute are grouped together into a subject thread. Then it is applied on messages in each subject thread, grouping them into sub-conversations 13
based on their similarity matching (using Equation 7). In the third phase, it is applied to the set of subconversations, grouping similar sub-conversations into conversations (using Equation 7). The clustering algorithm efficiently exploits the order between messages. In an ordered set of messages, a message ei cannot be a reply to any message ej , where j > i. Therefore, after processing messages e1 , . . . , en , there is no need to re-cluster them when new messages arrive. This is similar to many clustering algorithms over a stream of data. Figure 2 details the conversation detection algorithm. Figure 3 illustrates the conversation detection process. Subject threads include all messages with an identical subject. These threads are broken to subconversations, each includes all messages belonging to the same conversation according to their date, participants, and content attributes. Similar sub-conversations are then grouped together to form conversations. 1. Detect subject threads: Group messages with identical core subject attribute (after removing common prefixes, such as ”re:” ”fw:”, ”fwd:”). 2. Break a subject thread to sub-conversations: (a) Sort all messages belonging to a subject thread by their date attribute. (b) Group messages belonging to a specific subject thread into sub-conversations based on their similarity matching (using Equation 7). Messages are considered similar when their similarity exceeds a certain threshold. Since all messages belonging to a subject thread have an identical subject attribute, we ignore this attribute at this stage by zeroing the subject coefficient weight in the equation, ws = 0. 3. Group sub-conversations into conversations: (a) Sort all sub-conversations by their date attribute. (b) Group sub-conversations into conversations based on their similarity matching (using Equation 7). Sub-conversations are considered similar when their similarity exceeds a certain threshold. Note that in this phase, sub-conversations belonging to the same subject thread might be re-grouped together, when their content is highly similar. At this stage, the content similarity plays a major role in the similarity measurement while the date similarity is almost neglected. In addition, since subjects play important role in similarity measurement for sub-conversations belonging to different subject threads, ws > 0. Figure 2: The conversation detection algorithm. The clustering algorithm can also be applied incrementally. Sorting messages and conversations by their date attributes allows us to efficiently compare a new arriving message to all the conversations detected so far. If a good candidate is found (if the message is very similar to one of the conversations), we can add this message to that conversation. Otherwise we start a new one.
14
Figure 3: The conversation detection process. Subject threads include all messages with an identical subject. These threads are broken to sub-conversations, each includes all messages belonging to the same conversation according to their date, participants, and content attributes. Similar sub-conversations are then grouped together to form conversations.
15
4
Experiments
To evaluate our approach, we need a set of manually marked messages of a single mailbox. One candidate is to use a mailbox from the publicly available Enron corpus (Klimt, B. and Yang, Y., 2004). The advantage of using a public corpus is its availability for other systems to compare to our approach. However, marking conversations in a mailbox is a subjective task which strongly depends on the user’s familiarity with the mailbox content. Our attempts to mark messages in this corpus proved almost infeasible - since we are not familiar with the content of the messages in the mailbox, often looking at a single message gave no obvious clues as to which conversation the message should belong. Therefore, following the difficulty to identify conversations in an unfamiliar mailbox, we manually evaluated conversations in two mailboxes, each marked by its owner. One of the mailboxes belongs to the author of this work, and the other belongs to a colleague, who has not been directly involved with this work. This section details the evaluation process, the dataset used for evaluation, and the results obtained by the different approaches.
4.1
Evaluation Process
Our evaluation process is based on comparison of the manually marked conversations in a given mailbox with the automatic detected ones. One way to compare manual and automatic conversations is to measure the distance between the two partitions of the same set of messages. The following running example will be used for illustration. Assume that the manual detection outputs the following conversations: M C1 = {e1 , e2 , e3 }, M C2 = {e4 , e5 } and M C3 = {e6 , e7 , e8 , e9 }. On the other hand, the automatic process divides the same set of messages to the conversations AC1 = {e1 , e2 }, AC2 = {e4 , e6 }, AC3 = {e3 , e7 , e8 } and AC4 = {e5 , e9 }. It is quite clear that the automatic partition is significantly different than the manual partition. Only two pairs, (e1 , e2 ) and (e7 , e8 ), belong to the same conversation in both partitions. In addition, while the manual partition contains only three conversations the automatic partition contains four. Therefore, we compare the two partitions by measuring the agreement between them. Given a set of messages, partitioned to n manual conversations Mc = {M C1 , . . . , M Cn }, and k automatic conversations, Ac = {AC1 , . . . , ACk }. We mark by T p the number of message pairs, each belongs to one of the automatic conversations, ACi , for which there exists a manual conversation M Cj containing both messages of the pair. The number of all message pairs, Np , in the manual conversations is: Np =
n X |M Ci |(|M Ci | − 1)
2
i=1
T
We define the similarity between the manual and automatic partitions to be Npp . Note that the similarity between two identical partitions is maximal (1.0), since any pair of messages belonging to an automatic conversation also belongs to the corresponding manual conversation. Given the example described previously, Np = 10 and Tp = 2 (since only the pairs (e1 , e2 ) and (e7 , e8 ) in the automatic conversations belong to a manual conversation). Hence, the similarity between the manual and automatic partitions is 0.2.
4.2
Dataset
To carry out the evaluation we need a manually marked dataset to compare the conversation detection results. As described above, we used two manually marked mailboxes (mailbox A and B), each was marked by its 16
owner. Following are the mailboxes statistics: • Mailbox A – The recent 448 messages were classified to 145 subject threads (3.09 messages per subjectthread on average). – The 145 subject threads were manually split to 161 sub-conversations (2.78 messages per subconversation on average). – The 161 sub-conversations were manually grouped to 147 conversations (3.05 messages per conversation on average). • Mailbox B – The recent 500 messages were classified to 355 subject threads (1.41 messages per subjectthread on average). – The 355 subject threads were manually split to 376 sub-conversations (1.33 messages per subconversation on average). – The 376 sub-conversations were manually grouped to 334 conversations (1.48 messages per conversation on average). Table 2 describes the distribution of the number of sub-conversations in the subject threads and in the final detected conversations. In both mailboxes, most subject threads were marked as one coherent conversation, while the rest were split to several sub-conversations. As expected, long subject threads are likely to include more than one conversation. Moreover, approximately 90% of the manually detected conversations are actually sub-conversations (all messages belong to the same subject thread). Only 10% contain more than one sub-conversation, and there is one conversation that contains 4 sub-conversations in mailbox A (7 in mailbox B). These conversations are the most difficult to detect. Number of Sub-conversations Mailbox Number of subject threads Percent (%) Number of conversations Percent (%)
1 A 121 83 131 89
2 B 337 95 303 91
A 20 14 15 10
3 B 16 5 28 8
A 3 2 0 0
4 B 1 0 1 0
A 1 1 1 1
7 B 1 0 0 0
A 0 0 0 0
B 0 0 2 1
Table 2: The distribution of the number of sub-conversations in subject threads and in conversations, in mailboxes A and B.
4.3
Results
In this subsection we measure the quality of the conversation detection process by comparing the detected conversations to the manually marked conversations. 17
4.3.1
Email threads evaluation
To evaluate the applicability of using email threads for conversation detection we compared email threads, extracted by tracking the ”in-reply-to” attribute of the messages, to the manual partition of messages to conversations. The evaluation score was 0.679 in mailbox A and 0.377 in mailbox B. This provides a baseline score for all alternative detection processes. 4.3.2
Subject threads evaluation
To evaluate the applicability of using subject threads for conversation detection, we compared the subject threads detected by accumulating messages with identical subject attribute (as described in Section 3.3), to the manual partition. The evaluation score is improved to 0.872 for mailbox A and 0.701 for mailbox B, demonstrating the superiority of subject threads over email threads for the conversation detection task. Furthermore, subject threads detection can be implemented very efficiently, using standard indexing techniques, an important consideration for very large mailboxes. 4.3.3
Sub-Conversation detection evaluation
To evaluate the applicability of using sub-conversations for conversation detection, we have done the following steps: • Manually break subject threads to sub-conversations. In this step we consider only multi-message subject threads, since for one-message subject threads the algorithm is not invoked. • Automatically break subject threads to sub-conversations using the algorithm described in Section 3.3, experimenting with different coefficients applied to the similarity function defined in Equation 7. • Evaluate the similarity between the two partitions. The results are given in Table 3 and demonstrate the significance of considering all message features by the similarity function, (date, content and participants). The Independent columns show the results of considering only a single attribute to detect sub-conversations. As can be seen, the date attribute alone detects high-quality sub-conversations and omitting it hurts the detection quality. The Pairs columns show the results of considering two attributes at a time with equal weights, while zeroing the third. The results show that the owner of mailbox A values the content attribute more than the owner of mailbox B, who values the date attribute the most. Column All shows that considering all the attributes gives better results than considering only a subset of them. The coefficient weights were selected empirically after applying various weights combinations. The detection quality for both mailboxes was very high (0.99), which suggests that detecting sub-conversations in subject-threads is an easy task. In addition, Table 3 shows that both mailbox owners consider the date attribute as the most important one (followed by the participants and content attributes, respectively) for detecting sub-conversations. 4.3.4
Max date difference threshold impact
In addition we evaluated the impact of the max date difference (mdf) threshold used in the date similarity function on the sub-conversation detection task. This parameter determines a maximum threshold on the date difference between two messages for which they cannot belong to the same conversation. For this we detected sub-conversations by splitting multi-message subject threads, using different mdf values, and 18
Coeff. Date Content Participants Score A Score B
Independent 1.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 1.00 0.98 0.91 0.87 0.97 0.84 0.85
0.50 0.00 0.50 0.89 0.97
Pairs 0.50 0.50 0.00 0.93 0.86
0.00 0.50 0.50 0.90 0.84
All 0.50 0.20 0.30 0.99 0.99
Table 3: Sub-conversation detection results, using different coefficient weights in Equation 7 (mailboxes A and B respectively). evaluated the partitions. The coefficient weights in the similarity function were the same as the set of values given in the column All of Table 3.
Figure 4: The max date difference threshold impact on detection score. The results are given in Figure 4. The graph shows that the optimal mdf threshold for the two mailboxes is 10 days. A too low threshold hurts detection quality. Increasing mdf over 10 days did not improve (or hurt) the detection quality. This may suggest that either, (a) all messages were replied in less than 10 days in the two mailboxes, or (b) both mailbox owners consider a reply to a message after more than 10 days as no longer to be part of the same conversation. 4.3.5
Conversation detection evaluation
To evaluate the applicability of our conversations detection process, we have done the following: • Manually group the (manually marked) sub-conversations to conversations, including the singlemessage ones. 19
• Automatically group the automatically detected sub-conversations to conversations, applying the algorithm described in 3.3 with different coefficient weights of the similarity function. • Evaluate the similarity between the two partitions. The results are given in Table 4. The Independent columns show the results of considering only a single attribute to detect conversations. As can be seen, for both mailboxes the content attribute detects high-quality conversations. When omitted, the detection quality of mailbox B reduces, while for mailbox A the reduction is less significant. Columns All 1 and All 2 show that considering all the attributes gives better results than considering only a subset of them. The coefficient weights were selected empirically after applying various weights combinations. The detection quality for both mailboxes was higher when considering all message attributes than only independent ones. In addition, the results show that to get higher-quality conversations for mailbox B, the content attribute should be weighted higher than for mailbox A. This shows that detecting conversations in ones mailbox is not an intuitive task and that different coefficient weights apply to different mailbox owners. However, Table 4 shows that both mailbox owners consider the date attribute as almost insignificant for the conversation detection task, and that the content attribute is more important than the subject attribute, which is more important than the participants attribute. Coeff. Date Content Participants Subject Score A Score B
1.00 0.00 0.00 0.00 0.93 0.71
Independent 0.00 0.00 1.00 0.00 0.00 1.00 0.00 0.00 0.93 0.91 0.82 0.72
0.00 0.00 0.00 1.00 0.92 0.79
All 1 0.05 0.40 0.25 0.30 0.96 0.85
All 2 0.05 0.60 0.10 0.25 0.92 0.90
Table 4: Conversation detection results using different coefficient weights in Equation 7 (mailboxes A and B respectively). Figure 5 summarizes the evaluation scores of the different detection methods based on email threads, subject threads, sub-conversations and automatic detected conversations. 4.3.6
Coefficients Weights Selection
The coefficient weights in the All columns in Tables 3 and 4 were selected empirically after applying various weights combinations. However, as the results show, the weights in column All 1 in Table 4 are better for mailbox A than for mailbox B. The weights in column All 2 are better for mailbox B, and although mailbox A’s score is still higher, it is lower than in column All 1. The weights somewhat reflect the mailbox owner’s opinion on what constitutes a conversation. Therefore we believe that although the algorithm can start with the weights given in Table 4, it should learn and adjust the proper weights for a mailbox over time. One way to learn the appropriate weights for a given mailbox is to use machine learning techniques. Lewis and Gale (1994) describe an algorithm for sequential sampling during machine learning of statistical classifiers. They outline a general algorithm for teaching a classifier: 1. Create an initial classifier 2. While teacher is willing to label examples 20
Figure 5: A summary of the evaluation scores of the detection methods based on email threads, subject threads, automatic sub-conversation detection, and automatic detected conversations in both mailboxes. (a) Apply the current classifier to each unlabeled example. (b) Find the b examples for which the classifier is least certain of class membership. (c) Have the teacher label the subsample of b examples. (d) Train a new classifier on all new examples. By adjusting the described algorithm to the conversation detection task, we can improve our conversation detection algorithm over time: 1. Partition the mailbox to conversations using initial weights as described in Table 4. 2. Adjust the weights upon user selection of conversations: (a) If the algorithm is uncertain which conversation a message belongs to, display a list of candidate conversations and have the user select one. (b) When the user selects a conversation, add this message and conversation to the set of labeled examples and train the algorithm to adjust its weights. 3. Adjust the weights upon user modifications to conversations: (a) Allow the user to move messages between conversations. (b) When the user moves a message between conversations, add the message and both conversations to the set of labeled examples and train the algorithm to adjust its weights. Learning the proper weights for a mailbox is beyond the scope of this work. Moreover, we believe that such an algorithm should monitor the user’s behavior over time, and adjust itself constantly, in order to produce high quality conversations. 21
4.4
Discussion
The results shown in Table 3 demonstrate the importance of the date attribute for detecting sub-conversations within subject-threads. However, it cannot detect a drift in the topic of a conversation, or a change in participants. According to Kalman and Rafaeli (2005), in corporate emails, most messages are replied in less than 5 days since they were sent. The results in Table 3 (the run with the date attribute alone) somewhat strengthens their findings. This run suggests that in most cases, when the date difference exceeds the threshold, a new conversation emerges. Table 4 shows that the content and subject attributes of messages are the most significant features for conversation detection. The significance of date distance between sub-conversations is negligible (for both mailboxes) and may be discarded altogether for this specific task. It follows that two different sub-conversations extracted from the same subject thread, might be re-grouped together to construct a conversation as a result of their content similarity, despite the time distance between the messages, which probably caused them to be split into two different sub-conversations. In addition, Table 4 shows that both mailbox owners consider the content attribute as more important than the subject attribute, which is more important than the participants attribute. This suggests that though the date attribute is very important to detect sub-conversations, the textual portion of a message (content and subject attributes) is more important for detecting conversations. Figure 5 compares the similarity between email threads, subject threads, sub-conversations, and conversations, to manually detected conversations in two mailboxes. It shows very clearly the drawbacks of email threads for this task. It also shows that subject threads encapsulate most conversations in the mailbox, however they are inferior to the automatically detected conversations. Sub-conversations alone, as expected, are inferior to detecting full conversations as they are only part of the process. However, Figure 5 shows that they are also inferior to subject threads. The reason is that some of the sub-conversations represent partial conversations, which are scored lower than the subject threads containing them, while the full conversations that include those partial sub-conversations are only detected by the final stage of the algorithm. In conclusion, our experiments (and their corresponding results) show that for detecting conversations in email messages, one should consider all message attributes. Using all the attributes for measuring similarity gave the highest results in both experiments of sub-conversation and conversation detection. In addition, we have shown that breaking subject threads to sub-conversations, and then grouping similar sub-conversations together, is a superior detection process to alternative approaches.
22
5
Conversations Analysis
This work proposes a new definition for a conversation in an email mailbox, including a conversation detection process that is different from traditional approaches who treat email threads as conversations. Previous works (Kleinberg, J., 2002; Kalman, Y. M. and Rafaeli, S., 2005) studied the behavior of messages in traditional email threads from several aspects, like average response time. In this section we study the difference in behavior between threads and conversations from several aspects: • What is the average conversation size? What is the average subject-thread size? • Participants distribution: does the conversation size affect the number of participants? How are the active/passive participants distributed? • Following Kalman and Rafaeli’s work (2005), what is the average response time in a conversation? • Are there bursts in conversations (influenced by Kleinberg’s (2002) work)? It is important to note that each aspect may behave differently according to the conversation size. We assume that single-message conversations behave differently than very large conversations. In addition, we assume that conversations with 2 or 3 messages behave the same. This led us to divide the conversations to buckets according to their size: • Single-message conversations - these are special cases of conversations. They include messages that were sent but were never answered. It is interesting to observe what type of messages are included in such conversations. We believe that such conversations may include spam messages and general notices ones. • Small conversations (2 − 5) - these are conversations of special interest. Due to their low number of messages, it is interesting to observe how many participants appear in such conversations and what is the average response time. • The rest of the buckets are Medium, Large and Very Large conversations (6 − 10, 11 − 15 and ≥ 16, respectively) - these are special conversations. We expect their frequency to be low and also to include a lot more participants than small ones. These are the types of conversations we are interested in finding bursts of activities and to measure the average response time.
5.1
Data Set
To carry out the different experiments we applied the conversation detection algorithm over three mailboxes. Table 5 shows that there are 5862 messages in 3019 subject-threads and 1643 conversations. In addition, the average subject-thread contains 6.66 messages, while the average conversation contains 3.32.
5.2
Subject-threads and Conversations Size
Figure 6(a) shows that most subject-threads (2100) and conversations (1700), are single-message ones. Those messages were never replied to (either sent or received). That means that the same number of messages in a mailbox are never replied to. Figure 6(b) shows that there are approximately 2100 messages in small subject-threads and conversations (of size 2-5) and a similar number of messages belong to larger subject-threads and conversations (size > 5). This suggests that a user activity is almost equally dispersed among single-message, small and larger conversations. 23
Num Messages Num Subject-threads Avg. Subject-threads length Num Conversations Avg. Conversations length
5862 3019 6.66 2643 3.32
Table 5: Statistics of the mailboxes used for conversations analysis.
(a) Distribution of subject-threads and conversations in different bucket sizes.
(b) Distribution of messages in subject-threads and conversations, by bucket sizes. For example, 2100 messages belong to subject-threads (1700 for conversations) of size 1.
Figure 6: Subject-threads and Conversations distributions of size and messages.
5.3
Participants Distribution
Figure 7 shows the distribution of participants in conversations by their size. It distinguishes between active participants and passive-only ones (a passive-only participant is one that appears in the Cc attribute of all the messages in the conversation). It is interesting to note that the number of passive-only participants is not affected by the size of the conversation, and on average there are 1.3 passive-only participants in conversations. A typical email client behavior to a reply command is to move all the participants to the Cc attribute, put the sender of the new message in the From attribute and the sender of the replied message in the To attribute. Given that the number of passive-only participants is so low (about 20% of the active ones) means that most of the participants compose at least one of the messages in a conversation. Moreover, Figure 7 shows that the number of active participants is higher in smaller conversations. It may suggest that when one composes a message, it writes most or all of the participants in the To attribute. In small conversations, falsely, active participants are more expressed than in large conversations. It may also suggest that the participants do not change during the conversation and that is the reason the number of participants in small and large conversation is almost the same, as Figure 7 shows. 24
Figure 7: Participants distribution in conversations, plotted by the size of the conversation.
5.4
Response Time
Figure 8 shows the average response times in conversations. Single-message conversations are excluded here because the messages are not replied in such conversations. The results in Figure 8 are surprising since Kalman and Rafaeli (2005) note that messages are replied several days after they were sent, while most of them are replied in a matter of hours. However, as Figure 8 shows, the average response time in small conversations is 5.6 days. A deeper look into these conversations revealed that indeed most of them are replied within the same or the next day, however there were some conversations in which messages were replied after more than 15 days. For larger conversations, it can be seen that on average a message is replied after more than a day or less than three days. Here, too, there are messages that are replied after several days. However, because of the size of the conversations, the effect of those messages on the average response time is less important.
Figure 8: Response times in conversations, plotted by their size.
25
5.5
Bursts
Figure 9 shows the rate of messages per day, i.e., how many messages were sent in each day during the conversation. We partitioned the conversations according to their size and plotted the first 20 days, although there were conversations that spanned over more than 100 days (the number of conversations that spanned over more than 20 days is negligible). Figure 9 shows that a large number of messages are sent in the first day. Depending on the conversation size, the percent of messages that are sent in the first day varies from 40-50% (for small conversations) to 25% (for large conversations). The rate of messages sent on the second day drops by 30-50% in all conversations. It may suggest that in the first day of the conversation, many participants reply to the conversation. As can be seen in the figure, the numbers drop to less than 0.5 a message a day as the conversation continues. The results in Figure 9 also support Kleinberg’s (2002) claim of the existence of bursts in message streams, as we have observed the existence of bursts in large conversations after 10, 15 and 19 days.
Figure 9: Message bursts in conversations by their size.
5.6
Discussion
Figure 6 suggests that a user activity is almost equally dispersed among single-message, small and larger conversations. It also shows that one third of the messages in a mailbox are never replied to. Such messages may be spam or general notices. This information may be used for automatic foldering of messages, as discussed in (Klimt, B. and Yang, Y., 2004). It also shows that the number of large conversations is very low. A surprisingly and interesting result is that the number of participants is not affected by the conversation size. The average number of participants is 6.6 (standard deviation is 0.79). Figure 7 shows that the number of participants in single-message conversations is higher than in larger conversations. This further supports our assumption that single-message conversations are general notices or spam messages. They are usually sent to a large number of participants and one almost never replies to such messages. In addition, the number of passive-only participants is low, even in large conversations (approximately 20% of all participants). This suggests that most of the participants play an active role in a conversation. Another result that may be deduced from Figure 7 is that in email conversations, only few people are involved, as indicated by the low number of participants in conversations with more than sixteen messages 26
(7.0). The low number of passive-only participants, may suggest that users include in email conversations (at least in corporate ones) only the people with whom they actually would like to discuss, and not carbon copying a lot of unrelated people. The results in Figure 8 show that on average, a message is answered two days or more after submission. As mentioned before, a deeper look into these conversations revealed that indeed most messages are answered within the same or the next day, but there are some messages that are answered after several days. In their work, Kalman and Rafaeli (2005) studied the response time in email threads. As discussed in this work, email threads are different than conversations, especially in the sense that a conversation may span over several email threads. Specifically, a conversation can include a message that was sent as a “response” or continuation of a message from a different thread, several days after the last message was sent. This may explain the relatively longer response time we have discovered. Figure 9 shows that large email conversations are characterized with bursts. In all conversations, there is a large number of messages that are sent in the first day, however in large conversations there are bursts of activities every several days. This behavior is somewhat similar to kickoff meetings. At first, everyone would like to share his/her ideas with the rest. As times goes by, the ideas converge and the amount of activity is reduced. However, from time to time a new emerging idea causes a burst in the discussion.
27
6
Summary
This work explored a novel approach for detecting conversations in a mailbox. We defined a conversation as an exchange of messages among the “same” group of people on the “same” topic, and discussed the distinction between such conversations and traditional email threads that are usually used for conversation detection. We showed that subject threads are much more appropriate to describe conversations than traditional email threads. Our results in this area can be used by email clients to group messages together and remove the structural information that is embedded in messages. In addition we discussed the existence of sub-conversations within subject threads. We showed that most sub-conversations can be detected by including messages from the same subject thread that were sent/received in close date proximity (less than 10 days). Our experiments show that the date attribute of messages is considered the most important for the sub-conversations detection task, by both mailbox owners. We also showed the existence of a different type of conversations that are spread over several subject threads and group sub-conversations together. These conversations cannot be detected by any traditional conversation detection techniques that rely on structural information embedded in messages. Our experiments show the superiority of the method for detecting such conversations over traditional email threads and subject threads. In addition, we show that the textual portion of the messages (content and subject attributes) is considered to be more important for detecting conversations, by both mailbox owners. These results are not surprising given the nature of the conversations detected. Sub-conversations are detected within subject-threads. It is reasonable to assume that most subject-threads carry most of the participants throughout the thread and therefore the participants attribute cannot alone detect sub-conversations. The only major difference between sub-conversations is their date attribute. Our results show that when there is a long time break in the subject-thread (more than 10 days), a new sub-conversation starts. The fact that the participants or the topic change does not affect the detection algorithm much. Conversations on the other hand span over several subject-threads and therefore one cannot disregard the other attributes of the messages. Our results show that the detection quality improves by a factor of 10% when combining all of the message attributes. However, the date attribute of the messages is almost insignificant for the conversation detection task. In this work we evaluated our algorithm against two manually marked mailboxes. Each mailbox was partitioned into sub-conversations and conversations by its owner. An interesting topic for future work may be to ask different people to classify the same mailbox to sub-conversations and conversations, and measure the overlap between their partitions. We expect that the partitions will agree on most conversations, and disagree only on few of them. However, even if the results are different, in our opinion it is the mailbox owner’s partition that should be weighed the highest, as we have shown in Table 4 that for each mailbox owner, different coefficient weights should be applied. Additionally, we have studied the new type of conversations in several mailboxes. Our results show that different mailboxes are equally dispersed between single-message, small and large conversations. Therefore, approximately one third of the messages in a mailbox are never replied to. This information can be used for further research on the nature of such messages and whether they can be used to detect spam messages and automatic foldering of messages. We have shown the existence of bursts in large conversations. Initially, all conversations start with a large burst, with most of the messages sent in the first day, however in large conversations we observed bursts of activities every several days. That may be explained by some reasons: • Large conversations start with few topics, which over time only few of them “survive” to continue the conversation. 28
• Initially, many participants would like to share their thoughts, but over time only the “true” active participants, those that have the most meaningful arguments or are the most experienced in the field, continue the conversation. We discovered that the average response time is different than that reported in (Kalman, Y. M. and Rafaeli, S., 2005). Combining all of our results, we conclude that response time in conversations cannot be measured simply by averaging the response times of all the messages in the conversation, but rather measure the response time within periods of times in the conversation. As described previously, all conversations begin with most of the messages being sent in the first day. That means that the average response time for most of the messages is less than a day, as reported by Kalman and Rafaeli (2005). However, large conversations are characterized by bursts of activities. End-user wise, the response time within a burst is still less than a day. We believe that the type of conversations described in this work can benefit applications that rely on email threads from the end user’s perspective. For example, the email summarization methods described in (Lam, D. et al., 2002; Rambow, O. et al., 2004; Wan, S. and McKeown, K., 2004; Shrestha, L. and McKeown, K., 2004; Carenini, G., Ng, T. R., and Zhou, X., 2007) may produce better summaries for the end user when coherent conversation are used rather than email threads. Similarly, the visualization systems described in (Venolia, G. D. and Neustaedter, C., 2003; Kerr, B., 2003; Kerr, B. and Wilcox, E., 2004; Golder, S., Vie’gas, F. B., and Donath, J., 2006) might better display conversations rather than email threads to the end user. The results reported in (Venolia, G. D. et al., 2001) of using the threaded email client prototype show that people will benefit if allowed to work with conversations. Their lab study show that email triage is an easier task when the users are being able to look at the local context of a message, rather than scan through all messages to find related ones. Most modern email clients provide users with a threaded display of email messages, which somewhat strengthens their conclusions. We believe that users will benefit even more when provided with a view that displays coherent conversations rather than email threads.
29
References Aaron, H. and Jen-Yuan, Y. 2006. Email thread reassembly using similarity matching. In Proceedings of the Third Conference on Email and Anti-Spam (CEAS). B¨alter, O. and Sidner, C. 2002. Bifrost Inbox Organizer: Giving Users Control over the Inbox. In Proceedings of the second Nordic conference on Human-computer interaction table of contents., pages 111–118, (Aarhus, Denmark). Broder, A., Eiron, N., Fontoura, M., Herscovici, M., Lempel, R., McPherson, J., Shekita, E., and Qi, R. 2005. Efficient Indexing of Shared Content in Information Retrieval Systems. In Proceedings of the 31st VLDB Conference, (Trondheim, Norway). Carenini, G., Ng, T. R., and Zhou, X. 2007. Summarizing Email Conversations with Clue Words. In Proceedings of the 16th international conference on World Wide Web WWW ’07, pages 91–100, (Banff, Alberta, Canada). Carvalho, V. and Cohen, W. 2004. Learning to Extract Signature and Reply Lines from Email. First Conference on Email and Anti-Spam (CEAS). Dredze, M., Lau, T., and Kushmerick, N. 2006. Automatically Classifying Emails into Activities. In Proceedings of the 11th international conference on Intelligent user interfaces table of contents., pages 70–77, (Sydney, Australia). Ducheneaut, N. and Watts, L. A. 2005. In Search of Coherence: a Review of E-Mail Research. HumanComputer Interaction, 20:11–48. Gabor, C., Keno, A., and Roger, W. 2007. BuzzTrack: Topic Detection and Tracking in Email. In Proceedings of the 12th international conference on Intelligent user interfaces IUI ’07. ACM Press. Golder, S., Vie’gas, F. B., and Donath, J. 2006. Visualizing Email Content: Portraying Relationships from Conversational Histories. In Proceedings of the SIGCHI conference on Human Factors in computing systems table of contents., pages 979–988, (Montre’al, Que’bec, Canada). Kalman, Y. M. and Rafaeli, S. 2005. Email Chronemics: Unobtrusive Profiling of Response Times. In Proceedings of the 38th Annual Hawaii International Conference on System Sciences (HICSS’05), volume 04, page 108.2. Kerr, B. 2003. THREAD ARCS: An Email Thread Visualization. In Proceedings of IEEE InfoVis, pages 211–218, (Seattle, WA). Kerr, B. and Wilcox, E. 2004. Designing Remail: Reinventing the Email Client Through Innovation and Integration. In CHI ’04 extended abstracts on Human factors in computing systems., pages 837–852, (Vienna, Austria). Kleinberg, J. 2002. Bursty and Hierarchical Structure in Streams. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Klimt, B. and Yang, Y. 2004. Introducing the Enron Corpus. In Proceedings of the First Conference on Email and Anti-Spam (CEAS), (Mountain View, CA.). Lam, D., Rohall, S. L., Schmandt, C., and Stern, M. K. 2002. Exploiting e-mail structure to improve summarization. In ACM 2002 Conference on Computer Supported Cooperative Work (CSCW2002), (New Orlenes, LA.). Lewis, D. D. and Gale, A. W. 1994. A Sequential Algorithm for Training Text Classifiers. In Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, pages 3–12, (Dublin, Ireland). 30
Lewis, D. D. and Knowels, K. A. 1997. Threading Electronic Mail: a preliminary study. In Information Processing and Management, 33(2):209–217. Rambow, O., Shrestha, L., Chen, J., and Laurdisen, C. 2004. Summarizing Email Threads. In Proceedings of HLT-NAACL 2004: Short Papers. Rudy, I.A. 1996. A Critical Review of Research on Electronic Mail. European Journal of Information Systems, 4:198–213. Shen, D., Yang, Q., Sun, J. T., and Chen, Z. 2006. Thread Detection in Dynamic Text Message Streams. In Proceedings of the 29th annual international ACM SIGIR conference, pages 35–42, (Seattle, Washington, USA). Shrestha, L. and McKeown, K. 2004. Detection of Question-Answer Pairs in Email Conversations. In Proceedings of the 20th international conference on Computational Linguistics., (Geneva, Switzerland). The Internet Society. 2001. RFC 2822 Internet Message Format. http://www.faqs.org/rfcs/rfc2822.html. Venolia, G. D., Dabbish, L., Cadiz, J. J., and Anoop, G. 2001. Supporting Email Workflow. Microsoft Research, Report MSR-TR-2001-88, http://research.microsoft.com/scripts/pubs/view.asp?TR ID=MSRTR-2001-88. Venolia, G. D. and Neustaedter, C. 2003. Understanding Sequence and Reply Relationships within Email Conversations: A Mixed-Model Visualization. In Proceedings of the SIGCHI conference on Human factors in computing systems., pages 361–368, (Ft. Lauderdale, Florida, USA). Wan, S. and McKeown, K. 2004. Generating Overview Summaries of Ongoing Email Thread Discussions. In Proceedings of the 20th international conference on Computational Linguistics., (Geneva, Switzerland). Whittaker, S., Bellotti, V., and Gwizdka, J. 2006. Email in Personal Information Management. In Communications of the ACM., volume 49, pages 68–72. Whittaker, S. and Sidner, C. 1996. Email Overload: Exploring Personal Information Management of Email. In Proceedings of the SIGCHI conference on Human factors in computing systems: common ground., pages 276–283, (Vancouver, British Columbia, Canada). Wikipedia. 2007a. Conversation — wikipedia. http://en.wikipedia.org/wiki/Conversation. Wikipedia. 2007b. Threaded discussion — wikipedia. http://en.wikipedia.org/wiki/Threaded discussion.
31
זיהוי שיחות במערכת דואר אלקטרוני שי אררה תקציר אימייל הפך עם השנים לכלי חשוב באמצעותו אנשים מנהלים שיחות .באופן טיפוסי ,תיבת דואר אלקטרוני מכילה מאות שיחות .זיהוי השיחות הללו זוהה כמשימה חשובה היכולה לעזור למשתמשים לראות את ההקשר הרחב של ההודעות שהם קוראים ,קיבוץ הודעות קשורות ביחד ,צמצום כמות ההודעות שהמשתמשים רואים בתיבת הדואר הנכנס וביצוע פעולות על קבוצות של הודעות ,כמו קריאה או מחיקה. בעבודה זו אנו מציגים אלגוריתם לזיהוי שיחות קוהרנטיות במערכת דואר אלקטרוני. האלגוריתם משתמש בפונקציית דימיון בין הודעות שמשקללת תכונות כלליות של ההודעות ,כגון" :נושא"" ,נמענים"" ,תאריך" ו"גוף ההודעה" .תוצאת החישוב של הפונקציה הוא מס' הקובע האם ההודעה שייכת לשיחה מסוימת .האלגוריתם מתחיל בזיהוי שיחות עפ"י ה"נושא" )הודעות בעלי אותו "נושא" מקובצות יחדיו( .לאחר מכן האלגוריתם מחלק כל שיחה שזוהתה לפי ה"נושא" ל"תת-שיחות" )תוך שימוש בפונקציית הדימיון( ולבסוף מקבץ "תתי-שיחות" ל"שיחות" )תוך שימוש בפונקציית הדימיון(. על מנת להעריך את איכות השיחות שהאלגוריתם מזהה ,סימנו באופן ידני שיחות בשני אוספי אימייל שונים .התוצאות מראות את העדיפות של השיחות המזוהות ע"י האלגוריתם, על גבי השיחות הסטנדרטיות שמזוהות במערכות אימייל כיום. בנוסף אנו חוקרים כיצד השיחות הללו מתנהגות מכמה היבטים (1 :כמה אנשים משתתפים
בשיחה (2 .האם גודל השיחה )מס' ההודעות( משפיע על מס' המשתתפים (3 .מהו זמן התגובה הממוצע בשיחות (4 .כמה הודעות נשלחות בכל יום במהלך שיחה .התוצאות מראות שניתן לחלק תיבת דואר אלקטרוני לשלוש קבוצות :שיחות בעלי הודעה אחת, שיחות בעלות מעט הודעות ) (2-5ושיחות בעלות יותר הודעות .בנוסף אנו מראים שמס' המשתתפים אינו מושפע מגודל השיחה.
זיהוי שיחות במערכת דואר אלקטרוני מאת :שי אררה בהדרכת :ד"ר שולי וינטנר ד"ר דוד כרמל )מעבדות ,IBMאוניברסיטת חיפה(
עבודת גמר המוגשת כמילוי חלק מהדרישות לקבלת התואר "מוסמך האוניברסיטה"
אוניברסיטת חיפה הפקולטה למדעי החברה החוג למדעי המחשב
נובמבר 2007
זיהוי שיחות במערכת דואר אלקטרוני שי אררה
עבודת גמר המוגשת כמילוי חלק מהדרישות לקבלת התואר "מוסמך האוניברסיטה"
אוניברסיטת חיפה הפקולטה למדעי החברה החוג למדעי המחשב
נובמבר 2007