Archiving and Indexing Chat Utterances - CiteSeerX

Archiving and Indexing Chat Utterances Jason Bengel, Susan Gauch, Rajan Vijayaraghavan, Solomon Nagelli {jasonb, sgauch, rajan, manasvi}@ittc.ku.edu Electrical Engineering and Computer Science and Information Technology and Telecommunications Center University of Kansas Abstract Chat networks and instant messaging are growing Internet applications. They are rapidly replacing the telephone as a means of communication for teenagers, and corporations are embracing the use of lower-cost electronic meetings. It is difficult to track who is talking to whom and what they are talking about. Given the recent increase of interest in security issues at the personal, corporate, and national levels, we discuss a framework for archiving and indexing chat data. We also present ChatTrack, a chat archiving system that can be deployed as a client or server application. We propose future work in automated user profiling and topic identification based on chat discussions.

Keywords Instant messaging, chat, incremental indexing, information retrieval

Motivation Instant messaging and chat rooms have expanded the ways in which we communicate, increasing with it the number of messages we must manage daily. On- line chatting allows for an informal and unobtrusive way of reaching out to friends and business contacts in both one-on-one and large electronic town hall-style meeting rooms. Formerly popular with only teenagers and college students, corporate usage of instant messengers (IM) has recently soared [1]. Unlike e- mail, chat discussions are not easily controlled and there is no way to review who said what to whom after the chat session ends. Corporations and home users have many reasons to archive chat sessions. Employees involved in electronic meetings require at least one participant to manually record the minutes from virtual meetings might take place over several sessions,

spanning several days or months, causing employees to spend valuable time keeping track of a multitude of decisions generated over several chat sessions. An automated system of minute-taking would not only spare employees from this tedious task, but also provide indisputable documentation of team discussions and decisions. It stands to follow that businesses employing online chat as a form of an electronic meeting will desire an archival record of the discussion and resolutions from that meeting. Archiving would also be ideal for the protection of children participating in chat sessions. Predators lurking in chat rooms may pose alternate personalities to trick children into sharing personal information [2]. Few safe chat rooms – where a human monitor attempts to ensure that all conversations, both public and private, are of an appropriate manner – currently exist. This undertaking is expensive and is not

performed by any of the popular chat services. Archiving will allow parents who wish to review their children’s chat sessions to periodically verify that they are observing safe chat procedures. Additionally, chat room administrators, particularly those with chat rooms aimed at children, may also monitor the content of discussions to confirm that it remains appropriate for their subscribers. With the increased emphasis on security issues, government agencies may also want to monitor chat rooms to identify suspicious conversations. For example, USA Today reports that U.S. officials are monitoring Internet chat rooms where participants appear to be planning terrorist attacks against the United States [3]. Chat systems can be categorized as either client- or server-based architectures. Client-based systems are typically point-topoint and generally use an intermediary directory service that keeps track of who is available to chat. Desktop-based instant messengers, such as Microsoft MSN Instant Messenger and Yahoo Instant Messenger, are typical examples. Server-based systems are centrally based, through which chat traffic must travel before reaching end- users. Users connect to the server by means of a client application such as the Web. Most servers handle thousands of users at any given moment. Examples of server-based chat systems are the Microsoft MSN Chat rooms (http://chat.msn.com) and Yahoo Chat (http://chat.yahoo.com) as well as the Internet Relay Chat (IRC) servers. Related Work The ability to record online conversations is provided by a related product, IamBigBrother [4], that provides some of the features that parents may desire. Analysis is done as a review rather than interactively as chat sessions proceed. In

addition, only questionable exchanges are stored. Users must customize the software themselves, providing keywords to indicate which messages to block and how. There is no search ability and only limited capacity to review the archive. Some instant messenger products have the capability of recording chat for playback, but lack the facilities for searching through text. For example, ICQ [5] stores all IM traffic, but its search facility is limited to keyword and username search only. The ICQ archive is also not indexed. Various plug- ins for IM products support text recording, but do not provide a query capability. Approach The goals of this project are twofold. First, we wish to provide a real-time archiving system for IRC Chat servers and Instant Messenger clients that captures and stores chat messages so they can be searched by keyword and by username. Second, we wish to provide the ability to classify archived dialogues by topic, and allow topic-based retrieval and replay of chat sessions. We are developing ChatTrack, a system that archives and filters chat on the client or server machine. ChatTrack is comprised of the following modules: SafeChat, which is under development, and ChatLog, which we discuss here. ChatLog stores user messages to secondary storage. These chat messages are essentially text streams that are quite small in comparison with images, video, or even documents. Even if a user chatted for five hours a day and received 40 words per minute, they would only require ~120 KB of storage per day. An entire year’s worth of chat would use 44 MB of storage, achievable with today’s storage technologies and prices. The original unfiltered messages

are stored along with any record of substitutions or modifications provided by SafeChat. These messages are retrievable by date and time, by username, and by keywords, and by combinations of the above.

Chat Server

Retrieval System

Message Recorder

Indices

Chat Archive

Indexer

Figure 1 – Server-based Chat System.

Instant Messenger Retrieval System Message Recorder Indices

Chat Archive

Indexer

Figure 2 – Client-based Chat System.

As users chat either privately amongst themselves or in a public chat room where there can be multiple recipients, utterances are immediately stored. Each utterance produces two files, separately storing the username and the message. The files are placed in a directory structure encoding the channel identifier and date, and the filenames encode the utterance transmission time. Figure 1 depicts the server-based system. Server code must be modified to support the recording of messages. The

channel identifier consis ts of either the public chat room in which the utterance was created or the username to which the message is being sent. The client-based system is depicted in Figure 2. Here, the IM application remains untouched. Through the use of Microsoft® Windows® programming interfaces, the text of the messages is retrieved and stored as described above. The channel identifier indicates the names of the users who are chatting. Whenever possible, the usernames consist of the unique user identifier, such as an e- mail address, to avoid confusion if a user changes their nickname during a chat session. Chat transcription takes place concurrently as users chat in order to keep the session data current. This is done by running an incremental indexer in the background of the client or server system. This is performed at a predefined interval, typically every one to two minutes; and if no new chat data exists, the indexer cycles, waiting for new chat utterances. Typical indexers create an index from scratch, requiring more time as data size increases. Incremental indexing solves this issue by recording chat utterances that have appeared since the last index. There is a penalty for retrieval, but a re- indexing process could be performed during scheduled a maintenance to improve index speed Administrators and parents make use of a retrieval application to perform queries on the indexed chat data. Server-based systems would most likely use a web-based interface so that the chat data can remain on a secure server. Figure 3 depicts an example of the web-based query interface.

Achievements This system provides an effective way of recording chat utterances in near

real-time and for fast retrieval based on keywords, recipient/sender, time frame, and chat session replay. Session replay offers the ability of viewing entire conversations between individuals and groups. The framework is adaptable to both client- and server-based chat technologies, meaning that both home users and online chat providers could benefit. Client-side users would best be served by “plug- in” technologies, such as a program that can integrate seamlessly with Microsoft MSN Messenger. Server-side systems would require a slight modification of code to interact with the chat storage library that we have developed. A separate interface for viewing, searching, and retrieving chat sessions is required. Client-based systems would use a locally installed program to search the chat sessions, whereas a server-based system would best use a web interface. It is most likely that server-based systems will be primarily used by chat moderators, whereas client-based systems will be used by a home audience. A problem arises on server-based systems when a moderator wishes to retrieve the recipients of a message written in a public chat room. Because a public room can contain hundreds of users, the vast quantity of information makes rapid storage and retrieval complex.

Figure 3 – Retrieval User Interface

Future Work Currently, we have a system that retrieves data based on keywords, recipient and sender username, and time. Our goal is to perform more sophisticated analysis of the chat text than is currently provided by ChatLog. Through the Obiwan [6] and KeyConcept [7] projects, we have investigated techniques for automatically classifying web pages to identify their major topics or concepts. We wish to investigate the applicability of these algorithms to chat data. Due to the informal nature and “nonEnglish” text of chat messages (e.g., “brb – be right back”, “lol – laughing out loud”) [8], special preprocessing will be necessary to extract meaningful words from the meaningless chatter. Once this is done, we hope to be able to allow administrators to search the chat archive based on topics (e.g., “U.S. Economy”) rather than specific words (e.g., “interest rates” or “employment figures”). Topic-based chat archiving has several applications. Parents who wish to restrict and monitor the types of messages that their family members send and receive can proactively block dangerous messages based on topic. Corporations could create summaries of multiple virtual meetings concerning the same subject. Corporations could also protect themselves from employee harassment or corporate espionage by documenting chat sessions, looking for unusual chat topics. Chat networks operators could identify rogue accounts that send spam messages from automated spambots and flag suspicious users for post-chat review. Questionable chat rooms could also be identified by key topics. Chat messages could be used to produce a user profile based on a set of frequently discussed topics. This could be used to monitor users who may pose a threat to either children or domestic or

international security. There are also fundamental research issues to be investigated in this area. Searching finite documents – which contain a beginning and an end - is relatively easy. In contrast, chat sessions do not contain linear topic discussions, but rather threaded interwoven topics that come and go. Developing techniques to unravel these threads and follow the topic streams throughout a session would form the basis of a rather interesting, fundamental research question.

References [1] Cherry, Steven M. “IM Means Business”, IEEE Spectrum Magazine, November 2002. [2] “A Parent’s Guide to Internet Safety”, Federal Bureau of Investigation, URL Jan. 29, 2003. [3] “Agents pursue terrorists online”, USA Today. June 24, 2002. URL Feb. 13, 2003. [4] IamBigBrother.com, “Parental Internet Control Software,” URL , May 9, 2001. [5] ICQ URL . [6] “Ontology-Based Web Site Mapping for Information Exploration,” X. Zhu, S. Gauch, L. Gerhard, N. Kral and A. Pretschner. Proceedings of the Eighh International Conference on Information and Knowledge Management (CIKM ’99), Kansas City, MO, November 1999, 188-194. [7] ITTC, “KeyConcept Project,” URL , Jan. 29, 2003. [8] “Using IM: Know the lingo,” Komando, Kim. URL , Feb. 13, 2003.

Archiving and Indexing Chat Utterances - CiteSeerX

Archiving and Indexing Chat Utterances - CiteSeerX

Suggest Documents

Utterances Assessment in Chat Conversations - CiteSeerX

Archiving, Indexing and Accessing Web Materials: Solutions ... - IWAW

distributed data collection: archiving, indexing, and analysis - LASS

distributed data collection: archiving, indexing, and analysis - LASS

SEGMENTING SPOKEN LANGUAGE UTTERANCES ... - CiteSeerX

Incrementality, Alignment and Shared Utterances - CiteSeerX

Heracleum monograph_CH9_self-archiving - CiteSeerX

Responding to Unexpected Student Utterances in ... - CiteSeerX

Subject analysis and indexing: from automated indexing ... - CiteSeerX

Speaker Utterances Tying Among Speaker Segmented ... - CiteSeerX

Indexing Multiversion Databases - CiteSeerX

indexing terms - CiteSeerX

Chat Chat - Rouge Papille

Robust Color Indexing - CiteSeerX

Audio Indexing - CiteSeerX

Indexing Compressed Text - CiteSeerX

Canarian Spanish intonation: interrogative utterances - CiteSeerX

eScience and Archiving for Space Science - CiteSeerX

CHAT To Your Destination - CiteSeerX

Database Indexing, Retrieval and Clustering - CiteSeerX

Semantic Indexing and Typed Hyperlinking - CiteSeerX

Content-based Music Indexing and Organization - CiteSeerX

Wittgenstein and Indexing Theory Abstract Introduction - CiteSeerX

REGION-BASED COLOR IMAGE INDEXING AND ... - CiteSeerX