A Study of the Language Used in Electronic ...

Electronic negotiations, media, and transactions in socio-economic interactions

A Study of the Language Used in Electronic Negotiations – a Preliminary Report Mario Jarmasz and Stan Szpakowicz School of Information Technology and Engineering, University of Ottawa Ottawa, Ontario, Canada, K1N 6N5 {mjarmasz, szpak}@site.uottawa.ca Abstract With the advent of the Internet, inexpensive yet powerful computers and wide bandwidth, it is realistic to conceive that computer assisted negotiations will be a way of conducting business in the near future. The Inspire system [12], which has been in use for more than a half decade, facilitates this type of negotiations. The negotiation records are logged by the system in a rich database from which negotiation techniques can be studied and learned. This paper discussed avenues of analyzing the email messages that accompany these interactions using Natural Language Processing techniques. Available tools are investigated, preliminary experiments performed on available data and a 3 year plan for further research is proposed.

1. Introduction The Inspire Web-based negotiation support system has been in use since 1996 [12]. Many electronic negotiations have been recorded. The users of this system communicate using formal offers and written messages. We want to study these messages using natural language processing (NLP) techniques. The system records the negotiations in .log files. Each negotiation generates two .log files, one for each participant. The .log files are combined into a transcript.html file that contains all aspects of the negotiation between the two parties. These include tables in which are presented the values of the 4 variables which define these negotiations: price, delivery, payment and return. The transcript.html files contain graphs describing the progress of the negotiation. These cannot be analyzed using NLP techniques; it is therefore our direct goal to study the files containing the text messages, though other elements will be used in several stages of the NLP project. According to the InterNeg documentation for encoding the transcript.html file generated by the Inspire system [11] into a database, the important components of a message are: • User_id: the user identifier of the user who sent this particular message. • Negotiation_id: the identifier of the negotiation that is being dealt with. • Message_datetime: date and time of the message.

This work has been supported with the INE grant from the Social Sciences and Humanities Research Council of Canada. Canada 2003

A Study of the Language Used in Electronic Negotiations • • • •

2

Message_text: the text that makes up this exchange. This is the most important part to be studied. Offer_id: indicates if this message is part of an offer. Reply_to_message: indicates if this message is a reply to another one. Reply_to_offer: indicates if this message is the reply to an offer.

In this paper we present the various aspects of the electronic negotiations that we want to model, and we present related research. We then discuss software that can allow us to achieve this goal and perform some preliminary analysis of the Inspire data. Milestones for a 3 year plan to study the data, and a proposed model for electronic negotiations are given.

2. Aspects of the E-Negotiations to be Modeled Our objective is to study the various aspects of the textual component of e-negotiations using NLP techniques. Some points that we want to examine are: Do native English speakers have an advantage when participating in e-negotiations? How many of the interactions are irrelevant to the actual negotiations? How many are important for achieving an agreement? What words are repeated often in the messages of the transcripts? Will this allow us to define a sublanguage specific to negotiations? We would like to model the mood of a negotiation. Are the two parties getting irritated? Is one more hostile than another? Are they working towards a compromise? Natural language has been considered as a model for electronic negotiations but has often been criticized for being ambiguous and not having a formal basis [2]. Many computerized systems for negotiation and mediation have been developed without necessarily having to analyze text [1], [3]. The fact remains that people communicate using natural language. Schoop and Quix developed a system called DOC.COM [23], to support electronic negotiations between human parties. A communication management system allows for the analysis of e-mail messages. It is based on Searle’s Theory of Speech Acts [25] and Habermas’ Theory of Communicative Action [9]. They are combined together into the language-action perspective (LAP). The communication management system is of interest for further studying the language of e-negotiations. The document management system is essentially a version control system. Schoop and Quix identify 4 message types: request, offer, order, and inquiry. Each message is classified according to one of Searle’s five types of illocutionary force [25]: assertive, commissive, directive, expressive and declarative. The message content is tagged manually resulting in an XML document. Some of the tagged entities are: delivery, supplier, recipient, quantity, unit, product, product code, date, payment, payer, amount, currency, term, and contract. Schoop and Quix argued that is important to identify messages as comprehensible as well as truthful. They propose an order of required negotiation steps. If this model proves to be valid it can be used to predict what kind of message will come next. For example an offer can be followed by an acceptance, a counter offer or a rejection. Finally, Schoop and Quix state that the reasoning mechanisms in a negotiation can be analyzed using formalism for representing speech acts called cooperative language. The transcripts generated by the Inspire system [12] and stored in the .log files contain short messages. These are similar in length and style to email messages. There has been research on filtering e-mails based on content [10], [5] as well as automatically answering electronic messages [13] but so far no work seems to have been done on analyzing the content of a large collection of email messages. The Message Understanding Conferences (MUC) [14], in particular MUC-7 has the task of identifying named entities. Many systems that participate in the Question-Answering track of the Text Retrieval Conference (TREC) [28] also rely heavily on named entity retrieval. Some of these

A Study of the Language Used in Electronic Negotiations

3

techniques might possibly be used to identify the important elements of e-negotiations mentioned in the transcripts. Research performed in the fields of discourse and spoken dialogue might prove to be useful for this project, in particular the Dialog Act Markup in Several Layers (DAMSL) system [4] developed at the University of Rochester.

3. NLP Software Repositories There exist several repositories of public-domain and commercial NLP programming resources. The major ones are SourceForge [26], GATE (General Architecture for Text Engineering) [8], TUSTEP (TUebingen System of Text Processing Programs) [29], NLSR (The Natural Language Software Registry) [16], PennTools (computational linguistics resources at the University of Pennsylvania) [21]. There exists Websites that are not as much repositories as collections of bookmarks, for example the Speech and Language Web Resources page [27]. Interesting commercial software is available from Xerox Research Centre Europe (XRCE) [30]. Two major software suites for NLP programming are freely available: NLTK (Natural Language Processing Toolkit) and GATE. The content of these repositories as well as a discussion of the major software suites are presented in this section.

3.1. SourceForge SourceForge [26] is “the world's largest Open Source development Website, with the largest repository of Open Source code and applications available on the Internet”. Many open source NLP projects are housed at that Website, in particular OpenNLP [17], OpenNLP Grok Library [18], OpenNLP Maximum Entropy Package [19], NLTK [15]. SourceForge assigns a development status to every software project. The ratings are: 1 – Planning; 2 – Pre-Alpha; 3 – Alpha; 4 – Beta; 5 – Production/Stable; 6 – Mature; 7 – Inactive. Since many projects on this Website are under development, it is worth monitoring to see if there are any new developments.

3.1.1. OpenNLP OpenNLP provides the organizational structure for coordinating several different projects which approach some aspect of NLP. OpenNLP also defines a set of Java interfaces and implements some basic infrastructure for NLP components. It is therefore a framework for possible implementations [17]. The OpenNLP Grok Library which is discussed in the next section is an implementation of the OpenNLP interfaces and abstract classes. The development status of this project is 3 – Alpha. Some interesting abstract classes and interfaces are: opennlp.common.morph Package: • MorphAnalyzer: The interface for morphological analyzers, which return morphological information for word forms. opennlp.common.parse Package: • Parser: A Parser is a module that takes a string as input and produces a structured form representing that string. • Derivation: Draws a graphical representation of the derivation of a parse. • Lexicon: Contains words and their associated categories and semantics. • RuleGroup: A set of rules that describe how lexical items should be combined.


4

opennlp.common.preprocess Package: • NameFinder: The interface for name finders, which search text for names of people, companies, countries, etc. • POSTagger: The interface for part of speech taggers. • SentenceDetector: The interface for sentence detectors, which find the sentence boundaries in a text. • Tokenizer: The interface for tokenizers, which turn messy text into nicely segmented text tokens. opennlp.common.unify Package: • XMLUtils: Utilities for manipulating XML based objects. • NLPDocument: A class which wraps an XmlDocument inside and ensures that it fits OpenNLP specifications. (This class seems to be used to mark up a text with paragraph, sentence, token and word tags). • NLPDocumentBuilder: Creates an NLP document from a file or a string.

3.1.2. OpenNLP Grok Library Grok is an open source natural language processing library written in Java. It is part of the OpenNLP project, and provides implementations for most of the interfaces defined in the opennlp.common package [18]. The Development Status of this project is: 4 – Beta. Some interesting classes written in Java: opennlp.grok.ml.dectree Package: • DecisionTree: Implements decision trees based on a generic map that provides values for features when requested. These features could be computed on the fly by extending the Map interface. opennlp.grok.parse Package: • Chart: An implementation of the table (or chart) used for chart parsers like CKY. Special functions are provided for combining cells of the chart into another cell. • CKY: CKY is a chart parser that is used, in this case, with the CCG grammar formalism. • PrefRankCKY: This is a version of CKY that gives preference to edges that score highly based on some measure. That measure it the average of the P(cat|word) probabilities of its component parts. opennlp.grok.preprocess.namefind Package: • EmailDetector: Find emails in a text and mark them. Not much of a detector at the moment as it only checks whether there is a "@" in the string. • NameFinderME: A Name Finder that uses maximum entropy to determine if a word is a name or not a name. opennlp.grok.preprocess.postag Package: • EnglishPOSTaggerME: A part of speech tagger that uses a model trained on English data from the Wall Street Journal and the Brown corpus. The latest model created achieved >96% accuracy on unseen data. • POSTaggerME: A part-of-speech tagger that uses maximum entropy. Tries to predict whether words are nouns, verbs, or any of 70 other POS tags depending on their surrounding context. opennlp.grok.preprocess.sentdetect Package:

A Study of the Language Used in Electronic Negotiations • • •

5

DefaultEndOfSentenceScanner: The default end of sentence scanner implements all of the EndOfSentenceScanner methods in terms of the getPositions(char[]) method. It scans for . ? ! ) and ". EnglishSentenceDetectorME: A sentence detector which uses a model trained on English data (Wall Street Journal text). SentenceDetectorME: A sentence detector for splitting up raw text into sentences. A maximum entropy model is used to evaluate the characters ".", "!", and "?" in a string to determine if they signify the end of a sentence.

opennlp.grok.preprocess.tokenize Package: • EnglishTokenizerME: A tokenizer which uses default English data for the maximum entropy model. • TokenizerME: A Tokenizer for converting raw text into separated tokens. It uses Maximum Entropy to make its decisions. The features are loosely based off of Jeff Reynar's UPenn thesis "Topic Segmentation: Algorithms and Applications.", which is available from his homepage.

3.1.3. OpenNLP Maximum Entropy Package The opennlp.maxent package is a mature Java package for training and using maximum entropy models [19]. Its development status is 6 – Mature. Applications using this package, such as a parser, can be found in the OpenNLP Grok Library discussed previously.

3.1.4. Natural Language Processing Toolkit (NLTK) The NLP Toolkit is a Python package intended to simplify the task of programming natural language systems. Its primary audience is graduate and undergraduate students studying computational linguistics. In particular, it was designed for use with the class CIS-530 at the University of Pennsylvania. It is intended to be used as a teaching tool, not as a basis for building production systems [15]. The development status of this project is 3 – Alpha. This being said, its documentation is extremely thorough, much better than most open source projects. The package seems quite complete, but Python is not a common development platform and the project administrators warn that NLTK shuld not be used to build production systems, although it definitely seems adequate for building prototypes. A description of NLTK’s modules written in Python as described in the NLTK reference documentation [15] are: • • • • • • • •

token: Basic classes useful for processing individual elements of text, such as words or sentences. probability set: An unordered container class that contains no duplicate elements. tree: Classes for representing text-trees. cfg: Basic data classes for representing context free grammars. fsa: FSA class - deliberately simple so that the operations are easily understood. tagger: Classes and interfaces used to tag each token of a document with supplementary information, such as its part of speech or its WordNet synset tag. parser: Classes and interfaces for producing tree structures that represent the internal organziation of a text. o chart: Data classes and ParserI implementations for "chart parsers," which use dynamic programming to efficiently parse a text.


6

chunk: A chunk parser is a kind of robust parser which identifies linguistic groups (such as noun phrases) in unrestricted text, typically one sentence at a time. o probabilistic: Classes and interfaces for associating probabilities with tree structures that represent the internal organziation of a text. classifier: Classes and interfaces used to classify texts into categories. o feature: Basic data classes for building feature-based classifiers. o featureselection: Classes and interfaces used to select which features are relevant for a given classification task. o maxent: A text classifier model based on maximum entropy modeling framework. o naivebayes: A text classifier model based on the naive Bayes assumption. draw: Tools for graphically displaying and interacting with the objects and processing classes defined by the Toolkit. o chart: Drawing charts, etc. o fsa: Draw a finite state automoton o play: Methods for graphically displaying the objects created by the Toolkit. o plot o plot_graph o tree: Graphically display a Tree or TreeToken. o tree_edit: Tree. chktype: Type checking for the parameters of functions and methods. test: Unit tests for the NLTK modules. o probability o rechunkparser o set o token o tree o

•

•

• •

3.2. GATE GATE comprises an architecture, framework (or SDK) and development environment, and has been in development since 1995 in the Sheffield NLP group. The system has been used for many language processing projects; in particular for Information Extraction in many languages [8].GATE is a quite big SDK, the largest of the open-source projects. The documentation seems to be quite extensive but there is not an executive summary of all classes which slows down the learning curve for this SDK. It is not clear if the classes can be used independently, as in the NLTK package, or if the entire GATE development environment must be used. The two major components of this system are: •

•

CREOLE : Collection of REusable Objects for Language Engineering which consist of: o LanguageResources (LRs) represent entities such as lexicons, corpora or ontologies o ProcessingResources (PRs) represent entities that are primarily algorithmic, such as parsers, generators or n-gram modelers o VisualResources (VRs) represent visualization and editing components that participate in GUIs. ANNIE: a Nearly-New Information Extraction System. Some of its components are: o Tokenizer o Sentence Splitter o Part of Speech Tagger o Semantic Tagger

A Study of the Language Used in Electronic Negotiations o

7

Pronomial Coreference module

3.3. TUSTEP: TUebingen System of Text Processing Programs TUSTEP, developed at the University of Tuebingen Computing Center since the late 60es, is a professional toolbox for scholarly processing textual data. Long-term availability, platform independence for procedures and data, coverage of all steps of a typical humanities research project, flexibility based on a consequent modular design are important design principles [29]. The tasks which can be accomplished with the help of TUSTEP range from composing a brief seminar paper to preparing extensive bibliographies, lexical, indexes, concordances, dictionaries, critical editions and of course monographs; the final output can be formatted for photocomposition in a quality one is accustomed to in letterpress printing, or can be prepared in a form (e.g. XML, HTML) and encoding (e.g. Unicode) which is required for electronic publishing [ibid.]. The TUSTEP package requires more investigation. It may be a useful resource for preparing a lexicon of terms used in e-negotiations, if the proposed formalism is adequate. This has not been investigated.

3.4. The Natural Language Software Registry The Natural Language Software Registry (NLSR) is a concise summary of the capabilities and sources of NLP software available to the community. It comprises academic, commercial and proprietary software with specifications and terms on which it can be acquired clearly indicated [16]. This website seems quite extensive and professional. It has not yet been studied in depth.

3.5. PennTools PennTools is a website consisting of computational linguistics resources hosted at the University of Pennsylvania [21]. Some of its sections are: • packages • corpora • dictionaries / lexicons • tools: o tools for administering & annotating corpora o part-of-speech taggers and parsers o sentence boundary detectors • links to natural language research centers, organizations, and paper archives

3.6. Bookmarks of NLP resources There are many websites that consist of organized bookmarks to NLP resources, for example the

Speech and Language Web Resources site [27]. 3.7. Xerox Research Centre Europe XRCE conducts much NLP research. Its research agenda concerns theories, methods, tools and systems that make it possible to uncover the content of natural language texts [30]. XRCE has developed a language identifier that can classify texts in 47 languages [31]. Most of its products are commercial.


8

4. Preliminary Investigation of the Data The following tasks have been identified for the preliminary phase of this study: •

Writing Perl scripts to classify logs into three classes: o no messages o English messages o messages that are not in English This task is performed by using Michael Piotrowski’s Lingua-Ident-1.4 Perl module [22] that performs statistical language identification. It is based on Ted Dunning’s algorithm [7]. •

Write Perl scripts to extract the textual elements of the negotiation transcripts: o user o date an time o text of the message o is the message is part of an offer? o is the message a reply to an offer? o is it relevant to the negotiation?

•

Further analyzing the English messages to identify: o average length of messages o word frequencies o identification of common words used in the negotiations (if any) o average number of interactions o importance of the origin of the negotiator

•

Investigating avenues of future work: o possibility of implementing a system based on Searle’s speech acts [25] o automating the identification of entities important to e-negotiations o labelling interactions as positive and negative. Charles Osgood’s Theory of Semantic Differentiation [20] can possibly be used for this task. o automatic labelling of irrelevant interactions o identity the necessary elements for proposing a model of e-negotiations

5. Future Work Being able to study the language of e-negotiations using NLP techniques is of great importance to future e-negotiations. We propose a three year plan for a system that will be able to analyze the .log files generated by the system, model the e-negotiations and help the users to achieve consensus. Milestones of this three year plan have been identified. Year 1 will consist of building tools to extract and categorize the textual elements of the negotiation transcripts. The various named entities in the messages that are relevant to negotiation must be identified and tagged, for example: price, delivery, payment, and return; a specialized lexicon containing these terms is to be prepared. The entities will be further classified in a domain-specific ontology and used to build models of negotiations. Year 2 will be spent implementing a dedicated syntactic analyzer for the transcripts. The texts will be studied using the available NLP techniques. It should be possible to automatically analyze the position of the negotiator. Interactions can be categorized as strong, weak,


9

accommodating, and inflexible. Techniques for analyzing what is being discussed will be developed. Interaction can be topical or superfluous. The system should know in advance if the negotiation will be successful or not given the interactions. To achieve this task a preliminary catalogue of phrases associated with winning, losing and abandoning negotiations will be compiled. Year 3 will study the influence of the cultural backgrounds of the negotiators as well as propose methods for improving negotiations while they are being performed. Different negotiation strategies will be studied taking into account the country of origin of the users of the Inspire system [ibid.]. The most successful ones will be identified. Software to assist the negotiators by proposing alternative strategies will be developed.

6. References [1] C. Beam and A. Segev. “Automated negotiations: A survey of the state of the art”. CITM Working Paper 96-WP-1022 (1997) [2] M. Benyoucef and R. Keller, “An Evaluation of Formalisms for Negotiations in E-Commerce”. Proceedings of the Workshop on Distributed Computing on the Web, Quebec City, Canada, 2000. [3] M. Benyoucef and R. Keller, “A conceptual architecture for a combined negotiation support system”. Tech. Rep. GELO-118, Montreal, Canada, February 2000. [4] Mark Core and James Allen, “Coding Dialogs with the DAMSL Annotation Scheme”. Proceedings of the AAAI Fall Symposium on Communicative Action in Humans and Machines, Boston, MA, November 1997. [5] R. J. Cole II, P. W. Eklund. “Analyzing an Email Collection Using Formal Concept Analysis”. PKDD 1999: 309-315 [6] Mark Core, “Analyzing and Predicting Patterns of DAMSL Utterance Tags”. Proceedings of AAAI Spring Symposium on Applying Machine Learning to Discourse Processing, Stanford, CA, March 1998. [7] T. Dunning. “Statistical Identification of Language”. Technical report CRL MCCS-94-273. Computing Research Lab, New Mexico State University. [8]

General Architecture for Text Engineering. http://gate.ac.uk/

[9] J. Habermas. “Theorie des kommunikativen Handelns, vol. 1: Handlungsrationalität und gesellschaftliche Rationalisierung”, Suhrkamp, Frankfurt am Main, 1981, Taschenbuchausgabe, 1995. [10] M. Hfferer, B. Knaus and W. Winiwarter. “Adaptive Information Extraction from Online Messages”. Proceedings of RIAO 94, 11-13 October 1994, Rockefeller University, New York. [11] InterNeg Documentation. “Documentation for Entering Records into InterNeg Database”. Inspire Data CD-Rom provided by G. Kersten. [12] G. Kersten. “The Science and Engineering of E-Negotiation: An Introduction”. Proceedings of the Hawai’i International Conference on System Sciences, January 6 – 9, 2003, Big Island, Hawaii. [13] L. Kosseim, S. Beauregard and G. Lapalme. “Using information extraction and natural language generation to answer e-mail”. DKE 38(1): 85-100 (2001)


10

[14] Message Understanding Conferences (MUC).http://www.itl.nist.gov/iaui/894.02/related_projects/muc/ [15] Natural Language Processing Toolkit. http://nltk.sourceforge.net/ [16] Natural Language Software Registry. http://registry.dfki.de/ [17] OpenNLP. http://opennlp.sourceforge.net/ [18] OpenNLP Grok Library. http://grok.sourceforge.net/ [19] OpenNLP Maximum Entropy Package: http://maxent.sourceforge.net/ [20] C. Osgood. “Theory of Semantic Differentiation”. [21] PennTools. http://www.cis.upenn.edu/~adwait/penntools.html [22] M. Piotrowski. Lingua-Ident-1.4 Perl module. http://search.cpan.org/author/MPIOTR/LinguaIdent-1.4/Ident.pm [23] M. Schoop and C. Quix. “DOC.COM: a framework for effective negotiation support in electronic marketplaces” [24] M. Schoop and C. Quix. “DOC.COM: Combining document and communication management for negotiation support in business-to-business electronic commerce” [25] J.R. Searle. “Speech Acts: An Essay in the Philosophy of Language”, Cambridge University Press, Cambridge, 1969. [26] SourceForge. http://sourceforge.net [27] Speech and Language Web Resources http://n106.is.tokushima-u.ac.jp/member/kita/NLP/ [28] Text Retrieval Conference (TREC). http://trec.nist.gov/ [29] TUebingen System of Text Processing Programs. http://www.unituebingen.de/zdv/tustep/tdv_eng.html [30] Xerox Research Centre Europe. http://www.xrce.xerox.com [31] XRCE language identifier. http://www.xrce.xerox.com/competencies/contentanalysis/tools/guesser.en.html