Dynamic Building of Domain Specific Lexicons Using

Dynamic Building of Domain Specific Lexicons Using Emergent Semantics by Matt Selway

A thesis submitted for the degree of

Bachelor of Computer Science (Honours) School of Computer and Information Science Division of Information Technology, Engineering and the Environment

Supervisor: Professor Markus Stumptner Knowledge and Software Engineering Laboratory School of Computer and Information Science University of South Australia

24th November 2010

ii

Declaration I declare that this thesis does not incorporate without acknowledgment any material previously submitted for a degree or diploma in any university; and that to the best of my knowledge it does not contain any materials previously published or written by another person except where due reference is made in the text. Matt Selway

24th November 2010

iii

iv

Abstract The majority of current Natural Language Processing (NLP) methods and research is based on statistical analysis and machine learning. While these methods are quite effective, they are not perfect and the search continues for better and more accurate methods for NLP. The work of Kleiner et al. (Kleiner et al. 2009) takes a very different approach. They look at NLP as a model transformation problem, utilising configuration as a model transformation to deal with the ambiguity of natural language. Model transformations are used heavily in Model Driven Engineering (MDE) and are the process of producing an output model, or several, from one or more input models. Configuration is a general method of constraint based searching that, in this case, is used to search for a model conforming to the desired meta-model. Although Kleiner et al. (Kleiner et al. 2009) show promising results, their method requires the use of a predefined lexicon to support the transformations from natural language to a useful model representation. This means that in order to utilise this method a complete lexicon needs to be defined for the domain it is being used for, which is quite impractical. This research aims to develop a method of dynamically building a lexicon based on a multi-agent system, and the principles of Emergent Semantics, and Semiotic Dynamics in a similar way to that described by Steels and Hanappe (Steels and Hanappe 2006). By dynamically generating a lexicon based on the input text the amount of time required to define a lexicon would be reduced, with the potential for a fully automated process. Therefore, the overall development time of software would be reduced through the automatic transformation of specifications to formal representations (such as UML class diagrams). In addition, it would make updating the lexicon simpler as it would become a natural ability of the system. Finally, the dynamic generation may have longe range benefits for research in semantic interoperability and v

vi the Semantic Web. This dissertation explores some background on NLP, MDE, and Emergent Semantics in order to provide some context for this research. It then describes the development of the Lexicon Builder experimental system – a method for dynamically developing domain specific lexicons from input text. This system utilises an external data source in order to retrieve lexical data not present within the system. Several tests are performed in order to gauge the effectiveness of the Lexicon Builder in dynamically creating lexicons. Unfortunately the results show that, in its current form, the Lexicon Builder is not very effective and only partly meets the goals of this research. However, the analysis of its limitations has provided many opportunities for future work that will improve and expand the Lexicon Builder system.

Contents Declaration

iii

Abstract

v

Contents

vii

List of Figures

xi

List of Tables

xiii

List of Acronyms

xv

1 Introduction

1

1.1

The Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.2

Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2

1.3

Goals and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . .

3

1.4

Dissertation Overview . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

2 Literature Review

5

2.1

Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . .

5

2.2

Model Driven Engineering . . . . . . . . . . . . . . . . . . . . . . . . .

9

2.3

Emergent Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

2.4

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

vii

viii

CONTENTS

3 Design and Implementation of the Lexicon Builder

15

3.1

Type of Research, Results, and Validations . . . . . . . . . . . . . . . .

15

3.2

Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

3.3

Design Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

18

3.3.1

External Data Source Selection . . . . . . . . . . . . . . . . . .

18

3.3.2

Lexicon Meta-Model . . . . . . . . . . . . . . . . . . . . . . . .

21

3.3.3

Agent Interactions . . . . . . . . . . . . . . . . . . . . . . . . .

23

3.3.3.1

Scenario 1 . . . . . . . . . . . . . . . . . . . . . . . . .

23

3.3.3.2

Scenario 2 . . . . . . . . . . . . . . . . . . . . . . . . .

25

3.3.3.3

Scenario 3 . . . . . . . . . . . . . . . . . . . . . . . . .

28

Lexicon Builder Design . . . . . . . . . . . . . . . . . . . . . . .

30

Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

36

3.4.1

Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

36

3.4.2

Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

38

3.4.3

User Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . .

40

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

43

3.3.4 3.4

3.5

4 Experimental System Evaluation

45

4.1

Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

45

4.2

Validation Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

47

4.2.1

Validation Test 1 – Scenario 1 . . . . . . . . . . . . . . . . . . .

47

4.2.2

Validation Test 2 – Scenario 2: Few Changes . . . . . . . . . . .

48

4.2.3

Validation Test 3 – Scenario 2: Many Changes . . . . . . . . . .

51

4.2.4

Validation Test 4 – Scenario 3: Few Local Changes, Few Remote Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

54

Validation Test 5 – Scenario 3: Few Local Changes, Many Remote Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . .

57

4.2.5

CONTENTS 4.2.6 4.3

4.4

ix Validation Test 6 – Scenario 3: Many Local Changes . . . . . .

63

Performance Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

66

4.3.1

Performance Test 1 – Automatically Retrieved Data . . . . . . .

66

4.3.2

Performance Test 2 – Categories Per Word . . . . . . . . . . . .

69

4.3.3

Performance Test 3 – Correct and Superfluous Data . . . . . . .

73

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

77

5 Analysis

79

5.1

Automatically Retrieved Data . . . . . . . . . . . . . . . . . . . . . . .

79

5.2

Categories per Word . . . . . . . . . . . . . . . . . . . . . . . . . . . .

80

5.3

Correct and Superfluous Data . . . . . . . . . . . . . . . . . . . . . . .

83

5.3.1

Missing Categories . . . . . . . . . . . . . . . . . . . . . . . . .

83

5.3.2

Additional Categories . . . . . . . . . . . . . . . . . . . . . . . .

89

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

92

5.4

6 Summary and Conclusion

93

6.1

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

93

6.2

Limitations and Future Work . . . . . . . . . . . . . . . . . . . . . . .

95

6.3

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

97

References A Word Lists

99 103

A.1 SBVR Word List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 A.2 Proposal Word List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 B Expected Lexical Categories

111

B.1 SBVR Word List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 B.2 Proposal Word List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

x

CONTENTS

C Incorrect Lexical Categories

129

C.1 SBVR Word List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 C.2 Proposal Word List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

List of Figures 2.1

Process Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

3.1

Lexicon Builder Overview . . . . . . . . . . . . . . . . . . . . . . . . .

17

3.2

Lexicon Meta-Model . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22

3.3

Scenario 1 – New Lexicon . . . . . . . . . . . . . . . . . . . . . . . . .

25

3.4

Scenario 2 – New Lexicon and Matching Lexicon Found Remotely . . .

27

3.5

Scenario 3 – Matching Lexicon Found Locally . . . . . . . . . . . . . .

31

3.6

Lexicon Builder Package Structure . . . . . . . . . . . . . . . . . . . .

32

3.7

core Package Class Diagram . . . . . . . . . . . . . . . . . . . . . . . .

33

3.8

lexicon Package Class Diagram . . . . . . . . . . . . . . . . . . . . . .

34

3.9

util Package Class Diagram . . . . . . . . . . . . . . . . . . . . . . . .

35

3.10 Example of the “print-metadata” Command . . . . . . . . . . . . . . .

40

3.11 Example of the Graphical User Interface (GUI) with the Owner Tags Highlighted . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

3.12 Example of the GUI with the Words Context Menu . . . . . . . . . . .

42

3.13 Example of the GUI with the Category Context Menu . . . . . . . . . .

42

4.1

Histogram of the Percentage of Words with Retrieved Data . . . . . . .

68

4.2

Histogram of the Number of Categories Retrieved per Word . . . . . .

71

4.3

Histogram of the Percentage Frequencies of the Number of Categories .

72

4.4

Histogram of the Percentages of Correct and Additional Categories . .

76

xi

xii

LIST OF FIGURES 5.1

Definition of “sell” from The Free Dictionary . . . . . . . . . . . . . . .

82

5.2

Categories for “Sells” in the Lexicon Builder . . . . . . . . . . . . . . .

82

5.3

Potential Frequencies of Number of Categories . . . . . . . . . . . . . .

83

5.4

Definition of “least” from The Free Dictionary . . . . . . . . . . . . . .

85

5.5

Definition of “avoid” from The Free Dictionary . . . . . . . . . . . . . .

86

5.6

Definition of “would” from The Free Dictionary . . . . . . . . . . . . .

86

5.7

Categories for “would” in the Lexicon Builder . . . . . . . . . . . . . .

87

5.8

Definition of “Natural Language” from The Free Dictionary . . . . . . .

87

5.9

Definition of “developed” from The Free Dictionary . . . . . . . . . . .

88

5.10 Definition of “develop” from The Free Dictionary . . . . . . . . . . . . .

88

5.11 Definition of “base” from The Free Dictionary . . . . . . . . . . . . . .

91

List of Tables 4.1

Number of Words with Retrieved Data . . . . . . . . . . . . . . . . . .

68

4.2

Frequencies for Number of Categories . . . . . . . . . . . . . . . . . . .

72

4.3

Correct and Additional Categories . . . . . . . . . . . . . . . . . . . . .

76

A.1 SBVR Word List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 A.2 Proposal Word List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 B.1 SBVR Word List Expected Lexical Categories . . . . . . . . . . . . . . 111 B.2 Proposal Word List Expected Lexical Categories . . . . . . . . . . . . . 113 C.1 SBVR Words With Missing Categories . . . . . . . . . . . . . . . . . . 129 C.2 SBVR Words With Additional Categories

. . . . . . . . . . . . . . . . 130

C.3 Proposal Words with Missing Categories . . . . . . . . . . . . . . . . . 131 C.4 Proposal Words with Additional Categories . . . . . . . . . . . . . . . . 133

xiii

xiv

LIST OF TABLES

List of Acronyms General Eclipse Modelling Framework

EMF GUI

Graphical User Interface Globally Unique Identifier

GUID

Model Driven Engineering

MDE NLP

Natural Language Processing

OMG

Object Management Group Semantics for Business Vocabulary and Rules

SBVR UML

Unified Modelling Language Universally Unique Identifier

UUID XML

Extensible Markup Language

Lexicon Meta-Model ADCat DCat

article determiner

determiner

DADCat

definite article determiner

IADCat

indefinite article determiner xv

xvi

LIST OF TABLES

ITVCat NCat

intransitive verb

noun

NumeralCat QCat

numeral

quantified determiner

QUnvaluedCat QValuedCat TVCat VCat

quantified unvalued determiner

quantified valued determiner

transitive verb

verb

Chapter 1 Introduction Natural Language Processing (NLP) is an exciting concept with a realm of possibilities for its application. Most current NLP methods revolve around statistical analysis and symbolic machine learning methods. While these methods are quite effective they are not perfect and researchers continue to search for improved and new ways of performing NLP.

A radically different approach, that has been proposed by Kleiner et al. (Kleiner

et al. 2009), takes a Model Driven Engineering (MDE) approach to NLP. By looking at NLP as a model transformation problem and utilising Configuration as an advanced model transformation technique, they have developed a process for the automated transformation of natural language specifications into formal representations such as Unified Modelling Language (UML) class diagrams.

1.1

The Problem

The preliminary results produced by Kleiner et al. (Kleiner et al. 2009) are quite promising. However, their process requires the use of a manually predefined lexicon to support the early stages of transforming natural language into a useful model representation. Therefore, in order to utilise this method of NLP a complete lexicon needs to be defined, which is impractical for real world applications. To begin with, it can take a long time to manually build a complete lexicon, even within a limited domain. It would be useless, say, for a software company to spend six months developing a lexicon in order to save six months at the end through the use of MDE. Secondly, the predefined lexicon is static. So if the specification changes (which is likely) and a word 1

2

CHAPTER 1. INTRODUCTION

is added that was not initially put in the lexicon, then the whole process may break. Finally, the need to manually predefine a lexicon reduces the amount of automation in the process, but one of the goals of MDE is to allow more automation.

1.2

Research Questions

The problems identified with the process described by Kleiner et al. (Kleiner et al. 2009) would have a major impact on its long term viability. Therefore, a method of dynamically developing domain specific lexicons that can be incorporated into their process is required. This research attempts to develop such a method and answer the question:

Can we reduce or eliminate the need to manually predefine a lexicon by dynamically building a lexicon based on the input text?

This question was broken down into several subquestions in order to gauge the effectiveness of the developed method in dynamically creating lexicons.

1. How much of a reduction can be gained? 2. How well does it work? 3. What are its limitations?

The first subquestion is straightforward, as it is simply trying to determine the possible amount to which manual work can be reduced – with the goal of complete elimination of manual work in mind. The second encompasses many aspects and is aimed at trying to determine how correct the dynamically developed lexicons are. After all, there is no point having a method that automatically generates incorrect lexicons. The third subquestion is an attempt to identify limitations such a system may have that would make it impractical for any real application.

1.3. GOALS AND CONTRIBUTIONS

1.3

3

Goals and Contributions

The aim of this research is to remove the need to manually predefine the lexicon by dynamically building a lexicon from the input text. As a result, in this research an experimental system – dubbed the Lexicon Builder – has been developed that utilises the concepts of multi-agent systems, Emergent Semantics, and Semiotic Dynamics in order to support the automated development of a lexicon during the process developed by Kleiner et al. (Kleiner et al. 2009). This would provide several direct benefits including:

• the automated development of domain specific lexicons for NLP, and • the inclusion of updates as an inherent part of the system, thus avoiding the trap of an out-of-date lexicon when a new word is included in the input text.

Long-range benefits of this and future work may include:

• the completely automated analysis and transformation of Natural Language specifications • the support of other research into the semantic interoperability of information systems, and • the potential to develop global agreement on lexical information, through the inherent sharing of the developed domain specific lexicons, which would be a major contribution to the Semantic Web efforts.

The actual contributions made by this research are a little less lofty, however, they are an important step leading into future work. They are:

• the partially automated development of domain specific lexicons, providing a proof of concept that can be improved greatly in future work, and • the incorporation of updates in an input text into the domain specific lexicon.

4

1.4

CHAPTER 1. INTRODUCTION

Dissertation Overview

Chapter 2 examines the areas of NLP, MDE, and Emergent Semantics in order to develop some background and context for this research. Chapter 3 describes the development of the Lexicon Builder system, including the design of its elements and the approach taken for its implementation, with Chapter 4 describing the methods used to evaluate the experimental system and their results. Chapter 5 discusses the results of the tests in an attempt to answer the research questions and finally, Chapter 6 concludes this dissertation with a summary of the work carried out, the limitations identified, and future work that could further develop this research.

Chapter 2 Literature Review This research covers several areas in computer science including NLP, MDE, and Emergent Semantics. The subsequent literature review attempts to provide background for the research presented in this thesis and place it in context in each of these three areas.

2.1 NLP

Natural Language Processing has arisen from many different fields including linguistics, computer science, ma-

thematics, electrical engineering, and psychology (Jurafsky and Martin 2009). While NLP

was initially based in linguistics (Liddy et al. 2003), many different approaches

were developed across the fields including: symbolic (probabilistic models, neural networks, etc.), stochastic (the Bayesian method, hidden Markov models, etc.), logicbased, natural language understanding (the development of semantic representations for language), and discourse modelling methods (Jurafsky and Martin 2009). However, since the 1990s empirical methods (probabilistic, statistical, and machine learning) have become increasingly popular and are now considered standard throughout the fields associated with NLP (Jurafsky and Martin 2009; Liddy et al. 2003; Daelemans and Van den Bosch 2006). The rise of machine learning techniques in NLP can be attributed to several trends occurring in the late 1990s and early 2000s. First of all, large on-line corpora were becoming readily available providing researchers with access to large databases of annotated text and lexical resources (Jurafsky and Martin 2009; Liddy et al. 2003). Along with large corpora, computers with vastly more processing power and storage space were becoming readily available allowing the development of 5

6

CHAPTER 2. LITERATURE REVIEW

systems that were not possible before the 1990s (Jurafsky and Martin 2009; Daelemans and Van den Bosch 2006). Third, it became the understanding that manually constructed rules could not capture enough complexity for NLP algorithms to be really effective (Liddy et al. 2003). The resultant focus on learning language rules increased the interactions between researchers in the field of NLP and those in the field of machine learning (Jurafsky and Martin 2009). Finally, there was motivation to develop NLP

systems that scaled well for practical applications and that could be more easily

reused (Daelemans and Van den Bosch 2006). An example of a recent (symbolic) machine learning method for NLP, proposed by Collobert and Weston (Collobert and Weston 2008), performs multi-task learning in a deep neural network architecture. NLP is commonly separated into its sub-tasks in order to make it more manageable by analysing them separately (Collobert and Weston 2008). These sub-tasks include part-of-speech tagging, chunking, parsing, namedentity recognition, word sense disambiguation, semantic-role labelling, named entity extraction, and anaphora resolution (Collobert and Weston 2008; Jurafsky and Martin 2009). Rather than analyse the sub-tasks separately, Collobert and Weston (Collobert and Weston 2008) have taken the opposite approach, integrating several of the sub-tasks and training their neural network on them simultaneously. In their results Collobert and Weston (Collobert and Weston 2008) show that multi-task learning can provide state-of-the-art performance even in semantic role-labeling, considered to be the most complex of the tasks they analysed. Their method also demonstrated other benefits such as: fast training speed, ability to be applied to many similar tasks, improved generalised performance, and ability to provide state-of-the-art performance when operating on plain text without additional input features (Collobert and Weston 2008). Another take on machine learning utilises inductive machine learning to guide a deterministic dependency parsing algorithm (Nivre 2007). Dependency parsing is a relatively new method for parsing unrestricted natural language text which aims to construct a dependency graph of the words in a sentence rather than a more traditional parse tree (Nivre 2007). In order to improve its accuracy, Nivre (Nivre 2007) guides the dependency parser with a parse table built by a probabilistic machine learning algorithm, which predicts the next dependency for any given parser state. The results of this method achieved close to state-of-the-art parsing accuracy even with the small training

2.1. NATURAL LANGUAGE PROCESSING

7

set available for Swedish (approximately 10% the size of a common corpus for English) (Nivre 2007).

In 2007 an analysis of several common machine learning algorithms for NLP was performed by Gao et al. (Gao et al. 2007). Five parameter estimation methods, which are core to many statistical approaches to NLP, were analysed and compared, including: Maximum Entropy with L1 (or Lasso) regularisation, Maximum Entropy with L2 regularisation, Averaged Perceptron, Boosting, and BLasso (Boosting with Lasso regularisation) (Gao et al. 2007). The Boosting and BLasso methods are motivated by selecting a small number of useful features, while Maximum Entropy and Averaged Perceptron work by averaging over a large set of less useful features (Gao et al. 2007). As a result Maximum Entropy with L1, Maximum Entropy with L2, and the Averaged Perceptron were found to perform almost identically, with Boosting and BLasso not far behind (Gao et al. 2007). However, what separates the two approaches greatly is their computation time and the number of iterations needed to train them. Boosting and BLasso were found to be too computationally expensive and needed many more training iterations than the other methods, so were cut out of the second round of tests (Gao et al. 2007).

While machine learning methods have their benefits – including greater coverage of syntax and vocabulary, robustness, more re-usability, and shorter development times than other methods (Daelemans and Van den Bosch 2006) – the results of the comparative analysis performed by Gao et al. (Gao et al. 2007) begins to highlight some of the issues associated with NLP. To start with, performing the training requires the use of adequate data sets that must be selected carefully, which requires a large amount of human intervention to achieve the desired results (Steels 2002). In addition, supervised learning requires the use of large annotated data sets that take longer to develop than unannotated data sets, while unsupervised learning is more likely to learn irrelevant concepts (Steels 2002). Furthermore, the results produced by the best machine learning methods are not perfect. Collobert and Weston (Collobert and Weston 2008) present a per word error rate for semantic role labelling of 14.3% as state-of-the-art, while 3% error is common for part-of-speech tagging. So it seems there is still a way to go for the more complex tasks.

8


Figure 2.1: Process Overview (Kleiner et al. 2009) These issues led researchers to tackle NLP problems from different angles such as that taken by Kleiner et al. (Kleiner et al. 2009). Since they focus on the analysis and transformation of natural language specifications into formal representations (such as UML class diagrams), Kleiner et al. (Kleiner et al. 2009) have taken a MDE approach to NLP.

This approach uses model transformations to parse the (controlled) natural lan-

guage, produce an Semantics for Business Vocabulary and Rules (SBVR) – a controlled language for writing conceptual schemas – terminal model, and create a UML (class diagram) model of the specified conceptual schema (Kleiner et al. 2009). In addition, they utilise configuration as an advanced model transformation in the parsing of natural language (Kleiner et al. 2009). Configuration is a general constraint-based search technique (Estratat and Henocque 2004). In the context of MDE, the constraints that the solution(s) must hold to describe conformance to the target meta-model (Kleiner et al. 2009). As all possible solutions are found, the result is a non-deterministic parser of controlled languages (Kleiner et al. 2009). An overview of the process developed by Kleiner et al. (Kleiner et al. 2009) is shown in Figure 2.1.

In this process the natural language is injected into a simple model representing the words in a sentence, called Ordered Words. Next it is combined with a Lexicon model in a model transformation that performs Lexical Analysis on the text. This model

2.2. MODEL DRIVEN ENGINEERING

9

transformation produces a model called Labelled Words which is a relaxed version of the Syntax model. At this point the Labelled Words model is the original text labelled with all possible lexical information from the Lexicon model. It then undergoes configuration, which performs the syntactical and grammatical analysis by searching for one or more models conforming to the Syntax meta-model. The produced Syntax model(s) are effectively the input text that has had part-of-speech tagging and chunking – labelling words with their syntactic role, such as noun or verb, and labelling parts of a sentence with its syntactic constituents, such as noun phrase or verb phrase, respectively (Collobert and Weston 2008) – performed on it. In addition, the Syntax model has a similar structure to a dependency graph as described in Nivre (Nivre 2007). After this, the Syntax model(s) undergo another model transformation to produce an SBVR

model and finally another model transformation produces a UML class diagram.

This process shows promising results for the processing of (controlled) natural language. However, this approach has a limitation in the need for a manually predefined lexicon in order to work. The research presented in this thesis is an attempt to alleviate this issue by replacing the manually predefined lexicon with a dynamically developed lexicon based on the input text.

2.2

Model Driven Engineering

In order to integrate the proposed dynamically built lexicon with the MDE method of NLP proposed by Kleiner et al. (Kleiner et al. 2009), a basic understanding of the concepts of MDE is required. There are four main concepts of MDE: (terminal) models, meta-models, meta-metamodels, and (model) transformations (Favre and Nguyen 2005; Jouault and Bézivin 2006; Kleiner et al. 2009). The concepts of model, meta-model, and meta-meta-model form a three level architecture with the levels referred to as M1, M2 and M3, respectively (as can be seen in Figure 2.1). A model is a representation of a system and conforms to a meta-model (Favre and Nguyen 2005; Jouault and Bézivin 2006; Kleiner et al. 2009). The meta-model is basically a model of a model and is the definition of a language (represented as a model) that can be used to define a terminal model

10


(Favre and Nguyen 2005; Jouault and Bézivin 2006; Kleiner et al. 2009). In addition, the meta-model conforms to a meta-meta-model which is a language that allows the definition of meta-models and has the added conundrum of conforming to itself (Jouault and Bézivin 2006; Kleiner et al. 2009). These relationships can be seen in Figure 2.1, for example the Ordered Words (terminal) model conforms to the meta-model MM-OW which conforms to the meta-meta-model Ecore (the meta-meta-model of the Eclipse Modelling Framework (EMF) (Jouault and Bézivin 2006)). Another example is Extensible Markup Language (XML), where an XML document is a model, an XML Schema is a meta-model, and the schema for an XML Schema is the meta-meta-model (Favre and Nguyen 2005; Jouault and Bézivin 2006). The fourth concept of MDE, and arguably the most important, is that of model transformations. A simple and clear definition of model transformations is that they are the process of producing one or more output models from one or more input models (Sendall and Kozaczynski 2003). An example of a model transformation can be seen in Figure 2.1, where the SBVR model is transformed into the CS (class diagram) model. However, the source and target meta-models do not need to be different (Kleiner et al. 2009) and since meta-models and meta-meta-models are still models, they can also be transformed in any combination with each other or terminal models (Favre and Nguyen 2005). For example, extracting an XML Schema from an XML document is a model transformation where the source model is an XML document and the target model is an XML Schema, a model and meta-model respectively (Favre and Nguyen 2005).

2.3

Emergent Semantics

The inspiration for the research presented in this thesis comes from emergent semantics and specifically the use of semiotic dynamics similar to the way it is described by Steels and Hanappe (Steels and Hanappe 2006). In their paper, Steels and Hanappe (Steels and Hanappe 2006) propose a system for the interoperability of collective information systems, such as peer-to-peer music sharing, that utilises information agents as mediators between the users, their data, and other information agents. These information agents support the semantic interoperability of disparate information systems by developing a set of categories and associated labels linked to the users own organisa-

2.3. EMERGENT SEMANTICS

11

tion through the use of classifiers, giving the information agent’s categories and labels grounded semantics (Steels and Hanappe 2006). The key to the semantic interoperability is the utilisation of the semiotic relationship between the data, its labels, and their categories in conjunction with the communication between information agents, and between information agents and the users (Steels and Hanappe 2006). The communication between agents allows them to arrive at agreement on the use of the labels and categories they develop, while the feedback provided by the user allows them to keep their labels and categories grounded in the user’s semantics (Steels and Hanappe 2006). The negotiation and updating of the labels and categories produces an emergent environment that undergoes constant evolution (Steels and Hanappe 2006). Abstracted from a specific system, emergent semantics is a form of semantic interoperability that is dependent on negotiations between agents in order to form agreements on a common interpretation within the context of a given task (Aberer et al. 2004). At any point in time, the state of a system using emergent semantics depends on the frequency, quality and efficiency with which these negotiations are made (Aberer et al. 2004). According to Aberer et al. (Aberer et al. 2004), there are three major principles for emergent semantics: agreements as a semantic handshake protocol, agreements emerge from negotiations, and agreements emerge from local interactions. The first two mean that agreement between agents is a set of mutual beliefs that is arrived at and constantly verified through the exchange of information during a negotiation (Aberer et al. 2004). The third principle, and arguably the most important for the large distributed environments such as the Internet being targeted by emergent semantics, restricts agreements to be made between pair-wise local interactions rather than agents attempting to form agreements between large groups simultaneously (Aberer et al. 2004). Another key concept for emergent semantics, is that the constant negotiations and local consensus generates self-organisation among the global community, in effect producing a global agreement on the semantics in any particular context (Aberer et al. 2004). A large portion of the work in semantic interoperability is performed in the context of the Semantic Web, as described by Lee et al. (Lee et al. 2001). The Semantic Web is basically an extension to the Internet that provides it with meta-data allowing it to be more readily processed by machines instead of just being targeted for human

12


consumption (Lee et al. 2001; Nack 2002; Maedche 2002). To this end a lot of work has gone into developing large standard ontologies, knowledge bases, and/or lexicons using formal semantics (Steels and Hanappe 2006). Some examples are CYC (Lenat 1995), WordNet (Miller 1995), and EDR (Yokoi 1995). However, these efforts are not emergent and they are manually constructed, whereas the aim of the research presented in this thesis is to dynamically develop lexicons in an emergent system. Previous work by Steels (Steels 1998), dynamically built lexicons and ontologies. However, this was for the means of investigating the development of language between autonomous agents (Steels 1998). As a result the lexicons contained words that were made-up by the agents rather than taking in existing words provided to them. Work that is more similar to the goal of the the research presented in this thesis has been performed by Maedche and Staab (Maedche and Staab 2001) in the learning of ontologies. They presented a system that can learn ontologies from free-text, dictionaries, and legacy ontologies (Maedche and Staab 2001). The aim of this system, however, is to assist a human designer in developing a final ontology through semi-automatic, iterative methods (Maedche and Staab 2001), whereas the ultimate goal of the research presented in this thesis is to fit into a fully autonomous method of processing natural language text. Furthermore, the focus of the research presented in this thesis is on building lexicons, a small aspect of the system developed by Maedche and Staab (Maedche and Staab 2001), which would then assist in automated NLP. In fact, the research presented in this thesis may aid in making their system more automated, as their system covers a larger portion of NLP for ontology learning and the incorporated lexicon development appears to be a very manual process. Finally, the two could be combined in the future to produce a system that feeds back into itself to learn and refine the ontologies by itself (for example, updates to the lexicon cause updates in the ontology and vice versa), without the aid of a human designer.

2.4

Summary

The examination of the literature has revealed that very little work similar to the research presented in this thesis has been undertaken. This can be partly attributed to this research being aimed at integrating into the NLP process of Kleiner et al. (Kleiner

2.4. SUMMARY

13

et al. 2009) and partially due the focus of a large portion of current NLP research being into machine learning methods. Furthermore, the literature review has shown how the Lexicon Builder system fits into the areas of NLP, MDE and Emergent Semantics with the intention of being integrated with an NLP process that utilises MDE and by using principles of Emergent Semantics in the Lexicon Builder’s design. Finally, some of the benefits the research presented in this thesis may provide to these fields have been outlined, including enhancing existing NLP methods and supporting and leveraging other research into Emergent Semantics. In the next chapter, the methods and approach to designing and implementing the Lexicon Builder system are described.

14


Chapter 3 Design and Implementation of the Lexicon Builder This chapter outlines the approach that was taken to develop and evaluate the Lexicon Builder experimental system followed by a description of each step of the process. Section 3.1 relates this research to the types of research, results, and validations proposed by Shaw (Shaw 2002). Section 3.2 discusses the methodology used for this research including how an external data source was selected, the way in which the Lexicon Meta-Model was designed, the interactions between agents, and the tests performed on the system.

3.1

Type of Research, Results, and Validations

This research takes an experimental system oriented approach that builds a system and then evaluates how well it works empirically. According to the classification of software engineering research types proposed by Shaw (Shaw 2002), this research is a cross between a Method – ‘How can we do/create (or automate doing X)?’ – and Feasibility – ‘Can we do it at all?’ – to produce a hybrid type that asks ‘Can we do X in this way?’. Keeping with the classifications of Shaw (Shaw 2002), the result is a prototype Specific Solution and will be validated by a combination of Analysis and Example. 15

16CHAPTER 3. DESIGN AND IMPLEMENTATION OF THE LEXICON BUILDER

3.2

Methodology

This research was executed using a basic process consisting of three phases: design, build, and test and validation. Due to time constraints the use a of an iterative Software Development Life Cycle, or a life cycle with a heavy emphasis on analysis, was not feasible. As a result the design phase incorporated as much analysis as was required, but no more than necessary. This resulted in some issues being identified but not fixed; however, possible solutions have been proposed. The following sections provide a detailed description of the work that was carried out during each phase in order to develop and evaluate the experimental system. The key concepts used in the experimental system stem largely from those described by Steels and Hanappe (Steels and Hanappe 2006). Such concepts include the use of a multi-agent system, labels and their association to lexicons using strengths, classifiers to compare lexicons, and local interactions between agents with constant updates to lexicons and their associations to produce the emergent semantics. In addition, the classifiers for comparing lexicons utilise the concept of semiotic dynamics similar to that described by Steels and Hanappe (Steels and Hanappe 2006). In this case, the intuition is that the words used in a specific domain will be used in a particular way, i.e. they will have certain lexical categories. This helps to identify where two lexicons may be equivalent when attempting to retrieve information to support their automated development. As a result of these concepts, the Lexicon Builder is designed as a multi-agent system to support the sharing of domain specific lexical information and to aid in developing consistency across projects in the same domain. The main idea is to inject the list of words from a given text into the Lexicon Builder and have their lexical information retrieved from other agents if possible or from external data sources accessible via the Internet if not. This is to assist in the development of the lexicons and reduce the amount of manual input required. After retrieving the information, user feedback is necessary to ensure the desired information has been retrieved (whether it is from an external source or a lexicon that is considered to be matching by another agent); however, one of the long term goals is to have this process completely automated by possibly taking the feedback from the information the agent already contains. The basic

3.2. METHODOLOGY

17

The user injects the text and provides feedback on the retrieved data. User Lexicon Builder Remote Agent n Remote Agent ... Local Agent

Lexicon Agent

Remote Agent 1

Network Manager

Network Manager

Lexicon Manager

Lexicon Agent

Lexicon Manager

External Data Source(s)

Currently only The Free Dictionary.

Figure 3.1: Lexicon Builder Overview components of the Lexicon Builder system and how they are connected are displayed in Figure 3.1. The components are described in detail in Section 3.3.4. Each agent maintains associations with lexicons that are constantly updated based on interactions with both users and other agents. The way in which the lexicons and associations are updated depends on the results of comparing lexicons using classifiers – due to time constraints the current Lexicon Builder has a very simple classifier built in, however, there is the potential to use various classifiers depending on the situation. Depending on the nature of the interactions, the lexicons used by agents may remain synchronised, diverge, or converge over time. This constant updating of lexicons through interactions produces the emergent semantics and provides the potential for global agreement to be reached on the contents of a domain specific lexicon. There is also the possibility that a larger more general lexicon may evolve over time, but is not the intended goal of the system. The development was be performed using the Eclipse IDE and the Java programming language. This provided a couple of benefits: (1) simpler implementation than other languages due to experience with Java and its expansive standard API; and (2) simpler

18CHAPTER 3. DESIGN AND IMPLEMENTATION OF THE LEXICON BUILDER integration with the NLP process developed by Kleiner et al. (Kleiner et al. 2009), which was also developed in the Eclipse environment. Furthermore, the lexicon will be implemented as a meta-model using the EMF. By implementing the lexicon as a meta-model it supports much simpler integration with the NLP process by making it a simple matter of replacing the currently used lexicon meta-model with the new one. This is where the design of the lexicon will be important in keeping the new metamodel compatible with the original. Lastly, requiring that the agents store their lexical data as a terminal model should not incur any penalties to the development time over developing a custom plain text file type, using XML, or utilising a database such as SQLite, as there is an extensible API in EMF for working with models.

3.3

Design Phase

This section describes the tasks that were undertaken in designing the experimental Lexicon Builder system. Firstly, it documents the selection of a suitable external data source, followed by the development of the Lexicon Meta-Model used to store the lexical information, the design of the interactions between the agents, and finally the structure of the Lexicon Builder system as a whole.

3.3.1

External Data Source Selection

There were several important criteria used to evaluate possible lexical data sources for use as part of the Lexicon Builder system. These include the:

1. Number of entries in the data source, 2. Coverage of lexical categories, 3. Detail of the entries, 4. Ease to which the data can be extracted, and 5. Cost.

A suitable lexical data source needs to contain a large number of entries otherwise the Lexicon Builder system would appear to be ineffective simply due to an unsuitable

3.3. DESIGN PHASE

19

data source being used. Similarly, the entries in the data source need to cover as many categories of the language as possible, such as noun, verb, determiner, etc., in order to provide data for as many words as possible. The detail of the entries is an important criterion as there are many categories and subcategories of words that define the way in which a word can be used. For example, a verb could be a transitive verb or intransitive verb. The more specific the word categories associated with words in the lexicon are, the better the results should be from the transformations in the process defined by (Kleiner et al. 2009). Furthermore, being able to extract the data from the source quickly and easily was a major concern for this research due to the time to implement the experimental system being very short. Finally, a free data source was important as this research is very preliminary possibly making any money spent a waste if a system could not be developed at all. Initially the scholarly data sources CYC (Lenat 1995), WordNet (Miller 1995), and EDR (Yokoi 1995) were investigated as possible options. Unfortunately they were discounted for various reasons. CYC, for example, was originally aimed at creating a general purpose knowledge base encompassing everyday commonsense and general concepts in order to serve as the foundation for Natural Language Understanding and Machine Learning later in its development (Lenat 1995). Although CYC now incorporates a Natural Language subsystem including a lexicon (Cycorp Inc. 2010), it is a large and complex beast making it unsuitable to be used as part of this initial Lexicon Builder system. Although WordNet is designed to be an on-line lexical database (Miller 1995), it was excluded as it does not cover enough categories of language (it covers only nouns, verbs, adjectives, and adverbs). In addition, it does not appear to have the detail in its word categories required by the process developed by (Kleiner et al. 2009) and hence the Lexicon Builder system. EDR was excluded as its English lexicon is lacking compared to its Japanese lexicon (Lenat et al. 1995; NICT 2007) and is required to be purchased (NICT 2007). Although these databases were ruled out for the current experimental implementation of the Lexicon Builder, investigation into their use in future work is highly recommended. As the scholarly data sources were not suitable an investigation of on-line versions of traditional dictionaries such as the Oxford (Oxford University Press 2010) and Macquarie (Macmillan Publishers Australia 2010) dictionaries was undertaken. Unfortunately

20CHAPTER 3. DESIGN AND IMPLEMENTATION OF THE LEXICON BUILDER such dictionaries were found to be generally unsuitable for the Lexicon Builder system. For example, the on-line version of the Oxford dictionary does not provide enough detail in its word categories in order to keep it accessible to everyone and requires a subscription to access the majority of its entries. Similarly, the Macquarie dictionary, and many others, require a subscription to access any of its entries.

As a result of the unsuitability of the official dictionaries, several free on-line dictionaries were surveyed. To start with Wiktionary (Wikimedia Foundation 2010a), a free dictionary companion to the Wikipedia encyclopedia, was considered. It promises a large number of entries, completely free use of the dictionary and a wide coverage of the English language (and potentially other languages as well). However, the level of detail in the categories associated with words is highly variable and, although the Wiktionary has strict layout conventions (Wikimedia Foundation 2010b), the structure of the entries were often quite different potentially making it very difficult to extricate the information. Eventually, The Free Dictionary (Farlex 2010) was discovered, which meets most of the criteria adequately. It is free, contains a large number of entries with good coverage of categories, has a high level of detail for most entries in the dictionary, and the format of the entries appeared to be consistent (although it was found to be not as consistent as first thought). In addition, The Free Dictionary has information from two sources, the American Heritage Dictionary of the English Language and the Collins English Dictionary – Complete and Unabridged. This had the potential benefit of being able to include multiple sources into the Lexicon Builder system from the one place if time permitted. Finally, The Free Dictionary also had the nice feature of redirecting derived words to their root word, e.g. “sells” is redirected to “sell”, which would make it simpler to link words to their base designation as required by the process developed by Kleiner et al. (Kleiner et al. 2009).

The use of multiple data sources was considered, however, it was determined to be impractical to incorporate multiple sources in the time available to perform this research – even for multiple sources from the one source. Therefore, a search for the best possible single source was performed and the addition of more data sources has been left for future work to investigate.

3.3. DESIGN PHASE

3.3.2

21

Lexicon Meta-Model

The development of the Lexicon Meta-Model was driven by two needs: compatibility with the process described by Kleiner et al. (Kleiner et al. 2009), and the information required by the agents in the system. To meet the first need the basic Lexicon MetaModel was developed using the EMF based on the meta-models from Configuring Models for (Controlled) Languages (Kleiner et al. 2009). Apart from integration with the process developed by Kleiner et al. (Kleiner et al. 2009), using the EMF had the the additional advantage of being able to automatically generate Java code for the elements defined in the Lexicon Meta-Model. This allowed the rapid development of a major aspect of the Lexicon Builder.

The basic structure of the Lexicon Meta-Model is as follows: a Lexicon contains any number of Words, which have a Designation (this is used by the process to link concepts) and any number of Cats – i.e. categories and can be NCats (nouns), VCats (verbs), DCats (determiner s), etc. Words and Cats are SyntaxElements, which are Elements along with Designations. This relationship provides all Word s, Cats and Designations with an objectName attribute in order to identify instances of these objects. These relationships are shown in Figure 3.2.

The second part was achieved by updating the basic Lexicon Meta-Model as required during the development of the interactions between agents. This resulted in only two additions to the Lexicon Meta-Model. They were the labels and ownerTags attributes of the Lexicon object and can be seen in Figure 3.2. The labels are the strings used by the agents to uniquely identify lexicons, while the ownerTags are the strings used by people to refer to certain lexicons. These attributes were added to facilitate the storing of this information with the lexicon itself, rather than keeping separate meta-data files about the lexicons an agent manages. Storing this information with the lexicon it is associated with should also make it easier to move lexicons around if required and aid in recovery if the agent fails and needs to be restarted as it would be a simple matter of reloading all of the lexicon files from a directory. Since the model transformations used later on in the process do not refer to these attributes their addition will cause no issues to arise when the Lexicon Builder is eventually integrated into the process.


lexicon::model

`genre' appears to be unused in the Syntax metamodel.

NCat +plural : Boolean +genre

TVCat

Element

[0..1]

word

SyntaxElement

QValuedCat

Word

Designation

Designation allows linking of plural forms, and tenses to a single element that can be checked for equivalency.

Kleiner et al. also appear to use it with SBVR specific semantics, eg 'sells' base designation is 'to sell'.

The `terminal' attribute is not present as all lexicon entries are terminal, even though some may be a word combination such as `at least'.

containedIn

+labels : String [0..*] {ordered} +ownerTags : String [0..*]

Lexicon

[1..*]

[0..*] words

baseDesignation

+objectName : String

DCat

QCat

QUnvaluedCat

[1..*] wordCat

This class diagram defines the meta-model for the lexicons used by the Lexicon Builder system.

Clarification of categories: TVCat - Transitive Verb ITVCat - Intransitive Verb ADCat - Article Determiner IADCat - Indefinite Article Determiner DADCat - Definite Article Determiner QCat - Quantified Determiner

NumeralCat

ADCat

DADCat

Cat

It is based heavily on the syntax elements of the Kleiner et al. (2009) Syntax model in order to keep it consistent and simplify the model transformations later. It also has some additional elements required in the communication of the Lexicon Builder agents.

`passive' attribute denotes the voice; `true' is passive, `false' is active.

VCat +plural : Boolean +passive : Boolean

ITVCat

IADCat

`labels' is used by the system to help keep track of the labels associated with the lexicon. The strengths are maintanined by the agents themselves (or could be encoded into the values for failure recovery). `ownerTags' is any name assigned to the Lexicon by the user for future reference. May be important for the user to specify a particular lexicon to ensure it is updated rather than a new lexicon created because their are too many changes.

Figure 3.2: Lexicon Meta-Model

3.3. DESIGN PHASE

3.3.3

23

Agent Interactions

The interactions between agents in the Lexicon Builder system was based on the scenarios described in the paper by Steels and Hanappe (Steels and Hanappe 2006), Interoperability Through Emergent Semantics A Semiotic Dynamics Approach. However, they needed some modification as the nature of the Lexicon Builder is reversed compared to the system described by Steels and Hanappe (Steels and Hanappe 2006). While Steels and Hanappe (Steels and Hanappe 2006) are looking at comparing meta-data to decide which data should be transferred between systems, the Lexicon Builder system needs to compare the data itself (i.e. the words) as well as the meta-data (the categories and base designation). Furthermore, the Lexicon Builder attempts to notify other agents about changes to the lexicons it knows about, otherwise there would be no opportunity for lexicons to evolve outside of the agent that is being changed after the initial sharing of data. There are three major scenarios in the Lexicon Builder system. All three involve a user, an agent that the user interacts with (referred to as the local agent), and an agent that the local agent is interacting with (the remote agent). In addition, each scenario is instigated by an action of the user, typically the retrieval of lexical data for a specified word list, and incorporate feedback from the user. This is an important aspect for the moment as it is required to ensure the dynamically developed lexicon is grounded in the user’s meaning. In the long term, however, the goal is to reduce or eliminate altogether the requirement for the user to provide feedback. Finally, these scenarios only cover the creation and updating of lexicons, and are intentionally quite unrestricted to allow various implementations to be developed. Due to the openness of the scenarios, additional scenarios to meet any identified needs can easily be developed in future work. 3.3.3.1

Scenario 1

The first scenario describes the initial retrieval of data for a lexicon being newly injected into a local agent and has no matching lexicon at a remote agent. This scenario is displayed as a UML sequence diagram in Figure 3.3. Basically, it starts with the user requesting the lexical data (provided as a list of words) for a lexicon from a local agent. As the agent cannot find a matching lexicon it creates a new one with the given word

24CHAPTER 3. DESIGN AND IMPLEMENTATION OF THE LEXICON BUILDER list, associates a new label with it, and searches for possibly matching lexicons from a remote agent. Similarly, the remote agent cannot find a matching lexicon in the lexicons it handles, so it creates a new lexicon from the word list and associates the provided label with the lexicon. The remote agent then attempts to retrieve the lexical data from another source and returns it back to the original agent after updating its own version of the lexicon. Finally, the local agent updates its lexicon and returns the results to the user allowing them to make corrections, which are then fed back to the remote agent so that the lexicon is grounded in the user’s domain. The retrieval of the lexical data could be through either another agent using a recursive Get Lexicon search, or from an external data source such as the The Free Dictionary. It is envisioned that the system has some method of discovering other agents, either through a Peer-2-Peer protocol or an agent registry, that allows them to search for matching lexicons and ends up at an agent able to retrieve information from an external source if no information can be found within the system. In addition, by forwarding the search through several agents and returning the results back through them, the number of agents that know about a lexicon are increased, therefore increasing the probability of future searches finding possible matches. By leaving it open as to how the data retrieval works allows different implementations to be developed as the best approach is not currently known, nor is finding it a part of this research. In addition to the way the data is retrieved, the classifiers used to determine whether a lexicon is matching is not restricted by the definition of this scenario and can be chosen freely. The classifier used for the implementation in this research is described in Section 3.4. Furthermore, the method used to create labels is not strictly defined, although the use of Universally Unique Identifiers (UUIDs) or Globally Unique Identifiers (GUIDs) is recommended. Finally, as it currently stands, this scenario does not correctly take into account the situation where a matching lexicon is eventually found. This is due to the openness to which the retrieval of data is defined, however, a future update should clarify this eventuality to ensure consistency across the system. A simple way to handle this case would be to incorporate Scenario 2 so that the correct labels are associated with the lexicons and that truly matching lexicons (identified after the user feedback) are

3.3. DESIGN PHASE

25

Lexicon Builder Main Scenario (new lexicon for both agent 1 and agent 2)

This uses a classifier to determine if the given words are combatible with an existing Lexicon, which always fails in this scenario.

UUID/GUID generator.

Same as for agent1.

Lexcal Data Retrieval

Retrieve Lexical Data (lexicon) Lexical Data

Depending on the implementation of the agent, this may be to another agent that it knows of or to some other source of lexical data such as a dictionary or lexical database (e.g. WordNet). In the example system this will always be to a source of lexical data (likely an online dictionary). A production system would use a P2P protocol or agent registry for agent discovery.

Figure 3.3: Scenario 1 – New Lexicon and Matching Lexicon Not Found Remotely associated with one another. 3.3.3.2

Scenario 2

The second scenario is displayed as a UML Sequence Diagram in Figure 3.4 and describes how the system should behave when a new lexicon is introduced at the local agent, but a matching lexicon is found at a remote agent. It begins in the same way as Scenario 1: with the user requesting the lexicon for a list of words, the local agent unable to find a matching lexicon, and searching for a matching lexicon from a remote agent. In this scenario, however, the remote agent contains a matching lexicon and returns it immediately. The local agent then updates its lexicon with the returned information

26CHAPTER 3. DESIGN AND IMPLEMENTATION OF THE LEXICON BUILDER and associates the label of the pre-existing lexicon with the updated lexicon of the local agent. The association of this label is then strengthened so that the local agent uses the same label as that of other agents with the same lexicon in order to maintain consistency with the rest of the system. Otherwise, future queries for the same lexicon based on the label will not find any matches. The lexical data is then presented to the user to allow them to make corrections, which are fed back to the remote agent so that the lexicon can be updated and evolve. Depending on the number of corrections the user made, the remote agent may or may not update its lexicon, choosing instead to spawn a new lexicon. This would result in the agent having both its original lexicon and the lexicon created by the local agent (hopefully) utilising as much of the returned information as possible. Like various aspects of Scenario 1, the exact number of differences is not defined in order to leave it open to implementations to define their own appropriate classifiers.

Although this seemed appropriate initially, it was recently realised that the end of this scenario is defined backwards. For example, by having the local agent update its label to match the remote agent as soon as it receives the lexical data, the case where the returned lexicon is not actually what the user means is not taken into account. Furthermore, pushing the creation of the new lexicon on to the remote agent has the potential to prevent the original lexicon it was maintaining to be overridden by this new lexicon that is not in the same domain (or just not similar enough). This scenario should be more similar to Scenario 3, in that it should check the number of differences locally before deciding whether or not to associate the remote agent’s label to the lexicon and update the remote agent. Finally, if the number of differences is very large and the user could indicate this in their feedback, it would allow the local agent to immediately search somewhere else rather than forcing the user to update a completely incorrect lexicon manually.

Finally, it has also been noticed that in its current form, this scenario does not allow additional words to attempt to have their data retrieved from an external data source. This could be a major issue depending on how close the lexicons need to be before it is decided that they match.

3.3. DESIGN PHASE

27

Lexicon Builder Alternate Scenario 1 (new lexicon for agent 1, matching lexicon found at agent 2)

This uses a classifier to determine if the given words are combatible with an existing Lexicon, which always fails in this scenario.

UUID/GUID generator.

Same as for agent1 but succeeds in this scenario.

This is the label used by agent 2. Since the lexicon already exists elsewhere and agent 1 only just created it, agent 1 should make itself consistent with the others.

alt [few corrections]

[else]

Not many corrections likely means that the user is working in the same domain but uses a few extra words, for example.

Otherwise, the lexicon provided earlier is probably for a different domain entirely, so create a new lexicon consistent with the feedback that will be more strongly associated with the label.

Figure 3.4: Scenario 2 – New Lexicon and Matching Lexicon Found Remotely

28CHAPTER 3. DESIGN AND IMPLEMENTATION OF THE LEXICON BUILDER 3.3.3.3

Scenario 3

The third and final scenario describes how the Lexicon Builder system updates lexicons in order to provide the emergent characteristics of the system. In addition, it allows the updating of the association strengths to deal with lexicon synchronisation, divergence and convergence. If this scenario were not defined, all updates would only be performed at the local agent after the initial retrieval of lexical information from elsewhere. Scenario 3 is displayed as a UML Sequence Diagram in Figure 3.5. Like scenarios 1 and 2, Scenario 3 begins with the user requesting the lexicon for a list of words. Unlike both Scenario 1 and 2 though, a matching lexicon is found at the local agent. The user then provides updates to the lexicon which are incorporated by the local agent if they are within an appropriate range, or a new lexicon with a new label is created otherwise. If the changes are incorporated, the local agent then attempts to find other agents that know about the same lexicon (based on the label associated with it) and determines a set of corrections or differences between the lexicon of the remote agent and its own. Depending on the number of differences the local agent will either update itself and forward the corrections to the remote agent in order to keep their lexicons synchronised or it will diverge the two lexicons by creating a new label, associating it with the lexicon, and weakening the association of the old label. When two lexicons diverge they can converge again as the old label is still associated. Therefore, when another agent has its lexicon updated to be closer to a lexicon that has previously diverged, the set of differences will be smaller resulting in the lexicons being resynchronised. This is important because if lexicons are in the same domain they will likely evolve in a similar direction as new terms are incorporated into the specifications they are derived from. In addition, if an agent is not accessible at the time of an update, the next update may return too many differences causing the lexicons to diverge; however, if the lexicon becomes updated to closer match the lexicon that diverged they will resynchronise. If they really do refer to different domains, when the second agent compares the diverged lexicon to its own, it will find there are too many differences and create a new label itself, preventing future updates from comparing the two lexicons. Although it initially appeared reasonable, this scenario has a few errors. For starters,

3.3. DESIGN PHASE

29

when new words are added their lexical data is not automatically retrieved. This could pose a major issue, for example, if a new version of a specification is written that adds another aspect of functionality and a large number of new words with it, they would all need to have their lexical data added manually. With the ultimate goal being to completely automate the process of transforming natural language specifications, it is a problem that should be addressed in future work. The other issue identified with this scenario in its current form is that if the lexicon diverges from other lexicons of multiple agents, a new label will be created each time. This is not the desired behaviour. Instead, a new label should be created and associated only once if it diverges from another lexicon. The new label should then have its strength increased while the remainder of the labels for that lexicon have their associations weakened – this is closer to the way it is described in Steels and Hanappe (Steels and Hanappe 2006), which was initially thought to not be necessary. For each subsequent divergence within the same update, the new label should be strengthened and the rest weakened. Therefore, if the local agent’s lexicon has diverged from a large number of lexicons the new label’s strength will be increased to the point where the it is used to represent the lexicon rather than the old label. This would prevent further updates from comparing it to the lexicons from which it has diverged as desired. If another update occurs and the old label is still higher rated than the new causing the lexicon to diverge from the same lexicon(s) again, the matching label can simply be weakened and all other labels strengthened as there are other labels to manipulate the strengths of. This has the effect that the new label, which is not shared by other lexicons, will bubble to the top after a series of interactions with other agents. Solving the problem of multiple label problem when diverging lexicons as described introduces another issue though. As the new label is not shared by other agents, once it is being used any updates will no longer be compared to lexicons that it may not have actually diverged from. This could be solved by sending the new label to the agents whose lexicon still matches when the new label is created. That way, when querying lexicons by label the matching lexicons will be returned. Furthermore, it introduces consistency issues with the way label associations are strengthened and weakened. Therefore, if these changes, or similar, are made in the future work the adjustments to association strengths should be re-evaluated across the all of the scenarios to maintain

30CHAPTER 3. DESIGN AND IMPLEMENTATION OF THE LEXICON BUILDER consistency.

3.3.4

Lexicon Builder Design

An attempt has been made to develop the Lexicon Builder system in such a way as to make it simple to improve and expand in future work. To this end, the code base has been separated into several packages including: core, lexicon, lexicon::model, and util. An overview of this package structure is shown in Figure 3.6. The diagram has been simplified as the auto-generated code for the Lexicon Meta-Model is actually separated into a set of interfaces in the package lexicon::model::Lexicon, a set of implementation classes in lexicon::model::Lexicon::impl and a couple of utility classes in lexicon::model:Lexicon::util. A brief description of the main packages is as follows: core contains the core classes and interfaces, such as those for the agents and the networking code. It also contains the listeners, events, and additional exceptions used by the Lexicon Builder. A UML class diagram of this package is displayed in Figure 3.7. Note: the LexiconAgent does not have many methods defined. This is partially due to time constraints and partially due to the fact that they are almost entirely internal with no need to provide external interfaces themselves. lexicon contains classes and interfaces related to the lexicons used by the agents. This includes a generic form of lexicons for transferring between agents. A UML class diagram of this package is displayed in Figure 3.8. lexicon::model contains the classes of the Lexicon Meta-Model. (It actually contains the meta-model itself, with the classes in the packages described above. Conceptually, however, this is how the packages are organised.) A UML class diagram of this package is displayed in Figure 3.2. util contains utility classes such as a class to handle configuration settings for the agents and the two way map (or dictionary) data structures used to associate labels to lexicons and vice versa. A UML class diagram of this package is displayed in Figure 3.9. The most important elements of the organisation of the Lexicon Builder system include the LexiconAgent, NetworkManager, and LexiconManager interfaces. These interfaces

3.3. DESIGN PHASE

31

Lexicon Builder Alternate Scenario 2 (matching lexicon at agent 1, updates other agents with changes)

This uses a classifier to determine if the given words are combatible with an existing Lexicon.

Few corrections likely means the user has just added a few words but is still in the same domain.

alt [few corrections]

Creates a set of corrections based on the a comparison of the two lexicons using a classifier that not only takes the words into account but the lexical data as well. This works on the assumption that, if the same words are used in the same way they are likely being used in the same domain. This is a semiotic triangle.

loop [all agent with 'label'] How this is determined will depend on the discovery mechanism (eg. P2P, agent registry). alt [few corrections]

Few corrections here indicates that the two lexicons are still referring to the same domain, so they can be updated to match each other.

[else] Otherwise, the two lexicons are now referring to different domains, so a new label is created to differentiate it from the other. The old label should remain associated as other agents may still have similar enough lexicons to be of the same domain, but this agent should no longer use the old label, so the agent's whose lexicons have diverged will no longer update each other.

[else] Many corrections indicates the user is talking about a different domain entirely, so create a new lexicon and label for it.

Figure 3.5: Scenario 3 – Matching Lexicon Found Locally


This diagram provides an overview of the structure of the Lexicon Builder system.

lexiconbuilder core lexicon

lexiconbuilder::core

model util

LexiconAgent «interface» BasicLexiconAgent BasicRootLexiconAgent NetworkManager «interface» BasicNetworkManager CorrectionsEvent CorrectionsListener GetLexiconEvent GetLexiconListener NetworkManagerInitialisationException

lexiconbuilder::lexicon «uses»

GenericLexicon LexicalElement LexiconManager «interface» LexiconManager.LexiconToken ModelLexiconManager

«accesses» «uses» lexiconbuilder::lexicon::model Lexicon Element Designation SyntaxElement Word Cat NCat VCat TVCat ITVCat NumeralCat DCat QCat QValuedCat QUnvaluedCat ADCat IADCat DADCat

lexiconbuilder::util Settings TwoWayHashMap TwoWayStrengthMap

Figure 3.6: Lexicon Builder Package Structure

3.3. DESIGN PHASE

lexiconbuilder::core

33

This diagram displays the structure of the lexiconbuilder::core package.

«interface» LexiconAgent +createLabel() +run(args)

CorrectionsEvent

This is really just a placeholder as the agents do all the work, but there is no strict method defined for doing it. Future work could look at defining an abstract class to implement behaviour common to all agents.

+CorrectionsEvent(source, corrections) +getCorrections() Part of the event handling framework that the NetworkManager uses to notify the an agent about the messages it received from other agents.

«interface» NetworkManager +addCorrectionsListener(listener) +addGetLexiconListener(listener) +removeCorrectionsListener(listener) +removeGetLexiconListener(listener) +sendCorrections(corrections, agentId) +sendGetLexicon(label, wordlist, agentId) +sendGetLexiconResponse(lexicon, agentId)

BasicLexiconAgent #NOT_RETRIEVED_ID #knownAgents #lexiconManager #newLexicons +main(args) +getLexicon(wordlist) #calculateDifferencesValue(lexicon, lexicon) #createGUI(lexicon, data) #findCompatibleLexicon(getLexiconEvent) #printLexicon(lexicon) #updateLexicon(lexicon, corrections) ...

«interface» CorrectionsListener +correctionsReceived(event)

«interface» GetLexiconListener +correctionsReceived(event)

BasicNetworkManager ReceiverThread Handles the receiving and processing of messages from other agents.

Other methods include those for reading commands from the console, loading word lists from file, etc.

Part of the event handling framework that the NetworkManager uses to notify the an agent about the messages it received from other agents.

SenderThread Handles the sending of messages to other agents.

Part of the event handling framework that the NetworkManager uses to notify the an agent about the messges it received from other agents.

netManager BasicRootLexiconAgent #INFO_IDENTIFIER +main(args) #findCompatibleLexicon(getLexiconEvent) ... Other methods include those for retrieving data from The Free Dictionary.

GetLexiconEvent +GetLexiconEvent(source, label) +GetLexiconEvent(source, label, wordlist) +getLabel() +getWordList() Part of the event handling framework that the NetworkManager uses to notify the an agent about the messages it received from other agents.

ServerThread Accepts connections from other agents. #CORRECTIONS_MESSAGE #GET_LEXICON_MESSAGE #GET_LEXICON_RESPONSE #REPLY #expectedResponses #listeners #server +startServer() +stopServer() #fireCorrectionsReceived(event) #fireGetLexiconReceived(event)

Figure 3.7: core Package Class Diagram


lexiconbuilder::lexicon

This diagram displays the structure of the lexiconbuilder::lexicon package.

«interface» LexiconManager LexiconToken {abstract} Used as a reference to the internal lexicons from outside the LexiconManager. Contains methods that call the equivalent methods on the LexiconManager using itself as the argument.

GenericLexicon #label #words GenericLexicon(label, wordlist) GenericLexicon(label, lexicalData) getLabel() getLexicalElements(word) getWordList() Immutable and serialisable lexicon class.

+add(..., lexiconToken) +addOwnerTag(tag, lexiconToken) +associate(label, lexiconToken) +createLexicon(...) +findLexicon(...) +getDifferences(lexicon) +getLexicon(...) +getLexiconByOwnerTag(tag) +getOwnerTags(lexiconToken) +remove(...) +removeOwnerTag(tag, lexiconToken) +strengthenAssociation(label, lexiconToken) +weakenAssociation(label, lexiconToken) The methods with arguments '...' can take various forms for words, collections of lexical data, etc.

BasicLexiconManger ModelLexiconToken {extends LexiconToken} Provides no additional functionality over that of the abstract LexiconToken class. Has no need to.

LexicalElement +LexicalElement(type, data) +getData() +getType() Immutable and serialisable class for lexical data. In the constructor 'type' indentifies the type of lexical data such as 'Category', in which case the 'data' could be 'Noun', for example.

model Lexicon NCat VCat ...

#labels2tokens : util::TwoWayStrengthMap #tokens2Lexicons : java::util::Map

Implements the LexiconManager methods using the Lexicon Meta-Model classes. Remainder of its methods are private methods to support to creation of GenericLexicons and other behaviour.

Figure 3.8: lexicon Package Class Diagram

3.3. DESIGN PHASE

35

lexiconbuilder::lexicon

This diagram displays the structure of the lexiconbuilder::util package.

1 Settings BooleanSettings {enum} Contains enums for the boolean typed configuration settings.

DoubleSettings {enum} Contains enums for the doubletyped configuration settings.

IntegerSettings {enum} Contains enums for the integer typed configuration settings.

StringSettings {enum} Contains enums for the string typed configuration settings.

TwoWayHashMap {implements java.util.Map} TwoWayEntry Represenation of the entries in the TwoWayHashMap.

#comparator #entryList #keyMap #valueMap +containsEntry(key, value) +getKey(value) +getkeys(value) +getValue(key) +getValues(key) +remove(key, value) +removeKey(key) +removeValue(value) #createEntry(key, value) #getEntry(key, value) Modifies some of the behaviour defined in java.util.Map to make it more suitable for a two way map. Both keys and values are keys for the 'keyMap' and 'valueMap' respectively.

Contains static methods for each setting so that retrieving a setting is a as simple as a method call.

Allows subclasses to change the way entries are ordered by settings the 'comparator'.

+loadSettings(fileName)

TwoWayStrengthMap StrengthEntry Represenation of the entries in the TwoWayStrengthMap.

+TwoWayStrengthMap(initialStrength) +getInitialStrength() +getStrength(key,value) +setInitialStrength(strength) +setStrength(key, value, strength) #createEntry(key, value) #getEntry(key, value) TwoWayHashMap that associates strengths with each entry. Entries are ordered by strength.

Figure 3.9: util Package Class Diagram

36CHAPTER 3. DESIGN AND IMPLEMENTATION OF THE LEXICON BUILDER define the way in which agents operate and communicate in such a way as to allow various implementations to be developed quickly and easily. The basic idea is that a LexiconManager handles the storing and retrieval of data in the lexicons, while the NetworkManager handles the communication between agents, and the LexiconAgent is the glue that brings the two together in order to perform the required operations of the Lexicon Builder system. Using this structure the behaviour of the agent can be changed without affecting the networking code or the storing of the lexicons, or the storing of lexicons could be modified without affecting the behaviour of the agent, etc. The other important elements of this design are the GenericLexicon and LexicalElement classes. These classes are very generic, immutable, and serialisable classes for storing the lexical information in order to facilitate the transfer of lexicons between agents. In addition, this should provide the future benefit of allowing agents using different LexiconManager ’s to transfer lexical information between each other regardless of the storage mechanism used. Due to time constraints, a strict, general format for transferring the lexical information has not been defined. Furthermore, while the use of Java serialisation simplifies the transfer of information, it is only usable if all agents are implemented in Java, therefore future work should define a more language independent protocol in order to encourage wider use of the Lexicon Builder system in the event it develops into a usable system.

3.4

Implementation

This section describes some of the important aspects of the current implementation including: an explanation of the approach that was taken, the classifiers used in determining whether lexicons are matching, and a description of the interfaces that can be used to view information on the agents and provide the user feedback.

3.4.1

Approach

The approach taken for the experimental system developed as a part of this research is a very simple one. It is based around two types of agent: a standard agent that retrieves data from agents that it knows about (specified in a settings file) and a root agent that retrieves data from an external data source (in this case The Free

3.4. IMPLEMENTATION

37

Dictionary). This is a basic implementation of Scenario 1 of the agent interactions, which makes things much simpler to deal with for this experimental system, rather than attempting to set up a network of agents using a Peer-2-Peer protocol, for example. The standard agent is defined in the BasicLexiconAgent class while the root agent is defined as the BasicRootLexiconAgent class. Each agent has an identifier string, which is used to uniquely identify a specific agent. This identifier is used to aid in discovering and communicating with other agents, and is important no matter how the discovery mechanism operates, whether it is Peer-2-Peer, a registry, or simply using the identifier as a network address (as is the case with this implementation). To create a lexicon, a preformatted word list can be injected at agent start-up by providing the name of the file containing the word list as an argument. A word list file can also be loaded using a command while the agent is running. Preformatted word lists were used to save time as the injection of the actual text has several issues that need to be overcome in order to provide an accurate word list to the agents. For example, how to identify phrases that should be treated as a single word (such as “at least”), or how to identify names that are made of multiple words (such as “Lexicon Builder”). Actually creating a parser to generate the correct word list from the text to be injected into the agent (and the Ordered Words model in the process defined by Kleiner et al. (Kleiner et al. 2009)) is left for future work. Furthermore, Kleiner et al. (Kleiner et al. 2009) themselves do not appear to have addressed the issues involved. Once an agent is running, information on the lexicons it contains can be retrieved through the console by entering certain commands. In addition, the lexicons managed by the agent can be viewed through a simple GUI that contains a list of words and a table with the lexical data. The information commands are mainly for debugging and informational purposes, however, the GUI is used to provide the user feedback for the system and to insert changes to the lexicons into the system. In order to retrieve data from The Free Dictionary website, a simple parser was written that could parse the HTML of the web pages and extract the categories for the words. This parser operates on plain text only, rather than generating a representation of the HTML elements and searching them, as it was quicker to implement for this research. However, it makes it harder to search and provides more opportunities for errors, which

38CHAPTER 3. DESIGN AND IMPLEMENTATION OF THE LEXICON BUILDER were realised later during the testing of a larger word list than that used in the initial development of the parser. Furthermore, the parser performs only basic linking of words to their base designation when a word is redirected to its root word. Complete linking may be a complicated matter, requiring additional parsing to find links, so was not implemented at the time in order to get a mostly working system completed. The NetworkManager implementation used by these agents to communicate with one another is defined in BasicNetworkManager and utilises a basic client-server architecture using standard TCP sockets. The protocol used is similarly simple as it just sends the ID of the destination agent, the ID of the sending agent, the message being sent (e.g. GetLexicon, Corrections, etc.), and the data corresponding with the message using Java object serialisation. Sending the ID of the destination agent was mainly done to ensure the BasicNetworkManager was behaving correctly, while the sending of the source agent’s ID is to facilitate the return of the results of a message to the calling agent. This is necessary as the NetworkManager utilises an event system in order to facilitate both synchronous an asynchronous message passing in the future. Finally, the ModelLexiconManager implementation of the LexiconManager interface simply maintains a set of lexicons, labels, and associations using the classes generated by the EMF for the Lexicon Meta-Model defined in Section 3.3.2. In keeping with the extensible, adaptable framework, as much as possible is handled within the LexiconManager itself, such as comparing lexicons, modifying label-lexicon association strengths, etc. This results in none of the EMF specific classes being used outside of the ModelLexiconManager, with all handling of lexicons outside of the ModelLexiconManager being done using the GenericLexicon and LexicalElement classes. However, as a shortcut in system development, the current use of LexicalElements is linked to the ModelLexiconManger and the EMF classes for the Lexicon Meta-Model. This is something that should be addressed in future work.

3.4.2

Classifiers

There are two classifiers used in the Lexicon Builder system for determining matching lexicons. Currently they are both hard coded into the implementation, however, it may be useful to provide a classifier framework in future work that allows classifiers to

3.4. IMPLEMENTATION

39

be interchanged as required more readily and possibly support the automated creation of classifiers as indicated by Steels and Hanappe (Steels and Hanappe 2006). The first classifier is used to determine possible matching lexicons when only a word list is available, while the other is used to determine matching lexicons when lexical data is available for the words. These classifiers are used in the FindLexicon and Corrections sections of the agent interaction scenarios, respectively. The classifier used when only a word list is available is very simple. It calculates the percentage of words from the word list that are in the lexicon. The principle is that the more words from the word list are in the lexicon, the higher the likelihood that the lexicon is going to be for the same domain. If multiple matching lexicons are found, only the highest is returned to the agent searching for a matching lexicon for the word list. A cutoff point can be defined, so that any percentage lower than the threshold is not returned as matching. This is important as there is often going to be a number of words repeated between lexicons in different domains, which will bring the percentage of matches to a certain level. For example, the word lists for specifications defined using SBVR are going to share the SBVR key words, apart from the common words used to construct complete and proper sentences. The threshold for this implementation was selected arbitrarily to be 50% as it was not the focus of this project. However, future work should investigate suitable values for this threshold in order to improve the effectiveness of the Lexicon Builder as a large system populated with many lexicons. The second classifier, used to compare both words and their lexical information, is also quite simple. It determines whether or not there are a large number of changes by calculating the percentage of new words to the original lexicon and adding it to the weighted percentage of changed words over the original lexicon. This is where the semiotic triad is used between a word, its use, and the domain it is being used in. The idea is that the more changes there are in the lexical data the less likely the lexicons are still referring to the same domain. However, new words are welcomed and this is taken into account by using the weight to increase the effect the number of changed words has. Like the threshold value for the first classifier the factor used to determine whether or not the lexicons are still matching was chosen quite arbitrarily (25%), hence future work should investigate the effects of this value in order to produce better results for more

40CHAPTER 3. DESIGN AND IMPLEMENTATION OF THE LEXICON BUILDER than just simple example scenarios. Furthermore, the classifier only takes into account new words and lexical elements. Removing words or lexical elements is not supported since the lexicons are shared, so removing terms from a lexicon could potentially have hazardous effects for another person’s system. Therefore, how to determine the effects of the removal of elements from a lexicon could be investigated in future work.

3.4.3

User Interfaces

The BasicLexiconAgent (and hence the BasicRootLexiconAgent as its subclass) accepts certain commands from the console in order to display some information about the lexicons it manages or inject new lists of words. There are three commands, “printmetadata”, “update”, and “wordlist”. The last two take a single argument, while the first takes none. All three commands are used for the validation testing described in Chapter 4. The first command, “print-metadata”, lists all of the lexicons stored by the agent with their associated labels and strengths, and their owner tags. An example is displayed in Figure 3.10 where there is one lexicon with an owner tag “example tag”, and the associated label “8762d804-80fc-4dbc-92e0-ff011a56eb7d” with a strength of 0.5. Figure 3.10 also shows the commands that can be used in the agents.

Figure 3.10: Example of the “print-metadata” Command The second command, “update”, is used to display the GUI that allows the user to view and update the lexicon. It is not particularly fancy (it throws all of the lexical data into the same grid for starters) as it was primarily developed for testing purposes. The GUI allows the addition and deletion of words and categories using context menus (shown

3.4. IMPLEMENTATION

41

Figure 3.11: Example of the GUI with the Owner Tags Highlighted in Figure 3.12 and Figure 3.13), and allows the user to assign owner tags to the lexicon for future reference (shown in Figure 3.11). Multiple tags can be assigned to a lexicon by using a comma (“,”) to separate the different tags in the field. If no owner tags are assigned to a lexicon the “update” command can be used with no arguments to access it. Although, if there are more than one with no tags, only the first will ever be accessed. Therefore, in order to access all of the lexicons with the “update” command, owner tags need to be assigned. Furthermore, there is a peculiarity with the GUI when assigning owner tags in that it assigns them to the current lexicon, so if the agent decides a new lexicon needs to be created because there are too many changes, the owner tags entered into the GUI will not be assigned to the new lexicon. The last command, “wordlist” allows the loading of word lists while an agent is running, rather than having to specify the word lists at start-up. This has the advantage of being able to load word lists in a certain order among several agents. Otherwise they are injected one after the other without pausing at start-up, making it impossible to perform a scenario such as: load word list A in Agent 1, then word list A in Agent 2,


Figure 3.12: Example of the GUI with the Words Context Menu

Figure 3.13: Example of the GUI with the Category Context Menu

3.5. SUMMARY

43

and finally word list B in Agent 1.

3.5

Summary

This chapter has discussed the design and implementation of the Lexicon Builder system including. The methodology of the development was described along with the key aspects of the design such as the selection of an appropriate external data source, the design of the lexicon meta-model and agent interactions, and an overview of the implementation of the experimental Lexicon Builder system. The next chapter describes the tests performed on the experimental system and presents the results. An analysis of the results is then performed in Chapter 5.


Chapter 4 Experimental System Evaluation This chapter describes how the experimental Lexicon Builder system was tested in order to demonstrate the feasibility of a dynamic lexicon system using emergent semantics. It was tested whether or not the implementation worked as expected, and the performance of the current implementation. The evaluation of the Lexicon Builder experimental system was separated into validation testing and performance testing. The validation testing focused on ensuring that the Lexicon Builder executed as expected in relation to the scenarios described in Section 3.3.3 and was aimed at answering the research question at a high level. Conversely, the performance testing focused on answering the subquestions of this research in order to provide evidence of the feasibility and effectiveness the Lexicon Builder may have by attempting to gather some metrics on the information retrieved from the external data source. Section 4.1 describes the configuration of the system used throughout the tests, while Section 4.2 and Section 4.3 describe the validation and performance tests, respectively.

4.1

Configuration

For the tests three agents are used running on the same computer. These agents were made up of two BasicLexiconAgents and one BasicRootLexiconAgent. The first BasicLexiconAgent and the BasicRootLexiconAgent were typically the main executors of most tests, while the second BasicLexiconAgent was used occasionally where tests could not be performed with only two agents (e.g. when testing lexicon divergence). 45

46

CHAPTER 4. EXPERIMENTAL SYSTEM EVALUATION

Each agent had a set of configuration settings supplied by settings files. There was a default file shared by all three agents – app.properties – that provides common settings. Each agent then had their own settings provided by agent1.properties, agent2.properties and agent3.properties, which were loaded over the top of the shared settings using a command line argument. This allowed the default settings to be overridden by each agent individually as required. The agents were configured as follows: Agent 1: Class: BasicLexiconAgent – v1.0 ID/Network Address: 127.0.0.1:40001 Known Agents: Agent 2, Agent 3 Agent 2: Class: BasicRootLexiconAgent – v1.0 ID/Network Address: 127.0.0.1:40002 Known Agents: Agent 1, Agent 3 Agent 3: Class: BasicLexiconAgent – v1.0 ID/Network Address: 127.0.0.1:40003 Known Agents: Agent 1 Common: Label Strength Delta: 0.1 – the value that the strength of label associations are modified by when strengthened or weakened Label Initial Strength: 0.5 – the strength that new label associations start with Lexicon Match Threshold: 0.5 – the cutoff for matching lexicons when only the words are considered, in this case a match requires at least 50% matching words Lexicon Differences Scale Factor: 2.0 – the factor used to weight the amount of new lexical data when matching lexicons using both words and lexical data, in this case 2 new lexical categories (not including the categories associated with new words) will be considered 4 Lexicon Differences Threshold: 0.25 – the cutoff for matching lexicons when considering both words and lexical data, in this case when the calculated differences is over 25% the lexicons will be considered no longer matching

4.2. VALIDATION TESTS

47

All of the tests make the assumption that the user’s text has been parsed into a suitable word list for the Lexicon Builder system. For example, duplicate words have been removed and combined words have been identified and made into a single entry (i.e. “at least” becomes “AtLeast”). At this stage, the preprocessing for the injection of text is outside the scope of the Lexicon Builder. The same setup was used for both sets of tests.

4.2

Validation Tests

This section describes the validation tests that were performed on the Lexicon Builder system. Due to time constraints thorough testing of complex scenarios was not possible, therefore only the basic scenarios have been validated. Even so, this testing detected several shortcomings in the design of the interactions between agents, which have already been noted in Section 3.3.3.

4.2.1

Validation Test 1 – Scenario 1

Purpose The purpose of this test is to verify and demonstrate the behaviour of the Lexicon Builder system in the scenario where no agent contains any lexicons. This is one of the key scenarios for the lexicon builder as it defines the (partially) automated building of the lexicons, one of the goals of this research.

Test Data The word list used for this test is provided in Appendix A.1. This list was taken from the combined sentences used as examples by Kleiner et al. (Kleiner et al. 2009).

Prerequisites The word list files exist at the locations specified by the command line arguments for the agents.

Expected Results Both Agent 1 and Agent 2 have identical lexicons containing all of the words in the word list and with lexical information retrieved from the external data source.

48


Test Procedure

1. Start Agent 2 using the following command: java BasicRootLexionAgent --settings=agent2.properties 2. Start Agent 1 using the following command: java BasicLexiconAgent --settings=agent1.properties --word-list="Tests/Test Data/Verification01/word_list.txt" 3. A window will be displayed with the words listed on the left hand side and the lexical data in a table on the right. Check that all of the words from the loaded word list are present. 4. Check that each contains the lexical data retrieved from the external data source. 5. In the console/terminal associated with the running Agent 2, type “update” and press Enter. This will display a window for the only lexicon present in Agent 2. 6. Check that the two windows match. 7. Shutdown the agents by typing “exit” in their console/terminal windows or by using the Terminate button in Eclipse.

Results Pass. Both agents contained matching lexicons that included all of the words from the word list and lexical data retrieved from the external data source.

4.2.2

Validation Test 2 – Scenario 2: Few Changes

Purpose The purpose of this test is to verify and demonstrate the behaviour of the Lexicon Builder system in the scenario where the remote agent contains a lexicon that is compatible with the lexicon being requested by the local agent, and which has no or few changes compared to the existing lexicon. This includes the aspect of user feedback to provide the lexicon with data that could not be retrieved automatically. This scenario is a basic success scenario providing incremental updates to the remote agent and allowing the local agent to leverage off of the fact a suitable lexicon already


49

exists elsewhere. The negotiation arrives at an agreement that the usage of the words the user wants is in the same domain. Test Data There are two lists of words used to test this scenario; an initial list of words to create the pre-existing lexicon on the remote agent and a different but compatible list of words for the local agent to request data for. List one is taken from the combined sentences used as examples by Kleiner et al. (Kleiner et al. 2009). The second list is the same as the first with additional words taken from the sentence: ‘The Lexicon Builder is a software.’ List 1, refer to Appendix A.1. List 2, same as List 1 with the additional words: 1. The 2. LexiconBuilder Prerequisites The word list files exist at the locations specified by the command line arguments for the agents. Expected Results Both Agent 1 and Agent 2 contain a single identical lexicon, including all of the words in the second word list and containing both the manually entered and automatically retrieved lexical information. The lexicons of both agents have the same label with the highest strength (if there is more than one label). Test Procedure 1. Start Agent 2 using the following command: java BasicRootLexionAgent --settings=agent2.properties --word-list="Tests/Test Data/Verification02/initial_word_list.txt" 2. A window will be displayed with the words listed on the left hand side and the lexical data in a table on the right. Check that all of the words from the initial word list are present and that their lexical data is that which has been retrieved from the on-line dictionary.

50

CHAPTER 4. EXPERIMENTAL SYSTEM EVALUATION 3. Close the window without making any changes (this will not terminate Agent 2). 4. Start Agent 1 using the following command: java BasicLexiconAgent --settings=agent1.properties --word-list="Tests/Test Data/Verification02/updated_word_list.txt" 5. A window will be displayed with the words listed on the left hand side and the lexical data in a table on the right. Check that all of the words from the updated word list are present. 6. Select the word The on the left hand side. Add a new category and change its Class value to DADCat (definite article determiner ). 7. Select the word LexiconBuilder on the left hand side. Add a new category and change its Class value to NCat . 8. Close the window. (Commits the changes to the lexicon) 9. In the console/terminal window associated with the running Agent 2, type “printmetadata” and press Enter. This will display a list of lexicons and their information including: owner tags, labels and their strengths. Check that there is a single lexicon with a single label. Note the label.

10. In the console/terminal window associated with the running Agent 1, type “printmetadata” and press Enter. This will display a list of lexicons and their information including: owner tags, labels and their strengths. Check that there is a single lexicon with two labels and that the one with the highest strength matches the label of Agent 2 – the other label is the one created by Agent 1 before retrieving the lexicon from Agent 2. 11. In the console/terminal window associated with the running Agent 1, type “update” and press Enter. This will display a window for the only lexicon present in Agent 1. 12. In the console/terminal window associated with the running Agent 2, type “update” and press Enter. This will display a window for the only lexicon present in Agent 2.


51

13. Check that the two windows match. They should contain the list of words from the second word list, all of the words from the initial list should contain lexical data downloaded from the on-line dictionary, and the changes made in steps 6, 7, and 8 should be present. 14. Shutdown the agents by typing “exit” in their console/terminal windows or by using the Terminate button in Eclipse.

Results Pass. Both agents contained matching lexicons that included the entirety of the second word list and lexical data retrieved from the external data source. The agents had matching labels for the lexicon that had the highest strength of all the labels associated with it. Although correct according to the scenario, this result demonstrates the issue identified in Section 3.3.3.2 with the new words not having their lexical data automatically retrieved from the external data source.

4.2.3

Validation Test 3 – Scenario 2: Many Changes

Purpose The purpose of this test is to verify and demonstrate the behaviour of the Lexicon Builder system in the scenario where the remote agent contains a lexicon that is compatible with the lexicon being requested by the local agent, and which has many changes compared to the existing lexicon. This includes the aspect of user feedback to provide the lexicon with data that could not be retrieved automatically. This scenario is a basic failure scenario that informs the remote agent that it was incorrect. This results in a new lexicon with the same label (and possibly a higher strength) so that the next time it receives the same list of words it knows there could actually be two sets of possibilities. The other interactions that have occurred in the meantime, will determine which lexicon is retrieved the next time the same list of words is classified. While this feedback behaviour is based off of Steels and Hanappe (Steels and Hanappe 2006), it is not really suitable for the building of lexicons and should really be the other way around as discussed in Section 3.3.3.2.

52


Test Data The word list used for this test is provided in Appendix A.1. This list is taken from the combined sentences used as examples by Kleiner et al. (Kleiner et al. 2009). Prerequisites The word list files exist at the locations specified by the command line arguments for the agents. Expected Results Both Agent 1 and Agent 2 contain identical lexicons that include both the manually entered and automatically retrieved lexical information. In addition, Agent 2 contains a second lexicon that includes only the automatically retrieved lexical data. Both lexicons of Agent 2 have the same label and the label with the highest strength belongs to the lexicon that matches the lexicon of Agent 1. The lexicon of Agent 1 has the same label as the lexicon of Agent 2 and it has the highest strength of any other label associated with the lexicon. Test Procedure

1. Start Agent 2 using the following command: java BasicRootLexionAgent --settings=agent2.properties --word-list="Tests/Test Data/Verification03/word_list.txt" 2. A window will be displayed with the words listed on the left hand side and the lexical data in a table on the right. Check that all of the words from the first word list are present and that their lexical data is that which has been retrieved from the external data source. 3. In the text field at the top of the window enter “initial lexicon”. This will associate an owner tag with the lexicon. 4. Close the window (this will not terminate Agent 2). 5. Start Agent 1 using the following command: java BasicLexiconAgent --settings=agent1.properties --word-list="Tests/Test Data/Verification03/word_list.txt"


53

6. A window will be displayed with the words listed on the left hand side and the lexical data in a table on the right. Check that all of the words from the second word list are present. 7. Select the word AtLeast on the left hand side. Right-click the table on the right and select Add Category. Change the value in the Class column of the new category to QValuedCat (quantified valued determiner ). 8. Select the word Each on the left hand side. Right-click the table on the right and select Add Category. Change the value in the Class column of the new category to QUnvaluedCat (quantified unvalued determiner ). 9. Close the window. (Commits the changes to the lexicon) 10. In the console/terminal window associated with the running Agent 2, type “printmetadata” and press Enter. This will display a list of lexicons and their information including: owner tags, labels and their strengths. Check that there are two lexicons, one with no owner tags (the new lexicon since it is created without owner tags) and the other with the tag “initial”. The two lexicons should have the same label associated and the lexicon with no tag should have a higher a strength for the association. 11. In the console/terminal window associated with the running Agent 2, type “update” and press Enter. This will display a window for the updated lexicon in Agent 2 since it is created with no owner tags. 12. In the console/terminal window associated with the running Agent 1, type “update” and press Enter. This will display a window for the only lexicon present in Agent 1. 13. Check that the two windows match. They should contain all of the words in the word list, including the lexical data retrieved from the external data source, and the changes made in steps 7 – 8 should be present. Once finished close the windows. 14. In the console/terminal window associated with the running Agent 2, type “update initial” and press Enter. This will display a window for the lexicon with the

54

CHAPTER 4. EXPERIMENTAL SYSTEM EVALUATION given owner tag, in this case the initial lexicon created by Agent 2.

15. Check that it contains only the words from the word list without the changes made is step 7 – 8. Once finished close the window. 16. Shutdown the agents by typing “exit” in their console/terminal windows or by using the Terminate button in Eclipse.

Results Pass. The agents contained an identical lexicon that included all of the manually entered and data automatically retrieved lexical data. Agent 2 had a second lexicon that contained only the lexical data retrieved from the external data source. The agents had matching labels for the identical lexicon. For Agent 2 the label had the highest strength for the updated lexicon containing all of the lexical data. For Agent 1 the matching label had the highest strength out of all of the labels associated with the lexicon. Although correct according to the scenario, this result demonstrates the issue identified in Section 3.3.3.2 where the wrong agent has an additional lexicon possibly overriding a lexicon actually in use by another person.

4.2.4

Validation Test 4 – Scenario 3: Few Local Changes, Few Remote Changes

Purpose The purpose of this test is to verify and demonstrate the behaviour of the Lexicon Builder system in the scenario where the local agent contains a compatible lexicon (i.e. previous interactions by the user have created the lexicon and they are now updating it) and there are few or no changes to the existing lexicon at both the local and remote agents. This includes the aspect of user feedback to provide the lexicon with data that was not automatically retrieved. This is one of the key scenarios for the lexicon builder as it defines the way the lexicons can be updated in such a way as to move the agents towards global agreement. It is a more complex success scenario that incrementally updates the same lexicon at


55

each agent keeping them synchronised, preventing unnecessary divergence, and globally developing the lexicon. The negotiation comes to the conclusion that the users are still operating within the same domain. Test Data There are two lists of words used to test this scenario; an initial list of words to create the pre-existing lexicon on the remote agent and a different but compatible list of words for the local agent to request data for. List one is taken from the combined sentences used as examples by Kleiner et al. (Kleiner et al. 2009). The second list is the same with additional words included from the sentence: ‘The Lexicon Builder is a software.’ List 1, refer to Appendix A.1. List 2, same as List 1 with the additional words: 1. The 2. LexiconBuilder

Prerequisites The word list files exist at the location specified by the command line arguments for the agents. Tests 1 – 3 provide the expected results. Expected Results Both Agent 1 and Agent 2 contain an identical lexicon that includes all of the words in the second word list and the lexical information both manually entered and automatically retrieved form the external data source. The lexicons of Agent 1 and Agent 2 have matching labels, where the strength is the highest of any other labels also associated with the same lexicon – this does not have to be the same across agents. Test Procedure

1. Start Agent 2 using the following command: java BasicRootLexionAgent --settings=agent2.properties --word-list="Tests/Test Data/Verification04/initial_word_list.txt"

56

CHAPTER 4. EXPERIMENTAL SYSTEM EVALUATION 2. A window will be displayed with the words listed on the left hand side and the lexical data in a table on the right. Check that all of the words from the initial word list are present and that their lexical data is that which has been retrieved from the external data source. 3. Close the window without making any changes (this will not terminate Agent 2). 4. Start Agent 1 using the following command: java BasicLexiconAgent --settings=agent1.properties --word-list="Tests/Test Data/Verification04/initial_word_list.txt" --word-list="Tests/Test Data/Verification04/updated_word_list.txt" This will retrieve the data for the initial list and then find a matching lexicon locally for the second list. 5. A window will be displayed with the words listed on the left hand side and the lexical data in a table on the right. Check that it contains only those words from the initial word list, then close the window. 6. Another window will be displayed with the words listed on the left hand side and the lexical data in a table on the right. This is for the second word list that was injected into the agent. Ensure that it has all of the words from the second word list. 7. Select the word The on the left hand side. Add a new category and change its Class value to DADCat . 8. Select the word LexiconBuilder on the left hand side. Add a new category and change its Class value to NCat . 9. Close the window. (Commits the changes to the lexicon)

10. In the console/terminal window associated with the running Agent 2, type “printmetadata” and press Enter. This will display a list of lexicons and their information including: owner tags, labels and their strengths. Check that there is a single lexicon with a single label. Note the label.


57

11. In the console/terminal window associated with the running Agent 1, type “printmetadata” and press Enter. This will display a list of lexicons and their information including: owner tags, labels and their strengths. Check that there is a single lexicon with two labels and that the one with the highest strength matches the label of Agent 2 – the other label is the one created by Agent 1 before retrieving the lexicon from Agent 2. 12. In the console/terminal window associated with the running Agent 1, type “update” and press Enter. This will display a window for the only lexicon present in Agent 1 (since no owner tag was specified). 13. In the console/terminal window associated with the running Agent 2, type “update” and press Enter. This will display a window for the only lexicon present in Agent 2 (since no owner tag was specified). 14. Check that the two windows match. They should contain the list of words from the second word list, all of the words from the initial list should contain lexical data downloaded from the external data source, and the changes made in steps 7 – 8 should be present. Close the windows. 15. Shutdown the agents by typing “exit” in their console/terminal windows or by using the Terminate button in Eclipse.

Results Pass. Both agents had an identical lexicon that contained the words of the second word list and included the lexical data both manually entered and automatically retrieved from the external data source. The labels associated with the lexicons of each agent matched and, for Agent 1, the matching label had the highest strength of the labels associated with the lexicon.

4.2.5

Validation Test 5 – Scenario 3: Few Local Changes, Many Remote Changes

Purpose The purpose of this test is to verify and demonstrate the behaviour of the Lexicon Builder system in the scenario where the local agent contains a compatible

58


lexicon (i.e. previous interactions by the user have created the lexicon and they are now updating it) and there are few or no changes to the existing lexicon at the local agent but the there are many changes to the lexicon at the remote agent. This includes the aspect of user feedback to provide the lexicon with data that was not automatically retrieved. This is one of the key scenarios for the lexicon builder as it defines the way the lexicons can be updated in such a way as to move the agents towards global agreement. It is a more complex failure scenario where the local agent determines a lexicon is no longer within the same domain as the lexicon of the remote agent and therefore diverges its lexicon. Note: This test uses the third agent to update the first without updating the second, simulating the effect of the network being disconnect between Agent 2 and Agent 3 at the time the updates occurred. This is an appropriate scenario of how two lexicons become different enough to cause them to diverge. Prerequisites The word list files exist at the location specified by the command line arguments for the agents and “wordlist” command. Tests 1 – 4 provide the expected results. Test Data There are three lists of words used to test this scenario; an initial list of words to create the pre-existing lexicon on the remote agent and two different but compatible list of words to incrementally update the local agent. List one is taken from the first two sentences used as examples by Kleiner et al. (Kleiner et al. 2009). The second list is the same except extra words have been included from the third sentence, while the third list contains the additional words from the sentence: ‘The Lexicon Builder is a software.’ List 1: 1. Each 2. Company 3. Sells


59

4. AtLeast 5. One 6. Product 7. Is 8. Sold 9. By 10. Exactly

List 2, same as List 1 with the additional words: 1. A 2. Software

List 3, same as List 1 and List 2 combined with the additional words: 1. The 2. LexiconBuilder

Expected Results Agent 1 contains a single lexicon with all of the words from the third word list. This lexicon includes lexical data retrieved from the external data source and the manually entered data. Three labels will be associated with the lexicon. One will match the label associated with the lexicon at Agent 2 and Agent 3, and the other two will not match either of the other agents. All of the labels will have the same strength. Agent 2 contains a single lexicon consisting of only the words in the first word list and includes lexical data retrieved from the external data source for those words. Agent 3 contains a single lexicon, identical to that of Agent 1. The label associated to the lexicon with the highest strength matches the label of the lexicon at Agent 2 and one of the labels at Agent 1.

60


Test Procedure

1. Start Agent 2 using the following command: java BasicRootLexionAgent --settings=agent2.properties --word-list="Tests/Test Data/Verification05/initial_word_list.txt" 2. A window will be displayed with the words listed on the left hand side and the lexical data in a table on the right. Check that all of the words from the first word list are present and that their lexical data is that which has been retrieved from the external data source. 3. In the text field at the top of the window enter “initial lexicon”. This will associate an owner tag with the lexicon. 4. Close the window (this will not terminate Agent 2). 5. Start Agent 1 using the following command: java BasicLexiconAgent --settings=agent1.properties --word-list="Tests/Test Data/Verification03/initial_word_list.txt" 6. A window will be displayed with the words listed on the left hand side and the lexical data in a table on the right. Check that it displays the same as it did for Agent 2. 7. Start Agent 3 using the following command: java BasicLexiconAgent --settings=agent3.properties --word-list="Tests/Test Data/Verification05/updated_word_list1.txt" 8. A window will be displayed with the words listed on the left hand side and the lexical data in the table on the right. Check that all of the words from the second word list are present. 9. Select the word A on the left hand side. Right-click the table on the right and select Add Category. Change the value in the Class column of the new category to IADCat (indefinite article determiner ).


61

10. Select the word Software on the left hand side. Right-click the table on the right and select Add Category. Change the value in the Class column of the new category to NCat . 11. Close the window. (Commits the changes to the lexicon) 12. In the console/terminal window associated with the running Agent 1, type “wordlist Tests/Test Data/Verification05/updated_word_list2.txt” and press Enter. This will inject the third word list into the agent. 13. A window will be displayed with the words listed on the left hand side and the lexical data in the table on the right. Check that all of the words from the third word list are present and that the words A and Software have the lexical data associated with them in steps 9 and 10. 14. Select the word The on the left hand side. Right-click the table on the right and select Add Category. Change the value in the Class column of the new category to DADCat . 15. Select the word LexiconBuilder on the left hand side. Right-click the table on the right and select Add Category. Change the value in the Class column of the new category to NCat . 16. Close the window. (Commits the changes to the lexicon) 17. In the console/terminal window associated with the running Agent 2, type “printmetadata” and press Enter. This will display a list of lexicons and their information including: owner tags, labels and their strengths. Check that there is a single lexicons with the owner tag “initial lexicon” and note the label. 18. In the console/terminal window associated with the running Agent 1, type “printmetadata” and press Enter. This will display a list of lexicons and their information including: owner tags, labels and their strengths. Check that there is a single lexicon with no owner tags and three labels, one of which will match the label from Agent 2, all with the same strength.

62


19. In the console/terminal window associated with the running Agent 3, type “printmetadata” and press Enter. This will display a list of lexicons and their information including: owner tags, labels and their strengths. Check that there is a single lexicon with no owner tags and a single label that matches the label from Agent 2. 20. In the console/terminal window associated with the running Agent 3, type “update” and press Enter. This will display a window for the lexicon in Agent 3 (since it has no owner tags). 21. In the console/terminal window associated with the running Agent 1, type “update” and press Enter. This will display a window for the lexicon in Agent 1 (since it has no owner tags). 22. Check that the two windows match. They should contain all of the words in the third word list and include the lexical data retrieved from the external data source as well as the data entered in steps 9, 10, 14, and 15. Once finished close the windows. 23. In the console/terminal window associated with the running Agent 2, type “update initial lexicon” and press Enter. This will display a window for the lexicon in Agent 2. 24. Check that it contains only the words from the first word list and the automatically retrieved lexical data. Close the window. 25. Shutdown the agents by typing “exit” in their console/terminal windows or by using the Terminate button in Eclipse.

Results Pass. Agent 1 contained a single lexicon identical to that of Agent 3. The matching label had the highest strength of the labels associated with the lexicon at Agent 3. The lexicon at Agents 1 and 3 contained all of the words from the third word list and included all of the manually entered and automatically retrieved data. Agent 1 had three labels associated with the lexicon, all with the same strength.


63

Agent 2 contained a single lexicon with only the words from the first word list and the data automatically retrieved from the external data source. Although the results matched those expected of the scenario, they highlight the issues with Scenario 3 lexicon divergence as discussed in Section 3.3.3.3. For example, if the lexicon at Agent 1 were updated again, it would be found to diverge from the lexicon at Agent 2 again causing another label to be created. Since the original label would have its strength reduced, the next time the lexicon were updated at Agent 1 the lexicon at Agent 3 would not be updated since none of the labels associated with highest strength at Agent 1 match a label at Agent 3. If interactions with other agents in between time increased the strength of the label shared between all three, then, although Agent 3 would continue to be updated, new labels would perpetually be created at Agent 1.

4.2.6

Validation Test 6 – Scenario 3: Many Local Changes

Purpose The purpose of this test is to verify and demonstrate the behaviour of the Lexicon Builder system in the scenario where the local agent already contains a lexicon that is then updated with many changes. This is an important scenario as the lexicon is effectively diverging from itself. In this case it is most likely going to diverge from the lexicons with the same label from other agents, so it creates a new lexicon with a new label. Future requests for lexicons matching a word list may begin to propagate this new lexicon around the Lexicon Builder system. Test Data The word list used for this test is provided in Appendix A.1. This list is taken from the combined sentences used as examples by Kleiner et al. (Kleiner et al. 2009). Prerequisites The word list files exist at the locations specified by the command line arguments for the agents. Expected Results Agent 1 contains two lexicons one that includes both the manually entered and automatically retrieved lexical information and the other with only the automatically retrieved data. The lexicons will have different labels associated with them.

64


Test Procedure

1. Start Agent 2 using the following command: java BasicRootLexionAgent --settings=agent2.properties --word-list="Tests/Test Data/Verification06/word_list.txt" This causes the same word list to be injected twice causing a matching lexicon to be found locally. 2. A window will be displayed with the words listed on the left hand side and the lexical data in a table on the right. Check that all of the words from the word list are present and that their lexical data is that which has been retrieved from the external data source. Take note of the data, it will be compared to the lexicon that is found for the second loading of the word list. 3. In the text field at the top of the window enter “initial lexicon”. This will associate an owner tag with the lexicon. Close window. 4. Another window will be displayed with the words listed on the left hand side and the lexical data in a table on the right. Check that it contains the same data as the previous window. 5. Select the word AtLeast on the left hand side. Right-click the table on the right and select Add Category. Change the value in the Class column of the new category to QValuedCat . 6. Select the word Each on the left hand side. Right-click the table on the right and select Add Category. Change the value in the Class column of the new category to QUnvaluedCat . 7. Close the window. (Commits the changes to the lexicon) 8. In the console/terminal window associated with the running Agent 2, type “printmetadata” and press Enter. This will display a list of lexicons and their information including: owner tags, labels and their strengths. Check that there are two lexicons, one with no owner tags and the other with the tag “initial lexicon”. The two lexicons should have different labels associated with them.

--word-li


65

9. In the console/terminal window associated with the running Agent 2, type “update” and press Enter. This will display a window for the updated lexicon in Agent 2 (since it is created with no owner tags). 10. Check that it contains the data that was entered in steps 5 and 6. Close the window. 11. In the console/terminal window associated with the running Agent 2, type “update initial lexicon” and press Enter. This will display a window for the lexicon with the given owner tag, in this case the initial lexicon created by Agent 2. 12. Check that it contains only automatically retrieved lexical data and not the changes made is step 5 and 6. Close the window. 13. Shutdown the agent by typing “exit” in their console/terminal windows or by using the Terminate button in Eclipse. Results Fail. The agent contained two lexicons, one with the automatically retrieved and manually added lexical data and the other with the automatically retrieved data only. However, the two lexicons had the same label with a higher strength on the updated lexicon. This result is caused by an oversight in the implementation where the updating of the lexicon performs the same sequence of events as that described in Scenario 2 for many changes rather than the events described in Scenario 3. Although this is an error, updating the scenarios in future work (and the respective implementation) as described in the sub-sections of Section 3.3.3 will bring Scenario 2 and 3 in line with one another, allowing the reuse of code for this situation. This incorrect result actually highlights the errors caused by the current version of Scenario 2, as it makes it clear that the new lexicon hides the old. This is a problem because it was deemed to have many changes over the original and therefore is likely to have many changes over any other lexicons that matched the original. As a result, the next time a query is made for a lexicon with the same label, the new lexicon will be returned causing the calling agent to diverge its lexicon when it may still have matched the old lexicon.

66


4.3

Performance Tests

This section describes the performance tests that were performed on the Lexicon Builder system. These tests were designed to gather some statistics that can be analysed in order to evaluate the potential of the Lexicon Builder system and answer the subquestions of this research. The tests, including their results, are documented here, while an analysis of the results is presented in Chapter 5. Each performance test was run using two data sets: an SBVR specific list of words, and a list of words taken from a more general text. The SBVR specific word list includes the words from the example sentences used by Kleiner et al. (Kleiner et al. 2009), while the more general word list is taken from the introduction section of this research’s proposal. Both word lists are provided in Appendix A. These two different word lists were selected because one of the goals of the Lexicon Builder system is to integrate with the process developed by Kleiner et al. (Kleiner et al. 2009) which is intended to be used with SBVR like texts; however, it is a small word list and it was deemed beneficial to evaluate the Lexicon Builder’s potential when operating on more general texts as well.

4.3.1

Performance Test 1 – Automatically Retrieved Data

Purpose The purpose of this test is to gather statistics on the Lexicon Builder system that aid in answering the first two sub questions of this research, i.e. ‘How much of a reduction can be gained?’ and ‘How well does it work?’ To this end, this test determined the number of words that successfully had their information obtained from the external data source since this is the primary means of speeding up the process of developing a lexicon. Prerequisites Validation Test 1 completes with the expected results. Test Procedure 1. Start Agent 2 using the following command: java BasicRootLexionAgent --settings=agent2.properties --word-list="Tests/Test Data/Performance/kleiner_word_list.txt"

4.3. PERFORMANCE TESTS

67

2. A window will be displayed with the words on the left and the lexical data in a table on the right. Close the window. 3. In the console/terminal associated with Agent 2 type “exit” and press Enter to shutdown the agent. 4. Find the section of output in the console/terminal that appears as follows: Total words:

Number of words with retrieved info: Percentage words with retrieved info: The s provide the results of this test for the SBVR word list. 5. Start Agent 2 using the following command: java BasicRootLexionAgent --settings=agent2.properties --word-list="Tests/Test Data/Performance/proposal_word_list.txt" 6. A window will be displayed with the words on the left and the lexical data in a table on the right. Close the window. 7. In the console/terminal associated with Agent 2 type "exit" and press Enter to shutdown the agent. 8. Find the section of output in the console/terminal that appears as follows: Total words:

Number of words with retrieved info: Percentage words with retrieved info: The s provide the results of this test for the Proposal word list.

Results For the first data set, the SBVR word list, the Lexicon Builder managed to retrieve data for over 92% of the words it contained. The word list taken from this research’s proposal was almost 10% lower at approximately 82%. These results are displayed in Figure 4.1 and Table 4.1.

68


Figure 4.1: Histogram of the Percentage of Words with Retrieved Data

Table 4.1: Number of Words with Retrieved Data Word List

Words Retrieved

Total Words

SBVR

11

12

Proposal

224

272


4.3.2

69

Performance Test 2 – Categories Per Word

Purpose The purpose of this test is to gather statistics on the Lexicon Builder system that aid in answering all three subquestions of this research and gauge its overall effectiveness. To achieve this, it determined the number of categories that were retrieved and calculated various statistics such as the minimum, maximum, average, median, and mode number of categories per word. In addition, it reported the number of times each number of categories per word is encountered. Prerequisites Validation Test 1 completes with the expected results. Test Procedure

1. Start Agent 2 using the following command: java BasicRootLexionAgent --settings=agent2.properties --word-list="Tests/Test Data/Performance/kleiner_word_list.txt" 2. A window will be displayed with the words on the left and the lexical data in a table on the right. Close the window. 3. In the console/terminal associated with Agent 2 type "exit" and press Enter to shutdown the agent. 4. Find the section of output in the console/terminal that appears as follows: Min No. Categories per word: Max No. Categories per word: Ave. No. Categories per word: Median No. Categories per word: Modes No. Categories per word: [, ..., ] Frequencies of No. Categories per word 0: ...: n: The s provide the results of this test for the SBVR word list. Note: ‘...’ for the modes represents the potential to have multiple numbers of

70

CHAPTER 4. EXPERIMENTAL SYSTEM EVALUATION categories that occur with the same frequency. For the frequencies, ‘...’ denotes a list of numbers of categories. This list may skip values that have a frequency of zero. 5. Start Agent 2 using the following command: java BasicRootLexionAgent --settings=agent2.properties --word-list="Tests/Test Data/Performance/proposal_word_list.txt" 6. A window will be displayed with the words on the left and the lexical data in a table on the right. Close the window. 7. In the console/terminal associated with Agent 2 type "exit" and press Enter to shutdown the agent. 8. Find the section of output in the console/terminal that appears as follows: Min No. Categories per word: Max No. Categories per word: Ave. No. Categories per word: Median No. Categories per word: Modes No. Categories per word: [, ..., ] Frequencies of No. Categories per word 0: ...: n: The s provide the results of this test for the Proposal word list. Note: ‘...’ for the modes represents the potential to have multiple numbers of categories that occur with the same frequency. For the frequencies, ‘...’ denotes a list of numbers of categories. This list may skip values that have a frequency of zero.

Results To start with the number of categories per word showed some important and interesting results in regards to the Lexicon Builder’s potential. They are displayed in Figure 4.2. The minimum number of categories is not a surprise as not all words were able to have data retrieved from the external source, while both word lists had a similar maximum


71

of 4 and 5 for the SBVR and Proposal word lists respectively. The average number of categories per word was approximately 1.5 for each word list (1.42 and 1.61 for the SBVR and Proposal word lists respectively) and the most important measure,

the mode,

was found to be 1. The importance of which will be discussed in Section 5.2.

Figure 4.2: Histogram of the Number of Categories Retrieved per Word The percentages of words with the number of categories in both data sets show that words with one category appear more than twice as often as words with zero or two categories (the second and third most frequent number of categories). This can be seen in Figure 4.3. It is worth noting the two data sets show similar trends across the frequencies even though the small SBVR data set does not have enough entries for a proper statistical analysis. The actual frequencies for each of the data sets are displayed in Table 4.2.

72


Figure 4.3: Histogram of the Percentage Frequencies of the Number of Categories

Table 4.2: Frequencies for Number of Categories No. of Words No. of Categories

SBVR

Proposal

0

1

48

1

7

117

2

3

49

3

0

16

4

1

34

5

0

8


4.3.3

73

Performance Test 3 – Correct and Superfluous Data

Purpose The purpose of this test is to gather additional statistics on the Lexicon Builder system that aid in answering all three subquestions of this research and gauge its overall effectiveness. This test determines the number words that have not had their correct category retrieved from the external data source the number of words that have had extra, unneeded categories retrieved. Furthermore, the actual number of categories in both instances are determined, and the lists of words with missing, or additional categories are displayed with the number of categories that are missing or additional. These lists can be used to generate statistics on the effectiveness of the Lexicon Builder. Note: To determine which words are missing categories or that contain additional categories, the agent compares the created lexicon to the expected data read in from a file (see Appendix B). Prerequisites Validation Test 1 completes with the expected results. Test Procedure

1. Start Agent 2 using the following command: java BasicRootLexionAgent --settings=agent2.properties --compare-list="Tests/Test Data/Performance/marked_up_kleiner_ word_list.txt" The compare-list argument causes the agent to retrieve a lexicon for the words in the file then compares it to the lexical data contained within the file. 2. A window will be displayed with the words on the left and the lexical data in a table on the right. Close the window. 3. In the console/terminal associated with Agent 2 type "exit" and press Enter to shutdown the agent. 4. Find the section of output in the console/terminal that appears as follows:

74

CHAPTER 4. EXPERIMENTAL SYSTEM EVALUATION DISCREPANCIES Number of words with missing elements: Total missing elements:

Number of words with additional elements: Total additional elements:

Words with Missing Elements: WORD

AMOUNT

...

Words with Additional Elements: WORD

AMOUNT

... The s provide the results of this test for the SBVR word list. 5. Start Agent 2 using the following command: java BasicRootLexionAgent --settings=agent2.properties --compare-list="Tests/Test Data/Performance/marked_up_proposal_ word_list.txt" The compare-list argument causes the agent to retrieve a lexicon for the words in the file then compares it to the lexical data contained within the file. 6. A window will be displayed with the words on the left and the lexical data in a table on the right. Close the window. 7. In the console/terminal associated with Agent 2 type "exit" and press Enter to shutdown the agent. 8. Find the section of output in the console/terminal that appears as follows:


75

DISCREPANCIES Number of words with missing elements: Total missing elements:

Number of words with additional elements: Total additional elements:

Words with Missing Elements: WORD

AMOUNT

...

Words with Additional Elements: WORD

AMOUNT

... The s provide the results of this test for the Proposal word list. Results The results for the amount of correct and superfluous data is not very promising for the effectiveness of automating lexicon development using the Lexicon Builder. For the SBVR word list, less than 60% of words contained the correct information. The result was better for the proposal word list at over 80%, which may indicate that the Lexicon Builder works more effectively for general language rather than controlled languages. Although, the small sample size for the SBVR word list is a likely factor in producing such a low result. These statistics make the assumption that the words whose correct category is not supported by the Lexicon Meta-Model (refer to Section 5.1) would have had the correct category retrieved if it were supported. This has been deemed a reasonable assumption as entries for the words are available on The Free Dictionary that contain the correct information. Eliminating the words without supported categories altogether results in the number of words with correct categories remaining over 80%.

76


Almost 50% of the words in the proposal word list contained superfluous categories, which increased to almost 70% for the SBVR word list. Furthermore, the results show that between 50% and 60% of the categories retrieved for both word lists were superfluous, indicating there could be quite a large amount of information for the user to sort through and remove in order to complete the definition of an optimal lexicon. These results are summarised in the Figure 4.4 and Table 4.3, while the actual words and their missing and/or additional categories are displayed in Appendix C.

Figure 4.4: Histogram of the Percentages of Correct and Additional Categories Table 4.3: Correct and Additional Categories SBVR

Proposal

Words with Correct Categories*

7

226 (201** )

Words with Additional Categories

8

127

Total Additional Categories

10

239

*

**

Inverse of the “Number of words with Missing Elements” statistic reported by the Lexicon Builder. The number of words with the correct category excluding words with categories that are not supported by the Lexicon Model.

4.4. SUMMARY

4.4

77

Summary

In this chapter the testing performed on the experimental Lexicon Builder system was presented along with the results. The testing was separated into validation testing, which focused on ensuring the system behaved as expected, and performance testing, which focused on gathering metrics that could be used to analyse the effectiveness of the system. Performing the validation tests was important as it demonstrated the Example form of validation from Shaw (Shaw 2002) and partially demonstrated the feasibility of the Lexicon Builder system. However, the performance testing was most important, demonstrating the Analysis form of validation from Shaw (Shaw 2002), the results of which constitute the main results of this research. As the experimental Lexicon Builder implementation has quite a few limitations, the performance tests are primarily a means to more specifically identify the areas for future improvement and guide future research efforts. Furthermore, due to time constraints the validation testing was limited to simple scenarios, however, it still helped to identify several shortcomings in these scenarios that have already been discussed in Section 3.3.3. In the next chapter, an analysis of the results gathered from the performance tests is presented.

78


Chapter 5 Analysis This chapter discusses the results gathered from the three performance tests carried out on the Lexicon Builder system. The actual methods of testing and the data sets used were described in Chapter 4. The results of each test are examined in the context of answering one or more of the research subquestions: ‘How much of a reduction can be gained?’, ‘How well does it work?’, and ‘What are its limitations?’. Section 5.1 discusses the results for the overall amount of data that was automatically retrieved for new lexicons, while Section 5.2 discusses the results for the amount of data that was associated with each word in the lexicon. Finally, Section 5.3 discusses whether or not the retrieved information was correct or superfluous and how this effects the performance of the Lexicon Builder.

5.1

Automatically Retrieved Data

This section discusses the results of the first of the three tests: the number of words that had lexical information retrieved from the external data source. The results of this test provide an insight into the Lexicon Builder’s potential to reduce the amount of time it takes to completely define a lexicon when the data does not already exist in the system. Initial analysis indicates that these figures are quite good, as for SBVR based texts it appears that manually defining a lexicon could be almost eliminated; however, the very high value of 92% is likely caused by the very small sample size of the SBVR words. A more representative text would most likely result in a percentage closer to that of 79

80

CHAPTER 5. ANALYSIS

the Proposals word list. Potentially, it could fall below the general text as a larger SBVR

text would likely refer to a number of business products and services by name

that would not be in any external data source and would have to be entered manually no matter what. For example, combining the sentence “The Lexicon Builder is a software” with the existing SBVR sentences would result in two additional words (“the” and “Lexicon Builder” 1 ) where only one of which is recognised, causing a reduction in the percentage of words with retrieved data to approximately 85%. Furthermore, as can be seen by the previous sentence, SBVR texts often repeat the same words frequently and are therefore likely to increase the ratio of domain specific terms to other categories above that of a more general text. Although, for the SBVR controlled language in particular, if all other categories can be eliminated it may become possible to automatically categorise these domain specific terms. The 82% figure for more general texts is still quite positive as it indicates a possible 82% reduction in manual work when defining a lexicon. In fact, this percentage is lower than it should be due to the restrictions placed on the Lexicon Model by being directed at the SBVR controlled language. For example, the Lexicon Model does not define categories for Prepositions or Conjunctions for words such as “with”, “for”, or “while” that are commonly used in the text. As a result the implemented parser does not extract this category from the external source and, even if it did, there is no way to store this information as it is currently not supported by the Lexicon Meta-Model. This opens an avenue for future work to expand the process of Kleiner et al. (Kleiner et al. 2009) to more general usage. Overall it is a promising result, indicating that an 80% or better improvement may be able to be made over manually defining an entire lexicon. In the future, improvements to the parsing, using multiple data sources, and generalising the Lexicon Model could be used to improve the level of automation even further.

5.2

Categories per Word

This section discusses the results of the second test, the number of categories retrieved for each word. These results help to answer all three of the research subquestions as 1

Together the words “Lexicon” and “Builder” form the name of the software.

5.2. CATEGORIES PER WORD

81

too much or too little data will mean a time consuming manual process will still be needed to be undertaken in order to complete the lexicon. In addition, this analysis discovered some limitations of the Lexicon Builder system. Initially the maximum number of categories of four or five appears quite high since the goal is to have only those categories that are required for the domain specific lexicon. Additional categories will impact the amount of automation provided by the Lexicon Builder and possibly reduce the effectiveness and efficiency of the configuration aspect of the process developed by Kleiner et al. (Kleiner et al. 2009). Analysis of the words with a large number of categories revealed a limitation in the Lexicon Builder’s parsing of the web pages of The Free Dictionary. It was discovered that an extra verb category is associated with some words. Most, if not all, of the words with four or five categories are verbs with different forms and where the web page lists the different forms with a separate category. For example, the word “sell” (redirected to by The Free Dictionary from the word “sells” in the SBVR word list) can be a transitive or intransitive verb or a noun. However, it can be of the forms “sold”, “selling” and “sells” and these different forms are marked by The Free Dictionary as ordinary verbs. As a result, the parser picks up verb as another category and associates it with the words, resulting in the four categories: noun, verb, transitive verb, and intransitive verb. Obviously the verb category is superfluous as it is covered by its subcategories transitive and intransitive verb. This situation is displayed in Figure 5.1 and Figure 5.2. Correcting the parser to identify this situation will reduce the maximum number of categories for a word to three or four, which should be better for maintaining a high level of automation. The mode is an important measure because the more categories a word has the more ambiguous the sentence is and the more searching the Configuration process must do. This impacts both the quality of the output, as it may produce an incorrect answer, and the time it takes to produce a result, three sentences were already causing the configuration to take over one second (Kleiner et al. 2009, pg. 7). Therefore, the mode of 1 for both of the data sets, indicating that most words only have one usage or category, is a good sign for improving the automation of the entire process and the accuracy of the result of the configuration. A problem may arise, though, if the changes

82

CHAPTER 5. ANALYSIS

Figure 5.1: Definition of “sell” from The Free Dictionary (Farlex 2010)

Figure 5.2: Categories for “Sells” in the Lexicon Builder to the parsing and/or supporting more categories increases this number, as the average is already approximately 1.5 categories per word. However, as the frequency data shows that one category occurs twice as often as any other the slight redistribution of frequencies from changes should not affect this statistic. Furthermore, improvements should reduce the number of words associated with no categories, most likely moving them into the one category bracket, and therefore maintaining the high number of words with only one category. A graph of what these results may look like with the improvements implemented is shown in Figure 5.3. The mode of one for the number of categories per word could be used as a potential

5.3. CORRECT AND SUPERFLUOUS DATA

83

Figure 5.3: Potential Percentage Frequencies of Number of Categories for an Improved Lexicon Builder avenue to increase the level of automation of the Lexicon Builder (possibly completely automated) and to reduce the time it takes for the configuration process to complete. Future work could look at, for example, utilising the fact that most words have only one category associated in order to “fill in the gaps” for words in a sentence where data was not retrieved at all. Conversely, it could also be used to cull the additional data retrieved for some words down to the minimum set of categories used by the input text.

5.3

Correct and Superfluous Data

This section discusses the results of the last test: how much of the retrieved data was correct and how much superfluous. The results of this test aid in answering all three subquestions as large amounts of incorrect and superfluous category data reduce the effectiveness of automating the process, requiring the user to spend large amounts of time correcting and culling data from the lexicon.

5.3.1

Missing Categories

Analysis of the words that do not have their correct category retrieved from the external data source reveals that a majority of the errors are due to a few recurring cases, with

84

CHAPTER 5. ANALYSIS

the rest often specific to individual words. The errors with the SBVR word list are caused by not detailed enough information in the external data source for all but one of the words. For example, although the category for “each” has been correctly retrieved as a determiner the strictly correct category is a quantified unvalued determiner. Similarly, the word “sold” had the category verb retrieved, however, its strictly correct category is transitive verb. Due to the time constraints on this research it was not possible to determine the effect that less specific information has on the Configuration aspect of the process of Kleiner et al. (Kleiner et al. 2009). It may still be able to produce the correct results with less specific (but no less correct) information; however, it is likely that, at the very least, it will require more searching and therefore cause the process to take much longer to complete than if it had the most specific information. The last case for the SBVR word list was caused by “at least”. This word did not have any information retrieved at all as the Lexicon Builder does not currently search for phrases properly – although if it could “at least” would have had the same problem as the rest with the ambiguity of its category information. Although The Free Dictionary allows phrase searches, not all phrases return appropriate results. For example, searching for “at least” displays the definition for the word “least”, which has the phrase “at least” at the bottom (see Figure 5.4). However, there is no usage information associated so it cannot be determined if it is a verb or noun, etc. This is a major issue as a large number of SBVR key words are phrases including: “at least”, “at most” and “more than” (OMG 2008b). Since the goal is to enable the automatic transformation of specifications, not being able to determine the category of the SBVR key words severely hampers that effort. A possible means of correcting this issue that could be looked at in future work is to allow the definition of small lexicons that can be included in any number of other lexicons. In this way a lexicon of SBVR key words can be manually created once and then reused by any number of lexicons that are in the SBVR domain. As there are only approximately 39 SBVR key words (OMG 2008b) compared to the possible hundreds of words in a complete lexicon, there would still be the benefit of reducing the amount of time spent defining a lexicon. Ideally, the Object Management Group (OMG) – the organisation that develops the SBVR standard – would produce the SBVR key word lexicon and make it available on the Internet, anyone could then incorporate it directly into their lexicon and maintain a completely automated process. The same thing could


85

potentially be achieved for very common words as well, preventing the redefinition of common words in every single lexicon and helping maintain the domain specifics of each lexicon. Doing this would help with words such as “is” keep the most specific usage across all lexicons and hopefully improve the entire process.

Figure 5.4: Definition of “least” from The Free Dictionary (Farlex 2010) The errors in the words forming the proposal word list are more extensive but are still mostly caused by a small number of repeating cases. The most recurring, and most easily fixed, issue is with the format of some of the verbs on The Free Dictionary. This issue is separate to the repetition of verb categories discussed in Section 5.1 and is caused by the text used to mark words that are only transitive verbs being reversed, i.e. “tr. v.” compared to “v. tr.”. For example, the word “avoiding”, which is redirected to “avoid”, has the definition as shown in Figure 5.5 and can be compared to the definition of “sell” shown in Figure 5.1. This is a limitation with the implementation of the Lexicon Builder’s parser not checking for the two methods of marking transitive verbs and therefore causing some words to not be associated with the correct category. Some of the other words affected by this error include: “based”, “described”, “include”, and “included”. Another similar parser error is caused by words marked as auxiliary verbs in The Free Dictionary. Since this definition was not taken into account in the parser design, no

86

CHAPTER 5. ANALYSIS

Figure 5.5: Definition of “avoid” from The Free Dictionary (Farlex 2010) such category or equivalent is associated with these words. An example can be seen in the definition of “would” shown in Figure 5.6 where the category is denoted by “aux. v.” and therefore not identified by the parser (displayed in Figure 5.7). The expected category is an ordinary verb, however, depending on how specific the usage information needs to be then an extra category for auxiliary verb may need to be added to the Lexicon Model. In this case it could be discounted from the results as not being supported along with the other categories such as conjunction and preposition; however, this is something for future work to investigate as either adding support for auxiliary verbs to the Lexicon Model or updating the parser to identify “aux. v.” (and possibly “v. aux.”) as a verb would improve the number of words with correctly identified categories and help towards completely automating the development of lexicons.

Figure 5.6: Definition of “would” from The Free Dictionary (Farlex 2010) The third main problem is phrases and proper nouns consisting of multiple words, for example, the phrase “such as” and the proper noun “Natural Language Processing”. As mentioned above for “at least” from the SBVR word list, The Free Dictionary does support phrase searches although it has mixed results. Some phrase searches may be successful, such as “Natural Language” (shown in Figure 5.8), however, many more are not. This is more of an issue for general language than SBVR as phrases such as “such as” are quite commonly used and, while it may be practical to manually develop a


87

Figure 5.7: Categories for “would” in the Lexicon Builder lexicon of the SBVR key word phrases, there are many more phrases in general usage that would need to be included than are a part of the SBVR key words. There may be a solution as identified earlier in Section 5.2 by using the known information to “fill in the blanks” for such phrases and proper nouns. This is something that could be investigated in future work.

Figure 5.8: Definition of “Natural Language” from The Free Dictionary (Farlex 2010) The last recurring issue is due to some words arriving at an alternate definition, or no definition at all. Some of the words that are subject to this error include: “developed”, “limitations”, and “2”. An example of an alternate definition occurs when searching for “developed”. This returns the adjective definition, which is mapped to the determiner category, of “developed” (as in “a developed country”) rather than as the past tense of “develop” (compare Figure 5.9 and Figure 5.10). Similarly, “limitations” does not have a definition at all when it should be redirected to “limitation” as it is being used in the plural form. It may be possible to solve both of these cases by performing another search without the “ed”, “s”, “ing”, et cetera, endings and checking that the word appears in the list of alternate forms. Although this may cause the process to take twice as long as it should, once the data has been retrieved from the external source the first time these checks should not need to performed again. Alternatively, the use of multiple data sources may solve this problem without necessitating the determination of the endings

88

CHAPTER 5. ANALYSIS

of words, which could become awkward. For example, “possibilities” might result in a check for “possibilitie” as it ends in “s”. Lastly, the problem with “2”, and other numerals, occurs due to the alternate definition containing only the adjective category when, in the case of the proposal world list, it is actually referring to the number as a noun. This is harder to solve as simply converting the numeral to its textual form would result in a similar error due to the Lexicon Builder detecting the word as a number and assigning the numeral category from the Lexicon Model. A solution could be to eliminate the look up of numerals altogether by having the Lexicon Builder detect numerals and assign the noun and numeral categories automatically, since all numerals have the same two categories.

Figure 5.9: Definition of “developed” from The Free Dictionary (Farlex 2010)

Figure 5.10: Definition of “develop” from The Free Dictionary (Farlex 2010) Finally, the remaining errors are due to two reasons: (1) a prefixed word and (2) a word with UK English spelling while The Free Dictionary contains words using American English spelling. The prefixed word is a difficult case to solve. If all prefixed


89

words are guaranteed to be hyphenated then it may be possible to search for the nonprefix component separately and associate the categories from that with the prefixed word; however, it may be one of the few cases where user intervention is required to ensure the associated categories are correct. The error with the UK English spelling is caused by “utilise”, which is interesting as “utilising” is correctly redirected to “utilize”. Searching for some other words with UK English spelling on The Free Dictionary such as “specialise” and “specialising” reveals that they have the same problem as well. This could potentially be easily fixed by using a second source that correctly identifies the UK English spelling of words, however, it raises the issue of how the Lexicon Builder itself can deal with different spellings when comparing lexicons. The same issue arises for words that are not consistently hyphenated, although a simple fix in that case could be to remove all hyphenation. An alternative could be to combine the Lexicon Builder with dictionaries that are used in spellcheckers that contain alternate spellings and hyphenations in order to normalise the words for comparisons of lexicons. This could even be achieved as part of the text injection phase so that the Lexicon Builder does not need to take it into account when performing the comparisons. These are all possibilities that should be investigated in future work.

5.3.2

Additional Categories

A large number of additional categories in the proposal word list are caused by verbs that can be used both transitively or intransitively. Improving parsing to eliminate duplicates (as mentioned earlier in Section 5.2) will remove one extra category per word with duplicate categories. This would eliminate a total of 57 categories and bring the percentage of additional categories for the proposal word list down to 47.64% (almost a 10% reduction in superfluous categories). Further decreases may be seen depending on the style of writing and size of the text, as it is possible that a large number of these verbs would end up being used both transitively and intransitively in the same text. Therefore, the additional categories caused by these verbs may be completely removed. As can be seen from the proposal word list, the writer’s style more often uses the transitive form of the verb. However, occasionally the intransitive form is used, the frequency of which would likely increase over the course of the document as the writer attempts to prevent repetitious sentences. The same reduction is unlikely

90

CHAPTER 5. ANALYSIS

to occur for the SBVR word list due to the controlled nature of SBVR consistently using only one form of the verb in its short, specific sentences. The ability of The Free Dictionary to link words to their base word has turned out to be a blessing as well as a curse, as definitions of the base words are being added to its derived words that they does not belong to. For example, the categories for the word “based” include noun and determiner when the only category that should apply is transitive verb. In this case, both of the additional categories are taken from the word “base” (as an be seen from its definition in Figure 5.11). By updating the parser to correctly identify the categories associated with a particular form of a word, these superfluous usages could be excluded and would remove not only a number of additional categories but a some of the words with additional categories entirely. Most of the additional categories for the SBVR word list stem from the correct category being more specific than the category that was retrieved from The Free Dictionary. For example, “each” and “exactly” were associated with the category Determiner, when they needed to be associated with the categories quantified unvalued determiner and quantified valued determiner, respectively. Other additional categories in the SBVR word list were caused by the issue with the duplication of verb categories and a couple of cases with legitimate multiple uses. While it appears that the additional categories can be almost eliminated in the SBVR word list using the methods discussed in Section 5.2, the small sample size is likely playing a factor in overstating the improvements that can be made. The evaluation of a larger sample of SBVR text should be an important aspect of future work with the Lexicon Builder. Unfortunately most of the words with additional categories have legitimate multiple uses. Although some small improvements can be made the majority of the additional categories cannot be eliminated by improving the parsing of The Free Dictionary’s web pages. As mentioned in Section 5.3.2, it may be possible to use the fact that a large number of words have only one possible category in order to cull some of the superfluous categories. It may not be possible to reduce all of the words to a single category, but it may be able to reduce the amount of searching required in the Configuration by eliminating some possibilities immediately. This is something else to look at in future work.


Figure 5.11: Definition of “base” from The Free Dictionary (Farlex 2010)

91

92

5.4

CHAPTER 5. ANALYSIS

Summary

This chapter has identified several major issues in the performance of the experimental Lexicon Builder. The analysis shows that, although there is potential for a large improvement to the development of lexicons, in its current form the Lexicon Builder system is ineffective due to a large amount of superfluous lexical information. Several avenues for future work to improve the system have been identified, such as improved parsing and the possibility of culling additional information by using other known information. The following chapter presents a summary of this research, its limitations, and future work.

Chapter 6 Summary and Conclusion This chapter summaries the work that was undertaken for this research, the limitations identified, and the opportunities for future work. Section 6.1 summarises the work performed for this research, while Section 6.2 presents many of the limitations discovered before, during, and after performing this research. Finally, Section 6.3 concludes this dissertation.

6.1

Summary

This research developed the Lexicon Builder experimental system in order to investigate the possibility of reducing or eliminating the need to manually predefine complete lexicons to be used in the NLP process proposed by Kleiner et al. (Kleiner et al. 2009). The system is a multi-agent system that utilises sharing of data, an external data source, and the principles of Emergent Semantics and Semiotic Dynamics, in a similar way to that described by Steels and Hanappe (Steels and Hanappe 2006), in an attempt to overcome the need to manually define lexicons for this process. To develop this system a simple methodology that included design, implementation, and test phases, was used. The design phase consisted of selecting an appropriate external data source to retrieve lexical data from, designing the Lexicon Meta-Model that defined the format the lexicons would be stored in, defining the behaviour of interacting agents through a set of interaction scenarios, and a basic architecture that would aid in the effective implementation of the system. After considering a number of external data sources (including 93

94

CHAPTER 6. SUMMARY AND CONCLUSION

the scholarly sources CYC, WordNet, and EDR) the on-line dictionary The Free Dictionary was selected as an appropriate data source as it contains a large number of entries, a good coverage of lexical categories, and a high level of detail of categories and subcategories for most entries. The Lexicon Meta-Model was developed to match that already used by the process of Kleiner et al. (Kleiner et al. 2009) in order to support the integration of the Lexicon Builder with that process at some point. Finally, the interactions between agents were based off of those described in Steels and Hanappe (Steels and Hanappe 2006) and modified in order to make them more suited to the Lexicon Builder system; however, they are subject to multiple limitations. Possible solutions have been proposed, but are left to future work due to time constraints. In the implementation phase, the Lexicon Builder system was developed using the Java programming language in the Eclipse IDE. This was done for simplicity due to prior knowledge of Java and usage of the Eclipse Modelling Framework, which is used by Kleiner et al. (Kleiner et al. 2009) for their NLP process. A simple architecture was followed that allows the various components to be modified, adapted, and improved easily in future work. This involved a set of interfaces, NetworkManager, LexiconManager, and LexiconAgent, which handle the physical communication between agents, the storing and retrieval of lexical data, and the glue binding it all together, respectively. Furthermore, a GenericLexicon class and LexicalElement class were implemented that allow for the transfer of lexical data between agents regardless of the format used by their internal lexicon storage. In order to get the lexical data from the external data source into the Lexicon Builder system, a basic parser for The Free Dictionary’s web pages was implemented. This parser simply matches the lexical categories in the HTML and utilises the redirection performed by The Free Dictionary from derived words to their base word in order to perform some linking of base designations. Finally, the testing was separated into two parts: Validation testing, and Performance testing. The Validation tests were designed to ensure the Lexicon Builder system behaved in accordance with the scenarios developed during the design phase, while the Performance tests were focused on gathering data on the effectiveness of the system in respect to answering the research questions. While most of the Validation tests produced the correct results (there was an issue with the last test due to an oversight in the implementation), they helped to identify some issues with the behaviour

6.2. LIMITATIONS AND FUTURE WORK

95

of the system not being as intended. The Performance tests on the other hand determined that, although there is a lot of potential for improvement, the amount of superfluous lexical information retrieved by the current implementation would cause the development of domain specific lexicons to still require a large amount of manual work. Several possibilities for improvements were discussed and will be revisited in the following section.

6.2

Limitations and Future Work

During the testing of the Lexicon Builder system a number of limitations were discovered and opportunities for improvements and future work identified. To begin with the parsing of The Free Dictionary’s web pages could be improved greatly in order to overcome some defects with extraneous verb categories, reversed definitions of categories (e.g. “tr. v.” and “v. tr.”), and categories taken from base words that are not suitable for the derived words. Due to the time constraints placed on this research, it is no surprise that the parser for The Free Dictionary’s web pages could not handle all of these cases. Fixing these issues, though, does provide easy opportunities to achieve better results for the Lexicon Builder in future work. Another limitation of the current implementation of the Lexicon Builder (and parser related) is the linking of words with the correct base designation. In the majority of cases this is not done correctly, so the focus of the evaluation has been on the retrieval of correct word categories. However, in order for the model transformations at the end of the process defined by Kleiner et al. (Kleiner et al. 2009) to perform correctly the base designations need to be correct. There is opportunities for the linking of words to their base designations to be improved, as the The Free Dictionary provides web links between related words in a reasonably standard way, as well as redirecting a large number of words to their base designation. Therefore, it should be possible to link the majority of words with the correct base designation, reducing the amount of manual intervention required even further. However, there would be an issue if the user needs to go through all of the words to ensure their base designations are correct, so determining an automated method to check the base designations would be a desirable option.

96


There are also limitations in the design of the Lexicon Builder’s interactions that limit the system to a series of relatively simple interactions. Some of these issues include the creation of superfluous labels, general strengthening and weakening of associations, and the agent at which new lexicons and labels are created. Many of the limitations in this area have been identified and solutions proposed in Chapter 3. Therefore, future work should reevaluate the design, incorporate the proposed solutions, make adjustments as necessary, and perform tests using sets of complex interactions (rather than the simple scenarios used to test its basic functionality during this research) in order to ensure that a large series of complex interactions does not cause the system to become inconsistent. Currently, the Lexicon Builder system does not handle synonyms. Since the focus of this research was to evaluate an experimental system and determine whether or not dynamically developing domain specific lexicons was feasible in the limited time that was available, they were excluded for simplicity. However, synonyms could have a huge impact on the system’s effectiveness when comparing domain specific lexicons that use different words with the same meaning. Therefore, methods to include synonyms and their effects on the system should be evaluated in future work. There may be a simple solution as The Free Dictionary incorporates a Thesaurus, so it may be relatively simple to expand the parser of The Free Dictionary’s web pages to identify that information. A further option that should be investigated in future work is the initial injection of the text into the system (and the Ordered Words model). This is a key aspect with many issues that need to be overcome in order for the entire process to be successful. For example, phrases and names consisting of multiple words need to be considered in order for the output to be correct, so it is not just a simple matter of taking each individual word. Furthermore, how hyphenated words will be dealt with as well as words with alternate spellings (such as UK English and American English spellings) need to be considered in order to keep consistency and allow correct comparisons across domain specific lexicons. Multiple external data sources would be a good way to improve the quality of the information that is automatically retrieved by the Lexicon Builder system. Initially the parser for The Free Dictionary could be expanded to check the second source referred

6.3. CONCLUSION

97

to by The Free Dictionary. However, in the long term it would be good to be able to retrieve information from the scholarly lexical databases such as WordNet (Miller 1995), EDR (Yokoi 1995), and CYC’s lexical database (Cycorp Inc. 2010). The quality of and features provided by these databases (for example synonyms) would likely enable the Lexicon Builder to perform much better than in its current experimental state. Other future work identified in the analysis of the results includes: the ability to include other predefined lexicons into a new lexicon in order to handle situations such as SBVR key words and possibly the most common words of a language; a method that attempts to determine the category of a word using the category information of the words around it; and, similarly, a method to cull superfluous categories from a word by using the category information of the surrounding words. This likely sounds easier than it is, but it is an important aspect to investigate as there is currently too much superfluous information retrieved from the external data source, likely reducing the effectiveness of the Lexicon Builder in the process defined by Kleiner et al. (Kleiner et al. 2009). Finally, future work should evaluate the effect on the process defined by Kleiner et al. (Kleiner et al. 2009), in particular the configuration aspect, as well as the results of using a larger sample text based on the SBVR controlled language. These investigations drive right to the heart of the main goal of the Lexicon Builder, that is to support the complete automation of transforming Natural Language specifications into formal representations.

6.3

Conclusion

Natural Language Processing is a broad field encompassing many methods. Although the most common of these methods, empirical methods such as machine learning, are quite effective, we are beginning to see some of their limitations. As a result, new, innovative, methods such as that proposed by Kleiner et al. (Kleiner et al. 2009) are being investigated by many researchers. Their method takes a Model Driven Engineering approach to NLP with the goal of automatically transforming Natural Language specifications into formal models. However, these new methods have limitations of their own, such as the need to manually predefine complete domain specific lexicons before the process can be used, reducing the level of automation in the process. This research

98


has explored the possibility of enhancing the process described by Kleiner et al. (Kleiner et al. 2009) with the ability to automatically develop the domain specific lexicons required by this method. In doing so, it has developed the Lexicon Builder experimental system, which attempts to utilise the concept of Emergent Semantics described by Steels and Hanappe (Steels and Hanappe 2006) for the interoperability of distributed information systems in order to support the automated development of lexicons. A literature review conducted on NLP, MDE, and Emergent Semantics discovered very little previous work in dynamically generating lexicons. Most similar work focused on either manually defining large lexical databases and knowledge bases, or creating lexicons of new languages that are developed by interactions between autonomous agents and as such focused on the development of the language rather than the lexicons. The development of the Lexicon Builder was aimed at answering the research question:

Can we reduce or eliminate the need to manually predefine a lexicon by dynamically building a lexicon based on the input text?

The results of testing the Lexicon Builder show that it is certainly possible to at least reduce the need to manually define lexicons. The effectiveness of this system is open to improvement. However, it has provided the basis for evaluating the potential such a system may have and shown that there is plenty to be realised as many of the present limitations have already been discussed and possible solutions proposed. Therefore, it would be worth investigating the dynamic development of domain specific lexicons in future work.

References Aberer, K, Cudré-Mauroux, P, Catarci, A, Hacid, MS, Illarramendi, A, Kashyap, V, Mecella, M, Mena, E, Neuhold, E, Troyer, O, Risse, T, Scannapieco, M, Saltor, F, Santis, L, Spaccapietra, S, Staab, S and Studer, R 2004, Emergent semantics principles and issues, Database Systems for Advanced Applications, pp. 25–38. Collobert, R and Weston, J 2008, ‘A unified architecture for natural language processing: Deep neural networks with multitask learning’, Proceedings of the 25th international Conference on Machine Learning, pp. 160–167. Cycorp Inc. 2010, ‘The Cyc NL subsytem’, Cycorp Website, viewed 16 October 2010, . Daelemans, W and Van den Bosch, A 2006, ‘Memory-based language processing’, Computational Linguistics, vol. 32, no. 4, pp. 559–561. Estratat, M and Henocque, L 2004, ‘Parsing languages with a configurator’, Proceedings of the European Conference for Artificial Intelligence ECAI’2004, vol. 16, p. 591. Farlex

2010,

The

Free

Dictionary,

viewed

11

October

2010,

. Favre, JM and Nguyen, T 2005, ‘Towards a megamodel to model software evolution through transformations’, Electronic Notes in Theoretical Computer Science, vol. 127, no. 3, pp. 59–74. Gao, J, Andrew, G, Johnson, M and Toutanova, K 2007, ‘A comparative study of parameter estimation methods for statistical natural language processing’, Annual Meeting of the Association for Computational Linguistics, vol. 45, p. 824. 99

100


Jouault, F and Bézivin, J 2006, ‘KM3: a DSL for metamodel specification’, Lecture Notes in Computer Science, vol. 4037, pp. 171–185. Jurafsky, D and Martin, JH 2009, Speech and language processing, Prentice Hall New York, chapter Chapter 1 Introduction, pp. 1–16. Kleiner, M, Albert, P and Bézivin, J 2009, ‘Configuring models for (controlled) languages’, Proceedings of the IJCAI-09 Workshop on Configuration (ConfWS-09), Pacadena, CA, USA, pp. 61–68. Lee, TB, Hendler, J, Lassila, O and Others 2001, ‘The semantic web’, Scientific American, vol. 284, no. 5, pp. 34–43. Lenat, D, Miller, G and Yokoi, T 1995, ‘CYC, WordNet, and EDR: critiques and responses’, Communications of the ACM, vol. 38, no. 11, pp. 45–48. Lenat, DB 1995, ‘CYC: a large-scale investment in knowledge infrastructure’, Communications of the ACM, vol. 38, no. 11, pp. 33–38. Liddy, ED, Hovy, E, Lin, J, Prager, J, Radev, D, Vanderwende, L and Weischedel, R 2003, ‘Natural language processing’, Encyclopedia of Library and Information Sciencepp. 2126–2136. Macmillan tionary

Publishers online’,

Australia

Macquarie

2010,

Dictionary,

‘Australia’s viewed

16

national October

dic2010,

. Maedche, A 2002, ‘Emergent semantics for ontologies’, IEEE Intelligent Systems, vol. 17, no. 1, pp. 78–86. Maedche, A and Staab, S 2001, ‘Ontology learning for the semantic web’, IEEE Intelligent Systems, vol. 16, no. 2, pp. 72–79. Miller, GA 1995, ‘WordNet: A lexical database for english’, Communications of the ACM, vol. 38, no. 11, pp. 39–41. Nack, F 2002, ‘Media information spaces – a semantic challenge’, IEEE Intelligent Systems, vol. 17, no. 1, pp. 78–86.

6.3. CONCLUSION

101

NICT 2007, ‘About purchasing the EDR electronic dictionary’, EDR Website, viewed 16 October 2010, . Nivre,

J 2007,

‘Inductive dependency parsing’,

Computational Linguistics,

vol. 33, no. 2. OMG 2008b, Semantics of Business Vocabulary and Business Rules (SBVR) Specification, v1.0, Object Management Group. Oxford University Press 2010, ‘English dictionary and thesaurus’, Oxford Dictionaries Online, viewed 16 October 2010, . Sendall, S and Kozaczynski, W 2003, ‘Model transformation: The heart and soul of model-driven software development’, IEEE Software, vol. 20, no. 5, pp. 42–45. Shaw, M 2002, ‘What makes good research in software engineering?’, International Journal on Software Tools for Technology Transfer (STTT), vol. 4, no. 1, pp. 1–7. Steels, L 1998, ‘The origins of ontologies and communication conventions in multi-agent systems’, Autonomous Agents and Multi-Agent Systems, vol. 1, no. 2, pp. 169–194. Steels, L 2002, ‘Language games for emergent semantics’, IEEE Intelligent Systems, vol. 17, no. 1, pp. 83–85. Steels, L and Hanappe, P 2006, Interoperability through emergent semantics a semiotic dynamics approach, Journal on Data Semantics VI, vol. 4090, Springer Berlin / Heidelberg, pp. 143–167. Wikimedia page’,

Foundation

Wiktionary,

2010a, the

‘Wiktionary,

free

dictionary,

the

free

viewed

dictionary: 16

October

Main 2010,

. Wikimedia ned’,

Foundation

Wiktionary,

the

2010b, free

‘Wiktionary:Entry

dictionary,

viewed

layout 16

October

explai2010,

. Yokoi, T 1995, ‘The EDR electronic dictionary’, Communications of the ACM, vol. 38, no. 11, pp. 42–44.

102


Chapter A Word Lists

A.1

SBVR Word List

The SBVR word list is taken from the three example sentences used by Kleiner et al. (Kleiner et al. 2009). (1) Each company sells at least one product. (2) Each product is sold by exactly one company. (3) A software is a product. The resulting word list contains the 12 words displayed in Table A.1. Table A.1: SBVR Word List

A.2

A

Each

Product

AtLeast

Exactly

Sells

By

Is

Software

Company

One

Sold

Proposal Word List

The Proposal word list is taken from the introduction of the proposal for this research. Using this word list provides an indication of how well the system works for more 103

104

APPENDIX A. WORD LISTS

general texts and allows the evaluation of a larger set of words than the SBVR word list. This list contains 272 distinct words which are displayed in Table A.2.

A.2. PROPOSAL WORD LIST

105

Table A.2: Proposal Word List 2

Areas

Complete

3

Around

Completely

4

As

Concept

A

At

Concepts

Actually

Automated

Configuration

Added

Automation

Contents

Advanced

Avoiding

Continue

Agreement

Background

Contribution

Aim

Based

Current

Allow

Be

Defined

Amount

Been

Described

An

Benefits

Design

Analysis

Break

Determined

And

Brief

Develop

Answered

Build

Developed

Answering

Building

Developing

Appendix

But

Development

Application

By

Diagrams

Applications

Can

Different

Approach

Changes

Direct

Are

Class

Do

Area

Company

Does Continued on next page

106


Table A.2 – continued from previous page

Domain

From

Into

During

Fully

Involve

Dynamically

Further

Is

Dynamics

Gained

It

Early

Global

Its

Effective

Goals

Kleiner

Effort

Has

Lastly

Efforts

Have

Lexical

Eliminate

How

Lexicon

Emergent

However

Lexicons

EmergentSemantics

If

Limitations

Enable

Impractical

Limited

End

Improved

Literature

EtAl

Include

Long

Even

Included

Long-range

Exciting

Inclusion

Looking

Finally

Information

MachineLearning

Focus

Inherent

Mainly

Following

Initially

Major

Follows

InOrderTo

Manually

For

Input

May

Formal

Interoperability

Method Continued on next page


107


Methodology

Other

Put

Methods

Out-of-date

Questions

Model

Part

Quite

ModelDrivenDevelopment Perfect

Radically

ModelDrivenEngineering

Performing

Real

ModelTransformation

Performs

Realm

Months

Possibilities

Reduce

More

Potential

Reduces

Most

Predefine

Reduction

Much

Predefined

Remainder

Multi-agent

Preliminary

Remove

NaturalLanguage

Presents

Replace

NaturalLanguageProcessingProblem

Representation

Need

Process

Representations

Needs

Produced

Required

New

Project

Requires

Not

Promising

Research

Of

Proposal

Researchers

On

Proposed

Results

Once

Provide

Review

One

Provided

Revolve

Or

Provisional

Save Continued on next page

108



Say

SumsUp

To

Search

Support

ToBeginWith

Secondly

Symbolic

Transformation

Section

System

Transforming

Semantic

Systems

Trap

Semantics

Table

UML

SemanticWeb

Take

Updates

Semiotic

Takes

Use

Several

Technique

Used

Sharing

Text

Useful

Six

That

Useless

So

The

Utilise

Software

Their

Utilising

Specific

Then

Was

Specification

Therefore

Ways

Specifications

These

We

Spend

Thesis

Well

Stages

They

What

Static

This

When

Statistical

Through

Whether

Sub-questions

Time

Which

SuchAs

Timeline

While Continued on next page


109


Whole

Within

World

Will

Word

Would

With

Work

110


Chapter B Expected Lexical Categories

B.1

SBVR Word List Table B.1: SBVR Word List Expected Lexical Categories Word

Lexical Categories

Plural

A

Indefinite Article Determiner

AtLeast

Quantified Valued Determiner

By

Determiner

Company

Noun

Each

Quantified Unvalued Determiner

Exactly


Is

Transitive Verb

One

Numeral

Product

Noun

Singular

Sells

Transitive Verb

Singular

Software

Noun

Singular

Sold

Transitive Verb

Singular

Voice

Singular

Singular

111

Passive

Passive

Passive

112

B.2

APPENDIX B. EXPECTED LEXICAL CATEGORIES

Proposal Word List

Note: Adjectives and adverbs are not supported by the Lexicon Meta-Model directly, therefore, words of either of those categories have been assigned the category determiner instead. If future work includes adjectives and adverbs in the Lexicon Meta-Model, then the expected lexical categories will need to be updated appropriately.

B.2. PROPOSAL WORD LIST

113

Table B.2: Proposal Word List Expected Lexical Categories Word

Lexical Categories

Plural

Voice

2

Noun

Singular

3

Noun

Singular

4

Noun

Singular

A


Actually

Determiner

Added

Verb

Advanced

Determiner

Agreement

Noun

Singular

Aim

Noun

Singular

Allow

Transitive Verb

Singular

Amount

Noun

Singular

An


Analysis

Noun

And

Conjunction

Answered

Verb

Singular

Passive

Answering

Verb

Singular

Passive

Appendix

Noun

Singular

Singular

Passive

Passive

Singular

Continued on next page

114


Table B.2 – continued from previous page

Word

Lexical Categories

Plural

Voice

Application

Noun

Singular

Applications

Noun

Plural

Approach

Noun

Singular

Are

Verb

Singular Passive

Area

Noun

Singular

Areas

Noun

Plural

Around

Determiner Determiner

As Preposition At

Preposition

Automated

Verb

Singular

Automation

Noun

Singular

Avoiding

Transitive Verb

Singular

Background

Noun

Singular

Based

Transitive Verb

Singular

Passive

Be

Intransitive Verb

Singular

Passive

Been

Verb

Singular

Passive

Benefits

Noun

Plural

Break

Intransitive Verb

Singular

Brief

Determiner

Passive

Passive

Passive



115


Word

Lexical Categories

Plural

Voice

Building

Verb

Singular

But

Conjunction

By

Preposition

Can

Verb

Singular

Passive

Changes

Verb

Singular

Passive

Class

Noun

Singular

Company

Noun

Singular

Complete

Determiner

Completely

Determiner

Concept

Noun

Singular

Concepts

Noun

Plural

Configuration

Noun

Singular

Contents

Noun

Singular

Continue

Transitive Verb

Singular

Contribution

Noun

Singular

Current

Determiner

Defined

Verb

Singular

Passive

Described

Transitive Verb

Singular

Passive

Design

Noun

Singular

Passive


116



Word

Lexical Categories

Plural

Voice

Passive

Determined

Determiner

Develop

Transitive Verb

Singular

Developed

Verb

Singular Passive

Developing

Verb

Singular

Development

Noun

Singular

Diagrams

Noun

Plural

Different

Determiner

Direct

Determiner

Do

Transitive Verb

Singular

Passive

Does

Verb

Singular

Passive

Domain

Noun

Singular

During

Preposition

Dynamically

Determiner

Dynamics

Noun

Early

Determiner

Effective

Determiner

Effort

Noun

Singular

Efforts

Noun

Plural

Eliminate

Transitive Verb

Singular

Passive

Singular

Passive



117


Word

Lexical Categories

Plural

Emergent

Determiner

EmergentSemantics

Noun

Singular

Enable

Transitive Verb

Singular

End

Noun

Singular

EtAl

Determiner

Even

Determiner

Exciting

Determiner

Finally

Determiner

Voice

Passive

Transitive Verb

Singular

Passive

Intransitive Verb

Singular

Passive

Singular

Passive

Singular

Passive

Focus Following

Determiner

Follows

Verb Preposition

For Conjunction Formal

Determiner

From

Preposition

Fully

Determiner

Further

Determiner

Gained

Verb

Global

Determiner

Goals

Noun

Plural Continued on next page

118



Word

Lexical Categories

Plural

Voice

Has

Verb

Singular

Passive

Have

Transitive Verb

Singular

Passive

How

Determiner

However

Determiner

If

Conjunction

Impractical

Determiner

Improved

Verb

Singular Passive

Include

Transitive Verb

Singular

Passive

Included

Transitive Verb

Singular

Passive

Inclusion

Noun

Singular

Information

Determiner

Inherent

Determiner

Initially

Determiner

InOrderTo

Preposition

Input

Noun

Interoperability

Determiner

Into

Preposition

Involve

Transitive Verb

Singular

Passive

Is

Verb

Singular

Passive

Singular



119


Word

Lexical Categories

Plural

It

Pronoun

Its

Determiner

Kleiner

Noun

Lastly

Determiner

Lexical

Determiner

Lexicon

Noun

Singular

Lexicons

Noun

Plural

Limitations

Noun

Plural

Limited

Determiner

Literature

Noun

Long

Determiner

Long-range

Determiner

Looking

Verb

Singular

MachineLearning

Noun

Singular

Mainly

Determiner

Major

Determiner

Manually

Determiner

May

Verb

Voice

Singular

Singular

Singular

Passive

Passive


120



Word

Lexical Categories

Plural

Methodology

Noun

Singular

Methods

Noun

Plural

Model

Noun

Singular

ModelDrivenDevelopment

Noun


Noun

Singular

ModelTransformation

Noun

Singular

Months

Noun

Plural

More

Determiner

Most

Determiner

Much

Determiner

Multi-agent

Determiner

NaturalLanguage

Noun

NaturalLanguageProcessing

Noun

Need

Noun

Needs

Determiner

New

Determiner

Not

Determiner

Of

Preposition

On

Preposition

Voice

Singular

Singular



121


Word

Lexical Categories

Plural

Voice

Once

Conjunction

One

Numeral

Or

Conjunction

Other

Determiner

Out-of-date

Determiner

Part

Noun

Perfect

Determiner

Performing

Verb

Singular

Performs

Verb

Passive

Possibilities

Noun

Plural

Potential

Noun

Singular

Predefine

Transitive Verb

Singular

Passive

Predefined

Transitive Verb

Singular

Passive

Preliminary

Determiner

Presents

Verb

Passive

Problem

Noun

Singular

Process

Noun

Singular

Produced

Verb

Singular

Project

Noun

Singular

Singular

Passive

Passive


122



Word

Lexical Categories

Plural

Voice

Promising

Determiner

Proposal

Noun

Singular

Proposed

Verb

Singular

Passive

Provide

Intransitive Verb

Singular

Passive

Provided

Verb

Singular

Passive

Provisional

Determiner

Put

Transitive Verb

Singular

Passive

Questions

Noun

Singular

Quite

Determiner

Radically

Determiner

Real

Determiner

Realm

Noun

Singular

Reduce

Transitive Verb

Singular

Passive

Reduces

Verb

Singular

Passive

Reduction

Noun

Singular

Remainder

Noun

Singular

Remove

Transitive Verb

Singular

Passive

Replace

Transitive Verb

Singular

Passive

Representation

Noun

Singular Continued on next page


123


Word

Lexical Categories

Plural

Voice

Representations

Noun

Plural

Required

Determiner

Requires

Transitive Verb

Plural

Research

Noun

Singular

Researchers

Noun

Plural

Results

Noun

Plural

Review

Noun

Singular

Revolve

Transitive Verb

Singular

Passive

Transitive Verb

Singular

Passive

Intransitive Verb

Singular

Passive

Singular

Passive

Passive

Save Say

Determiner

Search

Transitive Verb

Secondly

Determiner

Section

Noun

Semantic

Determiner

Semantics

Noun

Singular

SemanticWeb

Noun

Singular

Semiotic

Determiner

Several

Determiner

Sharing

Verb

Singular

Singular

Passive


124



Word

Lexical Categories

Plural

Six

Numeral

So

Determiner

Software

Noun

Specific

Determiner

Specification

Noun

Singular

Specifications

Noun

Plural

Spend

Transitive Verb

Singular

Stages

Noun

Plural

Static

Determiner

Statistical

Determiner

Sub-questions

Noun

SuchAs

Determiner

SumsUp

Verb

Singular

Support

Noun

Singular

Symbolic

Determiner

System

Noun

Singular

Systems

Noun

Plural

Table

Noun

Singular

Take

Transitive Verb

Singular

Voice

Singular

Passive

Plural

Passive

Passive



125


Word

Lexical Categories

Plural

Voice Passive

Takes

Verb

Singular

Technique

Noun

Singular

Text

Noun

Singular

That

Determiner

The

Definite Article Determiner

Their

Determiner

Then

Determiner

Therefore

Determiner

These

Noun

Plural

Thesis

Noun

Singular

They

Noun

Singular

This

Determiner

Through

Preposition

Time

Noun

Singular

Timeline

Noun

Singular

To

Determiner

ToBeginWith

Intransitive Verb

Transformation

Noun

Singular

Transforming

Verb

Singular

Passive


126



Word

Lexical Categories

Plural

Voice

Trap

Noun

Singular

UML

Noun

Singular

Updates

Noun

Plural

Use

Noun

Singular

Used

Verb

Singular

Passive

Useful

Determiner

Useless

Determiner

Utilise

Transitive Verb

Singular

Passive

Utilising

Transitive Verb

Singular

Passive

Was

Verb

Singular

Passive

Ways

Noun

Plural

We

Pronoun

Well

Determiner

What

Pronoun

When

Conjunction

Whether

Conjunction

Which

Pronoun

While

Conjunction

Whole

Determiner Continued on next page


127


Word

Lexical Categories

Plural

Voice

Singular

Passive

Will

Verb

With

Preposition

Within

Preposition

Word

Noun

Singular

Work

Intransitive Verb

Singular

World

Noun

Singular

Would

Verb

Singular

Passive

Passive

128


Chapter C Incorrect Lexical Categories Note: The tables of words missing categories and words with additional categories are produced from the results of the tests combined with the expected or retrieved data, respectively. The tables of words missing categories do not include the words whose expected category is not supported by the Lexicon Meta-Model.

C.1

SBVR Word List Table C.1: SBVR Words With Missing Categories Word

Missing Categories

AtLeast


Each

Quantified Unvalued Determiner

Exactly


Is

Transitive Verb

Sold

Transitive Verb

129

130

APPENDIX C. INCORRECT LEXICAL CATEGORIES

Table C.2: SBVR Words With Additional Categories Word


A

Noun

By

Noun

Each

Determiner

Exactly

Determiner

Is

Verb

One

Determiner Verb

Sells

Intransitive Verb Noun

Sold

Verb

C.2. PROPOSAL WORD LIST

C.2

131

Proposal Word List Table C.3: Proposal Words with Missing Categories Word

Missing Categories

2

Noun

3

Noun

4

Noun

Avoiding

Transitive Verb

Based

Transitive Verb

Building

Verb

Can

Verb

Described

Transitive Verb

Developed

Verb

Developing

Verb

Eliminate

Transitive Verb

EmergentSemantics

Noun

Enable

Transitive Verb

Include

Transitive Verb

Included

Transitive Verb

Information


132

APPENDIX C. INCORRECT LEXICAL CATEGORIES Table C.3 – continued from previous page

Word

Missing Categories

Interoperability

Determiner

Involve

Transitive Verb

Kleiner

Noun

Limitations

Noun

MachineLearning

Noun

May

Verb


Noun

ModelTransformation

Noun

NaturalLanguage

Noun

NaturalLanguageProcessing

Noun

Predefine

Transitive Verb

Predefined

Transitive Verb

Provided

Verb

Replace

Transitive Verb

Representations

Noun

Requires

Transitive Verb

SemanticWeb

Noun

Sub-questions

Noun

SuchAs

Determiner

SumsUp

Verb

Systems

Noun

These

Noun Continued on next page


133

Table C.3 – continued from previous page

Word

Missing Categories

They

Noun

ToBeginWith

Intransitive Verb

UML

Noun

Used

Verb

Utilise

Transitive Verb

Utilising

Transitive Verb

Ways

Noun

Would

Verb

Table C.4: Proposal Words with Additional Categories Word A

Additional Categories Noun Transitive Verb

Added Intransitive Verb Verb Aim

Transitive Verb Intransitive Verb Verb

Allow Intransitive Verb And

Noun Noun

Answered

Intransitive Verb Transitive Verb Continued on next page

134


Word

Additional Categories Noun

Answering

Intransitive Verb Transitive Verb

Application

Determiner

Applications

Determiner Verb

Approach

Transitive Verb Intransitive Verb

Are

Noun

As

Noun

At

Noun Transitive Verb

Automated Intransitive Verb Noun Based Determiner Be

Verb Verb

Benefits


Break

Transitive Verb Noun

Brief



135


Word

Additional Categories Verb

Build


Building

Noun

But

Determiner Noun

By Determiner Can


Changes


Contents

Determiner Verb

Continue Intransitive Verb Current


Defined Intransitive Verb Verb Design


Develop Intransitive Verb Developed

Determiner

Developing


136


Word


Direct


Do


Dynamically

Noun

Effective

Noun

Emergent

Noun Verb

End


Even

Noun

Finally

Noun Noun

Focus Verb Following


Follows


Formal


Gained

Intransitive Verb Noun Verb

Have Noun Continued on next page


137


Word


How

Noun

If


Improved Intransitive Verb Included

Determiner

Information

Noun

Initially

Noun

It

Noun Noun Verb

Lastly Transitive Verb Instransitive Verb Limited

Noun

Long


Looking


Major

Noun

Manually

Noun

May

Noun Verb Transitive Verb

Model Intransitive Verb Determiner Continued on next page

138


Word


More

Noun

Most

Noun

Much

Noun Verb

Need


Not

Noun

On


Once Noun One

Determiner

Or

Noun

Other

Noun Verb Transitive Verb

Part Intransitive Verb Determiner Perfect


Performing Intransitive Verb Transitive Verb Performs Intransitive Verb Potential

Determiner

Preliminary



139


Word

Additional Categories Noun Determiner

Presents Transitive Verb Instransitive Verb Problem

Determiner

Process

Determiner Transitive Verb

Produced


Project

Transitive Verb Intransitive Verb Transitive Verb

Proposed Intransitive Verb Verb Provide Transitive Verb Provisional

Noun Verb Intransitive Verb

Put Noun Determiner Verb Questions

Transitive Verb Intransitive Verb Continued on next page

140


Word


Radically

Noun

Real

Noun Verb

Reduce Intransitive Verb Transitive Verb Reduces Intransitive Verb Verb Remove


Research


Researchers


Review


Revolve Intransitive Verb Verb Save Noun Verb Transitive Verb Say Intransitive Verb Noun Continued on next page


141


Word


Search

Intransitive Verb Noun Noun

Sharing


So

Noun

Specific

Noun Verb

Spend Intransitive Verb Verb Stages


Static

Noun Verb

Take

Intransitive Verb Noun Transitive Verb

Takes


The

Determiner

Then

Noun

Through


142


Word Time

Additional Categories Determiner Transitive Verb

Transforming


Trap


Use


Used

Determiner

Ways

Noun (non-plural) Noun Verb

Well Transitive Verb Instransitive Verb What


When Noun Which

Determiner

While

Noun

Whole

Noun Noun

Will

Transitive Verb Intransitive Verb Continued on next page


143


Word

Additional Categories Determiner

Within Noun Noun Determiner Work Verb Transitive Verb World

Determiner

Dynamic Building of Domain Specific Lexicons Using

Dynamic Building of Domain Specific Lexicons Using

Suggest Documents

Generating domain specific sentiment lexicons

Automatic Expansion of Domain-Specific Lexicons by ...

Building Bilingual Lexicons Using Lexical Translation ... - NaCTeM

Building Specialized Bilingual Lexicons Using Large Scale

Creating Domain-Specific Sentiment Lexicons via Text Mining

Generating Linked-Data based Domain-Specific Sentiment Lexicons ...

Domain-Specific Language Models and Lexicons for Tagging

Expanding Domain-Specific Lexicons by Term Categorization - NTUA

Building Affective Lexicons from Specific Corpora for ... - Google Sites

Building Readability Lexicons with Unannotated

Cross-Domain Contextualisation of Sentiment Lexicons

Cross-Domain Contextualization of Sentiment Lexicons - webLyzard ...

Customizing meaning: building domain-specific semantic ...

Building a specific domain Ontology derived from

Cooperative Distributed Dynamic Lexicons for WWW ...

Introducing domain-specific language implementation using web ...

Embedded Domain-Specific Languages using ... - Infoscience - EPFL

A Domain-specific Rule Generation Using Model

Product Line Engineering using Domain-Specific Languages

Domain-specific Sentiment Analysis using Contextual Feature ...

Building Domain Specific Dictionaries of Verb ... - Osaka University

Building Domain Specific Dictionaries of Verb-Object ... - CiteSeerX

Domain-Specific Knowledge Acquisition Using WordNet

Building Lexicons out of a Database for Idioms - CiteSeerX