Natural Language Processing and Object ... - Computer Science

Natural Language Processing and Object-Oriented Analysis

H.M.Harmain R. Gaizauskas Department of Computer Science University of Sheeld Regent Court, 211 Portobello Street Sheeld, S1 4DP, UK. E-mail: [email protected] E-mail: [email protected]

Technical Report CS - 98 - 8

Contents

1 Introduction

1.1 The Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Overview of the rest of the report . . . . . . . . . . . . . . . .

2 Software Engineering and OO Analysis

2.1 Software Development: A historical background . . 2.2 Object Oriented Technology . . . . . . . . . . . . . 2.2.1 Fundamental Concepts . . . . . . . . . . . . 2.2.2 Object-Oriented Modeling . . . . . . . . . . 2.3 Approaches to Object-Oriented Analysis . . . . . . 2.3.1 Classical Approaches . . . . . . . . . . . . . 2.3.2 Behaviour Analysis . . . . . . . . . . . . . . 2.3.3 Use-Case Analysis . . . . . . . . . . . . . . 2.3.4 Informal English Description . . . . . . . . 2.4 UML . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Class Modelling . . . . . . . . . . . . . . . . . . . . 2.5.1 Classes . . . . . . . . . . . . . . . . . . . . 2.5.2 Associations . . . . . . . . . . . . . . . . . . 2.5.3 Multiplicity . . . . . . . . . . . . . . . . . . 2.5.4 Qualier . . . . . . . . . . . . . . . . . . . . 2.5.5 Generalization . . . . . . . . . . . . . . . . 2.5.6 Composition . . . . . . . . . . . . . . . . . 2.6 Identication Techniques of Classes . . . . . . . . . 2.6.1 Using the Entities to be Modelled . . . . . 2.6.2 Using Object Decomposition . . . . . . . . 2.6.3 Using Inheritance . . . . . . . . . . . . . . . 2.6.4 Using Nouns and Noun Phrases . . . . . . . 2.6.5 Using Semantic Nets . . . . . . . . . . . . . 2.7 A Review of some Systems that Support Software ments Analysis . . . . . . . . . . . . . . . . . . . . 2.7.1 SAFE . . . . . . . . . . . . . . . . . . . . . 2.7.2 The Requirements Apprentice . . . . . . . . 2.7.3 SPECIFIER . . . . . . . . . . . . . . . . . . 2.7.4 Meziane . . . . . . . . . . . . . . . . . . . . 2.7.5 NL-OOPS . . . . . . . . . . . . . . . . . . . 2.7.6 Attempto . . . . . . . . . . . . . . . . . . .

...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... Require...... ...... ...... ...... ...... ...... ......

3 GATE and Information Extraction: supporting OOA 3.1 Overview . . . . . . . . . . . . 3.2 GATE . . . . . . . . . . . . . . 3.2.1 GATE Architecture . . 3.2.2 TIPSTER Architecture i

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

1 1 2

2

2 3 4 5 6 7 7 7 7 8 9 9 10 10 10 11 11 11 11 12 12 13 13 14 14 14 14 15 15 15

16 16 16 16 17

3.3 Information Extraction . . . . . . . . . . . . . 3.3.1 IE Tasks . . . . . . . . . . . . . . . . . 3.3.2 An Example . . . . . . . . . . . . . . 3.4 LaSIE and OOA . . . . . . . . . . . . . . . . 3.5 The input . . . . . . . . . . . . . . . . . . . . 3.6 The Natural Language Processing component 3.6.1 Lexical Preprocessing . . . . . . . . . 3.6.2 Parsing and Semantic Interpretation . 3.6.3 Discourse interpretation . . . . . . . . 3.7 OOA Component . . . . . . . . . . . . . . . . 3.7.1 Worksheet Generator Module . . . . . 3.7.2 Class Model Generator module . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

17 18 19 22 23 24 24 25 25 27 28 29

4 A Case study

30

5 Conclusions and Future Work 6 References

37 39

4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Discourse Model . . . . . . . . . . . . . . . . . . . . . 4.2.2 Finding Candidate Classes and Their Frequency . . . 4.2.3 Finding Candidate Relationships and their Arguments 4.2.4 Identifying Classes, Attributes, and Operations . . . . 4.2.5 Initial Class Model . . . . . . . . . . . . . . . . . . . .

ii

30 30 31 31 33 33 35

If you wait for a complete and perfect concept to germinate in your mind, you are likely to wait forever. Perfect ideas do not germinate, they evolve. So you put your lousy ideas down on paper, rout out their faults one by one, and gradually come up with a good product. DeMarco (1979)

1 Introduction 1.1 The Problem

Computer systems are economically critical in all industrialised countries, and increasingly so in developing countries. The software in these systems represents a large proportion of the total system development cost. The development of software systems is considered one of the most time and resource consuming activities in modern societies. When computer systems were rst introduced, ad hoc approaches were used to develop software systems. However, Computer software developed using ad hoc approaches was very dicult to maintain or extend. The search for more appropriate approaches led to the view that the development of software systems should not be dierent from the development of any other engineering systems. Therefore, a model of the system should be developed and tested before the development of the real system. Also, it was generally accepted that the development process is divided into several stages such as: analysis, design, implementation, and testing. Many researchers have realized and stated that errors made early in the software life-cycle are more expensive to correct if not detected early. A wide variety of methods and techniques have been suggested and used to analyse user requirements and create models of the software to be developed. These methods range from traditional, structured methods, to formal methods and object-oriented techniques. Object-Oriented Technology (OOT) is becoming a popular approach for building software systems. This technology has recently been extended from the programming phase to cover the earlier phases of the software life-cycle - Analysis and Design. Many object-oriented methods have been proposed for the analysis and design phases (e.g. Booch, OMT, Objectory, OOSE). The basic terminology, the development process, and the denition of concepts vary from one objectoriented method to another. However, the concept of an object is central to anything object-oriented. The analysis process is considered one of the most critical and dicult tasks because most of the input to this process is in natural languages such as English, which are inherently ambiguous. Developers need to interact 1

with customers in their language. Also they need to review and analyse documents written in human languages. Advances in the eld of Natural Language Processing suggest a promising approach that may help software engineers in the early analysis phases of software development. This report gives background information on this area of research and discusses how Natural Language Processing techniques can be used to help an analyst in the analysis of English descriptions of software systems.

1.2 Overview of the rest of the report

Section 2 gives background information on Software Engineering and ObjectOriented Technology. Some previous systems that aimed at supporting the analysis stage of software development are discussed in section 2.7. In section 3 we describe how our work is going to support building object oriented models from English descriptions of software requirements. A case study is given in section 4. Section 5 gives some conclusions and highlights future work.

2 Software Engineering and OO Analysis The aim of this section is to give some background knowledge on software development and Object-Orientation. Also we discuss some previous work on applying Natural Language Processing techniques to the early stages of software development.

2.1 Software Development: A historical background

When computing began in the 1950's, programs were developed using ad hoc software development approaches. These programs were very hard to maintain or extend, and hence it was believed that the solution is in providing a more methodical development approach YA96]. In the 1960's the rst change to the software development process appeared. A NATO conference was held in October 1968 in Garmisch, Germany, to discuss the problems of building complex software systems Gol95]. The term software engineering was rst introduced at this conference. Conference attendees agreed that building a software system should not be different from building any other engineering system. In particular, complex systems should be broken down into smaller modules which could be independently analysed, designed, implemented, and tested. This new concept received general agreement among software developers. In this new view, the development process of software systems was divided into several subtasks, called phases. Each phase addresses a dierent problem and has its own conceptual model, notation, and heuristics. 2

Requirements System in use

Delivery

Analysis Tested system

Specification

Integration

Desgin

Tested subprograms

Architecture

Unit testing

Implementation

Code

Figure 1: Phases of software development Although there is a general agreement on the nature of the phases and what kind of output they produce, some dispute always exists about their names and the exact placing of boundaries between them CAB+ 94]. A model of software development is shown in gure 1. The Cloud in the model represent the original requirements for which a system is to be developed. The boxes represent the output of the dierent phases. The process ow on the left shows the stages from requirements to code, and on the right shows the code being integrated, tested, and delivered. The dashed lines indicate that the result of the integration phases on the right are matched, and should satisfy the requirements on the left. Many approaches have been suggested for software development. These approaches range from traditional, structured methods, to formal methods and object-oriented techniques. The following section talks about objectorientation and the fundamental concepts which make it dierent from other approaches.

2.2 Object Oriented Technology

Object-orientation is a new philosophy for the development of software systems. It is built upon some sound engineering principles such as abstraction, encapsulations, modularity, hierarchy, typing, and concurrency Boo94]. Whereas by themselves none of these elements are new, bringing them together has made object-orientation a fundamentally new approach. 3

The atoms of computation in this approach are called objects, as contrasted with functions in the functional-approach CAB+ 94]. Objects collaborate with each other by sending messages. The necessary actions are performed by methods invoked by messages sent by other objects. The following subsection sheds some light on some of the fundamental concepts in this approach.

2.2.1 Fundamental Concepts Abstraction Abstraction is a fundamental issue in Object-Oriented Technology (OOT). It focusses on the essential properties of an entity and ignores less important or nonessential aspects. Abstraction is concerned with both the attributes and behaviour of an entity. Attributes are the characteristics or properties of an entity, while behaviour is the set of actions that an entity can perform. Abstraction is considered to be an important technique because complex situations can be simplied to the level where understanding can take place Kaf98]. For example, if we look at the concept of a car in the real-world, we can nd a large number of properties associated with it (color, age, make, model, etc.). In modeling this entity we will be interested in some of these properties only (such as, make, model, age, price in a sales tracing system).

Objects and Classes A car, chair, and pen are example objects that appear in everyday life. Objects can be concrete or abstract. The things that we can picture, such as a car, are called concrete objects. Abstract objects are those things that we can not picture but form useful concepts for a model. Vehicle is an abstract concept for objects such as a car or train. Each object belongs to a class. A class provides a description for objects of a particular type much in the same way that a drawing provides a description of a house. Objects are physical instantiations of their classes. A class description is made up of attributes and operations which together comprise a set of class members. An attribute of a car can be its colour or number of seats. Each attribute has a value (e.g. colour is red). The state of an object is dened by the values of its attributes. Operations govern the dynamic behaviour of an object. A bank account can be modelled as an object with operations to deposit cash, withdraw cash and to inspect the balance. Executing an operation can result in a change of state. There are three categories of operation: constructor, query, and transformation. The constructor is invoked automatically when an object is rst created. Query operations allow attribute values to be read. Transformation operations produce a change in state. 4

The class specic implementation of an operation is described by a method. For example, the class Polygon may have an operation to calculate the area Square and Triangle are classes of type Polygon and therefore each has an operation to calculate the area but the method of calculating the area of square and triangle are dierent. Kaf98, Boo94, FS97, Fir93]

Inheritance

Inheritance is a powerful concept that facilitates abstraction. A class can inherit features from another class. The class Square, for example, can inherit features, such as number of sides, from the class Polygon.

Aggregation Aggregation (also known as part-whole or has-a relationship) is a form of composition in which an aggregate object conceals its constituent parts Kaf98]. It occurs in many natural systems in our every day life. For example, a sentence could be modelled as an object which is composed of other objects: noun phrase and verb phrase. An automobile is composed of an engine, chasis, wheels, etc.

Association

Objects in real life are associated to other objects in many ways. A system can be seen as a number of related parts. Compared to aggregation, where only the whole is externally visible, in association each part maintains its identity and external visibility. A person can be associated with a car: a person drives a car. Associations describe relationship between classes. The relationship between individual objects may be referred to as a link. A link is an instance of association just as an object is an instance of a class. For example, \a person owns a car" describes an association between the concept person and the concept car whereas, \John owns that green Mazda" describes a link between John and a particular car.

2.2.2 Object-Oriented Modeling Object-Oriented modeling is a new way of thinking about problems using models organised around real-world concepts. In contrast with the conventional methods which place their emphasis on the functional decomposition of systems, the object-oriented methods place the primary emphasis on identifying objects, which combine both data structure and behaviour, and the classication of these objects. There exists many methods for Object-Oriented Modeling (e.g. Booch, OMT, Objectory, OOSE, to name some). Although the basic terminology, 5

the development process, and the notations vary from one object-oriented method to another, the notion of an object as a building block is central to all of them. These methods are applied to both activities of software modelling, analysis and design. Booch (in Boo94]) gave the following denition for Analysis and Design: \in the analysis activity, we seek to model the world by discovering the classes and objects that form the vocabulary of the problem domain, and in design, we invent the abstractions and mechanisms that provide the behaviour that this model requires." Depending on the method used and the intention of the analyst(s), many models can be produced to reect one aspect or another of the system under development. However, during analysis it is useful to show the static and behavioural aspects of the modelled system. Having too many methods that use dierent, and sometimes conicting (see Table 1), techniques and notations has caused problems for software developers and CASE tool vendors. Recently an eort has been made to provide a standardised notation called UML (the Unied Modeling Language). UML is briey discussed in section 2.4. In the next section we discuss the general approach to Object-Oriented Analysis (OOA).

UML Booch Rumbaugh Jacobson

Class Class Class Object

Association Uses Association Acquaintance association Coad Class & Instance ConObject nection Shlaer/Mellor Object Relation Odell Object Relation Type

Generalization Inherit Generalization Inherits

Aggregation Containing Aggregation Consists of

Gen-Spec

Part-Whole

Subtype Subtype

N/A Composition

Table 1: OO Modelling Terminology

2.3 Approaches to Object-Oriented Analysis

As mentioned in the previous section, Object-Oriented Analysis is dened as the process of discovering classes and objects of the problem domain and building models based on these classes and objects. Many approaches have been suggested for the analysis process and how it can be achieved. Some of these approaches are discussed in this section. 6

2.3.1 Classical Approaches These approaches are called classical because they derive from the principles of classical categorisation and focus on tangible things in the problem domain Boo94]. Shlaer and Mellor SM88] suggest the following source of objects and classes: Tangible things (eg. Cars, customers, devices) Roles (eg. Employee, Student, Father) Events (eg. Landing, departure, ) Interaction (eg. Meeting, instruction, order)

Ross Ros89] oers a similar list but from a database modelling point

of view: ( People, Places, Organisations, Concepts, Events, and Things)

2.3.2 Behaviour Analysis This school of thought in object-oriented analysis places the primary focus on dynamic behaviour as the primary source of objects and classes. Rubin and Goldberg RG92] suggest an approach for identifying objects and classes based on system functions. Their approach emphasises understanding the system behaviour and then trying to nd out who initiates and who participates in these behaviours. These participants and initiators are considered as objects.

2.3.3 Use-Case Analysis A Use-case, rst formalised by Jacobson JCJO82], is dened as: a particular form, pattern, or examplar of usage, a scenario that begins with some user of the system initiating some transaction or sequence of interrelated events. This approach can be used as early as the requirements analysis stage where end users and software developers enumerate scenarios that are fundamental to the system operation.

2.3.4 Informal English Description

This is a radical alternative to the classical object-oriented analysis approach Boo94]. It was rst proposed by Abbott Abb83]. He suggested writing an English description of the problem and then underlining the nouns and verbs. The nouns represent candidate objects and the verbs represent candidate operations upon them. This approach is useful because it is simple and it forces the developer to focus on the vocabulary of the problem domain. 7

2.4 UML

The Unied Modeling Language (UML) is a language for specifying, visualising, constructing, and documenting the artifacts of software systems, as well as for business modeling and non-software systems (see www.rational.com). It was started as an eort by Grady Booch and Jim Rumbaugh at Rational Software Corporation, in the 1994, to combine their well known method - the Booch and OMT (Object Modeling Techni que). Later they were joined by Ivar Jacobson, the inventor of OOSE (Object Oriented Software Engineering method). UML has the formal support of OMG (Object Management Group - an industry standard body), and is receiving a wide acceptance in the software industry. Many software development organizations and CASE tool vendors have adopted UML. The main idea behind standardizing the notation and not the process (method) is that software developers can better communicate using their favourite method, so they will not be restricted to one standard method. UML denes a collection of graphical diagrams (and their semantics), that can be used for modeling systems using OO concepts. The following is a list of diagrams that are dened in UML version 1.1: use case diagram class diagram behaviour diagrams: { state-chart diagram { activity diagram { interaction diagram: sequence diagram collaboration diagram implementation diagrams: component diagrams deployment diagrams It is outside the scope of this report to discuss all these diagrams, instead we will focus on one of them, the class diagram, and discuss it in the next section.

8

2.5 Class Modelling

Class modelling involves building structural views of the system. These views are expressed in a set of diagrams called Class Diagrams. Collectively these diagrams are referred to as the Class Model. This model describes the types of objects in the system (Classes), their internal structure (attributes and operations) and the static relationships between them BRJ97]. There are three principal kinds of relationships: associations, subtypes, and aggregations. The class diagram technique is a central modelling technique that is used in nearly all object-oriented methods Fow97, Lar98]. In addition to being a central technique, the class diagram is also subject to the greatest range of modelling concepts. Before we begin describing the elements of this diagram, it is worthwhile to mention the way people use it, and from what perspective they draw and interpret the diagram. Martin Fowler Fow97], discussed three perspectives that are used in connection with this diagram as follows:

Conceptual: In the conceptual perspective, the class diagram shows

the concepts (object types) in the problem domain. These object types will naturally relate to the classes that implement them. This perspective is used at the analysis stage of the development where the focus is on the conceptual model with no regard to the software that might implement it Lar98]. Cook and Daniel CD94] call this the essential perspective. Specication: From this perspective the diagram moves from the domain to look at the software to be implemented. It shows the interface of the software but not the implementation. In other words it species the software to be implemented. Implementation: In this view, the diagram shows language dependent aspects of classes and methods. More language dependent details are given. Now we will discuss the class diagram elements.

2.5.1 Classes

UML denes a class as \ the descriptor for a set of objects with similar structure, behaviour, and relationships". In UML classes are drawn as a solid-outline rectangle with three compartments separated by horizontal lines. The top section holds the class name, which begins with an uppercase letter. The middle section holds a list of attributes, and the bottom one hold a list of operations. Attribute 9

and operation names begin with a lowercase letter. The class name is mandatory and should be unique within the diagram. Attributes and operations can be suppressed.

2.5.2 Associations Associations (binary or n-ary) represent relationships between instances of object types. Conceptually associations represent conceptual relationships between the object types (concepts) involved. In the specication view these mean responsibilities, and will be made explicit by operations. An implementation interpretation implies the presence of a pointer that points to another object. Associations can be bi-directional or uni-directional. Conceptually they are bi-directional. For specication and implementation models they are uni-directional.

2.5.3 Multiplicity A multiplicity species the range of allowable cardinalities that a set may assume. Multiplicity specications may be given to roles within associations or parts within compositions. It is a subset of the nonnegative open integers. It can be a single value (e.g. 1 for only one) or an integer range (e.g. 1..7). A single star (*) is also used to denote unlimited nonnegative integer range.

2.5.4 Qualier A qualier is an association attribute whose value serves to partition the set of objects associated with an object across an association. Figure 2 shows the association attribute account# for the association between Customer and Bank. Bank

account#

*

0..1

Customer

Figure 2: Qualier Association 10

2.5.5 Generalization Generalisation is a taxonomic relationship between a more general element and a more specic element that is fully consistent with the rst element and that holds additional information.

2.5.6 Composition

Composition is a form of aggregation with strong ownership and coincident lifetime of a part with the whole. The multiplicity of the aggregate end may not exceed one.

2.6 Identi cation Techniques of Classes

The identication of the objects and classes is considered to be the hardest part of object-oriented analysis Boo94]. Over the past two decades software engineers have developed numerous techniques to tackle this problem. Some of these techniques are now considered relatively obsolete and others are state-of-the-art ( For a full coverage see Firesmith Fir93].) Depending on the Analysis approach adopted (see section 2.3), multiple sources of information can be used for the identication process (for example, Software Requirements Documents (SRDs), Software Design Documents (SDDs), Users, customers, Application domain Experts, and from the software Engineer's own experience). This section discusses some of the identication techniques as follows: Using the types of the modelled entities. Using object decomposition. Using Inheritance (Generalization and Specialization). Using nouns and noun phrases. Using semantic networks.

2.6.1 Using the Entities to be Modelled This approach is used in some of the object-oriented analysis methods such as Firesmith's ADM3 Fir93], Coad and Yourdon's Object-Oriented Analysis CYGS92], and Shlaer and Mellor's Object-Oriented Systems Analysis SM88]. It is highly recommended in this approach that software engineers should identify the application domain entities rst and then nd the corresponding objects and classes: 1. Identify individual important system devices, roles, locations, events, organizations, and interactions in the problem domain. 11

2. Identify the corresponding objects and class. This approach relies heavily on the experience of the software engineer.

2.6.2 Using Object Decomposition

This approach assumes that many objects and classes are aggregates of component objects and classes. Also it assumes that these aggregate objects have already been identied and by using object decomposition we can nd the component objects and classes. So the two steps of this approach are as follows: 1. Look for aggregate objects and classes. 2. Identify their component objects and classes. This approach can be used to nd only some of the objects and classes because not all of the objects and classes are aggregates Fir93].

2.6.3 Using Inheritance Generalization This approach assumes that objects are identied before classes and commonalities among related objects can be used to dene classes: 1. Identify all objects. 2. Look for collection of objects that share the same attributes or operations. 3. Generalize these resources to dene classes.

Specialization

The focus of this approach is on subclasses. This approach assumes that some classes often contain some common resources (attribute types, operations, etc.) and subclasses can be generated which will inherit these common resources: 1. Identify classes with common resources. 2. Use these common resources to generate superclasses and new simpler subclasses.

12

2.6.4 Using Nouns and Noun Phrases This approach was rst proposed by Russell J. Abbott Abb83]. It was expanded by Booch Boo86] in the classic Software Engineering with Ada. 1. Either obtain a Software Requirements Document or use narrative English (or any other natural language) to write a description of the problem to be solved in terms of real application domain entities and their interactions. 2. Use nouns, pronouns, and noun phrases to identify the objects and class. Singular proper nouns and nouns of direct reference are used to identify objects. Plural and Common nouns are used to identify classes. Verbs are used to identify operations.

2.6.5 Using Semantic Nets This approach assumes that Semantic nets (SNs) are to be used for documenting the relationships between objects and classes. SNs are very objectoriented where the nodes represent objects and classes, and the arcs represent semantic relationships among the nodes. SNs are the basis of several Object-Oriented Requirements Analysis such as Firesmith' ADM3 Fir93] and Berard's OORA Ber91].

13

2.7 A Review of some Systems that Support Software Requirements Analysis

Due to the fact that much of the input in the early stages of software development is in natural language and most of the activities performed in these stages are knowledge intensive, there have been only a few attempts to develop automated tools for supporting these early stages of software development. The early emphasis was on AI-based tools, mostly Rule-Based systems. SAFE, RA, and SPECIFIER, as discussed below, are examples of AI Rule-Based systems. Most recently a few prototype tools based on Natural Language Processing (NLP) components have come to light to address some of the problems of analysing informal software specications. This can be attributed to the recent advances in this area of research. Meziane's PhD work, NL-OOPs, and Attempto are examples of NLP-based systems.

2.7.1 SAFE SAFE (Specication Acquisition From Experts) was one of the rst early attempts in providing such tools BGW78]. The goal of this project was to process software specications, written in natural language, and generate an operational specication. The main assumption was that informal specications are usually partial and an automated tool has to complete these partially stated descriptions and transform them to specications that can be run. However, no linguistic knowledge was used in this system and, as a result, the input text has to be manually parenthesised to avoid syntactic problems.

2.7.2 The Requirements Apprentice The Requirements Apprentice (RA), developed at MIT RW91], is another AI-based knowledge acquisition system. Again here the developers imposed a restricted command language to avoid free natural language problems. RA was based on three essential components. The Cliche Library, a declarative repository of information about particular domains of interest, Cake a knowledge representation and reasoning system, and the Executive an interface to interpret the communication language between the analyst and the RA.

2.7.3 SPECIFIER SPECIFIER, developed at Illinois University MH91], is intended to be an intelligent assistant to a requirements analyst interested in developing formal software specications from informal specications. The system used common problem solving techniques, such as schemas, analogy, and dierence14

based reasoning, to derive the formal specication. The system keeps a knowledge base of past solved problems and a set of reasoning rules to search for a problem matching the current one if an exact match is not found, it tries to nd the closest one and use it to guide the specication process. The user interacts with the system in a restricted subset of natural language. The process is started by the system asking the user some questions like, what do you want me to specify, a program or a data-type, and what is the name of your program. Based on the users' answers the system performs its reasoning.

2.7.4 Meziane Farid Meziane, in his PhD work, developed a system that aims at helping an analyst in the development of formal specications in VDM Mez94]. The main emphasis in this project was in the detection of ambiguous sentences in the informal specications. A sentence is considered to be ambiguous if the system generates more than one meaning representation for it based on syntactic and semantic processing. In this case the user is prompted to choose the intended meaning. After analysing all the text sentence by sentence, the system tries to nd the entities mentioned in the text, based on nouns and noun phrases, and the relationships between these entities, based on verbs. These entities and their relationships are used to build an Entity-Relationship model, from which a VDM data-types are produced.

2.7.5 NL-OOPS NL-OOPS (Natural Language Object Oriented Production System) is an NLP-based prototype system that aims to support the analysis of natural language requirements and the extraction of objects and their relationships. These objects and their relationships are used to build an OMT object model Mic96]. LOLITA (Large-scale Object-based Linguistic Interactor Translator Analyser), a system developed at Durham University, is used as a core system for this CASE tool. The system processes problem statements written in a completely unrestricted English. It relies on LOLITA to perform all the linguistic processing and build a model of the text. After the text has been processed by LOLITA an algorithm is used to extract the objects and their relationships to be used for generating the object model.

2.7.6 Attempto Fuchs et al. have taken another approach by using a predened subset of English , called Attempto Controlled English (ACE), to write requirements specication FF92, Fuc95, ES95]. Attempto is a NLP-based system that is used to analyse software specications written in ACE. As ACE is a subset of English with a dened syntax, the system uses grammar rules and 15

a lexicon for its analysis and optionally produces Prolog code that can be run to validate the specication. This approach is used to analyse an Automated Teller Machine problem. However, as is the case with all Controlled Languages their users have to be trained to be able to use them.

3 GATE and Information Extraction: supporting OOA 3.1 Overview

The goal of this section is to show our view of how Language Technology (LT) can be used to assist software analysts in doing their job. In section 3.2 we discuss the architecture of a general language engineering system, GATE, then we look at Information Extraction with an example. Section 3.4 shows how language processing components in GATE are congured to build a system that supports OOA.

3.2 GATE

Gate (General Architecture For Language Engineering) is a rapid application development environment for Language Engineering (LE) applications CGW95, GCW+ 96, GRCH96, CWG96b, CWG96a]. It provides tools for data visualisation, debugging and evaluation of LE modules and systems. This system is being developed at the Department of Computer Science, University of Sheeld, and was used for developing the LaSIE system (Large Scale Information Extraction)- for more information on LaSIE see GWH+ 95, HCGA96].

3.2.1 GATE Architecture

GATE architecture comprises three principle components, as shown in gure 3: GATE Document Manager (GDM): a database for storing information about the processed text. All components of an LE system store their results and read others from this database. GDM is an implementation of the TIPSTER architecture, see section 3.2.2. GATE Graphical Interface (GGI): a graphical interface for launching processing tools and viewing results. Collection of REusable Objects for Language Engineering (CREOLE): a collection of algorithmic and data resources that inter-operate with the database and the graphical interface. All the real work of any LE system in GATE environment is done by CREOLE modules. These 16

modules maybe built from scratch, also pre-existing components can be used.

GDM

CREOLE

GGI

Figure 3: GATE Architecture

3.2.2 TIPSTER Architecture TIPSTER is an ARPA-sponsored programme of research and development in information retrieval and extraction. This project has produced a datadriven architecture for NLP systems. In this architecture, information about the processed text is added in the form of annotations and stored in a separate database while the original text remains unchanged. Annotations associate analysis information (attributes) with portions of documents (identied by start/end byte osets). For example a part of speech tagger associates information with the text in the form of attribute value pairs as shown in gure 4. This architecture is implemented in GATE, see section 3.2.1, and for more information on TIPSTER refer to GDC94].

3.3 Information Extraction

Information Extraction (IE) is a new eld of research in the area of Natural Language Processing. The goal of IE research is to build systems that nd and link relevant information in a text while ignoring extraneous and irrelevant information CL96]. 17

Text Sara savored the soup. 0:::j5:::j10::j15::j20 Annotations Id Type Span Attributes Start End 1 token 0 5 pos=NP 2 token 6 13 pos=VBD 3 token 14 17 pos=DT 4 token 18 22 pos=NN 5 token 22 23 6 name 0 5 name type=person 7 sentence 0 23 Figure 4: TIPSTER annotation example Information Extraction can be dened as the process of analysing electronic texts in order to extract specic types of information and store them in a xed format. The information produced can be stored in databases or analysed by spreadsheets. One feature that has made IE specially interesting is that it is possible to compare the output of a system with an ideal database as produced by a human analyst for some text in order to determine how well the system is doing. This also distinguishes information extraction systems from other natural language processing systems where evaluation is known to be problematic (e.g. MT, generation etc.). The US Advanced Research Projects Agency (ARPA) have been sponsoring a serious of evaluation conferences, known as Message Understanding conferences (MUCs), since 1987. In these conferences the extraction tasks are precisely dened and human analysts are employed to perform these tasks manually to produce a test corpus and specied lled templates. On the other hand participants develop their automated systems and then these systems are evaluated against the manually extracted information (see GS94] for an overview of the history of the MUCs).

3.3.1 IE Tasks The recent MUC competition (MUC-7, 1998) dened ve tasks of information extraction as follows: Named Entity recognition (NE) in this task all the names of people, places, organisations, etc.. are extracted from the text. Coreference Resolution (CO) to identify the identity relations between certain entities in the text. 18

Figure 5: An Example Text Template Element construction (TE) in this task descriptive information is added to the extracted entities in the NE task. Template Relationship (TR) in this task relationships between template elements are identied. Scenario Template production (ST) specic event scenarios are produced from the TE results.

3.3.2 An Example

Figure 5 shows a Wall Street Journal article reporting a takeover incident. This text is processed by LaSIE (Large Scale Information Extraction), a system for information extraction developed within GATE HCGA96]. Below we discuss the IE tasks using this example.

Named Entity recognition NE recognition is concerned with the identication of all the names of people, places, dates, organisations, and amount of money mentioned in the text. The result of running the text in gure 5 through LaSIE is given in gure 6. As shown in the gure the system identies three types of names (date, location, organisation). NE recognition is the most reliable IE technology. MUC-7 showed that this task be performed at accuracy above 93 %. The current Sheeld system performed above 85 % accuracy HG+ 98]. 19

Figure 6: Named Entity

Coreference Resolution This task seeks to identify identity relations

between entities in the text. The process of coreference resolution is more relevant to the other IE tasks (e.g. TE and ST) than to users Cun97]. For MUC-7 competition only noun phrases are linked { relations concerning verbs are not considered. For example, as shown in gure 7, Nixdorf in the last paragraph is corefered with Nixdorf Computer AG in the rst. The coreference algorithm of LaSIE achieved the highest combined precision + recall score , 61%, in MUC-7. It performs well on all denite NPs, and the majority of proper names in the MUC-7 walkthrough text. However, only half of the pronouns are correctly resolved (for more details see HG+ 98]).

Template Element The TE task associates descriptive information with

entities found in the text. For example, in our example text in gure 5 the system found Nixdorf Computer AG to be an organisation of type Company and has two aliases Nixdorf and NIXDORF. The format of the output can be seen as a database record and can be changed to any format suitable for the user's task. Humans can perform this task at about 93% accuracy. The best MUC-7 system scored above 85% and LaSIE scored above 76% accuracy (for details see MUC-7 proceedings). However, this task is domain dependent which means that if the domain of the system is changed, say from nancial news to 20

Figure 7: Co-Reference Resolution software requirements, considerable changes have to be made to the system in order to give accurate results.

Template Relations MUC-7 included three relational objects which point

to Template Element objects. These three relations are LOCATION OF, EMPLOYEE OF, and PRODUCT OF. In MUC-6 one limitation of the ST task coupled with TE task was that relations between entities had to be encoded as attributes in the template. To overcome this limitation, in MUC-7 these relations were separated from the scenario. The best MUC-7 system scored 75% and LaSIE scored 54% ( see MUC-7 proceedings for more information on this task). In our example, a system might identify West Germany as a location of Nixdorf and the relation that Sven Kado is employed by Nixdorf.

Scenario Template In this task TE entities are tied together to produce

specic event scenarios. For example, if TE has found Nixdorf Computer AG to be an organization of type company, and Sven Kado and Klaus Luft are two person entities, then these entities can be linked together to produce an ST as follows:

NAME : " Nixdorf Computer AG"

21

Figure 8: Template Element NAME NAME ORGANIZATION POST WHO_IS_IN WHO_IS_OUT

: "Sven Kado" : "Klaus Luft"

: : : :

"finance and purchasing"

ST is not an easy task. MUC competitions reported that humans can achive about 81% accuracy. However, the best MUC-7 system scored around 50% and LaSIE scored 44%. This reects the complexity involved in the task. ST is also domain dependent and considerable changes have to be made to give accurate results or to move the system to another domain.

3.4 LaSIE and OOA

We see great relevance between some IE extraction tasks (e.g. TE and TR) and the process of OO Modelling. The well performance of LaSIE at most of the IE tasks has given us the motivation of using it for building a prototype system, called OO-Analyser, to extract the Class Model elements (Classes, Attributes, Associations, etc.) from unrestricted text of software requirements in English. This system works in two major stages as follows: 22

Linguistically analyse the input text and build an integrated model of the text (Semantic Net). This is the result of the discourse analysis stage of LaSIE. Use the Semantic Net to extract class model elements (classes, attributes, associations, etc.) and other results that support the OOA (e.g worksheets). Figure 9 shows the architecture of the system. The system components are discussed in detail as follows:

Informal Spec

? NLP Engine Discourse Model

?

OO Analyser

?

Class Model

?

Worksheet

Figure 9: System architecture

3.5 The input

The input to the system is a plain text le containing an informal description of the problem (Informal requirements in English). Section 4 shows an example of this text as used in the case study. This text, along with all the processing results of the system modules, are stored in a special repository. The system accesses this repository through the GATE Document Manager (GDM) as described in CGW95, GCW+ 96, GRCH96].

23

3.6 The Natural Language Processing component

This is the cental component of our system. For the rst prototype we have used the LaSIE system to build an integreted model of the text. What makes this system more appropriate for our task is, among other things, its modularity. All language processing modules interact with each other through the GATE Documment Manager (GDM) as discussed in section 3.2.1. This makes it relatively easy to add other modules without aecting the existing ones. Three principle language processing stages are used to analyse the text and build the model in the following order:

3.6.1 Lexical Preprocessing The Lexical Preprocessor consists of four modules namely, a tokenizer, sentence splitter, tagger, and a morphological analyser. The input to the lexical preprocessor is a plain ASCII le containing a description of the problem at hand (informal requirements). The output is a set of charts, one per sentence, written as Prolog terms to be used directly by the parser and the discourse interpreter which are written in Prolog.

The Tokenizer and Sentence Splitter The tokenizer takes a plain text le as input and splits it into word tokens in a TIPSTER representation format. The sentence splitter identies sentence boundaries withen the given text. These two modules are used to prepare the input to the part of speech tagger as the tagger expects one sentence per line and all the tokens to be separated by a white space.

Part-of-Speech Tagging

The Brill tagger Bri94] is a public domain rule-based tagger. It uses 48 part-of-speech tags. Some changes have been made to the tagger for using it in LaSIE system. We have used the modied version as it was used in LaSIE.

The Morphological Analyser After part-of-speech tagging the text, all nouns and verbs are passed to the morphological analyser and a root form is returned for inclusion in the initial input to the parser. We used the same module that was delivered with GATE and used in the LaSIE system Gai95].

24

3.6.2 Parsing and Semantic Interpretation The parser is a simple bottom-up chart parser implemented in Prolog. It is a modication of Gazdar and Mellish parser GM89]. It uses a feature-based phrase structure grammar and employs a unication-based mechanism. The parser takes the output of the Lexical Preprocessor, a sequence of lexical and multi-word chart edges, and uses the grammar rules to build a syntactic tree and compositionally generates semantic representation for every sentence in the text. Semantic structures are assigned by hand to the grammar rules. The semantic representation is simply a predicate-argument structure (rst order logical terms). The morphological roots of the simple verbs and nouns are used as predicate names in the semantic representations. Tense and numbers are translated directly into this notation where appropriate. All NPs and VPs introduce unique instance constants in the semantics which serve as identiers for the objects or events referred to in the text. For example, \A library issues loan items." will map to something like: library(e2), determiner(e2,a), number(e2,sing), issue(e1), time(e1,present), aspect(e1,simple), voice(e1,active), lsubj(e1,e2), loan(e4), item(e3), qual(e3,e4), number(e3,plural), lobj(e1,e3). The parser is a partial parser, which means it produces correct but not necessarily complete, syntactic structures and hence semantic representations. If a full parse of a sentence is not found, the parser uses a best parse algorithm to choose the best complete sub-structures (i.e. phrases of category).

3.6.3 Discourse interpretation

This module reads the meaning representation of every sentence produced by the parser and adds it to a predened world model to produce a nal model specic to the processed text called the discourse model. The meaning representation is translated into a representation of instances, their ontological classes and their attributes in the XI knowledge representation language (pronounced ZI, the name meant to suggest crossclassication X and Inheritance I, the two primary features of the language). XI allows a straightforward denition of cross-classication hierarchies and the association of arbitrary attributes with classes or instances in the hierarchy (for a full description of XI see Gai95]). XI world model consists of a domain-specic ontology and an associated attribute knowledge base.

The World Model The denition of a cross-classication hierarchy in XI is called an ontology. The ontology together with an association of attributes with nodes in the 25

ontology form a world model. The world model can be viewed as a frame upon which a discourse model is build. It also serves as a declarative knowledge base that contain semantic information. An example of a simple world model is given in gure 10. top(X)

event(X)

object(X)

property(X)

issue(X) library(X)

name(X)

customer(X)

student(X)

address(X)

professor(X)

name: address:

e3

Figure 10: A simple world model This model shows ve object classes (namely, library, cutomer, student, and professor), one event class, and two attributes. The customer class is shown to be a superclass of the student and customer subclasses. Also two attributes, name and address, are attached to this class.

Discourse Model

In this section we discuss how knowledge from the text is added to the world model in order to change it to a model specic to the text called Discourse Model. The discourse interpreter takes the semantic representation of every sentence and applies four processing stages as follows: 1. Adding Instances and Attributes to the World Model: In this stage the Discourse Interpreter adds simple noun phrase instances and their attributes under the Object node in the world model, and all verb events and their attributes are added under the event node. 2. Presupposition Expansion 26

After all instances and their attributes been added to the world model, the discourse interpreter goes to the presupposition expansion stage. It examines the knowledge base for any presupposition rules attached to the new instances or attributes. 3. Coreference Resolution After adding the semantics of a new sentence to the world model, and all expanding presuppositions, all newly added instances are compared with previously added ones to see if any two instances can be merged into a single one, representing a coreference resolution. This comparison is done in four stages: (a) All new instances are compared with each other ( intra-sentential Coreference resolution) (b) New pronoun instances are compared with existing instances in the same paragraph. (c) All other new instances are compared with instances in the current and previous paragraphs. (d) Instances with proper-name properties are compared with all other instances which have proper-name properties. This stage is specic to entities of the type specied in the named entity task of the MUC competition. 4. Consequence Expansion In this stage the Discourse Interpreter checks any consequence expansion rule to determine if any rules might be applied as a consequence of merging instances in the coreference resolution stage. Having nished this stage the Discourse Interpreter goes back to the rst stage and take the meaning representation of the next sentence. All sentences go through the same stages.

3.7 OOA Component

Having shown how we build an integrated model of the text and store all other results in the repository, we now move to discuss how (in our view) this model can be utilised to support the OOA stage of software development. An NLP-Based tool can assist OO-Analysts in many ways. It is at the OOA component of the system, as shown in gure 9, that we decide the type of help our system can provide. We distinguish three possible levels of analysis that lead to some useful results as follows: 27

1. Performing supercial analysis on the text and giving the results in a form of a worksheet which an OO analyst can populate and further analyse. This is discussed in section 3.7.1. 2. Using some general world knowledge and general key constructs (such as, is-a and has-a relationships) to nd class model components. 3. Using more specic domain knowledge. For example, knowledge about library information systems only. Next we discuss in some detail the modules of the OOA component:

3.7.1 Worksheet Generator Module

This is a simple tool that takes all the NPs found in the text as a result of the parsing process and produces a list of candidate classes. The list is given in a tabular form, as shown in gure 11 (a similar worksheet is given in YA96]) and it is the analyst's job to populate this table. In this case no semantic or discourse interpretation is needed. Another list of all verbs can also be produced in a similar worksheet and used as candidate relationships. This approach is trivial but essential as a list of all the candidate classes and relationships is always produced as a starting point of the OOA process. We have implemented a module that takes the basic NPs from the output of the parser, counts them and presents them sorted in descending order. The most frequently mentioned NPs are given at the top of the list with their frequency number. Figure 11 shows a partial result of analysing the problem statement in section 4. Candidate Frequency Class Attribute Comments Classes Loan item 10 Customer 5 Library 3 Member 3 Name 2 Address 1

Figure 11: Example of a Worksheet

28

3.7.2 Class Model Generator module This is a more sophisticated and dicult tool to build. The diculty comes from the fact that we need to encode a lot of knowledge about OOA and about the domain. As mentioned above we distinguish two types of tools based on semantic and discourse interpretation. The rst one uses domain independent knowldge and the other uses very domain dependent knowledge. For the domain dependent one, a lot of world and domain knowledge has to be encoded in the system's world model. In other words, the discourse model generated by the Discourse Interpreter, as discussed in section 3.6.3, has to be geared towards the nal goal of the analysis. This may lead us to a problem, as the nal output of the system will be the same as the encoded knowledge. For this reason no attempt has been made yet in this direction. As an experiment and based on some hypotheses we have investigated the other possibility - using semantic analysis and domain independent knowledge. In this direction, we have implemented a simple module that takes the discourse model (Semantic Net) produced by LaSIE and extracts classes, attributes, and associations. Some initial results of this experiment are given in section 4.2. The hypotheses and the identication algorithm used to implement this module are as follows:

Hypotheses: 1. Classes are usually expressed by common nouns. 2. Important classes are mentioned more frequently than unimportant ones. 3. Certain kinds of relationships give rise to attributes (e.g Possession and has-a relationships). 4. Verbs express relationships or operations. 5. If a complement of a verb is an attribute then that verb denotes an operation.

The identication algorithm:

Our initial identication process can be summarised as follows: 1. All newly added object classes to the LaSIE semantic net (i.e. all nouns in the text) are taken as candidate classes. 2. For every candidate class nd its frequency in the text (i.e. how many times it has been mentioned). 29

3. Find attributes based on the simple hypothesis, given above, and attach them to their classes. 4. All newly added event classes to LaSIE semantic net (i.e all verbs) are considered to be candidate relationships. For example, if the verb issue is mentioned four times in the text, one event class will be added to the discourse model with four instances. 5. For every candidate relationship nd its complements (i.e. logical object, logical subject, or any other complement). 6. Discard any candidate class that does not participate in any relationship. 7. Discard any candidate relationship that has no arguments (this may occur as a result of partial parsing). 8. What is left is our initial class model. The next section discusses how the system is used to analysis a simple case study.

4 A Case study 4.1 Overview

In this section we give a simple case study of a library information system and show how our initial prototype system can be used to analyse it. This case study is not meant to be a full description of a library system as our goal is to show how some useful information can be extracted from text to support OOA. The text of the case study, gure 12, is taken from Callan's book Cal94]

4.2 Analysis

We have used our prototype system to analyse the text in gure 12. It is outside the scope of this report to discuss all the problems of analysing the text in order to produce useful results. So we decided to pick up some candidate sentences to illustrate our approach and discuss how the identication steps mentioned in section 3.7.2 are used. These sentence were chosen because they contain most of the information we want to extract (e.g. classes, attributes, operations etc.). 1. A library issues loan items to customers. 2. Each customer is known as a member. 30

A library issues loan items to customers. Each customer is known as a member and is issued a membership card that shows a unique member number. Along with the membership number, other details on a customer must be kept such as a name, address, and date of birth. The library is made up of a number of subject sections. Each section is denoted by a classi cation mark. A loan item is uniquely identi ed by a bar code. There are two types of loan items, language tapes, and books. A language tape has a title language (e.g. French), and level (e.g. beginner). A book has a title, and author(s). A customer may borrow up to a maximum of 8 items. An item can be borrowed, reserved or renewed to extend a current loan. When an item is issued the customer's membership number is scanned via a bar code reader or entered manually. If the membership is still valid and the number of items on loan less than 8, the book bar code is read, either via the bar code reader or entered manually. If the item can be issued (e.g. not reserved) the item is stamped and then issued. The library must support the facility for an item to be searched and for a daily update of records.

Figure 12: The problem statement 3. 4. 5. 6. 7. 8. 9.

Each customer is issued a membership card. A membership card shows a unique number. A customer's name must be kept. A customer's address must be kept. The library is made up of a number of subject sections. Each section is denoted by a classication mark. A loan item is uniquely identied by a bar code.

4.2.1 Discourse Model A portion of the discourse model of the above sentences is given graphically in gure 13. The textual representation of this discourse model in XI is given in Appendix I (see section 3.6.3 for information on how this model is generated.).

4.2.2 Finding Candidate Classes and Their Frequency

By applying step 1 of the identication process, all object classes are extracted from the discourse model. These classes form the right hand side of the clauses object(X) ==> xx(X), where xx is the class name. The frequency 31

top(X)

event(X)

object(X)

know(X) show(X)

property(X) denote(X)

issue(X)

make(X)

identify)X)

keep(X)

lsubj aspect time

library(X)

lobj qual

code(X)

address(X)

bar(X) name(X)

loan(X)

classification(X) mark(X)

number(X) customer(X)

item(X)

member(X) card(X)

membership(X) e2

e3

e4

e2

Figure 13: A portion of the discourse model of the classes is taken from the GATE database. This information is then kept in Prolog as candidate classes. The nine sentences above gave rise to 15 candidate class. These candidate classes, as given below, take the following form: cand class(Frequency Name).

cand_class(5,customer). cand_class(2,number). cand_class(2,membership). cand_class(2,loan). cand_class(2,library). cand_class(2,item). cand_class(2,card). cand_class(2,section). cand_class(1,name). cand_class(1,member). cand_class(1,mark). cand_class(1,code). cand_class(1,classification). cand_class(1,bar). cand_class(1,address).

32

4.2.3 Finding Candidate Relationships and their Arguments The process of nding candidate relationships is similar to that of nding candidate classes. Candidate relations are extracted from the event classes of the discourse model. Event classes appear in the discourse model in the form: event(X) ==> xx(X), where xx is the event name. Next for every candidate relation we nd its arguments and keep them in the following form: relation(Voice, Name, Argument1, Argument2, Argument3 ). Where Argument1, Argument2, and Argument3 are candidate classes associated with the verb (indicating the relation) either as subject, direct object, or indirect object. cand_relation(active,issue,library,item,customer). cand_relation(passive,issue,_,card,customer). cand_relation(passive,keep,name,_,_). cand_relation(passive,keep,address,_,_). cand_relation(active,show,card,number,_). cand_relation(passive,make,_,library,section). cand_relation(passive,know,_,customer,_). cand_relation(passive,identify,code,item,_). cand_relation(passive,denote,mark,section,_).

4.2.4 Identifying Classes, Attributes, and Operations Having generated a list of candidate classes and a list candidate relationships, we now show how these lists are further revised.

Noun-Noun Structures

In our list of candidate classes there are four candidates (loan, membership, classi cation and bar) that have a property of being qualiers of other candidates. This knowledge is expressed in the discourse model by a qual(X,Y) attribute attached to each of these modiers, where X is a descriptor of the noun being qualied and Y is a descriptor of the qualier. Based on the assumption that if a noun appears only as a qualier in a N-N compound then its unlikely to be a class, we combining these qualiers with the qualied nouns and remove them from the list of candidate classes. This process gave rise to four revised classes: membership card, loan item, 33

classification mark, and bar code. So the list of candidate classes becomes as follows:

cand_class(5,customer). cand_class(2,number). cand_class(2,membership_card). cand_class(2,loan_item). cand_class(2,library). cand_class(2,section). cand_class(1,name). cand_class(1,member). cand_class(1,classification_mark). cand_class(1,bar_code). cand_class(1,address).

This process also eects the list of candidate relationships and modies it to show the new class names after the combination with the qualiers as follows: cand_relation(active,issue,library,loan_item,customer). cand_relation(passive,issue,_,membership_card,customer). cand_relation(passive,keep,name,_,_). cand_relation(passive,keep,address,_,_). cand_relation(active,show,membership_card,number,_). cand_relation(passive,make,_,library,section). cand_relation(passive,know,_,customer,_). cand_relation(passive,identify,bar_code,loan_item,_). cand_relation(passive,denote,classification_mark,section,_).

Finding Attributes and Operations Here we look for relations between pairs of candidate classes. For example, by taking the (customer, address) pair we found that the only relation between them is the genitive relation expressed as follows (customer(e24), address(e23), of(e23,e24) ). Also this is the only relationship that the candidate class address participates in. There is another fact kept for it which says that it is the logical object of the verb keep which is given in passive voice. From this general information we infer that address is an attribute 34

for customer and keep is an operation. So address is removed from the list of candidate classes and attached to the class customer as an attribute. The same applies for the candidate class name. Now the class customer has two attributes and is mentioned ve times in the text so it was included as a class. This sort of analysis is done for all the other pairs of candidate classes. By the end of analysis we end up with the following list of classes of the form: class(Frequency, Name, Attributes, Operations). class(5,customer,address,name],keep]). class(2,number,],]). class(2,membership_card,]). class(2,loan_item,],]). class(2,library,],]). class(2,section,],]).

According to this analysis the list of candidate relations is also revised by removing the relations which contains the identied attributes. This has reduced the list from nine to seven relations as follows: cand_relation(active,issue,library,loan_item,customer). cand_relation(passive,issue,_,membership_card,customer). cand_relation(active,show,membership_card,number,_). cand_relation(passive,make,_,library,section). cand_relation(passive,know,_,customer,_). cand_relation(passive,identify,bar_code,loan_item,_). cand_relation(passive,denote,classification_mark,section,_).

4.2.5 Initial Class Model As mentioned in section 2.5, a class model shows the types of objects (classes) in the system and the static relationships between these classes. The classes can straightforwardly be drawn from the list of classes given above. We have done this by automatically transforming the identied classes, and their attributes and operations from Prolog to a CDIF transfer le. This le, as given in appendix II, can be edited by an analyst using some CASE tools to modify the model. 35

To test this idea we have used the Object Modeler in Select Enterprise CASE tool and produced the example model in gure 14. In this model the classes where automatically generated by the process of importing the CDIF le into the case tool. However, the relationships were added by hand based on the information given in the last section. The automatic identication of the relationships is more problematic. Lexical semantic knowledge about the verbs indicating the relations has to be available for anyone who wants to draw a sensible class diagram. This sort of knowledge is not yet available to our system. Nonetheless, at this early stage our system has proved to be useful by being able to automatically generate the building blocks of the class diagram in CDIF format. Library

Loan-Item issues

b a r | c o d e

classification-mark

Customer

Section

name address keep

Membership-Card

Figure 14: An Initial Class Model

36

5 Conclusions and Future Work This report gave an overview of one of the hardest problems in the software development process: requirements analysis and specications. A background knowledge on Software Development and Object-Oriented Technology was given in section 2. The review of some related work in the area shows that the problem of providing automated support tools has been looked at from two perspectives. The rst approach used general Articial Intelligence (AI) techniques, such as schema-based reasoning and search strategies to provide intelligent tools. However, given that most of the data available for software engineers interested in analysing software requirements is in natural languages (Human languages), and given the recent advances in the area of Natural Language Processing, researchers have started to examine another approach. This approach argues that a useful CASE tool for supporting the early stages of software development has to be based on a real NLP system. Some researchers following this second approach advocate the use of controlled subset of a natural language to write informal specications and build tools that can analyse these specications and produce useful results. Others argue that a useful CASE tool has to be able to analyse free unrestricted text. We are following the second approach and building a prototype system based on the LaSIE system. The recent implementation of our system has shown how an initial Class Model can be generated by doing some simple analysis. Our system, as discussed in section 3, is not yet complete and has to be improved in many ways: Our grammar rules have to be extended to cover a wider range of sentence types. Parsing techniques, such as partial parsing and shallow knowledge will be investigated. The discourse interpretation module has to be improved to provide a more suitable model for the task of requirements analysis. Class Model generator has to be improved in many ways to provide more useful results. The system should be able to identify dierent types of domain independent relationships (e.g. Gen-Spec, PartWhole, Association) and also the cardinalities of the relationships where appropriate. In this respect we are building a corpus of software requirements and trying to nd NL constructs that denote these types of relationships. At this stage our system is able to produce partial class model automatically in CDIF format. More has to be done in this area specically we will try to port our system to work on the same platform as the Select CASE tool (MS Windows 95 or NT). 37

In conclusion this report has shown the benets of our approach and we strongly feel that this approach will be of considerable use in the area of Object-Oriented Analysis.

38

6 References References Abb83]

Russell J. Abbott. Program design by informal english descriptions. Communications of the ACM, 26(11):882{894, Nov 1983. Ber91] Ed. V. Berard. Object-Oriented Requirements Analysis. Berard Software Engineering, 1991. BGW78] Robert Balzer, Neil Goldman, and David Wile. Informality in program specications. IEEE tran. on S/W Eng., 4(2), mar 1978. Boo86] G. Booch. Object-oriented development. Trans. on Software Eng., SE-12(2):211{21, 1986. Boo94] G. Booch. Object-Oriented Analysis And Design With Applications. The Benjamin/Cummings Publishing Company, Inc., second edition, 1994. Bri94] E. Brill. Some advances in transformation-based part of speech tagging. In Proceedings of the Twelfth National Conference on AI (AAAI-94), Seattle, Washington, 1994. BRJ97] G. Booch, J. Rumbaugh, and Ivar Jacobson. The Uni ed Modeling Language for Software Development. Rational Software Corporation, www.rational.com, 1997. CAB+ 94] D. Coleman, P. Arnold, S. Bodo, H. Gilchrist, F. Hayes, and P. Jeremaes. Object-Oriented Developmant The Fusion Method. Prentice-Hall, 1994. Cal94] R. E. Callan. Building Object-Oriented Systems: An introduction from concepts to implementation in C++. Computational Mechanics Publications, 1994. CD94] S. Cook and J. Daniels. Designing Object Systems: ObjectOriented Modeling with syntropy. Prentice-Hall International, UK, 1994. CGW95] H. Cunningham, R. Gaizauskas, and Y. Wilks. A General Architecture for Text Engineering (GATE) { a new approach to Language Engineering R&D. Technical Report CS{95{21, Department of Computer Science, University of Sheeld, 1995. Also available as http://xxx.lanl.gov/ps/cmp-lg/9601009. 39

CL96]

Jim Cowie and Wendy Lehnert. Information extraction. Communications of ACM, 39(1), jan 1996. Cun97] H. Cunningham. Information extraction- a user guide. Technical Report CS { 97 { 2, Department of Computer Science, University of Sheeld, 1997. CWG96a] H. Cunningham, Y. Wilks, and R. Gaizauskas. GATE { a General Architecture for Text Engineering. In Proceedings of the 16th Conference on Computational Linguistics (COLING-96), Copenhagen, aug 1996. CWG96b] H. Cunningham, Y. Wilks, and R. Gaizauskas. Software Infrastructure for Language Engineering. In Proceedings of the AISB Workshop on Language Engineering for Document Analysis and Recognition, Brighton, U.K., apr 1996. CYGS92] P. Coad, E. Yourdon, C. Gane, and T. Sarson. Object-Oriented Analysis. Prentice- Hall, 1992. ES95] Fuchs Norbert E. and Ralf Schwitter. Attempto controlled natural language. In Seventh ILPS 95 workshop on Logic Programming environments, Portland, Oregon, dec 1995. FF92] Norbert Fuchs and Markus P.J. Fromherz. Schema-based transformations of logic programs. In T.P. Clement and K.K. Lau, editors, Logic Program Synthesis and Transformation, Workshops in Computing (Proceedings LOPSTR '91). SpringerVerlag, 1992. Fir93] Donald G. Firesmith. Object-Oriented Requirements Analysis and Logical Design: A Software Engineering Approach. John Wiley & Sons, Inc., 1993. Fow97] Martin Fowler. Analysis Patterns: Reusable Object Models. Addison-Wesley, 1997. FS97] Martin Fowler and K. Scott. UML Distilled: Applying the Standard Object Modeling Language. Addison-Wesley, 1997. Fuc95] Rolf Schwitter Fuchs, Norbert E. Specifying logic programs in controlled natural language. In CLNLP 95, Workshop on Computational Logic fot Natural Language Processing, Edinburgh, apr 1995. Gai95] R. Gaizauskas. XI: A Knowledge Representation Language Based on Cross-Classication and Inheritance. Technical Report CS-95-24, Department of Computer Science, University of Sheeld, 1995. 40

GCW+ 96] R. Gaizauskas, H. Cunningham, Y. Wilks, P. Rodgers, and K. Humphreys. GATE { an Environment to Support Research and Development in Natural Language Engineering. In Proceedings of the 8th IEEE International Conference on Tools with Arti cial Intelligence (ICTAI-96), Toulouse, France, oct 1996. GDC94] R. Grishman, T. Dunning, and J. Callan. Tipster ii architecture desgin document version 1.52 (tenman architecture). TIPSTER, (http://www.cs.nyu.edu/tipster), 1994. GM89] Gerald Gazdar and Chris Mellish. Natural Language Processing in Prolog- An Introduction to Computational Linguistics. Addison-Wesley, 1989. Gol95] A. & K. S. Rubin Goldberg. Succeeding with Objects: Decision framework for project management. Addison-Wesley, 1995. GRCH96] R. Gaizauskas, P. Rodgers, H. Cunningham, and K. Humphreys. GATE User Guide. Department of Computer Science, University of Sheeld., 1996. Available at http://www.dcs.shef.ac.uk/research/groups/nlp/gate. GS94] R. Grishman and B. Sundheim. Massage understanding conference-6: A brief history. In Proceedings of the 16th International Conference on Computational Linguistics, Copenhagen, 1994. GWH+ 95] R. Gaizauskas, T. Wakao, K Humphreys, H. Cunningham, and Y. Wilks. Description of the LaSIE system as used for MUC-6. In Proceedings of the Sixth Message Understanding Conference (MUC-6). Morgan Kaufmann, 1995. HCGA96] Kevin Humphreys, H. Cunningham, R. Gaizauskas, and Saliha Azzam. VIE Technical Specications. Technical report, Department of Computer Science, University of Sheeld, Dec. 5 1996. Also available as http:/www.dcs.shef.ac.uk/research/nlp/gate. HG+ 98] K Humphreys, R. Gaizauskas, , S. Azzam, C. Huyck, B. Mitchell, and Y. Wilks. Description of the LaSIE-II system as used for MUC-7. In Proceedings of the Seventh Message Understanding Conference (MUC-7), 1998. JCJO82] I. Jackobson, M. Christerson, P. Jonson, and G. Overgaard. Object-Oriented Software Engineering. Addison-Wesley, Workingham, England, 1982. 41

Kaf98]

Dennis Kafura. Object-Oriented Software Design and Construction with C++. Prentice-Hall, Upper Saddle River, New Jesey 07458, 1998.

Lar98]

Craig Larman, editor. Applying UML and Patterns: An interoduction to Object-Oriented Analysis and Design. Prentice- Hall PTR, 1998. F Meziane. From Natural to Formal Speci cation. PhD thesis, Dept. of Maths and computer Sc. Uni. of Salford. UK, 1994. Kanth Miriala and Mehdi T. Harandi. Automatic derivation of formal specication from informal descriptions. IEEE Tran. on S/w Eng, 17(10), oct 1991. L. Mich. Nl-oops: from natural language to object oriented using the natural language processing system lolita. Natural language engineering, 2(2):161{187, 1996. K. Rubin and A. Goldberg. Oriented behavior analysis. Communications of the ACM, 35(9), 1992. R. Ross. Entity Modeling: Techniques and Applications. Boston MA: A database Research Group, 1989. H.B. Reubenstein and R.C. Waters. The requirements apprentice: Automatic assistance for requirements acquisition. IEEE Trans. on Software Eng., 17(3), mar 1991. S. Shlaer and S. Mellor. Object-Oriented Systems Analysis: Modeling the world in Data. Yourdon press, 1988. Edward Yourdon and Carl Argila. Case-Studies in ObjectOriented Analysis and Design. Yourdon press, 1996.

Mez94] MH91] Mic96] RG92] Ros89] RW91] SM88] YA96]

42

Appendix I: A discourse model in XI object(_) ==> library(_). object(_) ==> item(_). object(_) ==> loan(_). object(_) ==> customer(_). object(_) ==> member(_). object(_) ==> card(_). object(_) ==> membership(_). object(_) ==> number(_). object(_) ==> name(_). object(_) ==> address(_). object(_) ==> mark(_). object(_) ==> classification(_). object(_) ==> code(_). object(_) ==> bar(_). event(_) ==> issue(_). event(_) ==> know(_). event(_) ==> show(_). event(_) ==> keep(_). event(_) ==> make(_). event(_) ==> denote(_). event(_) ==> identify(_). e1

Natural Language Processing and Object ... - Computer Science

Natural Language Processing and Object ... - Computer Science

Suggest Documents

Natural Language Processing and Cognitive Science

natural language processing - Computer Society of India

LIFE A Natural Language for Natural Language - Computer Science ...

Natural Language Object Retrieval

Blunsom - Natural Language Processing Language Modelling and ...

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

71 Computer Vision and Natural Language Processing: Recent ...

Computer-assisted coding and natural language processing - 3M

The Use of Object-Speclflc Knowledte In Natural Language Processing

Computer-assisted language learning and natural language

Computer-assisted language learning and natural language

Object Captioning and Retrieval with Natural Language

Language identification - Natural Language Processing group - FFZG

Language identification - Natural Language Processing group - FFZG

Natural Language Processing for Amazigh Language ... - AfLaT.org

Natural Language Processing with Python

Natural Language Processing based Automated

CSE 517: Natural Language Processing

natural language processing in biomedicine

Heterogeneous Natural Language Processing Tools via Language ...