analyzing the given business details in the form of the text. A new model is ... Same the case is with Software analysis process which uses Unified Modeling. Language ... before drawing the UML diagrams by using orthodox CASE tools. Hence ...
Natural language processing based automated system for UML diagrams generation
Natural language processing
Imran Sarwar Bajwa, M. Abbas Choudhary
Computer and Emerging Sciences, Balochistan University of Information Technology and Management Sciences Quetta, Pakistan Keywords Natural language processing, Knowledge Engineering, Automatic diagrams generation, Text Understanding, UML Diagrams, Information extraction, UML design. Abstract This paper presents a natural language processing based automated system for generating UML diagrams after analyzing the given business details in the form of the text. A new model is presented for analyzing the natural languages and extracting the relative and required information from the given storyline by the user. User writes the requirements in simple English in a few paragraphs and the designed system has conspicuous ability to analyze the given script. After compound analysis and extraction of associated information, the designed system draws various UML diagrams as activity diagrams, sequence diagrams and class diagrams. Other conventional CASE tools require a lot of extra time and efforts from the system analyst during the process of creating, arranging, labeling and finishing the UML diagrams. The designed system provides a quick and reliable way to generate UML diagrams to save the time and budget of both the user and system analyst.
1
Introduction The looks and styles of software engineering have been completely changed in the recent times. These days step of software engineering follows the rules of Object Oriented design patterns. All phases of software engineering are deviating from the conventions and new paradigms are more popular these days. Same the case is with Software analysis process which uses Unified Modeling Language to map and model the user requirements. Analysis is the key process of building modern information system applications and base for the robust and vigorous software application’s design and development. There are various object-oriented modeling languages and tools. The Unified Modeling Language (UML) is one of the famous languages for the object-oriented analysis and design of the software applications. UML is a standard language that is used to identify, visualize, develop and document the components of software systems. Additionally, it is used for modeling and mapping the business logic and other non-software systems. Large and complex systems can easily be modeled by using UML as it is a very important part of developing objects oriented software and the software development process. Like other conventional methodologies, UML also uses graphical notations to represent and depict the design and flow of the software projects. In recent times, there is no software which provides services to draw UML diagrams more efficiently except Rational Rose, Smart Draw etc and there is no doubt that these are reasonably good software but has many disadvantages. According to the norms and conventions, the system analyst has to do a lot of work for deducing the business logic and understanding the user requirements before drawing the UML diagrams by using orthodox CASE tools. Hence, there is wastage of so much time due to the dull nature of the available CASE tools for the required scenario. In today’s world everybody needs a quick and reliable service. So it was needed that there should be some sort of intelligent software for generating UML based documentation to save time and budget of both the user and system analyst. Description of Problem Few years ago data flow diagram’s (DFD) were being used to symbolize the flow of data and represent the user’s requirements. But in current age, unified modeling language is used to model and map the user requirements, which is more comprehensive e and authentic way to of representation and it is beneficial for the later stages of software development. The problem specifically addressed in this research is primarily related to the software analysis and design phase of the software development process. The software in the current market which provides this facility is just paint like tools as Visual UML, GD Pro, Smart Draw, Rational Rose etc. All of them have dull nature. To use the extensively overloaded interface of these CASE tools is a vexing problem. The process of generating the UML diagrams through these software engineering tools is very difficult, time consuming and lengthy process to perform. Therefore, it was needed that any individual person involved obligatory in software development may get his required output with maximum accuracy in minimum time consumed.
18th National Computer Conference 2006 © Saudi Computer Society
Proposed Solution Object-oriented modeling in less time and effort is significant requirement. In order to resolve all such issues and provide some robust solutions, a helpful framework is required, which has sound ability to facilitate and assist both the users and software engineers. The functionality of the conducted research was domain specific but it can be enhanced easily in the future according to the requirements. Current designed system incorporate the capability of mapping user requirements after reading the given requirements in plain text and drawing the set of UML diagrams as Class Diagram, Activity Diagram, Sequence Diagram, Use case diagram and Component Diagram. An Integrated Development Environment would also be provided for User Interaction and efficient Input and output.
2
Object-Oriented Analysis and Design Analysis and design of an information system relates to understand and intend the framework to accomplish the actual job. Typically, design is relates to manage and control the complexity parameter in a domain. A robust design method also helps to split big tasks into controllable breakups (Condamines, 2001). In software engineering, design methods provide various notation usually graphical ones. These notations allow to store and communicate the perpetual design decisions. Object-oriented design has overruled the typical analysis and design techniques as structured design and data-driven design (Androutsopoulos, 1995). As compared to old style design paradigms, objectoriented design models the everyuse active entity to of the problem of objects. Object-oriented languages variable manifest thedomain state using of an concept object and methods
or procedures to implement the behaviour of an object. For example, a ball could be an Objects have: object. There different parameters of shape as colour, size, diameter, shape, type, • State (shape andare condition) • Behaviour (What can they also perform) etc. This object have behaviour as throw, roll, catch, hit, etc. The major task in
analysis and design phase is to identify the valid objects and specify there states and
Object-oriented languages use variable to manifest the state of an object and methods or procedures conventional methods, analyst performs tough jobare and then to behaviours. implement theInbehaviour of an object. Forsystem example, a ball could be anthis object. There different maps thisofinformation intosize, UML using some toolobject as Visio Rational Rose. as parameters shape as colour, diameter, shape, graphical type, etc. This can or also have behaviour throw, roll, catch, hit, etc. The major task in analysis and design phase is to identify the valid objects and specify there states and behaviours. In conventional methods, system analyst performs this tough thethen context of information this research, are some automatically identified a problem jobInand maps this intoobjects UML using graphical tool as Visiofrom or Rational Rose.
domain. User provides the input text in English language related to the business In domain. the context of this objects are identified fromisa performed problem domain. User After theresearch, lexical analysis of automatically the text, syntax analysis on word provides input text the in English language (Androutsopoulos, related to the business1995). domain. After analysis level totherecognize word category First of the all lexical the available of the text, syntax analysis is performed on word level to recognize the word category (Androutsolexicons into lexicons nouns, are pronouns, prepositions, adverbs,prepositions, articles, poulos, 1995).are Firstcategorized of all the available categorized into nouns, pronouns, conjunctions, The syntactic analysis of the programs havewould to behave in a position adverbs, articles, etc. conjunctions, etc. The syntactic analysis of thewould programs to be in a position to isolate subject, verbs, objects, adverbs, adjectivesand andvarious variousother other complements. complements. ItItis to isolate subject, verbs, objects, adverbs, adjectives little complex and multipart procedure. is little complex and multipart procedure. “Ziaisisplaying playingwith with the red "Zia red ball.” ball."
example, following the output. ForFor thisthis example, following is theisoutput. Lexicons
Phase-I
Phase –II
Zia is playing with the red ball
Noun Helping-Verb Verb Preposition Article Noun Noun
Object ------Method ------------Attribute Object
This is the final output of lexical assessment phase and all nouns are marked as objects and verbs is the output andofall nouns are marked asthe objects areThis marked as final methods and of all lexical adjectiveassessment are markedphase as states that particular object. In above and verbs marked as methods allconcerned adjectivemethod are marked as states example, there are is one object ‘Ali’ and ‘work’and is the of the object Ali. of that
particular object. In the above example, there is one object ‘Ali’ and ‘work’ is the
Natural Language Processing concerned method of the object Ali. The understanding and multi-aspect processing of the natural languages that are also termed as “speech languages”, is actually one of the arguments of greater interest in the field artificial intelNatural Processing ligence fieldLanguage (Strzalowski, 1995). The natural languages are irregular and asymmetrical. Traditionally, natural languages are based on un-formal grammars.ofThere are the geographical, psychological The understanding and multi-aspect processing the natural languages that are also and sociological factors which influence the behaviours of natural languages (Losee, 1996). There
termed as "speech languages", is actually one of the arguments of greater interest in the field artificial intelligence field (Strzalowski, 1995). The natural languages are irregular and asymmetrical. Traditionally, natural languages are based on un-formal grammars. There are the geographical, psychological and sociological factors which influence the
are undefined set of words and they also change and vary area to area and time to time. Due to these variations and inconsistencies, the natural languages have different flavours as English language has more than half dozen renowned flavours all over the world. These flavours have different accents, set of vocabularies and phonological aspects. These ominous and menacing discrepancies and inconsistencies in natural languages make it a difficult task to process them as compared to the formal languages (Krovetz, 1992).
Natural language processing
In the process of analyzing and understanding the natural languages, various problems are usually faced by the researchers. The problems connected to the greater complexity of the natural language are verb’s conjugation, inflexion, lexical amplitude, problem of ambiguity, etc. From this set of problems the problem which ever causes more difficulties is problem of ambiguity. Ambiguity could be easily solved at the syntax and semantic level by using a sound and robust rule-based system. Used Methodology Conventional natural language processing based systems use rule based systems. Agents are another way to develop speech language based systems (Krovetz, 1992). In the research, a rule-based algorithm has been designed and used which has robust ability to read, understand and extract the desired information. First of all, basic elements of the language grammar are extracted (Drouin, 2004) as verbs, nouns, adjectives, etc then on the basis of this extracted information further processing is performed. In linguistic terms, verbs often specify actions, and noun phrases the objects that participate in the action (Zelle, 1993). Each noun phrase’s then role specifies how the object participates in the action. As in the following example Ali is agent: “Ali is writing a letter with a pen.” A procedure that understands such a sentence must discover the agent because he performs the action of writing, that the letter as the thematic object because it is the object that is written, and that the pen is an instrument because it is the tool with which hitting is done (Gómez-Pérez, 2005). Thus, complete sentence analysis finds information about the agent, co-agent, thematic object, beneficiary, etc. The identification of such information specifically helps to understand the meanings of the input sentence as given below. Agent: The agent causes the action to occur as in “Ahmed hit the ball,” Ahmed is agent who performs the task. But in this example a passive sentence, the agent also may appear as “The ball was hit by Ahmed.’’ Co-agent: If agent is working with any other partner that is called co-agent. Both of them carry out the action together as “Ahmed played tennis with Ali.” Beneficiary: The beneficiary is the person for whom an action has bee performed: “Ahmed brought the balls for Ali.” In this sentence Ali is beneficiary. Thematic object: The thematic object is the object the sentence is really all about— typically the object, undergoing a change. Often the thematic object is the same as the syntactic direct object, as “Ahmed hit the ball.” Here the ball is thematic object. Conveyance: The conveyance is something in which or on which agent travels: ‘Ahmed goes by train.” Trajectory: Motion from source to destination takes place over a trajectory. ID contrast to the other role possibilities, several prepositions can serve to introduce trajectory noun phrases: “Ahmed and Ali went to London from Islamabad” Location: The location is where an action occurs. Several prepositions are manifesting the location usually a noun phrase as “Ali studied in the library, at a desk, by the wall, a picture, near the door.” Time: Time specifies when an action occurs. Prepositions such at, before and after introduce noun to depict time as “Ahmed and Ali left before Evening.” Duration: Duration specifies how long an action takes. Preposition such as since and for indicate duration. “Ahmed and Ali walked for an hour.”
3
Time: Time specifies when an action occurs. Prepositions such at, before and after introduce noun to depict time as "Ahmed and Ali left before Evening." Duration: Duration specifies how long an action takes. Preposition such as since and for indicate duration. "Ahmed and Ali walked for an hour.” Architecture of Designed System System Architecture of Designed
TheUMLG designed UMLGhas system hasto ability draw UML diagrams reading The designed system ability drawtoUML diagrams afterafter reading thethe texttext scenario provided by the user. This system draws diagrams in five modules: Text input acquisition, Syntactic scenario provided byText the user. This system draws diagrams fivefinally modules: Text input Analysis, understanding, Knowledge extraction,inand Generation of UML diagrams as in following figure 1. understanding, Knowledge extraction, and finally acquisition,shown Syntactic Analysis, Text Generation of UML diagrams as shown in following figure 1. Class, activity, etc Diagrams
Diagram Generation
ure 1. ecture4of Natural guage essing sed mated Figure 1. Architecture em for ML of the Natural Language grams Processing ration based Automated System for UML Diagrams Generation
Objects, methods, attributes Identification
Knowledge Extraction Understanding Meanings
Semantic Analysis Extracting Nouns, Verbs, Adjectives, etc
Syntax Analysis Token Extraction from given text
Lexical Analysis Text Input Acquisition from user
i. Text inputi.acquisition Text input acquisition This module helps to acquire input text scenario. User provides the business scenario in from of paraThis modulegraphs helpsoftothe acquire input text scenario. Usertext provides the business in the words or text. This module reads the input in the form charactersscenario and generates from of paragraphs of the 2001) text. This module reads input text in This the form characters lexicons (Tang, by concatenating thethe input characters. module is the implementation of thethe lexical phase. Language(Tang, specified lexicons or tokens or symbols are characters. generated in this module. and generates words or lexicons 2001) by concatenating the input
This moduleii.isSyntactic the implementation of the lexical phase. Language specified lexicons or Analysis tokens or symbols aresecond generated in this module. This is the module of the deigned framework and it reads the input from module one in the
of words. These words are categorized into various classes as verbs, helping verbs, nouns, proii. Syntactic form Analysis nouns, adjectives, prepositions, conjunctions, (Fagan, 1989) etc on the basis of the defined rules for categorization. setthe of rules are defined here and on the basis of theinput standard grammatical rules This is the second moduleAof deigned framework it reads the fromEnglish module also called parts of speech conventions. one in the form of words. These words are categorized into various classes as verbs,
helping verbs, nouns, pronouns, adjectives, prepositions, conjunctions, (Fagan, 1989) iii. Text Understanding etc on the basis of the defined set ofofwords. rules The are defined This module reads therules input for fromcategorization. module 1 in theA form meaningshere of the given text are inferred this module using semantic rules (Malaisé, 2005). These words categorized into varion the basis of theonstandard English grammatical rules also called parts of are speech conventions.ous classes as verbs, helping verbs, nouns, pronouns, adjectives, prepositions, conjunctions, etc.
iv. Knowledge extraction Required data attributes are extracted in this module (Rijsbergen, 1977) according to the given guide lines. This module, extracts different objects and classes and their respective attributes on the basses of the input provided by the preceding module. Nouns are symbolized as classes and objects and their associated attributes are termed as attributes. v. UML diagram generation This is the last module, which finally uses UML symbols and draws various UML diagrams by combining available symbols according to the information extracted of the previous module. As separate
v. UML diagram generation diagrams by combining available symbols according to the information extracted of the previous module. As separate scenario willuses be provided for various as classes, This is the last module, which finally UML symbols and diagrams draws various UML sequence activity diagrams, the separate functions are implemented forofthe diagramsand by combining available so symbols according to the information extracted the respective diagram. previous module. As separate scenario will be provided for various diagrams as classes, sequence and activity diagrams, so the separate functions are implemented for the Accuracy respectiveEvaluation diagram. scenario will be provided for various diagramsby as the classes, sequence and four activity diagrams, so the To test the accuracy of the diagrams generated designed system parameters separateEvaluation functions are implemented for the respective diagram. Accuracy had been decided. Each generated diagram from each category was checked. Maximum score wasthe declared 25.ofAccording to the wrong nominations and extractions, points Evaluation To Accuracy test accuracy the diagrams generated by the designed system four the parameters testdecided. the A accuracy the diagrams generated byeach the designed system four parameters had been were detected. matrix ofgenerated results ofdiagram generated diagrams is shown below. hadTo been Eachof from category was checked. Maximum decided. Each generated diagram from each category was checked. Maximum score was declared score declared to the wrong nominations extractions, the points 25.was According to 25. the According wrong nominations and extractions, the and points were detected. A matrix of Table 1. were detected. A matrix of results of generated results of generated diagrams is shown below. diagrams is shown below. Dig. Types Objects Attributes Sequence labeling Total Testing results Tableof1. 22 24 20 19 Class 85% different Dig. Types Objects Attributes Sequence labeling Total Testing 23 21 16 20 Activity 80% UML results of 22 24 20 19 Class 85% Diagrams different 21 24 21 22 Sequence 88% 23 21 16 20 Activity 80% UML Diagrams 21 24 21 22 Sequence 88% A matrix representing UML diagrams accuracy test (%) for class, activity and sequence diagrams has been constructed. Overall diagrams accuracy for all types of UML A matrix representing UML diagrams accuracy test (%) for class, activity and sequence diagrams diagrams isrepresenting determined by adding totalaccuracy accuracy of (%) alltypes categories and calculating its A matrix diagrams for class, activity and issequence has been constructed.UML Overall diagrams accuracytest for all of UML diagrams determined by average that is 83% in this case. adding total accuracy of all categories and calculating its average that is 83% in this case. diagrams has been constructed. Overall diagrams accuracy for all types of UML diagrams is determined by adding total accuracy of all categories and calculating its average that is 83%30in this case. Figure 2. 25
Graphical Figure 2. presentation the Aof Graphical ccuracy of epresentation generated of the Diagrams accuracy of generated Diagrams
2030
5 Table 1.
Testing results of different UML Diagrams
Class
1525
Activity
1020
Sequence Class
515
Activity
010 5 Objects
Natural language processing
Sequence Attributes
Sequence
labeling
0
Objects Attributes ratio Sequence labeling The graph above is showing the accuracy of various diagram types in terms of objects, attributes, sequence and labeling parameters.
Conclusion This research is all about the dynamic generation of the UML diagrams by reading and analyzing the given scenario in English language provided by the user. The designed system can find out the classes and objects and their attributes and operations using an artificial intelligence technique such as natural language processing. Then the UML diagrams such as Activity dig., Sequence dig., Component dig., Use Case dig., etc would be drawn. The accuracy of the software is expected up to about 80% with the involvement of the software engineer provided that he has followed the pre-requisites of the software to prepare the input scenario. The given scenario should be complete and written in simple and correct English. Under the scope of our project, software will perform a complete analysis of the scenario to find the classes, their attributes and operations. It will also draw the following diagrams. An elegant graphical user interface has also been provided to the user for entering the Input scenario in a proper way and generating UML diagrams. Future Work The designed system for generating UML diagrams was started with the aims that there should be a software which can read the user requirements given in the form English language text and can draw the selected types of the UML diagrams such as Class diagram, activity diagram, sequence diagram, use case diagram, component diagram, deployment diagram. But last three of them use case diagram, component diagram, deployment diagram are still untouched. There is also some margin of improvements in the algorithms for generating first four types Class diagram, activity diagram, sequence diagram. Current accuracy of generating diagrams is about
Figure 2.
A Graphical representation of the accuracy of generated Diagrams
80% to 85%. It can be enhanced up to 95% by improving the algorithms and inducing the ability of learning. References Androutsopoulos, G. D. Ritchie, and P. Thanisch. 1995. “Natural Language Interfaces to Databases – An Introduction.” Natural Language Engineering, vol 1, part 1, pages 29–81. B.J. Grosz, D. Appelt, P. Martin, and F. Pereira. (1987). “TEAM: An Experiment in the Design of Transportable Natural Language Interfaces”. Artificial Intelligence 32, pages 173–243. Condamines, Anne and Rebeyrolle, Josette. (2001). “Searching for and identifying conceptual relationships via a corpus based approach to a Terminological Knowledge Base (CTKB): Method and Results”, Recent Advances in Computational Terminology, pp. 127-148 Drouin Patrick. (2004). “Detection of Domain Specific Terminology Using Corpora Comparison.” Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC), Lisbon, Portugal.
6
Fagan, J. L. (1989). “The effectiveness of a non-syntactic approach to automatic phrase indexing for document retrieval”, Journal of the American Society for Information Science, 40(2), 115–132. Gómez-Pérez Asunción, F. Mariano, C. Oscar, (2004) “Ontological Engineering: with examples from the areas of Knowledge Management”, e-Commerce and the Semantic Web. Springer J. M. Zelle and R. J. Mooney, (1993), “Learning semantic grammars with constructive inductive logic programming”, in: Proceedings of the 11th National Conference on Artificial Intelligence (AAAI Press/MIT Press, Washington, D.C.) , pp. 817–822. Khoo Christopher, Chan Syin, Niu Yun, (2002) “The Many Facets of the Cause-Effect Relation”, The Semantics of Relationships. Kluwer Academic Press. pp. 51-70 Krovetz, R., Croft, W. B. (1992). “Lexical ambiguity and information retrieval.” ACM Transactions on Information Systems, 10, pp. 115–141. Losee, R. M. (1996). “Learning syntactic rules and tags with genetic algorithms for information retrieval and filtering: An empirical basis for grammatical rules.” Information Processing and Management, 32(2), 185–197. L. R. Tang and R. J. Mooney, 2001. “Using Multiple Clause Constructors in Inductive Logic Programming for Semantic Parsing”. In Proc. of the 12th European Conference on Machine Learning (ECML- 2001), Freiburg, Germany, pages 466–477. Malaisé Véronique, Zweigenbaum Pierre, Bachimont Bruno, (2005) “Mining Defining Contexts to Help Structuring Differential Ontologies”, Terminology, 11:1 Rijsbergen V., C. (1977). “A theoretical basis for use of co-occurrence data in information retrieval.” Journal of Documentation, 33(2), 106–119. S. Weiss, C. Apte, D. Johnson, F. Oles, T. Goetz and T. Hampp, (1999), “Maximizing text-mining performance”, IEEE Intelligent Systems 14, 63-69. Strzalowski, T. (1995). “Natural language information retrieval”. Journal of Information Processing and Management, 31(3), 397–417.