A good definition of this term is provided by iMatix [2]. ... would be one example - however, the form described here allows suitable modelling of all ..... This de-coupling has similar advantages as the one employed in the MVC paradigm [15].
Code Generation For Distributed Systems Markus Aleksy, Ralf Gitzel Chair for Management Information Systems III, University of Mannheim, Germany {aleksy|gitzel}@wifo3.uni-mannheim.de Due to the complexity of distributed code as opposed to the easy way the corresponding designs can be described graphically, interest in code generators which create applications based on abstract system descriptions, is high. An indicator for this are the many commercial products. This paper aims at exploring the theoretical foundations of code generation for distributed systems with regard to the data structures and template language syntax. Several existing approaches are analysed and a new hybrid-form data structure is proposed. The goal, that this paper aims at, is an adaptable, middleware-independent way to produce software with minimal human code.
Keywords: Distributed Design, Code Generation, Templates, Data Structures, Abstraction
1. INITIAL CONSIDERATIONS Due to the vast complexity of distributed code, there is a genuine interest in ways to simplify the process of obtaining it. This, combined with the fact that there is a high redundancy in said code, makes its automatic generation a desirable goal. Indeed, there is a great interest in tools that allow the visual design and automatic production of distributed applications. The ViDiO project [1] is aimed at providing a scientifically sound version of this concept and one focus is template-based code generation. A good definition of this term is provided by iMatix [2]. The basic idea is to automatically generate the repetitive parts in a source code or any kind of plain text. More generally it can be defined as "any mechanical production of text files". The code generation process can be described as a merging of instance data (discussed in section 2) and templates (section 3). Instance data is the information relevant for the current problem and templates are sets of rules which generate text based on the current instance data. Several general considerations have to be taken into account when designing a template language (see section 4). The process is shown in Fig.1. Ideally such generators work with a graphical front end which allows easy generation of the instance data, however a simple code file based approach also falls under this definition.
2. POSSIBLE APPROACHES TO INSTANCE DATA STRUCTURES Before discussing the different ways of representing the instance data, it is paramount to define what this data constitutes of. There are many different ways to do this - Unified Modelling Language (UML) [3] deployment diagrams would be one example - however, the form described here allows suitable modelling of all common distributed systems (such as database servers, web portals, peer to peer systems, etc.). 2.1. Elements Of Instance Data Basically, a distributed system consists of nodes, interfaces, user-interfaces, objects, and their relationships. A node is defined as any application within the system which may provide interfaces to other nodes or make use of the interfaces provided by others.
Abstract Deployment Description
Code Generator
Middleware Specific Template
Generated Code
Fig. 1: The Code Generation Process
An interface is an access point to functionality which is made available to those nodes using the interface. It is accessed via the methods defined by the interface. In many middleware technologies these are objects registered with the naming service. A similar concept is the user interface which is defined as the sum of all functionality made available to the user running the application deploying it. This is a subset of the methods provided by all interfaces the node uses. Abstractly defining what the user can access allows the generation of vastly different code (e.g. web interfaces, GUIs, text based access etc.) Objects, as defined in this paper, are collections of data with functionality to manipulate it (in the form of methods). Most code generation templates will generate a class for each object, which can be used for return values or arguments. The important difference between objects and interfaces with regard to middleware code is that the latter will generally be a single instance (of the same class) for each node they are deployed on while objects will be generated as needed to store data or pass results from data base queries. For example when using Enterprise Java Beans (EJB) [4], the interface would be the home interface and the object the actual entity bean. In the Common Object Request Broker Architecture (CORBA) [5] one could use valuetypes for objects and interfaces for the singleton. There are several different relationships between the elements of the system. ! Usage: When a node n or interface i1 uses an interface i2, it can access the functionality of i2. ! Deployment: The node n deploying interface i makes i's functionality available to the system. Thus, n is the server for i. ! Production: When an interface i produces an object o, the methods of i may return instances of o or take o as a parameter. The major difference to the most commonly used notation (the UML deployment diagram) is that this form is easier to grasp visually since there are no cardinalities (due to the distinction between objects and interfaces). The data required for each instance of these elements is quite extensive and shall get only cursory examination here. In fact it can be kept flexible, i.e. the meta-data itself is part of the instance data. In other words, not only are the descriptions of each element passed to the code generator but also what type of data these elements consist of. An example for this is shown in Fig. 2. Typical data for nodes is the name and a connection strategy. The strategy determines when connections are established, how they are to be maintained and which instance of an interface to connect to, if several are available. An example value for this might be INITIAL_LIST, which means that connections are established at start-up and the user picks from a list of available interfaces. Interfaces will probably consist of a name, their methods and attributes, information on name service registration, and a persistence flag. Objects will be similar with the possible inclusion of instructions on whether to pass by value or by reference. User interfaces, as the last category, are described by their method/attribute filter (i.e. which methods and attributes available from the connected node are to be shown) and a flag for application control. The latter governs trivial features such as user settings, accessing the INITIAL_LIST, etc.
Description File for Node:
Node 1:
Name:String IPAddress:long Reliability:double ....
MainServer 155.155.57.21 0.94 ...
Node 2: TestClient 155.155.57.22 0.99 ...
Instance Data
Metadata
Fig. 2: Meta-Data vs. Instance Data
2.2. Streamed Instance Data Streamed instance data is the most basic way to generate code but also one that provides little flexibility. Basically, the templates are passed one piece of information after another in an order agreed to before template creation. Thus, one starts out with all deployment info, then passes registration tuples, etc. Unfortunately, this severely restricts the language since information has to be passed in the same way for all templates written. Furthermore, this might not even be possible if the middleware technologies employed vary greatly. 2.3. Sets The description of a system can be seen as a collection of sets - the set of all nodes, all deployed interfaces, all objects, and all relationships of each type. By numbering the elements it is possible to give them a unique identifier. Whenever an element references another (in the form of a relationship) these can be looked up by set name and number. This form of data storage might be appealing but is plagued by two problems. First of all, the internal implementation of the sets is still open which might require some careful thought. Finally, the constant search accesses required by set collections will make this much slower than trees or graphs (where only leaves/neighbours have to be regarded). 2.4. Tree-Based Instance Data One promising way to generate code is to store the data in trees. XSLT [6] (which has been used for 'real' code generation in at least two instances [7],[8]) and GSLgen [9] are examples which are implemented in this way. The basic idea is that a tree is traversed according to rules provided in the template and the data stored in the various tree nodes is then used to influence the code generation for each of them (s. Fig. 3) Since there are already libraries and formats employing this method it can be easily used in new tools. Trees are quite suitable for distributed systems as the deployment graph (i.e. the graph of connections between nodes, interfaces, objects, and clients) can be presented in tree form as seen on the left side of Fig. 3. However, there is the problem that most of the interfaces appear under both the client and the server. Thus, there is information replication in the instance set.
System Server1 Server2 Iface1 Obj1
Iface2
Client
Iface1
Templates
Iface2
Obj1
Data Structure Tree Fig. 3: Tree-based code generation
Generated Code
This problem (as well as the equivalent problem in the interface-object relationship) can be addressed in three ways, all of which are shown in Fig. 4. The first technique (Solution A) is to introduce a reference id for the client interface which points to one of the leaves in the server branch. This method only works when the client requires only the reference (which usually is the lookup value stored by the Naming Service) and no other information. The second way (Solution B) is to have a reference for both the client and the server and to store the interfaces separately, presumably in a hash map or other other sorted structure. This way, interfaces can be accessed via the reference without the parser having to perform a complex search algorithm on the server branch of the tree. Solution C is probably the best way, in this context. It divides up the interface information into that required by the client and that used by the server respectively. This will lead to minimal replication but that should not be a problem since these structures will most likely be set up shortly before the code generation and not be used by the editing tool. Another problem is the fact that there is no central point of interface information, so generating a code file for the interfaces can be difficult (as the information has to be gathered from several tree nodes). However, since most of the client-side information is not needed for the code generation, so the above problem can be solved by careful design of the template language. 2.5. Graph-Based Instance Data While trees offer the advantage of existing software, graphs are a more natural way to represent a distributed system. In all likelihood, this will be the way the system is designed in the graphical front end anyway. To facilitate code generation, the graph will most likely have an entry point which would ideally be connected to all servers. Due to the nature of the problem, the graph can now be traversed from there on and all elements will be reached (since there are no isolated parts). A template language based on such a data structure would have to be event-based (s. section 3.3). Each time a graph node is analysed, the relevant code is generated (possibly using references and variables of other nodes) and then the graph is traversed further. Since a traversal algorithm marks off elements already visited, there will be no duplication. The exact nature of the algorithm is not important and can be chosen for maximum performance. For example, consider a graph (as seen in fig. 5) with one server, two interfaces, and a client using both of them. First the code for the server would be generated with two deployment operations (based on the two links in the graph), then System
Server
Server
Obj1
Client
Obj3
Obj2
Reference to Obj1
Solution A System
Obj1 Obj2
Server
Reference to Obj1
Server
Reference to Obj2
Obj3
Client
Reference to Obj3
Reference to Obj1
Solution B System
Server
Server Code Obj1
Server
Server Code Obj2
Client
Server Code Obj3
Client Code Obj1
Solution C Fig. 4: Solutions to the node duplication problem
Interface 1 deploys
uses
Server
Client
deploys
uses
Interface 2
Fig. 5: Graph Traversal Example
code would be created for the first interface, and finally for the client. The graph traversal would then move on the second interface and terminate. This approach seems to be the most suitable, even though tree based implementations are far more common, in current products.
3. APPROACHES TO TEMPLATE FORMATS The template format can be either linear (i.e. instance data values are inserted at positions marked in the template text) or contain control structures (for, while). The latter needs a suitable way of structuring the instance data (see section 2). Both techniques are already employed in commercial products such as WithClass [10] (linear) and GSLgen (control structures). Alternately these could be called static and dynamic templates, as seen in [11]. 3.1. Linear Linear templates are parsed from the beginning to the end and when a marker/variable is found, the code generator inserts the appropriate value instead. The instance data does not need any complex structure in this case. Of course, the linear approach only works for simple cases where there are no dynamic arrays of information, e.g. a set of objects. Complex systems such as a distributed environment cannot suitably be served by this method, unless some of the code generation logic is hard coded in the template parser. This approach is mediocre at best. 3.2. Control Structures Templates with control structures can "navigate" the instance data using "selection criteria" [11]. This means that during the parsing of the template a piece of code will be generated for all graph nodes matching the given condition, using its specific attribute values. Since distributed systems can be described as a collection of sets (see section 2.3) they can thus be suitably produced with control structures. ArcStyler is an example of a product which uses this approach (as can be seen in the *.tpl files). 3.3. Event-Based Event based templates are a slightly enhanced version of control structured templates. So far templates were handled from start to end, looking up the appropriate information from the instance data structure. Event-based parsing however, traverses the data structure (usually a graph) and looks for the appropriate template fragment corresponding to the node. The fragments themselves are linear or control structured templates which correspond to an output file. Except for the latter, this is the concept behind XSLT, even though there is usually but one event - matching the root element.
This approach is probably the best for distributed code generation when combined with graph-based instance data. However, any template language working this way would have to be implemented from scratch. 3.4. Full Language Instead of providing limited control structures, it is possible to provide full language support (e.g. Jostraca [12]). The templates can contain real pieces of code in a known language, which are within specially marked blocks, to be executed when they are met. This is quite similar to Java Server Pages (JSP) [13]. While this is probably the most powerful way of designing templates it has the drawback that the resulting code is quite complex as can be seen in Jostraca's "Hello World" example. Also since the instance data is well-defined for distributed code, such a level of freedom is not needed. 3.5. Language Extension This is not what springs to mind when talking about code generation, yet the basic idea is the same - let a machine take over the complex parts of programming. The 'templates' would be ready-to-use classes which are passed basic information and take care of the rest. The Dejay project [14] is an example for this approach. While this idea is sound and has its uses, the major drawback is that there exists no instance data (at least not in non-code form) and therefore there is no way to use graphical front ends in the design of an application.
4. CHARACTERISTICS OF WELL-DESIGNED TEMPLATE LANGUAGES FOR DISTRIBUTED SYSTEMS 4.1. High Level Of Abstraction Ideally the template input as required by the language will not necessitate any knowledge relevant for the technology used. This de-coupling has similar advantages as the one employed in the MVC paradigm [15]. An example for inadequate abstraction would be a language that requires a "BeanType" as instance data. This information would only be relevant for EJB templates and hard to interpret when using CORBA or Remote Method Invocation (RMI) [16] instead. Many commercial products suffer from this problem (ArcStyler [17], UMLBridge [8]) since they do not aim at a broad spectrum of middleware. The primary advantage of abstraction is improved maintenance - switching a template to a new technology does not require change of the input (i.e. the instance meta-data). Also, with high abstraction the designers do not have to translate the problem into the terms and limitations of a certain technology since they solve it abstractly and the templates decide on the actual implementation details (see Fig. 6).
EJB-based Solution
EJB-based Code Real World Problem
"UML"-based Solution tech. independent Code
Fig. 6: Abstract vs. Concrete Design
A drawback is added complexity in the design of templates since the template provider has to change the abstract information passed by the user interface into terms relevant for the chosen middleware. This translation might also lead to less efficient code which programmers might want to streamline by hand. The ideal level of abstraction is one where no computer knowledge whatsoever is needed. The user simply models business transactions. Presumably though, this is impossible, so a good compromise has to be found. 4.2. Generic Nature While the requirements towards a template language for distributed applications are somewhat clearly defined, there are still many ways to define its syntax. In any case, it should be flexible enough to allow for expansions due to future developments. One aspect of this is a syntax which is independent of the utilised technologies1. Basically the language should be able to create any kind of text which can substitute variables for instance data and omit or add paragraphs based on the input. Secondly the instance data should be expandable as described next. 4.3. The Instance Data Instance data should be in a format easily accessible by the template, i.e. there must be a syntactic feature allowing to retrieve this information via some kind of identifier. Ideally, the data appears to be divided into sets of similar items since these are often processed together. One example are registered objects on a server, each of which needs practically the same code for the registration process. Note that this is not the same as the actual data structure, only the way how the data is retrieved in the templates. The information necessary in the instance data is defined by the sum of all features the code generator should provide. While independence from the middleware is an absolute requirement it is at this stage necessary to determine what kind of services are provided. Independence means in this case that no single technology is examined but rather as many as possible and the data should be based on the sum of all capabilities (Fig. 7). One might argue that this is a bad idea since not all templates will work. However, there are two ways to circumvent this. First of all some features can be emulated. RMI does not provide any support for persistent objects while EJB does, yet, a template based on the first could easily implement code which allows the storage of objects. Alternatively, the templates could provide a list of supported features (i.e. the types of instance data this template can handle - compare to section 2.1) in some kind of information header which advertises the supported capabilities. Naturally this definition of the instance data will be incomplete since one can neither evaluate all technologies nor can all future advancements be anticipated. To allow the introduction of future features the definition of the instance data must be expandable. Older templates must be able to ignore these for backward compatibility. 4.4. Other Template Syntax Requirements The syntax of the template language must be easy to understand to allow simple maintenance of the templates. Also the code generated should have a decent formatting and human readable names to make the generated software understandable to the programmer.
Technology-based Requirements
Total Template Requirements
Fig. 7: Assessing Language Requirements 1
Obviously, linear templates (section 3.1) are hard pressed to fulfil this criterion.
4.5. Miscellaneous Features The following properties would provide additional benefit for a generator of distributed code: ! A command to start external programs. Many middleware technologies provide special (pre-) compilers such CORBA's IDL compiler. The ability to start these gives the template more control over the various resources. ! The ability to create multiple output files. Even the simplest technologies have to spawn at least two files (one for the client, the other for the server). Also, this allows to automatically generate documentation files. ! Header information is required to provide data on the templates version, processing capabilities (see 4.3.), and to give advise on when to use this specific one. ! Constants can allow the easy reuse of templates. This is especially true for CORBA where there is a slight variation between different ORBs. Rather than write a new template for each product, a single template would suffice, where constants such as the compiler path can be inserted via a file of constants. ! Easy usage is essential to increase acceptance of the tool since normally "programmers resist code generation" [18]. ! If code generation requires vast effort before its benefits are reaped, it defeats its purpose. This applies to the graphical front end / data instance format as well as the template syntax. ! Consistency checks: is the data consistent (e.g. are all interfaces deployed on at least on node?) ! Inclusion of Archetypes: The template language should allow the designer to include archetypical software components. A good example would be standard filters as described by Joshi [19]. Basically this allows to include ready-made code/byte code/packages as nodes. Objects will be passed certain parameters during creation (as specified in the meta-template which influences the GUI).
5. CONCLUSION A distributed system is hard to program by hand and code generation is an inevitable trend in this field as seen by the many commercial products. From the standpoint of software engineering and expandability, however, these solutions are inadequate and call for an in depth look at the different possibilities of code generation. As shown in this paper, there are many different template structures and ways to model instance data, each of which has different advantages. Also some features which are paramount for all choices (e.g. high level of abstraction) have been presented. Of all the options, a combination of graph-based instance data and event-based templates seems to be the best combination, if one is willing to neglect the use of existing parsing tools. It remains yet to be seen how an actual implementation using it would work. Special consideration should be given to special network problems such as security, or load balancing. This will be addressed in future papers.
6. ACKNOWLEDGEMENTS I would like to thank Prof. Dr. Dr. Martin Schader for reading this paper thoroughly and for his helpful comments and advice regarding its contents.
7. REFERENCES [1] M. Aleksy, R. Gitzel, ViDiO - Visual Distribution of Objects, Fifth International Conference of Information Visualisation (IV2001), London, England, 2001 [2] iMatix Corporation, Template-based Code Generation, http://www.imatix.com/pub/info/codegen.pdf
technical
white
paper,
23
January
2000,
[3] OMG, OMG Unified Modeling Language Specification, Version 1.3, June 1999 [4] L. G. DeMichiel, L. Ümit Yalçinalp, S. Krishnan, Sun Microsystems - Enterprise JavaBeansTM Specification, Version 2.0, October 23, 2000
[5]
OMG
(2000):
CORBA/IIOP
2.4
Specification,
OMG
Technical
Document
Number
00-10-01,
ftp://
ftp.omg.org/pub/docs/formal/00-10-01.pdf [6]
J.
Clark:
XSL
Transformations
(XSLT)
Version
1.0
W3C
Recommendation
16
November
1999,
http://www.w3.org/TR/xslt [7] C. Oloff, Einsatz von XMI und XSLT zur Code- und Schemagenerierung aus UML-Modellen, 2001 [8] UMLBridge: http://www.avantis.de/frame.asp?s=8 [9] GLSGen: http://www.imatix.com [10] Microgold: WithClass, http://www.microgold.com/ [11] M. Brassard, Dynamic template-based code generators offer solution to e-software paradox, Serverworld Magazine, http:// www.serverworldmagazine.com/hpchronicle/2000/08/codegen.shtml [12] Jostraca, http://www.jostraca.org/ [13] JAVASERVER PAGES WHITE PAPER, http://java.sun.com/products/jsp/whitepaper.html [14] Dejay, http://www.dejay.org/ [15] G. E. Krasner, S. T. Pope, A Description of The Model-View-Controller User Interface Paradigm in the Smalltalk-80 System, 1988, http://citeseer.nj.nec.com/krasner88description.html [16] Sun Microsystems, JAVA REMOTE METHOD INVOCATION - DISTRIBUTED COMPUTING FOR JAVA, 1999, http://java.sun.com/marketing/collateral/javarmi.html [17] IOSoftware: ArcStyler, http://www.io-software.com/products/as_docu_main.html [18] A. Ryman, Rosetta, 1996, http://www.cs.yorku.ca/~ryman/Rosetta.html [19] R. K. Joshi, Filter Configurations for Transparent Interactions in Distributed Object System, Journal of Object Oriented Programming, June/July 2001