Software Is a Directed Multigraph - Springer Link

3 downloads 319 Views 273KB Size Report
In particular there emerged a number of software development methodologies. (e.g. structured, iterative, adaptive), design models (e.g. Entity Relationship Di-.
Software Is a Directed Multigraph Robert Dąbrowski, Krzysztof Stencel, and Grzegorz Timoszuk Institute of Informatics Warsaw University Banacha 2, 02-097 Warsaw, Poland {r.dabrowski,k.stencel,g.timoszuk}@mimuw.edu.pl

Abstract. The architecture of a software system is typically defined as the organization of the system, the relationships among its components and the principles governing their design. By including artifacts coresponding to software engineering processes, the definition gets naturally extended into the architecture of a software system and process. In this paper we propose a holistic model to organize knowledge of such architectures. This model is graph-based. It collects architectural artifacts as vertices and their relationships as edges. It allows operations like metric calculation, refactoring, bad smell detection and pattern discovery as algorithmic transformations on graphs. It is independent of development languages. It can be applied for both formal and adaptive projects. We have implemented prototype tools supporting this model. The artifacts are stored in a graph database. The operations are defined in a graph query language. They have short formulation and are efficiently executed by the graph database engine. Keywords: architecture, graph, metric, model, software.

1

Introduction

As long as there were no software systems, managing their architecture was no problem at all; when there were only simple systems, managing their architecture became a mild problem; and now we have gigantic software systems, and managing their architecture has become an equally gigantic problem (to paraphrase Edsger Dijkstra). Nowadays software systems are being developed by teams that are: changing over time; working under time pressure; working over incomplete documentation and changing requirements; integrating unfamiliar source-code in multiple development technologies, programming languages, coding standards; productively delivering only partially completed releases in iterative development cycles. When at some point development issues arise (bugs, changes, extensions), they frequently lead to refactoring of the software system and the software process. Even if the issues get addressed promptly, they often return in consecutive releases due to volatile team structure, insufficient flow of information, inability to properly manage architectural knowledge about the software system and the software process. I. Crnkovic, V. Gruhn, and M. Book (Eds.): ECSA 2011, LNCS 6903, pp. 360–369, 2011. c Springer-Verlag Berlin Heidelberg 2011 

Software Is a Directed Multigraph

361

Unsurprisingly such challenges have already been identified and software engineering is focused on their resolution. In particular there emerged a number of software development methodologies (e.g. structured, iterative, adaptive), design models (e.g. Entity Relationship Diagram, Data Flow Diagram, State Transition Diagram), development languages (e.g. functional, object-oriented, aspect-oriented) and production management tools (e.g. issue trackers, build and configuration managers, source-code analyzers). Although they address important areas, it is still a challenge to integrate those methodologies, standards, languages, metrics, tools into a consistent environment. Such an environment should (1) include all software system and software process artifacts; (2) identify their dependencies; (3) facilitate systematic build of deliverables. Furthermore, it should be resilient to changes of the development team. This property can be achieved provided all architectural knowledge is preserved in this environment’s repository. For software practitioners this current lack of integration of architectural knowledge is a historical condition: while software was limited to a small number of files delivered in one programming language and built into a single executable, it was possible to browse the artifacts in a list mode (file by file; or procedure by procedure). Next, as software projects evolved to become more complex and sophisticated, the idea of a software project organized according to a tree (folders, subfolders and files; or classes, subclasses and methods) emerged to allow browsing artifacts in a hierarchical approach. This is no longer enough. We believe that although software engineering is going in the right direction, the research will lack proper momentum without a new sound model to support integration of current trends, technologies, languages. A new vision for architectural repository of software system and software process is required and this paper aims to introduce one in order to trigger a discussion. Our concept can be summarized as follows. All software system and software process artifacts being created during a software project are explicitely organized as vertices of a graph (being the next step after the list and tree) connected by multiple edges that represent multiple kinds of dependencies among those artifacts. The key aspects of software production like quality, predictability, automation and metrics are easily expressible in graph-based terms. The integration of source code artifacts and process artifacts in a single model opens new possibilities. They include e.g. defining new metrics and qualities that take into account all architectural knowledge and not only the source code. This concept of a graph-based model for software and software process has been briefly anounced in [4]. In this paper we present a detailed definition of the model and demonstrate by example that its implementation if feasible using graph databases. The rest of the paper is organized as follows. In Section 2 we analyze the background that motivated our approach. In Section 3 we provide a definition of the graph-based model for architectural knowledge management. In Section 4 we describe our prototype implementation using a graph database. Section 5 concludes and enumerates challenges for further research.

362

2

R. Dąbrowski, K. Stencel, and G. Timoszuk

Related Work

The idea of software development described in this paper is not an entirely novel one. It has been contributed to by several existing approaches and practices. Software engineering strives for quantitative assessment of software quality and software process predictability. Typically this is achieved by different metrics. Frequently there are many contradicting definitions of a given metric (i.e. they depend on the implementation language). It has been suggested by Mens and Lanza [11] that metrics should be expressed and defined using a languageindependent metamodel based on graphs. Such graph-based approach allows for an unambiguous definition of generic object-oriented or higher-order metrics. Also Gossens, Belli, Beydeda and Dal Cin [7] considered view graphs for representation of source code. Such graphs are convenient for program analysis and testing at different levels of abstraction (e.g. white-box analysis and testing at the low level of abstraction; black-box analysis and testing at the high level of abstraction). A graph-based approach integrates the different techniques of analysis and testing. Modern software models often describe systems by a number of (partially) orthogonal views (e.g. state machine, class diagram). Abstract models are often transformed into platform-specific models, and finally into the code. During such transformations it is usually not possible to keep a neat separation into different views (e.g. the specification language of the target models might not support all such views). The target model, however, still needs to preserve the behavior of the abstract model. Therefore, model transformations have to be capable of moving behavioral aspects across views. Derrick and Wehrheim [5] studied aspects of model transformations from state-based views (e.g. class specifications with data and methods) into protocol-based views (e.g. process specifications on orderings of methods) and vice versa. They suggested that specification languages for these two views should be equipped with a joint, formal semantics which enables a proof of behavior preservation and consequently derives conditions for the transformations to be behavior-preserving. Also Fleurey, Baudry, France and Ghosh [6] have observed that it is necessary to automatically compose models to build a global view of the system. The graph-based approach allows for a generic framework of model composition that is independent from a modeling language. The use of components is beneficiary for the development of complex software systems. However, component testing is still one of the top issues in software engineering. In particular, both the developer of a component and the developer of a system, while using components, often face the problem that information vital for certain development tasks is not available. One of its important consequences is that it might not only obligate the developer of a system to test the components used, it might also complicate these tests. Beydeda and Gruhn [2] have focused on component testing approaches that explicitly respect this lack of information during development. As Kühne, Selic, Gervais and Terrier [9] have noticed, an automated transition from use cases to activity diagrams would provide significant, practical

Software Is a Directed Multigraph

363

help. Additionally, traceability could be established through automated transformation, which could then be used to relate requirements to design decisions and test cases. They proposed an approach to automatically generate activity diagrams from use cases while establishing traceability links. Such approach has already been implemented (e.g. RAVEN, ravenflow.com). Osterweil [12] perceived software systems as large, complex and intangible objects developed without a suitably visible, detailed and formal descriptions of how to proceed. He suggested that not only the software, but also software process should be included in software project as programs with explicitly stated descriptions. According to Osterweil, software architect should communicate with developers, customers and other managers through a software process program, indicating steps that are to be taken in order to achieve product development or evolution goals. Osterweil postulates that developers would benefit from communicating by software process programs, as reading them should indicate the way in which work is to be coordinated and the way in which each individual’s contribution is to fit with others’ contributions. In that sense software process program would be yet another artifact in the graph we propose in this paper. An RDF (Resource Description Framework) model [10] is also worth mentioning. The model presented in this paper is somehow similar to RDF idea. RDF defines triples subject-predicate-object which are similar to graph relationships (triples: vertex-egde-vertex ). It is usually stored in textual formats (XML or N3 format). Several languages have already been proposed to query this model, like: Sesame [3] and SPARQL [13].

3

Model

In this section we introduce a graph-based model for software engineering methodologies. The model is based on directed multigraphs. Definition 1. Let S be a software-intensive system. Let A denote the set of all types of artifacts that are created during construction of S, let D denote the set of all types of dependencies among those artifacts. In the remaining part we assume A, D to be given and denote S = S(A, D). The set A is a dictionary of attributes that annotate artifacts created during development of S. For the simplicity of reasoning we assume A to be predefined in the rest of the paper. There remains a challenge to derive a representative and consistent classification (a superset) of such attributes, though during a given software project only a subset of A would be typically used. Example 1. Typically, A may contain some of the following values: class; coding standard; field; grammar; interface; library; method; module; requirement; test suite; use case; unit test. Analogically, the set D is the dictionary of labels describing dependencies traced among the artifacts. Again, in the remainder of this paper we assume it to be predefined, although the set of actual dependencies may be software-specific and derivation of a common superset remains a challenge.

364

R. Dąbrowski, K. Stencel, and G. Timoszuk

Example 2. Typically, D may contain some of the following values: apply to, call, contain, define, depend on, generate, implement, limit, require, return, override, use, verify. Definition 2. The software graph G is an ordered triple G(S) = (V, L, E), where V is the set of vertices that represent the artifacts of software system or software process, L ⊆ V × A is the labeling of vertices with their attributes, E ⊆ V × D × V is the set of directed edges that trace dependencies between artifacts. Example 3. Typically, E may contain some of the following values: a class calls a class; a class contains a field; a class contains a method; a class implements an interface; a coding standard limits a module; a grammar generates a class; a method calls a method; a module depends on a module; a requirement defines a module; a unit test verifies a method. G is a multigraph, that is there can be more than one edge in E from one vertex in V to another vertex in V. G is a directed graph, that is forward and backward relations traced among artifacts are distinguished. Example 4. Figure 1 shows an example software graph G where A = { Abstract class, Class, Field, Method }, D = { CALL, CONTain, EXTend, OVERride }. The model integrates all artifacts created during a software project. It provides a graph-based abstraction of software engineering methodology. Being graphbased, the abstraction is well recognized in software community; in particular for many problems there already exist efficient graph algorithms.

Fig. 1. An example software graph

Software Is a Directed Multigraph

365

We provide now several examples to demonstrate how the model can be applied to collect software architectural knowledge and to analyze its properties. For this purpose, we introduce some model transformations. The list of transformations presented in this paper is not exclusive and there remains a challenge to provide a canonical classification of such operations (including basic operations like adding or deleting graph nodes, or graph edges). However they can be summarized by the following intuitive set of main transformation types: an evaluation that maps a graph into a real number; a selection that maps a graph into one of its subgraphs; and a transition that maps a graph into a new graph (and in particular may introduce new vertices or edges). First we define the diagram transformation that limits the graph to a given scope of artifacts and dependencies. The transformation is particularly useful for providing human-convenient representation of the graph, as in a non-trivial software project the model itself may grow large. Definition 3. For a given software graph G = (V, L, E) and subsets of its artifact types A ⊆ A and dependency types D ⊆ D, its diagram is a selection G|A ,D = (V  , L , E  ), where V  = {v ∈ V|∃a∈A (v, a) ∈ L}, L = V  × A and E  = E ∩ (V  × L × V  ). In particular, this transformation allows generating the class and entity relationship diagrams directly from the model. Example 5. Figure 2 shows the graph G1 that is a selection G|A ,D where A = { Abstract class, Class }, D = { Contain, Extend }.

Fig. 2. The result of an example diagram transformation

Software architects may choose to stop distinguishing certain differences in artifact or dependency types (adapt a higher level of abstraction, e.g. hide fields and methods while preserving class dependencies). For this purpose we define the map transformation. The transformations can be combined, e.g. the map transformation combined with the diagram transformation is useful for generating visual representation (e.g. two or three-dimensional) of a given software graph.

366

R. Dąbrowski, K. Stencel, and G. Timoszuk

Definition 4. For a given software graph G = (V, L, E) and t : D × D → D, its map is a transition G|t = {V, L, E  }, where E  is the set of new edges resulting from a transitive closure of t calculated on the neighboring edges of vertices in G. Example 6. Figure 3 shows the graph G2 that is the result of a combination of a map and a selection G f |A ,D where A = { Abstract class, Class }, D = { Call, Contain, Extend } and f : { Call, Contain, Extend } → { Depend }.

Fig. 3. The result of the example transformations composed of a map and a selection

Software architects need to assess the model quantitatively. For this purpose, we introduce metric transformations. The graph-based approach not only allows using existing metric that can be efficiently calculated using graph algorithms, but also allows designing new metrics. The metrics that integrate both system and process artifacts are particularly interesting. Definition 5. For a given software graph G = (V, L, E), its metric is an evaluation m : V, L, E → R (R being real numbers) which can be calculated by a graph algorithm on V, L, E. Sometimes vertices that meet certain conditions need to be discovered. For this purpose we introduce detection transformations. Definition 6. For a given software graph G = (V, L, E) and f : V → bool, its detection is a selection G f = (V  , L, E  ), where V  = {v ∈ V|f (v) = true}, E  = (V  × L × V  ). This way discovery of bad smells can be conducted using detection transformation. In particular, we can easily find classes that define own fields but do not redefine the comparison method. Example 7. Figure 4 shows the graph G3 that is a detection G|f where f (v) evaluates to true iff: v is of type Class and does have a neighbor of type Field and does not have a neighbor equals() of type Method.

Software Is a Directed Multigraph

367

Fig. 4. A bad smell detected

4

Model Implementation

We have decided to implement the repository with a graph database. Graph databases are a member of the family of NoSQL databases that directly store unconstrained graph structures. Therefore they are well-suited for the needs of our approach. Graph databases provide efficient traversal between the vertices, called index-free adjacency. The graph structure in such a database is explicit, thus joins and index probes are not necessary to walk the graph from one vertex to another. This facility is important for model browsing tools and IDEs. Graph databases provide also implementations of query languages and transactional operations. A query language is needed to easily define and efficiently execute graph transformations sketched in Section 3. Transactional operations are necessary for large teams who work concurrently on the same repository. For our implementation we have selected an open-source graph database neo4j (neo4j.org). Neo4j offers high-availability facilities that make it feasible to build repositories for large projects. To express model operations we have selected a specific graph query language Gremlin (github.com/tinkerpop/gremlin). Gremlin is a path language similar to XPath, however a number of additional facilities like backtrack and loops make Gremlin Turing-complete. Thus, we can code in Gremlin any calculation, selection and transition as described in Section 3. We have implemented in Gremlin a number of graph transformations. As an example we show a selection of classes with a specific bad smell: namely classes that add own fields but do not redefine the comparison method equals. The result of this transformation applied to the graph from Figure 1 is shown on Figure 4. The respective query in Gremlin follows. As this code shows, a relatively complex search condition has a concise formulation in Gremlin. g = new Neo4jGraph(’Repository’)\ g.V{ it.TYPES_KEY == ’[JAVA_CLASS]’\ && !it.outE(’CONTAINS’).inV{it.NAME_KEY==’equals’}\ && it.outE(’CONTAINS’).inV{it.TYPES_KEY==’[JAVA_FIELD]’} \ && it.outE(’EXTENDS’).inV.loop(2){\ !it.object.outE(’CONTAINS’).inV{it.NAME_KEY==’equals’} } \ }.NAME_KEY

368

R. Dąbrowski, K. Stencel, and G. Timoszuk

V is the collection of all vertices of graph g. The query performs a filter to this collection. The first part of the condition selects nodes that are Java classes. The second drops all nodes that stretch an edge contains towards a node describing a method equals. The third keeps only those classes that have own Java field. The forth is the most complex since it utilities Gremlin’s loop step. This conditions traverses upwards the inheritance lattice and stops when it finds a class having a method equals. Only when such an ancestor is found, the tested class is added to the result of the selection. When this step is finished, the query projects its result to the value of the NAME_KEY property. Eventually, we get the following answer conformant with the contents of Figure 4. ==>Motor ==>Truck

5

Conclusions

Following the research on architecture of software [8] and software process, we propose an approach that avoids separation between software and software process artifacts as the one worth taking [12]. Implementation of such approach has already became feasible - starting with a graph-based model and using graph databases [1] as the foundation for artifact representation. The concept is not an entirely novel one, rather it should be perceived as an attempt to support existing trends with a sound and common foundation. A holistic approach is required for current research to gain proper momentum, as despite many advanced tools, current software projects still suffer from a lack of visible, detailed and complete setting to govern their architecture and evolution. We are also aware that the scope of research required to turn this idea into an actual contribution to software engineering requires further work. In particular, the following research areas seem to be especially inspiring: assessing a representative number of existing projects in an effort to provide a systematic classifications of artifact types A and dependency types D; perhaps the artifact types and dependency types should evolve rather to be trees then mere lists; designing metric (in graph-based terms, so they can be calculated by graph algorithms) to assess software quality and software process maturity; implementing graph algorithms to calculate those metrics; classifying existing software and its process according to the model, in particular calculating metric in order to assess software quality and software process maturity, which would eventually allow comparing software projects with one another; defining UML diagrams as reports obtained from the integrated software graph as a combination of its transformations; precise definitions for the model and its components (views, maps), new components enriching the model; productive implementation of the graph based on graph databases; a project query language that would operate on the graph model and allow architects and developers to conveniently filter, zoom and drill-down the project’s architectural information.

Software Is a Directed Multigraph

369

References 1. Angles, R., Gutiérrez, C.: Survey of graph database models. ACM Computing Surveys 40(1) (2008) 2. Beydeda, S., Gruhn, V.: State of the art in testing components. In: Proceedings of Third International Conference on Quality Software, pp. 146–153. IEEE Computer Society, Los Alamitos (2004) 3. Broekstra, J., Kampman, A., Harmelen, F.: Sesame: A generic architecture for storing and querying rdf and rdf schema. In: Proceedings of the First International Semantic Web Conference, pp. 54–68 (2002) 4. Dąbrowski, R., Stencel, K., Timoszuk, G.: Software is a directed multigraph (and so is software process). arXiv:1103.4056 (2011) 5. Derrick, J., Wehrheim, H.: Model transformations across views. Science of Computer Programming 75(3), 192–210 (2010) 6. Fleurey, F., Baudry, B., France, R., Ghosh, S.: A generic approach for automatic model composition. In: Proceeding of MoDELS Workshops, pp. 7–15 (2007) 7. Gossens, S., Belli, F., Beydeda, S., Dal Cin, M.: View graphs for analysis and testing of programs at different abstraction levels. In: Proceedings of the Ninth IEEE International Symposium on High-Assurance Systems Engineering, pp. 121– 130. IEEE Computer Society, Los Alamitos (2005) 8. Kruchten, P., Lago, P., van Vliet, H., Wolf, T.: Building up and exploiting architectural knowledge. In: Proceedings of the 5th Working IEEE/IFIP Conference on Software Architecture, pp. 291–292. IEEE Computer Society, Los Alamitos (2005) 9. Kühne, T., Selic, B., Gervais, M.-P., Terrier, F. (eds.): ECMFA 2010. LNCS, vol. 6138. Springer, Heidelberg (2010) 10. Lassila, O., Swick, R.R.: Resource description framework (RDF) model and syntax specification. W3C Recommendation (1999) 11. Mens, T., Lanza, M.: A graph-based metamodel for object-oriented software metrics. Electronic Notes in Theoretical Computer Science 72(2), 57–68 (2002) 12. Osterweil, L.: Software processes are software too. In: Proceedings of the 9th International Conference on Software Engineering, pp. 2–13. IEEE Computer Society, Los Alamitos (1987) 13. Prud’hommeaux, E., Seaborne, A.: SPARQL query language for RDF. W3C Recommendation (2008) 14. Hopcroft, J., Tarjan, R.: Algorithm 447: efficient algorithms for graph manipulation. Communications of the ACM 16, 372–378 (1973) 15. Hasse, P., Broekstra, J., Eberhart, A., Volz, R.: A comparison of RDF query languages. The Semantic Web (2004)

Suggest Documents