Source Code Driven Enterprise Application ... - Science Direct

6 downloads 119 Views 421KB Size Report
clustering methods and by taking into account business domain knowledge. The method is ... Keywords: Decomposition; Enterprise application; Modularization; Software clustering. 1. ... A list of all nodes with previously defined node types is.
Available online at www.sciencedirect.com

ScienceDirect Procedia Computer Science 77 (2015) 167 – 175

ICTE in Regional Development

Source Code Driven Enterprise Application Decomposition: Preliminary Evaluation Inese Šūpulniecea*, Inese Poļakaa, Solvita Bērzišaa, Edgars Ozoliņšb, Edgars Palacisb, Egils Meiersb, Jānis Grabisa a

Institute of Information Technology, Riga Technical University, Kalku str.1, Riga, LV-1658, Latvia b Visma Enterprise, Riga, Latvia

Abstract Many legacy enterprise applications are large, monolithic and tightly coupled. Their maintenance is expensive and these systems are inflexible in fast changing environments. Modularization can improve the maintainability and increase the efficiency of legacy enterprise applications. This paper describes a source code driven decomposition of a large enterprise application using clustering methods and by taking into account business domain knowledge. The method is applied for decomposition of a large industrial ERP system. © 2015 Published by by Elsevier B.V.B.V. This is an open access article under the CC BY-NC-ND license © 2016The TheAuthors. Authors. Published Elsevier (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review underresponsibility responsibility of the Sociotechnical Systems Engineering of Vidzeme Applied Sciences. Peer-review under of the Sociotechnical Systems Engineering InstituteInstitute of Vidzeme UniversityUniversity of Appliedof Sciences Keywords: Decomposition; Enterprise application; Modularization; Software clustering

1. Introduction Enterprise core business processes are implemented in enterprise applications. Many of these applications are large, monolithic and inflexible in a fast changing business environment1. This can be improved by decomposing large enterprise applications into modules. Modularization can improve the maintainability and increase the efficiency of enterprise applications by reducing their complexity2. Enterprise application decomposition is expected

* Corresponding author. E-mail address: [email protected]

1877-0509 © 2015 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility of the Sociotechnical Systems Engineering Institute of Vidzeme University of Applied Sciences doi:10.1016/j.procs.2015.12.377

168

Inese Šūpulniece et al. / Procedia Computer Science 77 (2015) 167 – 175

to lead towards an application design, which is easier to maintain, and development process suitable for autonomous teams. The main requirements for the enterprise application decomposition method are3: 1) scalable and computationally feasible for large applications (several million LOC and >10000 classes); 2) source code driven; 3) considers business domain knowledge 4) automatically produces a list of loosely coupled modules; 5) tractable for software architects; 6) parameterized process. This paper reports initial evaluation results of an enterprise application decomposition method4, which was developed according to the results of systematic literature review5. The decomposition method is developed and validated in the university-industry collaboration research project. The industry partner is an IT company developing and supporting a legacy enterprise resource planning system for over 20 years. This enterprise application is developed in the Delphi environment, has around 4 million lines of code and around 10,000 classes. It is used by many customers from different business domains. The rest of the paper is structured as follows: Section 2 describes the related work. Section 3 presents an overview of the large enterprise application decomposition method. Case study is described in Section 4. Evaluation results are discussed in Section 5. Conclusion and future work is presented at the end of the paper. 2. Related Work We have conducted a systematic literature review about enterprise application decomposition and published it in a separate report3. The main findings were: 1) Research about applicability of decomposition methods in the area of enterprise applications is limited. During the systematic literature review we identified case studies on 187 software systems. 13 of them were business systems and 3 of them were related to enterprise resource planning systems. The rest were software development tools and scientific applications. Enterprise application data are usually confidential, therefore open source systems dominated in the described case studies. Decomposition is mainly performed for systems written in Java. Large legacy enterprise systems usually are built in other programming languages. 2) Most of the identified studies have tested one system. 3) Small (10K-100KLOC) and micro systems (1million LOC) were only 4% of all tested systems. 4) An interesting observation was that only one paper reports negative results6. This paper presents industry experience and the authors use decomposition approaches created by other researchers. 3. Enterprise application decomposition method The proposed method for decomposition of large monolithic enterprise systems4 includes basic phases of the decom-position process: facts extraction, pre-processing, component identification and post-processing. The method uses static information from the source code, is partially automated and requires human involvement in some of the decomposition tasks. Overview of the method is given in this section but different practical implementation details are in the case study description (Section 4). The facts extraction phase prepares data about the system’s structure – nodes and relations. At first, node types and relation types used in the system are identified. A list of all nodes with previously defined node types is extracted from the source code and similarly a list of relations among nodes with defined relation types is extracted. The result of the fact extraction phase is a directed graph G = {N, E}, where node set N is a collection of the nodes and edge set E is a collection of relations among the nodes. The pre-processing phase improves and supplements results of the facts extraction and reduces the number of nodes for clustering. Based on circumstances of the particular case, the pre-processing phase includes the following tasks: 1) marking of key nodes identified by business experts as known distinctive functional areas of the system; 2) calibration of weights characterizing relative importance of relation types; 3) identification and blacklisting of

169

Inese Šūpulniece et al. / Procedia Computer Science 77 (2015) 167 – 175

omnipresent nodes (utilities, library classes etc.); 4) application of pre-processing rules to group tightly related nodes according to language-dependent codding patterns. Results of the pre-processing phase are a list of nodes (excluding omnipresent nodes and grouped nodes), node groups (initial clusters) and similarity between nodes/clusters. Similarity is calculated between all nodes i and j using relations (number or value of r of all types – 1…z, z - count of relation types), weights (w) and key node coefficient KeyNode(i,j) (Equation 1). Similarity between cluster and related nodes/clusters is recalculated using one of the strategies: closer nodes, average distances between nodes and farther nodes. ‫݉݅ݏ‬൫݊௜ ǡ ݊௝ ൯ ൌ ‫݁݀݋ܰݕ݁ܭ‬ሺ݅ǡ ݆ሻ ‫ כ‬σ௭௫ୀଵ ‫ݓ‬௫ ‫ݎ‬௫ሺ௜ǡ௝ሻ

(1)

The component identification phase identifies clusters of nodes using a hierarchical clustering algorithm7. The clusters can be considered as system components. The clustering process continues till all nodes are merged in one cluster or a defined threshold is reached. The result of the component identification phase is clusters and free nodes. The post-processing phase improves the result of the component identification. Based on circumstances of the particular case, the following tasks can be performed: 1) identification of the most suitable cluster for free nodes by adding a node to each component and calculating cohesion and/or coupling based on optimization fitness function; and 2) interface identification using language-dependent rules. The result of the post-processing phase is proposed clusters and/or its interfaces. 4. Case study The method was applied for decomposition of a subsystem of the ERP system Horizon developed and maintained by Visma Enterprise. The source code subset used for the experiments is a multifunctional module with 223 806 SLOC written in Delphi. Table 1. Node Types. Node Type

Description

Program

Software module that is compiled in a form of executable file (*.exe) and contains application initialization code.

Library

Software module that is compiled in a form of dynamically loadable library (*.dll)

Package

A file that contains a collection of references to individual files (Pascal units) to group them into one logical functional package (*.dpk).

Unit

A single Delphi source code file (*.pas).

Class

Delphi language element "class". There are following class type declarations supported in our source code extraction tool:simple definition, inherited, forward declaration, helper, reference.

Record

Delphi language element "record" and its subtype "record helper for".

Interface

Delphi language element "interface".

DispInterface

Delphi system interface that serves as superclass for IDispatch interface. It provides recognition of COM object methods by their ID values.

Generic and GenericParam

Represents Delphi term "generic" or "generic type". It describe the set of things in a Delphi platform that can be parameterized by type. The term generics can refer to either generic types or generic methods, i.e., generic procedures and generic functions.

CompType

A set of different compound types. Following compound types are recognized by our source code extraction tool: record, array, enumeration, set, subrange, type redef, pointer type, method un class of.

Gvar

All global Delphi variables, constants and resource strings

Method

Delphi Global methods

SClass

Virtual node that is named as "singleton class". Each singleton class unites all global methods and variables declared in a particular Delphi unit.

170

Inese Šūpulniece et al. / Procedia Computer Science 77 (2015) 167 – 175

4.1. Facts extraction A custom-made tool was developed based on an open-source Delphi source code parser from CnPack. The tool was used to automatically parse the chosen subset of Delphi source code and extract the directed graph G = {N, E}. Node types are listed in Table 1. There were possible zero, one or multiple edges of the same and different relationship types between every two nodes (Table 2). Relationship categories were taken from8. Additionally, Delphi language specific relationship types were added. Relationship types Table usage and Business Class Component were extracted using dynamic source code analysis. Another custom-made tool was developed to extract information about database tables used by source code classes. The graph data and databaseusage data were imported and merged into a SQL Server database. The resulting graph had 2582 nodes and 94266 edges. Table 2. Relationship Types. Relationship Type

Category

Description

Uses

Global based

Module uses another module or module includes unit

Defines

Containment based

Delphi unit contains definition of a class

UsesAsArg

Association based

Method argument list of a class includes a variable whose type is another class

UsesAsDeclVar

Association based

In the body of a class a variable of another class is being instantiated

UsesAsDefVar

Association based

Class or unit defines a variable of type that is another class

Extends

Inheritance based, incl. Generity based

Class definition extends another class definition

Implements

Inheritance based, incl. Generity based

Class implements an interface

Calls

Containment based

Class calls a method of another class

Level 0-8

Business Class Component

Nodes from dynamic source code analysis

CommonTable

Table usage

Nodes use the same table

4.2. Pre-processing Several automated algorithms based on fan-in/fan-out analysis (e.g.,9,10) for identification of utility nodes did not provide expected results – some identified utilities were business objects and many known utilities were not included in the list of utilities. Therefore utilities were identified manually by an expert. The pre-processing phase included marking of the library nodes so that they could be excluded from the clustering. The marking was performed by matching node names to combinations of known Delphi library class names and namespaces. These structures were considered uninformative for decomposition and their inclusion into clusters would have no practical consequences for software redesign. In total 67 nodes and 7073 edges were blacklisted as omnipresent nodes. During the first clustering iteration it was discovered that 25%-50% of the clustering time was spent on collecting the nodes, which has only a single relationship. Node chaining (Fig.1a) was used to reduce the amount of data, which will be used in the clustering process. Relationships between the same nodes were aggregated and the weight for all Relationship types was set to 1. Weight seeding (Fig.1b) was used to incorporate business domain expert knowledge about potential cluster centres.

171

Inese Šūpulniece et al. / Procedia Computer Science 77 (2015) 167 – 175

a

b Fig. 1. (a) Node chaining; (b) Seed weights for business centers

4.3. Clustering The clustering phase was carried out using a modified hierarchical agglomerative algorithm. This bottom up approach starts by considering each node a cluster and merging the two most similar clusters at each stage. Similarity of two nodes was measured using information about the edges between the two nodes. The calculation of similarity between nodes ܿ௜ and ܿ௝ in a graph with m nodes is shown in Equation 2. ‫݉݅ݏ‬൫ܿ௜ ǡ ܿ௝ ൯ ൌ σ೘

ଶή௡೐೏೒೐ೞ ሺ௖೔ ǡ௖ೕ ሻ

೘ ೖసభ ௡೐೏೒೐ೞ ሺ௖೔ ǡ௖ೖ ሻାσ೗సభ ௡೐೏೒೐ೞ ሺ௖ೕ ǡ௖೗ ሻ

(2)

where ݊௘ௗ௚௘௦ ሺܿ௜ ǡ ܿ௝ ሻ is the number of edges between nodes ܿ௜ and ܿ௝ . The cluster linkage criterion, which is used in this algorithm to evaluate similarity of two clusters ‫ܥ‬௔ and ‫ܥ‬௕ା௖ after a merge (‫ܥ‬௕ା௖ ൌ ‫ܥ‬௕ ൅ ‫ܥ‬௖ ) in the following iterations, measures unweighted average similarity between both sets of nodes and is calculated using Equation 3. ‫݉݅ݏ‬ሺ‫ܥ‬௔ ǡ ‫ܥ‬௕ା௖ ሻ ൌ

௦௜௠ሺ஼ೌ ǡ஼್ ሻା௦௜௠ሺ஼ೌ ǡ஼೎ ሻ ଶ

(3)

Since this criterion decreases with each iteration and the decomposed systems used in the study are large, it also sets a stopping point – the clustering process is stopped when the cluster similarity criterion exceeds the precision available in the SQL database. This criterion can be perceived as arbitrary but it has been proven to be sufficient to reach an end point with a result that can be passed to the post-processing stage. The clustering process has to be stopped before reaching the last iteration, which by design is a situation with only one cluster. The calculation of the cluster linkage criterion usually stops with several hundred clusters left and this is an adequate situation to start post-processing. In systems where the precision is exceeded too soon or if there is a necessity to reach the last iteration, the linkage criterion can be adjusted using a coefficient (larger than 1) with the linkage criterion. This coefficient has to be empirically adjusted because the dynamics of linkage criterion depend on the features of the specific systems.

172

Inese Šūpulniece et al. / Procedia Computer Science 77 (2015) 167 – 175

4.4. Post-processing The post-processing step is carried out in order to assign smaller clusters and free nodes to larger clusters based on their relationships. The main goal in this step is to decrease the number of components (clusters and nodes) while increasing cohesion (number of relationships between nodes in the same components) and decreasing coupling (number of relationships between components). The post-processing step is started by identifying the large clusters, which will be used for merging by adding smaller clusters and nodes. The large clusters are identified based on expert knowledge or an arbitrary cut in the cluster size graph (see Fig. 2). The optimal assignment of the small clusters and nodes is found using a genetic algorithm. Each individual of the population consists of genes that represent each of the small clusters/nodes and labels attached to genes that represent the large cluster each of the small clusters/nodes belong to. Each solution (individual) is evaluated according to a fitness function to minimize coupling and maximize cohesion. The fitness function for situation represented by solution x is shown in Equation 4 and is minimized in the process. ‫ܨ‬ሺ‫ݔ‬ሻ ൌ

௖௢௨௣௟௜௡௚ሺ௫ሻ

(4)

௖௢௨௣௟௜௡௚ሺ௫ሻା௖௢௛௘௦௜௢௡ሺ௫ሻ

The genetic algorithm stops when the population becomes homogenous, the result has converged or the time that can be dedicated to post-processing ends (time threshold depends on the size and complexity of the solution and should be determined empirically).

2000

40

1800

35

1600

30

1400 1200

25

1000

20

800

15

600

10

400 200

5

0

0 Node count

Number of clusters

Fig. 2. Numbers of clusters and their sizes

5. Results The decomposition process was carried out according to the previously described methods and using the previously described ERP system module.

173

Inese Šūpulniece et al. / Procedia Computer Science 77 (2015) 167 – 175

b) 15000 Edges, Free nodes

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

600 500 400 300 200 100 0

10000 5000 0 11 111 211 311 411 511 611 711 811 911 1011 1111

Similarity

a)

Clusters

Process dynamics are illustrated in Figure 3. It shows the exponential decrease of the similarity metric value until it reaches a value so small that the precision of real numbers allowed by SQL is exceeded. Although this metric reaches very small values, it is enough to decompose the module. The graph on the right shows the dynamics of free node count, cluster count and inter-component relationship count.

0

500 1000 Clustering iteration

Clustering iteration

1500 Edges

Free nodes

Clusters

Fig. 3. (a) Similarity dynamics; (b) Graph change dynamics.

It is clearly visible that the counts of free nodes and relationships are constantly decreasing throughout the whole process, while they are being added to clusters. The cluster count is growing for the first 500 iterations while mostly new clusters are created from free nodes and then it starts decreasing, which points to the merging of clusters. This part of process is important because a useful decomposition should hold a small enough number of components so they can be post-processed and interpreted. The end situation holds 126 clusters, 107 free nodes and 6 edges. The small number of edges is one of the reasons this end state was reached. This situation could be caused by strict blacklisting or the sheer nature of the decomposed system – the source code holds only one module and could have had some interface-specific elements in it. This is also the reason why the post-processing phase was not executed at this time – it requires interpretation of experts and returning of blacklisted nodes/edges to the end graph. The situation in iterations close to the end of the process is very similar. There is one large cluster connected to few smaller clusters (with less than 30 nodes) and many small clusters and free nodes without any edges connecting them to other elements (see Fig. 4 for representation of iteration number 1200 where only 5 clusters are interconnected; the numbers represent the number of nodes in each cluster, the small clusters without connections are circles without number in them). Nevertheless this result is substantial and the test run allowed the authors to improve the method and verify that it can be used to achieve the goal set for this study and it can be used with large software systems like Horizon.

174

Inese Šūpulniece et al. / Procedia Computer Science 77 (2015) 167 – 175

5 55

2536 3

52

Fig. 4. State of decomposition for iteration number 1200.

6. Conclusion and future work The overall method has shown satisfactory results in the test runs. When applying this method to the full system and in other circumstances small adjustments in data extraction and algorithm parameters might be necessary and should be performed based on empirical testing. Fact extraction is the most important stage of decomposition. Selecting the correct types of objects and relationships is the most significant precondition. Missing important objects will result in meaningless components while lack of relationships can cause poorly connected graph that consists of random unrelated components. The preprocessing stage can measurably reduce resource use and enhance quality of the resulting decomposition. Exclusion of unimportant and noisy objects and relationships improves decomposition speed as well as results in more meaningful graphs due to noise reduction. The clustering process presented in this study has been empirically proven to be suitable for decomposition of large software system. Although the similarity measure leaves space for improvement, it is satisfactory for the presented application. The similarity measure has to be adjusted for each specific application. The presented similarity metric can be adapted and will be used in further research for decomposition of the full Horizon system. The obtained results did not provide an opportunity to test the post-processing step but unit tests of the method show promising results. This method will be tested in the future research and for decomposition of the full Horizon software system.

Acknowledgements The research has been conducted within the framework of European Regional Development Fund's project “Information and communication technologies competence center” Nr. KC/2.1.2.1.1/10/01/001 (Contract No. L-KC11-0003, www.itkc.lv) activity 1.3. “The Method of Monolithic System Decomposition According to SOA Principles.”

References 1.

Singh A, Wesson J. Improving the Usability of ERP Systems through the Application of Adaptive User Interfaces. In: ICEIS 2009 - 11th International Conference on Enterprise Information Systems, Proceedings; 2009:208-214.

Inese Šūpulniece et al. / Procedia Computer Science 77 (2015) 167 – 175

2. 3. 4. 5. 6. 7. 8. 9. 10.

175

Anquetil N, Laval J. Legacy Software Restructuring: Analyzing a Concrete Case. In: Proceedings of the 15th European Conference on Software Maintenance and Reengineering, CSMR 2011. IEEE; 2011:279-286. Supulniece I, Polaka I, Berzisa S, Meiers E, Ozolins E, Grabis J. Decomposition of Enterprise Application: A Systemic Literature Review and Research Outlook. Comput Science, vol.IT and Management Science. 2015:In Press. Berzisa S, Polaka I, Supulniece I, Grabis J, Ozolins E, Meiers E. Method for Decomposition of Monolithic Enterprise Applications. In Press. 2015. Supulniece I, Berzisa S, Polaka I, Meiers E, Ozolins E, Grabis J. Source Code Driven Decomposition of Object-Oriented Legacy Systems: A Systemic Literature Review and Research Outlook. In: PROFES 2015. 2015:In Press. Glorie M, Zaidman A, van Deursen A, Hofland L. Splitting a Large Software Repository for Easing Future Software Evolution - an Industrial Experience Report. J Softw Maint Evol Res Pract. 2009;21(2):113-141. Cui JF, Chae HS. Applying Agglomerative Hierarchical Clustering Algorithms to Component Identification for Legacy Systems. Inf Softw Technol. 2011;53(6):601-614. Muhammad S, Maqbool O, Abbasi AQ. Evaluating Relationship Categories for Clustering Object-oriented Software Systems. IET Softw. 2012;6(3):260-274. Hamou-Lhadj A, Lethbridge T. Summarizing the Content of Large Traces to Facilitate the Understanding of the Behaviour of a Software System. In: IEEE International Conference on Program Comprehension.; 2006:181-190. Sora I, Glodean G, Gligor M. Software Architecture Reconstruction: An approach Based on Combining Graph Clustering and Partitioning. In: 2010 International Joint Conference on Computational Cybernetics and Technical Informatics. IEEE; 2010:259-264.

Inese Supulniece holds a Doctoral Degree and is a Researcher at the Institute of Information Technology of Riga Technical University (Latvia). Her main research fields are ERP system’s usability, user-adaptive systems, user model-ling, business process modelling and process personalization in business applications. Her professional activities are related to business and system analyses.

Inese Polaka holds a Doctoral degree and is a Lecturer at the Institute of Information Technology of Riga Technical University (Latvia). Main research interests include data mining, machine learning, classifiers, evolutionary algorithms and their applications, as well as bioinformatics and bio-statistics.

Solvita Berzisa holds a Doctoral degree (2012) and is a Lecturer and Re-searcher at the Institute of Information Technology of Riga Technical University (Latvia). Her main research fields are IT project management, project management information systems implementation and application as well as project data analytics. Also she works as an IT Project Man-ager at Exigen Services Latvia. She holds PMP certificate and is awarded the IPMA Outstanding Research Contribution of a Young Researcher 2013. She is a member of PMI, IIBA and Latvian National Project Management Association. Janis Grabis is a Professor at Riga Technical University and the Head of the Institute of Information Technology. His main research interests lie within the application of mathematical pro-gramming methods in information technology, enterprise applications and system integration. He has published more than 60 scientific papers, including a monograph on supply chain configuration. He has led a number of national projects and has participated in five projects in collaboration with the University of Michigan-Dearborn (USA).