Testing Density as a Complexity Metric for EPCs

8 downloads 12072 Views 139KB Size Report
adapts insight from software engineering and software metrics to business process ... we identified the requirement that a complexity metric for business pro-.
Testing Density as a Complexity Metric for EPCs Jan Mendling Vienna University of Economics and Business Administration Augasse 2-6, A-1090 Wien, Austria [email protected]

Abstract: Measuring complexity of business process models is a rather new area of research. Since there are only a few contributions, it is not yet clear which metrics are needed in order to guide and increase the quality of model design. In this paper, we propose a density metric inspired by social network analysis in order to quantify the complexity of an EPC business process model on a scale between zero and one. In particular, we consider minimum and maximum number of arcs for a given set of function, event, and connector nodes. Furthermore, we test the EPC density metric in combination with simple metrics of size for its capability to predict errors in the SAP reference model. While the significance of density is promising, the experiment reveals that there are further metrics needed in addition to density.

1 Introduction Measuring complexity of business process models has recently gained increasing interest in the research community. An analysis of the SAP reference model revealed that complexity seems to be a key determinant for error probability (see [MMN+ 06b, MMN+ 06a]). Against this background, it is promising to consider complexity metrics in the design phase of process models in order to quantify which of multiple design alternatives might be better to understand by model users and less prone to errors. Yet, the notion of complexity in business process models is still too little understood and the number of publications on this new research topic seems to be less than a dozen. Pioneering work by Cardoso adapts insight from software engineering and software metrics to business process models. In [Car05] he proposes a metric called control flow complexity that is inspired by McCabe’s cyclomatic complexity [McC76, MB89]. Further software metrics like lines of code [Kal90], Halstead complexity [Hal87], or Henry and Kafura’s information flow metric [HK81] have been considered in [CMNR06, LG06] for business process models. Graph metrics is a second discipline of inspiration. From this area, network complexity, restrictiveness, or tree-wdith were proposed as complexity metrics e.g. in [LK01, ARGP06]. A problem of these metrics is that they do not provide a proper distinction between size and complexity. Since several of them tend to be larger for a growing number of process elements, it is difficult to compare complexity of models of different size. In this context, it is desirable to design a complexity metric that yields a positive rational number between zero indicating absence of complexity and one for maximum complexity. With such a

metric, it would be possible to compare the complexity of models even if one is large and the other small. Against this background, the contribution of this paper is twofold. First, we consider social network analysis and identify density as a metric that offers some features similar to our requirements. Second, we define density in such a way that it meets the syntactic restrictions of EPCs. The remainder of the paper is structured as follows. Section 2 introduces density in social network analysis and adapts it to Petri nets and YAWL. Section 3 discusses density of EPCs and presents a respective density definition for EPCs. In Section 4, we test the density metric as a predictor for errors in the SAP reference model. Finally, Section 5 concludes the paper and gives an outlook on future research.

2 Social Network Analysis and Density Social network analysis is the quantitative study of relationships between social entities. Social entities in this context can range from individuals on the micro level up to organizations and states on the macro level. The principal idea of social network analysis is to represent the entities and relationship of interest as a graph and apply analysis by calculating metrics for the structure of the graph. There is an extensive set of metrics available to quantify various aspects of a network as a whole or regarding the position of a node in the network. Some prominent examples of such metrics that refer to the network as a whole are size, diameter, and density: size is frequently identified with the number of nodes. The diameter refers to the largest geodesic distance of two nodes in the network, i.e. basically the longest path. Density relates the number of arcs to the number of all possible arcs for a given number of nodes. For a more in-depth introduction to social network analysis refer e.g. to [Was94]. In the introduction we identified the requirement that a complexity metric for business process models should be unbiased towards size and range between zero and one. Since density d relates the number of arcs to potential arcs, it is a good starting point for designing a respective business process density metric. For a workflow net [vdA97] W = (P, T, A) with P referring to the set of places, T to the set of transitions, and A ⊆ (P ×T )∪(T ×P ) to the set of arcs, the density can easily be calculated as follows: dW =

|A| |P | ∗ |T | + |T | ∗ |P |

For a YAWL net [vdAtH05] Y = (C, T, A) with C referring to a set of conditions, T giving the set of tasks, and A as the set of arcs, the density can be calculated as dY =

|A| |C| ∗ |T | + |T | ∗ |C| + |T | ∗ |T |

since tasks can be connected without having a condition in between. In contrast to workflow nets and YAWL nets, the density of EPCs is more difficult to be calculated since EPCs impose more syntactical constraints on nodes and arcs.

3 Density of EPCs In this section, we introduce density for EPCs. First, we consider density between connectors, and then for a whole EPC, and define density with respect to the minimum and the maximum number of arcs given the syntax constraints of an EPC. Before calculating the minimum and maximum number of arcs for an EPC, we have to consider syntactical constraints. The following three constraints from a larger set of constraints (see e.g. [NR02]) are important in this context: • Minimal EPC: An EPC must have at least one start event, one end event, and one function. • Cardinality: Each function has one input and one output, each event has at most one input and one output, and connectors have either one-to-many (splits) or many-toone (joins) input-output cardinality. • Coherence: An EPC is a coherent graph. Let EP C = (E, F, C, l, A) be an EPC with E being the set of events, F the set of functions, C the set of connectors, l as a mapping from the connectors onto the connector labels, and A as the set of arcs. The small letters n, e, f , c, and a refer to the number of nodes, events, functions, connectors, and arcs, respectively. Given the syntax constraints, we can identify the minimal number of arcs as amin = n − 1 of a coherent EPC that is a sequence of alternating functions and events. Furthermore, if there are connectors, each of them introduces a minimum branching with one additional arc on the one hand. On the other hand, an additional branch implies an additional start or end node that has one arc less than a sequence. Figure 1 illustrates this fact for six nodes.

a1

a1

a2

a2

a3

a3

a4

a4

a5

a4

Figure 1: EPCs with a minimum density

The maximum number of arcs is less trivially calculated. Figure 2 illustrates the definition of EPCs with an maximum number of arcs for an increasing number of connectors. For one connector, there is just one input and one output arc. In the case of two connectors, the first connector is a join and the second a split. Having three connectors, there are two arcs added: one from the split and one to the join. From now on the maximum strategy works as follows: if the number of connectors is even, add a join that points to another join and that receives input from all splits yielding n/2 additional arcs. Then, if the number of connectors is odd, add a split behind another split that links to all joins resulting in (n − 1)/2 new arcs.

n a



1 2 1

2 4 2

3 6 2

4 9 3

5 12 3

6 16 4

Figure 2: EPCs with a maximum density

Figure 2 indicates the number of arcs a for a given number of connectors c and the increase ∆. Accordingly, we receive the following formulae for the number of maximum arcs: c cmaxeven = ( + 1)2 2 and cmaxodd = (

c−1 c−1 + 1)2 + +1 2 2

and cc≤1 = 1 if c is one or less. Beyond that, the density metric has to reflect that n − 1 is the minimal density and the maximum density must include arcs to and from events and functions. The maximum number that you can add for an additional event or function is two if you introduce an additional branch between a split and a join connector. Accordingly, if we define the density of an EPC for an odd and an even number of connectors as follows: deven =

a − amin cmaxeven + 2 ∗ (e + f ) − amin

dodd =

a − amin cmaxodd + 2 ∗ (e + f ) − amin

and

and dc≤1 = 0

4 Testing Density with the SAP Reference Model In this section, we present some first empirical evidence whether density could serve as a predictor for error probability. We utilize an analysis of the SAP reference model (see e.g. [CKL97, KT98]) reported in [MMN+ 06b, MMN+ 06a] as the data sample for our test. This sample covers 604 EPC business process models of which 34 have errors with respect to a relaxed soundness [DR01] check. Based on the sample, we test the hypothesis that complexity is a determinant of error probability. We operationalize complexity with density and size. For size, we consider the number of nodes and the number of arcs, respectively. These variables are used as independent variables to explain the variance of the dependent variable “hasError”. As the dependent variable is binary, we use a logistic regression (logit) model. The idea of a logit model is to capture the probability of a binary event by its odds, i.e., the ratio of event probability divided by non-event probability. The relationship between independent and dependent variables is represented by an S-shaped curve of the logistic function that converges to 0 for −∞ and to 1 for ∞. The cut value of 0.5 defines whether event or non-event is predicted. The Exp(B) then gives the change of the odds if the independent variable B is increased by one unit. The logit model is calculates by maximizing the likelihood value. The significance of the model can be interpreted by the help of two statistics: the Hosmer&Lemeshow statistic is a goodness-of-fit test for the null hypothesis that the difference between observed and predicted values is zero. For an adequate fit, this statistic should be greater 5%. Nagelkerke’s R2 statistic ranges from 0 to 1 and indicates how much of the variance is explained by the model in comparison to a base model without predictors. Beyond that, each coefficient has to be tested using the Wald statistic for being significantly different from zero. The significance should be less than 5%. For more details on logistic regression see e.g. [HATB98]. In Table 1 we also give the percentage of correct classifications and the number of correctly predicted EPC models with errors. As our sample includes only 5.6% error cases, a correct classification of 94.4% is easily achieved by always predicting that the EPC is correct. Therefore, the number of correctly predicted errors is more interesting in this context. Table 1 basically reveals two things. First, the Wald statistic for all coefficients indicate that they are significantly different from zero, i.e., they do have an impact on error probability.

Coefficient Constant density n a Hosmer&Lem. Sig. Nagelkerke R2 Correct Classif. Correct Error Pred.

Density Exp(B) Wald Sig. 0.046 0.0% 435.020 0.1% 0.0% 4.7% 94.2% 0

Density&Nodes Exp(B) Wald Sig. 0.016 0.0% 394.507 0.4% 1.018 0.0% 0.3% 19.0% 94.4% 3

Density&Arcs Exp(B) Wald Sig. 0.017 0.0% 269.885 0.7% 1.035 0.0% 0.2% 19.3% 94.4% 3

Table 1: Logit Models based on potential Error Determinants

In particular, density has a very strong positive influence: an increase in density implies an over-proportional increase in error probability. Second, the statistics for the whole model leave room for improvement. The Hosmer&Lemeshow statistic is less than 5% which is not desirable, and the Nagelkerke R2 signals a mediocre explanation of the variance of the dependent variable. Therefore, it seems as if density and size together can only partially explain error probability and additional error determinants have to be identified.

5 Conclusion In this paper, we designed a density metric inspired by social network analysis in order to quantify the complexity of an EPC business process model on a scale between zero and one. In particular, we considered minimum and maximum number of arcs for a given set of function, event, and connector nodes. Furthermore, we tested the EPC density metric in combination with simple metrics of size for their capability to predict errors in the SAP reference model. This experiment yields a mixed result: on the one hand, it has an overproportional, positive impact on error probability on a significance level better than 99%. On the other hand, density and size together are not sufficient to explain the variance of errors as it is indicated by the Nagelkerke and the Hosmer&Lemeshow statistics. Since the measurement of business process model complexity is a rather new area, the definition of the density is only a single step towards a suite of meaningful complexity metrics. In future research, we aim to design additional metrics for EPC business process models that might be helpful to predict errors. Furthermore, we aim to test them on a large set of models from practice for their capability to serve as error predictors. Beyond that, complexity metrics are an interesting tool to guide next generation process-mining algorithms in order to arrive at a model that maximizes some goal metric for a given log.

References [ARGP06]

Elvira Rol´on Aguilar, Francisco Ruiz, F´elix Garc´ıa, and Mario Piattini. Towards a suite of metrics for business process models in bpmn. In Yannis Manolopoulos, Joaquim Filipe, Panos Constantopoulos, and Jos´e Cordeiro, editors, ICEIS 2006 Proceedings of the Eighth International Conference on Enterprise Information Systems: Databases and Information Systems Integration (III), Paphos, Cyprus, May 23-27, 2006, pages 440–443, 2006.

[Car05]

J. Cardoso. Workflow Handbook 2005, chapter Evaluating Workflows and Web Process Complexity, pages 284–290. Future Strategies, Inc., Lighthouse Point, FL, USA, 2005.

[CKL97]

Thomas Curran, Gerhard Keller, and Andrew Ladd. SAP R/3 Business Blueprint: Understanding the Business Process Reference Model. Enterprise Resource Planning Series. Prentice Hall PTR, Upper Saddle River, 1997.

[CMNR06]

J. Cardoso, J. Mendling, G. Neumann, and H. Reijers. A Discourse on Complexity of Process Models. In Schahram Dustdar et al., editor, Proceedings of BPM 2006, Lecture Notes in Computer Science, Vienna, Austria, 2006. Springer-Verlag.

[DR01]

Juliane Dehnert and Peter Rittgen. Relaxed Soundness of Business Processes. In K. R. Dittrick, A. Geppert, and M. C. Norrie, editors, Proceedings of the 13th International Conference on Advanced Information Systems Engineering, volume 2068 of Lecture Notes in Computer Science, pages 151–170, Interlaken, 2001. Springer.

[Hal87]

M. H. Halstead. Elements of Software Science. Elsevier, Amsterdam, 1987.

[HATB98]

J. F. Hair, jr., R. E. Anderson, R. L. Tatham, and W. C. Black. Multivariate Data Analysis. Prentice-Hall International, Inc., 5th edition edition, 1998.

[HK81]

S. Henry and D. Kafura. Software structure metrics based on information-flow. IEEE Transactions On Software Engineering, 7(5):510–518, 1981.

[Kal90]

G. E. Kalb. Counting lines of code, confusions, conclusions, and recommendations. Briefing to the 3rd Annual REVIC User’s Group Conference, 1990.

[KT98]

G. Keller and T. Teufel. SAP(R) R/3 Process Oriented Implementation: Iterative Process Prototyping. Addison-Wesley, 1998.

[LG06]

Ralf Laue and Volker Gruhn. Complexity metrics for business process models. In Witold Abramowicz and Heinrich C. Mayr, editors, 9th International Conference on Business Information Systems (BIS 2006), volume 85 of Lecture Notes in Informatics, pages 1–12, 2006.

[LK01]

Antti M. Latva-Koivisto. Finding a complexity for business process models. Research report, Helsinki University of Technology, February 2001.

[MB89]

T. J. McCabe and C. W. Butler. Design complexity measurement and testing. Communications of the ACM, 32:1415–1425, 1989.

[McC76]

T. J. McCabe. A complexity measure. IEEE Transactions on Software Engineering, 2(4):308–320, 1976.

[MMN+ 06a] J. Mendling, M. Moser, G. Neumann, H.M.W. Verbeek, and B.F. van Dongen W.M.P. van der Aalst. A Quantitative Analysis of Faulty EPCs in the SAP Reference Model. BPM Center Report BPM-06-08, Eindhoven University of Technology, Eindhoven, 2006. [MMN+ 06b] J. Mendling, M. Moser, G. Neumann, H.M.W. Verbeek, B.F. van Dongen, and W.M.P. van der Aalst. Faulty EPCs in the SAP Reference Model. In J.L. Fiadeiro S. Dustdar and A. Sheth, editors, Proceedings of BPM 2006, volume 4102 of Lecture Notes in Computer Science, page 451457, Vienna, Austria, 2006. Springer-Verlag. [NR02]

M. N¨uttgens and F. J. Rump. Syntax und Semantik Ereignisgesteuerter Prozessketten (EPK). In J. Desel and M. Weske, editor, Proceedings of Promise 2002, Potsdam, Germany, volume 21 of Lecture Notes in Informatics, pages 64–77, 2002.

[vdA97]

W. M. P. van der Aalst. Verification of Workflow Nets. In Pierre Az´ema and Gianfranco Balbo, editors, Application and Theory of Petri Nets 1997, volume 1248 of Lecture Notes in Computer Science, pages 407–426. Springer Verlag, 1997.

[vdAtH05]

Wil M. P. van der Aalst and Arthur H. M. ter Hofstede. YAWL: Yet Another Workflow Language. Information Systems, 30(4):245–275, 2005.

[Was94]

Stanley Wasserman. Social Network Analysis: Methods and Applications. Structural Analysis in the Social Sciences. Cambridge University Press, 1994.