2015 IEEE/ACM 37th 37th IEEE IEEE International International Conference Conference on on Software Software Engineering Engineering 2015 IEEE/ACM
Statistical Learning and Software Mining for Agent Based Simulation of Software Evolution Verena Honsel Institute of Computer Science Universtity of G¨ottingen Goldschmidtstrasse 7, 37077 G¨ottingen, Germany
[email protected]
Abstract—In the process of software development it is of high interest for a project manager to gain insights about the ongoing process and possible development trends at several points in time. Substantial factors influencing this process are, e.g., the constellation of the development team, the growth and complexity of the system, and the error-proneness of software entities. For this purpose we build an agent based simulation tool which predicts the future of a project under given circumstances, stored in parameters, which control the simulation process. We estimate these parameters with the help of software mining. Our work exposed the need for a more fine-grained model for the developer behavior. Due to this we create a learning model, which helps us to understand the contribution behavior of developers and, thereby, to determine simulation parameters close to reality. In this paper we present our agent based simulation model for software evolution and describe how methods from statistical learning and data mining serves us to estimate suitable simulation parameters. Index Terms - Software Process Simulation; Hidden Markov Model; Agent Based Modeling; Developer Behavior
the simulation of software evolution and compare empirical with simulated results. For this simulation model we use average developers. This means that each developer spends the same effort on the project. Since this holds not true in reality, we identified the need for a more exhaustive investigation of the developer behavior. The contribution behavior of developers complies with their personal status of experience and learning in the project they are working on. When analyzing repositories, only the output of these states such as their code contributions are visible. To gain knowledge about the underlying - hidden states, it is useful to employ Hidden Markov Models (HMMs). HMMs are stochastic models which are very flexible and useful for studying a set of observations in discrete time. The outcome of this technique will reveal a sequence of learning states for each developer. By evaluating these sequences for the different developers, we gain valuable insights for the adjustment of the agent developer behavior.
I. I NTRODUCTION
II. R ELATED W ORK
Understanding the development of software systems and especial the behavior of developers is an important step for the identification of influencing factors for software evolution. Previous work focused solely on software evolution and its influencing factors [1] [2], also in a predictive context [3] [4], but ignored solving this issue by the use of simulation methods. Simulation facilitates us to describe complex processes even with chaotic behavior in an elegant way and then to try out and compare different evolutionary scenarios [5]. For doing so, it is significant to have a closer look on the system to simulate. In our case, this is the process of software development. Software evolves in growth and complexity over time [6]. Tools exist for specific research questions, e.g., for the analysis of object oriented software or detection of code duplication [7]. Our approach is to combine software quality assurance issues with social and process controlled factors which influence the software development process. From this point of view it is expedient to consider the process to be simulated agent based. Agents are autonomous individuals with a behavior which is specified by certain rules [8]. They can be active as well as passive. In our context we have the developers as active agents and the software entities as passive ones. By mining software repositories we are able to estimate simulation parameters for
978-1-4799-1934-5/15 $31.00 © © 2015 2015 IEEE IEEE 978-1-4799-1934-5/15 $31.00 DOI 10.1109/ICSE.2015.279 DOI 10.1109/ICSE.2015.279
We founded our approach on the work of Smith et al. [9]. In this model they consider the fitness of modules, which can be seen as a goodness indicator, and the complexity as well as the boredom threshold on the developer side. The last-mentioned means that developers can leave the project, if they consider the project as boring. The probability distributions which underly the simulation processes are not properly explained. Moreover, Smith et al. take refactorings as a counteraction for complexity into account. This is a good approach, but the assumption of randomly growing complexity for modules with every change and the fact that they can easily remain in a rather complex state does not match our experience with software development processes. While some work is done on the behavior of developers and their influence on software evolution [10], to the best of our knowledge, Singh et al. [11] performed the only work similar to our suggested HMM based approach. They use a HMM, where the developers learn from own and peer experiences, i.e., their contributions and communication with others via mailing lists. We extend this approach by splitting the communication activities in questions, i.e., threads opened by a developer, and answers. We also added activities in issue solving, which can be extracted via Bug Tracking Systems like
863 863
ICSE 2015, Florence, Italy Doctoral Symposium
Bugzilla [12]. This is of interest, because developers discuss bugs and enhancements in such systems. III. R ESEARCH O BJECTIVE The research questions under investigation deal with the software evolution process itself, to reveal patterns, rules and triggers of the software development processes. On the other hand we are interested in the developer and to find out behavioral rules and traceable strategies. We formulate the following overall research objective: • How good can a simulation reproduce and predict quality attributes of software processes? This work was examined following this questions: • RQ1 How can we estimate simulation parameters with software mining for fitting an accurate agent based simulation model for software evolution that reflects system growth, bug trends and developer activity? • RQ2 Can we understand developer contribution behavior in order to create different agents for different developer types? • RQ3 Can we determine different phases of software development and their influence on other simulated parameters?
we selected the parameters according to the three parts of the software development process we wanted to simulate. To estimate them we examined a large open source software project with over ten years of development. This way, we are able to reproduce the system growth of this software project Figure 2 depicts for the real growth and Figure 3 for growth we simulated. We also derived stochastic distributions for the kind of software changes as well as the lifetime of bugs. Additionally, we gained some interesting observations about the collaboration of developers by investigating developer-file networks. Additional to our work in [13], we are now able to simulate the developer-file networks and added some file properties. Finally we are still working on an elegant way to describe the dependencies between the files.
Fig. 2. Empirical System Growth of K3b [13].
IV. P ROPOSED S OLUTION AND C URRENT P ROGRESS In this section we describe our approach for the achievement of our research objectives and how we want to answer our research questions. We first describe our underlying simulation model. Then we explain the application of HMMs for analyzing developer learning behavior and software phase discovery. Finally, the next steps and our approach for the validation of our results are stated. A. Software Process Simulation Model
Fig. 3. Simulated System Growth [13].
B. HMM for Developer Contribution Behavior In the above described model we had only average developers who spend the same effort on the project. This was satisfactory for a first simulation. We identified the need for more sophisticated models of developer behavior. To answer RQ2 we determinate HMMs as appropriate. Our idea for learning the contribution behavior extends the work by Singh et al. [11] and is summarized in Figure 4.
In contrast to the overwhelming model of Smith we tailored our current model towards three specific questions: the system growth, bug density and developer activity. This way, we get a clear structure for our simulation model that is expansible for other research questions. As part of our research on RQ1, we build the simulation model shown in Figure 1. Therefore,
Fig. 4. Process for Developer Contribution Behavior Learning.
Fig. 1. Agent Based Simulation Model for Software Processes [13].
In the HMM, we have on one side a sequence of observations and on the other side a sequence of hidden states, which represents the result of the learning process. We will use hidden states as experience states that influence the contribution behavior. As input for the developer experience, we already collected data about contribution behavior (commits 864 864
ICSE 2015, Florence, Italy Doctoral Symposium
and bugfixes), communication behavior (the number of threads opened and answers to open threads), and bug activities (bug comments and bug reports). In Figure 4, these learning activities are shown on the bottom with the sources, the input data is extracted from. For the preparation of this data as input for the learning model, it is necessary to preprocess the data, because our observations are multi dimensional and we have a set of six observations at each point in time. To deal with multiple observations, we could either use combinatorial methods [14] or map your observations into a comprehensible format. We chose the second option and decided to do this by a classifier that learns thresholds [15], which divides our observations for each category into low, medium and high. To build the model with parameters close to reality, we decided to make another case study. We selected an open source project which has all three data sources available and also enough developers to learn from, the web browser Rekonq [16]. Moreover, we only consider developers with at least twenty commits. With this investigation, the project remained ten active contributors. The monthly contribution behavior is pictured in Figure 5 for commits and in Figure 6 for bugfixes.
more often. This is also part of the learning process. Table I summarizes all developer learning activities. TABLE I M ONTHLY D EVELOPER C ONTRIBUTION AND C OMMUNICATION B EHAVIOR OF R EKONQ . Factor
Max
Mean
Std.Dev.
commits bugfixes threads opened responses bug comments bug reports
85 55 20 100 164 10
32.1 20.1 4.3 30.4 38.2 5.2
21.1 16.4 4 24.3 48.9 3.3
These observations of monthly developer activities serves us as input for the HMM. For this, we first calculate the thresholds for low, medium and high behavior. The thresholds we retrieved from the Rekonq project are depicted in Table II. TABLE II T HRESHOLDS FOR C LASSIFYING C ONTRIBUTION B EHAVIOR OF D EVELOPERS .
'HYHORSHU
GHY
GHY
GHY
GHY
GHY
GHY
GHY
GHY
GHY
GHY
EXJIL[HV
FRPPLWV
'HYHORSHU
GHY
GHY
GHY
GHY
GHY
GHY
GHY
GHY
GHY
GHY
Threshold
Contributions
Communication
Bug Activity
Commits/Bugfixes
Opened/Responses
Bug Comments/ Reports
low
< 13.8/0
< 0/11
< 20.7/0
medium
≥ 13.8/0 < 34.5/5.5
≥ 0/11 < 0/27.5
≥ 20.7/0 < 62.1/0
high
≥ 34.5/5.5
≥ 0/27.5
≥ 62.1/0
í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í
í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í
PRQWK
PRQWK
Fig. 5. Commits.
Fig. 6. Bugfixes.
'HYHORSHU
'HYHORSHU
GHY
GHY
GHY
GHY
GHY
GHY
GHY
GHY
GHY
GHY
GHY
GHY
GHY
GHY
GHY GHY
GHY
GHY
GHY
UHVSRQVHV
WKUHDGVRSHQHG
GHY
Fig. 7. Opened Threads.
í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í
í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í í
PRQWK
PRQWK
Fig. 8. Responses.
Here it can be seen that the project points out one main contributor (dev1). But with a closer look at Figure 7 for the threads opened and Figure 8 for answers in threads, which together reveal the communication activities of the same developers, it is shown that also other developers are very active in communication. An explanation for this can be, that the less experienced developers need to ask for help 865 865
This means, e.g., for the contribution activities, that a developer needs at least 14 commits and no bugfixes to contribute medium and at least 35 commits and 6 bugfixes to contribute high in a month. The actual training of the HMM is not yet finished. Our next step is the retrieval of the state transition matrix with the Baum-Welch algorithm. Afterwards we apply the Viterbi algorithm to learn the sequence of most probable states, which produced the observed states. We hope to retrieve enough knowledge to enrich our agent model of the developers by this investigation. C. Software Development Phases Learning Most software projects exhibit at least four phases, an initial, changing phase and maintenance phase and phaseout [17], which may have loops. Using HMMs can also be of benefit considering development phases of the software. Possible items of the input space of observations may be the number of appeared bugs, different kinds of software changes, and involved developers for this purpose. But this work aimed to answer RQ3 is in a very early stage. D. Next Steps The following planned steps are pictured in Figure 9. Big bullets mean that the main work is planned in this time section, smaller ones mean, that there is something to cross-check
ICSE 2015, Florence, Italy Doctoral Symposium
or to adapt according to other results. Further steps planned as depicted in the timeline (Figure 9) are more research on social networks including communication networks build from mailing list data, incorporation of developer strategies for the agent model, and a model to predict bugs based on the before retrieved metrics. The evaluation of the simulation results and adaption of parameters and models for specific questions take place in nearly each step with focus on the end of the timeline confirmed by final case studies to evaluate the simulation model.
challenges in software evolution, which can predict different evolutionary scenarios based on the chosen parameter constellation.
E. Validation
[1] G. Xie, J. Chen, and I. Neamtiu, “Towards a better understanding of software evolution: An empirical study on open source software.” in ICSM. IEEE, 2009, pp. 51–60. [2] M. Lanza and S. Ducasse, “Understanding software evolution using a combination of software visualization and software metrics,” in In Proceedings of LMO 2002 (Langages et Modles Objets. Lavoisier, 2002, pp. 135–149. [3] P. Bhattacharya, M. Iliofotou, I. Neamtiu, and M. Faloutsos, “Graphbased analysis and prediction for software evolution,” in Proceedings of the 34th International Conference on Software Engineering, ser. ICSE ’12. Piscataway, NJ, USA: IEEE Press, 2012, pp. 419–429. [Online]. Available: http://dl.acm.org/citation.cfm?id=2337223.2337273 [4] M. Goulo, N. Fonte, M. Wermelinger, and F. B. e Abreu, “Software evolution prediction using seasonal time analysis: A comparative study.” in CSMR, T. Mens, A. Cleve, and R. Ferenc, Eds. IEEE, 2012, pp. 213–222. [5] J. A. Sokolowski and C. M. Banks, Principles of Modeling and Simulation: A Multidisciplinary Approach. Hoboken, NJ: Wiley, 2009. [6] M. M. Lehman, “Programs, life cycles, and laws of software evolution,” Proc. IEEE, vol. 68, no. 9, pp. 1060–1076, September 1980. [7] C. Marinescu, R. Marinescu, P. F. Mihancea, and R. Wettel, “iplasma: An integrated platform for quality assessment of object-oriented design,” in In ICSM (Industrial and Tool Volume. Society Press, 2005, pp. 77–80. [8] C. M. Macal and M. J. North, “Introductory tutorial: Agent-based modeling and simulation,” in Simulation Conference (WSC), Proceedings of the 2011 Winter. IEEE, 2011, pp. 1451–1464. [9] N. Smith and J. F. Ramil, “Agent-based simulation of open source evolution,” in Software Process Improvement and Practice, 2006, pp. 423–434. [10] T. Girba, A. Kuhn, M. Seeberger, and S. Ducasse, “How developers drive software evolution,” in Proceedings of the Eighth International Workshop on Principles of Software Evolution, ser. IWPSE ’05. Washington, DC, USA: IEEE Computer Society, 2005, pp. 113–122. [Online]. Available: http://dx.doi.org/10.1109/IWPSE.2005.21 [11] P. V. Singh, Y. Tan, and N. Youn, “A hidden markov model of developer learning dynamics in open source software projects.” Information Systems Research, vol. 22, no. 4, pp. 790–807, 2011. [12] T. Weissman et al., “Bugzilla,” Programa de Computador, 1998. [Online]. Available: http://www.bugzilla.org [13] V. Honsel, D. Honsel, and J. Grabowski, “Software process simulation based on mining software repositories,” in Proceedings of the Third International Workshop on Software Mining, 2014. [14] X. L. 0006, M. Parizeau, and R. Plamondon, “Training hidden markov models with multiple observations-a combinatorial method.” IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 4, pp. 371–377, 2000. [15] S. Herbold, J. Grabowski, and S. Waack, “Calculation and optimization of thresholds for sets of software metrics,” Empirical Software Engineering, vol. 16, no. 6, pp. 812–841, 2011. [Online]. Available: http://dx.doi.org/10.1007/s10664-011-9162-z [16] KDE.org, “Rekonq,” online, 2014. [Online]. Available: https://rekonq.kde.org/ [17] A. Capiluppi, J. M. Gonzalez-Barahona, I. Herraiz, and G. Robles, “Adapting the ”staged model for software evolution” to free/libre/open source software,” in 9th International Workshop on Principles of software evolution (IWPSE), ACM. Dubrovnick, Croatia: ACM, 09/2007 2007, p. 79–82. [Online]. Available: http://iwpse2007.inf.unisi.ch/
ACKNOWLEDGMENT I would like to thank my supervisor Prof. Dr. Jens Grabowski for his support. This work is part of the project ”Simulation-Based Quality Assurance for Software Systems” sponsored by the SWZ G¨ottingen-Clausthal. R EFERENCES
For evaluating this approach we collect more open source project data, use this as input for our simulation and compare empirical with simulated data, in order to ensure that our simulation produces results that match the reality. We are also currently working on a evaluation method for the simulation itself based on Conditional Random Fields (CRF), which are suitable for the classification of sequence data. V. C ONCLUSION This paper presents first results and models for simulating software development processes. The research combines the search of software evolution patterns with the investigation of developer behavior and presents it in an agent based simulation model. The current work deals with calculating a HMM from real project data to learn the underlying experience processes from developers in their contribution behavior in order to finally enrich the developer agents. The methodology for this process is described and input data observations are presented. Furthermore, the detection of software development phases is planned using a similar approach. The overall goal is to understand the contribution behavior of developers, so that the gained knowledge enriches our software process simulation. Long-term plans include to suit our simulation tool to more
! "# $
Fig. 9. Timeline for Dissertation Research.
866 866
ICSE 2015, Florence, Italy Doctoral Symposium