Predicting Bugs in Distributed Large Scale Software ...

Predicting Bugs in Distributed Large Scale Software Systems Development R.B. Lenin Department of Mathematics University of Central Arkansas R.B. Govindan Department of OB/GYN University of Arkansas, Medical Sciences S. Ramaswamy Department of Computer Science University of Arkansas at Little Rock

Introduction In the software industry, a need for better prediction is no different from other industries / applications. Software companies have been working on making good plans to achieve better development, maintenance and management processes by predicting the behavior of the software systems. The crucial role of any software companies is the maintenance of the developed software systems in terms of taking care of defects (bugs) in the system that would arise once the software is released to the market. Companies spend substantial amount of money in allocating resources to maintain the software systems in order to yield customers satisfaction to the highest possible degree. Predicted behavior of the software system bugs would help in planning for optimal allocation of resources to maintain the software systems. If we can have optimal allocation of resources to maintain the software systems, the costs in software maintenance can be reduced. Recently, with the fast emergence of open-source software systems coupled with the upward spiraling costs of ‘closed’ and ‘customized’ software projects, it is crucial to address the issue of prediction accuracy in software projects. Open-source software systems have provided the benefit of lower software and hardware costs, and simplified license management in addition to several other benefits. Currently, the data from open source repositories, including defect tracking systems, are analyzed using probabilistic models to predict files with bugs [1]. A

SCS M&S Magazine – 2010 / n2 (April)

statistical model based on historical fault information and file characteristics was proposed to predict the files that contain the largest numbers of faults [2-5]. Such information can be used to prioritize the testing components; to make the testing process more efficient and the resulting software more dependable. In [6], the authors worked on the discovery of common method usage patterns that are likely to encounter violations in Java applications by mining software repositories. Their approach is more dynamic in the sense that it finds violation patterns from method pairs. Also bug prediction schemes utilizing TABLE 1: 31 COMPONENTS OF ECLIPSE ID Component ID Component 1 Equinox.Bundles 17 Platform.Debug 2 Equinox.Framework 18 Platform.Doc 3 Equinox.Incubator 19 Platform.IDE 4 Equinox.Website 20 Platform.Releng 5 JDT.APT 21 Platform.Resources 6 JDT.Core 22 Platform.Runtime 7 JDT.Debug 23 Platform.Scripting 8 JDT.Doc 24 Platform.Search 9 JDT.Text 25 Platform.SWT 10 JDT.UI 26 Platform.Team 11 PDE.Build 27 Platform.Text 12 PDE.Doc 28 Platform.UI 13 PDE.UI 29 Platform.Update 14 Platform.Ant 30 Platform.WebDAV 15 Platform.Compare 31 Platform.Website 16 Platform.CVS association rules have been proposed to predict multicomponent bugs [7]. In [8-10], the authors Lenin, Govindan, Ramaswamy

Page 1 of 5

successfully predicted the bug density on a module base by using software change history.

Let the (bug count) difference between B and B be denoted by d i , that is, d i = Bi − Bi −1 , i = 2, N with d1 = 0. We note that d denotes the slope of the graph connecting the two consecutive bug counts Bi −1 and Bi . We define the weights of the slopes as i

i −1

i

The Modeling Procedures So far, we have applied two mathematical models, evolved and used in two different domains, to predict bugs in Eclipse’s components [11]. The 31 components used for prediction are in Table 1 above. The first model is based on the Power-Law (PL) [13]; which is used in several software and natural systems; and the second one is based on Detrended Fluctuation Analysis (DFA) [12]; which is used extensively in medical and healthcare research applications.

wi( N ) = f ( N − i ), i = 1,2,, N ,

(1) where f (N − i ) is a function defined suitably to account for the recency dependency property of the collected data. In order to capture the global trend, we define N

the function Gt =

∑d w i

(N)

i

i =1

.

N

∑w

(N)

i

i =1

To capture the local trend, we define ∑ d i wi(| M |) ( m) , k = 1, 2,K ,12, Lt k = i∈ M ∑ wi(| M |) k

In a typical software development project, the number of bugs in the software is expected to decrease with time (i.e. as the software system stabilizes). Moreover, future bugs tend to strongly depend on bugs from the recent past. We refer to this type of dependency as recency. Using recency, we hypothesize that the reporting of bugs would be further influenced seasonal factors (holidays, vacations, etc.). This is based on the observation that bug reports may be usually less during holiday periods. We refer this kind of trend in the bug count as a local trend. We refer to the longerterm fluctuation trend of the bug counts collected over the period starting from year 2001 as the global trend. Given the above, we propose a mathematical model that uses recency dependencies – i.e. local and global trends, of the collected data to predict future data. We first introduce the notations; which are being used in our model. Let N denote the total number of collected B data on bug counts. Let  i , i = 1, 2,, N , denote the bug count collected for the time period i. Bi may denote the bug count collected on day i, week i or month i, depending on which period we use to predict future bug counts. Hence, Bi can further be denoted by any of the following three notations depending on whether it is collected daily, weekly or monthly:  Bd , for daily bug count,  i Bi =  Bwi for weekly bug count, B  mi for monthly bug count.

In this work we restrict our analysis to monthly bug count, though the analysis can be easily extended to daily and weekly.


k

k

i ∈ Mk

 

  N  M k = mik | mik = month k in year i , i = 1,2,K , ,  12     where |• | and • denote the cardinality of a set and the floor function, respectively. The formula to predict the bugs is given as follows: For k = N + 1, N + 2,, Bmk = Gt + Lt k( m) +

1 k−K

k −1

∑B j=K

m j −1 .

Here K (< k ) is a parameter whose value is chosen based on the error tolerance on the predicted results. We note that the parameter K indicates the number of past bug counts from the current prediction period; which are being used in the above formulas. With these notations and basic definitions, we adopt two different approaches to compute f (N − i ) in (1). These are elaborated below: PL-based Recency Approach: Assuming that the trend d i would decrease monotonically that can be characterized using power law, in the first approach we model f (N − i ) using the power law function f (N − i ) = 1 N −i . Equations (2) and (13) have 2

been defined in such a way that the weighting functions will place more weight for recent bug counts and relatively lesser weights for bug counts from the distant past. Hence we refer these weights as recency weights. In operating systems, this concept has been widely used to depict aging functions.

Lenin, Govindan, Ramaswamy

Page 2 of 5

DFA Approach: In the second approach, we attempt to exploit the autocorrelations in the bug count to fine tune our predictions. In this approach we quantify the long-range correlations in the bug counts using DFA and f (N − i ) is given by f (N − i ) = 1 .DFA N − iα involves the following three steps to find α : (a) For the data Bi , i = 1,, N , we remove the mean

Fig. 1. Actual monthly bug count of the year 2008

Fig. 2. Predicted monthly bug count of year 2008 using DFA of the data and denote the modified data as Yn (b) The profile is divided into M disjoint time windows of size s indexed ν , where M = [N / s ] . If the length of the data is not an integer multiple of the scale s , a small portion of the data towards the end of the record will be left unanalyzed. (c) Profile in ν th window is fitted by a polynomial pνq of order q . To this end we compute the fluctuation function Fν (s) in the ν th window as follows:


νs

Fν ( s ) =

∑

1 [Yi − p q (i )]2 s i =(ν −1) s +1

Finally, Fν (s) is averaged over all the windows to get the Fluctuation function F (s) . For power-law correlated data, F (s) follows a power law: F ( s) ~ sα ,

Fig. 3. Predicted monthly bug count of year 2008 using PL

Fig. 4. Pearson correlation among actual and predicted bug counts for 2008 where α is called scaling exponent or fluctuation exponent. Using our approach, we achieve highly effective prediction results. In Figures 1 – 4, the predicted results of year 2008 bugs using the bugs from 2001 to 2007 are shown. In Figures 5 - 7, the predicted results of year 2009 bugs using the bugs from 2001 to 2008 are shown. During the year 2008, after each version release, the total bug counts across all components were higher when compared to the total before the version release. This was not the case for the total bug counts of previous years. In order to test how resilient our models to this drastic change, we predicted the bug counts of years 2008 and 2009. Lenin, Govindan, Ramaswamy

Page 3 of 5

Fig. 5. Predicted monthly bug count of year 2009 using DFA

Fig. 7. Pearson correlation among predicted bug counts of year 2009 References

Fig. 6. Predicted monthly bug count of year 2009 using PL Conclusions and Future Work Our future plan will be focused on defining and validating a parameterized framework for bug predictions using multiple techniques and on adaptive tuning of the power law function based on the extractable domain knowledge, such as the use of observed prediction cycle patterns. We will also focus on determining if bug cycles of various components can help in finding optimum software release dates. By optimizing the determination of release dates based upon component bug cycles, a software project management team can improve workforce allocation planning, work load characterization and balancing, testing scenarios planning, minimize the severity of major disruptions following a release and could potentially reduce the number of bug reports.


[1] M. Askari and R. Holt, "Information Theoretic Evaluation of Change Prediction Models for Large-Scale Software," in 3rd ICSE Workshop on Mining Software Repositories, Shanghai, China, 2006, pp. 126-132. [2] T. Ostrand and E.J. Weyuker, "The distribution of faults in a large industrial software system," in 2002 ACM International Symposium on Software Testing and Analysis, Rome, Italy, 2002, pp. 5564. [3] T. Ostrand, E. J. Weyuker, and R.M. Bell, "Where the bugs are?," in 2004 ACM International Symposium on Software Testing and Analysis, Boston, MA, 2004, pp. 86-96. [4] T.J. Ostrand and E.J. Weyuker, "A tool for mining defect-tracking systems to predict faultprone files," in 1st ICSE Workshop on Mining Software Repositories, Edinburgh, Scotland, 2004, pp. 85-89. [5] T. J. Ostrand, E. J. Weyuker, and R.M. Bell, "Predicting the location and number of faults in large software systems," IEEE Transactions on Software Engineering, vol. 31, pp. 340-355, 2005. [6] B. Livshits and T. Zimmermann, "DynaMine: Finding Common Error Patterns by Mining Software Revision Histories," in 2005 European Software Engineering Conference and 2005 Foundations of Software Engineering (ESEC/FSE 2005), Lisbon, Portugal, 2005, pp. 296-305. [7] M. Shepperd Q. Song, M. Cartwright, and C. Mair, "Software Defect Association Mining and Defect Correction Effort Prediction," IEEE Lenin, Govindan, Ramaswamy

Page 4 of 5

Transactions on Software Engineering, vol. 32, pp. 69-82, 2006. [8] T. L. Graves, A. F. Karr, J. S. Marron, and H. Siy, "Predicting Fault Incidence Using Software Change History," IEEE Transactions on Software Engineering, vol. 26, pp. 653-661, 2000. [9] A. Mockus and D.M. Weiss, "Predicting Risk of Software Changes," Bell Labs Technical Journal, vol. 5, pp. 169-180, 2002. [10] N. Nagappan and T. Ball, "Use of Relative Code Churn Measures to Predict System Defect Density," in 27th International Conference on Software Engineering (ICSE 2005), Saint Louis, Missouri, U.S.A., 2005, pp. 284-292. [11] Eclipse, "Eclipse - an open development platform," Available online: http://www.eclipse.org, 2008. [12] Eva Koscielny-Bunde Jan W. Kantelhardt, Henio H.A. Rego, Shlomo Havlin, and Armin Bunde, "Detecting Long-range Correlations with Detrended Flunction Analysis," Physica A, vol. 295, pp. 441-454, 2001. [13] M. E. J. Newman, "Power laws, Pareto distributions and Zipf's law," Contemporary Physics, vol. 46, pp. 323-351, 2005. Author Biographies R.B. Lenin received the Ph.D. degree in Mathematics from the Indian Institute of Technology, Madras, India, in 1998. He is currently working as an assistant professor in the department of mathematics at the University of Central Arkansas (UCA), U.S.A. Prior to joining UCA, he was working as a post-doctoral fellow at the University of Arkansas at Little Rock (UALR), U.S.A. Earlier, he had also worked as a postdoctoral fellow at the University of Twente, The Netherlands, and at the University of Antwerp, Belgium for about 5 years. His research interests are in mathematical modeling and simulation of discreteevent systems, stochastic models, optimization techniques, performance analysis of computer and communication networks, rational approximation, information retrieval and network analysis. He has 27 papers published in international journals including Journal of Applied Probability, Queuing Systems, IEEE Transactions on Computers and IEEE Transactions on Microwave Theory and Techniques. R.B. Govindan is an Assistant Professor at the Department of Obstetrics and Gynecology, University of Arkansas for Medical Sciences in Little Rock, AR, SCS M&S Magazine – 2010 / n2 (April)

since June 2007. His research interests include developing methods to farther the understanding of maternal-fetal medicine. He received the B.S. and M.S. degrees in Chemistry from University of Madras, Tamil Nadu, India, in 1991 and 1993, respectively. He received his Ph.D. degree in nonlinear dynamics and chaos theory from Indian Institute of Technology, Madras, India, in 1999. Earlier, he worked as post-doctoral fellow and research scientist in Israel and Germany during 2000 to 2005 and a research assistant professor at the Graduate Institute of Technology, University of Arkansas at Little Rock, until 2007 S. Ramaswamy is currently Professor and

Chairperson of the Computer Science Department at the University of Arkansas at Little Rock. His research interests on behavior modeling, analysis and simulation, software stability and scalability particularly in the design and development of better software systems, and intelligent and flexible control systems. At UALR, he is currently associated with several research initiatives, which include: the statewide program manager for wireless nano sensors and systems center and the principle investigator at UALR for a high performance computing initiative. He is also the research coordinator for collaboration on ``Engineering Innovative Software Systems for Marine Transportation Logistics'' with the National Institute of Applied Sciences (INSA) in Rouen, France, where he was a visiting research professor in 2006 and 2007. During the summers of 2003, 2004 and 2007, he was a visiting research professor of Computer Science in the Institute of Software Integrated Systems (ISIS) at Vanderbilt University as part of a NSF ITR project - Foundations of Hybrid and Embedded Software Systems. In 1994-1995, and subsequently during the summer months of 1996 and 1998, he was a post-doctoral research fellow / visiting scientist in the Laboratory for Intelligent Processes and Systems (LIPS) at the University of Texas at Austin where he helped with research efforts on Sensible Agents. Dr. Ramaswamy earned his Ph.D. degree in Computer Science in 1994 from the Center for Advanced Computer Studies (CACS) at the University of Louisiana at Lafayette. He serves as an Associate Editor for the IEEE Transactions on Systems, Man and Cybernetics, Part C: Applications and Reviews. He is a senior member of the ACM, a member of Society for Computer Simulation International, Computing Professionals for Social Responsibility and a senior member of the IEEE.

Lenin, Govindan, Ramaswamy

Page 5 of 5

Predicting Bugs in Distributed Large Scale Software ...

Predicting Bugs in Distributed Large Scale Software ...

Suggest Documents

Predicting Bugs Using Antipatterns - Department of Software ...

Large Scale Distributed Deep Networks

Exploring Large-Scale, Distributed System

Software Architecture for Large-Scale, Distributed, Data ... - CiteSeerX

Software Engineering Advice from Building Large-Scale Distributed ...

A software architectural design method for large-scale distributed ...

Distributed Transaction Routing in a Large Scale

Large-scale Machine Learning in Distributed Environments

Distributed SIR-Aware Scheduling in Large-Scale

Large-scale Machine Learning in Distributed Environments

Dynamic Aspects in Large Scale Distributed

Coordination in Large-Scale Software Development - Microsoft

Coordination in Large-Scale Software Development - Microsoft

Predicting Effort to Fix Software Bugs - Semantic Scholar

Predicting Bugs' Components via Mining Bug ... - Journal of Software

Predicting clinical outcomes from large scale cancer

CHAPTER 261 PREDICTING LARGE-SCALE, CROSS ... - Journals

Predicting Aging-Related Bugs using Software ... - Semantic Scholar

4 Predicting Bugs from History

Software Metrics in Boa Large-Scale Software Mining Infrastructure

Distributed Large Scale Network Utility Maximization - CiteSeerX

Large-Scale Cross-Document Coreference Using Distributed ...

data scheduling for large scale distributed applications

Large Scale Distributed Deep Networks - NIPS Proceedings