Threshold-based prediction of schedule overrun in software projects Morakot Choetkiertikul
Hoa Khanh Dam
Aditya Ghose
School of Computer Science and Software Engineering University of Wollongong, Australia
School of Computer Science and Software Engineering University of Wollongong, Australia
School of Computer Science and Software Engineering University of Wollongong, Australia
[email protected]
[email protected]
ABSTRACT Risk identification is the first critical task of risk management for planning measures to deal with risks. While, software projects have a high risk of schedule overruns, current practices in risk management mostly rely on high level guidance and the subjective judgements of experts. In this paper, we propose a novel approach to support risk identification using historical data associated with a software project. Specifically, our approach identifies patterns of abnormal behaviours that caused project delays and uses this knowledge to develop an interpretable risk predictive model to predict whether current software tasks (in the form of issues) will cause a schedule overrun. The abnormal behaviour identification is based on a set of configurable threshold-based risk factors. Our approach aims to provide not only predictive models, but also an interpretable outcome that can be inferred as the patterns of the combinations between risk factors. The evaluation results from two case studies (Moodle and Duraspace) demonstrate the effectiveness of our predictive models, achieving 78% precision, 56% recall, 65% F-measure, 84% Area Under the ROC Curve.
CCS Concepts •Software and its engineering → Risk management; •Information systems → Data mining; Decision support systems; •Social and professional topics → Project and people management;
Keywords Risk factor; Mining software repository; Decision tree model
1.
INTRODUCTION
Schedule overruns are a major threat to software projects – approximately one-third of software projects studied in the Standish Group’s well-known CHAOS report [5] missed their schedule. Similar portions of delayed software projects Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from
[email protected].
WOODSTOCK ’97 El Paso, Texas USA c 2015 ACM. ISBN 123-4567-24-567/08/06. . . $15.00
DOI: 10.475/123 4
[email protected]
were also reported in other studies (e.g. [9, 16]). Lack of effective risk management is cited as one of the main reasons for schedule overruns [4]. Current practices in software risk management mostly rely on high-level guidance (e.g. Boehm’s “top 10 list of software risk items” [1] or SEI’s risk management framework [2]) and/or subjective expertise judgements. Risk identification is the most important task – determining factors that can, when present, adversely affect a project – which is the focus of this paper. In fact, abnormal behaviours that are not anticipated or norms (e.g. statistically exceeds a threshold) are factors causing a risk. Hence, we propose an automated, data-driven approach to risk identification by automatically detecting abnormal behaviours in the ongoing development of a software project. Our approach identifies instances of schedule overruns (in the form of delayed issues). In today’s software development (especially the open software development context), software tasks are recorded in an issue tracking system (e.g. JIRA) in the form of new feature requests, bug reports, and so on. We define a number of risk factors related to issues (tasks) and developers (resources) which are basically available in all software projects. Those risk factors have a threshold indicating if an issue or developer has a normal or abnormal behaviour. The thresholds are learned from historical data (i.e. past issues) using outlier detection techniques. We then use supervised learning to establish a decision tree model which describes the combinations of risk factors that could be a caused schedule overruns. The paper presents two main contributions. First, we proposed a set of threshold-based risk factors (abnormal resolving time for an issue, abnormal repetitions in the life-cycle of an issue, a developer involved with a number of delayed issues, overloaded developers, and being similar to a large number of delayed issue). Second, we developed the cutoff threshold derivation technique to identify abnormal behaviours and learned decision tree models that can predict whether an issue poses a risk of delaying. The performance of our predictive models were evaluated on two open source projects: Moodle and Duraspace. We achieved 78% precision, 56% recall, 65% F-measure, 84% Area Under the ROC Curve. The decision tree models also present the combinations of abnormal behaviours in the terms of hierarchical decision rules to be inferred that abnormal behaviours could cause a schedule overrun. The remainder of this paper is organized as follows. Section 2 describes a number of threshold-based risk factors
affecting schedule overruns that we use in our approach. Section 3 describes how our approach learns a threshold for each risk factor and how we build a risk prediction model using those factors. Section 4 reports our evaluation results. Threats to validity of our approach, related work, and our conclusions are discussed in Section 5.
2.
THRESHOLD-BASED RISK FACTORS
This paper focuses on investigating the risk of schedule overruns in terms of delayed issues (i.e. tasks). Traditionally, software projects have release dates as major milestones. A number of issues must be resolved before planned release date. A release, however, could have more than thousands related issues. Each issue has been defined the planned to resolved date (i.e. due date). A number of delayed issue, thus, could be a cause of release schedule overruns. Hence, we describe here a set of risk factors involving the characteristics of an issue and of a developer. Each risk factor is associated with a cut-off threshold which determines whether the corresponding behaviour is abnormal. The deriving of a cut-off threshold is discussed in Section 3. Risk factor 1: Abnormal resolving time for an issue. A software project can be viewed as a network of activities. Each activity is recorded as an issue whose detail can be described and tracked in an issue tracking system. Hence, the time taken to complete a task (i.e. resolve an issue) contributes to the overall schedule of the project [10]. More specifically, issues that take a significant, unusual amount of time to resolve may lead to delays. The time taken to resolve an issue can be divided into three periods to differentiate resolving activities: Discussion time is the time during which people involving with an issue spend on finding solutions to solve an issue. Discussion time starts when an issue is created or opened until the development activities have begun (i.e. coding, request to check out codes, or assigning a developer). Issues that have abnormal discussion time tend to be delayed. For example, delayed issue ID MDL-38314 in version 2.5 had 92 days in discussion, whilst the normal discussion time for issues in Moodle is 42 days. Development time includes the time that developers spend on the development activities such as coding, reviewing, testing, and integration. Issues that have abnormal development time also tend to be delayed. For example, issue ID MDL-12810 version 2.3 has a development period of 164 days, which is highly abnormal (the usual development time for Moodle issues is only 56 days). Note that development time is collected from the historical issue reports (training data) which issues were resolved (known development time). In practical terms, the development time cut-off threshold can be used to monitor the current development time of existing issues whether spend unusual amount of development time. Waiting time indicates the time when the issue is waiting for being acted upon, e.g. waiting for assigning to a peer reviewer, tester, integrator, and integrator reviewer. An abnormal waiting time is an indication of an issue being delayed due to lack of team cooperation or nobody wanting to deal with the issue [10]. Risk factor 2: Abnormal repetitions in the life-cycle of an issue. Software development process consists of different tasks such as coding, testing, reviewing, and integration. This also corresponds to the life-cycle of an issue. Each software project may define different lifecycles for an issue. For example, the life-cycle of an issue of Moodle consists of
four main tasks: developing, reviewing, testing, and production. Each task has several sub-tasks. The repetitions in the life-cycle of an issue can be caused by a number of reasons. For example, a severe defect has been found in testing. The issue is thus changed to developing state again. In addition, we also consider that resolved issues can be reopened (i.e. found a problem after deployment) as the repetitions in the life-cycle of an issue. Previous research (e.g., [15, 18]) has shown that task repetitions (i.e. repetitions in the life-cycle of an issue) is considered as a degrading factor of the overall quality of software development projects. It often leads to additional and unnecessary rework that contribute to release delay. Exceeding (abnormal) repetitions may however indicate that the issue has some underlying problems, which potentially lead to delays. For example, issue ID MDL-3030 was delayed (in the delayed release 2.3) and it was reopened twice and went back to the “coding” state twice. Issue ID MDL-31560 in the same release went back to “coding” states 10 times and was eventually closed after the planned release date (delayed). Risk factor 3: A developer involved with a number of delayed issues. Lack of developer’s skill and knowledge are considered as a major threat to schedule overruns. Teams that consists of incompetent developers or developers who do not have the expertise to work on issues might be the cause of delaying [10]. The number of delayed issues that a developer involved is used to characterize this risk factor. Developers who had been involved with an large number of delayed issues (exceeding a threshold) are considered as a risk. For example, when we examined (delayed) issue ID MDL-22504 in Moodle, we found that it was assigned to a developer who involved with a high number of delayed issues. Risk factor 4: Overloaded developers. Developer workload is a reflection of the quality of resource planning, which is crucial for project success. A lack of resource planning may mostly cause a project failure and developer workload may have significant impact on the progress of a project [11]. A. Pika [10] also presented that an assigning issues to high workload developer might be considered as a reason for time overruns. The workload of developers is determined by the number of opened issues assigned to a developer at a time. Issues assigned to an overloaded developer tend to be delayed since the developer does not have enough time to deal with it (concurrently with other issues assigned to them). For example, we observe the developer workload of the Moodle project from release versions 2.2 to 2.7. The highest number of opened issues that were assigned to one developer is 125 issues. The average number of opened issues that were handled by one developer at a time is 16 issues. Issue ID MDL-31501 is associated with release version 2.6 which was delayed. This issue was assigned to the developer that have 108 opened issues in his hand at a time (compared to the normal workload is 35 issues). A learned threshold is also used to determine overloaded developers. Risk factor 5: Being similar to a large number of delayed issue. An issue which is similar in its characteristics (e.g. difficulty) to another delayed issue in the past tend to be delayed as well. Similarity here can be measured using different techniques – the simple one is comparing the text description of the issues using bug duplicate detection technique (e.g. [14]). For example, issue ID MDL-40359 (in version 2.6) was delayed and we found that its description
is similar to the other three delayed issues in the previous versions. For each issue, we keep track the number of delayed issues that it is similar to. We also learn a threshold to indicate a number of delayed issues that similar to an issue.
3. 3.1
RISK IDENTIFICATION The Architectural Framework
Our risk identification approach consists of main 3 phases: (1.) the data extraction phase, (2.) the cut-off threshold derivation phase, and (3.) the prediction model building phase. The data extraction phase is a historical data collecting process. There are two required reports which are the record of the past releases and the archive of past issues. The record of the past releases shows the release plan date and the actual release date which are used to distinguish between delayed and non-delayed issues. Then, the discussion time, the development time, the waiting time, the number of repetitions in the life-cycle, the number of delayed issues for a developer, the developer’s workload and the value of issue similarity that correspond to each risk factor are derived from the historical issues. The second phase is the cut-off threshold derivation phase. This phase derives the cut-off threshold for each risk factor. The cut-off threshold is used to identify abnormal behaviours. We do so by using the outlier detection technique. We assume that the abnormal behaviour is an outlier in the data. Risk predictive models are learned in the prediction model building phase using a decision tree model. This phase consists of two steps: the classification of abnormal behavior using the cut-off thresholds and learning a decision tree model. To predict and prevent delayed issues, the learned decision tree can be used to predict whether issues will be a cause of delays. Moreover, the correlation between risk factors in the term of hierarchical decision rule structure can be interpreted from the learned decision tree to support project managers in risk mitigation.
3.2
The Data Extraction Phase
The archive of past issues can be extracted from an issue tracking system. The extracted issues are processed corresponding to the risk factors to fed into the cut-off threshold derivation phase. To describe the data extraction process, R() is defined as an extraction function corresponding to each risk factor. S is an issue in the archive of past issues. i is a number of issues from 1 to n. Thus, there are 7 extraction functions: RDis time (Si ) denotes the discussion time spent on the issue Si , RDev time (Si ) denotes the development time spent on the issue Si , RW ait time (Si ) denotes the waiting time spent on the issue Si , RRep (Si ) denotes the number of repetitions in the life-cycle of the issue Si , RDeveloper (Si ) denotes the number of delayed issues that the developer of the issue Si involved, RW orkload (Si ) denotes the workload of the developer of the issue Si , and RSim (Si ) denotes the number of delayed issues that similar to the issue Si The issue state transition related data is used for computing RDis time , RDev time , RW ait time , and RRep time . The issue resolving time is captured by the sum of the duration when the issue state is changed according to the life-cycle (i.e. changing from coding to testing state). For example, RDis time calculates sum of the duration when the issue state is changed from Open to Development In Progress.
RDev time calculates the sums of the duration when the issue state is changed from Development In Progress to Waiting for Peer Review. RW ait time calculates sum of the duration when the issue state is changed from Waiting for Testing to Testing In Progress. RRep calculates the number of backward transitions (repeat the previous state) and reopened state. RDeveloper function calculates the number of delayed issues assigned to an assignee of issue Si . RW orkload function calculates the number of opened issues assigned to an assignee of issue Si at a time. The issues similarity is determined based on the bug duplication detection using Natural Language Processing (NLP) presented by Runeson et al. [14]. In our context, the issue summary and issue description are used as features to determine similarity between issues. Tokenization, stemming, stop word removal, vector space representation are performed. The similarity between Si and Sj is determined using cosine similarity as the following: similarity(Si , Sj ) = cos(θ) =
Vi · Vj kVi k ∗ kVj k
where Vi and Vj is the vector representation of issue Si and Si , respectively. RSim (Si ) function outputs the number of delayed issues within the top 30 highest similar issues of issue Si .
3.3
The cut-off threshold derivation phase
We employed a cut-off threshold optimization technique to determine the cut-off threshold based on the desired precision level as an input from user (i.e. project managers). The benefit of the cut-off threshold optimization technique is that the desired precision level can be adjusted to fine-tune the sensitivity of the predictive models – A critical project may requires a lower precision level (more likely to identify risky issues), while a higher precision level is suitable for a non critical project. We define: DP is the desired precision level, M AD is the Median Absolute Difference which is a robust scale estimator [13], M edian() is a function to calculate the median, T is the threshold, P is the precision, and CT is the optimized cut-off threshold. First, we input DP . M AD is calculated by: M AD = 1.4826 ∗ M edian {|RDev time (Si ) − M D)|} where M D = M edian(RDev time (S)) and i is a number of issues in the archive of past issues from 1 to n. Then, T is calculated by TDev time = M D + 2 ∗ M AD The list of candidate cut-off threshold is defined as tj : tj = {TDev time ∗ nj | nj ∈ {0, 0.25, 0.50, ..., 10}} The precision level Pj is calculated by x , if (|x| + |y|) > 0 x+y Pj = 0, otherwise where x is a number of the delayed issues in the archive of past issues that RDev time is higher than tj , y is a number of the non-delayed issues in the archive of past issues that RDev time is higher than tj . The optimized cut-off threshold of the development period factor CTDev time is calculated
Overloaded developer
1
1
Table 1: Experimental setting
1
0
Similar to delayed issues
Non-‐Delayed
Abnormal discussion time
0
Test set
Delayed
Non-delayed
Delayed
Non-delayed
169 157 326
791 2,805 3,596
45 70 115
195 500 695
Moodle Duraspace Total
Non-‐Delayed
1 0
Non-‐Delayed
Delayed
Training set
Project
0.8 0.6 0.4
Figure 1: Example of the learned decision tree 0.2 0 Precision
by
Moodle
CTDev time = mintj ∈TDev time (Pj > DP ) where min function determines the minimum value of tj that Pj > DP Note that the desired precision level (DP ) is an input that we can configure the precision level of the cut-off thresholds (CT ).
3.4
Building a risk prediction model
The prediction model building phase consists of two steps: (1.) the abnormal behaviour classification and (2.) the decision tree learning. The cut-off threshold is the device for classifying the abnormal and normal behavior. After the cut-off threshold of each risk factor is derived from the previous step, we then classify an individual risk factor of each issue using their corresponding cut-off threshold whether it is abnormal behaviour as the following: CDev time (Si ) =
1, if RDev time (Si ) > CTDev time 0, otherwise
Recall
F-measure
AUC
Duraspace
Figure 2: Evaluation result for all risk factors
ble 1 shows the number of issues in training set and test set for our experiment. The issues in training set are those that were opened before the issues in test set. This setting mimics a real development scenario that a prediction has been made on current issues using knowledge from the past. As can be seen from Table 1, our delayed class is imbalanced. Since delayed issues are rare (only 9% of all collected issues), we had to be careful in creating the training and test sets. Specifically, we placed 80% of the delayed issues into the training set and the remaining 20% into the test set. In addition, we tried to maintain a similar ratio between delayed and non-delayed issues in both test set and training set, i.e. stratified sampling.
4.2
Result
where C is a classification result (1 = abnormal behaviour, 0 = normal behaviour). The second step aims to build a risk predictive model using decision tree model (C4.5) [12] based on the classified behaviours of each risk factor. A decision tree presents a set of rules in a hierarchical structure that can be used to predict whether an existing issue is a delayed issue. The learned decision tree can be inferred as a patten of a combination between risk factors. For example, an issue may causes a delay when abnormal behaviours are found on three risk factors together which are overloaded developer, similar to a number of delayed issues, and abnormal discussion time as shown in Figure 1.
We aim to evaluate the effectiveness of the risk prediction models when all risk factors are aggregated by using a decision tree model. We evaluated our model based on the desired precision levels of 80%. Figure 2 shows the predictive performance achieved by the decision tree model for Moodle and Duraspace. As can be seen from Figure 2 that the predictive performance achieved by the learned decision tree is high and consistent across two projects; it achieved 0.78 precision, 0.56 recall, and 0.65 F-measure (averaging across two projects). The degree of discrimination achieved by our predictive models is also high, as reflected in the AUC results. The AUC quantifies the overall ability of the discrimination between the delayed and non-delayed classes.
4.
5.
EVALUATION AND RESULT
This section explains the experimental setting, evaluations, and results. The experiment is conducted on two wellknown open source projects which are Moodle and Duraspace.
4.1
Experimental Setup
We collected the archive of issue reports from Moodle and Duraspace using REST API provided by JIRA engine. Ta-
DISCUSSION
Our risk prediction approach aims to make a prediction while a task is in the development phase (i.e. developers are working on the implementation). The required information is available at that time for collecting and predicting risks. However, we acknowledge that finding a good time to make a prediction requires further studies because the prediction time has implications to its accuracy and usefulness – the more accuracy we could gain when we make the later pre-
diction, since more information has become available, but the less useful it is (too late to mitigate risks). Our data set has the class imbalance problem. The majority of issues are non-delayed issues. We have used stratified sampling to mitigate this problem. We empirically studied the issues from issue tracking systems to identified delayed issues using the proposed risk factors. However, there are several factors that may contribute to risk of delaying which affect the predictive performance of the model. All issue reports are real data that were generated from open source project settings. We cannot claim that our data set would be representative of all kinds of software projects, especially in commercial settings. Further study of how our predictive models perform for commercial projects is needed. A number of studies has been proposed to mine software repositories which related to our work. Hooimeijer et al. [7] modeled the quality of bug report to predict resolving time of the bug. Zimmermann et al. [18] studied reopened bug to develop the bug reopen prediction model. A number of research focuses on bug fix-time prediction (e.g. [6, 8, 17]). In particular, the proposed approach in [3] emphasized on extracting a number of primitive features to build a risk predictive model. Our approach however presents the interpretable risk predictive model using threshold-based that could support project manager not only a risk identification task, but also, a risk mitigation. We have proposed an automated risk identification model based on historical data. We proposed five threshold-based risk factors which reflect abnormal behaviours that could be a cause of schedule overruns. We have developed a risk predictive model using decision tree model which yields two important benefits: the effective predictive model and the decision rules which presents the combinations of abnormal behaviours that cause a delay. In this paper, we have performed a study in two major open source projects: Moodle and Duraspace. The evaluation results demonstrate the effectiveness of our predictive models, achieving on average 78% precision, 56% recall, 65% F-measure, 84% Area Under the ROC Curve. In the future, we plan to expand our study to other large open source projects and commercial software projects to further assess our predictive models. We will also explore additional risk factors such as the collaboration among teams. Furthermore, we aim to study the correlation and causality between risk factors which may improve the predictive performance and provide more insightful information for project managers to mitigate risks.
6.
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12] [13]
[14]
[15]
REFERENCES
[1] B. W. Boehm. Software risk management: principles and practices. Software, IEEE, 8(1):32–41, 1991. [2] M. J. Carr and S. L. Konda. Taxonomy-Based Risk Identification. Technical Report June, Software Engineering Institute, Carnegie Mellon University, 1993. [3] M. Choetkiertikul, H. K. Dam, T. Tran, and A. Ghose. Characterization and prediction of issue-related risks in software projects. In Proceedings of 12th Working Conference on Mining Software Repositories (MSR-2015), page To Appear, 2015. [4] K. de Bakker, A. Boonstra, and H. Wortmann. Does risk management contribute to IT project success? A meta-analysis of empirical evidence. International
[16] [17]
[18]
Journal of Project Management, 28(5):493–503, July 2010. J. L. Eveleens and C. Verhoef. The rise and fall of the Chaos report figures. IEEE Software, 27(1):30–36, 2010. E. Giger, M. Pinzger, and H. Gall. Predicting the fix time of bugs. In Proceedings of the 2nd International Workshop on Recommendation Systems for Software Engineering - RSSE ’10, pages 52–56. ACM Press, May 2010. P. Hooimeijer and W. Weimer. Modeling bug report quality. In Proceedings of the twenty-second IEEE/ACM international conference on Automated software engineering - ASE ’07, page 34. ACM Press, Nov. 2007. L. Marks, Y. Zou, and A. E. Hassan. Studying the fix-time for bugs in large open source projects. In Proceedings of the 7th International Conference on Predictive Models in Software Engineering - Promise ’11, pages 1–8, New York, New York, USA, Sept. 2011. ACM Press. B. Michael, S. Blumberg, and J. Laartz. Delivering large-scale IT projects on time, on budget, and on value. In McKinsey Quarterly, 2012. A. Pika, W. M. van der Aalst, C. J. Fidge, A. H. ter Hofstede, M. T. Wynn, and W. V. D. Aalst. Profiling event logs to configure risk indicators for process delays. Advanced Information Systems Engineering (CAISE 2013), pages 465–481, July 2013. A. A. Porter, H. P. Siy, and L. G. Votta. Understanding the effects of developer activities on inspection interval. In Proceedings of the 19th international conference on Software engineering ICSE ’97, pages 128–138. ACM Press, May 1997. J. R. Quinlan. C4. 5: programs for machine learning. Elsevier, 2014. P. J. Rousseeuw and A. M. Leroy. Robust regression and outlier detection, volume 589. John Wiley & Sons, 2005. P. Runeson, M. Alexandersson, and O. Nyholm. Detection of Duplicate Defect Reports Using Natural Language Processing. In 29th International Conference on Software Engineering (ICSE’07), pages 499–510. IEEE, May 2007. E. Shihab, A. Ihara, Y. Kamei, W. M. Ibrahim, M. Ohira, B. Adams, A. E. Hassan, and K.-i. Matsumoto. Studying re-opened bugs in open source software. Empirical Software Engineering, 18(5):1005–1042, Sept. 2012. L. Tichy and T. Bascom. The business end of IT project failure. Mortgage Banking, 68(6):28, 2008. C. Weiss, R. Premraj, T. Zimmermann, and A. Zeller. How Long Will It Take to Fix This Bug? In Proceedings - ICSE 2007 Workshops: Fourth International Workshop on Mining Software Repositories, MSR 2007. IEEE, May 2007. T. Zimmermann, N. Nagappan, P. J. Guo, and B. Murphy. Characterizing and predicting which bugs get reopened. In 34th International Conference on Software Engineering (ICSE), 2012, pages 1074–1083. IEEE Press, June 2012.