Predicting bug-fix time

Predicting bug-fix time: using standard versus topic-based text categorization techniques Pasquale Ardimento1 , Massimo Bilancia2 , Stefano Monopoli3 2 Ionian

1 Deparment of Informatics, University of Bari Aldo Moro Department of Law, Economics and Environment, University of Bari Aldo Moro 3 Everis Italia S.p.A., Milan - Italy

DISCOVERY SCIENCE 2016. 19 – 21 October 2016 – Bari, Italy

Introduction I

In recent years, with the increasing complexity of software systems, the task of software quality assurance has become progressively more challenging

I

Software companies spend over 45 percent of their costs in fixing bugs [Pressman and Maxim, 2014; Xuan & al., 2015]

I

Predicting the time to fix a newly-reported bug is an important target to support the whole bug triage process

I

Broadly speaking, bug fix-time is defined as the calendar time from the triage of a bug to the time the bug is resolved and closed as fixed

2(DS 2016)

Introduction I

Once a bug is reported, it is typically recorded in the Bug Tracking System (BTS), and is assigned to a developer to resolve (bug triage)

I

Current practice of bug triage is largely a manual collaborative process, prone to error

I

The triager first examines whether a bug report contains sufficient or duplicated informations

I

Then she/he confirms the bug and sets severity and priority, and finally decides who has the expertise in resolving it.

3(DS 2016)

Example: KDE BTS on Bugzilla

4(DS 2016)

Bug-fix time prediction I

Bug-fix time can be considered as a valuable proxy variable of bug severity

I

Hence, many researchers have proposed methods for automated bug-fix time prediction, in order to make the assignment process more effective

I

Most of existing approaches are building prediction models based on selected attributes of bug reports

I

Machine learning plays an essential role

5(DS 2016)

Bug-fix time prediction I

However, despite apparently positive findings, existing models often fail to validate on multiple large projects [Bhattacharya, Neamtiu, 2011]

I

An alternative approach: instead of focusing on attribute subset selection, we use all available textual information

I

The problem of bug-fix time estimation is then mapped to a text categorization problem

I

A new bug report is classified to a set of discretized time to resolution classes (discretized bug-fix time, SLOW/FAST), based on a classifier which is trained using historical data.

6(DS 2016)

The conceptual design

7(DS 2016)

Text preprocessing

8(DS 2016)

Traditional text categorization methods I

Multivariate Bernouilli model (MB): it generates an indicator for each term of the vocabulary V , either 1 indicating presence of the term in the text or 0 indicating absence

I

Vector Space (VS) model, where documents are represented as long vectors in R|V |

I

Vectors are weighted using either the term frequency TFtd of word t in document d, or the term frequency–inverse document frequency TF-IDFtd = TFtd × IDFt

I

State-of-the art classifier with VS representation: non-linear Support Vector Machine (SVM) with soft margin classification

9(DS 2016)

Topic models I

Models introduced so far limits each bug textual report to a single topic

I

This assumption may often be too limiting to model a large collection of textual bug reports

I

Any report typically concerns multiple topics and specific sub-issues in different proportions

I

We want to infer such hidden structure using Bayesian posterior inference

10(DS 2016)

Latent Dirichlet Allocation [Blei, 2012]

11(DS 2016)

Supervised Latent Dirichlet Allocation (sLDA) I

Latent Dirichlet Allocation (LDA) is a powerful model for visualizing the hidden thematic structure in large corpora [Blei, Ng & Jordan, 2003]

I

But LDA is an unsupervised model. How can we build a topic model that is good at the task we care about?

I

Supervised topic models are topic models of documents and responses, fit to find topics predictive of the response

I

Supervised Latent Dirichlet Allocation (sLDA) has been introduced in [Blei, McAuliffe, 2007]

12(DS 2016)

The sLDA graphical model [Blei, 2007]

13(DS 2016)

The sLDA graphical model

14(DS 2016)

sLDA as a Bayesian hierarchical model

1. Draw topic proportions, θ|α ∼ DirichletK (α). 2. For each word I Draw topic assignment, zn |θ ∼ MultinomialK (θ) I Draw word wn |zn , β1:K ∼ Multinomial |V | (βzn ) 3. Draw response label y (SLOW/FAST, SLOW ≡ 1) from a logistic Generalized Linear Model (GLM) y|z1:N , η ∼ Bernouilli where z = (1/N)

PN

n=1 zn

exp(η > z) 1 + exp(η > z)

is the vector of empirical topic frequencies

15(DS 2016)

Posterior inference and label prediction I

Posterior inference of latent model variables is not feasible

I

The conditional posterior distribution p(θ, z1:N |w1:N , y, α, β1:K , η) has not a closed form

I

Posterior inference hinges on mean-field variation inference (MFVI)

I

MFVI is emerging as an exciting framework for fully Bayesian and empirical Bayesian inference problems [Blei & al., 2016]

I

MFVI provides also a full solution for estimating discretized bug-fix new time of a newly-opened bug via E(y new |w1:N ,α ˜ , β˜1:K , η˜). If new E(y new |w1:N ,α ˜ , β˜1:K , η˜) > 0.5 then y new ≡ 1(≡ SLOW)

16(DS 2016)

Case study I

We have obtained bug report information from Bugzilla repositories of four large open source software projects: Eclipse, Gentoo, KDE and OpenOffice

I

Data were automatically extracted from Bugzilla data sources, using a scraping visual interface written in PHP/JavaScript/Ajax

I

Raw textual reports were pre-processed and analyzed using the R software system

I

We also assumed SLOW as the positive class, hence SLOW being the target class of our prediction exercise

I

We are interested in increasing the number of true positives for the positive class. Over-estimation of bug-fix times is considered as a less severe error than under-estimation 17(DS 2016)

The visual scraping interface

18(DS 2016)

Case study I

We selected textual reports of resolved and closed bugs only, whose Status field has been assigned to VERIFIED, as well as Resolution field has been assigned to FIXED

I

We discarded a few fields. For example, both Status and Resolution were discarded because these fields were used for bug report selection

I

We filtered out all post-submission information from the test set (for example, comments posted after priority and severity were set for the first time)

I

We tokenized the text into bi-grams, to take care of multi-word expressions

19(DS 2016)

Sample size dimensions

Eclipse Gentoo KDE Open Office I I I I I

n1

n2

n3

n4

n5

44435 3704 1275 3057

44347 2466 1270 3057

1500 1500 1270 1500

1200 1200 1016 1200

300 300 254 300

n1 bug reports extracted for each project n2 bug reports after removing corrupted or duplicated reports n3 randomly sampled bug reports n4 bugs in the training set (80:20 training/test ratio) n5 bugs in the test set 20(DS 2016)

Results. Eclipse and Gentoo

MB SVM SLDA

MB SVM SLDA

Parameters

Accuracy

Precision

Recall

FPR

λ=2 γ = 0.001, C = 10 K = 25

0.73 0.67 0.57

0.60 0.23 0.32

0.04 0.09 0.48

0.01 0.11 0.40

Parameters

Accuracy

Precision

Recall

FPR

λ=2 γ = 0.001, C = 100 K = 30

0.74 0.74 0.43

0.50 0.67 0.27

0.13 0.03 0.70

0.05 0.00 0.67

21(DS 2016)

Results. KDE and OpenOffice

MB SVM SLDA

MB SVM SLDA

Parameters

Accuracy

Precision

Recall

FPR

λ=2 γ = 0.001, C = 100 K = 40

0.83 0.60 0.41

0.64 0.03 0.29

0.79 0.01 0.84

0.02 0.19 0.74

Parameters

Accuracy

Precision

Recall

FPR

λ=2 γ = 0.001, C = 100 K = 10

0.78 0.58 0.51

0.00 0.22 0.23

0.00 0.38 0.55

0.00 0.38 0.55

22(DS 2016)

Discussion I

The proposed model greatly improves recall (the proportion of true positives of all SLOW bugs), when compared to single topic algorithms

I

The loss of accuracy of our method is strong

I

However, predictive accuracy provides meaningful and reliable comparisons only when the two target classes have equal importance

I

In our experimental setting the negative class (FAST) plays a minor role, as costs incurred in false positives are often very low

23(DS 2016)

Conclusion I

The proposed method seems promising for implementing a large-scale bug-fix time prediction system. However: 1. Using the quantile qa with a = 0.75 to separate positive and negative instances is arbitrary. A sensitivity analysis is needed 2. Each method is trained for a number of parameter settings and tested on the test set, but only the best results are presented. Extensive testing on large validation sets is therefore needed as well 3. Potential outliers of the distribution of bug-fix times were not identified and filtered out 4. Most of defect tracking systems are just ticketing systems, that cannot keep track of actual person-hours spent to resolve a bug

24(DS 2016)

Essential bibliography I

Bhattacharya, P., Neamtiu, I. (2014). Bug-fix Time Prediction Models: Can We Do Better? In Proceeding of the 8th Working Conference on Mining Software Repositories - MSR 11, pp. 207–210. New York, New York, USA: ACM Press

I

Blei, D., Ng, A., Jordan, M. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993–1022

I

Blei, D., McAuliffe, J.D. (2007). Supervised Topic Models. NIPS’07.

I

Blei, D. (2012). Probabilistic topic models. Communications of the ACM, 55(4):77–84

I

Blei, D. M., Kucukelbir, A., McAuliffe, J. D. (2016). Variational inference: a review for statisticians, 1–33. http://arxiv.org/abs/1601.00670

I

Pressman, R. S., Maxim, B. R. (2014) Software Engineering: A Practitioners Approach (Eight Edition). McGraw-Hill Higher Education

I

Xuan, J., Jiang, H., Hu, Y., Ren, Z., Zou, W., Luo, Z., Wu, X. (2015). Towards Effective Bug Triage with Software Data Reduction Techniques. IEEE Transactions on Knowledge and Data Engineering 27(1):264–280 25(DS 2016)

Predicting bug-fix time

Predicting bug-fix time

Suggest Documents

Predicting time series of railway speed restrictions with time ... - Hal

Predicting Physical Time Series Using Dynamic

Predicting Convergence Time for Genetic Algorithms

Predicting cycle time distributions for integrated ... - Eurandom

Predicting Chaotic Time Series using Machine

Predicting Chaotic Time Series by Reinforcement Learning

Time to Combine Markers Predicting Survival in

Time Series Prediction: Predicting Stock Price

Predicting Velocity Growth: A Time Series Perspective

BugFix: A Learning-Based Tool to Assist Developers in ... - Cs.ucr.edu

Bflinks: Reliable Bugfix Links via Bidirectional References and Tuned ...

Predicting

Predicting Adolescents Predicting Adolescents ...

Mining cross-predicting stochastic ARMA time series in ... - Microsoft

Predicting The Exit Time Of Employees In An Organization Using ...

Predicting Stay Time of Mobile Users With Contextual ... - IEEE Xplore

Real-Time Shear Wave versus Transient Elastography for Predicting

Dynamic Time-Dependent Strategy Model for Predicting Human's ...

Predicting Endurance Time in a Repetitive Lift and ... - Semantic Scholar

Predicting How Nanoconfinement Changes the ... - Glass and Time

Selecting a Temperature Time History for Predicting Fatigue Life

Predicting Density-Based Spatial Clusters Over Time - CiteSeerX

PREDICTING THE TIME RATE OF SUPPLY FROM A ... - Core

Predicting Query Execution Time: Are Optimizer Cost ... - CiteSeerX