System (BTS), and is assigned to a developer to resolve (bug triage). â· Current practice of .... of four large open source so ware projects: Eclipse, Gentoo, KDE.
Predicting bug-fix time: using standard versus topic-based text categorization techniques Pasquale Ardimento1 , Massimo Bilancia2 , Stefano Monopoli3 2 Ionian
1 Deparment of Informatics, University of Bari Aldo Moro Department of Law, Economics and Environment, University of Bari Aldo Moro 3 Everis Italia S.p.A., Milan - Italy
DISCOVERY SCIENCE 2016. 19 – 21 October 2016 – Bari, Italy
Introduction I
In recent years, with the increasing complexity of software systems, the task of software quality assurance has become progressively more challenging
I
Software companies spend over 45 percent of their costs in fixing bugs [Pressman and Maxim, 2014; Xuan & al., 2015]
I
Predicting the time to fix a newly-reported bug is an important target to support the whole bug triage process
I
Broadly speaking, bug fix-time is defined as the calendar time from the triage of a bug to the time the bug is resolved and closed as fixed
2(DS 2016)
Introduction I
Once a bug is reported, it is typically recorded in the Bug Tracking System (BTS), and is assigned to a developer to resolve (bug triage)
I
Current practice of bug triage is largely a manual collaborative process, prone to error
I
The triager first examines whether a bug report contains sufficient or duplicated informations
I
Then she/he confirms the bug and sets severity and priority, and finally decides who has the expertise in resolving it.
3(DS 2016)
Example: KDE BTS on Bugzilla
4(DS 2016)
Bug-fix time prediction I
Bug-fix time can be considered as a valuable proxy variable of bug severity
I
Hence, many researchers have proposed methods for automated bug-fix time prediction, in order to make the assignment process more effective
I
Most of existing approaches are building prediction models based on selected attributes of bug reports
I
Machine learning plays an essential role
5(DS 2016)
Bug-fix time prediction I
However, despite apparently positive findings, existing models often fail to validate on multiple large projects [Bhattacharya, Neamtiu, 2011]
I
An alternative approach: instead of focusing on attribute subset selection, we use all available textual information
I
The problem of bug-fix time estimation is then mapped to a text categorization problem
I
A new bug report is classified to a set of discretized time to resolution classes (discretized bug-fix time, SLOW/FAST), based on a classifier which is trained using historical data.
6(DS 2016)
The conceptual design
7(DS 2016)
Text preprocessing
8(DS 2016)
Traditional text categorization methods I
Multivariate Bernouilli model (MB): it generates an indicator for each term of the vocabulary V , either 1 indicating presence of the term in the text or 0 indicating absence
I
Vector Space (VS) model, where documents are represented as long vectors in R|V |
I
Vectors are weighted using either the term frequency TFtd of word t in document d, or the term frequency–inverse document frequency TF-IDFtd = TFtd × IDFt
I
State-of-the art classifier with VS representation: non-linear Support Vector Machine (SVM) with soft margin classification
9(DS 2016)
Topic models I
Models introduced so far limits each bug textual report to a single topic
I
This assumption may often be too limiting to model a large collection of textual bug reports
I
Any report typically concerns multiple topics and specific sub-issues in different proportions
I
We want to infer such hidden structure using Bayesian posterior inference
10(DS 2016)
Latent Dirichlet Allocation [Blei, 2012]
11(DS 2016)
Supervised Latent Dirichlet Allocation (sLDA) I
Latent Dirichlet Allocation (LDA) is a powerful model for visualizing the hidden thematic structure in large corpora [Blei, Ng & Jordan, 2003]
I
But LDA is an unsupervised model. How can we build a topic model that is good at the task we care about?
I
Supervised topic models are topic models of documents and responses, fit to find topics predictive of the response
I
Supervised Latent Dirichlet Allocation (sLDA) has been introduced in [Blei, McAuliffe, 2007]
12(DS 2016)
The sLDA graphical model [Blei, 2007]
13(DS 2016)
The sLDA graphical model
14(DS 2016)
sLDA as a Bayesian hierarchical model
1. Draw topic proportions, θ|α ∼ DirichletK (α). 2. For each word I Draw topic assignment, zn |θ ∼ MultinomialK (θ) I Draw word wn |zn , β1:K ∼ Multinomial |V | (βzn ) 3. Draw response label y (SLOW/FAST, SLOW ≡ 1) from a logistic Generalized Linear Model (GLM) y|z1:N , η ∼ Bernouilli where z = (1/N)
PN
n=1 zn
exp(η > z) 1 + exp(η > z)
is the vector of empirical topic frequencies
15(DS 2016)
Posterior inference and label prediction I
Posterior inference of latent model variables is not feasible
I
The conditional posterior distribution p(θ, z1:N |w1:N , y, α, β1:K , η) has not a closed form
I
Posterior inference hinges on mean-field variation inference (MFVI)
I
MFVI is emerging as an exciting framework for fully Bayesian and empirical Bayesian inference problems [Blei & al., 2016]
I
MFVI provides also a full solution for estimating discretized bug-fix new time of a newly-opened bug via E(y new |w1:N ,α ˜ , β˜1:K , η˜). If new E(y new |w1:N ,α ˜ , β˜1:K , η˜) > 0.5 then y new ≡ 1(≡ SLOW)
16(DS 2016)
Case study I
We have obtained bug report information from Bugzilla repositories of four large open source software projects: Eclipse, Gentoo, KDE and OpenOffice
I
Data were automatically extracted from Bugzilla data sources, using a scraping visual interface written in PHP/JavaScript/Ajax
I
Raw textual reports were pre-processed and analyzed using the R software system
I
We also assumed SLOW as the positive class, hence SLOW being the target class of our prediction exercise
I
We are interested in increasing the number of true positives for the positive class. Over-estimation of bug-fix times is considered as a less severe error than under-estimation 17(DS 2016)
The visual scraping interface
18(DS 2016)
Case study I
We selected textual reports of resolved and closed bugs only, whose Status field has been assigned to VERIFIED, as well as Resolution field has been assigned to FIXED
I
We discarded a few fields. For example, both Status and Resolution were discarded because these fields were used for bug report selection
I
We filtered out all post-submission information from the test set (for example, comments posted after priority and severity were set for the first time)
I
We tokenized the text into bi-grams, to take care of multi-word expressions
19(DS 2016)
Sample size dimensions
Eclipse Gentoo KDE Open Office I I I I I
n1
n2
n3
n4
n5
44435 3704 1275 3057
44347 2466 1270 3057
1500 1500 1270 1500
1200 1200 1016 1200
300 300 254 300
n1 bug reports extracted for each project n2 bug reports after removing corrupted or duplicated reports n3 randomly sampled bug reports n4 bugs in the training set (80:20 training/test ratio) n5 bugs in the test set 20(DS 2016)
Results. Eclipse and Gentoo
MB SVM SLDA
MB SVM SLDA
Parameters
Accuracy
Precision
Recall
FPR
λ=2 γ = 0.001, C = 10 K = 25
0.73 0.67 0.57
0.60 0.23 0.32
0.04 0.09 0.48
0.01 0.11 0.40
Parameters
Accuracy
Precision
Recall
FPR
λ=2 γ = 0.001, C = 100 K = 30
0.74 0.74 0.43
0.50 0.67 0.27
0.13 0.03 0.70
0.05 0.00 0.67
21(DS 2016)
Results. KDE and OpenOffice
MB SVM SLDA
MB SVM SLDA
Parameters
Accuracy
Precision
Recall
FPR
λ=2 γ = 0.001, C = 100 K = 40
0.83 0.60 0.41
0.64 0.03 0.29
0.79 0.01 0.84
0.02 0.19 0.74
Parameters
Accuracy
Precision
Recall
FPR
λ=2 γ = 0.001, C = 100 K = 10
0.78 0.58 0.51
0.00 0.22 0.23
0.00 0.38 0.55
0.00 0.38 0.55
22(DS 2016)
Discussion I
The proposed model greatly improves recall (the proportion of true positives of all SLOW bugs), when compared to single topic algorithms
I
The loss of accuracy of our method is strong
I
However, predictive accuracy provides meaningful and reliable comparisons only when the two target classes have equal importance
I
In our experimental setting the negative class (FAST) plays a minor role, as costs incurred in false positives are often very low
23(DS 2016)
Conclusion I
The proposed method seems promising for implementing a large-scale bug-fix time prediction system. However: 1. Using the quantile qa with a = 0.75 to separate positive and negative instances is arbitrary. A sensitivity analysis is needed 2. Each method is trained for a number of parameter settings and tested on the test set, but only the best results are presented. Extensive testing on large validation sets is therefore needed as well 3. Potential outliers of the distribution of bug-fix times were not identified and filtered out 4. Most of defect tracking systems are just ticketing systems, that cannot keep track of actual person-hours spent to resolve a bug
24(DS 2016)
Essential bibliography I
Bhattacharya, P., Neamtiu, I. (2014). Bug-fix Time Prediction Models: Can We Do Better? In Proceeding of the 8th Working Conference on Mining Software Repositories - MSR 11, pp. 207–210. New York, New York, USA: ACM Press
I
Blei, D., Ng, A., Jordan, M. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993–1022
I
Blei, D., McAuliffe, J.D. (2007). Supervised Topic Models. NIPS’07.
I
Blei, D. (2012). Probabilistic topic models. Communications of the ACM, 55(4):77–84
I
Blei, D. M., Kucukelbir, A., McAuliffe, J. D. (2016). Variational inference: a review for statisticians, 1–33. http://arxiv.org/abs/1601.00670
I
Pressman, R. S., Maxim, B. R. (2014) Software Engineering: A Practitioners Approach (Eight Edition). McGraw-Hill Higher Education
I
Xuan, J., Jiang, H., Hu, Y., Ren, Z., Zou, W., Luo, Z., Wu, X. (2015). Towards Effective Bug Triage with Software Data Reduction Techniques. IEEE Transactions on Knowledge and Data Engineering 27(1):264–280 25(DS 2016)