Predictive Data Mining Model for Software Bug ... - IEEE Xplore

4 downloads 13374 Views 397KB Size Report
Predictive Data Mining Model for Software Bug. Estimation Using Average Weighted Similarity. Naresh Kumar Nagwani,. Department of CS&E,. NIT Raipur,.
Predictive Data Mining Model for Software Bug Estimation Using Average Weighted Similarity Naresh Kumar Nagwani,

Dr. Shrish Verma,

Department of CS&E, NIT Raipur, [email protected]

Department of Info. Tech., NIT Raipur, [email protected]

Abstract – Software bug estimation is a very essential activity for effective and proper software project planning. All the software bug related data are kept in software bug repositories. Software bug (defect) repositories contains lot of useful informaton related to the development of a project. Data mining techniques can be applied on these repositories to discover useful intersting patterns. In this paper a prediction data mining technique is proposed to predict the software bug estimation from a software bug repository. A two step prediction model is proposed In the first step bug for which estimation is required, its summary and description is matched against the summary and description of bugs available in bug repositories. A weighted similarity model is suggested to match the summary and description for a pair of software bugs. In the second step the fix duration of all the similar bugs are calculated and stored and its average is calculated, which indicates the precicted estimation of a bug. The proposed model is implemented using open source technologies and is exaplained with the help of illustrative example. Keywords: Software bug repositories, Bug estimation, Weighted Similarity, Estimation Prediction

I.

1.2 Bug States A software bug enters into the various state for its resolution. Figure-1 depicts a general bug state diagram. The boxes indicates various bug states and arrow indicates the tranisition between the states. The most common and simple path which a bug follow is Open → In-Progress → Resolved → Closed. When a bug is identified by a tester or by a quality engineer its summary, description and related informations are entered into the bug tracking system and during this action item every bug gets one uniqe id number. As soon as the bug is created it enteres into the “Open” state. And it is assigned to one of the developer for fixation. Once it is assigned to a developer and he or she start working for the resolution, the bug enteres into the “In-Progress” state, after fixation of the bug developer mark that bug as “Resolved”, which is the “Resolved” state and it is assigned back to the tester or quality engineer for verification. Once the bug is verified by tester or quality engineer and found ok then it is marked as “Closed”.

INTRODUCTION

A bug is defect in sofware. Bug indicates the unexpected behavior of some of the given requirement during software development. During software testing the unexpected behavior of requirements are identified by software testers or quality engineers and they are marked as a Bug. In this paper both defect and bug are used as synonyms. Bugs are managed and tracked using number of available tools like Bugzilla, Perforce, JIRA etc. 1.1 Bug Reposiories Figure 1. Bug state diagram

Most of the open source projects and large projects manages their software development related data using some of the project management tools. For managing the bugs associated with the software bug tracking tools are used. These bug tracking systems provides online interfaces to various users associated with the projects. These tools internally manages the bug repositories where all the bugs and related data are stored. For example for the Mozilla project, the bugs are tracked using bugzilla tool [10]. Bugzilla provides all the mozilla bugs in the form of online repository. By specifying the bug id in the Mozilla’ online repository, any user can fetch the required bug information. The url for Mozilla’s bug repository is “https://bugzilla.mozilla.org/show_bug.cgi?id=”.

c 978-1-4244-4791-6/10/$25.00 2010 IEEE

1.3 Bug estimation prediction In this section two imporatant questions are addressed. These questions are: 1. 2.

What is software bug estimation? and Why bug estimation in needed?

Software bug estimation is a process of technique of analyzing the time effort for a software bug. Software bug estimation is a primary need of project planning, it helps in scheduling releases, managing resources efficiently and discovering risk on project builds etc.

373

1.4 Text Similarity Techniques Prediction of estimation can be done by analyzing the effort taken by similar task previously. For software bug also this rule is applicable. A bug is composed of number of attributes like summary, description, comments etc. All these attributes are text data. So for discovering the similar software bugs text similarity techniques are required. There are number of string similarity algorithms present today, some of the most commonly used and studied in [4] are discussed here, they can be used for calculating the similarity of bug summaries, descriptions and comments etc. because of these data are in string format. 1.4.1 Dice Similarity The Dice Coefficient is a word-based similarity measure. The similarity value is related to a ratio of the number of common words for both sentences and the number of total words of the two sentences. When comparing two sentences Q and S, if Ncommon is the count of common words, NQ is the total count of words of sentence Q, and NS is the total count of words of sentence S, the Dice coefficient can be expressed as follows.

Dice(Q, S )

(2 N common ) (NQ  N S )

In the Vector space model, documents (S) and queries (Q) are decomposed into smaller word units. All words are used as elements in the vectors that will represent Q and S. Both vectors contain weights assigned to each word corresponding to the number of occurrence of that word within them. 1.4.2 Cosine Similarity Cosine similarity measurement is very common way of calculating corpus based sentences or strings similarity. The formula is given in following equation (Salton et al., 1983) for t words.

t

COSINE (Q, S )

¦ (w t

¦ (w k 1

Where wqk and

˜ wsk ) t

qk

) 2 ˜ ¦ ( wsk ) 2 k 1

wsk are the words presents in sentences Q and

S. 1.4.3 BLEU Similarity

374

qk

k 1

BLEU is a method for automatic evaluation of machine translation. We use this algorithm to rank similar sentences by comparing the input sentence Q and only one reference sentence S. The implementation is based on the following formula: N

Log BLEU = min(1- r/c, 0) +

¦w

n

log p n

n 1

1.4.4 TF-IDF Similarity Term Frequency (TF) and Inverse Domain Frequency (IDF) are very common factors today to calculate the similarities among the strings. Both are used mostly in text mining to discover the similarities. Product of both gives the similarity of a text in a document. Where c is the length of Q and r is the length of S. The second term is calculating the geometric average of the modified ngram precision pn. If pn is zero, a constant value ε is added to make pn a non zero value. idf(t) TF-TDF Sim =

§ · D ¸ log ¨ ¨ document - frequency( t) ¸ © ¹

1.4.5 Jaccard Similarity This is another token based vector space similarity measure like the cosine distance and the matching coefficient. Jaccard Similarity uses word sets from the comparison instances to evaluate similarity. The jaccard similarity penalizes a small number of shared entries (as a portion of all non-zero entries) more than the Dice coefficient. The Jaccard similarity is frequently used as a similarity measure for chemical compounds. Each instance is represented as a Jaccard vector similarity function. The Jaccard between two vectors X and Y is Jaccard Sim = (X*Y) / (|X||Y|-(X*Y)) Where (X*Y) is the inner product of X and Y, and |X| = (X*X)^1/2, i.e. the Euclidean norm of X. This can more easily be described as ( |X & Y| ) / ( | X or Y | ) In this paper a software bug estimation model is proposed for predicting the resolution time of a software bug using the information available in software bug repositories. The structure of this paper is as follows. In section 2 previous and related work done in the same area is described. The proposed prediction model is explained in Section 3. An illustrative example of the proposed prediction model is given in section 4. Implementation details of the model is covered in section 5. Section 6 carries the performance evaluation of the proposed prediction model and section 7 discusses about conclusion and future scope of the proposed work.

2010 IEEE 2nd International Advance Computing Conference

II.

BACKGROUND

Many researchers have worked on software bug estimation area. In this section previous and related work done in this area is discussed. Weib, Premraj, Zimmermann and Zeller have given a method to predict the software bug estimation. In their proposed technique the Lucene framework was used to search similar bugs in the earlier reports and their average time is used as a predicted estimation time [1-2]. Kim, Pan and Whitehead, Jr. [6] have presented a bug finding algorithm using bug fix memories. A project-specific bug and fix knowledge base was developed to analyze the history of bug fixes. Pieter Hooimeijer and Westley Weimer [10] have presented a descriptive model of bug report quality based on a statistical analysis of surface features of available bug reports for the Mozilla Firefox project. The proposed model predicts whether a bug report is triaged within a given amount of time. Panjer [3] has explored the feasibility of using data mining tools to predict the time to fix a bug given only the basic information known at the beginning of a bug’s lifetime. To address this question, a historical portion of the Eclipse Bugzilla database was proposed for modeling and predicting bug lifetimes. A bug history transformation process was described and several data mining models were built and tested. In this paper a new approach is proposed towards estimating software bug efforts. The goal is to provide users an accurate way to predict the estimation of a software bug using average weighted similarity model. The framework is implemented using open source technologies only and experiments are also done for performance analysis. III.

PROPOSED PRECTION MODEL

In this section the proposed prediction model is explained. Whenever a new bug enters into the bug repository, its summary and description information is matched against with the summary and description information of software bugs stored in repositories. If the summary and description maches i.e. match value of summary and description is greater than some minimum similarity threshold value then the predicted fix duration will be the average of the fix duration of matched(similar) bugs. A bug is said to be as the similar of other bugs if the overall similarity of a bug Simbug is above the above the similarity threshold value τ. It is represented in (1). The overall bug similarity is given by (2), where WS is the similarity weigth for summary of the bug, SimSummary is the similarity value for the summary and WD and SimDescription are weigth and similarity value for the description of the bug. The overall similarity of the bug is mapped to value in range [0…1] by applying the restriction given in (3). And similarity value ranges for summary and description is given in (4) and (5). Simbug > τ

(1)

Simbug = WS * SimSummary + WD * SimDescription

(2)

WS + WD = 1

(3)

0 ≤ SimSummary ≤ 1

(4)

0 ≤ SimDescription ≤ 1

(5)

After finding the similar bugs their fix duration is taken and average of the fix duration represents the predicted estimation for a given software bugs. This can be represebted using (6). T(Bug)=(T(SimBug1)+T(SimBug2)+…+ T(SimBugN)) / N (6) Where T(Bug) is the estimated time for a software bug and T(SimBugi) is the fix duration for i’th similar bug and N is the total number of similar bugs. Once the similar software bugs are found their fix duration is counted for prediction. The total fix duration of a bug can be calculated by examine the attributes bug-status, bug-creationdate and bug-modified-date. If the bug-status is “Fixed” or “Resolved” then the fix duration is the differences of bugmodified-date and bug-created-date. This difference can be transformed into any time granular e.g. hours, days, weeks, months etc. In this paper no. of days are taken out as a time granular for the bug fix duration. public static long fixDurationInDays(BugObject bugObject) { long diffDays = 0; String submitted = bugObject.getSubmitted(); String modified = bugObject.getModified(); DateFormat df = new SimpleDateFormat("dd MMM yyyy hh:mm"); try { if( bugObject.getStatus().equals(”Fixed”) || bugObject.getStatus().equals(”Resolved”)) { Calendar calenderSubmitted = Calendar.getInstance(); Calendar calenderModified = Calendar.getInstance(); calenderSubmitted.setTime(df.parse(submitted)); calenderModified.setTime(df.parse(modified)); long milliseconds1 = calenderSubmitted.getTimeInMillis(); long milliseconds2 = calenderModified.getTimeInMillis(); long diff = milliseconds2 - milliseconds1; diffDays = diff / (24 * 60 * 60 * 1000); } } catch(Exception ex) { ex.printStackTrace(); } return diffDays; }

Figure-2 summerizes the process of calculating the bug similarity between a pair of software bugs. A bug consists of number of attributes, where summary and description attributes are the important. Summary and descriptions are

2010 IEEE 2nd International Advance Computing Conference

375

matched using one of the text similarity approach. Weights are assigned for the match value of summary and decription and overall similarity is calculated.

TABLE II. SIMILARITY CALCULATIONS FOR SUMMARY AND DESCRIPTION USING VARIOUS SIMILARITY TECHNIQUES Similari ty Bug #id 32482

Cosine Similarity SimSum SimDescr mary

Jaccord Similarity SimSumma SimDescr

iption

ry

TFIDF Similarity SimSum SimDescrip

iption

mary

0.502

tion

0.4718

0.5215

0.315

0.5934

0.6315

34945

0.174

0.2215

0.0952

0.1332

0.1672

0.1872

35993

0.4398

0.6228

0.3875

0.5675

0.5198

0.7168

36023

0.496

0.5972

0.4666

0.5415

0.5636

0.6231

36128

0.123

0.1761

0.0625

0.1146

0.123

0.2367

39002

0.1672

0.2147

0.0909

0.1324

0.1818

0.2141

41868

0.4896

0.6332

0.4052

0.5964

0.4906

0.7767

TABLE III. OVERALL BUG SIMILARITY CALCULATION USING DIFFERENT SIMILARITY WEIGHTS FOR COSINE SIMILARITY RESULTS. Match Weight Bug #id

Figure 2. Calculation of bug similarity by using the attributes summary and description

IV.

ILLUSTRATIVE EXAMPLE

For illustrative example bugs from mysql bug repository is taken. Table-1 contains the software bug from mysql. Suppose estimation for mysql bug 44399 is required. Table-2 represents the similarity measures of summary and description of bugs taken in Table-1 using different text similarity techniques. Table-3 represents the overall bug similarity for different summary and description similarity weights using cosine similarity technique. Table-4 and Table-5 represents the overall bug similarities for different weights using Jaccord and TF-IDF text similarity techniques correspondingly. TABLE I. EXAMPLE BUGS FROM MYSQL BUG REPOSITORY BUG-ID SUMMARY 32482 crash with GROUP BY alias_of_user_variable WITH ROLLUP 34945 ref_or_null queries that are null_rejecting and have a null value crash mysql 35993 severe memory corruption and crash with multibyte conversion 36023 crash/huge memory alloc with max and case when if statement 36128 not in subquery causes crash in cleanup.. 39002 crash with insert .. select * from ... on duplicate key update col=default 41868 crash or memory overrun with concat + upper, date_format functions 44399 crash with statement using TEXT columns, aggregates, GROUP BY, and HAVING

376

WS = 0.3, WD = 0.7

WS = 0.4, WD = 0.6

WS = 0.5, WD = 0.5

WS = 0.6, WD = 0.4

WS = 0.7, WD = 0.3

SimBug

SimBug

SimBug

SimBug

SimBug

32482

0.50659

0.50162

0.49665

0.49168

0.48671

34945

0.20725

0.2025

0.19775

0.193

0.18825

35993

0.5679

0.5496

0.5313

0.513

0.4947

36023

0.56684

0.55672

0.5466

0.53648

0.52636

36128

0.16017

0.15486

0.14955

0.14424

0.13893

39002

0.20045

0.1957

0.19095

0.1862

0.18145

41868

0.59012

0.57576

0.5614

0.54704

0.53268

TABLE IV. OVERALL BUG SIMILARITY CALCULATION USING DIFFERENT SIMILARITY WEIGHTS FOR JACCORD SIMILARITY RESULTS. WS = 0.3, WD = 0.7

WS = 0.4, WD = 0.6

WS = 0.5, WD = 0.5

WS = 0.6, WD = 0.4

WS = 0.7, WD = 0.3

SimBug

SimBug

SimBug

SimBug

SimBug

32482

0.50988

0.48204

0.4542

0.42636

0.39852

34945

0.1218

0.118

0.1142

0.1104

0.1066

35993

0.5135

0.4955

0.4775

0.4595

0.4415

36023

0.51903

0.51154

0.50405

0.49656

0.48907

36128

0.09897

0.09376

0.08855

0.08334

0.07813

39002

0.11995

0.1158

0.11165

0.1075

0.10335

41868

0.53904

0.51992

0.5008

0.48168

0.46256

Match Weight Bug #id

Suppose the value of similarity threshold τ, W S and WD are taken as 0.5, 0.3 and 0.7 respectively using cosine text similarity technique for the experiments (Table-III containg the result value for the given parameters) then the similar bugs

2010 IEEE 2nd International Advance Computing Conference

found in MySql bug repository are {32482, 35993, 36023, 41868} .Table-6 represents the calculated fix duration(in number of days) of similar software bugs. The average of all these fix duration i .e. ( ( 58+45+42+65)/4 = 52.5 ̵͂ 53 days) is the estimation of bug 44399, which was required.

select the appropriate place of repository. Then user can specify various similarity parameters.

TABLE V. OVERALL BUG SIMILARITY CALCULATION USING DIFFERENT SIMILARITY WEIGHTS FOR TF-IDF SIMILARITY RESULTS.

WS = 0.3, WD = 0.7

WS = 0.4, WD = 0.6

WS = 0.5, WD = 0.5

WS = 0.6, WD = 0.4

WS = 0.7, WD = 0.3

SimBug

SimBug

SimBug

SimBug

32482

SimBug 0.59265

34945

0.1812

35993

0.6577

36023

0.60525

36128

0.20259

39002

0.20441

41868

0.69087

Match Weight Bug #id

0.5797

0.56675

0.5538

0.54085

0.1792

0.1772

0.1752

0.1732

0.638

0.6183

0.5986

0.5789

0.5993

0.59335

0.5874

0.58145

0.19122

0.17985

0.16848

0.15711

0.20118

0.19795

0.19472

0.19149

0.66226

0.63365

0.60504

0.57643

TABLE VI. CALCULATION OF BUG FIX DURATION FOR THE SIMILAR BUGS.

Bug #id Fix Duration

32482

35993

36023

41868

58

45

42

65

V.

Figure 3. Main GUI for the bug estimation where all the parameters can be specified.

IMPLEMENTATION

Implementation is done in Java, JDBC [7], MySql [8] technologies. Xapian stemmer is also used to clean up the string for the comparision. Implementation is done in two phases – Data preprocessing and estimation model implementation. 5.1 Data Preprocessing Like every data mining technique preprocessing is required here also for the software bug repositories. A stemming algorithm is a process of linguistic normalisation, in which the variant forms of a word are reduced to a common form. In this paper an open source stemmer named Xapian[9] is used for normalization of the text. Xapian stemmer is based on the Snowball Stemming Algorithms. 5.2 Bug Estimation Implementation Figure 3 depicts the implementation of proposed prediction model. The bug estimation technique consisting of number of sub tasks. First of all user have to enter the bug id for the estimation is required. Secondly user has to specify the location of bug repository. Bug repository could either be in local database or in some online bug repository. User can

Figure 4. Result GUI for bug estimation.

Figure 4 depicts the bug estimation result GUI (Graphical User Interface). In this GUI data related to all the similar bugs

2010 IEEE 2nd International Advance Computing Conference

377

and model parameters are displayed and results is shown for a given bug id. VI.

[10] Bugzilla, the free bug tracking system : http://www.bugzilla.org [11] Weka, the open source java data mining implementation: http://www.cs.waikato.ac.nz/ml/weka.

PERFORMANCE EVALUATION

The performance of prediction models are decided by the accuracy of the model. For evaluating the performance of the proposed model MySql bug repository is taken. MySql is open source and all the data related to the development of MySql is available freely over the web. Around 12000 bugs from Mysql bug repositories are taken for measuring accuracy. 10-fold cross validation model is used for measuring the accuracy of the proposed model. The accuracy measured is around 91%. Weka’s [11] java library is used (The open source data mining implementation) for implementing the 10-fold cross validation for proposed model. VII. CONCLUSION AND FUTURE SCOPE In this paper a new data mining model is proposed to predict the software bug estimation. The proposed model is implemented using open source technologies and applied over the open source MySql bug repository. For the proposed work only two attribute of the bug summary and description are taken for similarity measurement based on which estimation prediction is done for the software bugs. Future scope for the related work could be analyzing the impact of other bug attributes for the software bug estimation and incorporating them for the prediction calculation to achieve more accurate results. Also semantic similarities between the software bugs can be measured and applied for meaningful bug estimation. REFERENCES [1]

[2]

[3]

[4]

[5]

[6]

[7] [8] [9]

378

Cathrin Weiß, Rahul Premraj, Thomas Zimmermann, Andreas Zeller, "How Long will it Take to Fix This Bug?”, Proceedings of the Fourth International Workshop on Mining Software Repositories, SIGSOFT, 2007. Cathrin Weiß, Rahul Premraj, Thomas Zimmermann, Andreas Zeller, "Predicting Effort to Fix Software Bugs", In Proceedings of the 9th Workshop Software Reengineering (WSR 2007), Bad Honnef, Germany. Proceedings also appeared in SoftwaretechnikTrends (27:2), published by the Gesellschaft für Informatik (GI),2007. Lucas D. Panjer, "Predicting Eclipse Bug Lifetimes", 29th International Conference on Software Engineering Workshops (ICSEW'07), 2007. Naresh Kumar Nagwani, Pradeep Singh, "Weight Similarity Measurement Model Based, Object Oriented Approach for Bug Databases Mining to Detect Similar and Duplicate bugs", ICAC, Proceedings of the International Conference on Advances in Computing, Communication and Control, pp. 202-207,2009. Nicolas Bettenburg, Sascha Just, Adrian Schröter, Cathrin Weiss, Rahul Premraj, Thomas Zimmermann, "What Makes a Good Bug Report?", SIGSOFT 2008/FSE-16, November 9–15, Atlanta, Georgia, USA, 2008. Sunghun Kim, Kai Pan, E. James Whitehead, Jr., "Memories of Bug Fixes", SIGSOFT'06/FSE-14, November 5–11, 2006, Portland, Oregon, USA, 2006. Java, the open source object oriented programming API : http:// java.sun.com MySql, the open database management system : http:// www.mysql.com Xapian, the open source stemmer for text data cleaning: http://xapian.org/docs/stemming.html

2010 IEEE 2nd International Advance Computing Conference