Bug Assignee Prediction Using Association Rule Mining Meera Sharma1, Madhu Kumari2, and V.B. Singh2() 1
Department of Computer Science, University of Delhi, Delhi, India
[email protected] 2 Delhi College of Arts and Commerce, University of Delhi, Delhi, India {mesra.madhu,vbsinghdcacdu}@gmail.com
Abstract. In open source software development we have bug repository to which both developers and users can report bugs. Bug triage, deciding what to do with an incoming bug report, takes a large amount of developer resources and time. All newly coming bug reports must be triaged to determine whether the report is correct and requires attention and if it is, which potentially experienced developer/fixer will be assigned the responsibility of resolving the bug report. In this paper, we propose to apply association mining to assist in bug triage by using Apriori algorithm to predict the developer that should work on the bug based on the bug’s severity, priority and summary terms. We demonstrate our approach on collection of 1,695 bug reports of Thunderbird, AddOnSDK and Bugzilla products of Mozilla open source project. We have analyzed the association rules for top five assignee of the three products. Association rules can support the managers to improve its process during development and save time and resources.
1
Introduction
The availability of various software repositories namely source code, bugs, attributes of bugs, source code changes, developer communication, mailing list, allows new research areas in software engineering like mining software repositories, empirical software engineering, and machine learning based software engineering. Various machine learning based prediction models have been developed and is currently being used to improve the quality of software in terms of choosing right developer to fix the bugs, predicting bug fix time, predicting the attributes of a bug namely severity and priority, and bugs lying dormant in the software [1-6]. One of the important software repositories is the bug tracking system (BTS) which is used to manage bug reports submitted by users, testers, and developers [7]. Each new reported bug must be triaged to determine if it describes a meaningful new problem or enhancement, and if it does, it must be assigned to an appropriate developer to fix it. A bug is characterized by many attributes shown in table 1[8]. Some of the important bug attributes are severity, priority and summary. The degree of impact of a bug on the functionality of the software is known as its severity. It is defined on seven levels from 1 to 7 namely, Blocker, Critical, Major, Normal, Minor, Enhancement and Trivial, having Blocker as the level 1 and Trivial as the level 7. Bug priority describes the importance and order in which a bug should be fixed compared to other bugs. P1 is considered the highest and P5 is the lowest. The summary attribute of a bug report consists of the brief description (textual description) about the bug. © Springer International Publishing Switzerland 2015 O. Gervasi et al. (Eds.): ICCSA 2015, Part IV, LNCS 9158, pp. 444–457, 2015. DOI: 10.1007/978-3-319-21410-8_35
Bug Assignee Prediction Using Association Rule Mining
445
Table 1. Bug Attributes description
Attribute Severity Bug Id Priority
Resolution Status Number of Comments Create Date Dependencies
Summary Date of Close Keywords
Version CC List Platform and OS Number of Attachments Bug Fix Time
Short description This indicates how severe the problem is. e.g. trivial, critical, etc. The unique numeric id of a bug. This field describes the importance and order in which a bug should be fixed compared to other bugs. P1 is considered the highest and P5 is the lowest. The resolution field indicates what happened to this bug. e.g. FIXED The Status field indicates the current state of a bug. e.g. NEW, RESOLVED Bugs have comments added to them by users. #comments made to a bug report. When the bug was filed. If this bug cannot be fixed unless other bugs are fixed (depends on), or this bug stops other bugs being fixed (blocks), their numbers are recorded here. A one-sentence summary of the problem. When the bug was closed. The administrator can define keywords which you can use to tag and categorize bugs e.g. the Mozilla project has keywords like crash and regression. The version field defines the version of the software the bug was found in. A list of people who get mail when the bug changes. #people in CC list. These indicate the computing environment where the bug was found. Number of attachments for a bug. Last Resolved time-Opened time. Time to fix a bug.
To the best of our knowledge available in literature no work has been done to discover associations rule among the bug attributes. These rules can support the managers to improve its process during development. In this paper, we have made an attempt to predict the developer that should work on the bug by applying association mining by using Apriori algorithm based on the bug’s severity, priority and summary terms. We demonstrate our approach on collection of 1,695 bug reports of Thunderbird, AddOnSDK and Bugzilla products of Mozilla open source project. Our prediction method is based on the association rule mining method which was first explored by [9].
446
M. Sharma et al.
Association rule mining is used to discover the patterns of co-occurrences of the attributes in a database. Associations do not imply causality. An association rule is an expression A⇒ C, where A (Antecedent) and C (Consequent) are sets of items. Given a database D of transactions, where each transaction T ∈ D is a set of items, A ∈ C expresses that whenever a transaction T contains A, then T also contains C with a specified confidence and support. The rule confidence is defined as the percentage of transactions containing C in addition to A with regard to the overall number of transactions containing A [10]. Support is the number of times the items in a rule appear together in a single entry within the entire set. Association rule mining can successfully be applied to a wide range of business and science problems. Extensive performance studies have also shown that associative classification frequently generates better accuracy than state-of-the-art classification methods [11-20]. The successful use of association rule mining in various fields motivates us to apply it to the open source software bug data set. The rest of the paper is organized as follows. Section 2 of the paper describes the datasets and preprocessing of data. Results have been presented in section 3. Section 4 presents the related work. Threats to validity have been discussed in section 5 and finally the paper is concluded in section 6.
2
Description of Data Sets and Data Preprocessing
In this paper, an empirical experiment has been conducted on 1,695 bug reports of the Mozilla open source software products namely Thunderbird, AddOnSDK and Bugzilla. We collected bug reports for resolution “fixed” and status “verified”, “resolved” and “closed” because only these types of bug reports contain the meaningful information for the experiment. The collected bug reports from Bugzilla have also been compared and validated against general change data (i.e. CVS or SVN records). Table 2 shows the data collection in the observed period. Table 2. Number of bug reports in each product
Product Thunderbird Add-on SDK Bugzilla
Number of bugs 115 616 964
Observation period Apr. 2000-Mar. 2013 May 2009-Aug. 2013 Sept. 1994-June 2013
We have used four quantified bug attributes namely severity, priority, summary and assignee. There is a need to extract terms from bug summary attribute (a textual description of the bug). We pre-processed the bug summary in RapidMiner tool [21] containing the following steps: Tokenization: Tokenization is the process of breaking a stream of text into words, phrases, symbols, or other meaningful elements called tokens. In this paper a word or a term has been considered as token. Stop Word Removal: In bug summary prepositions, conjunctions, articles, verbs, nouns, pronouns, adverbs, adjectives, etc. are stop words and have been removed.
Bug Assignee Prediction Using Association Rule Mining
447
Stemming to base stem: The process of converting derived words to their base word is known as stemming. In this paper, we have used Standard Porter stemming algorithm for stemming [22]. Feature Reduction: Tokens of minimum length 3 and maximum length 50 have been considered because most of the data mining algorithm may not be able to handle large feature sets. As a result of this process we get a set of terms of bug summary attribute for a dataset. In RapidMiner tool for the calculation of summary terms we have set tokenize mode as non-letters. For filter tokens option we have taken min chars parameter value as 3 and max chars parameter value as 50. We have filtered the stop words by English dictionary. The importance i.e. usefulness and certainty of an association rule is measured by its support and confidence. Rules that discover with high levels of support (or relevance) and high confidence do not necessarily imply causality. Let Y = {Y1, Y2 . . . Ym} be a set of attribute values, called items. A set A⊆ Y is called an item set. Let a database D be a multiset of Y. Each T ∈ D is called a transaction. An association rule is an expression A⇒ C, where A⊂ Y, C⊂ Y, and A∩C=φ. We refer to A as the antecedent of the rule, and C as the consequent of the rule. The rule A⇒ C has support Supp(A⇒ C) in D, where the support is defined as Supp(A⇒ C) = Supp(A∪ C). That means Supp(A⇒ C) percent of the transactions in D contain A∪ C, and Supp(A)=|{T∈ D|A⊆ T}| / |D| is the support of A that is the fraction of transactions T supporting an item set A with respect to database D. The number of transactions required for an item set to satisfy minimum support is referred to as the minimum support count. A transaction T ∈ D supports an item set A⊆ Y if A⊆ T holds. The rule A⇒ C holds in D with confidence Conf(A⇒ C), where the confidence is defined as Conf(A⇒ C)= Supp(A∪ C) / Supp(A). That means Conf(A⇒ C) percent of the transactions in D that contain A also contain C. The confidence is a measure of the rule’s strength or certainty while the support corresponds to statistical significance or usefulness. Association rule mining generates all association rules that have a support greater than minimum support min.Supp(A⇒ C), in the database, i.e., the rules are frequent. The rules must also have confidence greater than minimum confidence min.Conf(A⇒ C), i.e., the rules are strong. The process of association rule mining consists of these two steps: 1) Find all frequent item sets, where each A∪ C of these item sets must be at least as frequently supported as the minimum support count. 2) Generate strong rules from the discovered frequent item sets, where each (A⇒ C) of these rules must satisfy min.Supp(A⇒ C)and min.Conf(A⇒ C) [10]. We have carried out following steps for our study:
1.
Data Extraction a. b.
2.
Download the bug reports of different products of Mozilla open source software from the CVS repository: https://bugzilla.mozilla.org/ Save bug reports in excel format.
Data Pre-processing a.
Extract individual terms from bug summary attribute.
448
3.
M. Sharma et al.
Data Preparation a. b. c.
4.
Modeling a.
5.
Build a model in MATLAB software by using ARMADA tool [23]. ARMADA (Association Rule Miner And Deduction Analysis) is a Data Mining tool that extracts Association Rules from numerical data files using a variety of selectable techniques and criteria. The program integrates several mining methods which allow the efficient extraction of rules, while allowing the thoroughness of the mine to be specified at the user’s discretion. We have applied Apriori algorithm to find association rules for assignee as consequent and severity, priority and summary terms as antecedents with minimum confidence 20% and minimum support 7% for AddOnSDK, Bugzilla. We have taken confidence 20% and minimum support 3% for Thunderbird dataset as we are not getting sufficient rules for support 7% because of fewer transactions in the dataset.
Testing and Validation a.
3
Assign severity attribute as numeric values from 1 to 7 and priority levels as 8 to 12. Take top 30 terms based on the occurrences of a term in the dataset of summary attribute and assign a numeric value from 13 to 43. Assign a unique numeric value to each assignee.
Assess the association rules in terms of support and confidence.
Results and Discussion
We have applied Apriori algorithm for association rule mining using ARMADA tool in MATLAB software to predict bug assignee using bug severity, priority and summary terms. When mining association rules for bug assignee prediction, we have taken minimum confidence 20% and minimum support 7% for AddOnSDK and Bugzilla products. As number of bugs is very less for thunderbird product we have taken minimum support 3% and confidence 20%. As a result we get 3 sets of rules, where each set consist of more than 100 rules. For this reason, we do not list them all, but instead we present top 5 rules for top 5 assignee based on the highest confidence. Table 3 shows the typical forms of the association rules for top five bug assignee of AddOnSDK dataset. Table 3. Association rules for top five assignee in AddOnSDK
Association Rules with minimum support 7% and minimum confidence 20% Assignee = Alexandre Poirot • •
Priority {P1} ∧ Term {con} ∧ Term {content} ∧ Term {fail} ⇒ Assignee {Alexandre Poirot} @ (11%, 79%). Severity {Normal} ∧ Priority {P1} ∧ Term {con} ∧ Term {content} ∧ Term {fail} ⇒ Assignee {Alexandre Poirot} @ (7%, 78%)
Bug Assignee Prediction Using Association Rule Mining
449
Table 3. (Continued)
• • •
Priority {P1} ∧ Term {con} ∧ Term {test} ∧ Term {content} ∧Term {fail} ⇒ Assignee {Alexandre Poirot} @ (10%, 77%) Severity{Normal} ∧ Priority {P1} ∧Term {con} ∧Term {content}∧Term {script} ⇒Assignee {Alexandre Poirot} @ (14%, 67%) Priority {P1} ∧ Term {con} ∧ Term {test} ∧ Term {fail} ⇒ Assignee {Alexandre Poirot} @ (10%, 67%) Assignee = Will Bamberg
• • • • •
Severity {Normal} ∧ Term {doc} ∧ Term {document} ∧ Term {page} ⇒ Assignee {Will Bamberg} @ (7%, 100%) Severity {Normal} ∧ Term {doc} ∧ Term {tab} ⇒ Assignee {Will Bamberg} @ (8%, 89%) Severity {Normal} ∧ Priority {P1} ∧ Term {doc} ∧ Term {mod} ∧ Term {modul} ⇒ Assignee {Will Bamberg} @ (10%, 83%) Severity {Normal} ∧ Priority {P1} ∧ Term {con} ∧ Term {doc} ⇒ Assignee {Will Bamberg} @ (9%, 82%) Severity {Normal} ∧ Priority {P1} ∧ Term {doc} ∧ Term {mod} ⇒ Assignee {Will Bamberg} @ (11%, 79%) Assignee = Brian Warner
• • • • •
Severity {Normal} ∧ Priority{P1} ∧ Term {add} ∧ Term {sdk} ⇒ Assignee {Nobody} @ (7%, 26%) Term {P1} ∧ Term {add} ∧ Term {sdk} ⇒ Assignee {Nobody} @ (7%, 24%) Severity {Normal} ∧ Term {add} ∧ Term {sdk} ⇒ Assignee {Nobody} @ (7%, 21%) Severity {Normal} ∧ Term {pack} ⇒ Assignee {Nobody} @ (7%, 20%) Term {add} ∧ Term {sdk} ⇒ Assignee {Nobody} @ (7%, 20%) Assignee = Erik Vold
• • • • •
Severity {Normal} ∧ Term {test} ∧ Term {win} ∧ Term {window} ⇒ Assignee {Erik Vold} @ (7%, 47%) Term {privat} ∧ Term {brows} ⇒ Assignee {Erik Vold} @ (11%, 44%) Priority {P1} ∧ Term {test} ∧ Term {win} ∧ Term {window} ⇒ Assignee {Erik Vold} @ (7%, 44%) Severity {Normal} ∧ Term {test} ∧ Term {win} ⇒ Assignee {Erik Vold} @ (7%, 44%) Term {test} ∧ Term {win} ∧ Term {window} ⇒ Assignee {Erik Vold} @ (7%, 42%)
450
M. Sharma et al. Table 3. (Continued)
Assignee = Irakli Gozilalishvili • • • •
Severity {Normal} ∧ Priority {P1} ∧ Term {load} ⇒ Assignee {Irakli Gozilalishvili} @ (8%, 32%) Priority {P1} ∧ Term {load} ⇒ Assignee {Irakli Gozilalishvili} @ (8%, 27%) Severity {Normal} ∧ Term {load} ⇒ Assignee {Irakli Gozilalishvili} @ (8%, 24%) Term{load} ⇒ Assignee {Irakli Gozilalishvili} @ (8%, 21%)
The first association rule is a four antecedent rule, which reveals that the assignee Alexandre Poirot can be assigned a bug having priority P1 and summary containing terms con, content and fail with a significance of 11 percent and a certainty of 79 percent. Second association rule is a five antecedent rule, which means that the assignee Alexandre Poirot can be assigned a bug having severity Normal, priority P1 and summary containing terms con, content and fail with a significance of 7 percent and a certainty of 78 percent. Third rule shows that the assignee Alexandre Poirot can be assigned a bug having priority P1 and summary containing terms con, test, content and fail with a significance of 10 percent and a certainty of 77 percent. Rule four reveals that 14 percent of the bugs in the bug data set have severity Normal, priority P1 and summary containing terms con, content, script and assignee Alexandre Poirot. 67 percent of the bugs in the bug data set that have severity Normal, priority P1 and summary containing terms con, content, script also have assignee Alexandre Poirot. The fifth rule shows that the assignee Alexandre Poirot can be assigned a bug having priority P1 and summary containing terms con, test and fail with a significance of 10 percent and a certainty of 67 percent. We can similarly interpret the association rules for other assignee. Table 4 shows top five rules for top five bug assignee of Thuderbird dataset. Table 4. Association rules for top five assignee in Thuderbird
Association Rules with minimum support 3% and minimum confidence 20% Assignee=David • • • • •
Priority {P2} ∧ Term {folder} ∧ Term {mar} ⇒ Assignee {David} @ (3%, 100%) Term {folder} ∧ Term {mar} ⇒ Assignee {David} @ (4%, 100%) Term {move} ∧ Term {account} ⇒ Assignee {David} @ (3%, 100%) Severity {Normal} ∧ Priority {P2} ∧ Term {folder} ⇒ Assignee {David} @ (3%, 75%) Priority {P2} ∧ Term {folder} ⇒ Assignee {David} @ (4%, 67%)
Bug Assignee Prediction Using Association Rule Mining Table 4. (Continued)
• • • • •
• • • •
• • • • •
• • • • •
Assignee=Phil Ringnalda Term {show} ⇒ Assignee {Phil Ringnalda} @ (3%, 60%) Severity {Normal} ∧ Term {text} ⇒ Assignee {Phil Ringnalda} @ (3%, 38%) Severity {Normal} ∧ Term {remov} ⇒ Assignee {Phil Ringnalda} @ (3%, 33%) Term {remov} ⇒ Assignee {Phil Ringnalda} @ (3%, 30%) Severity {Normal} ∧ Priority {P3} ⇒ Assignee {Phil Ringnalda} @ (8%, 24%) Assignee=Mark Banner Severity {Normal} ∧ Priority {P3} ∧ Term {thunderbird} ⇒ Assignee {Mark Banner} @ (3%, 50%) Severity {Normal} ∧ Term {thunderbird} ⇒ Assignee {Mark Banner} @ (5%, 33%) Term {thunderbird} ⇒ Assignee {Mark Banner} @ (5%, 28%) Term {move} ⇒ Assignee {Mark Banner} @ (4%, 31%) Assignee=Blake Winton Term {config} ∧ Term {auto} ⇒ Assignee {Blake Winton} @ (3%, 75%) Term {tool} ∧ Term {toolbar} ⇒ Assignee {Blake Winton} @ (3%, 60%) Term {auto} ⇒ Assignee {Blake Winton} @ (3%, 60%) Term {tool} ⇒ Assignee {Blake Winton} @ (3%, 43%) Term {add} ⇒ Assignee {Blake Winton} @ (4%, 31%) Assignee=Andreas Nilsson Severity {Normal} ∧ Priority {P3} ∧ Term {icon} ⇒ Assignee {Andreas Nilsson} @ (5%, 71%) Severity {Normal} ∧ Priority {P3} ∧ Term {window} ⇒ Assignee {Andreas Nilsson} @ (3%, 60%) Severity {Normal} ∧ Term {icon} ⇒ Assignee {Andreas Nilsson} @ (5%, 56%) Priority {P3} ∧ Term {button} ⇒ Assignee {Andreas Nilsson} @ (3%, 43%) Severity {Normal} ∧ Term {button} ⇒ Assignee {Andreas Nilsson} @ (3%, 43%)
451
452
M. Sharma et al.
The first rule is a three antecedent rule, which reveals that the assignee David can be assigned a bug having priority P2 and summary containing terms folder and mar with a significance of 3 percent and a certainty of 100 percent. Second association rule is a two antecedent rule, which means that the assignee David can be assigned a bug having summary containing terms folder and mar with a significance of 4 percent and a certainty of 100 percent. Third rule shows that the assignee David can be assigned a bug having summary containing terms move and account with a significance of 3 percent and a certainty of 100 percent. Rule four reveals that 3 percent of the bugs in the bug data set have severity Normal, priority P2 and summary containing terms folder and assignee David. 75 percent of the bugs in the bug data set that have severity Normal, priority P2 and summary containing term folder also have assignee David. The fifth rule shows that the assignee David can be assigned a bug having priority P2 and summary containing term folder with a significance of 4 percent and a certainty of 67 percent. We can similarly interpret the association rules for other assignee. Table 5 shows top five rules for top five bug assignee of Bugzilla dataset. Table 5. Association rules for top five assignee in Bugzilla
Association Rules with minimum support 7% and minimum confidence 20% Assignee = Terry Weissman • • • • •
Severity {Normal} ∧ Priority {P3} ∧ Term {mai} ∧ Term {mail} ⇒ Assignee {Terry Weissman} @ (11%, 73%). Severity {Normal} ∧ Priority {P3} ∧ Term {bug} ∧ Term {bugzilla} ⇒ Assignee {Terry Weissman} @ (9%, 41%). Severity {Normal} ∧ Priority {P3} ∧ Term {bug} ⇒ Assignee {Terry Weissman} @ (21%, 41%). Priority {P3} ∧ Term {bug} ∧ Term {bugzilla} ⇒ Assignee {Terry Weissman} @ (19%, 42%). Priority {P3} ∧ Term {mai} ∧ Term {mail} ⇒ Assignee {Terry Weissman} @ (16%, 53%). Assignee = Max Kanat-Alexander
• • • • •
Severity {Enhancement} ∧ Term {sql} ⇒ Assignee {Max Kanat-Alexander} @ (7%, 78%). Severity {Normal} ∧ Priority {P1} ∧ Term {sql} ⇒ Assignee {Max Kanat-Alexander} @ (11%, 69%). Severity {Enhancement} ∧ Priority {P1} ⇒ Assignee {Max Kanat-Alexander} @ (32%, 62%). Severity {Enhancement} ∧ Term {bug} ∧ Term {bugzilla} ⇒ Assignee {Max Kanat-Alexander} @ (11%, 58%). Priority {P1} ∧ Term {sql} ⇒ Assignee {Max Kanat-Alexander} @ (21%, 55%).
Bug Assignee Prediction Using Association Rule Mining
453
Table 5. (Continued)
Assignee = Joel Peshkin • • • • •
Priority {P2} ∧ Term {abl} ⇒ Assignee {Joel Peshkin} @ (8%, 38%). Priority {P2} ∧ Term {user} ⇒ Assignee {Joel Peshkin} @ (7%, 37%). Severity {Enhancement} ∧ Priority {P2} ⇒ Assignee {Joel Peshkin} @ (10%, 24%). Term{abl} ∧ Term{tab} ⇒ Assignee {Joel Peshkin} @ (9%, 24%). Severity {Major} ∧ Priority {P2} ⇒ Assignee {Joel Peshkin} @ (8%, 22%). Assignee = Gervase Markham
• • • • •
Severity {Blocker} ∧ Term {cgi} ∧ Term {temp} ∧ Term {templat} ⇒ Assignee {Gervase Markham} @ (7%, 100%). Priority {P1} ∧ Term {cgi} ∧ Term {temp} ∧ Term {templat} ⇒ Assignee {Gervase Markham} @ (7%, 78%). Term {cgi} ∧ Term {temp} ∧ Term {templat} ⇒ Assignee {Gervase Markham} @ (9%, 75%). Severity {Blocker} ∧ Term {temp} ∧ Term {templat} ⇒ Assignee {Gervase Markham} @ (11%, 50%). Severity {Blocker ∧ Term {temp} ⇒ Assignee {Gervase Markham} @ (11%, 48%). Assignee = Bradley Baetz
• •
Severity {Blocker} ∧ Priority {P1} ∧ Term {ing} ⇒ Assignee {Bradley Baetz} @ (8%, 30%). Severity {Blocker} ∧ Term {ing} ⇒ Assignee {Bradley Baetz} @ (8%, 26%).
The first association rule is a four antecedent rule, which reveals that the assignee Terry Weissman can be assigned a bug having priority P3 and summary containing terms mai, content and mail with a significance of 11 percent and a certainty of 73 percent. Second association rule is a four antecedent rule, which means that the assignee Terry Weissman can be assigned a bug having severity Normal, priority P3 and summary containing terms bug and bugzilla with a significance of 9 percent and a certainty of 41 percent. Third rule shows that the assignee Terry Weissman can be assigned a bug having severity Normal, priority P3 and summary containing term bug with a significance of 21 percent and a certainty of 41 percent. Rule four reveals that 19 percent of the bugs in the bug data set have priority P3 and summary containing terms bug, Bugzilla and assignee Terry Weissman. 42 percent of the bugs in the bug data set that have priority P3 and summary containing terms bug and bugzilla also
454
M. Sharma et al.
have assignee Terry Weissman. The fifth rule shows that the assignee Terry Weissman can be assigned a bug having priority P3 and summary containing terms mai and mail with a significance of 16 percent and a certainty of 53 percent. We can similarly interpret the association rules for other assignee. Rules for assignee Gervase Markham shows that bugs with severity blocker are assigned to him. We have drawn the distribution of association rules according to their length i.e. number of antecedents for all the three products. Figures 1 to 3 show the number of association rules with different rule length for all products.
Fig. 1. Distribution of rules by rule length for AddOnSdk data set for min.supp=7% and min.conf=20%
Fig. 2. Distribution of rules by rule length for Thunderbird data set for min.supp=3% and min.conf=20%
Fig. 3. Distribution of rules by rule length for Bugzilla data set min.supp =7% and min.conf =20%
Bug Assignee Prediction Using Association Rule Mining
455
Figures 1 to 3 show that we are getting maximum number of association rules of length 2 (with two antecedents) across all datasets.
4
Related Work
Each new reported bug must be triaged to determine if it describes a meaningful new problem or enhancement, and if it does, it must be assigned to an appropriate developer to fix it. In recent years, there have been a number of valuable contributions in order to address this problem. An attempt has been made by [7] by using descriptions of fixed bug reports in open bug repositories as machine learning features, and names of developers as class labels in Bayesian classifier. They have achieved the accuracy value of 30% for Eclipse projects. In another approach [24] authors expanded the work of [7] by using additional textual information of bug reports beyond the bug description, to form the machine learning features. The authors have also applied a non-linear Support Vector Machines (SVMs) and C4.5 algorithms in addition to the Naive Bayes classifier which their predecessor work had used and found that SVM is the best one. An approach for assisting human bug triagers in large open source software projects by semi-automating the bug assignment process has been proposed by [25]. This approach employs a simple and efficient n-gram-based algorithm for approximate string matching by collecting the natural language textual information available in the summary and description fields of the previously resolved bug reports and classifying that information in a number of separate inverted lists with respect to the resolver of each issue. In a study [26] authors outlined an approach based on information retrieval in which they report recall levels of around 20% for Mozilla. A new technique which automatically selects the most appropriate developers for fixing the fault represented by a failing test case has been proposed by [27]. Their technique is the first to assign developers to execution failures without the need for textual bug reports. Results reported 81% of accuracy for the top-three developer suggestions. Extensive performance studies have also shown that associative classification frequently generates better accuracy than state-of-the-art classification methods [11-20]. The successful use of association rule mining in various fields motivates us to apply it to the open source software bug data set.
5
Threats to Validity
Following are the factors that affect the validity of our study: Construct Validity: The independent attributes taken in our study are not based on any empirical validation. Internal Validity: We have considered four bug attributes: severity, priority, summary terms and assignee. Developer’s reputation attribute can also be considered. External Validity: We have considered only open source Mozilla products. The study can be extended for other open source and closed source software.
456
M. Sharma et al.
Reliability: RapidMiner and MATLAB software have been used in this paper for model building and testing. The increasing use of these software confirms the reliability of the experiments. But we have not considered and handled any accuracy error for these tools.
6
Conclusion
Bug triaging is a process of deciding what to do with an incoming bug report which takes a large amount of developer resources and time. Triaging all the incoming bugs to determine the assignee to whom the bug should be assigned is a cumbersome task. In literature much work has been done by using classification based on the textual information about bug i.e. bug summary or description. To the best of our knowledge no work has been done till now to find associations between bug attributes to predict the assignee for that bug. We have used association mining to assist in bug triage by using Apriori algorithm to predict the developer that should work on the bug based on the bug’s severity, priority and summary terms. We have used 1,695 bug reports of Thunderbird, AddOnSDK and Bugzilla products of Mozilla open source project for result validation. For a minimum confidence of 20% and minimum support of 3 and 7% we have summarized the association rules for top five assignee based on the number of bugs assigned to them. Prediction of assignee will help the managers in software development process by assigning a bug to potentially efficient developer. In future we can extend our study for other association mining algorithms to empirically validate the results.
References 1. Menzies, T., Marcus, A.: Automated severity assessment of software defect reports. In: IEEE Int. Conf. Software Maintenance, pp. 346–355 (2008) 2. Lamkanfi, A., Demeyer, S., Giger, E., Goethals, B.: Predicting the severity of a reported bug. In: Mining Software Repositories, MSR, pp. 1–10 (2010) 3. Lamkanfi, A., Demeyer, S., Soetens, Q.D., Verdonck, T.: Comparing mining algorithms for predicting the severity of a reported bug. In: CSMR, pp. 249–258 (2011) 4. Chaturvedi, K.K., Singh, V.B.: Determining bug severity using machine learning techniques. In: CSI-IEEE Int. Conf. Software Engineering (CONSEG), pp. 378–387 (2012) 5. Chaturvedi, K.K., Singh, V.B.: An empirical Comparison of Machine Learning Techniques in Predicting the Bug Severity of Open and Close Source Projects. Int. J. Open Source Software and Processes 4(2), 32–59 (2013) 6. Sharma, M., Bedi, P., Chaturvedi, K.K., Singh, V. B.: Predicting the priority of a reported bug using machine learning techniques and cross project validation. In: IEEE Int. Conf. Intelligent Systems Design and Applications (ISDA), pp. 27–29 (2012) 7. Cubranic, D., Murphy G.C.: Automatic bug triage using text categorization. In: Int. Conf. Software Engineering. Citeseer, pp. 92–97 (2004) 8. Sharma, M., Kumari, M., Singh, V.B.: Understanding the meaning of bug attributes and prediction models. In: I-CARE 5th IBM Collaborative Academia Research Exchange Workshop, Article No. 15. ACM (2013)
Bug Assignee Prediction Using Association Rule Mining
457
9. Agrawal, R., Imielinski, T., Swami, A.: Mining association rules between sets of items in large databases. In: SIGMOD Conf. Management of Data. ACM, May 1993 10. Song, Q., Shepperd, M., Cartwright, M., Mair, C.: Software defect association mining and defect correction effort prediction. IEEE Transactions on Software Engineering 32(2), 69–82 (2006) 11. Ali, K., Manganaris, S., Srikant, R.: Partial classification using association rules. In: Int. Conf. Knowledge Discovery and Data Mining, pp. 115–118 (1997) 12. Dong, G., Zhang, X., Wong, L., Li, J.: CAEP: classification by aggregating emerging patterns. In: Arikawa, S., Nakata, I. (eds.) DS 1999. LNCS (LNAI), vol. 1721, pp. 30–42. Springer, Heidelberg (1999) 13. Liu, B., Hsu, W., Ma, Y.: Integrating classification and association rule mining. In: Int. Conf. Knowledge Discovery and Data Mining, pp. 80–86 (1998) 14. She, R., Chen, F., Wang, K., Ester, M., Gardy, J.L., Brinkman, F.L.: Frequentsubsequence-based prediction of outer membrane proteins. In: ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (2003) 15. Wang, K., Zhou, S.Q., Liew, S.C.: Building hierarchical classifiers using class proximity. In: Int. Conf. Very Large Data Bases, pp. 363–374 (1999) 16. Wang, K., Zhou, S., He, Y.: Growing decision tree on support-less association rules. In: Int. Conf. Knowledge Discovery and Data Mining (2000) 17. Yang, Q., Zhang, H.H., Li, T.: Mining web logs for prediction models in WWW caching and prefetching. In: ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (2001) 18. Yin, X., Han, J.: CPAR: classification based on predictive association rules. In: SIAM Int. Conf. Data Mining (2003) 19. Ying, A.T.T., Murphy, C.G., Ng, R., Chu-Carroll, M.C.: Predicting source code changes by mining revision history. In: Int. Workshop Mining Software Repositories (2004) 20. Zimmermann, T., Weigerber, P., Diehl, S., Zeller, A.: Mining version histories to guide software changes. In: Int. Conf. Software Engineering (2004) 21. Mierswa, I., Wurst, M., Klinkenberg, R., Scholz, M., Euler, T.: YALE: rapid prototyping for complex data mining tasks. In: ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (KDD 2006) (2006). http://www.rapid-i.com 22. Porter, M.: An algorithm for suffix stripping. Program. 14(3), 130–137 (2008) 23. www.mathworks.in/…/3016-armada-data-mining-tool-version-1-4 24. Anvik, J., Hiew, L., Murphy, G.C.: Who should fix this bug? In: Int. Conf. Software Engineering (ICSE) (2006) 25. Amir, H.M., Neumann, G.: Assisting bug triage in large open source projects using approximate string matching. In: Int. Conf. Software Engineering Advances (ICSEA 2012), Lisbon, Portugal (2012) 26. Canfora, G., Cerulo, L.: How software repositories can help in resolving a new change request. In: Workshop on Empirical Studies in Reverse Engineering (2005) 27. Servant, F., Jones, J.A.: Whose fault: automatic developer-to-fault assignment through fault localization. In: Int. Conf. Software Engineering (ICSE 2012), pp. 36–46. IEEE Press, Piscataway (2012)