Applying Data Mining to Software Maintenance ... - Semantic Scholar

ApplyingDataMiningtS o oftwareMaintenanceRecords Jelber SayyadShirabad,TimothyCLethbridge . andStanMatwin SchoolofInformationTechnologyandEngineering UniversityoOttawa, f Ottawa,Ontario,K1N6N5Canada {jsayyad,tcl,stan}@site.uottawa.ca

Abstract Inasystemmaintainedoveralongtime period,asisthecaseforlegacysoftware,there will be many unknown and non-trivial relationshipsamongcomponents.Findingsuch hiddenrelationshipsmayhelpsoftwareengineers intheirmaintenanceactivities.Inthispaperwe presentanapproachwherebyweminesoftware updaterecordstofindrelationshipsbetweenfiles thatarechangedtogether.Thegeneralized modelswepresentasresultsareobtainedby usingfeaturesextractedfromdifferentsourcesof knowledgesuchassourcecodeandproblem reports.Thepredictivequalityofsomeofthe generatedmodelssuggestthattheycanbe deployedtobeusedinarealworldsetting.The paperalsoincludestheresultsoaf nalyzingthe structureosome f othe f bestmodelsobtained.

1. Introduction Understandingasoftwaresystemiskeytoits propermaintenance.However,moreoftenthan not,knowledgeaboutthesystemandhowto proceedwithmaintenanceisscarce.Thisis certainly thecaseforlegacy software systemsthat tendtobveery largeandcomplex.Thesesystems alsosufferfromotherfactorssuchashighstaff turnover,poordocumentation andthelong periodsotfimethattheymustbemaintained.In suchsystems it is not surprising tfoind non-trivial relationshipsamongdifferentcomponentsotfhe systemthatwere notpreviously known. Thislackofknowledgealsocontributesto delays,costoverruns,andothercomplications such asintroduction ofnewerrorsduring maintenance.

Researchinreverseengineeringandprogram comprehensionhasresultedinmanytoolsand techniquestohelpsoftwareengineersrecoverlost knowledgeandmakelatentknowledge explicitso software engineers may better understand software systems. Discoveringandunderstandingnon-trivial relationshipsplaysanimportantroleinoverall understandingogafivensystem.Assuggestedby other researchers [2, 5], using Artificial Intelligence(AI)techniquessuchasemploying knowledgebasescanfurtherbenefitreverse engineeringandprogramcomprehensionUsing large knowledge bases introduces various difficulties, however, some of whom are mentionedinanearlierpaperby the authors[11]. Inthispaperwediscusstheapplicationofan alternativeAImethodtosoftwaremaintenance, namely the inductive or machine learning methods. Unlikeknowledgebasedsystems,inductive systemsdo notrequire large a body oknowledge. f Instead,frompastexperience,theylearn models, also knownas concepts,thatcanbuesed infuture predictions. Dataminingandinductivemethodsingeneral havebeensuccessfullyappliedtoavarietyof applicationdomains.However,theyhavebeen largelyneglectedbythesoftwareengineering community andparticularly inresearchinvolving source code levelsoftware maintenance. Inthispaperww e illpresentanapplicationof inductivemethodsinthecontextofthefollowing software maintenance issue: Whenasoftwareengineer(SE)islookingata pieceocode f infile a Fone othe f questionshe or sheneedstoansweri“s arethereotherfilesthat maybae ffectedbtyhechangesappliedtoF? I”n otherwords“ arethereotherfilesinthesystem thatheosrheneedstobaewareoin ftheprocess

ofmaintainingthecodeinthefileF? ”Perhaps thisproblemcanbetterbeappreciatedifone considersthatalargelegacysystemcaninclude thousandsof files. Ourapproachtohelpasoftwareengineerto answerthisquestionistofindrelationsamong filestuplesow f hich,whenpresent,suggestthat thefileswillbothneedtochangetogether.We refertothisparticularclassofrelationsas co-updaterelations.Werefertothese andsimilar relationsingeneralas MaintenanceRelevance Relations (MRR),sinceweareinterestedinthe relevanceoffilestoeachotherduringsoftware maintenance.WefindanMRRbylearninga classifier.Themeasureorfelevanceitsheresult ofclassification.Thenatureofrelevanceis definedbytheparticularrelation,inourcasethe relevanceoftwofilestoeachotherintermsof thembeingupdatedtogether. Problemtrackingsystemsrecordproblemsor bugsreportedagainstasoftwaresystem.They alsotypicallymaintainarecordofchanges appliedtothesoftwaretofixproblems.Such systemscanbeseenasrepositoriesofsystem maintenance experience. One can use classificationlearningandotherdatamining techniquesto extractorlearnrelations suchacos updaterelationswhicharehiddeninsuch repositories. Asdataforourstudies,weusedmaintenance recordsfromalargelegacytelephoneswitch system,i.e.P a BX,developedbM y itelNetworks. This system is written in a high level programming language (HLL) and assembler. The versionothe f PBXsoftware whichwuesed inour experimentshadcloseto1.9millionlinesof code distributedamongmorethan4700files,outof whichabout75%were HLL source files. Inthefollowingsectionswedescribehowwe formulatedandimplementedthelearningtasks, anddiscussthe results.

2. Castingtheproblem asclassification Themachinelearningtechniquesusedinour experiments fall under the category of classificationlearning .Thisisasocalled supervisedlearningmethod,becausewelearn fromanexistingpre-classifiedsetofexamplesof aconcept(whatwewanttolearn).Thissetis referred taos training set The . classifier that is the resultotfhelearningprocessias lsoknownasa

model.Tolearntheconceptweapplyan inductionalgorithm tothetrainingset.To evaluatethegeneratedmodelwetestitona testingset w , hichconsistsoafsetoef xamples independentfromthoseusedintraining.Each trainingexampleisa knowncaseotfheconcept we wantto learn,inthat it is assigned an outcome ora class.Anexample idescribed s interms of the observedvaluesforset a of featuresor attributes. Oncemodel a orclassifierislearnedfrom saetof trainingexamplesict anbeusedtopredictthe classoroutcomeofunclassifiedorunsolved examples. Aco-updaterelationexistsamongpairsof files.We cancastthe task olearning f this relation asaclassificationproblemwherewcelassifytwo filesasoneof mdistinctclassesthatshowhow likely iis that t two files will change together.This approachresultsinadiscreteclassifier.The generatedmodelorclassifieristheco-update relationthatwearetryingtolearn.Asan alternativetoadiscreteclassifieronecanlearna classifierthatreturnsanumericvalueasthe likelihoodotwo f filesbeingchangedtogether Inthispaperwewillpresentadiscrete co-update relation with two classes called RelevantandNot-Relevant.Itfwoexamplesare classifiedasRelevantim t eansthatachangein one willlikely resultin change a othe f other.The Not-Relevantclassindicatesthatachangeinone filewilllikely nothaveaneffectontheothere.g. when twofilesbelongtotwoindependent subsystems. We usedC5.0, daecisiontree inductionsystem byRuleQuestResearch,tolearntheco-update relation 1. AsshowninFigure1,adecisiontreehasa conditionateachnode.Theconditioniatest s Ti ofthevalueofoneofthepredefinedattributes usedtodescribetheexample.Ateachleaf,there isapredictionfortheclass Cof examplethat jan satisfiestheconditionsonthepathfromtheroot ofthetreetothatparticularleaf.Inotherwords eachsuchpathprovidesaconjunctionoftests performedonthevalueofsomeofthefeatures ————— 1 Wehavealsoexperimentedwithadeliberately triviallearningalgorithmcalled1Randanother approachcalledSetCoveringMachines[7].As expected,1Rdidnotproducesatisfactory results due to thecomplexityotfhedata.ResultsobtainedforSCM werenotsignificantlybetterthantheonesobtained withC5.0.

andthecorrespondingoutcomeiall f thetestson the pathresultin true a value. Wehavechosendecisiontree learningbecause thealgorithmsareamongthemost-established techniquesinmachinelearninganddatamining. Theyproduceexplainablemodelsthatcanbe presentedtosoftwareengineers.Thisisan importantfactorsincesoftwareengineerscan actuallyfindthereasonfortheoutcomeofa predictionbfyollowingthepathinthe tree thatis takenbytheexampleorcase.InSection4we provide anexample osuch f analysis. T1

T2

C1

T3

C2

C2

C1 T2

C1

T4

C1

C2

Figure1A.decisiontree

2.1.Trainingand testingrepositories Aswementionedearlier,tobeabletolearna modelorconcept,wefirstneedtofindexamples of it.Since the co-update relationibetween s pairs offiles,wefirstneedtofindasetofilepairs whereeachpairiseitherlabeledasRelevantor NotRelevant.Oncewehaveafilepairandits correspondingrelevancelabelwecancreatean examplebycalculatingthevalueforasetof predefinedfeaturesandassigningtheexamplea classvalue similarto the file pairlabel. AtMitelNetworksaproblemreportisstored inasystemcalledSMS(SoftwareManagement System).SMSprovidesarangeoffacilities including bug tracking and source code management.Problemreportstypicallyincludea textualdescriptionoftheproblemwithpossible additionalinformationsuchashardwarestatus

whentheproblemwasobserved,etc.Ifthe problemcauseschangeinthesourcecode,an updatereport(simplyreferredtoasan update) willbesubmittedandstoredinSMS.Anupdate recordstheproblemreportsthatitintendsto addressandthenameothe f fileschangedatshe resultofapplyingtheupdate.Ifthesubmitted updatepassesthetestingandverificationprocess, the update iconsidered s afinal s or closed. Usingheuristics,weautomaticallyextracted fromSMSthe followingtwo kindsof examplesof the co-update relation: • Relevant:Twofilesthathavebeenchanged inthe same update are Relevantto eachother • NotRelevant:Duringtheperiodoftime coveredbyourdata,twofileswerenot updatedtogether. Toapply theseheuristics,firstwe select taime periodT=[T 1-T2of ]certainnumberofyearsfrom whichwewanttolearn.Wefindalltheupdates thatwere‘closed’inthistimeperiod.Foreach update,wefindallthefileschangedbythat updateandcreatelist a of Relevantfilepairs.We refertothisheuristicatshe co-updateheuristic . Anyotherfilepairsthatarenotlabeledas Relevantbytheco-updateheuristicinatime periodT’=[T 3-T2]whereT T 1a, relabeledas 3 ≤ NotRelevant We . refertothissecondheuristicas the Nor-Relevant heuristic . Due to complementarynatureoftheseheuristics,the numberoN f ot-Relevantfilepairsisreducedas thenumberofRelevantfilepairsincreases.In otherwordswoebtaintheNot-Relevantlabelsby applyingwhatisknowninAIasthe Closed WorldAssumption . Inprinciple,wideningT(increasing ’ the timeperiodcovered)resultsinahigherdegreeof confidenceregardingtwofilesnotbeingchanged together.Thisisbecausetheevidencetothe contrary doesnotexistin larger a time period.On theotherhand,choosingvery a largetimeperiod T’meansapplyingtheco -updateheuristictothat timeperiod,whichcanresultintheinclusionof changes applied ttohe system in distant a past that maynolongerreflectthecurrentstateofthe system.Inthe experimentsdiscussedinthis paper wehaveusedatimeperiodT’=Torinother wordsT =T 3 1. WecreatedthesetofRelevantandNot Relevantfilepairsbyprocessingtheupdatesfor the1995-1999timeperiod.Therewere1213 problemsreportsinthistimeperiodforwhichwe found1401closedupdates.Anupdate canchange

morethanonefile.Wedefinethe updategroup sizeathe s numberof fileschangedbiyt.To apply theco-updateheuristicweneedupdateswitha groupsizegreaterthanoequal r totwo.Outof all closedupdatesintheabovetimeperiod,43% changedtwoom r orefiles.Howeveriturned t out thatsomeupdateschangelarge a numberof files. Forinstance wencountered updates that changed more than280files.It makes sense taossume that updatesthatchangeasmallernumberoffiles capturerelationsthatexistbetweenfilesthatare morecloselyrelated.Furtheranalysisofthe updatesshowedthat40%ofthemhaveagroup sizebetween2and20inclusively.Theseupdates constitute 93%of updateswith group a size larger than1.Intheexperimentsreportedinthispaper we have limitedthe update groupsize t2o0. Ourversionothe f co-updaterelationwillonly focusonthefilesinaparticularhighlevel language(HLL).Thisisbecauseourexperience withthe Assemblerparsers available tuoshowed s thatthe informationgeneratedbtyhem was not as accurateaswehadwished.Thiscanbemostly tracedbacktothelessstructurednatureof Assemblercode.ThenumberofRelevantHLL filepairschangedbyupdateswithagroupsize limit of up t2oi0nthe 1995-1999 time period was 4547.Thisnumberincludesrepeatedfilepairsif twofileswerechangedtogetherinmorethanone update.

ForNsourcefileswecancreate

arepairingeachfile with fi filesthatare Relevant toitandfilesthatareNot-Relevanttoit.Using thismethod,wepairedthefirstfile fof iRelevant filepairs( fi, fj) ∈SRwithotherHLLfilesto createthesetofpotentialfilepairs.Byremoving SRfromthisseti.e.applyingtheclosedworld assumption,wecreatethe defaultNot-Relevant filepairsset W . ecanfurtherrefinethissetby removingfilepairsthatfromothersourcesof informationweknowshouldnotbelabeledas Not-Relevant.Anexampleofthisisthecase whereasoftwareengineermay suggestthatsome of the Not-Relevant files created by the Not-Relevantheuristicshouldberemovedfrom thedefaultNot-Relevantfilepairsset.Inourcase weusedthedatafromSMSforthispurpose.We removedallthefilepairsthatwerechanged togetherby updatesinthe 1995-1999time period. Thissetoffilepairsocourse f includesallthe file pairsinSR whichwas created from updates inthe sametimeperiodbutwithmaximum a groupsize of 20. Toevaluateourclassifiers,werandomlysplit thesetoffilepairexamplesintothreeparts.Two partsareusedtocreateatrainingrepository,and theremainingpartisusedtocreateatesting repository.Thisapproachisknownasthe hold outmethod,andresultsinatestingsetthatis independentofthe trainingrepository.Aswe will discussbelow,weusethetrainingrepositoryto create training sets to learnthe co-update relation. Table1showsthedistributionoffilepairsin the trainingrepository andthe testingset.

N

pairsof

2

files. This formula takesintoaccount the symmetric nature othe f co-update relation and the factthatwe dnootpair faile withitself.To creat thesetofNot-Relevantpairsweneedtocreatea setofpotentialfilepairs andthenfromitremove thesetofRelevantfilepairs(SR).Usingthe aboveformula,thenumberopotential f filepairs whenalltheHLLsourcefilesinthesystemare usedisclosetosixmillionpairs.Thisimmense numberofile f pairsoerxamplesbringstolighta majordifficultywithlearningtheco-update relation,namelytheveryskeweddistributionof classes. Toreduce the numberof potentialfile pairswe haveconstrainedthefirstfileinapotentialfile pairtobeoneotfhefirstfilesinsomeRelevant filepairinSR.Werefertothisapproachasthe RelevantBased methodocreating f Not-Relevant pairs.By applyingtheRelevantBasedmethodwe

e

Table1Training . repositoryandthe testing set classdistribution Relevant All Training Testing

4547 3031 1516

Not Relevant 1226827 817884 408943

#Relevant/ #NotRelevant 0.00371 0.00371 0.00371

AscanbeseeninTable1,althoughthe numberofNot-Relevantfilepairshasbeen greatly reduced, the set of file pairs is neverthelessveryimbalanced.Suchimbalanced datafrequentlyappearinrealworldapplications andposedifficultiestothelearningtask[1].In Section3wediscusshowwelearnfromless skewedtrainingsetsto circumventthisproblem.

problem.Examples of such additional information include program traces, memory dumps, hardware statusinformationetc.

2.2.Featuresets Corresponding to each Relevant or Not-Relevantfilepair,thereiasnexamplethatis labeledRelevantorNot-Relevant.Sincethe conceptthatwearetryingtolearniasboutpa air offiles,theattributesorfeaturesdescribingthe examplecanbebasedonthepropertiesoone f of thesefilesofarunctionoboth f othese f files.Th featuresetsthatwehaveusedinourexperiments canbdeividedinto two groups: • syntactic attributes • textbasedattributes Inthissectionwperesentanddiscusseachone of these attribute sets. 2.2.1.Syntacticfeatures. Syntacticattributesare basedonsyntacticconstructsinthesourcecode suchasfunctioncalls,ordataflowinformation suchavsariableotype r definitions.Thesearethe attributesextractedbsytatic analysisof the source code.Computingthevalueofsomeofthese attributesinvolvesstepssimilartotheonestaken tomeasurewellknownstaticsoftwareproduct metricssuchasfanin,fanout,andcyclomatic complexity[8].Inadditiontothesewehavealso usedattributesbasedonnamesgiventofiles. Table2showsthesetofsyntacticattributesused inthe experimentsreportedinthispaper. Manyoftheentriesinthistableareself explanatory. Interesting software unit (ISU) denotesthefirstfileinapairandtheother softwareunit(OSU)denotesthesecondfile. Conceptually,intheRelevantBasedmethodwe wanttolearnwhatmakesthe secondfile inpair a (OSU)tobR e elevantorNotRelevanttothefirst fileinthepair(ISU).Wesayasubroutineis directlyreferredtoifitappearsinthemain executableportionofasourcefile,e.g.main functioninaCprogram.Subroutinesthatare referredoutsidethemainexecutablepartofa sourcefileornonrootsubroutinesinthestatic callgraphofthesourcefilearesaidtobe indirectlyreferredto. 2.2.2.Textbasedfeatures. Mostsourcefiles includecomments.Thesecommentsprovidea textualdescriptionofwhattheprogramis intendedtdoo. Problem reports describe, in English, a problemorpotentialbuginthesoftware.They mayalsoincludeotherinformationdeemedtobe helpfultothepersonwhowilleventuallyfixthe

Table2Syntactic . attributes AttributeName

e

Attribute Type

SameFileName

Boolean

ISUFileExtension

Text

OSUFileExtension

Text

SameExtension

Boolean

CommonPrefixLength NumberofShared Types

Integer DirectlyReferred

Integer

NumberofSharedDirectlyReferred nonTypeDataItems

Integer

NumberoR f outinesDirectlyReferred inISUandDefinedinOSU

Integer

Number of Routines Directly and IndirectlyReferredinISUandDefined inOSU

Integer

Numberof Routines Defined iInSU and Directly ReferredinOSU

Integer

Numberof Routines Defined iInSU and DirectlyandIndirectlyReferredby OSU

Integer

NumberofSharedRoutinesDirectly Referred

Integer

NumberofSharedRoutinesAmongAll RoutinesReferredintheUnits

Integer

NumberofSharedDirectandIndirectly ReferredRoutines

Integer

Numberof SharedFilesIncluded

Integer

Directly IncludeYou

Boolean

Directly IncludeMe

Boolean

Both problem reports and source file comments provide additional sources of knowledgethatcanbeusedtoextractattributes. Theideahereitsoassociateasetofwordswith eachsourcefile.Thereforeinsteadolooking f aat sourcefileasasequenceosfyntacticconstructs, we canview themasdocuments.

Wehaveadaptedthe vectoror bagofwords representation[9]tothetaskoflearningthe co-updaterelationbyassociatingavectorof Booleanfeatureswitheach pairofiles f The . bag ofwordsrepresentationisfrequentlyusedin informationretrievalandtextclassification.The ideaitshateachdocumentinasetofdocuments isrepresentedbyabagowords f appearinginall thedocumentsintheset.IntheBooleanversion ofthisrepresentationthevectorcorrespondingto adocumentconsistsofBooleanfeatures.A featureisetto trueithe f wordcorrespondingto thatfeatureappearsinthedocument,otherwise the feature iassigned s a falsevalue. Sinceourexamplesdescribetherelation betweenpairsofsourcefiles( fi, fj),weneedto create a filepair feature vector foreachsuchpair. Wedothisbyfindingtheintersectionofthe individualfilefeaturevectorsassociatedwith fi, and fj.Thisisineffectsimilartoapplyinga logicalANDoperationbetweenthetwofile featurevectors.Bydoingsowefindsimilarities betweenthetwofiles.Inthe intersectionvector, a featureisettotrueithe f wordcorrespondingto thefeatureisinbothdocumentsthatare associatedwiththe filesinthe pair. Animportantissueinusingabagofwords approachintextclassificationistheselectionof thewordsorfeatures.Weusedanacronym definitionfileaMitel t Networkstocreateset a of acceptablewords. Since these acronyms belong to theapplicationdomain,onewouldexpectthat theywillincludesomeothe f mostimportantand commonlyusedwords.Fromthisinitialsetwe removedanexpandedversionofthesetosftop wordsproposedin[6].Wefurtherrefinedtheset ofacceptablewordsbyfilteringasetofproblem reportsagainstthem.Werepeatedthisprocessa fewtimesandaeach t iterationupdatedthesetof acceptablewordsbyanalyzingtherejectedwords using word frequencies, and the intuitive usefulnessofwordsin thecontextofthe applicationdomain. Althoughthisacceptablewordlistwascreated byanon-expertandasaproofofconcept,we weremotivatedbyanearlierpromisingworkin creatingalightweightknowledgebaseaboutthe sameapplicationdomainthatusedasimilar manualselectiontechniquefortheinitialsetof concepts[10] Thewordextractorprogramalsoperformsa limitedtextstyleanalysistodetectwordsin uppercaseinanotherwisemostlylowercasetext.

Wefoundthatinmanycases,suchwordstendto begoodcandidatesfortechnicalabbreviationsor acronyms and domain specific terms. The programalsoincorporatedsomedomainspecific knowledgetoreducethenoise inthe extractedset ofwords.Forinstancetheprogramdetects memorydumpsodr istinguishesawordfollowed byaperiodattheendofasentencefromafile nameoar nassemblerinstructionthatincludesa period.Webelievetheresultspresentedinthis paperwouldbefurtherimprovedifwecould benefitfromthedomainexperts’knowledgein creatingthese wordlists. Aspartoftheprocessofcreatingthesetof acceptablewordswealsohavecreatedtwoother liststhatwecalledthe transformationlist andthe collocationlist Using . the transformationlist,our wordextractorprogramperformsactionssuchas lemmatization or conversion of plurals to singularsthatresultinthe transformationosome f wordstoacceptablewords.Wewouldalsoliketo preservesequencesofinterestingwordsthat appeartogethereventhoughtheparticipating wordsmaynotbedeemedacceptableontheir own. This is accomplished by using the collocationlist. InSections2.2.3and2.2.4wdeiscusshowwe canusethesourcecommentsandproblemreport wordsasfeatures. 2.2.3Sourcefilecommentfeatures. commentsinasourcefileisarelativelytrivial task.Weusedthethreelistsmentionedaboveto filterthecommentsineachsourcefileand associateeachfilewithasetofcommentwords. Wethenusedthebagofwordsrepresentation discussedabovetocreatethefilepairfeature vectors.

Finding

2.2.4.Problemreportfeatures. Sinceeach updatethatresultsinchangingsourcefilesiisn responsetoneomore r problemreports,onecan associateeachproblemreportwithoneormore fileschangedtfoixthe problem. Sincedescriptionofaproblemcanbeviewed asadocument,abagofwordscanbeassociated witheachproblem report. Lets assume that faile fi ischangedasaresultofaddressingproblem reports p1, p2,…, pn.Furthermoreassumethat theseproblemreportsarerepresentedbybagsof wordscalled Wp , Wp ,…, Wp We . createabag 1 2 n ofwordsforafilebyfindingtheunionof

problemreportbagsowords f

n 1

W pi By . doing

soweassociatewitheachfileabagofallthe wordsthatappearinsomeproblemreportthat causedthefiletochange.Theuseoaunion f of bagsow f ordsallowsustoaccountfordifferent reasonsresultinginafilebeingchanged.Once againthisnewbagofwordsifsilteredusingthe threelistsdiscussedinsection2.2.2.Weremind thereaderthatawordinprinciplecanappearin anyproblemreport.Twofilesdonotneedto share problem a report inorderto share word. a In otherwords,thereisa1tonrelationbetween problemreportwordsandsourcefiles.The process described above will not generate attributesthatuniquely identify examples. Oncefile a iasssociatedwithbag a oproblem f reportwordsitcanberepresentedasafeature vector,andafilepairfeaturevectorcanbe createdusingthese individualfile feature vectors. InSection3wedescribetheresultsof experiments using different sets of feature vectors.

3. Experiments Toproperlyevaluateandcomparetheeffects of usingdifferenttechniquesandfeature sets, first weneedtochooseaperformancemeasureor visualizationmethodthatbestsuitsthetaskat hand.Aswdeiscussedearlier,the datainTable 1 showsthelargeimbalancethatexistsamongthe RelevantandNotRelevantclasses.Itiswell knownthatinsettingssuchasthis accuracy, whichisthenumberocfasescorrectlyclassified overtotalnumberocf asesclassified,isnotthe bestperformancemeasure.Thereasonforthisis thatasimpleclassifierthatalwayschoosesthe majorityclasswillhaveaveryhighaccuracy, howeversuchclassifier a hasno practicalvalue. It is also the case that most learning algorithms aredesignedtomaximizeaccuracy,andtherefore facedwithatrainingsetwithalargeimbalance theygenerateaclassifierthatalwayssuggeststhe majorityclass.Thereforetolearntheco-update relation,wecreatedless-skeweddatasetsfrom thetrainingrepositoryandtestedthegenerated classifiersusingthecompletetestingset.Each trainingsetincludedalltheRelevantexamplesin thetrainingrepositoryandKtimesasmany Not-Relevant examples from the training

Repository.Weusedthefollowing18valuesfor K: 1,2,3,4,5,6,7,8,9,10,15,20,25,30,35,40, 45,50 In the experiments using the syntactic attributes,the Not-Relevant examples inthese less skewedtrainingsetswereformedbyselectinga stratifiedsampleothe f Not-Relevantexamplesin thetrainingrepository.Inotherwordsinthe trainingsetswetriedtomaintain,asmuchas possible,thesameproportionofNot-Relevant exampleswithacertainattributevaluevector a1, a2, …, an,, class aswaspresentinthe skewedtrainingrepository. Precisionandrecallarestandardmetricsfrom informationretrieval,andarealsousedin machine learning. Precision tells one the proportionoreturned f results that are infact valid (i.e.assignedthecorrectclass). Recallisa complementary metric that tells one the proportionovalid f resultsthatare infactfound. WhileprecisionandtherecalloftheRelevant class are appropriate measures for this application,they arenotalwaysconvenientto use to compare classifiers performance. We frequently foundthat,whencomparingclassifiers generated bydifferent methods ofcreating trainingsets,whiletheprecisionplotforone methodwasmore dominant,the recall plot forthe othermethodtendtbom e ore dominant. Wewouldliketouseavisualizationmethod which depicts measures which are equally important as precisionand recall but on one single plot.ROCplots[4]provide uwith s such tool. a In anROCplotthehorizontalaxisrepresentsthe falsepositiverate andtheverticalaxisrepresent the truepositive rate.Forourapplicationthe Relevantclassisthesameasthepositiveclass. Weuse the following‘confusionmatrix’to define these measures. ClassifiedAs

Not Relevant True Class

Not Relevant Relevant

Relevant

a c

b d

Figure2.Aconfusionmatrixforatwoclass classificationproblem Thetrueandfalsepositiveratesforthe Relevantclassare definedas TPR=

d Relevant cases correctly classified = Number of Relevant cases c+d

3.1Syntacticversustextbased attributes Figure3showstheROCplotsgeneratedfor the18ratiososfkewnessbetweenNot-Relevant andRelevantexamplesusingsyntacticandtext basedfeatures. WeusedexactlythesamesetofRelevantand Not-Relevantpairsforallexperimentsdonewith the same skewness ratio.Ineachplot the classifier withimbalanceratio=1correspondstothe rightmostpointontheplot.Astheratioof Not-RelevantexamplestotheRelevantexamples inthetrainingsetincreases,thetrueandfalse positiveratiosdecrease.Thelowerleftmostpoint oneachplotcorrespondsto animbalance ratio 50 classifier.Whileincreaseinthetrainingset skewnesshastheundesirableeffectofdecreasing thetruepositive rate,the amountof change inthe

caseoproblem f reportfeaturebasedclassifiersis muchlessthantheclassifiersusingotherfeature sets. 100 95 90 85 80 True Positive Rate

Not - Relevant cases incorrectly classified Number of Not Relevant cases b = a+b NotethatTP Risthesameastheprecisionof the Relevantclass. InanROCcurve the followingholds: • Point (0,1) corresponds to perfect classification,whereeverypositivecaseis classifiedcorrectly,andnonegativecaseis classifiedapositive. s • Point(1,0)isaclassifierthatmisclassifies every case • Point(1,1)iaclassifier s thatclassifiesevery case apositive s • Point(0,0)iaclassifier s thatclassifiesevery case anegative s Betterclassifiershave (FP,TP) values closerto point(0,1).Aclassifierissaidtodominate anotherclassifieriitfismore‘northwest’ofth e inferiorclassifier. Bothtrueandfalsepositivemeasuresare intuitivelyapplicabletoourapplicationdomain. Ahightruepositivevaluemeansthatthe classifiercorrectlyclassifiesalltheexisting Relevantexamplesinthetestingset.Alowfalse positivevaluemeansthattheclassifierdoesnot classify many Not-Relevant examples as Relevant.Oncesuperiorityofaclassifierover anotherclassifierisdeterminedbuysingthe ROC plotonecanfurtherinvestigatethequalityothe f betterclassifierbyothermeanssuchasprecision andrecallplots.

FP=R

1

75 70 65 60 55 50 45 Syntactic File Comments File Problem Report Words

40 50

35 0

2

4

6 8 False Positive Rate

10

Figure3.Comparisonofsyntacticandtext basedfeaturesets AsFigure 3shows,the classifierslearnedfrom syntacticattributesgenerateinterestingtrueand falsepositivevalues,buttheirperformanceins ot sufficienttousetheminthefield.Thefigure showsthatclassifiersgeneratedfromtheproblem reportfeaturesetclearlydominatetheclassifiers generatedfromsourcefile commentandsyntactic featuresets.Thefigurealsoshowsthatalthough thecommentfeaturesetdoesnotperformaswell astheproblemreportfeatureset,theclassifiers generatedfromthisfeaturesetstilldominatethe classifiersgeneratedformsyntactic features. InFigures4and5wehaveshownthe precisionandrecallplotscorrespondingtothe18 skewnessratiosdiscussedabove.Ascanbeseen here,theincreaseinskewnessresultsinan increaseinthe precisionothe f Relevantclassand decreaseinitsrecall.Howeverinthecaseof problemreportfeaturebasedclassifiersthe degradationofrecallvaluesoccursinamuch slowerrate.Fortheimbalanceratioof50,the problem report feature based classifier can achievea62%precisionand86%recall.Inother wordsoutof every 100file pairsthat the classifier predictsmaychangetogether,itiscorrectin62 cases.Alsotheclassifiercancorrectly identify 82 outofevery100filepairsthatactuallychange together.Theimportanceofsuchperformance valuesbecomesmoreapparentwhenwteakeinto accountthatamaintenanceprogrammermaybe facedwiththousandsopf otentialsourcefilesin the system.

12

70

60

Precision

50

40

30

20

10 Syntactic File Comments File Problem Report Words 0 0

5

10

15

20

25

30

35

40

45

50

Not-Relevant/Relevant Ratio

Figure4Comparison . oftheprecisionofclassifiersgen sets

eratedfromsyntacticandtextbasedfeature

100

90

80

Recall

70

60

50

40 Syntactic File Comments File Problem Report Words 30 0

5

10

15

20

25

30

35

40

45

50

Not-Relevant/Relevant Ratio

Figure5Comparison . othe f recallofclassifiersgenerate

dfrom syntacticandtextbasedfeaturesets

85 80 1

75

True Positive Rate

70 65 60 55 50 45 40

Juxtaposition of Syntactic and Used Comment Words Comment Words Syntactic

50 35 2

4

6

8

10

12

14

False Positive Rate

Figure6Combining . syntacticandsourcefilecommentfeat

3.2Combiningfeaturesets Wealsoinvestigatedtheeffectsocfombining different feature sets. Feature sets can be

ures

combined by simply juxtaposing i.e. concatenating,featuresineachset.Thisisthe obviouschoice whencombining syntactic and text basedfeaturesbecausetheyaredifferentintheir nature.

featuresetexperimentsdiscussedinSection3.1. TheresultsareshowninFigure7In . thisfigure onceagainweobservedthatthecombinationof syntacticandtextbasedfeaturesimprovedthe existing results formost of the ratios including the moreinterestingclassifiersgeneratedfrommore skewedtrainingsets.Theimprovementsinthe resultsare notas important as the ones seeninthe caseofcombiningfilecommentandsyntactic features.Thisisnotverysurprisingconsidering the factthatproblemreportwordfeatures ontheir owngenerate highquality results 97

96

95

94

93

True Positive Rate

Inthecaseoftextbasedfeatures,suchas problemreportwordfeaturesandthesourcefile commentwordfeatures,wehaveatleasttwo otheralternatives.Awordcanappearinbotha problemreportandfile commentandtherefore be afeatureinbothfeaturesets.Wecancreatea combinedfeature setby finding the intersectionof thetwosetsofeatures, f orwceanfindtheunion ofthesets.Inthefirstcase,thenewfeatureset consistsof wordsappearingbothinthe comments ofsomesourcefileandproblemreportwords associatedwithasourcefiles.Inthesecondcase, awordneedonlytoappearinthecommentsof somesourcefileoin rproblem a reportassociated withsource a file. Wehaveperformedextensiveexperiments usingtheconcatenationandtheunionmethodof combiningproblemreportwordandsourcefile commentfeaturesets.Duetospaceconstraints, theexperimentsusingtheunionmethodwillbe discussedelsewhere.Here,wewouldliketo reportthattheunionrepresentationdidnot improvetheresultsinmostcases,includingthe moreinterestingresultssuchatsheonesobtained for ratio 50 problem report word features classifierdiscussedinthe previoussection. InFigure6wehaveshowntheROCplotsfor classifiersthatare generatedfrom examples using theconcatenationofsyntacticandsourcefile commentfeatures. Insteadofusingallthecommentwords featuresinthecombinedfeature set,we only used featuresthatappearedinthedecisiontrees generatedinthecommentwordfeatureset experimentspresentedintheprevioussection. Thisineffectisperformingalimitedfeature selectiononsourcefilecommentfeatureset. Featureselectionisanactiveresearchareain machinelearning.Theideaito sselect saubsetof availablefeaturestdoescribe the example without thelossofqualityoftheresultsobtained.A smallerfeaturesethastheobviousbenefitthat creatinganexampleocraserequiresmakingless observationsocr alculationtofindthevaluesfor thefeatures.Insomecases,appropriately selected smallerfeaturesetsmayalsoimprovethequality of the resultsobtained. Thisfigureshowsthattheconcatenationof thesetwofeaturesetsgeneratedclassifiersthat considerablydominatebothsyntacticandsource file commentfeature sets. Wealsocombinedsyntacticfeatureswith featuresthatwereusedinproblemreportword

92

91

90

89

88

87 File Problem Report Words Juxtaposition of Syntactic and Used Problem Report Words 86 0

2

4 False Positive Rate

6

Figure7.Combiningsyntacticandproblem reportwordfeatures

4. Acloser look intomodelslearned Onebenefitofanexplainablemodelsuchaas decisiontreeistheabilitytofindoutwhythe modelclassifiesanexampleabseingoacfertain class.Theotherbenefitisthatthemodelmay reveal interesting relations that are not necessaril known.Theextentofwhatcanbleearnedfroma modeldependsonfactorssuchasthefeatures themselvesandthe complexity othe f model. Asmentionedabove,ourbestmodelswere obtainedwhenweusedproblemreportbased features.Wecouldimproveontheseresultsby combiningthemwithsyntacticattributes.We furtheranalyzedthedecisiontreesgeneratedby thesecombinedfeaturestoseewhatarethemost influentialfeaturesinthedecisionmakingafsar astheC5.0inductionalgorithmisconcerned. Thesearewordsthatappearhigherupinthe generateddecisiontrees. InTable3wehaveshownthefeatures appearingatthetoptwolevelsofeighteen decisiontreesgeneratedfromthecombined

y

syntacticandusedproblemreportfeaturesets. Foeeachfeaturewoenlyhaveshownthehighest levelithasappearedinamongtheeighteen decisiontrees.Forinstance the word “hogger” has appearednineoutof eighteentimesatthe highest level inthe decisiontrees i.e. level 0. In remainin ninedecision trees,thisfeaturemayhave appearedatalowerlevelincludinglevel1, however we onlyhaveshown thehighest contributionothis f word into the decisionmaking process. Table3Top . attributesincombinedsyntactic andproblem reportfeatureset Level

Frequency

Feature

0

9 3 2 2 1 1

hogger rad play trunk fd ver

1

10 4 4 4 2 1 1 1 1 1 1 1 1 1 1 1 1

SameFileName acd ce..ss e&m group appear_show call display extension internal mitel ms2008 per set single ss430 tapi

Ascanbeseeninthistableonlyoneofthe featuresshown,namely“SameFileName”, belongstothesyntacticfeatures.Uponfurther analysis,wefoundthefollowingrankingamong remainingsyntacticattributesused,wherethe numberinparenthesisshowsthe highestlevelthe feature appearedindecision a tree. (2)CommonPrefixLength (3)Directly Include You (3)Numberof SharedFilesIncluded (3)Same Extension (4)NumberofSharedDirectandIndirectly ReferredRoutines

g

(4)NumberoR f outinesDirectlyandIndirectly ReferredinISU andDefinedinOSU (5)Numberof SharedDirectly ReferredTypes (6)NumberofSharedDirectlyReferrednon Type DataItems (6)Number of Shared Routines Directly Referred (13) NumberofRoutinesDefinedinISUand Directly andIndirectly ReferredbO y SU (15) NumberofSharedRoutinesAmongAll RoutinesReferredinthe Units (78) OSU file Extension Theseresultsshowthatthefilenamebased attributesandfileinclusionappearbeforeother featuresbasedonroutinecallsanddataandtype definitionsandreference.Thissuggeststhe importanceoffilenameconventionsandfile inclusionmechanismasawaytogroupfilesin the systemandshare functionality. Table4.Mostaccuratepathsincovering trainingRelevantexamplesfor imbalance ratio 50 Path hogger& ce..ss 935 ~hogger&SameFile Name&fd& ~fac ~hogger& ~Same File Name & alt & corrupt ~hogger&~SameFileName~alt& zone& command ~hogger&~SameFileName~alt& ~zone& msx& debug

#Examples 153/2 103/2 99/3 76/1

Wealsocomputedthepathsinthedecision treegeneratedfromthiscombinedfeaturesetfor imbalanceratio50trainingset.Toprovidea bettersenseatsohowthetestsonthesefeatures resultsinaclassificationofafilepair,inTable weshowthetop5pathstoaleafclassifiedas Relevant.Inthesecondcolumnweshowthe numberotraining f exampleslabeledaR s elevant whichsatisfy the conditionsinthe path.If there is asecondnumberinthecolumn,itshowsthe numberoNot f Relevantexamplesthatsatisfy the conditions in the path and therefore are misclassifiedasRelevant.Atilde(~)isusedto showthatthetestisnegativewhichmeansthe wordisnotsharedbyproblemreportsorthe syntacticattributeevaluatedtoafalsevaluee.g. two filesdo nothave the same file name.

4

Obviouslythereisaneedforfurtheranalysis ofthesepathsbyanexpertsincesomewordsor their combinations carry special meaning that may bemissedbsyomeonewhoins otasfamiliarwith the system. We also lookedinto the classificationresults of theRelevantexamplesinthetestingset.We sortedthepathsbytheiraccuracyintermsof classifyingtestingRelevantexamples.Wefound exactlythesamefivepathsandinthesameorder shownabove. C5.0isalsocapableofcreatingrulesetsas opposedtodecisiontrees.Therulesetshavethe benefitofsometimesbeingsimplerthandecision trees.Howeverourexperimentswithproblem reportfeaturesetsshowedthatthegeneratedrule setsperform worsethan thecorresponding decisiontreesforallimbalanceratios,therefore weonlypresentthedecisiontreeanalysisresults here.Howeverthefactremainsthatonecouldin principlefindalternativeandpotentiallysimpler explanationthantheonesobtainedfromdecision trees. Althoughwaereconstrainedbsypace,itisnot difficult toenvision similar or even more sophisticatedanalysisothe f decisiontreesorule r setsgeneratedfromotherfeaturessets.We intend tocontinuethisanalysisandreporttheresultsin future publications.

classifiers.Wealsoperceiveotherresearch opportunitiesincluding: • Investigatingtheeffectsoautomatic f feature selectiononthe results • Investigatingtheeffectsoautomatic f feature selectiononthe results • Experimenting withother SMS and non-SMS basedfeature sets. • Usingalternativelearningmethodologies suchaCo-Training s [3] to label file pairs. For instance,we couldstartwith set a of Relevant exampleslabeledbytheco-updateheuristic, andasmallersetoN f ot-Relevantexamples providedbysoftwareengineers.Wecould repeatedlygenerateunlabeledfilepairs(and thecorrespondingexamples)andthenusing classifiersbasedonproblemreport,source filecomments,orsyntacticfeaturesclassify a desirednumberofunlabeledexamplesand addthemtothepooloflabeledexamples.In each repetition we would create new classifiersfromthenewsetoftraining examplesavailableandevaluatethem.This processisrepeatedaslongasbetterresults are obtained. • Usingthelearnedco-updaterelationinother applications,suchassubsystemclustering, which is an important tool in better understandinglarge software systems.

5. Conclusions

Acknowledgements

Inthispaperwepresentedanapplicationof machinelearningtosourcecodelevelsoftware maintenanceactivity.Weshowedhowonecan use software update records to learn relation a that canbeusedtopredictwhetherchangeinone source file may require change a inanothersource file.We mpiricallycomparedthreefeaturesets extractedfromdifferentinformationsources.Our experimentsshowthattheproblemreportword featuresgenerateclassifierswithprecisionand recallvaluesappropriateforreallifedeployment ofourmethod.Wealsoshowedthatcombining usedtextbasedfeatureswithsyntacticattributes inmostcasesimprovestheresults.InSection4 wepresentedtheexamplesoanalysis f thatcanbe performedonthegenerateddecisiontrees.Such analysiscanbeusedtobetterunderstandthe learnedmodel(ortherelationthatitrepresents). Clearly oneimportantfuturestepforourresearch isthequestionofthefieldevaluationofthese

TheauthorswouldliketothanktheSX2000 teamfortheircontinuoussupportandhelp.This workwassponsoredbyCSERandsupportedby MitelandNSERC.

References [1]AAAI 2000 Workshop on Learning from ImbalancedDataSet,Austin,Texas. [2]B.BellayB.andH.Gall.AnEvaluationof ReverseEngineering Tool Capabilities. Journal of SoftwareMaintenance:ResearchandPractice v. , 10no.5,pp.305-331,1998. [3]A.BlumandTMitchell. . Combining Labeledand UnlabeledDatawithCo-Training. Proceedingsof the11thConferenceonComputationalLearning Theory, Madisson, WI, Morgan Kaufmann Publishers,pp.92-100,1998. [4]J.Egan.SignalDetectionTheoryandROC analysis.NewYork,Academic Press,1975. [5]K.A.KontogiannisandP.G.Selfridge.Workshop Report:TheTwo-dayWorkshoponResearch IssuesintheIntersectionbetweenSoftware

EngineeringandArtificialIntelligence(heldin conjunctionwithICSE-16).AutomatedSoftware Engineering v2, .pp.87-97,1995. [6]D.DLewis.RepresentationandLearningin Information Retrieval. Doctoral dissertation, UniversityoMassachusetts f 1992. , [7]M.Marchand&J.Shawe-Taylor.TheSet CoveringMachine, JournalofMachineLearning Research,v.3,pp.723-746,2002. [8]T.J.McCabe.AComplexityMeasure. transactionsonSoftwareEngineering pp.308-320,1976.

IEEE v. .2no.4,

[9]D. Mladenic. Text-Learning and Related IntelligentAgents:ASurvey. IEEEIntelligent Systems,v.14no.4,pp.44-54,1999. [10]JSayyad . Shirabd,T.C.Lethbridge,andSLyon. . ALittleKnowledgeCanGLong ao Way Towards ProgramUnderstanding. Proceedingsofthe5th International Workshop on Program Comprehension. Dearborn, MI, pp. 111-117, 1997. [11]J.SayyadShirabd,T.C.Lethbridge,andS. Matwin.SupportingSoftwareMaintenanceby MiningSoftwareUpdateRecords. Proceedingsof the17thIEEEInternationalConferenceon SoftwareMaintenance Florence, , Italy,pp.22-31, 2001

Applying Data Mining to Software Maintenance ... - Semantic Scholar

Applying Data Mining to Software Maintenance ... - Semantic Scholar

Suggest Documents

applying data mining to software development ...

Applying data mining to the study of joseki - Semantic Scholar

Applying data mining algorithms to inpatient ... - Semantic Scholar

Software Maintenance - Semantic Scholar

Data mining in software engineering - Semantic Scholar

Mining Software Engineering Data - Semantic Scholar

Characterising Data Mining software - Semantic Scholar

Mining Software Engineering Data - Semantic Scholar

Mining Software Usage Data - Semantic Scholar

Applying Quantitative Methods to Software Maintenance

Applying Quantitative Methods to Software Maintenance

Applying Microreboot to System Software - Semantic Scholar

Applying Reduction Techniques to Software ... - Semantic Scholar

Using Data Mining to Assess Software Reliability - Semantic Scholar

Mining Software Engineering Data

Applying Opinion Mining to OSS selection process - Semantic Scholar

Applying Data Mining to the Geosciences Data - Core

From Data Mining to Knowledge Mining - Semantic Scholar

Mining Data from Multiple Software Development ... - Semantic Scholar

Mining Open Source Software(OSS) Data using ... - Semantic Scholar

Software Bug Detection using Data Mining - Semantic Scholar

Visual Data Mining in Software Archives - Semantic Scholar

Data mining source code for locating software bugs - Semantic Scholar

A Review of Software Packages for Data Mining - Semantic Scholar