canali di comunicazione scritta (ad esempio, mailing list, issue trackers e .... Keywords: (software maintenance and evolution, unstructured data mining, nat- ...... and ranks them in order to prioritize the reviews set, according to frequency and.
UNIVERSITÀ DEGLI STUDI DEL SANNIO DIPARTIMENTO DI INGEGNERIA
Dottorato di Ricerca in Tecnologie dell’Informazione per l’Ingegneria XXIX Ciclo
Doctoral Dissertation
Exploiting NLP Techniques to Support Developers during Software Maintenance and Evolution Tasks I tutor Prof. Gerardo Canfora
Andrea Di Sorbo
Prof. Corrado Aaron Visaggio
PhD Coordinator: Prof. Luigi Glielmo
Sommario L’ enorme disponibilit` a di dati, presenti sia negli archivi web delle comunit`a opensource che nelle piattaforme di distribuzione delle applicazioni mobili, ha semplificato lo studio delle dinamiche di comunicazione nei processi di manutenzione, ed in particolare delle tipologie e degli strumenti attraverso i quali gli sviluppatori software interagiscono tra di loro, e delle metodologie attraverso le quali gli sviluppatori collezionano i suggerimenti provenienti dagli utenti, al fine di migliorare i propri prodotti. Diverse ricerche hanno evidenziato che gli sviluppatori fanno un uso intenso di canali di comunicazione scritta (ad esempio, mailing list, issue trackers e chats), per organizzarsi e coordinarsi nelle varie attivit` a e per raccogliere feedback da parte degli utenti, utili a migliorare i prodotti software e a renderli pi` u inclini alle esigenze di questi ultimi. Tuttavia, in questo tipo di comunicazioni (cio`e, discussioni tra sviluppatori e commenti degli utenti) viene comunemente impiegato il linguaggio naturale ed `e possibile trovare: (i) miscele eterogenee di informazioni strutturate, semi-strutturate e non strutturate (ad esempio, un messaggio in una discussione tra sviluppatori pu`o includere frammenti di codice o informazioni di debug) ; (ii) parti di testo che hanno differenti obiettivi (ad esempio, discutere una nuova funzionalit`a, o riportare una condizione di errore); (iii) dettagli non necessari (ad esempio, circa i 2/3 dei commenti degli utenti presenti negli app stores contengono informazioni che sono poco utili dal punto di vista degli sviluppatori interessati a manutenere ed evolvere le proprie applicazioni). Per questi motivi, la classificazione (o il filtraggio) manuale dei differenti intenti che queste tipologie di messaggi possono avere `e un processo noioso e dispendioso. Dunque, aiutare gli sviluppatori ad individuare, tra i vari messaggi scritti in linguaggio naturale, i contenuti di cui hanno pi` u bisogno risulta essere un task molto importante per supportare i processi decisionali che avvengono durante le fasi di evoluzione e manutenzione del software (ad esempio, a stabilire le nuove funzionalit`a da implementare, o i bug da correggere). 1
A tal fine, in questa dissertazione, si studia l’ utilizzo di un approccio semisupervisionato, chiamato Intention Mining, che si propone di modellare l’ intento, ossia l’obiettivo originale dello scrittore che si cela dietro un frammento di testo scritto in linguaggio naturale, sfruttando la struttura grammaticale del suddetto frammento. Oltre a presentare tale tecnica, si discuter` a come essa sia stata sfruttata per (i) costruire classificatori automatici di discussioni di sviluppo; (ii) realizzare classificatori in grado di individuare, all’ interno dei commenti degli utenti di applicazioni mobili, contenuti utili agli sviluppatori, interessati ad eseguire task di evoluzione e manutenzione; (iii) elaborare in maniera automatica sintesi dei commenti provenienti dagli utenti, che aiutino gli sviluppatori a meglio comprendere le necessit`a di questi ultimi. Verranno, inoltre, presentati gli strumenti realizzati per dare supporto agli sviluppatori nell’ individuazione di informazioni utili che fluiscono nei suddetti canali di comunicazione, durante le fasi di manutenzione ed evoluzione del software. La tecnica proposta, non ha l’ obiettivo di sostituire tecniche di text mining precedentemente sviluppate. Essa, piuttosto, pu` o essere utilizzata congiuntamente a tali tecniche al fine di identificare informazioni utili all’ interno di documenti scritti in linguaggio naturale.
Parole chiave: (evoluzione software, manutenzione del software, data mining, applicazioni mobili, feedback degli utenti).
4
Resume The great availability of data in OSS communities, as well as mobile distribution platforms (i.e., app stores) has encouraged research on (i) how open source projects and mobile apps are maintained, (ii) how developers interact with each other, and (iii) how developers gather suggestions from users in order to evolve their products. Previous research demonstrated that developers (i) make an intense use of written communication channels (e.g., mailing lists, issue trackers and chats) to coordinate themselves during software maintenance activities, and (ii) usually collect user feedback helpful for improving their products. However, such kinds of messages (i.e., user feedback and developers’ discussions) are usually written through natural language and may contain (i) a mix of structured, semi-structured and unstructured information (e.g., a development email may usually enclose code snippets or stack traces), (ii) text having different purposes (e.g,. discussing feature requests, bugs to fix etc.), (iii) unnecessary details (e.g., about 2/3 of app reviews contain useless information from a software maintenance and evolution perspective). Thus, the manual classification (or filtering) of such messages according to their purpose would be a daunting and time-consuming task and we argue that helping developers to discern the content of natural language messages that best fit their information need is a relevant task to support them in decision making processes during software maintenance and evolution (i.e., establish the new features/functionalities to implement in the next release, the bugs which have to be fixed, etc.). To address this issue, in this dissertation, we explore a semi-supervised technique, named Intention Mining, which tries to model the writer’s main purpose within a natural text fragment, by exploiting the grammatical structure of the fragment and assigning it to an intention category (e.g., asking/providing help, proposing new features or solutions for a known problem, discussing bugs, etc.). In particular, we show how we exploited the approach for (i) building automated classifiers of development messages, (ii) constructing categorizers of mobile app reviews able to discern useful 5
contents from a software maintenance and evolution perspective, and (iii) building summaries of app reviews, which help developers better understand users needs. We also discuss several tools we built in order to support developers discerning useful information over natural text channels, during the maintenance and evolution of their software. Our approach is not aimed at replacing previous text mining techniques. Conversely, it might be profitably used in combination with other techniques in order to mine helpful information within natural text documents.
Keywords: (software maintenance and evolution, unstructured data mining, natural language processing, mobile applications, user feedback).
6
List of publications List of the publications of the candidate [1] Sebastiano Panichella, Andrea Di Sorbo, Emitza Guzman, Corrado Aaron Visaggio, Gerardo Canfora and Harald Gall (2015) “How Can I Improve My App? Classifying User Reviews for Software Maintenance and Evolution”, Proceedings of the 31st International Conference on Software Maintenance and Evolution (ICSME 2015), Bremen, Germany, pages 281-290. [2] Andrea Di Sorbo, Sebastiano Panichella, Corrado Aaron Visaggio, Massimiliano Di Penta, Gerardo Canfora, and Harald Gall (2015) “Development Emails Content Analyzer: Intention Mining in Developer Discussions”, Proceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering (ASE 2015), Lincoln, Nebraska, pages 12-23. [3] Gerardo Canfora, Andrea Di Sorbo, Francesco Mercaldo, Corrado Aaron Visaggio (2015) “Obfuscation Techniques against Signature-Based Detection: A Case Study”, 2015 Mobile Systems Technologies Workshop (MST), Milano, Italy, pages 21-26. [4] Andrea Di Sorbo, Sebastiano Panichella, Corrado A. Visaggio, Massimiliano Di Penta, Gerardo Canfora and Harald C. Gall (2016) “DECA: Development Emails Content Analyzer”, Proceedings of the 38th International Conference on Software Engineering Companion (ICSE 2016), Austin, TX, pages 641-644. [5] Andrea Di Sorbo, Sebastiano Panichella, Carol V. Alexandru, Junji Shimagaki, Corrado A. Visaggio, Gerardo Canfora and Harald C. Gall (2016) “What Would Users Change in My App? Summarizing App Reviews for Recommending Software Changes”, Proceedings of the ACM SIGSOFT International Symposium on the Foundations of Software Engineering (FSE 2016), Seattle, WA, pages 499-510. 9
[6] Sebastiano Panichella, Andrea Di Sorbo, Emitza Guzman, Corrado A. Visaggio, Gerardo Canfora and Harald C. Gall (2016) “ARdoc: App Reviews Development Oriented Classifier”, Proceedings of the ACM SIGSOFT International Symposium on the Foundations of Software Engineering (FSE 2016), Seattle, WA, pages 1023-1027. [7] Gerardo Canfora, Andrea Di Sorbo, Francesco Mercaldo, Corrado A. Visaggio (2016) “Exploring Mobile User Experience Through Code Quality Metrics”, Product-Focused Software Process Improvement, 17th International Conference Proceedings (PROFES 2016), Trondheim, Norway, pages 705-712. [8] Andrea Di Sorbo, Sebastiano Panichella, Carol V. Alexandru, Corrado A. Visaggio, and Gerardo Canfora (2017) “SURF: Summarizer of User Reviews Feedback”, Proceedings of the 39th International Conference on Software Engineering Companion (ICSE 2017), Buenos Aires, Argentina, pages 55-58. [9] Giovanni Grano, Andrea Di Sorbo, Francesco Mercaldo, Corrado A. Visaggio, Gerardo Canfora, and Sebastiano Panichella (2017) “Android Apps and User Feedback: A Dataset for Software Evolution and Quality Improvement”, Proceedings of the 2nd International Workshop on App Market Analytics (WAMA 2017), Paderborn, Germany, pages 8-11.
12
Acknowledgements It is not trivial to find the words to thank people that in these three years contributed in different ways to this achievement. I would thank both my supervisors Prof. Corrado Aaron Visaggio and Prof. Gerardo Canfora, for giving me the possibility and the freedom to pursue my research interests. They have been providing me valuable advices since my first day as PhD student! With their help I learnt to conduct research, to be more concrete, write more clearly and risk more. Thank you for your availability and your precious advices! Special thanks to the external reviewers, Prof. Denys Poshyvanyk (College of William and Mary) and Prof. Rocco Oliveto (Universit`a del Molise) for managing to read this dissertation and for their helpful suggestions. My sincere thanks also goes to Dr. Sebastiano Panichella (University of Zurich). We met and started collaborate when he was just a phD student at the University of Sannio. Despite his huge workloads, he was always available for discussing research ideas with me, and worked hard to realize them. From a professional point of view, he acted as an “unofficial” additional adviser, but now, after three years, I would rather consider him as a friend. I personally have benefited from working with Dr. Panichella and sincerely hope to continue collaborating with him for long time. Above all, I want to thank him because he taught me to have a method in my research work. I would also like to offer my thanks to all the people at the University of Sannio that made my PhD experience unforgettable. I sincerely thank my family. Cultural and economic soundness are important requirements for deciding to continue studying at my age. I would thank my parents for 13
giving me the opportunity to pursue my ambitions! They taught me the importance of building true relationships. “An ugly truth is much better than a beautiful lie. Say what you really think and not what other people expect”, they said when I was younger. This principle guide me in my everyday life, and I can say that I found it very useful also in my professional experience. Also thanks to my brothers, Leo and Alessandro, for their advices, for criticizing some actions when it was the moment, for supporting my decisions. They always believed in me! Finally, all the words in the world are not enough to express my gratitude to the person who decided to stay with me, despite my not always sweet temper. Thank you, Ilaria, for supporting and (sometimes) tolerating me. Thank you for taking care of me in these years. You are not only my girlfriend, but also my best friend. You are my reason for living. I love you!
14
Contents Sommario
1
Resume
5
List of publications
9
Acknowledgements
13
1 Introduction 1.1 Context . . . . 1.2 Motivation . . 1.3 Research Goals 1.4 Outline . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
2 Background 2.1 Managing Information in Development-related Conversational 2.2 App Reviews Analysis for Software Engineering . . . . . . . . 2.2.1 Classification . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Content . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.3 Requirements Engineering . . . . . . . . . . . . . . . . 2.2.4 Sentiment Analysis . . . . . . . . . . . . . . . . . . . . 2.2.5 Summarization . . . . . . . . . . . . . . . . . . . . . . 2.3 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
29 Channels 30 . . . . . 31 . . . . . 33 . . . . . 33 . . . . . 34 . . . . . 35 . . . . . 35 . . . . . 36
3 Intention Mining in Developers’ Discussions 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Classifying Emails according to Intentions . . . . . . . . . . . . . . . . 3.2.1 Categories Definitions of Development Sentences . . . . . . . . 15
19 20 23 25 26
39 40 42 42
CONTENTS
3.3
3.4
3.5 3.6
3.2.2 Identifying Linguistic Patterns . . . . . . . . . . . . . . . . . . 3.2.3 The DECA tool . . . . . . . . . . . . . . . . . . . . . . . . . . Evaluation: Study Design . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Context Selection and Data Extraction . . . . . . . . . . . . . 3.3.3 Analysis Method . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.4 An Approach Based on ML for Email Content Classification . Evaluation: Analysis of Results . . . . . . . . . . . . . . . . . . . . . . 3.4.1 RQ1-a:Is the proposed approach effective in classifying writers’ intentions in development emails? . . . . . . . . . . . . . . . . 3.4.2 RQ1-b: Is the proposed approach more effective than existing ML in classifying development emails content? . . . . . . . . . DECA in a real-life application: Code Re-documentation . . . . . . . . . . . . . . . . . . . . . . . . . . Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44 47 51 51 52 53 55 57 57 59 63 65
4 Classifying App Reviews for Software Maintenance and Evolution 67 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.2.1 Taxonomy for Software Maintenance and Evolution . . . . . . 70 4.2.2 Text Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.2.3 Natural Language Processing . . . . . . . . . . . . . . . . . . . 73 4.2.4 Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.2.5 Learning classifiers . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.2.6 The ARdoc tool . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.3 Evaluation: Study Design . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.3.1 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.3.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.3.3 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . . 84 4.4 Evaluation: Analysis of Results . . . . . . . . . . . . . . . . . . . . . . 86 4.4.1 RQ2-a: Are the language structure, content and sentiment information able to identify user reviews that could help developers in accomplishing software maintenance and evolution tasks? 86 4.4.2 RQ2-b: Does the combination of language structure, content and sentiment information produce better results than individual techniques used in isolation? . . . . . . . . . . . . . . . . . 88 4.4.3 Evaluating ARdoc’s classification performances . . . . . . . . . 92 16
CONTENTS
4.5
Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
94
5 Summarizing App Reviews for Recommending Software Changes 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 The User Reviews Model . . . . . . . . . . . . . . . . . . . . . 5.2.2 The SURF Approach . . . . . . . . . . . . . . . . . . . . . . . . 5.2.3 How to SURF app reviews data . . . . . . . . . . . . . . . . . . 5.3 Evaluation: Study Design . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Analysis Method . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.3 Research Method . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Evaluation: Analysis of Results . . . . . . . . . . . . . . . . . . . . . . 5.4.1 RQ3-a Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 RQ3-b Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.3 RQ3-c Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
97 98 99 99 101 107 108 110 110 113 114 114 115 117 120
6 Conclusion 123 6.1 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 6.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 References
128
List of figures
145
List of tables
149
17
CONTENTS
18
Chapter 1
Introduction Developers and software engineers undertaking maintenance and evolution activities often need to search, collect and cross-reference information relevant to the maintenance task at hand. This information helps them making decisions on specific courses of actions (e.g., the new functionalities to implement, the bugs to fix). However, the amount of information they have to manage is often huge and a developer usually ends up with glancing or skimming through the details of an artifact to retrieve the desired piece of information, which is tedious and time-consuming. Starting from these considerations, we argue that it urges to study new techniques to better assist developers in finding the specific information they sought in artifact. This dissertation studies mechanisms for processing unstructured information contained in natural language texts, in order to mine in an automated way useful contents for developers interested in accomplishing software maintenance and evolution tasks. In this chapter we describe the context and the motivations behind our research, as well as the overall organization of the document.
19
Introduction
1.1
Context
In software engineering the terms evolution and maintenance are often used as synonyms. However, Godfrey and German [45] pointed out some semantic differences between the two terms. In particular, while maintenance implies preservation and repair without any changes to the software design, the term evolution allows for the creation of new software designs that evolves from previous ones. Furthermore maintenance is generally considered to be a set of planned activities, whereas evolution includes everything that occurs to a software system over time and therefore includes planned and unplanned activities (e.g, the appearance of users with different needs and the emergence of new usage scenarios). In this dissertation we refer to software maintenance and/or evolution to any modification made on a system after its delivery1 . The literature is plenty of studies showing that software maintenance is, by far, one of the main and costly among software engineering practices. In 1974, Lehman and Belady proposed a set of laws pertaining software evolution [81]. These laws essentially characterize the software evolution process as a self-regulating and self-stabilizing system, subject to continuing growth and change [85]. In the following, the eight laws of software evolution formulated by Lehman during his research activities [81–85] are described: 1. Continuing Change. The first law postulates that a program must continually adapt to its environment, otherwise it becomes progressively less useful [82]. 2. Increasing Complexity. The second law postulates that as a program evolves, its complexity increases, unless proactive measures are taken to reduce or stabilize the complexity [82]. 3. Self Regulation. The third law postulates that the evolution of large software systems is a self-regulating process, i.e., the system will adjust its size throughout its lifetime [83] 4. Conservation of Organizational Stability. This law, also known as “invariant work rate”, stipulates that the rate of productive output tends to stay constant throughout a program’s life time [83]. 5. Conservation of Familiarity. This law suggests that incremental system growth tends to remain constant (statistically invariant) or to decline, because 1 https://www.iso.org/standard/39064.html
20
1.1. Context
developers need to understand the program’s source code and behavior. A corollary is often presented, stating that releases that introduce many changes will be followed by smaller releases that correct problems introduced in the prior release, or restructure the software to make it easier to maintain [83]. 6. Continuing Growth. This law stipulates that programs usually grow over time to accommodate pressure for change and satisfy an increasing set of requirements [83]. 7. Declining Quality. This law stipulates that over time, software quality appears to be declining, unless proactive measures are taken to adapt the software to its operational environment [83]. 8. Feedback System. According to this law software projects are self-regulating systems with feedback [84]. More precisely, based on the concept of feedback in system dynamics, this law states that the size of a system can be described in terms of the size of the previous release and the effort for the actual release. Much time has passed since then, but, many of Lehman’s laws still apply these days despite the evolution of technology in general. Software continues to evolve long after the first version has deployed and numerous studies indicate that the costs associated with software maintenance and evolution exceed the 50% (and sometimes are more than 90%) of the total costs related to a software system [121]. To reduce these costs, both managers and developers must understand the factors that drive software evolution and take proactive steps that facilitate changes and ensure software does not decay [138]. The advent of open-source communities and agile methodologies tried to address some of the issues related to software maintenance. In particular, according to these development philosophies, the line between maintenance and the earlier portions of the development life-cycle has blurred. Indeed, in more traditional development lifecycles, maintenance begins after the software has been used in a production environment [129]. However, open source applications are often put into use as soon as they can be compiled to accomplish something useful, stemming from the “release early, release often” philosophy popularized by the development of Linux [120]. This is necessary, so that user feedback can be incorporated. The open source model is an incremental model of development, where the user is given a series of prototypes or partially working pieces of software [129]. However, in open source projects, these increments occur far more often. The “release often” portion of the Linux philosophy 21
Introduction
maximizes the usefulness of user feedback. The differences between open source and more traditional development methods also lead to a number of advantages during the maintenance phase. Specifically, open source projects allow any end user to double as a co-developers, as users can examine the source code and suggest bug fixes. However, the Lehman’s laws of Continuing Change, Increasing Complexity, Self Regulation and Continuing Growth are still applicable to the evolution of open source software [138]. The spread of smartphones, tablet and other mobile devices has posed a number of specific challenges to software developers (e.g., mobile devices exhibit different form factors and interaction paradigms, mobile software must operate taking into account the varying reliability of the Internet connection, software has to be designed in order to minimize the usage of hardware and communication resources and, consequently, reduce power consumption) [40]. In particular, there has been exponential growth in mobile application (also called apps) development since the Apple App Store opened in July, 2008. Nowadays, on the Google Play Store more than 4 new apps every day are released (while on the Apple App Store the average number of new apps per day is greater than 3)2 . Thus, the quantity of software applications for smart devices grows very quickly, along with the features and computing power offered by the mobile equipment. Mobile software development suffers from several limitations not present in desktop computing that make the mobile ecosystem a particular environment. The fast-paced mobile market sets the need of having lightweight processes that facilitate the change and the adoption of new technologies or emerging trends. An effective development strategy for mobile apps should consider the quality drivers of the mobile ecosystem, the expectations of the end users, and the conditions posed by the application marketplace; at the same time, it should be flexible to adapt to the advancements of the enabling technologies, and to cope with the marketplace’s competitiveness and the evolution [27]. Practically speaking, mobile apps should be developed quickly and keeping up a low price to be successful in a marketplace of millions of potential users with an offer of millions of products. A decade of evolution on the mobile domain (software, hardware and business models) has brought significant advancements. In 2010, most mobile applications were relatively small (i.e., averaging several thousand lines of source code) with one or two developers responsible for conceiving, designing, and implementing the application, but rarely using any formal development processes [133]. Nowadays, mobile software is still developed by small teams and small-medium enterprises, but it is also part of major developments that involve large corporate teams. Moreover, mobile applications spans not only in 2 https://42matters.com/stats
22
1.2. Motivation
stand-alone applications but also interacting with other systems, collaboration tools, using heavily network and hardware resources, etc. This also implies that the mobile software product is not anymore small by definition [27]. Nevertheless, researchers in [140] provided evidence that also mobile apps do follow Lehman’s law of Continuing Change. In particular, they also observed that mobile applications (i) show an increase of lines of code over time (i.e., Continuing Growth and Increasing Complexity) and (ii) continue to decline in quality over time (i.e., Declining Quality). A practical evidence about the validity of the Lehman’s law of Continuing Change in this context can be find in the fact that about 14% of mobile applications in the Google Play store are updated on a bi-weekly basis (or more frequently) [94]. In conclusion, despite the Continuing Change law of software evolution were formulated more than forty years ago in the context of proprietary software, it has been proved to be still valid in current open-source projects and mobile apps development domains. Thus, developers and managers usually require techniques and tools to cope with it.
1.2
Motivation
During software maintenance, developers need to improve the quality of information captured in different sources of information such as requirement documents, bug reports, source code. These artifacts may contain excessive information, with a lot of unnecessary details and developers are often forced to skim these details to found the desired information [102]. This is especially true, in some open-source communities adopting agile methods that mostly use informal communication (between developers and with users) rather than formal documentation as an helper for software development [34]. For example, there are many ways through which the members of the Apache community can report problems and propose enhancements (e.g., change requests are reported on the developer mailing list, in issue trackers, and newsgroups). Thus, in order to identify the work to be done, developers are often required to be aware of discussions occurring over the different communication means, because they may contain information or knowledge that could be crucial for deciding about a software maintenance activity [97]. However, the manual analysis of these discussions is time-consuming for several reasons: (i) a development email or a post on issue tracker contains a mix of structured, semi-structured, and unstructured information [6], (ii) the messages posted on such communication means may have different purposes (e.g., an issue report may re23
Introduction
late to a feature request, a bug, or just to a project managment discussion [60]), (iii) sometimes emails or discussions are too long and the reader gets lost in unnecessary details, or (iv) pieces of information regarding the same topic are scattered among different sources [111]. As remarked by the Lemhan’s laws of software evolution, developers accomplishing software maintenance also have to be aware of user needs, and how these needs and expectations could change over time [106]. At the very beginning of the digital era, the customers of the market of software mostly consisted in small groups composed by technically skilled engineers and researchers, who had very specific and limited needs. However, with the evolution of computing power and the spread of personal computers and, after, of mobile devices, the audience of software users has expanded more and more, capturing the interest of more heterogeneous groups with a wide variety of needs and expectations. However, it is not always straightforward satisfying such a variety of needs. In particular, many software requirements could be identified only after a product is deployed, once users have had a chance to try the software and provide feedback [74]. Thus, previous research has pointed out the importance of users involvement, during the post-deployment phases, and in particular during the software maintenance and evolution of a software product [106]. One way to extract user feedback is to adopt typical channels used in traditional software development, such as (i) bug/change repositories, (ii) crash reporting systems, (iii) online forums, and (iv) emails. However, mobile distribution platforms (such as Google Play Store, and Apple App Store) promote users involvement by providing user feedback features, which allow users who download an application to rate it with a number of “stars” and post a review message. Thus, app stores provide a sort of communication channel between users and developers. User reviews transmitted through this channel, may contain multiple and heterogeneous topics, such as (i) user experience, (ii) bug reports, and (iii) feature requests, but previous studies demonstrated that only a small subset of comments include helpful feedback from a software maintenance and evolution perspective [24, 105]. However, for some popular apps, the volume of user reviews is simply too large to carry on manual checking on all of them [105]. One of the main challenges for software evolution researchers is to persuade developers about the value of cross-referencing information as it is created, when the knowledge is still fresh in their minds, in order to document the rationale behind a specific decision taken during a development activity. The field of Mining software repositories (MSR) aims to recover latent relationships between many kinds of 24
1.3. Research Goals
software artifacts: CVS check-ins, bug reports, test suites, email messages, user forum postings, and various kinds of documentation. This is done using a variety of techniques, such as parsing, data mining, various kinds of analysis, and the use of heuristics [45].
1.3
Research Goals
In this dissertation our aim is to address some of the issues described in the previous paragraph to better support developers when collecting information to accomplish software maintenance and evolution activities. In particular, our purpose is to provide assistance to developers in finding useful information from a software maintenance and evolution perspective within (i) development discussions (i.e., occurring in mailing lists) and (ii) user feedback. Since information contained in both sources are written in natural language, and their unstructured nature makes the task not trivial [6, 24], we devise an approach, named Intention Mining, relying on natural language parsing to capture linguistic patterns and classify text fragments according to the writers’ purposes. The use of natural language parsing is motivated by the need to better capture the intent of a sentence in a discussion, a task for which techniques based on lexicon analysis, such as Vector Space Models [8], Latent Semantic Analysis, or Latent Dirichlet Allocation (LDA) [13] would not be sufficient. For example, consider the following sentences: 1. The button in the page doesn’t work 2. A button in the page should be added We notice that (i) these sentences share a lot of terms, and (ii) a topic analysis would reveal that they are likely to treat the same topics. However these two sentences have completely different intentions: in sentence (1) the writer describes a problem, while in sentence (2) the writer asks for the implementation of a new feature. Thus, they could be useful in different maintenance contexts. This example highlights that understanding the intentions in developers’ communication and/or user comments could add valuable information for guiding developers in detecting text fragments useful to accomplish different and specific maintenance and evolution tasks. Our research stems from the conjecture that when developers and users write about existing bugs, suggest new functionalities to be implemented or propose solutions for known problems, they tend to use some recurrent linguistic patterns. These patterns 25
Introduction
are exploited in order to identify the intention of each text fragment. In particular, we design our study in order to answer the following research question: • RQ: Is Intention Mining effective in supporting developers during software maintenance and evolution tasks? Our aim is to assess whether the Intention Mining approach facilitate the analysis of (i) development discussions occurring in development mailing lists and (ii) user feedback appearing in app reviews. Thus, stemming from this general research question we derive 3 further sub-questions that need to be answered to quantitatively and qualitatively assess the effectiveness of Intention Mining in the mentioned fields of research: – RQ1: To what extent is Intention Mining effective in automatically discerning useful paragraphs in development emails for developers interested in accomplishing software maintenance and evolution tasks? This research question is aimed at evaluating our approach when used by developers to recognize specific paragraphs in development discussions which could be connected to specific software maintenance activities (e.g., implementing new solutions/features, fixing bugs, etc.). – RQ2: Does Intention Mining support developers in automatically classifying useful feedback for maintaining end evolving their apps? The purpose of this research question is to investigate whether Intention Mining is able to overcome some of the limitations present in previous approaches aimed at identifying useful feedback, from a software maintenance and evolution perspective, in user reviews of mobile apps. – RQ3: To what extent does a summarization technique relying on Intention Mining produce summaries of app reviews that help developers better understand users needs? The goal of this research question is to investigate whether Intention Mining enables a summarization technique able to reduce the effort required to analyze feedback contained in user reviews.
1.4
Outline
The dissertation is divided into six chapters and proceeds as follows. In Chapter 2 we review the research background. 26
1.4. Outline
In Chapter 3 we propose a taxonomy of high-level categories modeling developers’ information needs when they have to deal with software maintenance and evolution tasks. Moreover, we describe how we successfully employed Intention Mining for classifying text paragraphs in developers’ discussions according to this taxonomy. In conclusion, we illustrate a prototype able to automatically process development emails, identifying useful paragraphs from a software maintenance and evolution perspective. Chapter 4 shows how Intention Mining could be applied to the context of mobile applications’ evolution. Specifically, we illustrate how the use of different combinations of techniques would be helpful against some limitations showed by previous approaches. We, finally, describe a prototype able to automatically classify app reviews useful for software maintenance. In Chapter 5 we first define URM (User Reviews Model), a two-level classification model which takes into account both the users’ intentions (when giving feedback to developers) and the specific aspect of the app covered by the review. Moreover, we introduce SURF (Summarizer of User Reviews Feedback) a summarization approach, built on top of URM, to automatically generate summaries of users feedback with the aim of helping developers better understanding users’ needs. Chapter 6 concludes the thesis and describes future research directions.
27
Chapter 2
Background This chapter describes (i) the background knowledge related to techniques to support developers in efficiently managing the information contained in natural language development artifacts, as well as (ii) the literature related to the app reviews analysis and summarization.
29
Background
2.1
Managing Information in Development-related Conversational Channels
As discussed in Chapter 1, there are many different kinds of software artifacts which developers usually employ in order to coordinate themselves. For example, a bug report usually contains information about the creation and the resolution of bugs, feature enhancements, and other maintenance activities written in natural language form. Since a bug report often involves comments by other users and developers, this also represent a mean through which different contributors discuss problems in a conversational way [102]. Another very popular communication channel for developers are mailing lists. Mailing lists usually comprise sets of time-stamped email messages and each message consists of (i) a header (including sender, receiver(s) and timestamp), (ii) a message body (i.e., the text content), and (iii) a set of attachments. Sometimes mailing lists aim to suggest different software engineering tasks such as the documentation of source code. Developer forums are also used by developers in order to discuss a variety of development topics related to a given software, programming problems, feature requests, or project management discussion. Finally, IRC chats are means through which developers can have on-line meetings and organize the work. These chats usually contain information about the project status and the project discussions about specific and relevant development topics [122]. In the last years, several research works proposed various approaches based on Natural Language Processing (NLP) analysis as tools to derive important information aimed at supporting program comprehension and maintenance activities [36, 59, 73, 90, 108, 135, 137, 141]. Some of these approaches are aimed at providing developers with information about the reason behind a code change. For example, Rastkar and Murphy [117] proposed the automated generation of summaries for describing the motivation of code changes based on the information present in the related documents. They identified a set of sentence-level features to locate the most relevant sentences in a change set to be included in a summary. In a similar effort, some other approaches have been developed to achieve the same purpose [16, 29]. Bug reports and mailing lists, both contain conversational data along with combinations of structured (e.g., code snippets, stack traces, etc.) and unstructured information, which constitute a precious source of information for developers interested in accomplishing software maintenance activities. Rastkar et al. [118] first recognized the similarity between email threads and bug reports, and utilized an existing tech30
2.2. App Reviews Analysis for Software Engineering
nique [101] created for emails and conversations summarization to produce concise summaries of bug reports. In a subsequent research [119] they also demonstrated that the concise summaries of original bug reports could help developers save time in detecting duplicate bug reports. To ease the comprehension of bug reports, Lotufo et al. [30, 88] proposed a PageRank based summarization technique with the aim of making more digestible (i.e., easier to look for) information contained in this kind of artifacts. Yeasmin et. al. [139] produced an interactive visualization of bug reports using extractive summaries and the relationships between different bug reports. In addition, several studies have been carried out to address the problem of automatically establishing a link between two artifacts (e.g., emails and source code or bug reports and source code). For example, Bacchelli et al. [7] devised a set of lightweight methods to establish the link between emalis and source code. Similarly, Aponte and Marcus [4] used text summarization techniques to address traceability link recovery problem. Shihab et al. [123] showed that mailing list discussion are closely related to source code activities and the types of mailing list discussions are good indicators of the types of source code changes (i.e., code additions and code modifications) being carried out on a project. In the literature, several techniques have been devised to speed up and facilitate managing emails. The work by Corston-Oliver et al. [28] illustrated an approach to provide task-oriented summaries of email messages by identifying taskrelated sentences in development message. Ko et al. in [75] presented a study that observed (i) how noun, verbs, adverbs and adjectives are used to describe software problems and (ii) the extent to which these parts of speech can be used to detect software problems. Cohen et al. [26] proposed an approach that relying on machine learning methods classifies emails according to the intent of the sender. Differently from these previous works, we focus our attention in classifying emails sentences written by developers during development discussions (in mailing lists data). Thus, we analyze both syntactic and semantic information to discover significant recurrent patterns useful to recognize relevant sentences within messages.
2.2
App Reviews Analysis for Software Engineering
Before the spread of the personal computer, software users were mainly engineers, scientists or programmers, and software was developed according to the needs of this homogeneous group. With the diffusion of personal computer, the group of software users started to become increasingly heterogeneous. To address the problems deriving 31
Background
from this variety of users research began to take into account of user satisfaction [48] and software engineering methodologies began to involve users throughout the software lifecycle [68]. Nowadays, user feedback is the mean through which developers become aware of users’ needs and expectations. More formally, user feedback is the “relevant information provided by end-users of software applications with the purpose of requesting enhancements and changes, reporting issues and communicating needs, as well as to report their overall experience and opinions about the applications” [98]. User feedback can be obtained through direct (i.e., the user information is sent directly to the developers or analysts) or indirect (i.e., the information is shared among other users) communication with the developers and analysts involved in the software development [58]. Research has shown that integrating user feedback in software development can improve the quality of the requirements, as well as the usability and usefulness of the software [86].These improvements result in lower maintenance costs and higher sales [130]. Pagano et al. [106] hypothesized that there is a need for tools aimed at structuring, analyzing and tracking user feedback. In this dissertation we apply data mining techniques for processing explicit user feedback, provided through direct communication with developers and analysts. A mobile application (or app) is a software application that runs on smartphones or other mobile devices. Mobile applications can be delivered to users through mobile application distribution platforms (also referred to as app stores). These app stores provide a functionality which allows to users who downloaded an app to write reviews (also called app reviews) about the app. Within an app review the user can give a rating to the app and write a comment. The feedback given by users in the form of app store reviews has a high value for developers and managers involved in the app evolution as it allows them to become familiar with users and their needs. The user expectations change rapidly towards apps, and developers must keep up with demand to remain competitive [63]. App stores are a recent phenomenon: Apple’s App Store and Google Play were launched in 2008, and since then both have accumulated several millions of downloadable apps. Consequently software engineering researchers have access to large numbers of software applications together with customers’ feedback (i.e., app reviews) unavailable in previous software deployment mechanisms [92]. Research studies related to app review analysis was first published in 2012 and since then the literature in this field of research has developed along five different directions: (i) classification, (ii) content analysis, (iii) requirements engineering, (iv) sentiment analysis, and (v) summarization [92]. 32
2.2. App Reviews Analysis for Software Engineering
2.2.1
Classification
Reviews have been classified for spam, maturity ratings, and privacy and security risks. Research in 2015 has also used reviews to help detect erroneous apps [92]. Chandy and Gu [23] mined about 6 millions of reviews from the Apple App Store and trained both a supervised decision tree and unsupervised latent class analysis in order to identify spam reviews. Chen et al. [25] compared the maturity ratings of apps present in both the Apple App Store and Google Play. They found that about 10% of Android apps were underrated while about 18% were overrated and also trained a classifier on the sets of app descriptions, user reviews and iOS maturity ratings, in order to automatically verify app maturity ratings. Ha et al. [56] manually analyzed 556 comments from 59 Google Play apps and clustered them into topics and sub-topics. They found that most reviews dealt with the quality of the app rather than security and privacy contents. Cen et al. [19] contrived an approach to identify comments dealing with security and privacy issues from a set of app reviews. G´omez et al. [46] employed an unsupervised machine learning approach to identify apps that may contain bugs, using about 1.5 millions reviews mined from about 50,000 apps. Guzman et al. [53] developed an ensemble of machine learning classifiers in order to classify user reviews. They achieved a precision of 74% and recall of about 60% on a manually labelled set of about 2,000 reviews. McIlroy et. al. [95] presented an approach that can automatically assign multi-labels (i.e., multi-labelling) to user reviews in order to assist (i) developers better understanding users’ concerns, (ii) app store owners uncovering anomalous apps, and (iii) users comparing competing apps. Through a large scale empirical study they demonstrated that (i) up to 30% of the user reviews raise more than one issue type, (ii) the proposed multi-label approach can correctly label issue types in user reviews with a precision of 66% and a recall of 65%, and (iii) the multi-label approach assists all three stakeholder types (i.e., developers, users, and app store owners) in three prof-of-concept scenarios: app comparison, market overview, and anomaly detection.
2.2.2
Content
Review content literature has investigated the vocabulary and ontology of reviews, the factors affecting feedback, and devices most used for review submission [92]. In particular, Hoon et al. [64] and Vasa et al. [125] collected a dataset containing about 9 million of reviews from the Apple App Store and analysed the reviews and vocabulary used. Iacob et al. [67] studied how the price and rating of an app influence 33
Background
the type and amount of user feedback that it receives through review. Khalid et al. [72] investigated the devices used to submit app reviews, in order to determine the optimal devices for testing. Palomba et al. [107], by linking the reviews to code changes of 100 open source Android apps, found that about the 50% of requests were implemented in new app releases. Hoon et al. [61] devised an ontology of words used to describe software quality attributes in app reviews. McIlroy et al. [93] studied responses to reviews related to Google Play apps. Gui et. al. [50] analysed ad related complaints extracted from app user reviews. They discovered that most ad complaints were related to how ads interfered or interacted with UI related aspects of the mobile app. Specifically, they found that three topics were raised the most often: (i) the frequency with which ads were displayed, (ii) the timing of when ads were displayed, and (iii) the location of the displayed ads.
2.2.3
Requirements Engineering
In this direction of research reviews have been mainly used to extract bug reports and feature requests, in addition to prioritize critical reviews [92]. Oh et al. [104] developed a review digest system based on a SVM (Support Vector Machine) classifier, which automatically discern between informative reviews and noninformative ones, relying on two sets of words: (i) the first set contains the words appearing in reviews manually labeled as informative, (ii) the second set comprises words present within reviews manually categorized as non-informative. Iacob and Harrison [65] developed MARA which relies on linguistic rules to automatically identify feature requests in app reviews. Moreover, as an extension to the MARA system, Iacob et al. [66] introduced a set of linguistic rules for identifying also bug reports in app reviews in order to facilitate app development. Park et al. [114] developed AppLDA, a topic model which discards review-only topics, in order to allow developers gathering only reviews discussing features present in the app descriptions. Maalej and Nabil [89] used multiple binary classifiers to identify bug reports and feature requests from user reviews. They also found that stopword removal and lemmatisation, could negatively affect the effectiveness of the classification task. Moran et al. [99] developed the FUSION system, which helps users complete bug reports, through static and dynamic analysis on Android apps. 34
2.2. App Reviews Analysis for Software Engineering
2.2.4
Sentiment Analysis
The works discussed in this paragraph have exploited sentiment in their study of reviews. Sentiment describes a users’ moods or opinions towards the topics treated in a text. Typically a sentiment could be positive or negative, and is extracted from reviews relying on “positive” sentiment words (e.g., good, great, love, ...) and “negative” sentiment words (e.g., bad, hate, terrible, ...). In particular, sentiment analysis helps in identifying causes of user complaints, and in prioritizing informative reviews [92]. Goul et al. [47] performed sentiment analysis on 5,000 reviews from the Apple App Store in order to facilitate requirements engineering. Galvis Carre˜ no and Winbladh [42] extracted user requirements from comments using the ASUM model [69], a sentiment-aware topic model. Hoon et al. [62] analysed about 30,000 short reviews and found that they are mostly made up of sentiment words. Khalid [70, 71] studied negative related to 20 free iOS apps and found that users were most dissatisfied by issues related to invasion of privacy, unethical behaviour and hidden costs. Pagano and Maalej [105] through a study involving more than 1 million of reviews from the Apple App Store, showed that positive feedback was often associated with highly downloaded apps, while negative feedback was often associated with less downloaded applications. Chen et al. [24] produced AR-Miner, a system for extracting the most informative reviews, placing weight on negative sentiment reviews. Guzman e Maalej [52] studied the differences between user sentiments in Google Play form the Apple App Store. Guzman et al. [51] developed DIVERSE, a tool that clusters reviews with similar sentiments about the same feature in order to condense the information.
2.2.5
Summarization
A number of tools aimed at summarizing reviews data have been proposed, in order to help developers mine information from large amounts of comments that would be infeasible to manually read [92]. Fu et al. [39] designed WisCom for enabling summarization of reviews at a perreview, per-app or per-market level. This tool provides overviews of “complaint” or “praise” comments over time. It is also useful for large-scale overviews of competitor apps. Gao et al. [43] developed AR-Tracker that automatically collects app reviews and ranks them in order to prioritize the reviews set, according to frequency and importance. Vu et al. [131,132] proposed MARK, a system which identifies keywords in sets of reviews to assist summarization and search tasks. Gu and Kim [49] designed 35
Background
SUR-Miner a tool for categorizing and summarizing app reviews. The tool produces a visualization and 28 out of 32 of the surveyed mobile developers agreed that the tool is useful. Villarroel et al. [127] introduced CLAP a solution to (i) identify bug reports and suggestions for new features in user reviews, by using a customized text preprocessing step which unifies synonyms relying on a custom dictionary previously defined [10] and the Random Forest machine learning algorithm [14], (ii) cluster together related reviews (e.g., all reviews reporting the same bug) by applying the DBSCAN clustering algorithm [37], and (iii) automatically prioritize the clusters of reviews to be implemented when planning the subsequent app release, relying on a Random Forest machine learner which labels each cluster as high or low priority on the basis of some features characterizing the cluster (e.g., the average rating of the cluster, the number of reviews in the cluster, etc.).
2.3
Main Contributions
The main contributions of our research work with respect to the related literature discussed in this chapter can be summarized as follows: • Managing Information in Development-related Conversational Channels. We design a taxonomy of high-level categories of sentences in development emails that are useful for accomplishing software maintenance and evolution tasks. Moreover, we devise an approach and make available a tool (named DECA) to automatically classify development emails content according to these high-level categories. An empirical comparison of our approach with machine learning classifiers, demonstrates that DECA outperforms results obtained by six different machine learning techniques. • App Reviews Analysis for Software Engineering. In this context our research efforts focus on two main research directions: – Requirements Engineering. We model a taxonomy of high-level categories of sentences that are relevant for the maintenance and evolution of mobile apps and develop a novel approach to classify reviews’ sentences according to the categories in this taxonomy. In particular, the approach (i) extracts different kinds of features (i.e., structure, lexicon, and sentiment) from reviews’ text, and (ii) combines the extracted features through machine learning techniques in order to automatically recognize reviews’ fragments that could be useful for developers interested in accomplishing 36
2.3. Main Contributions
software maintenance and evolution tasks. We also make available a tool (called ARdoc) implementing the proposed approach. Empirical results show how particular combinations of the extracted features could lead to higher classification performances. – Summarization. We define URM (User Reviews Model ), a two-level classification model which takes into account both users’ intentions (when giving feedback to developers) and review topics (i.e., the specific aspects of the app discussed in the review). On top of URM we build SURF, an approach which exploits summarization techniques to summarize user reviews and generate an interactive, structured and condensed agenda of recommended software changes. Results of two experiments involving 17 real-word apps and 23 subjects in total demonstrate the practical usefulness of summaries generated by SURF in the developers’ “working context”. A prototype implementation of our reviews summarizer is made available for research purposes.
37
Chapter 3
Intention Mining in Developers’ Discussions Written development communication (e.g., mailing lists, issue trackers) constitutes a precious source of information to build recommenders for software engineers, for example aimed at suggesting experts, or at re-documenting existing source code. In this chapter we propose a novel, semi-supervised approach named DECA (Development Emails Content Analyzer) that uses Natural Language Parsing to classify the content of development emails according to their purpose (e.g., feature request, opnion asking, problem discovery, solution proposal, information giving, etc.), identifying email elements that can be used for specific tasks. A study based on data from Qt and Ubuntu, highlights a high precision (90%) and recall (70%) of DECA in classifying email content, outperforming traditional machine learning strategies. Moreover, we successfully used DECA for re-documenting source code of Eclipse and Lucene, improving the recall, while keeping high precision, of a previous approach based on ad-hoc heuristics.
39
Intention Mining in Developers’ Discussions
3.1
Introduction
In many open sources and industrial projects, developers make an intense usage of written communication channels, such as mailing lists, issue trackers and chats [111]. Although voice communication still remains something unavoidable [5,80], such channels ease the communication of developers spread around the world and working around the clock, and allows for keeping track of discussions and of decision taken [11, 112]. From a completely different perspective, information contained in such a recorded communication has been exploited by researchers to build recommender systems, for example aimed at performing bug triaging [3], suggest mentors [17], or providing a description of an existing, undocumented software artifact [110]. However, profitably using information available in development communication is challenging, because of its noisiness and heterogeneity. Firstly, a development email or a post on issue tracker contains a mix of different kinds of structured, semi-structured, and unstructured information [6]. For example, they may contain source code fragments, logs, stack traces, or natural language paragraphs mixed with some source code snippets, e.g., method signatures. In order to work effectively, recommenders must separate such elements, and this has been achieved by approaches combining machine learning techniques with island parsers [6], or by using other statistical techniques such as Hidden Markov Models [21]. The second issue is that communication posted on issue trackers, mailing lists or forums may have different purposes. For example, an issue report may relate to a feature request, a bug, or just to a project management discussion. For example, Herzig et al. [60] and Antoniol et al. [2] found that over 30% of all issue repors are misclassified (i.e., rather than referring to a code fix, they resulted in a new feature, an update of documentation, or an internal refactoring). Hence, relying on such data to build fault prediction or localization approaches might result in incorrect results. Kochhar et al. [76] shed light on the need for additional cleaning steps to be performed on issue reports for improving bug localization tasks. This, for example, may involve a re-classification of issue reports. On a different side, certain recommender may require to mine specific portions of a written communication, for example to identify questions being asked by developers [59] or to mine descriptions about certain methods [110, 126]. Also sometimes an email or a discussion is too long and this does not help developer who get lost in unnecessary details. To cope with this issue, previous literature proposed approaches aimed at generating summaries of emails [79,116] and bug reports [118,119]. However, none of the aforementioned approaches is able to classify paragraphs contained in 40
3.1. Introduction
developers’ communication according to the developers’ intent, in order to only focus on paragraphs useful for specific purposes (e.g., fixing bugs, add new features, improve existing features, etc.). In order to cope with this issue and answer our RQ1 (stated in Paragraph 1.3), in this chapter we propose an approach named DECA (Development Email Content Analyzer ), based on Intention Mining, which is able to classify emails’ content according to developers’ intentions, such as asking/providing helps, proposing a new feature or reporting/discussing a bug. The main contributions of this chapter are: • A taxonomy of high-level categories of sentences, obtained by manually classifying development emails using grounded theory [44]. • DECA, a novel automated approach to classify development emails content according to developers’ intentions. • A prototype implementation of the proposed approach. • Results of an empirical comparison of DECA with machine learning classifiers. • Last, but not least, as a practical application of DECA, we show how it can be used to mine method descriptions from developers’ communication, and how DECA can overcome the limitations of a previously proposed approach [110]. The proposed taxonomy defines a conceptual framework for indexing discussions of different nature, as the inferred categories reflect some of the actual developers’ needs in searching information across different channels. In this context, DECA could be very useful for classifying and indexing content of developers discussions occurring over several communication channels (i.e., issue trackers, IRC chats, on-line forum, etc.). The results of our empirical study on data from Qt and Ubuntu, highlight a high precision (90%) and recall (70%) of DECA in classifying email content. Moreover, the proposed approach can be used for a wider application domain, such as the preprocessing phase of various summarization tasks. For example, DECA could be used as a preprocessing support to discard irrelevant sentences within emails or bug report summarization approaches. 41
Intention Mining in Developers’ Discussions
3.2
Classifying Emails according to Intentions
In this paragraph we describe the approach applied for the automatic classification of development emails content. In particular, we defined a taxonomy of sentences categories in order to catch useful contents for developers. We extracted a set of linguistic patterns for each category. For each linguistic pattern we defined an heuristic responsible for the recognition of the specific pattern.
3.2.1
Categories Definitions of Development Sentences
We have defined six categories of sentences describing the “intent” of the writer: feature request, opinion asking, problem discovery, solution proposal, information seeking and information giving. Table 3.1 provides a description of each category. These categories are designed to capture the aim of a given email and, consequently, recognizing the kind of information generally contained in messages regarding the development concerns. Table 3.1: Sentence Categories Definition
The categories were identified by a manual inspection of a sample of 100 emails taken from the Qt Project development mailing list. During this task, we manually grouped all the extracted emails according to the categories defined by Guzzi et al. [54], which are: implementation, technical infrastructure, project status, social interactions, usage and discarded. We obtained 5 groups of emails, one for each category, with the exception of discarded. The aim of the taxonomy presented in [54] was to assign topics to discussions’ threads. Thus, this classification is useful to assign 42
3.2. Classifying Emails according to Intentions
the scope to the entire mail message, but not the intent related to relevant sentences in the email content, which is our purpose. Indeed, we believe a single message may contain relevant sentences of different nature (e.g., an identification of a bug, and a subsequent solution proposal to fix it). Thus, for each group of emails we manually selected and extracted significant sentences that evoke, or suggest, the intent of the writer: e.g., is the writer saying that there is something to be implemented? Is the writer saying that she/he discovered a bug? With the aim of defining a complete set of categories, we relied on a further taxonomy proposed by Guzzi and Begel [55]. This second taxonomy classifies the reasons why developers need to communicate about source code. This taxonomy includes three categories: (i) coordination, (ii) seeking information, and (iii) courtesy. This second classification is close to the aim of our work, but it is not enough detailed. Through a manual inspection of the extracted sentences, we extended and refined the set of categories of Guzzi and Begel [55] with a new set of categories (detailed in Table 3.1), which better fits the content of development mailing list messages. We discarded the sentences that do not belong to any of the defined categories because they have negligible information (and/or are too generic) to help during a development task. The categories we defined are intended to capture the intent of each sentence (requesting new features, describing a problem, or proposing a solution) and consequently allow developers to better manage the information contained in emails. Table 3.2: Sentence types by Guzzi and Begel (left side) and ours (right side)
43
Intention Mining in Developers’ Discussions
Table 3.2 compares our classification to the one proposed by Guzzi and Begel [55], highlighting that our classification is more suitable for the software development domain, as it is more detailed and specific. As an example, let’s consider three different kinds of sentences: (i) discuss a change (e.g., “A computer icon is required”, “I would like to see a chat included in this release”), (ii) file a bug (e.g., “The server doesn’t start”, “I found a problem during the installation process”), and (iii) propose a solution (e.g., “You may have to enable the package”, “What you need to do is to re-install the application”). While in the classification of Guzzi and Begel [55] all of them fall in the same category (coordination), in our model each of them is associated to a different category: Feature Request, Problem Discovery, and Solution Proposal respectively. We neglected some forms of courtesy (e.g., ask a permission) adding the Information Giving category to model cases in which the developers’ intent is to provide useful information to other participants in the discussion (e.g., “I have provided some extra information in the bug report”, “Plan is to make available a new release for this month”). We finally introduced the Opinion Asking class for capturing explicit opinion requests (e.g., “What do you think about creating a single webpage for all the services?”). The sentences belonging to the Opinion Asking class may emphasize discussion elements which could be useful for developers’ activities; thus, it appears reasonable to distinguish them from more general information requests (mapped with the Information Seeking category).
3.2.2
Identifying Linguistic Patterns
We assume that when developers write about existing bugs (Problem Discovery) or suggest solutions to solve these bugs (Solution Proposal ) within discussions about development issues, they tend to use some recurrent linguistic patterns. For instance, let’s consider the example sentence: “We could use a leaky bucket algorithm to limit the bandwidth” A human who reads this sentence can easily understand that writer’s intention is to propose a solution to solve some kind of problem. Hence, she/he could conclude, according to our taxonomy, that the sentence belongs to the Solution Proposal category. Observing the syntax of this sentence, we can notice that it presents a well-defined predicate-argument structure: • the verb “to use” constitutes the principal predicate of the sentence; 44
3.2. Classifying Emails according to Intentions
• “could” is the auxiliary of principal predicate. • “we” represents the subject of the sentence; • “a leaky bucket algorithm” represents the direct object of the principal predicate; • “to limit” is a non-finite clausal complement depending on the principal predicate; • “the bandwidth” is the direct object of the clausal complement; By exploiting this information, most of the sentences that present similar predicateargument structure would indicate a Solution Proposal. Thus, we define a heuristic to detect this particular predicate-argument structure. The formalization of a heuristic requires three steps: (i) discovering the relevant details that make the particular syntactic structure of the sentence recognizable (e.g. the verb “to use” as principal predicate of the sentence and the auxiliary “could”), (ii) generalizing some kinds of information (e.g. subject does not necessarily be “we” and direct object does not necessarily be “leaky bucket algorithm”), (iii) ignoring useless information (e.g. the clausal complement and its direct object do not provide any useful information for the structure identification). So, we define a general pattern “[someone] could use [something]” (the words in square brackets are placeholders indicating generic subjects, [someone], and generic direct objects, [something]) and associate it with the Solution Proposal class. On the contrary, if we consider the sentence: “The leaky bucket algorithm fails in limiting the bandwidth” we can notice that this second sentence has a totally different structure. Indeed, the verb “to fail” constitutes the principal predicate of the sentence and this would rather suggest the description of a problem. We used the Stanford typed dependencies (SD) representation [32] in order to describe a set of heuristics able to recognize similar recurrent linguistic patterns used by developers in an automated way. Each category is then associated to a group of heuristics. Thus, each heuristic is leveraged for the recognition of a specific linguistic pattern. The typed dependency parser represents dependencies between individual words contained in sentences and labels each of them with specific grammatical relation (such as subject or indirect objects) [32, 33]. The SD representation was successfully used in a range of tasks, including Textual Entailments [31] and BioNLP [41], 45
Intention Mining in Developers’ Discussions
and in recent years SD structures have also become a de-facto standard for parser evaluation in English [18], [20], [103]. Figure 3.1, through an example, shows the process we applied to define each heuristic. More precisely, in Figure 3.1 we can see how previously discussed concepts on sentences’ predicate-argument structures can be implemented and captured through the Stanford typed dependencies. Firstly, we analyze a sentence containing one of the recurrent linguistic patterns (e.g. “We should add a new button to access to personal contents” in Figure 3.1) and build its SD representation.
Figure 3.1: Natural language parse tree from a Feature Request
In Figure 3.1, the main predicate, “add”, jointly to the auxiliary verb “should” is a clue for a Feature Request. Obviously this predicate has to be connected to a generic subject (that indicates who makes the request) and one (or more) generic direct object (that along with the predicate indicates the request object). At this point, we can define the related heuristic “[someone] should add [something]” and associate it with the Feature Request class. Once we have defined the heuristic, a sentence having a similar structure (“add” or synonyms in the role of principal predicate, “should” or synonyms in the role of auxiliary verb and one or more direct objects that indicates 46
3.2. Classifying Emails according to Intentions
Table 3.3: Defined heuristics for each Sentence Category
the things a user would add) can be recognized as belonging to the class of Feature Request. Through the process described above, we defined a set of heuristics for each of the defined categories. Table 3.3 shows the number of heuristics implemented for each class.
3.2.3
The DECA tool
We implemented a Java tool leveraging the proposed approach. In particular, By exploiting the defined heuristics and working on the SD representation of the emails’ sentences, the tool highlights the recognized sentences with different colors. We make the tool available online1 . DECA classifies development emails content according to the six categories describing the “intent” of the writer: feature request, opinion asking, problem discovery, solution proposal, information seeking and information giving (detailed in Table 3.1). DECA exploits the predicate-argument structures related to intentions (see Paragraph 3.2.2) to automatically detect and categorize relevant text fragments for developers, within emails’ content. Specifically, the tool analyzes messages at the sentence-level granularity because within a raw mail message some sentences could be relevant for software development purposes, while others could be not. DECA has two main modules: the Parser and the Classifier. The Parser aims at preparing the text for the analysis. Firstly, it performs sentence splitting and tokenization, relying on the Stanford CoreNLP API [91]. Once the text is divided into sentences, the Parser creates, for each sentence, the Stanford Dependencies (SD) representation [33]. The Stanford Dependencies parser represents dependencies 1 http://www.ifi.uzh.ch/seal/people/panichella/tools/DECA.html
47
Intention Mining in Developers’ Discussions
Figure 3.2: Implementation of a NLP heuristic
between individual words contained in sentences and labels each dependency with a specific grammatical relation (e.g., subject or direct/indirect object). Such a representation is exploited by the Classifier to perform its analysis. We identified a set of 231 linguistic patterns related to the sentence categories of Table 3.1. For each pattern, the Classifier implements a NLP heuristic to recognize it. Each NLP heuristic tries to detect the presence of a text structure that may be connected to one of the development categories, looking for the occurrences of specific keywords in precise grammatical roles and/or specific grammatical structures. Figure 3.2 describes how the Classifier performs the classification for an example of sentence reported in the Ubuntu’s development mailing list. The figure depicts the SD representation of the sentence (on the left-side) and the implementation of the NLP heuristic (on the right-side) able to detect the structure indicating (in the majority of cases) the disclosure of a problem/bug. The NLP heuristic has to analyze few typed dependencies (in the example of Figure 3.2, the code checks only the underlined dependencies) to detect the presence of a linguistic pattern. Once recognized a pattern, the Classifier returns the classification result to the Parser, which adds the result to a collection and provides the SD representation of the next sentence to the Classifier. The Classifier labels only the sentences that present known structures assuming that all other sentences are too generic or have negligible contents. At the end of this process, results are provided as output. We provide two different versions of the tool. The first version of DECA provides a practical GUI, which can be found in the zipped file DECA GUI.zip available from the tool’s webpage. The README.txt contained in the zipped file provides all the 48
3.2. Classifying Emails according to Intentions
Figure 3.3: DECA’s Interface
Figure 3.4: DECA’s output in presence of code snippets and error logs
49
Intention Mining in Developers’ Discussions
Figure 3.5: Using DECA as a Java library
information to run the tool’s GUI. To analyze discussions or messages the users can paste them in the text area of the GUI or alternatively, load them from a text file and press the Recognize button. When the recognition process is complete, DECA highlights all recognized sentences using different colors for different categories. Figure 3.3 shows the tool’s GUI with an example of output. It is important to point out that DECA classifies exclusively the natural language fragments contained in the messages, since the Classifier can start its elaboration only when the Parser is able to construct the SD representation for the sentence under analysis. Figure 3.4 depicts the tool’s behaviour for two examples of text fragments: the first one (on the top) contains a code snippet (for which the Parser is not able to construct the SD representation), while the second fragment (on the bottom) exhibits an error log containing some text in a natural language form (recognized by the Parser which can identify the dependencies structure). The second version of DECA is a Java API that provides an easy way to integrate our classifier in Java projects. Figure 3.5 shows an example of Java code that integrates the DECA’s capabilities. To use it, it is necessary to download the 50
3.3. Evaluation: Study Design
DECA API.zip from the tool’s Web page, unzip it, and import the library DECA API.jar as well as the Stanford CoreNLP libraries (which can be found in the lib folder contained in DECA API.zip) in the build path of our Java project. Then, to use the DECA classifier, it is sufficent to import the classes org.emailClassifier.Parser and org.emailClassifier.Result, and instantiate the Parser through the method getInstance. The method extract of the class Parser represents the entry point to access to the tool’s classification. This method accepts in input a String containing the text to classify and returns in output a collection of objects of the Result class. The Result class provides all the methods to access to DECA’s classification results.
3.3
Evaluation: Study Design
The goal of this study is to analyze development emails contents with the purpose of investigating the effectiveness of the approach in identifying discussions relevant for developers for specific maintenance task. The perspective is of researchers interested in identifying relevant recurrent linguistic patterns in the software engineering domain, useful for performing several software engineering tasks.
3.3.1
Research Questions
In order to answer the RQ1 specified in Paragraph 1.3, we derive two research subquestions: • RQ1-a: Is the proposed approach effective in classifying writers’ intentions in development emails? This research question represents the core part of our study aimed at developing and evaluating a novel approach for classifying messages able to help retrieving meaningful information from message content. • RQ1-b: Is the proposed approach more effective than existing ML in classifying development emails content? This research question aims at comparing results achieved by DECA with the results obtained by a set of existing ML techniques previously used in the literature for classifying bug reports. Thus, this research question is aimed at quantifying the benefits obtained by the use of Natural Language Parsing with respect to existing ML techniques. 51
Intention Mining in Developers’ Discussions
Table 3.4: Analyzed Projects
3.3.2
Context Selection and Data Extraction
The context of the study consists of mailing lists belonging to two open source projects whose characteristics are summarized in Table 3.4. Specifically, for each project Table 3.4 provides: (i) name, (ii) home page, (iii) period of time considered to collect the emails and (iv) total number of analyzed emails. The Qt project is a cross-platform application and UI framework used to develop application software that can run on various software and hardware platforms. The development of Qt framework started in 1991 while Nokia founded the Qt Project on 21 October 2011 with the aim of easing online communication among Qt developers and community members through public forums, mailing lists and wikis. Ubuntu is a Debian-based Linux operating system. Both projects have large development communities and this ensures high messages density (more than 100 messages per month). In order to have as many message types as possible in our study, we selected emails in specific time windows for the two projects. For the Qt Project we selected emails in a period related to a very advanced development stage (the development of Qt framework started in 1991) in which we expected to find more messages related to information requests and solution proposals. For Ubuntu we selected emails in a period related to a very early development stage (the first release of Ubuntu was issued in October 2004) in which we expected to find more messages related to new bugs discovered and/or feature requests. Table 3.5 summarizes the samples of emails we randomly selected for our study, reporting for each sample the (i) name of the project, the (ii) amount of messages considered in the sample for that project and the (iii) period of time in which the messages of the sample have been exchanged. In this dataset, we anonymized the messages and applied a pruning of email metadata (removing for example, names of sender and receiver) that were not relevant for our goals. Starting from sentences categories defined in Paragraph 3.2.1, two phD students (one of them not involved in this research work) separately analyzed all the messages in the dataset and manually extracted significant sentences assigning each of them to one of the defined categories. We involved an external evaluator to avoid any 52
3.3. Evaluation: Study Design
Table 3.5: Development mailing lists samples
bias related to the subjectivity of the classification. Specifically, the classifications performed by the two evaluators coincide for the most part (in about 97% of the cases they assigned a sentence to the same category). We considered only the sentences that both evaluators assigned to the same category. Table 3.6 shows the samples’ size of the classified sentences for the two projects considered in our study. Table 3.6: Samples Size of Classified Sentences
It is important to stress that the proportion of the categories of sentences varies depending on the projects. However, in both projects Opinion Asking and Information Seeking are respectively the categories having the lowest and the highest percentages of occurrences if compared to each other class of sentences.
3.3.3
Analysis Method
To answer RQ1-a we defined a sequence of train and test sets pairs for a progressive assessment of the results. Thus, we scheduled 3 experiments. 53
Intention Mining in Developers’ Discussions
Experiment I a. We randomly selected as training set 102 emails among the messages sent in June 2014 related to Qt Project development mailing list. As explained in Paragraph 3.3.2, two humans performed the manual classification of the sentences contained in such email messages according to the defined category in Paragraph 3.2.1. Thus, we manually detected the recurring linguistic patterns found in this set of messages according to the defined categories in Paragraph 3.2.1. Through the process explained in Paragraph 3.2.2 (Figure 3.1), we defined and implemented 87 heuristics for automatically classifying (recognition of patterns) the sentences contained in training set. b. We randomly selected as test set 100 emails sent in May 2014 regarding the Qt Project. Also in this case, two people performed a manual classification of the contents of these messages according to the defined category. Specifically, only sentences evaluated as belonging to the same category were selected. c. Relying on the 87 defined heuristics, we used our tool to automatically classify sentences of the emails in the test set. We compared tool outputs with the human generated oracle and computed (i) true positives (TP ) as the number of sentences correctly classified by the tool, (ii) false positives (FP ) as the number of sentences incorrectly labeled as belonging to a given class, and (iii) false negatives (FN ) as the number of items, which were not assigned to any sentence category but were belonging to one of them. Thus, we evaluated the tool performances relying on the widely adopted metrics of Information Retrieval: Precision, Recall and F-Measure. Experiment II a. To improve the effectiveness of the DECA’s classification we used the set of sentences classified as false negatives in the Experiment I as a gold set for defining new heuristics able to capture such sentences. Specifically, 82 new heuristics were identified, formalized and implemented in order to detect the sentences not identified in Experiment I. Thus, our heuristics set increased from 87 to 169. b. We prepared a new test set to verify if the augmented heuristics set allowed us to get better results. 100 emails were randomly selected between messages sent in the months of March, April, July, August, September of the year 2014 related to Qt Project. Following the same approach previously discussed, two human judges contributed to constitute the oracle for this experiment. 54
3.3. Evaluation: Study Design
c. We executed DECA on emails sentences of this new test set and we compared the data with the human generated oracle. Experiment III a. To further improve the effectiveness of the classification performed by DECA, we used again false negatives found in the Experiment II as new set for identifying new recurrent patterns. In this way, 62 new heuristics were implemented, giving a total number of 231. b. To evaluate the potential usefulness of the new set of heuristics, we created a third test set randomly selecting 100 emails sent from September 2004 to January 2005 in the Ubuntu distribution. Two human judges created the oracle table relatively to this test set, according to the previously explained process. c. We executed DECA on the emails messages of this new test set and compared the results with the human generated oracle. It is worth nothing that the two evaluators are the same in all the three experiments.
3.3.4
An Approach Based on ML for Email Content Classification
This paragraph discusses the methodology we used to train machine learning techniques to classify the email content (RQ1-b). Specifically, the work by Antoniol et al. [2] exploited conventional text mining and machine-learning techniques to automatically classify bug reports. They used terms contained in bug reports as features (fields) of machine learning models to discern bugs from other issues. The work by Zhou et al. [142] extended the work of Antoniol et al. building the ML techniques considering, as additional features, structural information improving ML prediction accuracy. We implemented an approach, similar to the one used by Antoniol et al., to classify sentences contained in mailing lists data using as features the terms contained in the sentences themselves. Formally, given a training set of mailing list sentences T1 and a test set of mailing list sentences T2 , we automatically classify the email content contained in T2 , by performing the following steps: 1. Text Features: the first step uses all sentences contained in T1 and T2 as base information to build a textual corpus (indexing the text). In particular, we pre55
Intention Mining in Developers’ Discussions
processed the textual content applying stop-word removal and stemming [38] (similarly to the work of Zhou et al. [142]) to reduce the number of features for the ML techniques. The output of this phase is a Term-by-Documents matrix M where each column represents a sentence and each row represents a term contained in the given sentence. Thus, each entry M[i,j] of the matrix represents the weight (or importance) of the i−th term contained in the j−th sentence. Similarly to the work of Antoniol et al. [2] we weighted words using the the tf (term frequency), which weights each words i in a document j as: rfi,j tfi,j = Pm k=1 rfk,j where rfi,j is the raw frequency (number of occurrences) of word i in document j. We used the tf (term frequency) instead of tf-idf indexing, because the use of the inverse document frequency (idf) penalizes too much terms appearing in too many documents [38]. In our work, we are not interested in penalizing such terms (e.g., “fix”, “problem”, or “feature”) that actually appear in many documents because they may constitute interesting features that guide ML techniques in classifying development sentences. 2. Split training and test features: the second step splits the matrix M (the output of the previous step) in two sub-matrices Mtraining and Mtest . Specifically, Mtraining and Mtest represent the matrix that contains the sentences (i.e., the corresponding columns in M) of T1 and the matrix that contains the sentences (i.e., the corresponding columns in M) T2 respectively. 3. Oracle building: this step aims at building the oracle to allow ML techniques to learn from Mtraining and predict on Mtest . Thus, in this stage, we manually classified the sentences in T1 and T2 assigning each of them to one of the categories defined in Paragraph 3.2.1 (as described in Paragraph 3.3.3 two human evaluators performed this classification). We added the value of the classification as further columns in both Mtraining and Mtest . The machine learning techniques during the training phase use the column “C” of the classification for learning the model. 4. Classification: this step aims at automatically classifying sentences relying on the output data obtained from the previous step (Mtraining and Mtest with classified sentences). The automatic classification of sentences is performed using the Weka tool [134] experimenting with eight different machine learning techniques, namely, the standard probabilistic naive Bayes classifier, the Logistic Regression, Simple 56
3.4. Evaluation: Analysis of Results
Table 3.7: Results for Experiment I
Table 3.8: Results for Experiment II
Logistic, J48, the alternating decision tree (ADTree), Random Forest, FT, Ninge. The choice of these techniques is not random. We selected them since they were successfully used for bug reports classification [2, 142] (i.e., ADTree, Logistic regression) and for defect prediction in many previous works [9, 12, 22, 87, 143], thus allowing to increase the generalisability of our findings. It is important to specify that the generic training set T1 and the generic test set T2 , correspond to training and test set pairs discussed in Paragraph 3.3.3.
3.4 3.4.1
Evaluation: Analysis of Results RQ1-a:Is the proposed approach effective in classifying writers’ intentions in development emails?
Tables 3.7, 3.8 and 3.9 report the results achieved by DECA in classifying development emails content. In particular, these tables show the amounts of (i) true positives, (ii) false negatives, (iii) false positives, (iv) precision, (v) recall and (vi) F-Measure achieved for each defined class for the three experiments (respectively). In general, 57
Intention Mining in Developers’ Discussions
Table 3.9: Results for Experiment III
the results of the classification performed by DECA are rather positive and the addition of new heuristics improves the effectiveness of the approach along the various experiments. Specifically, while the precision is always very high (ranges between 87% and 90%) and stable for all the experiments, the recall increases with the addition of new heuristics from 34% to 70% (i.e., around two times). This is also reflected by the increment of the F-Measure in the three experiments; it varies from an initial value of 49% (Experiment I) to 79% (Experiment III). Furthermore, data shows that DECA works well for all the categories of sentences and all the experiments. The only exception is in the Experiment I for the Opinion Asking category, where recall and precision are equal to zero. However, in general for the Experiment I, precision ranges from 79.2% obtained for Solution Proposal to 100% achieved in Feature Request, whereas recall ranges from 23.2% for Information Seeking category to 50% achieved for Problem Discovery. In the Experiment II, precision ranges from 75% obtained for Solution Proposal to 100% achieved in Information Seeking and Opinion Asking categories, whereas recall ranges from 42.9% obtained for Solution Proposal to 69.6% achieved in Problem Discovery. In the Experiment III, precision ranges from 73.9% obtained for Solution Proposal to 100% achieved in Opinion Asking category, whereas recall ranges from 50% obtained for Solution Proposals to 85.2% achieved in Feature Request. It is important to note that we achieved the best results in terms of recall classifying Problem Discoveries in the Experiments I and II. This indicates that developers very often rely on common/recurrent patterns successfully detected by DECA when their intent is to communicate a bug or a problem. On the other hand, we achieved the worst results in detecting Solution Proposals for all the three experiments with a precision that has gradually (and relatively) deteriorated (from 79.2% in Experiment I to 73.9% in Experiment III) and a recall that has gradually increased but 58
3.4. Evaluation: Analysis of Results
Table 3.10: Examples of correctly classified sentences by DECA
never exceeded the 50%. This suggests that there are many different ways developers use when proposing solutions, making it hard to identify common patterns to detect them. Table 3.10 shows some examples of correctly classified sentences by DECA from Ubuntu development mailing list. Summary RQ1-a: the automatic classification performed by DECA achieves very good results in terms of both precision, recall and F-measure (over all the experiments). The results tend to improve when adding new heuristics. DECA achieved the best values of F-Measure for Problem Discovery sentences and the worst F-Measure results for Solution Proposal sentences.
3.4.2
RQ1-b: Is the proposed approach more effective than existing ML in classifying development emails content?
As discussed in Paragraph 3.3.4, this research question aims at comparing performances of DECA with the performances of a set of machine learning techniques. For the lack of space, we report in this paragraph only the results of the ML models that obtained the best performances in classifying development content. Specifically, in order to get a more complete picture, we selected a set of techniques belonging to different ML categories: regression functions (i.e. Logistic Regression, Simple Logistic), decision trees (i.e. J48, FT, Random Forest), and rules models (i.e. Ninge). The comparison of results for the Experiment I highlights that DECA achieves the best global results in terms of both precision (see Figure 3.6) and recall (see Figure 3.7). The precision of DECA was only worse than J48 technique when identifying 59
Intention Mining in Developers’ Discussions
Figure 3.6: Compared Precision for Experiment I
Figure 3.7: Compared Recall for Experiment I
Figure 3.8: Compared Precision for Experiment II
60
3.4. Evaluation: Analysis of Results
Figure 3.9: Compared Recall for Experiment II
Figure 3.10: Compared Precision for Experiment III
Figure 3.11: Compared Recall for Experiment III
61
Intention Mining in Developers’ Discussions
Solution Proposal (79% for DECA with respect to 100% for J48). However, for the Solution Proposal category DECA achieved a recall of 26.8% while J48 reached a recall of only 1.4% (see Figure 3.7). Focusing the attention on recall, our approach was worse than the RandomForest technique in detecting Feature Request (41.1% versus 56,2%, see Figure 3.7); on the other side, for Feature Request class DECA achieved a precision of 100% while RandomForest obtained a precision of only 23% (see Figure 3.6). Furthermore, the recall of our approach was better only than RandomForest technique in identifying Information Seeking sentences. However, DECA results are much better than all the other techniques in terms of precision (around 94%, as showed in Figure 3.6). Finally, in the Experiment I, the recall achieved by DECA in detecting Information Giving sentences was comparable to the RandomForest technique (23.2% versus 25.5%). Also in this case DECA achieved a better precision (81.3% versus 20% of RandomForest). In both the Experiment II and the Experiment III DECA achieved the best global results in terms of both Precision and Recall. We discuss in details the results of the comparison in Experiment III (Figure 3.10 and Figure 3.11). Specifically, in the Experiment III DECA outperforms, in terms of recall, precision and F-measure the results of all the ML techniques. What is interesting to highlight is that our approach was the only technique able to recognize Opinion Asking sentences in the Experiment III (the same happened in Experiment II) with a substantially high precision and recall (precision 100% and a recall of 75%). DECA obtained the best F-Measure values for all the defined sentences’ categories in all the three experiments. To evaluate the performances of the proposed approach we repeated the run of each experiment 100 times. The results of the ML classifiers and DECA were pretty stable and statistically equal to the results showed in the study. In Experiment I DECA achieved an average F-Measure of 31% that is better than the F-Measure that can be achieved relying on all other considered techniques. While, in Experiment III DECA obtained an average F-Measure of 58% (49% for Experiment II) also in this case higher than the F-Measure of all ML techniques. Moreover, while the results of DECA improve in Experiment II and Experiment III this does not happen for all the considered ML techniques. Indeed, their performances are quite stable along all the experiments, even if the training set grows up with the number of experiments with a precision and recall never exceeding the threshold of 38.3% and 26% respectively. 62
3.5. DECA in a real-life application: Code Re-documentation Summary RQ1-b: DECA outperforms traditional ML techniques in terms of recall, precision and F-Measure when classifying e-mail content. Moreover, while the results of DECA improve in Experiment II and Experiment III this does not happen for all the considered ML techniques.
3.5
DECA in a real-life application: Code Re-documentation
In this paragraph we show how DECA can be used for a specific application, namely mining source code documentation from developers’ communication. Specifically, a previous work by Panichella et al. [110] proposed an approach, based on vector space models and ad-hoc heuristics, able to automatically extract, with high precision (up to 79%), method descriptions from developers communications (bug tracking systems and mailing lists) of two open source systems, namely Lucene and Eclipse. The limit of such approach is that it tends to discard a high number of potentially useful method descriptions. Indeed, the approach discarded around 33% and 22% useful paragraphs for Lucene and Eclipse respectively. However, the authors pointed out that several are the discourse patterns that characterize false negative method descriptions. We argue that DECA can be successfully used to overcome such limitations capturing some of the discourse patterns contained in false negative method descriptions. Thus, we experimented our intention based approach on all the 200 paragraphs (100 for each project) validated in [110] to provide a preliminary evaluation of how many (useful) paragraphs (i.e., false negatives) could be recovered by using our approach. Specifically, we considered as valid method descriptions the paragraphs containing for DECA sentences belonging to Feature Request, Problem Discovery, Information Seeking or Information Giving categories (in according to the taxonomy defined in Paragraph 3.2.1) because are more likely to contain information about the behaviour of a Java method. Table 3.11: Number of paragraphs recognized by DECA
Table 3.11 reports for each project: (i) the number of analyzed paragraphs, (ii) the number of paragraphs previously labeled as false negatives (FN), (iii) the number 63
Intention Mining in Developers’ Discussions
Table 3.12: Examples of paragraphs recognized by DECA
of FN recovered by DECA, and (iv) the number of false positives (FP) generated by DECA. For Eclipse, DECA was able to recover about 64% of paragraphs previously labeled as false negatives, while for the Apache Lucene DECA recovered about 79% of them. Moreover, DECA generates a reduced set of false positives for both projects achieving a precision of 74% for Eclipse and 70% for Lucene. These results demonstrate how DECA can improve significantly the recall of the previous approach, even if we obtain a slight degradation of the performance in terms of precision. Table 3.12 shows some examples of paragraphs correctly detected by DECA that were marked as false negatives in the previous work by Panichella et al. [110]. This happens because the previous approach assigned a score to paragraphs to be candidate method documentation if they contain some keywords, such as, “return”, “override”, “invoke”, etc. However, there can be valid method descriptions not containing anyone of such keywords. As a consequence, these paragraphs were discarded by the approach of Panichella et al. [110]. Instead, as it can be noticed from the examples reported in Table 3.12, DECA is able to identify at least 70% of them. For example, in the case of Eclipse, the paragraph referring to the build method from the JavaBuilder class contains the sentence “JavaBuilder.build() triggers a PDE model change...”. DECA successfully recovers such paragraph and correctly assign this method description to the Information Giving category. A similar situation occurs for the others recovered 64
3.6. Threats to Validity
paragraphs. Thus, intention mining performed by DECA could improve the recall (and precision) of a previous approach that mine source code documentation from developers communication means. Specifically, this result shows that DECA really overcomes limitations of traditional lexicon approaches (e.g. as LDA) that are not able to capturing capturing discourse patterns contained in paragraphs useful for code redocumentation.
3.6
Threats to Validity
Threats to internal validity concern any confounding factor that could influence our results. This kind of threats can be due to a possible level of subjectivity caused by the manual classification of entities. To reduce this kind of threats we attempt to avoid any bias in the building of oracles, by keeping one of the human judges who contributed to define our oracle tables unaware of all defined and implemented patterns. Moreover, we built the oracle tables on the basis of predictions that two human judges separately made. Only predictions, which both judges agreed upon, formed our oracles for the experiments. Another threat to internal validity could regard the particular ML classification algorithm used as baseline for estimating our results, as the results could be dependent on the specific technique employed. For mitigating this threat we used different ML algorithms and compared the results achieved through our approach with the results obtained through each of them. Threats to external validity concern the generalization of our findings. In our experiments, we used a subset of messages in the original mailing lists. This factor may be a threat to the external validity, as the experimental results may be applicable only on the selected messages but not to the entire mailing lists. To reduce this threat, we tried to use as many messages as possible in our experiments. For the same reason we prepared different test sets in which we tried to select both messages related to different periods of the same year and messages related to different development stages. Messages posted in the same month often belong to the same discussion threads and often include quotations of messages to which they reply. To avoid analyzing more times the same sentences, for the experiments II and III we randomly selected test sets containing messages posted in time windows of five months. Another threat to the external validity is represented by the mailing lists used in our experiments. It is possible that some particular characteristics of mailing lists we selected lead to our experimental results. To reduce this threat we used 65
Intention Mining in Developers’ Discussions
two different existing development mailing lists (as we discussed in Paragraph 3.3.2). Moreover we further experimented our approach with text fragments belonging to mailing lists of two others open source projects (Eclipse and Lucene).
66
Chapter 4
Classifying App Reviews for Software Maintenance and Evolution App Stores, such as Google Play or the Apple Store, allow users to provide feedback on apps by posting review comments and giving star ratings. These platforms constitute a useful electronic mean in which application developers and users can productively exchange information about apps. Previous research showed that users feedback contains usage scenarios, bug reports and feature requests, that can help app developers to accomplish software maintenance and evolution tasks. However, in the case of the most popular apps, the large amount of received feedback, its unstructured nature and varying quality can make the identification of useful user feedback a very challenging task. In this chapter we present a taxonomy to classify app reviews into categories relevant to software maintenance and evolution, as well as an approach that merges three techniques: (1) Natural Language Processing, (2) Text Analysis and (3) Sentiment Analysis to automatically classify app reviews into the proposed categories. We show that the combined use of these techniques allows to achieve better results than results obtained using each technique individually. We also describe ARdoc, a prototypical implementation of the approach.
67
Classifying App Reviews for Software Maintenance and Evolution
4.1
Introduction
App stores are digital distribution platforms that allow users to download and rate mobile apps. Notable distribution platforms for mobile devices include Apple and Android app stores, in which users can comment and write reviews of the mobile apps they are using. These reviews serve as a communication channel between developers and users where users can provide relevant information to guide app developers in accomplishing several software maintenance and evolution tasks, such as the implementation of new features, bug fixing, or the improvement of existing features or functionalities. App developers spend considerable effort in collecting and exploiting user feedback to improve user satisfaction. Previous work [24,42,105] has shown that approximately one third of the information contained in user reviews is helpful for developers. However, processing, analyzing and selecting useful user feedback presents several challenges. First of all, app stores include a substantial body of reviews, which requires a large amount of effort to manually analyze and process. An empirical study by Pagano et al. [105] found that mobile apps received approximately 23 reviews per day and that popular apps, such as Facebook, received on average 4,275 reviews per day. Additionally, users usually provide their feedback in form of unstructured text that is difficult to parse and analyze. Thus, developers and analysts have to read a large amount of textual data to become aware of the comments and needs of their users [24]. In addition, the quality of reviews varies greatly, from useful reviews providing ideas for improvement or describing specific issues to generic praises and complaints (e.g. “You have to be stupid to program this app”, “I love it!”, “this app is useless”). To handle this problem Chen et al. [24] proposed AR-Miner, an approach to help app developers discover the most informative user reviews. Specifically, the authors use: (i) text analysis and machine learning to filter out non-informative reviews and (ii) topic analysis to recognize topics treated in the reviews classified as informative. We argue that text content represents just one of the possible dimensions that can be explored to detect informative reviews from a software maintenance and evolution perspective. In particular, topic analysis techniques are useful to discover topics treated in the review texts, but they are not able to reveal the authors’ intentions (i.e. the writers’ goals) for reviews containing specific topics. We conjecture that a deep analysis of the sentences structure in user reviews can be exploited to determine the intention of a given review. In addition, also the sentiment expressed in user reviews can be exploited to distinguish different kinds of informative reviews. In this chapter we investigate whether the (i) structure, (ii) sentiment and (iii) text 68
4.2. Approach
features contained in user reviews can be used to classify and select the user reviews that are helpful for developers to maintain and evolve their apps. Thus, we propose an approach that combines Natural Language Processing (NLP), Sentiment Analysis (SA) and Text Analysis (TA) techniques for the extraction of information present in user reviews that is relevant to the maintenance and evolution of mobile apps. Furthermore, we use machine learning (ML) to combine the three techniques and through a quantitative evaluation show that the combination of the three techniques outperforms the performance of each individual technique. In addition, we illustrate ARdoc a tool for automatically extracting and classifying reviews according to the proposed taxonomy. The main contributions of this chapter are as follows: • A high level taxonomy of categories of sentences contained in app user reviews that are relevant for the maintenance and evolution of mobile apps. • A novel approach to extract users’ intentions expressed in app store reviews based on Natural Language Processing. • An empirical study that investigates to what extent NLP, SA and TA features help to detect app store reviews relevant for the maintenance and evolution of mobile apps. • ARdoc, a tool able to extract structure, sentiment and text features from app reviews and use this information for automatically classifying reviews that are useful from software maintenance and evolution perspective. • An end-to-end evaluation of ARdoc involving mobile professional developers of three real-life applications.
4.2
Approach
Figure 4.1 depicts the approach applied for the automated classification of app reviews’ content. Specifically, our approach consisted of four steps: 1. Taxonomy for Software Maintenance and Evolution: we manually analyzed users reviews of seven Apple Store and Google Play apps and rigorously deduced a taxonomy of the reviews containing useful content for software maintenance and evolution. The output of this phase consisted of a taxonomy of user reviews categories that can lead the developers to select the reviews more useful for a specific maintenance task (i.e. bug fixing, feature adding, etc.). 69
Classifying App Reviews for Software Maintenance and Evolution
Figure 4.1: Overview Research Approach
2. Feature Extraction: the goal of this step was to extract a set of meaningful features from user reviews data to train ML techniques and automatically label app review content according to the taxonomy deduced in the first step. Thus, we designed three different techniques based on (i) Text Analysis, (ii) Natural Language Processing and (iii) Sentiment Analysis, that analyzed the content of app reviews and extracted features for the learning phase of our approach. The output of this phase was represented by a set of NLP, TA and SA features. 3. Learning Classifiers: in this step we used the NLP, TA and SA features extracted in the previous phase of the approach to train ML techniques and classified app reviews according to the taxonomy deduced in the first step. Moreover, we also experimented with different combinations of NLP, TA and SA features to train ML approaches. 4. Evaluation: in this step we evaluated the performance of the ML techniques experimented in the previous step relying on widely adopted metrics for machine learning evaluation.
4.2.1
Taxonomy for Software Maintenance and Evolution
The goal of this first step is to deduce a taxonomy of user reviews categories that is relevant to software maintenance and evolution. To achieve this objective, we analyse user reviews data at the sentence-level granularity because within a raw user review some sentences are relevant to software evolution and maintenance, while 70
4.2. Approach
Table 4.1: Topics mapping with identified categories of sentences
others are not. We argue that the definition of such a taxonomy requires the understanding of which kinds of feedback developers look for in user reviews. Developers usually exchange messages on development communication channels, such as mailing lists and issue trackers, to plan and discuss maintenance and evolution tasks. Therefore, we conjecture that the kinds of discussions occurring in such communication means can guide us in defining a taxonomy of sentence categories that developers perceive as important for software maintenance and evolution. For this reason we performed a systematic mapping between relevant categories of sentences reported in mailing lists messages (presented in Paragraph 3.2.1) with a previously defined taxonomy of contents generally present in user reviews [105]. Table 4.1 reports the mapping between the initial set of categories (see Table 3.1) and the taxonomy proposed by Pagano and Maalej [105] which describes a set of 17 common topics present in app reviews. Additionally, we evaluated the relevance of each of the topics proposed by Pagano and Maalej for developers performing software evolution and maintenance tasks. 71
Classifying App Reviews for Software Maintenance and Evolution
We noticed that some of the categories identified in the context of development mailing lists (see Table 3.1) were irrelevant in the domain of app user reviews (see Table 4.1). More precisely, as it happens in developers’ discussions, also user reviews frequently contain requests of new features (i.e., Feature Request in Table 3.1) and clarification questions on how to use specific features of a given app (i.e., Information Seeking in Table 3.1) . However, differently, from discussions occurring in development mailing lists, users of mobile app rarely ask others users to explicitly express opinions about technical software aspect related to the app (i.e., Opinion Asking in Table 3.1). Moreover, in app stores, users report very often bugs (i.e., Problem Discovery in Table 3.1), but, differently from development mailing lists, they rarely propose a solution to fix it (i.e., Solution Proposal in able 3.1). The results of the systematic mapping highlight that eight of the topics reported in the taxonomy of Pagano and Maalej [105] were relevant for developers. Table 4.1 shows the (i) categories of topics proposed by Pagano and Maalej, (ii) their relevance for software maintenance and evolution tasks and (iii) the mapping to the sentence categories in the initial taxonomy presented in Table 3.1. These topics match with four of the six categories of sentences we identified in the context of development mailing lists: • Information Giving: sentences that inform or update users or developers about an aspect related to the app. • Information Seeking: sentences related to attempts to obtain information or help from other users or developers. • Feature Request: sentences expressing ideas, suggestions or needs for improving or enhancing the app or its functionalities. • Problem Discovery: sentences describing issues with the app or unexpected behaviours. We consider such categories as the base categories in our taxonomy and thus, they represent the output of this first phase of our research approach.
4.2.2
Text Analysis
This paragraph discusses the approach we used to extract textual features from app reviews. Specifically, it consists of two steps: 72
4.2. Approach
1. Preprocessing : all terms contained in our set of user reviews are used as an information base to build a textual corpus that is preprocessed applying stopword removal (using the english standard stop-word list) and stemming (English Snowball Stemmer) to reduce the number of text features for the ML techniques. The output of this phase corresponds to a Term-by-Document matrix M where each column represents a sentence and each row represents a term contained in the given sentence. Thus, each entry M[i,j] of the matrix represents the weight (or importance) of the i−th term contained in the j−th sentence. 2. Textual Feature Weighting : words are weighted using the the tf (term frequency), which weights each words i in a review j as: rfi,j tfi,j = Pm k=1 rfk,j where rfi,j is the raw frequency (number of occurrences) of word i in review j. We used the tf (term frequency) instead of tf-idf indexing because the use of the inverse document frequency (idf) penalises too much terms appearing in many reviews [38]. In our work, we are not interested in penalising such terms (e.g., ”fix”,”problem”, or ”feature”) that actually appear in many reviews because they may constitute interesting features that guide ML techniques in classifying sentences containing useful feedback from the software maintenance and evolution perspective. The weighted matrix M represents the output of this phase and the input for ML strategies as described in the Paragraph 4.2.5.
4.2.3
Natural Language Processing
Similarly to what happens in developers’ discussions (see Paragraph 3.2.2), we observe that when users write app reviews (e.g., to report bugs or propose new features) they also tend to use recurrent linguistic patterns. For instance let’s consider the sentence “You should add a new button”. A developer who reads this sentence can easily understand that the writer’s intention is to make a feature request. Observing the sentence syntax, we can notice that the sentence presents a well defined predicateargument structure: • “add” constitutes the principal predicate of the sentence • “you” represents the subject of the sentence • “button” represents the direct object of the predicate 73
Classifying App Reviews for Software Maintenance and Evolution
Table 4.2: Examples of defined NLP heuristics
• “new” represents an attribute of the direct object • “should” is the auxiliary of the principal predicate We argue that this sentence matches a recurrent linguistic pattern that can be exploited for the recognition of sentences belonging to the feature request category of the taxonomy presented in Paragraph 4.2.1. Our conjecture is that this and similar patterns are intrinsically related to the intentions of the users that wrote the text. Furthermore, we believe that user intentions relevant for our purposes can be mapped to the categories defined in our taxonomy. Therefore, recurrent linguistic patterns can be exploited to recognize sentences of others categories belonging to our taxonomy. Through a manual inspection of 500 reviews (different from the reviews described in Paragraph 4.3)) from different kinds of apps (games, communication, productivity, photography, etc.) we identified 246 recurrent linguistic patterns1 . For each identified linguistic pattern we formalized and implemented an NLP heuristic to automatically recognize it. For instance, for the previous example we defined the general NLP heuristic “[someone] should add [something]”. The implementation of a NLP heuristic enables the automatic detection of a sentence which matches a specific structure (e.g. “add” or a synonym as principal predicate, “should” in the auxiliary role of the 1 http://www.ifi.uzh.ch/seal/people/panichella/Appendix.pdf
74
4.2. Approach
principal predicate, a generic subject indicating who makes the request and a generic direct object indicating the request object). Table 4.2 shows several examples of NLP heuristics. To automatically detect sentences containing our defined NLP heuristics we again relied on the Stanford Typed Dependencies (STD) parser [33] which is able to represent dependencies between individual words contained in sentences and to label each of them with a specific grammatical relation. In this step, the NLP parser we implemented assigns each sentence in the input to its corresponding NLP heuristic. If the sentence structure does not match any of the defined NLP heuristics the NLP parser simply labels the sentence as others. The output of this step is a mapping between each sentence contained in a review and its corresponding NLP heuristic. We then use the NLP heuristic extracted for each sentence to train different ML techniques, as will be explained in the Paragraph 4.2.5.
4.2.4
Sentiment Analysis
Sentiment analysis is the process of assigning a quantitative value to a piece of text expressing an affect or mood [78]. We consider sentiment analysis as a text classification task which assigns each given sentence in a user review to one corresponding class. For our purpose, the classes are defined as three different levels of sentiment intensity: positive, negative and neutral. In our approach we use Naive Bayes for predicting the sentiment in the user reviews. Previous work [109] found that Naive Bayes performed better than other machine learning algorithms traditionally used for text classification when analyzing the sentiment in movie reviews. Table 4.3: Examples of sentiments assigned to different user review sentences.
User%Review%Sentence Senti.%Score !Nice!app!to!post!pics!! 1 !Just!installed!this,!I'm!really!enjoying!this.!! 1 !Please!fix!this.! 0 !Please!do!a!better!job!getting!spammers!off!your!site.! 0 !This!new!update!is!always!crashing.! C1 !It's!annoying!now!that!u!have!to!click!on!that!pin!and!then!you!can!hit!the!like!button.! C1
For our sentiment analysis task we performed the same preprocessing steps performed in the TA technique (stop word removal, stemming and transformation to a Term-by-Document matrix). Additionally, we performed a selection of the words considered to be most important for determining sentiment according to the Chi-squared x2 metric. We trained our classifier with a set of 2090 App Store and Google Play 75
Classifying App Reviews for Software Maintenance and Evolution
review sentences which were randomly selected from the dataset described in Paragraph 4.3.2. The sentences were manually labeled by two annotators, a researcher and a graduate student in Computer Science with experience in sentiment analysis. To assure that both annotators had a similar understanding of the task to be done, a short clarification session was held and examples of annotated sentences were shown. The disagreement rate between both annotators was 5%. We computed the final sentiment score of each manually labeled sentence by averaging the two scores. We performed the sentiment analysis task using the Weka tool [134] generating as output of this step an integer value in the [1,-1] range to each of the input sentences. The value of 1 determines positive sentiments, whereas 0 and -1 denote neutral and negative sentiments respectively. Table 4.3 shows some of the sentences and their corresponding sentiment as determined by the classifier.
4.2.5
Learning classifiers
This paragraph discusses how we trained machine learning techniques to classify user reviews, while Paragraph 4.3 describes the data used as training and test set (below we refer them as T1 and T2 ), as well as, the procedure we followed to manually create the truth set. Formally, given a training set of app reviews sentences T1 and a test set of app reviews sentences T2 , we automatically classify the reviews content in T2 , by performing the following steps: 1. NLP, TA and SA features: The first step uses the NLP, TA and SA approaches discussed in the previous paragraphs to compute the corresponding features contained in the sets of app reviews sentences T1 and T2 . In particular, the output of this phase corresponds to a matrix M where each column represents an app review sentence and each row represents a feature extracted using NLP, TA and SA approaches. Thus, each entry M[i,j] in the matrix represents the value of the metric i−th of the corresponding j−th app review sentence. 2. Split training and test features: The second step splits the matrix M (the output of the previous step) in two sub-matrices Mtraining and Mtest . Specifically, Mtraining and Mtest represent the matrix that contains the sentences (i.e., the corresponding columns in M) of T1 and the matrix that contains the sentences (i.e., the corresponding columns in M) of T2 respectively. 3. Oracle building: The third step aims at building the oracle to allow ML techniques to train from Mtraining and predict on Mtest . Thus, in this stage, the 76
4.2. Approach
sentences in T1 and T2 are manually classified and assigned to one of the categories defined in Paragraph 4.2.1 (as described in Paragraph 4.3 two human evaluators performed this manual labelling). 4. Classification: The fourth step automatically classifies sentences relying on the output data obtained from the previous step, that is Mtraining and Mtest (with classified sentences). Specifically, we experimented (relying on the Weka tool) different machine learning techniques, namely, the standard probabilistic naive Bayes classifier, Logistic Regression, Support Vector Machines, J48, and the alternating decision tree (ADTree). The choice of these techniques is not random since they were successfully used for bug reports classification [2, 142] and for defect prediction in many previous works [9, 12, 22, 87, 143], thus allowing to increase the generalisability of our findings.
4.2.6
The ARdoc tool
In this paragraph we present ARdoc (App Reviews Development Oriented Classifier), an all-in-one tool that automatically classifies useful sentences in user reviews from a software maintenance and evolution perspective. ARdoc classifies sentences contained in user reviews, that are useful for maintenance perspective, in five categories: feature request, problem discovery, information seeking, information giving and other. Table 4.4 shows, for each category: (i) the category name, (ii) the category description and (iii) an example sentence belonging to category. Figure 4.2 depicts ARdoc’s architecture. The main tool’s module is represented by the Parser, which prepares the text for the analysis (i.e., text cleaning, sentence splitting, etc.). Our Parser exploits the functionalities provided by the Stanford CoreNLP API [91], which annotates the natural text with a set of meaningful tags. Specifically, it instantiates a pipeline with annotations for tokenization and sentences splitting. Once the text is divided into sentences, ARdoc extracts from each of these sentences three kinds of features: (i) the lexicon (i.e., the words used in the sentence) through the TAClassifier, the structure (i.e., grammatical frame of the sentence) through the NLPClassifier, and (iii) the sentiment (i.e., a quantitative value assigned to the sentence expressing an affect or mood) through the SAClassifier. Finally, in the last step the MLClassifier uses the NLP, TA and SA information extracted in the previous phase of the approach to classify app reviews according to the taxonomy reported in Table 4.4 by exploiting a Machine Learning (ML) algorithm. We briefly describe, in Paragraph 4.2.6, the information extracted by our tool from app reviews 77
Classifying App Reviews for Software Maintenance and Evolution
Table 4.4: Categories Definition
Category
Description
User Feedback Example
Information Giving
Sentences that inform or update users or developers about an aspect related to the app Sentences related to attempts to obtain information or help from other users or developers Sentences expressing ideas, suggestions or needs for improving or enhancing the app or its functionalities Sentences describing issues with the app or unexpected behaviours Sentences do not providing any useful feedback to developers
“This app runs so smoothly and I rarely have issues with it anymore"
Information Seeking
Feature Request
Problem Discovery
Other
“Is there a way of getting the last version back?"
“‘Please restore a way to open links in external browser or let us save photos" “App crashes when new power up notice pops up" “What a fun app"
and, in Paragraph 4.2.6, the classification techniques we adopted. Features Extraction The NLPClassifier implements the set of the previously identified 246 NLP heuristics (see Paragraph 4.2.3). The NLPClassifier uses the Stanford Typed Dependencies (STD) parser [33], which represents dependencies between individual words contained in sentences and labels each dependency with a specific grammatical relation (e.g., subject or direct/indirect object). Through the analysis of the typed dependencies, each NLP heuristic tries to detect the presence of a text structure that may be connected to one of the categories in Table 4.4, looking for the occurrences of specific keywords in precise grammatical roles and/or 78
4.2. Approach
Figure 4.2: ARdoc’s architecture overview
specific grammatical structures. For each sentence in input, the NLPClassifier returns the corresponding linguistic pattern. If the sentence does not match any of the patterns we defined, the classifier simply returns the label “No patterns found”. The SAClassifier analyzes the sentences trough the sentiment annotator provided by the Stanford CoreNLP [91] and for each sentence in input returns a sentiment value from 1 (strong negative) to 5 (strong positive). We use this sentiment prediction system because it is independent of hard-coded dictionaries – a drawback from lexical sentiment analysis techniques that have been previously used for the analysis of app reviews [51,61,89]. The TAClassifier exploits the functionalities provided by the Apache Lucene API2 for analyzing text content in user reviews. Specifically, this classifier performs a stop-words removal (i.e., words not containing important information) through the StopFilter and normalizes the input sentences (i.e., reduces the inflected words in the root form) through the EnglishStemmer in combination with the SnowballFilter in order to extract a set of meaningful terms that are weighted using the tf (term frequency), which weights each word i in a review j as: 2 http://lucene.apache.org
79
Classifying App Reviews for Software Maintenance and Evolution rfi,j tfi,j = Pm k=1 rfk,j where rfi,j is the raw frequency (number of occurrences) of word i in review j. We use the tf (term frequency) instead of tf-idf indexing because the use of the idf penalizes too much terms (as “fix”, “problem”, or “feature”) appearing in many reviews [38]. Such terms may constitute interesting features for guiding ML techniques in classifying useful feedback. Classification via ML Techniques We used the NLP, TA and SA features extracted in the previous phase of the approach to train ML techniques and classify app reviews according to the taxonomy in Table 4.4. To integrate ML algorithms in our code, we used the Weka API [57]. The MLClassifier module provides a set of java methods for prediction, each of them exploits a different pre-trained ML model and uses a specific combination of the three kinds of extracted features: (i) text features (extracted through the TAClassifier), (ii) structures (extracted through the NLPClassifier) and (iii) sentiment features (extracted through the SAClassifier). Specifically, methods implemented in the MLClassifier may use the following combinations of features (as shown in Figure 4.3): (i) only text features, (ii) only text structures, (iii) text structures + text features, (iv) text structures + sentiment, and (v) text structures + text features + sentiment. We do not provide (i) sentiment and (ii) text features + sentiment combinations, because, as discussed in Paragraph 4.4, they proved very poor effectiveness in classifying sentences into the defined categories. All the prediction methods provided by the MLClassifier class create a new Instance using a combination of the extracted features to learn a specific ML model and classify the Instance according to the categories showed in Table 4.4. Among all the available ML algorithms we use the J48 algorithm since it has been proved the algorithm that achieved the best results (see Paragraph 4.4). We trained all the ML models using as training data a set of 852 manually labeled sentences randomly selected from the user reviews of seven popular apps. Using ARdoc We provide two versions of ARdoc. The first version provides a practical and intuitive Graphic User Interface. Users simply have to download the zipped file ARDOC.zip, unzip the downloaded file and follow the running instructions provided in the README.txt 80
4.2. Approach
Figure 4.3: ARdoc Graphic Interface
file. Figure 4.3 shows the tool’s interface. The tool’s window is divided into the following sections: (i) the menu bar (point 1 in Figure 4.3) provides functions for creating a new blank window, loading the text to classify from an existing text file, importing the reviews for classification from Google Play, or exporting the classified data for further analysis; (ii) the features selection panel (point 2 in Figure 4.3) allows users to choose the desired combination of features for reviews classification; (iii) the input text area (point 3 in Figure 4.3) allows users to write (or copy and paste) reviews to classify and visualize the classification results; (iv) the panel with the legend (point 4 in Figure 4.3) reports the categories and their associated colors; (v) the button Classify (point 5 in Figure 4.3) allows to start the classification and produces the classification results. To analyze the reviews the user can simply (i) paste the reviews in the input text area of the GUI; (ii) load them from a text file, or import them directly from Google Play (specifying the url of the app as reported in the instructions of the provided README.txt file); (iii) select the desired combination of features she 81
Classifying App Reviews for Software Maintenance and Evolution
Figure 4.4: ARdoc java API usage
wants to exploit for the classification, and press the Classify button. For classifying multiple reviews, users can insert blank lines to separate the reviews to each other, as showed in Figure 4.3. At the end of the recognition process, all the recognized sentences will be highlighted with different colors depending on the categories the tool assigned to them. The second version of ARdoc is a Java API that provides an easy way to integrate our classifier in other Java projects. Figure 4.4 shows an example of Java code that integrates the ARdoc’s capabilities. To use it, it is necessary to download the ARdoc API.zip from the tool’s Web page, unzip it, and import the library ARdoc API.jar, as well as the jars contained in the lib folder of ARdoc API.zip, in the build path of the project. To use ARdoc it is sufficient to import the classes org.ardoc.Parser and org.ardoc.Result and instantiante the Parser through the method getInstance. The method extract of the class Parser represents the entry point to access to the tool’s classification. This method accepts in input a String representing the combination of features the user wants to exploit, and a String containing the text to classify. The extract method returns a list of Result objects, providing all the methods to access to ARdoc’s classification results. 82
4.3. Evaluation: Study Design
4.3
Evaluation: Study Design
The main goal of our study is to help developers of mobile apps to categorize information from user reviews that is relevant for software maintenance and evolution.
4.3.1
Research Questions
Stemming from our RQ2 stated in Paragraph 1.3 we derived two research subquestions that guided our work: • RQ2-a: Are the language structure, content and sentiment information able to identify user reviews that could help developers in accomplishing software maintenance and evolution tasks? • RQ2-b: Does the combination of language structure, content and sentiment information produce better results than individual techniques used in isolation? In this paragraph we describe the dataset and methodology we used during the evaluation of our approach.
4.3.2
Dataset Table 4.5: Overview of the dataset
App# AngryBirds* Dropbox* Evernote* TripAdvisor* PicsArt* Pinterest* Whatsapp*
Category# Platform# Games* App*Store Productivity* AppStore* Productivity* App*Store* Travel* App*Store* Photography* Google*Play* Social* Google*Play* Communication* Google*Play*
Total#Reviews# 1538 2009 8878 3165 4438 4486 7696
To answer our research subquestions we evaluated our approach on the set of reviews collected by Guzman and Maalej [52] which contains reviews of the AngryBirds, Dropbox and Evernote apps available in Apple’s App Store3 and reviews from the apps TripAdvisor, PicsArt, Pinterest and Whatsapp available in Android’s Google Play4 store. All seven apps were in the list of the most popular apps in the year 2013 3 https://itunes.apple.com/us/genre/ios/id36 4 https://play.google.com/store?hl=en
83
Classifying App Reviews for Software Maintenance and Evolution
in their respective app store and belong to different app categories. The diversity of the chosen apps allows for evaluating the robustness of the approach by classifying reviews which contain different vocabularies and are written by different user audiences. Table 4.5 shows for each app considered in our dataset: (i) the application name, (ii) the app category it belongs to, (iii) the platform from which comments were collected, and (iv) the number of collected reviews.
4.3.3
Evaluation Methodology
To answer our RQ2-a we experimented the ML techniques described in Section 4.2.5 performing a training on the NLP, TA, and SA features. Furthermore, to answer RQ2-b we investigate whether specific combinations of NLP, TA and SA features allow to obtain a better classification. Specifically, we learned the ML techniques using different combination of features: (i) NLP+TA, (ii) NLP+SA and (iii) NLP+TA+SA. We then compared our results against a manually labelled truth set by using metrics commonly used in machine learning and NLP tasks. In this paragraph we describe the procedure for creating the truth set and the used metrics. Creation of Truth Set To create our truth set we first use AR-miner [24] to filter out non-informative reviews in our dataset. Then, we manually labeled a sample of dataset sentences. The sentences were selected through a stratified random sampling strategy. During the sampling we verified that the percentage of the number of extracted sentences per app was the same as the percentage of reviews per app in the original set. In total we sampled 1421 sentences out of 7696 reviews (18.46%). Two researchers involved in this study manually labeled the sample according to the categories of our taxonomy (see Paragraph 4.2.1). An additional category, named other, was used whenever the sentences did not match any of the predetermined categories. To assure that both annotators applied the same criteria when labeling the results, the definitions of each category were discussed among them before any labeling was done. Then, each annotator labeled a set of 20 sentences. All disagreements were deliberated and the definitions for each taxonomy category were updated to avoid further misunderstandings. Afterwards, each annotator labeled half of the remaining set independently of each other. Whenever the annotators were unsure about the appropriate category for a sentence they marked the sentence as unsure and labeled it with the category they thought would suit best. Afterwards, the other annotator 84
4.3. Evaluation: Study Design
Table 4.6: Percentages of labeled sentences in the truth set
Category Information*Seeking Information*Giving Feature*Request Problem*Discovery Others Total
#*Reviews 101 583 218 488 31 1421
Proportion 0.07107671 0.41027445 0.15341309 0.34342013 0.02181562 1
labeled all sentences that were marked as unsure by the original annotator. For the cases were the second annotator was also unsure about the category, both annotators discussed the final labeling and a decision was made. In total there were 88 sentences (6.2% of the whole truth set) where at least one annotator was unsure about the labeling, indicating that most of the times the annotators were confident about their work. The disagreement for the unsure cases was of 2.81%. Only 31 sentences were labeled in the other category. i.e., 2.18% of the truth set, indicating that our taxonomy covers most of the evolution topics discussed in sentences that are informative for developers. After this annotation process our truth set comprised 1390 sentences. Table 4.6 shows the number of reviews in the truth set that were labeled as belonging to a certain category. Information giving was the most common category, making 41% of the truth-set, problem discovery followed with 34% of the truth-set, whereas feature request and information seeking were only present in 15% and 7% of the sentences respectively. The truth-set is used to generate the training and test sets for the machine learning phase of our approach. Specifically, we used 278 items from our fully manually labeled truth set (20%) as a training set for the different ML techniques we employed, while the remaining 1112 sentences (80%) of the truth set constituted the test set. Used Metrics We evaluate our results using the precision, recall, and F-measure metrics commonly used in machine learning. In our evaluation we compare the human generated truth set with the automatically generated classification. For each category, the correctly classified items have been computed as true positives, the items incorrectly labeled as belonging to that specific category have been considered false positives and the items incorrectly labeled as belonging to other categories have been computed as false negatives. Precision is computed by dividing the number of true positives by the 85
Classifying App Reviews for Software Maintenance and Evolution
sum of true positives and false positives. Recall is computed by dividing the number of true positives by the sum of true positives and false negatives. We compute the F-measure by using its general form definition, which returns the harmonic mean of the precision and recall. Statistical Tests In order to compare if the differences between the different input features and classifiers were statistically significant we performed a Friedman test, followed by a post-hoc Nemenyi test, as recommended by Demˇsar [35].
4.4 4.4.1
Evaluation: Analysis of Results RQ2-a: Are the language structure, content and sentiment information able to identify user reviews that could help developers in accomplishing software maintenance and evolution tasks?
Figure 4.5 gives an overview of the main results obtained through different configuration of machine learning algorithms: (i) NLP features only, (ii) TA features only, (iii) both NLP and SA features, (iv) both NLP and TA features, (v) all NLP, SA, and TA features. The figure does not report the results achieved when learning the ML techniques by using only SA features since in that case we obtain the worst results with a precision and recall that never exceeds the threshold of 20% and 10% respectively. These results are not surprising because SA features are characterized by only three possible values, that are insufficient to assign the reviews to one of the four categories of our taxonomy. The results in Figure 4.5 show that the NLP+TA+SA configuration had the best results with the J48 algorithm, among all possible feature inputs and classifiers with 75% precision and 74% recall. Therefore, we base the forthcoming result analysis on the NLP+TA+SA configuration with the J48 classifier. Table 4.7 shows the precision, recall and F-Measure for each category (see Paragraph 4.2.1) obtained through the J48 algorithm, using the NLP+TA+SA features. In particular, problem discovery was the class with the highest F-measure, followed by the information giving and information seeking categories. On the other hand, the feature request category was the category with the lowest F-measure. This mirrors the high performance obtained in terms of both precision and recall by three of the 86
Classifier Bayes SVM Logistic:Regression J48 ADTree
Precision 0.572 0.577 0.577 0.577 0.697
NLP Recall 0.661 0.662 0.662 0.662 0.67
F/Measure Precision 0.609 0.665 0.61 0.592 0.61 0.462 0.61 0.572 0.63 0.619
TA Recall 0.584 0.614 0.46 0.58 0.611 F/Measure Precision 0.545 0.572 0.584 0.643 0.457 0.561 0.563 0.726 0.591 0.79
NLP