FOCUS: SOFTWARE QUALITY
Examining the Relationship between FindBugs Warnings and App Ratings Hammad Khalid, Shopify Meiyappan Nagappan, Rochester Institute of Technology Ahmed E. Hassan, Queen’s University
// A case study examined the relationship between user app ratings and static-analysis warnings for 10,000 Android apps. Certain categories of warnings had a strong relationship with app ratings and thus can be considered as closely related to the user experience.//
significant relation to downloads and hence the revenue from apps.2 When users are dissatisfied with an app’s quality, they often give the app a low rating (for example, 1 star to indicate the lowest quality). Users can also leave review comments explaining the rating. So, if low ratings are related to certain categories of static-analysis warnings, developers could use staticanalysis tools to identify and fix the bugs that lead to those ratings. We examined the categories of static-analysis warnings related to 10,000 free-to-download Android apps from Google Play. We obtained these warnings from FindBugs (http://findbugs.sourceforge.net), an open source program that automatically warns about potential bugs in Java code. We wanted to empirically examine the relationship between each category of warning in an app and that app’s rating. We also compared the complaints in the review comments with the FindBugs warnings. Specifically, we wanted to determine which categories occurred more frequently in low-rated apps than in highly rated apps. We found that specific types of warnings had significantly higher densities in low-rated apps. Our results suggest that developers could indeed use static-analysis tools (that is, FindBugs) on Android apps to identify bugs that could be related to low ratings.
FindBugs
STATIC-ANALYSIS TOOLS AUTOMATICALLY examine source code and produce warnings that help developers identify possible issues before releasing software. Research has confirmed that these tools can help identify bugs.1 So, addressing staticanalysis warnings can help improve 34
IEEE SOFTWARE
|
software quality. However, many categories of warnings exist, and it’s not clear whether some of them lead to bugs that impact user perception of an app’s quality. User perception in the mobile-app ecosystem, represented as user ratings, exhibits a statistically
PUBLISHED BY THE IEEE COMPUTER SOCIETY
Although other prominent staticanalysis tools exist, we chose FindBugs because it strives to reduce the number of false-positive warnings.3 This, we feel, makes a static-analysis tool useful because developers look for low-cost, highly effective tools. FindBugs identifies warnings for more than 400 possible types of bugs, grouped into eight categories: bad practice, correctness, internationalization, 0740-7459/16/$33.00 © 2016 IEEE
Google Play
Select 10,000 Android apps
Download the apps and their details and reviews
Selecting data
Reviews and details for each app
Search for complaints in reviews
The APK for each app
Decompile the APK to the JAR format
Collecting data
Compare FindBugs warnings with user complaints
Common complaints for each app
JAR file
Decompiling Android apps
Run FindBugs on the JAR file
Remove warnings of common libraries
Running FindBugs
Removing warnings
List of warnings for each app
FIGURE 1. An overview of our study, in which we examined the categories of static-analysis warnings (from the FindBugs tool) related to 10,000 Android apps from Google Play. APK stands for Android application package; JAR stands for Java Archive.
malicious code vulnerability, multithreaded correctness, performance, security, and dodgy code. FindBugs also assigns a low, medium, or high priority to each warning, indicating how confident it is about whether the warning really indicates a bug.
The Study Design Figure 1 illustrates our study’s workflow.
Selecting Data We randomly selected the 10,000 apps from a list that Steffen Dienst and Thorsten Berger generated.4 The minimum rating was 1.3 stars, the maximum was 5 stars, and the median was 4.07 stars. To ensure that a few users didn’t skew the ratings, we selected only apps that had at least 30 ratings. The median number of ratings was 181. The apps came from all the Google Play categories. Game apps accounted for the highest number of apps; weather apps accounted for the lowest.
Collecting Data For each app, we downloaded the APK (Android application package) fi le, the reviews (both the ratings and comments), and the number of people who rated the app. We collected
up to 500 of the newest reviews for each app (Google Play limits the total number of reviews that nondevelopers can see).
Decompiling Android Apps Because FindBugs requires the JAR (Java Archive) format, we used dex2jar (https://code.google.com/p /dex2jar), an open source tool, to extract the JAR fi les from the APKs. Other researchers have also used dex2jar for this.5
Running FindBugs We ran FindBugs using its recommended settings, in which it detects high- and medium-priority warnings and ignores low-priority warnings, which often include false positives. We also ignored all style and naming-convention warnings (because we were looking at the decompiled binary of the original code). After running FindBugs on each app, we extracted the density of each warning per app. This density was the number of warnings per thousand lines of noncommented source statements. We also identified the number of warnings in each of the eight categories and the name of the class in which a warning occurred.
Removing Warnings of Common Libraries Android apps, like all software, have numerous external libraries. Because we wanted to examine the relationship between ratings and warnings in each app, we couldn’t have interference due to warnings from common libraries. For example, attributing the warnings of the Android Support Library (code that provides backward compatibility and is in many Android apps) would have added unnecessary noise to the data. We fi rst identified the external libraries found across many apps, using the packaging information and base classes’ names. For example, the base class android.support.app .fragment let us identify that this app used the Android Support Library. We counted the number of apps in which we found each package. We found 4,049 shared packages, with a few that were in many apps. After the fi rst few hundred popular packages, the frequency quickly declined. The skew in the data was very high, with 5,611 apps that included com .google packages. We manually examined 766 packages that 10 or more apps shared and the libraries those packages were part of. We ensured that the packages were
J U LY / A U G U S T 2 0 1 6
|
I E E E S O F T WA R E
35
TABLE 1
FOCUS: SOFTWARE QUALITY
Categories of FindBugs warnings with a statistically significantly higher density in low-rated apps. Median warning density Warning category
MWU p-value*
Highly rated apps
Low-rated apps
Bad practice
0.011
0.21
0.24
Internationalization
1.57e-11
0.11
0.18
Performance
4.03e-05
0.39
0.48
* MWU stands for Mann-Whitney U test.
part of a library and not a commonly used package name (for example, com .myapp.ButtonFragment). For each of these potential libraries, we examined the public code bases (for example, GitHub and Ohloh [now Black Duck Open Hub]) that matched its package names. We flagged 329 common libraries; our analysis ignored the warnings from these libraries’ packages.
The Results A common criticism of staticanalysis tools is that they often produce many false positives.1 Thus, developers using them could end up wasting time dealing with warnings unrelated to their software’s quality. Although FindBugs focuses on reducing false positives as much as possible, it isn’t perfect. For example, it could find benign warnings. So, we wanted to identify the warnings most related to the low-rated apps. Identifying them will help developers concentrate on such warnings, which could indicate the culprits behind the issues about which users complain.
Our Approach In particular, we wanted to determine whether densities of a particular category of FindBugs warning differed between highly rated and low-rated apps. From the 10,000 apps, we chose the 25 percent (2,500 apps) with the highest ratings and the 25 percent with the lowest ratings. The 36
I E E E S O F T WA R E
|
highly rated apps ranged from 4.3 to 5 stars; the low-rated apps ranged from 1.29 to 3.6 stars. We compared the warning densities (to control for the apps’ size) for each of the eight categories, for the highly rated and low-rated apps, using a one-tailed Mann-Whitney U test (MWU) with α < 0.05.6 The dependent variable was each category’s warning density (which is continuous). The two independent variables were the highly rated apps and lowrated apps. The MWU’s null hypothesis was that a category’s warning density is the same for the highly rated and low-rated apps. The alternate hypothesis was that the warning density is larger for the low-rated apps. We used MWU because it doesn’t assume that the data is normally distributed, unlike the Student’s t-test.
The Three Significant Categories Three categories of warnings occurred statistically significantly more in low-rated apps: • Bad-practice warnings indicate violation of essential coding practices (for example, equals problems, dropped exceptions, and misuse of finalize). • Internationalization warnings indicate misused encoding of characters. • Performance warnings indicate slow code.
W W W. C O M P U T E R . O R G / S O F T W A R E
|
Table 1 shows statistics for these categories.
Analyzing User Complaints We examined whether the reviews explicitly mentioned issues related to the three warnings. To do this, we compared the reviews of the apps with the highest densities of these warnings with those of the apps with the lowest densities. To focus on the complaints, we analyzed only the reviews for ratings of 3 or fewer stars.7 So that a few reviews didn’t skew the overall complaints, we filtered out the apps with fewer than 10 reviews.8 That left us with 4,708 apps. From them, we identified the 25 percent (1,177 apps) with the highest densities and the 25 percent with the lowest densities. To analyze the reviews, we first identified keywords to look for. On the basis of our knowledge of the three categories and our experience with manually categorizing mobileapp reviews,9 we selected the keywords in Table 2. We counted the number of reviews per app that had a keyword related to a warning category. For example, to determine performance complaints, we counted the number of reviews per app that included “slow,” “hang,” “lag,” or “slug.” We included stemmed versions of each keyword (for example, “lags,” “lagging,” and “lagged”). We counted only one occurrence of these keywords per review. For example, if a review contained both “slow” and “hangs,” we counted it as one occurrence of a performance complaint. We did this because we cared only whether a review contained a particular type of complaint. So, for each of the highestdensity and lowest-density apps, we got the total number of reviews and
@ I E E E S O F T WA R E
TABLE 2
Keywords used to identify user complaints. The mean of the reviews with the corresponding complaint (%) MWU p-value
Lowest-density apps
Highest-density apps
Bug, buggy, issue, problem, broke
4.02e-09
5.6
6.7
Internationalization
Country, language, word, international, internationalization, UTF, encoding
0.0002261
3.5
3.8
Performance
Slow, hang, lag, slug
0.0004456
3.9
6.0
Warning category
Keywords
Bad practice
the number of reviews with complaints corresponding to a particular warning category (for example, 500 reviews, 50 performance complaints). We turned that raw count into percentages (for example, 10 percent performance complaints). We compared the percentages for the highest-density and lowest-density apps using MWU, to see whether users complained about a particular category in apps with a higher density of the corresponding warnings. Apps with the highest warning densities for a category had a statistically significantly higher rate of the corresponding complaints. Table 2 shows the statistics for each category. One such app was the trial version of a media player, which had a performance warning density of 6.4. A quick examination of this app’s reviews revealed numerous comments mentioning performance issues and crashes such as, “Takes an age to search SD [Secure Digital] card, then when you try to play a video it just says Buffering until you get bored and close it. Rubbish.” Many of the internationalization warnings occurred in apps for which users complained about the encoding or being forced to use a specific language. So, we established that the bugs related to FindBugs warnings could directly manifest in user reviews as complaints and thus impact the apps’ ratings.
Threats to Validity Here we address the perceived threats to our research’s validity.
Construct Validity As we mentioned before, we ran FindBugs on decompiled versions of the apps (some of which could be obfuscated). Although this might have affected the results, we were limited to this approach because we didn’t have access to the source code. The warnings from third-party libraries that we removed could have been even more problematic than the warnings we dealt with. Nonetheless, we feel that removing these libraries was a better approach because many apps shared them. Assigning the warning attributes of the thirdparty libraries to the apps themselves would have added the threat of low reliability of the measures, which could have led to invalid conclusions. However, we verified that the relationship between FindBugs warnings and ratings held even when we included all the libraries.
Internal Validity The tools we used to decompile the apps and identify the warnings weren’t perfect and thus might have affected our results. However, we used standard tools that past studies have used to reverse-engineer Android apps.5,8,10 (For a look at other related research on mobile apps, see the sidebar.) We were also restricted
to this approach because of the study’s large scope. To describe complaints, app reviewers might have used words other than the keywords we chose. However, manually analyzing thousands of reviews was outside this study’s scope. As we mentioned before, we chose keywords based on our experience, and we focused on re views with low ratings.
External Validity Although our findings might not generalize to all free-to-download Android apps, we feel that 10,000 apps was a considerable sample. To mitigate the threat of generalization, we studied apps that covered all the Google Play categories, as we mentioned before. In addition, the apps covered a range of ratings similar to the range for all apps in Google Play (www.appbrain.com /stats/android-app-ratings). Also, we considered only one static-analysis tool—FindBugs. To control for the threat that arises from using only FindBugs, we’ve only made conclusions about FindBugs warnings throughout this article. Other static-analysis tools, such as Coverity (www.coverity.com) or IBM Security AppScan (www-03 .ibm .com /sof t wa re /produc t s /en /appscan), might yield a different set of warnings. So, developers could use them to identify other warnings indicating bugs that could lower user r atings. Toward that end, future
J U LY / A U G U S T 2 0 1 6
|
I E E E S O F T WA R E
37
FOCUS: SOFTWARE QUALITY
RELATED WORK IN MOBILE APPS David Hovemeyer and William Pugh introduced FindBugs, a tool that automatically warns about a variety of bugs in Java programs.1 Here, we survey research related to static-analysis tools for Android apps and other pertinent research about mobile apps.
STATIC ANALYSIS OF ANDROID APPS Chaorong Guo and colleagues proposed Relda, a static-analysis tool that identifies resource leaks in Android apps.2 To do this, Relda analyzes callbacks in the Android framework. Similarly, Étienne Payet and Fausto Soto extended the static-analysis tool Julia and improved the precision of detecting nullness in Android apps.3 Their research also demonstrated static-analysis tools’ usefulness and versatility for Android apps. Amiya Kumar Maji and colleagues characterized failures in the Android and Symbian OSs.4 Our research (see the main article) complements these studies. We show that developers can benefit from using automated static analysis on mobile apps (not the OS on which the apps are installed) to check for issues that could lead to complaints in user reviews.
This highlights the importance of improving app quality. Ryan Stevens and his colleagues examined permission use in 10,000 Android apps and found a relationship between a permission’s popularity and the number of times it was misused.8 Mario Linares-Vásquez and his colleagues examined API use of Android apps.9 They found that fault and change proneness in the APIs might have affected app quality and ratings.
References 1. D. Hovemeyer and W. Pugh, “Finding Bugs Is Easy,” ACM SIGPLAN Notices, vol. 39, no. 12, 2004, pp. 92–106. 2. C. Guo et al., “Characterizing and Detecting Resource Leaks in Android Applications,” Proc. IEEE/ACM 28th Int’l Conf. Automated Software Eng. (ASE 13), 2013, pp. 389–398. 3. É. Payet and F. Spoto, “Static Analysis of Android Programs,” Information and Software Technology, vol. 54, no. 11, 2012, pp. 1192–1201. 4. A. Kumar Maji et al., “Characterizing Failures in Mobile OSes: A Case Study with Android and Symbian,” Proc. IEEE 21st Int’l Symp. Software Reliability Eng. (ISSRE 10), 2010, pp. 249–258.
MOBILE-APP RATINGS
5. M. Harman, Y. Jia, and Y. Zhang, “App Store Mining and Analysis:
Mark Harman and his colleagues discovered a strong correlation between an app rating and the number of downloads, indicating that ratings are a strong indicator of users’ opinions of apps.5 Dennis Pagano and Walid Maalej found that review comments often contained useful feedback, bug reports, and user experience.6 Our case study used app ratings to identify highly rated and low-rated Android apps and compared FindBugs warnings for them. We also examined the complaints in the review comments.
MSR for App Stores,” Proc. 9th Working Conf. Mining Software Repositories (MSR 12), 2012, pp. 108–111. 6. D. Pagano and W. Maalej, “User Feedback in the AppStore: An Empirical Study,” Proc. 21st IEEE Int’l Requirements Eng. Conf. (RE 13), 2013, pp. 125–134. 7. H. Khalid et al., “What Do Mobile App Users Complain About?,” IEEE Software, vol. 32, no. 3, 2014, pp. 70–77. 8. R. Stevens et al., “Asking for (and about) Permissions Used by Android Apps,” Proc. 10th Int’l Workshop Mining Software Repositories (MSR 13), 2013, pp. 31–40.
MOBILE-APP QUALITY
9. M. Linares-Vásquez et al., “API Change and Fault Proneness: A Threat
Previously, we showed that the two most common complaints in app reviews were functional errors (bugs) and crashes.7
research could use our analysis technique to examine those tools to identify other types of warnings related to ratings.
Conclusion Validity We’re not claiming that fixing the bugs related to the FindBugs warnings will increase ratings—rather, 38
I E E E S O F T WA R E
|
to the Success of Android Apps,” Proc. 9th Joint Meeting Foundations of Software Eng. (ESEC/FSE 13), 2013, pp. 477–487.
only that those warnings appear more significantly in low-rated apps. We didn’t carry out a controlled study to check whether fixing those bugs increases ratings. Although that question is interesting, such a controlled study is beyond this article’s scope, and we would like to address it in future research.
W W W. C O M P U T E R . O R G / S O F T W A R E
|
O
ur results imply that app developers shouldn’t neglect running FindBugs (or another static-analysis tool) because it’s a low-cost way to find solutions to some user complaints. If their app’s overall warning density is too high, they should look at the categories of bugs with high warning
@ I E E E S O F T WA R E
References 1. A. Vetro, M. Morisio, and M. Torchiano, “An Empirical Validation of FindBugs Issues Related to Defects,” Proc. 15th Ann. Conf. Evaluation Assessment in Software Eng. (EASE 11), 2011, pp. 144–153. 2. M. Harman, Y. Jia, and Y. Zhang, “App Store Mining and Analysis: MSR for App Stores,” Proc. 9th Working Conf. Mining Software Repositories (MSR 12), 2012, pp. 108–111. 3. D. Hovemeyer and W. Pugh, “Finding Bugs Is Easy,” ACM SIGPLAN Notices, vol. 39, no. 12, 2004, pp. 92–106. 4. S. Dienst and T. Berger, “Static Analysis of App Dependencies in Android Bytecode,” tech. note, 2014; www .informatik.uni-leipzig.de/berger/tr /2012-dienst.pdf. 5. I.J. Mojica Ruiz et al., “Understanding Reuse in the Android Market,” Proc. 20th IEEE Int’l Conf. Program Comprehension (ICPC 12), 2012, pp. 113–122. 6. H. Mann and D. Whitney, “On a Test of Whether One of Two Random Variables Is Stochastically Larger than the Other,” Annals Mathematical Statistics, vol. 18, no. 1, 1947, pp. 50–60.
ABOUT THE AUTHORS
densities and address those warnings before releasing the app. For researchers, this study provides a direct link between static-analysis warnings from one tool and software quality (expressed as user ratings). We plan to examine other static-analysis tools and how they could help developers improve app quality. We also intend to apply the same experimental process to the other steps in the quality assurance cycle, so that we can recommend a more complete set of quality assurance practices to mobileapp developers.
HAMMAD KHALID is a software engineer at Shopify. His
research examines the link between user feedback and software quality. Khalid received a master’s in computer science from Queen’s University. Contact him at hammad@ cs.queensu.ca; http://hammad.ca.
MEIYAPPAN NAGAPPAN is an assistant professor in the Rochester Institute of Technology’s Department of Software Engineering. He previously was a postdoctoral fellow in the Software Analysis and Intelligence Lab at Queen’s University. His research centers on using large-scale softwareengineering data to address stakeholders’ concerns. Specifically, he has focused on mining mobile-app-store data to provide actionable recommendations to the various stakeholders of mobile apps. Nagappan received a PhD in computer science from North Carolina State University. He received best-paper awards at the 2012 and 2015 International Working Conference on Mining Software Repositories. Nagappan is the editor of the IEEE Software blog. Contact him at
[email protected]; mei-nagappan.com. AHMED E. HASSAN is the Natural Sciences and Engi-
neering Research Council of Canada / BlackBerry Software Engineering Chair at the School of Computing at Queen’s University. His research interests include mining software repositories, empirical software engineering, load testing, and log mining. Hassan received a PhD in computer science from the University of Waterloo. He spearheaded the creation of the International Working Conference on Mining Software Repositories and its research community. Hassan also serves on the editorial boards of IEEE Transactions on Software Engineering, Empirical Software Engineering, and Computing. Contact him at ahmed@ cs.queensu.ca.
7. H. Khalid et al., “Prioritizing the Devices to Test Your App On: A Case Study of Android Game Apps,” Proc. 22nd ACM SIGSOFT Int’l Symp. Foundations of Software Eng. (FSE 14), 2014, pp. 610–620. 8. I.J. Mojica Ruiz, “Large-Scale Empirical Studies of Mobile Apps,” master’s thesis, School of Computing, Faculty of Arts and Science, Queen’s Univ., 2013.
9. H. Khalid et al., “What Do Mobile App Users Complain About?,” IEEE Software, vol. 32, no. 3, 2014, pp. 70–77. 10. M. Linares-Vásquez et al., “API Change and Fault Proneness: A Threat to the Success of Android Apps,” Proc. 9th Joint Meeting Foundations of Software Eng. (ESEC/FSE 13), 2013, pp. 477– 487.
J U LY / A U G U S T 2 0 1 6
|
I E E E S O F T WA R E
39