Which Android Devices Should I Test My Game App On? Hammad Khalid, Meiyappan Nagappan Software Analysis and Intelligence Lab (SAIL) Queen’s University Kingston, Canada
{hammad, mei}@cs.queensu.ca
Emad Shihab
Ahmed E. Hassan
Department of Software Engineering Rochester Institute of Technology Rochester, USA
Software Analysis and Intelligence Lab (SAIL) Queen’s University Kingston, Canada
[email protected]
ABSTRACT
To compete in this environment, developers need to get (and maintain) good ratings for their apps. This can be difficult since users are easily annoyed by buggy apps, and that annoyance could lead to bad ratings [3]. Hence, app developers need to test their apps thoroughly on different devices to improve their chances of getting a high rating. Yet, there are tons of Android devices, each with its own nuances. In fact, dealing with the device specific issues of (the many) Android devices is considered one of the biggest challenges developers face when creating an app [4]. A recent survey from Appcelerator showed that developer interest in Android has fallen to 76% this quarter from a high of 87% [5]. A staggering 94% of the developers that avoid working on Android apps cited Android fragmentation as the main reason. Android fragmentation refers to the concern, that since there are many devices with different screen sizes and other hardware specifications, an app that functions correctly on one device might not work on a different one [6]. Therefore, Android game developers need to carefully prioritize their testing efforts on the most important devices. While there has been some previous research in automated testing for Android apps [7, 8, 9, 10], to the best of our knowledge, there has not been any work on prioritizing testing efforts on the devices which have the most impact. Although market share can help prioritize testing, it is the devices from which more reviews are given that affect the rating of an app, and not the devices on which the app is installed on most often (although they may be similar). Therefore in this study, we analyze 89,239 user reviews of 99 free game apps from various Android devices to help developers prioritize their testing efforts, thus answering the question: Which Android devices should I test my game app on? In addition, we further explore how developers can identify the devices that have the most impact on their app ratings. We do this by examining the devices which have the highest percentage of reviews per app. We call this ‘review share’. For example, a review share of 10% means that a device gave 10% of all ratings for an app. While we study the reviews of free game apps, individual developers can easily adopt our approach to analyze the reviews of specific apps. We formalize our study in the following three research questions:
For Android game app developers, the star ratings that their apps receive from users are directly tied to their revenues. These ratings are given by users who run the game app on one of hundreds of Android devices. These devices can have very different interface and hardware capabilities. Therefore, to achieve (and maintain) a good rating, developers have to test their apps on many devices and fix device-specific problems. This turns out to be very difficult because of the sheer number of Android devices. On analyzing 99 free Android game apps, we find that these apps receive user reviews from 38 to 132 devices. However, most of the user reviews are from a small subset of devices. On average, 33% of the devices account for 80% of the reviews. Thus testing the devices that account for most of the reviews first, can be a good prioritization scheme for app developers with limited resources. Furthermore, we find that developers of new game apps that have no reviews can use the review data of similar game apps to prioritize their testing efforts. Finally, among the set of devices that generate the most reviews for an app, we find that some devices tend to generate worse ratings than others. Game app developers can take specific corrective actions for these devices.
Keywords Mobile Applications, Android, Empirical Study, Software Quality
1.
[email protected]
INTRODUCTION
Usage of Android devices has grown at a tremendous rate over the past few years [1]. To capitalize on this growth, both small and large companies are developing an enormous amount of applications (called mobile apps), designed to run on Android devices. However, most users only download the top-rated or the featured apps from the app markets [2]. This makes it very challenging to attract new users, especially for game app developers who have to compete with almost 120,000 game apps already in the Android market – more than any other category.
RQ1. What percentage of devices account for the majority of user reviews? By examining the devices that users post app reviews from, we find that on average 33% of the devices account for 80% of all reviews given to a free game app. With this information, app developers can prioritize their testing effort by focusing on a small set of devices that have the highest potential review share.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00.
1
2.3
RQ2. How can new developers identify the devices that they should focus their testing efforts on?
Several recent studies have attempted to reduce the testing burden of Android developers. One of these studies is by Hu et al. which suggested techniques for automated test generation and analysis for Android apps [7]. MacHiry et al. presented a dynamic input generation system which can by used by developers for blackbox testing [16]. Anand et al. also presented an approach which generates input events to execute different event sequences for an Android app [8]. There has also been previous work which has aimed to automate testing in a virtual environment. Amalfitano et al. presented a tool which automatically generates tests for Android Apps based on its graphical user interface [9]. DroidMate is another such tool which uses genetic algorithms to generate input sequences (i.e., user interaction, or simulated events) [10]. Joorabchi et al. [17] examined the challenges in mobile application development by interviewing 12 mobile developers. One of main the findings of their study is that dealing with the device specific issues of different Android devices and testing apps on these devices remains a challenge for mobile app developers. While testing in a virtual environment is useful for identifying general issues, developers also test their apps on actual Android devices to identify device specific issues. Our work focuses on helping developers identify the devices which have the most impact on their rating and hence they should focus their testing efforts on.
We find that developers can use the devices with the most review share in all other gaming apps as a good indicator of which devices they should focus their testing efforts on. RQ3. Do the star ratings from different Android devices vary significantly? By examining the reviews from different devices for the same apps, we find some devices give significantly lower ratings. Developers can take corrective actions for such devices or remove support for them if they see fit. The remainder of this paper is organized as follows: We survey the related work in Section 2. In Section 3, we discuss in detail the data that we analyze for our study. In Section 4, we discuss some of our preliminary analysis. In Section 5, we present the results from this study. In Section 6, we perform further analysis on paid apps. In Section 7, we discuss the potential threats to validity for our work. We conclude our paper in Section 8.
2.
RELATED WORK
In this section, we discuss the related work. In particular, we present work related to Android fragmentation, work related to mobile app reviews and the work related to testing Android apps.
2.1
Previous research has confirmed the effect of fragmentation in Android hardware and software. In addition, recent studies have identified the importance of mobile app user reviews. In this paper, we seek to help developers understand how they can prioritize their testing efforts towards the devices which have the most impact on their ratings.
Work related to Android Fragmentation
A recent study by Han et al. manually labeled bugs reported for specific vendors of Android devices [6]. Their study found that HTC devices and Motorola devices have their own vendor specific bugs. They considered this as evidence for Android fragmentation. Ham et al. came up with their own compatibility test to prevent Android fragmentation problems [11]. They focus on doing code analysis and API pre-testing to identify possible issues. Our study differs from their work since we seek to help game app developers determine which devices they need to test their game apps on. We determine this set of Android devices from the user reviews of game app.
2.2
Work Related to Testing Android Apps
3.
STUDY DESIGN
In this section we discuss the data used to conduct our study, the main sources of this data, and the collection process for this data. The main data used in our study are the user reviews of the top 99 Android game apps. We collect these reviews from Google Play, which is the main market for Android apps. After collecting these reviews, we identify the devices that these reviews were produced from, as well as, the ratings associated with each review. Figure 1 provides an overview of the process used in our study. The following sections describe the data used in our study and our data collection method in further detail.
Work Related to App Reviews and App Quality
Previous studies focused on how to improve the testing process of mobile apps. Kim et al. [12] look at testing strategies to improve the quality of mobile apps. Agarwal et al. study how to better diagnose unexpected app behaviour [13]. Recent studies have also examined the importance of app reviews. Harman et al. mined information of BlackBerry apps and found that there is a strong correlation between app rating and number of downloads [2]. Kim et al. also found ratings to be one of the key determinants in the user’s purchase decision of an app [14]. Linares-Vasquez et al. [15] show that the quality (in terms of change- and fault-proneness) of the APIs used by Android apps negatively impacts their success, in terms of user ratings. Our work complements the aforementioned work since its goal is to use mobile app reviews to assist developers determine whether they need to test the game apps on all Android devices. In our previous study we manually analyzed and tagged reviews of iOS apps to identify the different issues that users of iOS apps complain about [3]. We hoped to help developers prioritize the issues that they should be testing for. Our current study differs from our previous work as we are focusing on Android to identify the different devices that Android game app developers should be focusing on.
3.1
Data Selection
In order to perform our study, we selected 99 free game apps from Google Play. We selected the top 99 apps as ranked by Google Play. It is important to note that this list doesn’t just contain the top-rated apps, it also contains popular and trending apps in that category.
3.2
Data Collection
To collect the user reviews, we use a python screen scraping framework, called Scrapy, and a web automation and testing tool called Selenium, to implement a web crawler which extracts data such as the app name, the review title, the review description, the device used to review the app, the version of the app being reviewed and the numerical star rating assigned by the user [18, 19]. This crawler opens a new instance of the Firefox browser, visits the URL of one of the selected apps, and then clicks through the review pages of this app. The required information is located on each page using Xpath and then stored into a SQLite database [20, 21]. 2
Web scraper
Android app market
Selecting the top 99 game apps
List of top 99 apps
Collecting the reviews
Data Selection
Data Collection
144,000+ extracted reviews for analysis
Collected Data
Figure 1: Overview of our process We used a similar crawler to collect the URLs of the top apps. This crawler is much more limited and slower than typical web-crawling since Google Play puts heavy restrictions on crawlers. We found this to be the only reliable method for extracting the reviews. All of the data is extracted as an anonymous user. Also, Google Play limits the total number of reviews that a user can see to a maximum of 480 for each of the star ratings (i.e., a user is restricted to viewing 2,400 reviews for each app). However, it also allows other users to vote on the helpfulness of the reviews to filter away spam reviews. We selected the ‘most helpful reviews’, since they are more reliable.
3.3
All ratings
Medium ratings
Good ratings
Cumulative review-share
80 70 60 50 40 30 20 10 0
Collected Data
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
For the game apps, we collect reviews of 99 free apps. In total we collected over 144,689 of the most helpful one-five star reviews from these apps. We limit our reviews to those given only after October 2012 to January 2013, a 3 month window since the rate of churn in devices used is very high. From this set of reviews, we only consider the ones that have a device associated with them. This reduces the set of reviews to 89,239.
4.
Bad ratings
90
Percent of devices
Figure 2: Percent of Android devices used to give X%(X ranges from 0 - 80) of the reviews for free game apps
PRELIMINARY ANALYSIS
each device to ‘review share’. For example, a review share of 10% for bad ratings means that a device gave 10% of all bad ratings. We then determine what percentage of devices is required to cover X% of the user reviews. In our case, X varies from 0 to 80%. In total, our dataset contains reviews from 194 unique Android devices. Out of these 194 devices, 114 of these devices had more than 100 user reviews. These facts highlight the magnitude of the Android fragmentation problem and the importance of identifying the devices which give the most reviews to apps in order to prioritize the testing effort. Figure 2 shows the percentage of devices vs. the cumulative percentage of reviews in the different rating groups. The x-axis of this figure shows the percentage of devices, while the y-axis shows the cumulative percentage of reviews received from each group (i.e., bad, medium, good and all) of reviews. We find that 19% devices cover approximately 80% of the posted user reviews. This finding suggests that developers working on free game apps, only needs to focus their testing efforts on 19% of the devices to cover the majority (i.e., approximately 80%) of the reviews. Observing Figure 2, we note that the bad, medium and good ratings curves are very similar. After comparing how different devices give bad, medium and good reviews, we find that while the devices which gave the most bad ratings were fairly identical to the devices which gave good ratings, there were a few discrepancies. The set of devices which had a cumulative sum of 80% of the bad ratings, had four devices which were not in the set of devices that gave the most good reviews. This suggests that there is additional variation in how specific devices rate these game apps.
Prior to delving into our research questions, we perform some preliminary analysis on our dataset as a whole, i.e. using data from all 99 game apps taken together. We perform this analysis to determine whether our review dataset contains reviews from many different devices or a small set of devices. In other words, we would like to know if our data contains enough information to help us detect Android fragmentation. We also look at how the reviews are distributed for the apps taken as a whole, on a per device basis. Using all of the reviews, we identify the number of reviews that each device gave to our selected game apps. We break these reviews down by the star rating to identify the number of one to five star ratings given by a device. To identify and compare the devices that give the most bad, medium and good reviews, we combine 1 and 2 star reviews into a group which we call ‘bad reviews’, and combine 4 and 5 star reviews into another group which we call ‘good reviews’. We label 3 star reviews as ‘medium reviews’. Following this step, we validate our groupings. To ensure that our grouping makes sense, we run a sentiment analysis tool over the text of the reviews in these groups to validate that these groupings are appropriate (i.e., ‘good reviews’ actually have the most positive reviews and vice versa) [22]. This tool assigns an integer to the sentiment expressed in the reviews (where a negative integer represents negative review). In our case, the ‘negative reviews’ group was assigned a score of -0.32, the ‘medium reviews’ group was assigned a score of 0.44, while the ‘good reviews’ group was assigned a score of 1.25. These scores support our grouping scheme. We then transform the bad, medium and good reviews given from 3
50
Percentage of devices
Number of unique devices
120
100
80
40
30
60
40 Free game apps
Free game apps
5.
Figure 4: Percent of Android devices that account for 80% of the reviews for each game app
RESULTS
Now that we have determined that our dataset contains reviews from many different Android devices, in this section we answer our research questions. For each research question, we present a motivation for the question, the approach taken to answer the question and the results for the research question.
5.1
RQ1 - What percentage of devices account for the majority of user reviews?
Motivation: As mentioned earlier, Android fragmentation is a major concern of developers seeking to develop and test Android apps [11]. There may be hundreds of devices which developers may need to test their apps on. Testing on such a large number of devices is not feasible for developers with limited time and budget. As our preliminary analysis showed, 19% of the devices account for more than 80% of the reviews. Therefore, our goal is to determine what percentage of devices accounts for most of the reviews for each app. For example, we would like to know whether there is a subset of devices that a particular app’s developer can test in order to cover most of the devices that users post their user reviews from. This analysis is different from our preliminary analysis, which simply grouped the reviews from all apps. Approach: To answer this question, we use the reviews that we collected for the 99 free game apps. For each review, we determine the device that the review was posted from. For each of the 99 game apps, we calculate the number of unique devices that post a review. Then, we order the devices according to the number of reviews from each of them for a particular app. We determine the number of devices it takes to cover 80% of the reviews. Findings: Figure 3 shows the number of unique devices per app. We find that the app with the least number of devices posting reviews had 38 different devices posting reviews. The app with the most number of different devices had reviews being posted from 132 different devices. The median number of devices from which an app receives reviews is 87. Our results indicate that clearly, fragmentation exists even at the app level. Figure 4 shows the percentage of devices that account for 80% of the reviews in each of the 99 free game apps. We observe that, on a per app basis, 80% of the review share can be achieved by consid-
Number of devices for 80% reviews
Figure 3: Number of unique Android devices from which an app receives reviews
40
30
20
10
0
99 Free game apps
Figure 5: Number of Android devices that account for 80% of the reviews for each game app
ering 22% of the unique devices, and at most 53% of the devices are needed to cover 80% of the reviews. The average percentage of devices required to cover 80% of the review share is 33% (and the median is 32%). This finding indicates that, on average, by focusing their testing efforts on 33% of the devices, an app developer can cover the majority (i.e., 80%) of the reviews. To better understand the actual number of devices which developers would have examined, we calculate the number of devices that account for 80% of the reviews for each of the 99 free game apps, shown in Figure 5. We find that a minimum of 13, a maximum of 45, and a median of 30 devices accounted for 80% of the reviews, per app. While these numbers may seem large, one would have to keep in context that Android developers often have to think about testing their apps on hundreds of devices. Discussion: A scenario that we consider is one where a team of developers is working on a new free game app but they can only afford to buy 10 devices to test their app on. In such a scenario, the ideal set of devices would be the one that accounts for the most reviews to game apps. Using our app reviews data, we generate Figure 6 which contains a stacked area chart that shows the percentage of reviews that the
4
top 10 devices would cover. This figure shows that even a small number of devices, if carefully picked (i.e., by looking at which devices post reviews for the app), can account for a considerable number of reviews – more than half of the reviews in most cases. This leads to question how developers without any reviews for their app can pick the best set of top devices which they should focus their testing efforts on. For example, can a developer use the review share data from all the other apps that are currently present in the game app category? Can they just pick the most popular devices? We explore this issue in RQ2.
Table 1: The market share of the most popular Android devices and their review share
We find that only a small set of devices are needed to cover 80% of the reviews. On average, 33% of all devices account for 80% of reviews given to free game apps.
5.2
Top market share device
Market share (%)
Samsung Galaxy S3 Samsung Galaxy S2 Samsung Galaxy S Samsung Galaxy Ace Samsung Galaxy Note Samsung Galaxy Y HTC Desire HD Asus Nexus 7 Samsung Galaxy Tab 10.1 Motorola Droid RAZR
9.4 7.8 2.6 2 1.9 1.9 1.5 1.1 1.1 1.1
box) contains a box plot that shows the percentage of review share impacted due to the missed devices. The median of missed review share is 7.1%. This is not a large loss in review share, especially given that, using this technique (i.e., using devices that have a large review share in their category), developers will have the benefit of picking an adequate set of devices even before releasing their app to the market.
RQ2 - How can new developers identify the devices that they should focus their testing efforts on?
Motivation: Thus far, we have shown that a small percentage of devices make up the majority of user reviews. Another implication of this finding is that developers need to carefully pick their set of test devices so that they can maximize the effectiveness of their testing. Identifying the optimal set of devices is even more important for developers with limited resources who can only afford a few devices. One method of picking these devices is to aggregate the review data of every app in a category, and selecting the devices with the most review share. This method can provide developers a working set to start from, which they can later augment with other devices. In this RQ we examine the effectiveness of this method.
Discussion: While identifying the devices with the most review share within an app’s category is useful, some developers may opt to pick the devices with the most overall market share (which is posted on many mobile analytics websites, e.g., App Brain) to prioritize their testing. Market share is generally determined by the number of active Android devices. To compare this naive approach of simply using market share to our approach, which uses the reviews, we compare the devices with the most market share with the devices with the most review share of game apps. We obtain the list of devices with the most market share from App Brain, which gives us a list of the top 10 devices (shown in Table 1). Figure 7 (left box) shows the distribution of missed devices for the 99 free game apps, when market share devices are considered. We find that the median number of devices missed if we consider the market share devices is 4, which is higher than if we use review share in the same category. Moreover, Figure 8 (left box) shows the distribution of the percentage of review share impacted due to the missed devices. Once again, we see that median value here is 9.8%. This is higher than the median if review share devices were used, which has a median of 7.1%. The difference in the number of missed devices and missed review share between choosing an order based on market share instead of choosing an order based on review share from all the remaining apps in the game category is statistically significant (p-value « 0.01 for a paired Mann Whitney U test). Our findings suggest that simply using the market share is not enough, and using our technique, which uses reviews from apps in the same category can improve the effectiveness of the device prioritization problem. While examining the top 10 devices with the most review share in the game category, we notice not all of the devices rate apps the same way. This makes us wonder if some specific devices give worse ratings than others, and thus need special attention from developers. We explore this question in RQ3.
Approach: To answer this RQ, we examine the effectiveness of selecting top 10 most popular devices in the game app category. Working with our dataset of 99 free game apps, we identify the set of 10 devices with the most review share for each app, and compare it with the set of 10 devices with the most review share in the remaining 98 apps combined. Doing so allows us simulate an app’s developers who picks the top 10 devices that provide reviews for other apps, and using these devices to test his or her own app. Since we have the reviews for that app, we use these reviews to measure the review share that would be covered for an app based on the selected 10 devices. In other words, this analysis allows us to identify the devices which were in the top set of devices for an app, but not in the set of top devices of the rest of the group (essentially a set difference). If such devices are identified, we sum the review share of these devices to highlight the review share the developer would have lost/missed if they had picked the top 10 devices of the overall category. Findings: After comparing the top 10 devices of each app, with the top 10 devices of the whole set of game apps, we count the number of devices which were not in the set of top devices of the specific app. This count indicates the devices that a developer would have missed if they picked the top 10 overall devices. Figure 7 (left box) shows the distribution of these missed devices for the 99 free game apps. We observe that the median number of missed devices is 3 devices, with a max of 5 devices and low of 0. This finding suggests that using such a technique to prioritize the devices to test on works well overall (i.e., in the majority of the cases the developer will identify the 7 most review-producing devices). Now, we want to know what percentage of the review share would be impacted due to these missed devices. Figure 8 (right
We find that developers can use the devices with the most review share collected from all of the existing game apps, as a good indicator of which devices they should test their particular game app on.
5
Devices
S.Galaxy.S3
S.Galaxy.S2
Nexus.7
S.Galaxy.S
Motorola.Droid.RAZR
S.Galaxy.Note
S.Galaxy.Nexus
LG.Optimus.One
S.Galaxy.Ace
S.Galaxy.Y
Percentage of reviews
60
40
20
0
99 free game apps
Figure 6: Percent of reviews accounted by the top 10 devices way. With this kind of information, the app developer can do one of two things: 1) they can either allocate even more testing effort to devices that give a poor rating, when the revenue from that device is critical or 2) if there are not enough users of the app on the particular device, then developers can manually exclude these devices [23], from the list of supported devices on Google Play. In either case knowing if certain devices give worse ratings than others will help developers prioritize their testing effort even further.
Number of devices missed
6
4
2
Approach: We separately compare the bad (1 and 2 star reviews), medium (3 star reviews) and good reviews (4 and 5 star reviews) from the 20 devices with the most review share of the 99 free game apps. For example, to compare the ratings of the devices that rated game apps poorly, we create a table where the columns are the 99 game apps and the rows are ratings from the top 20 devices. Each cell in the table has a bad-ratings:all-ratings ratio (which is the number of 1 and 2 star reviews over all reviews) given to an app, by a specific device. Table 2 is an example of such a table. For instance, the ratio 21:100 in the ‘Samsung Galaxy S3 X Angry Birds’ cell means that 21 out of every 100 ratings that the S3 device gave to Angry Birds game were bad (i.e., 1 and 2 star reviews). Similarly, 33 out of every 100 ratings from the ‘Samsung Galaxy S2’ device for the ‘Temple Run’ game were bad. To compare how the top review share devices give bad, medium and good ratings, we create three 20x99 tables for the free game apps. We illustrate the bad rating tables for free apps using the box plots in Figure 9. These box plots allow us to easily compare the distributions of bad ratings for the 99 apps, from each of the top 20 devices. We then compare the statistical significance of the difference in ratings from these devices using the Tukey’s honestly significant difference test (HSD) [24]. This test is a multiple comparison test which compares the means of bad rating ratios of each device, with the means of every other device. This process is repeated for medium and good ratings as well. This implementation assigns a group to the devices based on how their ratings compare to other devices. Devices which are not in a common group give significantly different reviews. For devices with a common group, we cannot say that they have significantly different ratings. For example if ‘Samsung Galaxy S3’ and ‘ASUS Nexus 7’ are assigned to a common group, then we cannot say that these devices have significantly different ratings.
0 Top apps based on market share Top apps based on Review Share of other apps
Percent of reviews missed
Figure 7: Number of devices missed if overall top game devices are picked
20
10
0 Top apps based on market share Top apps based on Review Share of other apps
Figure 8: Percent of review share missed if overall top game devices are picked
5.3
RQ3 - Do the star ratings from different Android devices vary significantly?
Motivation: Since different devices have varying specifications, it could be the case that users of the different Android devices perceive and rate the same app differently (i.e., the one to five star ratings). Understanding how different devices rate apps will allow developers to understand why their apps were given the ratings that they receive (i.e., is it really about the device the user has and not the app itself). If different devices give varying ratings, this would imply that not all reviews of apps should be treated the same
Findings: We illustrate the difference in ratings from each device with the box plot in Figure 9. The horizontal axis shows the dis-
6
tribution of the ratio of bad ratings given from these devices to the 99 free game apps (i.e. bad:total ratings). The vertical axis of these box plots shows the 20 devices with the most reviews. The vertical axis is sorted by medians of the values in the horizontal axis; the devices on the bottom of the plot gave the most percentage of bad reviews. This figure indicates that some devices such as the ‘Motorola Droid X2’ give much more bad ratings to apps than others. On the other hand, we notice that some devices such as the ‘Samsung Galaxy Y’, ‘Asus Nexus 7’ and ‘Samsung Galaxy S3’ give much less bad ratings than other devices. We also note that there are a few outlier apps which received much more bad ratings than others. By investigating these outliers we discover that the outliers are different for all apps (i.e., it is not the same apps which are receiving the terrible ratings from the different devices). For example, 92% of all ratings given from the ‘HTC Sensation 4G’ device to the ‘Tap Tap Revenge 4’ game app are bad ratings while the median percentage of bad ratings given by the same device is 37.5%. We also find that none of the other top devices gave this app such poor ratings. Thus we can see that this particular app just does not work well in this device. On further examination we find many complaints by users of this device on the forum of the developer of ‘Tap Tap Revenge 4’. Examining the reviews from this device, we see that this app crashes after most major events (i.e. a song ending in the app) [25]. To verify the statistical significance of our observations, we use the Tukey’s HSD test, we discover that some devices give statistically significant different ratings than others. One example of such a device is the ‘Motorola Droid X2’. We find that this device has a significantly higher ratio of bad ratings than the devices which give the least bad ratings (i.e., Samsung Galaxy Y, Samsung Galaxy Ace). This same test shows that this device has the lowest ratio for good ratings. A reason behind these poor ratings could be manufacturer specific problems. A recent study by Han et al. provided evidence of vendor specific problems when they compared bug reports of HTC devices with Motorola devices [6]. A report of this device from Android Police (an Android dedicated web blog) describes its sluggish performance and poor screen resolution [26]. This may be the main issue with this device since most game apps require a good screen resolution and performance. The Tukey’s HSD also shows that the ratios of medium ratings from each device are not statistically different. Since the ratio of medium ratings is not different, the ratio of good ratings is directly related to the ratio of good ratings (i.e., if the ratio of bad ratings from a device was high, the ratio of good ratings will be low).
Table 2: An example of a table used to compare bad ratings given by devices Device
Angry Birds
Temple Run
Samsung Galaxy S3 Samsung Galaxy S2
21:100 29:100
19:100 33:100
Samsung Galaxy Y Asus Nexus 7 Samsung Galaxy S3 Samsung Galaxy Ace LG Cayman Samsung Galaxy Note Samsung Galaxy S2 HTC One S
Devices
HTC One X LG Optimus One Samsung Galaxy S HTC EVO 3D Motorola Droid RAZR Samsung Galaxy Nexus HTC Desire HD HTC Sensation 4G Droid Bionic HTC Evo 4G Motorola Droid X Motorola Droid X2 0.00
0.25
0.50
0.75
Percentage of bad ratings from each device per app
Figure 9: Percentage of bad ratings given from each device to free game apps
Older devices also tend to have worse hardware specifications than new devices so the performance difference may be the main reason for these bad ratings. To test this theory, we identify the release dates of the 20 devices with the most review share. Then we do a correlation test of their release dates and the median of the ratio of bad ratings to all ratings for all devices. Using the ‘Spearman’ correlation test we find a correlation of -0.61. Since the correlation is moderate, it implies that when it comes to the top 20 devices, the age of the device may be a factor for bad ratings. It is important to note here that what we are observing is a correlation, not causation. For the devices which give the most reviews for game apps, we find statistical evidence suggesting that some of these devices give worse ratings than others. This implies that game app developers should be aware of such devices so they can take appropriate actions.
Discussion: Our findings imply that developers should be aware that a few devices may give significantly worse reviews than others. These bad ratings may be given because the device itself generally gives poor ratings (i.e., ‘Motorola Droid X2’), or that the app itself does not work well on a device (as indicated by the outliers). In either case, developers aware of this finding can specifically address the concerns expressed in the reviews from such devices (i.e., do detailed testing) or remove the support for such devices. We are not suggesting that developers only need to test on devices that give statistically different star ratings than others; developers just need to devote specific attention towards problematic devices. Monitoring how different devices rate apps can reveal devices which are bringing down the overall rating of an app. Additionally our findings suggests that the user’s perception of the quality of an app depends on the device which runs it. Hence research on testing Android apps, should factor in the effect of the device. Another possible reason behind why some devices give worse ratings than others could be that those devices are simply older.
5.4
Implications for Developers
While we think that the results from this study will be useful for developers, device usage may have changed by the time most developers view this study. This is because of the rapid growth and evolution of the mobile industry where newer devices are constantly being released [1]. Thus, developers should focus on our general findings and approach rather than the device specific results from this study. For example, developers can take away the idea of prioritizing their testing efforts on a subset of impactful devices rather than all devices. Moreover, they can use our technique of analyzing reviews per device to potentially identify devices which consistently give poor ratings. On the other hand, app ratings from specific devices and the list of devices with the most review-share will most likely change. While it is ultimately up to the developers to decide how they are going to prioritize their testing efforts, we think that focusing on 7
the devices which have the most impact on the app’s ratings is an effective approach. This is because these ratings are directly linked to how new users perceive apps, the number of downloads [2], and thus the revenue generated by apps. Number of unique devices
6.
75
COMPARISON WITH PAID GAME APPS
While the findings of the study so far are most relevant for developers of free game app, we want to see if our findings hold for paid game apps as well. Although it is likely that many devices review the paid game apps as well, we believe that there will be a more even distribution of reviews among the devices. This is because users of paid apps may be more inclined to give reviews irrespective of the device, since they paid for them. Using the same method we used to collect the reviews of apps for the free game apps, we gather reviews of the top 100 apps in the paid game app category. We limit these reviews to the ones that were given after October 2012 and before January 2013 (similar to free game apps). Similar to RQ1, we identify the percentage of devices which are needed to account for the majority of the reviews given to the paid game apps. We use the same methods that we used in RQ2, to determine an ideal way to choose the top 10 devices to test - based on the reviews from other apps, or based on market share. We use the same method that we used in RQ3 to identify if ratings from different devices vary, for paid game apps.
50
25
Paid game apps
Figure 10: Number of Unique Android devices that reviews each paid game app
All ratings
Bad Ratings
Medium Ratings
Good ratings
90
Cumulative review-share
80
What percentage of devices account for the majority of user reviews? Figure 10 shows the number of unique devices that review each paid game app. We can see that there are fewer devices for the paid game apps (median of 50 devices) in comparison to the free game apps (median of 87 devices). Figure 11 shows the percentage of devices vs. the cumulative percentage of reviews in the different rating groups for all the paid game apps taken together. We find that only 10.3% devices are needed to cover 80% of the reviews given. It takes 11.8%, 10.3% and 9.27% devices to cover 80% of the bad, medium and good reviews given to paid game apps, respectively. Figure 12 shows the percentage of devices that are required to cover 80% of the reviews for each of the apps individually. We observe that the range of these percentages is more for paid game apps in comparison to free game apps. The minimum percentage is 16.7% whereas the maximum percentage is 69.2%. Our finding suggests that developers working on paid apps should be even more attentive of their analytics since in some cases a select few devices have a huge impact on their ratings.
70 60 50 40 30 20 10 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Percent of devices
Figure 11: Percent of Android devices used to give 80% of the reviews for all paid game apps
70
Percentage of devices
60
How can new developers identify the devices that they should focus their testing efforts on? From Figure 13 and Figure 14, we can see that it is indeed more beneficial to target devices based on the review share of the other paid game apps instead of using the market share. The difference in both the cases is statistically significant (p value « 0.01 for Mann Whitney U test). This result is similar to the result in free game apps. However one noticeable difference is that the number of devices missed (median of 5 devices) and the review share missed (median of 15.9%) for paid apps when using the market share data is slightly higher than in the case of free game apps (where the median values are 4 devices and 9.8%). Thus we can see that in the case of paid game apps, the market share data is much more incorrect in helping the developer identify the devices to test their app on first.
50
40
30
20
Paid game apps
Figure 12: Percent of Android devices used to give 80% of the reviews for each of the paid game apps
Do the star ratings from different Android devices vary significantly? For the paid game apps, we illustrate the variation of ratings with the box plot in Figure 15. This box plot shows how 8
Asus Nexus 7 Samsung Galaxy S3 Samsung Galaxy Nexus HTC EVO 3D HTC One X Samsung Galaxy S2 Motorola Droid RAZR Samsung Galaxy S
4
HTC One S
Devices
Number of devices missed
6
SEMC Xperia Play Samsung Galaxy Note Droid Bionic HTC Desire HD EeePad Transformer TF300
2
HTC Sensation 4G Galaxy Tab 10.1 Samsung Nexus S
Paid apps based on market share Paid apps based on review share of other apps
HTC Evo 4G Motorola XOOM EeePad Transformer TF101 0.00
Percent of reviews missed
0.50
0.75
Figure 15: Percentage of bad ratings given from each device to free game apps
40
7.1
30
Threats to Construct Validity
This type of threat considers the relationship between theory and observation, in case the measured variables do not measure the actual factors. Since we collected reviews from Google Play, our selection was limited to the most helpful (as determined by Google Play) subset of the total reviews. However since these reviews are the most recent, this allows us to accurately measure the most recent trends in the Android market. Analysis over all of the reviews would not be representative as it may include older phones which are no longer relevant. We compared the bad, medium and good reviews given from different devices to identify if certain devices give different (and worse) ratings than other devices. It could be the case that it wasn’t the specific device which led to poor ratings, but the specific version of the Android operation system which the device was running. Since the reviews themselves don’t contain any information about the Android OS versions, our approach can be used as a first step to identify problematic devices. Additionally each device is by default tied to a specific Android OS version. Hence our conclusions from the data collected about the devices are not incorrect. Developers can still benefit from prioritizing testing efforts based of ratings from devices. However the real underlying reason may be due to OS used in the device. Further studies need to be performed to know if prioritization should be done at the OS level instead of device level.
20
10
0 Paid apps based on market share Paid apps based on review share of other apps
Figure 14: Percent of review share missed if overall top game devices are picked
similar to free game apps, paid apps also receive varying ratings from certain devices. The statistical significance of these variations is confirmed by the Tukey’s HSD test. For example, the ‘Asus Nexus 7’ and the ‘Samsung Galaxy S3’ devices give significantly better ratings to paid game apps than many of the other devices (i.e. Motorola Xoom, EeePad TF101, HTC Evo 4G). We also find that the devices which give a higher percentage of bad ratings in paid apps are different from the devices that give a higher percentage of bad ratings in free game apps. This is because Motorola Xoom and EeePad TF101 are not even in the top 20 devices that reviews the free game apps. Thus developers need to be careful about using free games apps to prioritize their test effort for paid game apps, as the users can vary.
7.2
Threats to Internal Validity
This type of threat refers to whether the experimental conditions makes a difference or not, and whether there is sufficient evidence to support the claim being made. Since we limited our reviews to only the reviews from a few months (i.e., October, 1st, 2012 till January, 15, 2013), our data may not accurately represent ratings for the entire year or the entire life of devices. However, to mitigate this threat, we made sure to apply statistical tests, where applicable, to ensure that our findings are statistically significant. All of our findings are derived from user reviews. In certain cases, user reviews may not directly correlate with other measures of quality such as defects, for example. However, prior research showed that user reviews are directly correlated with app revenues
Trends and results in free game apps are similar for the the paid game apps as well. However, the argument for prioritization is more pronounced in the case of paid game apps. Therefore paid game app developers can save more testing effort by prioritizing their efforts based on the share of reviews from a device.
7.
0.25
Percentage of bad ratings from each device per paid app
Figure 13: Number of devices missed if overall top game devices are picked
THREATS TO VALIDITY
In this section we discuss the perceived threats to our work and how we address them. 9
(even free apps, which make their money through ads). Therefore, we believe that using user reviews is a good proxy for quality.
7.3
[2] M. Harman, Y. Jia, and Y. Z. Test, “App store mining and analysis: Msr for app stores,” in Proceedings of the 9th Working Conference on Mining Software Repositories (MSR ’12), Zurich, Switzerland, 2-3 June 2012.
Threats to External Validity
This type of threat considers the generalization of our findings. Our work dealing with fragmentation largely focused on Android devices and hence it may not generalize to all mobile devices. While our results are specifically for Android devices, the method of analyzing reviews can be applied to other platforms which have numerous devices that rate apps. In addition, since our study was performed on 99 mobile apps, our results may not generalize to all game apps. To address this threat, we pick apps which are labeled as ‘Top apps’ by Google Play. We feel that these apps are an appropriate representation of the apps in the game category, and a better choice than hand picking apps.
7.4
[3] H. Khalid, “On identifying user complaints of ios apps,” in Proceedings of the 2013 International Conference on Software Engineering. IEEE Press, 2013, pp. 1474–1476. [4] M. E. Joorabchi, A. Mesbah, and P. Kruchten, “Real challenges in mobile app development.” [5] H. Bodden, “Android’s fragmentation problem,” November 2012. [Online]. Available: http://greyheller.com/Blog/androids-fragmentation-problem [6] D. Han, C. Zhang, X. Fan, A. Hindle, K. Wong, and E. Stroulia, “Understanding android fragmentation with topic analysis of vendor-specific bugs,” in Reverse Engineering (WCRE), 2012 19th Working Conference on. IEEE, 2012, pp. 83–92.
Threats to Conclusion Validity
We assume that testing apps on certain devices will help find the device specific bugs, which can be fixed by the developer, thereby improving the quality and hence the revenues from the app. Even though it may seem logical, it still is an assumption. However, this assumption (testing finds bugs that can be fixed to improve quality) is the basis for any software team to test a software application. The software testing community focuses on improving testing practices in the hope that we do not verify if the test effort prioritization does actually improve quality or increase revenues. This is similar to other test effort prioritization research studies that only focus on improving the prioritization and not on whether the end quality is improved (the large body of literature on defect prediction is an example). Conducting such a verification experiment although valuable is out of the scope of this paper. Additionally, we have been careful in placing our conclusions in proper context - our approach can help in prioritizing testing effort (better than prioritizing based on market share data, which is hard to come by and highly questionable, due to lack of knowledge on how it was calculated). We are careful not to make any claims of improving quality or revenues.
8.
[7] C. Hu and I. Neamtiu, “Automating gui testing for android applications,” in Proceedings of the 6th International Workshop on Automation of Software Test. ACM, 2011, pp. 77–83. [8] S. Anand, M. Naik, M. J. Harrold, and H. Yang, “Automated concolic testing of smartphone apps,” in Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering. ACM, 2012, p. 59. [9] D. Amalfitano, A. R. Fasolino, P. Tramontana, S. De Carmine, and A. M. Memon, “Using gui ripping for automated testing of android applications,” in Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering. ACM, 2012, pp. 258–261. [10] Z. Jamrozik, Gross, “Droidmate: Fully automatic testing of android apps,” September 2013. [Online]. Available: http://www.droidmate.org/ [11] H. Ham and Y. Park, “Mobile application compatibility test system design for android fragmentation,” Software Engineering, Business Continuity, and Education, pp. 314–320, 2011.
CONCLUSION
This study seeks to help game app developers deal with Android fragmentation by focusing their testing efforts on the devices which have the most impact on their app ratings. By studying the reviews of game apps, we find that only a small percentage of devices account for most of the reviews given to apps. Thus, developers can focus their testing efforts on a small set of devices. New developers can use data from the remaining apps in the game app category to prioritize their testing effort if they do not already have reviews for their app. We also find that some devices give significantly worse ratings than others. Finally we find the results and conclusions from the free game apps generalize to the paid game apps as well. Therefore, developers can identify particularly problematic devices, and prioritize their testing efforts even further. Developers can adopt our approach of analyzing Android app reviews to alleviate the challenge brought forth by Android fragmentation for testing. In future work, we would like to investigate the problems reported from each device, and why Android devices have certain kinds of problems.
[12] H. Kim, B. Choi, and W. E. Wong, “Performance testing of mobile applications at the unit test level,” in IEEE Int’l Conf. on Secure Software Integration and Rel. Improvement, ser. SSIRI ’09, 2009, pp. 171–180. [13] S. Agarwal, R. Mahajan, A. Zheng, and V. Bahl, Diagnosing mobile applications in the wild. ACM Press, 2010, pp. 1–6. [14] H.-W. Kim, H. L. Lee, and J. E. Son, “An exploratory study on the determinants of smartphone app purchase,” in The 11th International DSI and the 16th APDSI Joint Meeting, Taipei, Taiwan, July 2011. [15] M. Linares-Vásquez, G. Bavota, C. Bernal-Cárdenas, M. Di Penta, R. Oliveto, and D. Poshyvanyk, “Api change and fault proneness: a threat to the success of android apps,” in Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering, ser. ESEC/FSE 2013, 2013, pp. 477– 487.
References
[16] A. MacHiry, R. Tahiliani, and M. Naik, “Dynodroid: An input generation system for android apps,” Technical report, Tech. Rep., 2012.
[1] L. Goasduff and C. Pettey, “Gartner says worldwide smartphone sales soared in fourth quarter of 2011 with 47 per cent growth,” Gartner, Inc, vol. 15, 2012. 10
[17] M. E. Joorabchi, A. Mesbah, and P. Kruchten, “Real challenges in mobile app development,” p. to appear, 2013.
[23] “Using the device availability dialog android developer help,” February 2013. [Online]. Available: https://support.google.com/googleplay/androiddeveloper/answer/1286017
[18] “Scrapy - an open source web scraping framework for python,” June 2012. [Online]. Available: http://scrapy.org/
[24] J. Hsu, Multiple comparisons: theory and methods. man & Hall/CRC, 1996.
[19] “Selenium - web browser automation,” June 2012. [Online]. Available: http://seleniumhq.org/
Chap-
[25] “Tap tap revenge 4 keeps closing after i complete a song,” September 2013. [Online]. Available: https://getsatisfaction.com/tapulous/topics/tap_tap_revenge_4 _keeps_closing_after_i_complete_a_song
[20] J. Clark, S. DeRose et al., “Xml path language (xpath) version 1.0,” 1999. [21] D. R. Hipp and D. Kennedy, “Sqlite,” Available fr om: http://www. sqlite. org, 2007.
[26] “Motorola droid x2: Even raw horsepower can’t save this phone,” September 2013. [Online]. Available: http://www.androidpolice.com/2011/05/28/review-motoroladroid-x2-even-raw-horsepower-cant-save-this-phone-frommediocrity/
[22] “Sentistrength - sentiment strength detection in short texts - sentiment analysis, opinion mining,” July 2013. [Online]. Available: http://sentistrength.wlv.ac.uk/
11