Predicting Popularity of Online Distributed Applications: iTunes App Store Case Analysis Miao Chen
Xiaozong Liu
School of Information Studies Syracuse University
School of Information Studies Syracuse University
[email protected]
[email protected]
ABSTRACT
user rating and comments, we want to answer two research questions: 1) What makes an application popular? 2) Can we predict the popularity of new and existing applications? In order to answer these questions, we will employ Classification And Regression Tree (CART) (Breiman et al., 1984; Steinberg & Colla, 1995) classification algorithm to analyze a list of innovative numeric and textual features.
Online distributed applications are becoming more and more important for users nowadays. There are an increasing number of individuals and companies developing applications and selling them online. In the past couple of years, Apple Inc. has successfully built an online application distribution platform – iTunes App Store, which is facilitated by their fashionable hardware such like iPad or iPhone. Unlike other traditional selling networks, iTunes has some unique features to advertise their application, for example, daily application ranking, application recommendation, free trial application usage, application update, and user comments. All of these make us wonder what makes an application popular in the iTunes store and why users are interested in some specific type of applications. We plan to answer these questions by using machine learning techniques.
2. RELATED WORK The popularity of objects such as software applications, webpages, etc. has been measured in two ways: 1) network based measurement, which measures one object’s popularity based on its connection with other related objects, for instance webpage popularity ranking PageRank (Page et al., 1998) and ontology popularity (Sabou et al., 2006); 2) feedback based measurement, which measures popularity based on user feedback such as voting, ratings, comments, etc. (Cha et al., 2007). In our study we adopt the second popularity measurement, taking application ranking ranked by users as popularity.
Keywords Online distributed application, machine learning, data mining, popularity
There have been a few studies on iPhone applications with fewer studies are on iPad applications due to its newness (it came out in 2010). A number of iPhone studies have been about using iPhone applications for educational purposes, i.e. teaching fundamental computer concepts by using iPhone games (O’Rourke et al., 2010). Kim et al. (2010) studied factors that affect smartphone application developers’ intention to develop applications frequently, in the context of platform business, which enables third-party developers to distribute and possibly make profits from their applications. Our study will integrate features from different perspectives, including features of application information and features of user-contributed content. We will take into account some innovative features that have not been used in previous studies to the best of our knowledge, such as whether an application has free version, application update information and user comment sentiment analysis etc.
1. INTRODUCTION In the past 20 years, users have begun to be used to and satisfied with online shopping, and the success of major online shopping systems, such as Amazon or eBay, shows that online shopping is becoming a great competitor as well as complement to traditional shopping outlets. Just until recently, online distributed applications have been widely accepted by end users with the success of iPhone and iPad. Apple Inc. launched iTunes service in April 2003, and in the past years, iTunes Application Store has become one of Apple’s most profitable services. By using this platform, developers can decide the price of their application (as a paid app) or make them free (as a free app). Until Jan 5 2010, 3 billion applications have been downloaded from iTunes. Similar to other online distribution systems, user attention is often distributed following power law (Nazir et al., 2008), with most content getting only some downloads, whereas a few receiving the most attention (Wu & Huberman, 2007).
3. DATA We collected data in two steps: First, 102,337 applications are sampled from the iTunes application databases, with their application name, provider, release date and category extracted (in table 1); Second, everyday a list of dynamic features were collected by tracking their daily ranking (top 200 paid and free applications from general and categorical rank lists). A variety of categories of iTunes applications were collected (shown in Table 1), and three groups of features were collected for our machine learning purpose: static, dynamic, and comments features (shown in Table 2).
For this paper, we investigate 102,337 applications from iTunes App store. By tracking daily ranking, application properties, and Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. iConference 2011, February 8-11, 2011, Seattle, WA, USA Copyright © 2011 ACM 978-1-4503-0121-3/11/02…$10.00
661
Books
Number of Apps 17227
Navigation
Number of Apps 4691
Business
2036
Photography
3996
Education
3604
Productivity
3280
Entertainment
6651
1265
Finance
3257
Games
30956
Reference Social Networking Sports
News
4971
Travel
5499
Lifestyle
2354
Utilities
3498
Medical
3393
Weather
947
Category
Category
split into training and testing groups, for model building and model evaluation respectively.
5. PRELIMINARY OBSERVATIONS AND FUTURE WORK We do some preliminary observations on the data set. Interestingly, we find the top ranked paid applications are not necessarily closely correlated with customer ratings. For instance, the average rating of top 200 paid apps are listed as the following (sampling for 3 days):
3238 2966
Average rate 0 50
Feature name
Description
Name
Navigation
Provider
Developer name
Category
E.g. Games, Books, and Travel
Release date
The date that application is released Rank of (paid or free) in a specific category The peak rank that this app has achieved
Current rank Peak rank
Dynamic features
Comments feature
Most recent update date Current version Current rate Current rate count All version rate All version count Current App Description User rate
3.489 100
150
3.795
3.418
200
Besides the features listed in table 2, we find a large percentage of top paid ranked applications have free trial version. For instance, 26.0% of top 50 paid applications have a trial version, while 24.7% of top 150 top paid applications have a free download version. The rank of those free trial applications is also available to help us better understand the contribution of “free trial”.
Music 982 Table 1. Categories of Sample from iTunes App Store.
Static features
3.846
Most recent version release date Current version e.g. 1.5.3 Current version (average) rate Count for current rate Average rate for all versions
In the future we will also work on these problems to better understand causes of popularity as well as other concepts like application stability: 1) Some applications stay on the top list for a long time, while some others drop after a short period of time, indicating that there is different stability between applications. We can also rank and predict stability from collected data. 2) Price reduction may help an application come back to the top. 3) Application provider authority may contribute to the popularity of the application. 4) The descriptions and comments (textual features) may help us predict the popularity of the application. Sentiment analysis and opinion mining will be used to extract these kinds of textual features. 5) Updating to a higher version may help an application get a better rank.
Count of rate for all version Most recent application description
6. REFERENCES
User rate for this application
[1] Breiman, L., Friedman, L., Olshen, R., & Stone, C. (1984). Classification and regression trees. Pacific Grove: Wadsworth.
Comment title Comment title Comment Comment content content Table 2. Features.
[2] Cha, M., Kwak, H., Rodriguez, P., Ahn, Y., & Moon, S., (2007). I tube, you tube, everybody tubes: analyzing the world's largest user generated content video system. Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement. San Diego, CA.
4. METHODOLOGY Our methodology is machine learning based by generating a model of associations between application popularity or ranking stability and the three groups as well as other innovative features. More specifically, three steps are involved in the methodology: 1) identify useful features using data mining and text mining techniques; 2) train a list of CART models to derive a popularity predicting model of applications; 3) evaluating model performance using testing data.
[3] Kim, H.J., Kim, I., & Lee, H.G. (2010). The success factors for App Store-like platform business from the perspective of third-party developers: An empirical study based on a dual model framework. Proceedings of Pacific Asia Conference on Information Systems 2010, 242-283. [4] Nazir, A., Raza, S., Chuah, C. (2008). Unveiling Facebook: A measurement study of social network based applications. Proceedings of the 8th ACM SIGCOMM Conference on Internet Measurement, 43-56.
We will track three months daily top 200 iTunes application lists (paid and free for general and categorical ranks). The data will be
662
web. Proceedings of the 4th International Evaluation of Ontologies on the Web Workshop. Edinburgh, UK.
[5] O’Rourke, J., MacDonald, I., & Goldschmidt, D. (2010). Learning computer science concepts using iPhone applications. Journal of Computing Sciences in Colleges, 25(6), 121-128.
[8] Steinberg, D., & Colla, P. (1995). CART: Tree-structured non-parametric data analysis. San Diego, CA: Salford Systems.
[6] Page, L., Brin, S., Motwani, R., & Winograd, T. (1998). The PageRank citation ranking: Bringing order to the web. Technical Report, Stanford Digital Library Technologies Project.
[9] Wu, F., & Huberman, B.A. (2007). Novelty and collective attention. Proceedings of the National Academy of Sciences, 104 (45), 17599-17601.
[7] Sabou, M., Lopez, V., Motta, E., & Uren, V. (2006). Ontology selection: Ontology evaluation on the real semantic
663