Predicting popularity of online distributed applications - Amazon Web ...

Predicting Popularity of Online Distributed Applications: iTunes App Store Case Analysis Miao Chen

Xiaozong Liu

School of Information Studies Syracuse University

School of Information Studies Syracuse University

[email protected]

[email protected]

ABSTRACT

user rating and comments, we want to answer two research questions: 1) What makes an application popular? 2) Can we predict the popularity of new and existing applications? In order to answer these questions, we will employ Classification And Regression Tree (CART) (Breiman et al., 1984; Steinberg & Colla, 1995) classification algorithm to analyze a list of innovative numeric and textual features.

Online distributed applications are becoming more and more important for users nowadays. There are an increasing number of individuals and companies developing applications and selling them online. In the past couple of years, Apple Inc. has successfully built an online application distribution platform – iTunes App Store, which is facilitated by their fashionable hardware such like iPad or iPhone. Unlike other traditional selling networks, iTunes has some unique features to advertise their application, for example, daily application ranking, application recommendation, free trial application usage, application update, and user comments. All of these make us wonder what makes an application popular in the iTunes store and why users are interested in some specific type of applications. We plan to answer these questions by using machine learning techniques.

2. RELATED WORK The popularity of objects such as software applications, webpages, etc. has been measured in two ways: 1) network based measurement, which measures one object’s popularity based on its connection with other related objects, for instance webpage popularity ranking PageRank (Page et al., 1998) and ontology popularity (Sabou et al., 2006); 2) feedback based measurement, which measures popularity based on user feedback such as voting, ratings, comments, etc. (Cha et al., 2007). In our study we adopt the second popularity measurement, taking application ranking ranked by users as popularity.

Keywords Online distributed application, machine learning, data mining, popularity

There have been a few studies on iPhone applications with fewer studies are on iPad applications due to its newness (it came out in 2010). A number of iPhone studies have been about using iPhone applications for educational purposes, i.e. teaching fundamental computer concepts by using iPhone games (O’Rourke et al., 2010). Kim et al. (2010) studied factors that affect smartphone application developers’ intention to develop applications frequently, in the context of platform business, which enables third-party developers to distribute and possibly make profits from their applications. Our study will integrate features from different perspectives, including features of application information and features of user-contributed content. We will take into account some innovative features that have not been used in previous studies to the best of our knowledge, such as whether an application has free version, application update information and user comment sentiment analysis etc.

1. INTRODUCTION In the past 20 years, users have begun to be used to and satisfied with online shopping, and the success of major online shopping systems, such as Amazon or eBay, shows that online shopping is becoming a great competitor as well as complement to traditional shopping outlets. Just until recently, online distributed applications have been widely accepted by end users with the success of iPhone and iPad. Apple Inc. launched iTunes service in April 2003, and in the past years, iTunes Application Store has become one of Apple’s most profitable services. By using this platform, developers can decide the price of their application (as a paid app) or make them free (as a free app). Until Jan 5 2010, 3 billion applications have been downloaded from iTunes. Similar to other online distribution systems, user attention is often distributed following power law (Nazir et al., 2008), with most content getting only some downloads, whereas a few receiving the most attention (Wu & Huberman, 2007).

3. DATA We collected data in two steps: First, 102,337 applications are sampled from the iTunes application databases, with their application name, provider, release date and category extracted (in table 1); Second, everyday a list of dynamic features were collected by tracking their daily ranking (top 200 paid and free applications from general and categorical rank lists). A variety of categories of iTunes applications were collected (shown in Table 1), and three groups of features were collected for our machine learning purpose: static, dynamic, and comments features (shown in Table 2).

For this paper, we investigate 102,337 applications from iTunes App store. By tracking daily ranking, application properties, and Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. iConference 2011, February 8-11, 2011, Seattle, WA, USA Copyright © 2011 ACM 978-1-4503-0121-3/11/02…$10.00

661

Books

Number of Apps 17227

Navigation

Number of Apps 4691

Business

2036

Photography

3996

Education

3604

Productivity

3280

Entertainment

6651

1265

Finance

3257

Games

30956

Reference Social Networking Sports

News

4971

Travel

5499

Lifestyle

2354

Utilities

3498

Medical

3393

Weather

947

Category

Category

split into training and testing groups, for model building and model evaluation respectively.

5. PRELIMINARY OBSERVATIONS AND FUTURE WORK We do some preliminary observations on the data set. Interestingly, we find the top ranked paid applications are not necessarily closely correlated with customer ratings. For instance, the average rating of top 200 paid apps are listed as the following (sampling for 3 days):

3238 2966

Average rate 0 50

Feature name

Description

Name

Navigation

Provider

Developer name

Category

E.g. Games, Books, and Travel

Release date

The date that application is released Rank of (paid or free) in a specific category The peak rank that this app has achieved

Current rank Peak rank

Dynamic features

Comments feature

Most recent update date Current version Current rate Current rate count All version rate All version count Current App Description User rate

3.489 100

150

3.795

3.418

200

Besides the features listed in table 2, we find a large percentage of top paid ranked applications have free trial version. For instance, 26.0% of top 50 paid applications have a trial version, while 24.7% of top 150 top paid applications have a free download version. The rank of those free trial applications is also available to help us better understand the contribution of “free trial”.

Music 982 Table 1. Categories of Sample from iTunes App Store.

Static features

3.846

Most recent version release date Current version e.g. 1.5.3 Current version (average) rate Count for current rate Average rate for all versions

In the future we will also work on these problems to better understand causes of popularity as well as other concepts like application stability: 1) Some applications stay on the top list for a long time, while some others drop after a short period of time, indicating that there is different stability between applications. We can also rank and predict stability from collected data. 2) Price reduction may help an application come back to the top. 3) Application provider authority may contribute to the popularity of the application. 4) The descriptions and comments (textual features) may help us predict the popularity of the application. Sentiment analysis and opinion mining will be used to extract these kinds of textual features. 5) Updating to a higher version may help an application get a better rank.

Count of rate for all version Most recent application description

6. REFERENCES

User rate for this application

[1] Breiman, L., Friedman, L., Olshen, R., & Stone, C. (1984). Classification and regression trees. Pacific Grove: Wadsworth.

Comment title Comment title Comment Comment content content Table 2. Features.

[2] Cha, M., Kwak, H., Rodriguez, P., Ahn, Y., & Moon, S., (2007). I tube, you tube, everybody tubes: analyzing the world's largest user generated content video system. Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement. San Diego, CA.

4. METHODOLOGY Our methodology is machine learning based by generating a model of associations between application popularity or ranking stability and the three groups as well as other innovative features. More specifically, three steps are involved in the methodology: 1) identify useful features using data mining and text mining techniques; 2) train a list of CART models to derive a popularity predicting model of applications; 3) evaluating model performance using testing data.

[3] Kim, H.J., Kim, I., & Lee, H.G. (2010). The success factors for App Store-like platform business from the perspective of third-party developers: An empirical study based on a dual model framework. Proceedings of Pacific Asia Conference on Information Systems 2010, 242-283. [4] Nazir, A., Raza, S., Chuah, C. (2008). Unveiling Facebook: A measurement study of social network based applications. Proceedings of the 8th ACM SIGCOMM Conference on Internet Measurement, 43-56.

We will track three months daily top 200 iTunes application lists (paid and free for general and categorical ranks). The data will be

662

web. Proceedings of the 4th International Evaluation of Ontologies on the Web Workshop. Edinburgh, UK.

[5] O’Rourke, J., MacDonald, I., & Goldschmidt, D. (2010). Learning computer science concepts using iPhone applications. Journal of Computing Sciences in Colleges, 25(6), 121-128.

[8] Steinberg, D., & Colla, P. (1995). CART: Tree-structured non-parametric data analysis. San Diego, CA: Salford Systems.

[6] Page, L., Brin, S., Motwani, R., & Winograd, T. (1998). The PageRank citation ranking: Bringing order to the web. Technical Report, Stanford Digital Library Technologies Project.

[9] Wu, F., & Huberman, B.A. (2007). Novelty and collective attention. Proceedings of the National Academy of Sciences, 104 (45), 17599-17601.

[7] Sabou, M., Lopez, V., Motta, E., & Uren, V. (2006). Ontology selection: Ontology evaluation on the real semantic

663

Predicting popularity of online distributed applications - Amazon Web ...

Predicting popularity of online distributed applications - Amazon Web ...

Suggest Documents

Predicting popularity of online distributed applications - Amazon Web ...

Predicting popularity of online videos using Support Vector Regression

Modeling and Predicting the Popularity of Online Contents with ... - Hal

Predicting popularity of online videos using Support Vector Regression

Shallow reading with Deep Learning: Predicting popularity of online ...

Distributed Leadership, Trust and Online Communities - Amazon Web ...

Security Assessment of Web Based Distributed Applications

Predicting the Popularity of GitHub Repositories

Predicting the Popularity of News Articles

Beyond Distributed Representation

Predicting News Popularity by Mining Online Discussions - GDAC

Ripley: Automatically Securing Distributed Web Applications Through ...

Replicating Web Applications On-Demand - distributed-systems.net

Vulnerability Analysis in Web Distributed Applications

Active Hypertext for Distributed Web Applications - CiteSeerX

Scalable Strong Consistency for Web Applications - distributed

ShoreTel Advanced Applications Web Utilities - Amazon Simple ...

Predicting dataset popularity for the CMS experiment

On the Popularity of GitHub Applications

Predicting the usage intention of social network games - Amazon Web ...

Predicting different conceptualizations of system use - Amazon Web ...

Models for predicting the performance of ASP.NET Web applications

Wireless distributed computing: a survey of ...

CORAL - Online Monitoring in Distributed Applications - wseas.us