Cloud Based Ranking System on Clustered Databases

1 downloads 5568 Views 92KB Size Report
Cloud Based Ranking System on Clustered Databases. Niladri Dey. Manager ... specific product and will get to know the database wide ranking. The rest of the ...
Cloud Based Ranking System on Clustered Databases Niladri Dey

Dr. Sujoy Bhattacharya

Dasaradha Ramaiah K

Manager – IT Projects BVRIT, Hyderabad +91 9542 60 1772

Professor BVRIT, Hyderabad +91 9618 06 2749

Professor BVRIT, Hyderabad +91 9491 91 9183

[email protected]

[email protected]

[email protected]

ABSTRACT Ranking is always important for all types of comparisons including website ranking and other voting and polling situations. However, the general ranking system focuses primarily on individual or standalone databases and the algorithms do not connect components in distributed or clustered databases. As the computing community is migrating towards cloud and distributed computing, the information for ranking will be distributed over multiple clusters. In this paper, we propose a new algorithm for ranking considering the clustered databases. An application has been used to showcase the implementation of the algorithm. This algorithm also can be used for any ranking, voting or polling applications.

Keywords Ranking, Clustered Database, Clustered Ranking, Movie Ranking, Distributed database.

1. INTRODUCTION Ranking is the most effective method of calculating popularity [1] for any public objects like movies. The ranking method is also applicable for multiple consumer goods or polling processes [2]. The data stored in the database and its statistics become the deciding factor in ranking. Most of the cases, the data is located in a centralized server [3], where all the participants of the survey update their responses. In a few cases, it has been observed that the location where the survey took place has the effective interpretation of the results. Hence storing the ranking statistics and calculating the ranking based on the clusters provides the most accurate ranking or popularity rating. Hence in this research we propose an algorithm to calculate the ranking based on a clustered architecture with a factor for popularity or to provide ranking for the cluster itself based on the usability of that cluster. This ensures the consideration of the most number of users taking survey from that cluster location. This will help the decision making to approach the most targeted customers for the specific product and will get to know the database wide ranking. The rest of the paper is organized such that in the Part 2, we discuss the related works, Part 3 describes the algorithm, Part 4 describes the application for Movie Ranking based on the algorithm, Part 5 shows the effects of the algorithm over Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Conference’10, Month 1–2, 2010, City, State, Country. Copyright 2010 ACM 1-58113-000-0/00/0010 …$15.00.

standalone, Part 6 discusses the hardware and software setups for the application and test beds and Part 7 provides the conclusion of the research.

2. RELATED WORK This research also provides a comparison with the other popular ranking mechanisms. However, we also noted the limitations from those researches. We have classified the related works based on clustering into the two following categories:

2.1 Ranking without Considering the Cluster A large number of major research works have been undertaken in the area of ranking, where the data is located on a single or standalone database. An approach called “Factoring and weighting approaches to status scores and clique detection” by Phillip Bonacich [4] shows an independent ranking approach, where the individual ranking does not consider the clustering of social data. A similar research work by Alice Cheng and Eric Friedman, shows the page ranking [5] strategy, which they term as “Manipulability of pagerank under sybil strategies” without considering the data distribution [6]. “A new status index derived from sociometric analysis” by L. Katz also does not consider the social location based weight factors for the analyzed data. The above mentioned researches, focuses only on the standard factors of the social or page hit information, where as the geodetic location may have impact on the final results [7].

2.2 Ranking with Cluster Data Few of the existing research works [8] [9] [10] have considered the impact of geodetic clustering and distribution of the data, but those researchers have not listed the impacts of cluster wise weights for any such ranking mechanism. Moreover, the ranking algorithms only work on certain specific type of data, which is not applicable for other purposes like ranking an individual or ranking a movie or ranking of products. Two most important ranking algorithms are PageRank and HITS, which is also manage the ranking without considering the weight factor of the geodetic clusters [11] [12], which may have an effect on the final conclusion of those algorithms.

3. ALGORITHM The cluster setup is organized with multiple move entries with redundancy [Figure 1]. The algorithm for this research is a simple cluster weight based algorithm, where the popularity of the cluster is defined as weight. Hence the final statistics makes it more reliable for the decision making. The steps of the algorithm are defined in the following paragraphs.

n ,m

RM = (



RLK * Wi ) /d k

(4)

L =1, K =1

Where, RM denotes the final ranking for the movie.

4. APPLICATION BASED ON THE ALGORITHM A sample application [Figure 2] was developed to test the effectiveness of the algorithm. The details of the components for the application are listed below – Figure 1. The Clustered information of Movies.

3.1 Calculation for the number of clusters The perfect number of clusters for any distributed database is always the best performance improving factors[13], which is calculated based on a standard algorithm –

dK =

1 min c1 ...cK E[( X − c X )T Γ −1 ( X − c X )] p

(1)

Where, If we let C1 ... cK be a set of K cluster centers, with cX the closest center of a given sample of X, then d is the minimum average distortion per dimension when fitting the K centers to the data.

3.2 Calculation for the local rank The local rank for a movie is just the summarized calculation of the parameters given for survey – n

RLK = ∑ f i

(2)

Figure 2. The Application for Ranking.

i =1

Where, RLK is the representing local ranking for the movie with L and K as Cluster and Movie indexes respectively. The factor

f i defines the survey parameters.

3.3 Calculation for cluster weight The most impacting factor, the cluster weight or the access frequency is calculated based on the number of accesses for taking up the survey – ∞

Wi = ∑ USE ( M i )

(3)

t =0

Where,

Wi

denotes the cluster weight and

Mi

denotes

4.1 Survey Tool The very first component of the application is the survey tool, which is a Cloud web based application for conducting a survey. The application is deployed on the cloud considering the heavy load during the pick access durations. The survey tool is connected to the independent master site logger, which again is connected to the final ranking component.

4.2 Master Site Access Logger The Master Site Access Logger is connected to the main application component to log the movie entries for redundant movie accesses. The Master Site Access Logger also determines the nearest cluster site for storing the survey reports.

4.3 Cluster Access

the movies information stored on the cluster, which is a time variant calculation. The special use of the USE ( ) function counts the number of accesses for the survey.

The Cluster Access component is referred by the Master Site Access Logger stored the data into the proper domain considering the access factors [14] [15] [16].

3.4 Calculation of the final rank

4.4 Site Access Log

The Final rank for the movie calculation is duly calculated based on cluster weights, local ranking and number of clusters –

The Cluster Access Log component measures the number of accesses and keeps a track of the Cluster Weights. The USE () method calculates the total number of access for each movie.

4.5 Movie Ranking The Movie Ranking component deploys the major part of the algorithm. This component calculates final rank of the movie based on the local ranks and weights of the clusters.

5. PERFORMANCE IMPROVEMENT The Survey took place for over 300 Movies and few of them are listed in Table 1. Each row gives the ranking for each movie in in various sites (different columns) and the comparative results (resultant ranking) is shown in the final column. All ranking values are in a scale of 10. Table 1. Comparative ranking of the movie databases Figure 3. Ranking Comparisons over Individual Sites and Final Ranking. Movie Names

Site 1

Site 2

Site 3

Site 4

Final Ranking

The Shawshank Redemption

9.2

7

9.8

4

9.4

Pulp Fiction

8

2.6

5

6

8.5

Rope

6

8.3

2

6.4

5.7

In the Heat of the Night

8

8

5.5

2.1

7.03

The calculated values demonstrate that the final ranking is much more stable than the cluster based individual rankings, which vary by wide margins.

6. Hardware and Software Requirements The application architecture is based on a simple preconfigured Amazon EC2 instance. The architecture considers the elastic properties of the EC2, which enables the application to increase or decrease capacity within minutes, not hours or days. The Completely Controlled Environment (CCE) allows us to have complete control of application instances. In general, the reliability and stability of Amazon EC2 offers a highly robust environment where replacement instances can be rapidly and predictably commissioned. The service runs within Amazon’s proven network infrastructure and datacenters without the requirement of a local database instance. The Cluster Setups are on Apache HBase Databases, where one node works as Master and other sites are considered as slaves. The nodes are configured with Oracle Java 6, with Ubuntu 12 Operating Systems with SSH and SSHD enabled. The DNS and LoopBack IP addresses are also duly configured for all the nodes including the Master.

7. CONCLUSION In the Movie Database application, it is assumed that the ranking will be based on multiple factors that play a part in deciding the total local ranks. It is also assumed that the movie survey information also will be redundant in nature as different geodetic locations will have at least one cluster, where the survey is taking place. The cluster will show the impact of the movie on that region based on the local language and culture.

. Figure 3 demonstrates the local and the actual ranking or the rating of the movie based on the provided survey information and calculations. Some of the cluster information is not giving the correct ranking or rating for the movies. This variation can be due to the cultural difference or the social impacts or the policies, which finally may impact the ratings. It is demonstrated that the Final Ranking Algorithm generates a more reliable ranking that enables the user to take a learned decision while also providing the region wise survey reports and statistics. This research is also applicable for any type of survey, which may have geodetically varying results. The results of such a survey frequently have a significant impact on business decisions or any scientific / social decision with lasting impact on the society.

8. ACKNOWLEDGMENTS The research work is being carried in the Cloud Computing Center in Padmasri Dr. B. V. Raju Institute of Technology, Hyderabad. We would like to thank the management for allocating necessary funds for the research.

9. REFERENCES [1] J. Bilmes. A gentle tutorial on the em algorithm and its application to parameter estimation for gaussian mixture and hidden markov models, 1997. [2] J. E. Gentle and W. HSrdle. Handbook of Computational Statistics: Concepts and Methods, chapter 7 Evaluation of Eigenvalues, pages 245{247. Springer, 1 edition, 2004. [3] S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst., 30(1-7):107{117, 1998. [4] Phillip Bonacich. Factoring and weighting approaches to status scores and clique detection. Journal of MathematicalSociology, pages 113–120, 1972. [5] Alice Cheng and Eric Friedman. Manipulability of pagerank under sybil strategies. In First Workshop on the Economics of Networked Systems (NetEcon06), 2006. [6] DBLP. The dblp computer science http://www.informatik.uni-trier.de/ ley/db/.

bibliography.

[7] J. E. Hirsch. An index to quantify an individual's scienti¯ c research output. Proceedings of the National Academy of Sciences, 102:16569, 2005. [8] Z. GyÄongyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with trustrank. In Proceedings of the Thirtieth international conference on Very large data bases (VLDB'04), pages 576{587. VLDB Endowment, 2004. [9] G. Jeh and J. Widom. SimRank: a measure of structuralcontext similarity. In Proceedings of the eighth ACM SIGKDD conference (KDD'02), pages 538{543. ACM, 2002. [10] W. Jiang, J. Vaidya, Z. Balaporia, C. Clifton, and B. Banich. Knowledge discovery from transportation network data. In Proceedings of the 21st ICDE Conference (ICDE'05), pages 1061{1072, 2005. [11] U. Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395{416, 2007. [12] Z. Nie, Y. Zhang, J.-R. Wen, and W.-Y. Ma. Object-level ranking: Bringing order to web objects. In Proceedings of the fourteenth International World Wide Web Conference (WWW'05), pages 567{574. ACM, May 2005. [13] S. Roy, T. Lane, and M. Werner-Washburne. Integrative construction and analysis of condition-speci¯ c biological networks. In Proceedings of AAAI'07, pages 1898{1899, 2007. [14] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888{905, 2000. [15] X. Yin, J. Han, and P. S. Yu. Linkclus: E±cient clustering via heterogeneous semantic links. In Proceedings of the 32nd VLDB conference (VLDB'06), pages 427{438, 2006. [16] O. Zamir and O. Etzioni. Grouper: A dynamic clustering interface to web search results. pages 1361{1374, 1999.

Suggest Documents