Oct 28, 2012 - Privacy Preserving k-medoids Clustering: An Approach towards Securing Data in Mobile Cloud Architecture. Sanjit Kumar Dash. Dept. of IT.
Privacy Preserving k-medoids Clustering: An Approach towards Securing Data in Mobile Cloud Architecture Sanjit Kumar Dash
Debi Pr. Mishra
Ranjita Mishra
Sweta Dash
Dept. of IT College of Engineering & Technology Bhubaneswar, Odisha +91-9437990892
Dept. of IT College of Engineering & Technology Bhubaneswar, Odisha +91-9437990892
Dept. of IT College of Engineering & Technology Bhubaneswar, Odisha +91-9437990892
Dept. of CSE Synergy Institute of Engg & Technology Dhenkanal, Odisha +91-9090585246
[sanjitkumar303, dpmishra.07, ranjita586, swetadash123]@gmail.com alone and another reason is how applications are distributed today. Currently, mobile applications are tied to a carrier. If you want an iPhone app, for example, you have to first have a relationship with the mobile operator who carries the iPhone. If you want a
ABSTRACT The proliferation of mobile computing and cloud services is driving a revolutionary change in today’s information society. We are moving into the Ubiquitous computing age in which a user utilizes, at the same time, several electronic platforms through which one can access all the required information whenever and wherever needed. Mobile users can use their cellular phone to check e-mail, browse internet; travelers with portable computers can surf the internet from airports, railway stations etc. The mobile capabilities can be integrated with cloud computing services to give more secure and advanced services to the subscribers. At the same time privacy is an important issue in the collaborative ubiquitous computing since privacy concerns may prevent the parties from directly sharing the data and some types of information about the data. The main challenge arises as to how multiple parties collaboratively conduct information exchange without breaching data privacy. This paper seeks to investigate solutions for secure Mobile cloud architecture by using a privacy preserving K-Medoids clustering which is one of data mining tasks.
Blackberry app, the same rule applies. But with mobile clouding computing applications, as long as you have access to the web, you have access to the mobile application. But still there is a question mark on the responsibility and accountability of the mobile cloud architecture. Users are still in the dark over questions of legal responsibility, insurance, assurance and ownership of content in the cloud. Some uncertainty also exists about the issue of business continuity and risk management. Clustering could be thought of as an approach to solve the issues in mobile cloud. Clustering is the process of grouping a set of objects into classes or clusters so that objects within a cluster have similarity in comparison to one another, but are dissimilar to objects in other clusters [19]. A complete appraisal of the current state-of-the art of clustering techniques can be found in [20]. Our paper focuses on k-medoids method since it allows arbitrary objects that are not limited to numerical attributes [21]. In kmedoids clustering, a cluster is denoted by one of its points. The solution is easy since it covers any attributes type and medoids are resistant against outliers. Once medoids are chosen, clusters are defined as subsets of points close to respective medoids, and the objective function is described as the distance between a point and its medoid. From the privacy protection point of view, the kmedoids clustering seems more challenging than the k-means clustering [16] because in k-medoids clustering the real instances are used to compute the distances while we use the mean instance whose values are the means of the real instance values to compute the distances.
Keywords Cloud Computing, Mobile cloud Computing, k-medoid, Clustering
1. INTRODUCTION The term "cloud computing" [17] is being talked about a lot these days, mainly in the context of the "future of the web”. But the potential of cloud computing doesn't begin and end with the personal computer's transformation into a thin client which would result in a heavy blow on the mobile platform. At least that's the analysis being put forth by ABI Research [13]. Their recent report, Mobile cloud computing, theorizes that the mobile cloud will soon become a disruptive force in the mobile world, eventually becoming the dominant way in which mobile applications operate. The primary reasons for ABI to give such a statement are that the number of users the technology has the power to reach is far more than the number of smartphone users
The rest of this paper is organized as follows. Section 2 presents the various privacy preserving protocols that have been used over time, Section 3 explains the architecture of Mobile cloud Computing, Section 4 outlines the security issues in mobile cloud, Section 5 highlights the k-medoids clustering algorithm and presents the protocol that has been used in our paper, Section 6 gives a clue on the exposure areas of the mobile cloud architecture, Section 7 cites a case in support of section 6. Finally, Section 8 concludes our paper.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CCSEIT-12, October 26-28, 2012, Coimbatore [Tamil nadu, India] Copyright © 2012 ACM 978-1-4503-1310-0/12/10…$10.00.
2. RELATED WORK In early work on privacy-preserving data mining, Lindell and Pinkas [1] propose a solution to privacy-preserving classification problem using oblivious transfer protocol, a powerful tool
439
developed by secure multi-party computation (SMC) research [2,3]. The techniques based on SMC for proficiently dealing with large data sets have been addressed in [4]. Agrawal and Srikant first proposed randomization approaches in [5] to solve privacypreserving data mining problem. Canvassers proposed more random perturbation-based techniques to tackle the problems [6, 7, 8]. In addition to perturbation, aggregation of data values [9] provides another alternative to disguise the actual data values. In [10], authors studied the problem of computing the kth-ranked element. Dwork and Nissim [11] illustrate how to learn certain types of boolean functions from statistical databases in terms of a measure of probability difference with respect to probabilistic implication, where data are perturbed with noise for the release of statistics. Lately, there are several endeavors on privacy preserving clustering [12, 14, 15, 18]. A scaffold for clustering distributed over horizontally partitioned data in unsupervised and semi-supervised scenarios using sampling techniques is provided in [18]. In [12], Klusch et. al. presented an approach to distributed data clustering based on sampling density estimates. Oliveira and Zaiane introduced a family of geometric data transformation methods that ensure the mining process does not infringe privacy up to a certain degree of security in [14], and showed that a solution can be achieved by transforming a database using object similarity-based representation and dimensionality reductionbased transformation in [15]. Vaidya and Clifton's work [16] is an important contribution to the problem of privacy preserving clustering over vertically partitioned data. Their approach used the k-means method. Our paper basically focuses on clustering using k-medoids method. Since the two algorithms are different, the design for the secure protocols is dissimilar. As we discussed in the previous section, there are two-level protections in our protocols. Even though P1, Pn−1 and Pn(consider these are three parties) collude with one another, other parties' private data still remain securely hidden unless all of the parties collude except only one party.
Figure 1. Mobile cloud Computing Architecture
3.1 Presentation Tier The presentation tier of mobile cloud computing architecture consists of the following components: Microbrowser: A mobile browser, also called a microbrowser, minibrowser or wireless internet browser (WIB), is a web browser designed for use on a mobile device such as a mobile phone or PDA. Mobile browsers are optimized so as to display Web content most effectively for small screens on portable devices. Email Client: This is an application which is specifically designed to access remote mail servers, retrieve mail from them, and manipulate that mail. Popular examples of these are Microsoft Outlook, Thunderbird, and Eudora. Mobile OS: A Mobile operating system, also known as a Mobile OS, a Mobile platform, or a handheld operating system, is the operating system that controls a mobile device—similar in principle to an operating system such as Linux or Windows that controls a desktop computer or laptop.
3. OVERVIEW OF THE MOBILE CLOUD COMPUTING ARCHITECTURE
3.2 Application Tier
The architecture of mobile cloud computing mainly consists of three layers: 1. Presentation layer 2. Application layer 3. Database layer. Presentation Tier is used for user interaction to the server, application tier provides various business logics through different service providers and database tier provides full database and mobile communication functionality. The database tier allows a mobile user to initiate transactions from anywhere and anytime and guarantees their consistency preserving execution. The database tier also provides security, device provisioning etc. The database present in this layer is used for providing backup to the users in case of any loss or damage.
Application Tier basically focuses on various services provided by a group of service providers. Some of the important cloud service providers are Google Apps, Amazon web services, Facebook developers, IBM, Windows azure etc.
3.3 Database Tier In Mobile cloud computing both the data storage and the data processing happen outside of the mobile device i.e. when we combined concept of cloud computing in mobile environment. In MCC scenario all the computing power and data storage move into the mobile cloud. MCC will not provide benefits only to the smart phone users but also help a broader range of mobile subscriber. Database not only stores subscriber’s data but also provides back up facility.
4. SECURITY ISSUES IN MOBILE CLOUD In this section we outline new problem areas in security that arises from mobile cloud computing. These problems may only become perceptible after the maturation and more widespread adoption of mobile cloud computing as a technology. Cheap data and data analysis: The rise of mobile cloud computing has created enormous data sets that can be monetized by applications such as advertising. What is the impact on privacy
440
of abundant data and cheap data-mining? Because of the mobile cloud, attackers potentially have massive, centralized databases available for analysis and also the raw computing power to mine these databases. Because of privacy concerns, enterprises running mobile cloud collecting data have felt increasing pressure to anonymize their data. EPIC has called for Gmail, Google Docs, Google Calendar, and the company's other Web applications to be shut down until appropriate privacy guards are in place [22]. The anonymized data is retained though, to support the continual testing of their algorithms. Another reason to anonymize data is to share data with other parties (e.g., the AOL incident [23]).
(4)
For each pair (medoid M and non-medoid NM); Compute the value of TD for the partition that results from swapping M with NM, denoted by TDNMM. (5) Select the non-medoid NM for which TDNMM is minimal. (6) If TDNMM < TDcurrent • Swap NM with M • Set TDcurrent to be TDNMM • Go to Step 4 -------------------------------------------------------------------------The algorithm requires a distance function. For example, the distances can be defined in terms of standard Euclidean distance. Each party figure out its own portion of the distance and utilization of certain distance measure does not cause privacy violation. Therefore, other distance functions can be applied as well.
Cost-effective defense of availability: Availability also needs to be considered in the context of an adversary whose goals are simply to harm activities. Increasingly, such adversaries are becoming realistic as political conflict is taken onto the web, and as the recent cyber attacks on Lithuania confirm [24]. The damages are not only related to the losses of productivity, but extend to losses due to the degraded trust in the infrastructure, and potentially costly backup measures. The mobile cloud computing model encourages single points of failure. It is therefore important to develop methods for sustained availability (in the context of attack), and for recovery from attack.
5.1 Notations
Increased authentication demands: The development of cloud computing may, in the extreme, allow the use of thin clients on the client side. Rather than a license purchased and software installation on the client side, users will authenticate in order to be able to use a cloud application. There are some advantages in such a model, such as making software piracy more difficult and giving the ability to centralize monitoring. It also may help prevent the spread of sensitive data on untrustworthy clients. This architecture stimulates mobility of users, but increases the need to address authentication in a secure manner. In addition, the movement towards increased hosting of data and applications in the cloud and lesser reliance on specific user machines is likely to increase the threat of phishing and other offensive technologies aimed at stealing access credentials, or otherwise derive them, e.g., by brute force methods.
The following notations have been used for illustration purposes: • • • • • • •
k the total number of medoids. ta non-medoid instance. mCi the medoid of the cluster Ci. Ma general term for medoids. It contains all possible medoids. NMa general term for non-medoids. It contains all possible medoids. TD(Ci)the measure of the compactness for a cluster Ci. TDthe measure of the compactness of a clustering that contains all the clusters.
5.2 Privacy Preserving Protocol for Vertical Collaboration
This paper mainly emphasizes on the authentication issues in mobile cloud computing.
In vertical collaboration, since each party holds only a portion of attributes for each instance, each party computes her portion of the distance (called the distance portion which is the square of the standard Euclidean distance) according to her attribute set. To decide the nearest medoid of t, all the parties need to sum their distance portions together, then compare the summation. For example, assume that the distance portions between t and the medoid instance mCi are s11, s12,…., s1n; and the distance portions between t and the medoid instance mCj (i ≠ j) are s21, s22,…, s2n where s1j and s2j belong to Pj for j ∈ [1, n]. To compute whether the distance between the medoid instance mCi and t is larger than the distance between the medoid instance mCj and t, we need to evaluate the expression
5. OVERVIEW OF THE K-MEDOIDS CLUSTERING ALGORITHM The k-medoids method [25] divides a distance-space into k clusters. A medoid [11] that is selected from the dataset represents a cluster. The algorithm chooses k medoids to denote the k clusters. Clusters are then created by assigning each of the remaining instances to the nearest medoid. The k-medoids clustering algorithm is described in the following manner: -------------------------------------------------------------------------Algorithm: (1) Randomly select k occurrences from the dataset as medoids (2) Allocate each residual (non-medoid) instance to the cluster with the nearest medoid. (3) Compute the firmness of a clustering, denoted by TDcurrent
Problem 1: Assume that Pj has a private distance portion of the ith instance, sij, for i ∈ [1, k], j∈ [1, n], the problem is to decide whether ≤ for i, l [1, k] (i≠ l) and select the smallest value TD(Ci), without disclosing each distance portion. Protocol: The protocol has four steps
441
(1)
(2)
(3) (4)
c)
Key and digital envelope generation: multiple parties select one of them, e.g., Pn, as the key generator, who creates a cryptographic key pair (e, d) of a semantically-secure homomorphic encryption scheme. Each party generates k digital envelopes Computing e( ) for i ∈ [1, k]: each party
(3)
puts her private distance portion into a digital envelope and sends it to Pn−1 ), for all i ∈ [1, k]: each party Computing e( encrypts her digital envelopes and sends them to P1 P1, Pn−1 and Pn jointly compute the nearest medoid
We present the formal protocol as follows: Step I: Key and digital envelope generation (4)
Pjs for j ε [1, n] randomly select a key generator, e.g., Pn Pn generates a cryptographic key pair (e, d) of a semantically-secure homomorphic encryption scheme and publishes its public key e (3) Each party independently generates k digital envelopes, i.e., Pj generates k digital envelopes Rij, for all i ε [1, k], j ∈ [1, n] Step II: To compute e ( ) for i ∈ [1, k]
(1) (2)
(1) (2) (3) (4) (5)
P1 computes e(si1 + Ri1), for i ∈ [1, k], and sends them to P2 P2 computes e(si1 + Ri1) × e(si2 + Ri2) = e(si1 + si2 + Ri1 + Ri2), where i ∈ [1, k], and sends them to P3 Repeat sub-step 1 and 2 of step II until Pn−1 obtains e(si1+si2+….+si(n−1)+Ri1+Ri2+…..+Ri(n−1)), for all i ε [1, k] Pn computes e(sin + Rin) for i ∈ [1, k], and sends them to Pn−1 Pn−1 computes e (si1 + si2 +…..+ si(n−1) + Ri1 + Ri2 +….. + Ri(n−1)) × e(sin + Rin) = e(si1 + si2 +….+ si(n−1) +sin+Ri1+Ri2+…..+Ri(n−1)+Rin) = e( ), i ∈ [1,
6. THE STATE OF THE AFFAIRS WHERE THE PRIVATE DATA MAY BE EXPOSED The key step of the k-medoids clustering algorithm is the computation of the distance between each non-medoid t and its medoid mCi without disclosing their private data. There are two cases where privacy-oriented computations are required:
k]. Let e(S+R) denote the k encrypted elements as follows: [e(S1 +R1), e(S2 + R2),….., e(Sk+Rk)], where Si = and Ri = Step III: To compute e( (1) (2) (3)
) for all i ∈ [1, k]
Pn computes e(Rin) for i ∈ [1, k] and sends them to Pn−1 Pn−1 computes e(Rin) × e(Ri(n−1)) = e(Rin + Ri(n−1)) for i ∈ [1, k], and sends them to Pn−2 Repeat the sub-step 1 and 2 of step III until P1 obtains e(Ri1+Ri2+…..+Ri(n−1))×e(Rin) = e( ), for all i ∈ [1, k]. The k encrypted elements are denoted by e(R) that contains the following: [e(R1), e(R2),…., e(Rk)] where Ri =
(2)
(1)
Assign each non-medoid instance to the cluster with the nearest medoid
(2)
Compute TD. That is, for a particular cluster, computing the distances between each non-medoid instance and its medoid; then adding all the distances together to obtain TD(Ci). TD can then be computed by summation of TD(Ci) for all k clusters. Given a non-medoid instance t, multiple parties want to compute the distance between t and its medoid instances mCi.
7. AN INTERESTING CASE
Step IV: To compute the nearest medoid (1)
Pn−1 computes [S1+R1, S2+R2,…. , Sk +Rk] denoted by S +R. Note that: The permutation function that P1 used is independent from the permutation function that Pn−1 used Pn−1 and P1 compute e (Si − Sl) = e( ), for i, l ∈ [1, k](i ≠l), and collects the results into a sequence Φ which contains k(k − 1) elements. This computation can be achieved via the following process: a) P1 computes e(Rl) and e(−Ri) for i, l∈ [1, k](i ≠l), then sends them to Pn−1 b) Pn−1 computes e(Si −Sl) for i, l ∈ [1, k](i ≠ l) as follows: • e(Si + Ri) × e(−Ri) = e(Si) • e(−Sl − Rl) × e(Rl) = e(−Sl) • e(Si) × e(−Sl) = e(Si − Sl) Computation between Pn−1 and Pn a) Pn−1 randomly permutes this sequence Φ and obtains the permuted sequence denoted by Φ’, then sends Φ’ to Pn. Note that the permutation is independent from the ones she used b) Pn decrypts each element in sequence Φ’. He assigns the element +1 if the result of decryption is not less than 0, and −1, otherwise. Finally, he obtains a +1/ − 1 sequence denoted by Φ’’ c) Pn sends Φ’’ to Pn−1 who computes the smallest element. It is the nearest medoid for a given nonmedoid instance t. He then decides the cluster to which t belongs
Let us discuss an interesting scenario where P1 ,P n−1, and Pn be in cahoots with each other. What we want to know is whether the private data can be unveiled. In this case, these three parties can gain more information than what they should according to the protocols. The extra useful information that they can obtain at in protocol 1 is for i∈[1, k] .
Computation between P1 and Pn a) P1 randomly permutes e(R1), e(R2),….., e(Rk), then sends the permuted elements to Pn b) Pn decrypts each element and sends them to P1 in the same order as P1 did c) P1 computes R that contains the following: [R1, R2,….,Rk]. Computation between Pn−1 and Pn a) Pn−1 randomly permutes e(S1), e(S2),…., e(Sk), then sends the permuted elements to Pn b) Pn decrypts each element and sends them to Pn−1 in the same order as Pn−1 did
Based on this information, other parties' individual private distance portions cannot be derived unless, among the remaining of n-3 parties, there are n-4 parties colluding with P1, Pn−1 and Pn. In other words, to break the two-level protection and gain private data that should not be disclosed, n-1 parties in total need
442
[12]
to collude. Thus, although we assume that P1, Pn−1 and Pn do not collude, the assumption can be released to certain extent.
8. CONCLUSION AND FUTURE WORK In this paper we present a protocol towards privacy preserving of data in the mobile cloud computing architecture. We focus on kmedoids method since it allows arbitrary objects that are not limited to numerical attributes [4]. In particular, we have anticipated using homomorphic encryption technique to achieve collaborative clustering without sharing the private data among the collaborative parties. From the privacy protection point of view, the k-medoids clustering seems more efficient because we always use the real instance to compute the distances in kmedoids clustering instead of using the mean instance whose values are the means of the real instance values to compute the distances.
[13]
[14]
[15]
9. REFERENCES [1]
[2]
[3]
[4] [5]
[6]
[7]
[8]
[9]
[10]
[11]
Lindell Y. and Pinkas B. Privacy preserving data mining. In Advances in Cryptology – Crypto2000, Lecture Notes in Computer Science, volume 1880, 2000. Goldreich O. Secure multi-party computation (working draft). http://www.wisdom,weizmann.ac.il/home/oded/ public _html/foc.html, 1998. Yao A. Protocols for secure computations. In Proceedings of the 23rd Annual IEEE Symposium on foundations of Computer Science, 1982. Kaufman L., and Rousseeuw. Finding groups in data. Wiley, New York, NY, 1990. Agrawal R. and Srikant R. Privacy-preserving data mining. In Proceedings of the ACM SIGMOD Conference on Management of Data, 439-450, ACM Press, May 2000. Gehrke J. E., Evfimievshi A., and Srikant R. Limiting privacy breaches in privacy preserving data mining. In Proceedings of the 22nd ACM SIGMOD Symposium on Principles of Database Systems, San Diego, CA, June 2003. Du W. and Zhan Z. Using randomized response techniques for privacy preserving data mining. In Proceedings of The 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, August 24-27, 2003 Rizvi S. and Haritsa J. Maintaining data privacy in association rule mining. In Proceedings of the 28th VLDB Conference, Hong Kong, China, 2002. Sweeney L. k-anonymity: a model for protecting privacy. In International Journal on Uncertainty, fuzziness and Knowledge-based Systems, 10(5), 557-570, 2002. Aggarwal G., Mishra N., and Pinkas B. Secure computation of the kth-ranked element. In EUROCRYPT, 40-55, 2005 Dwork C. and Nissim K. Privacy-preserving data mining on vertically partitioned databases. In CRYPTO 2004, 528-544.
[16]
[17]
[18]
[19] [20] [21] [22]
[23] [24]
[25]
443
Klusch M., Lodi S. and Moro G-L. Distributed clustering based on sampling local density estimates. In Proceedings of International Joint Conference on Artificial Intelligence, Mexico, 2003. http://www.abiresearch.com/research/1003385Mobile+Cloud+Computing.
Oliveira S. and Zaiane O. Privacy preserving clustering by data transformation. In Proceedings of the 18th Brazilian Symposium on Databases, 304-318, Manaus, Brazil, October 6-8, 2003. Oliveira S. and Zaiane O. Privacy preserving clustering by data object similarity-based representation and dimensionality reduction transformation. In Workshop on Privacy and Security Aspects of Data Mining in conjuction with the 4th IEEE International Conference on Data Mining, 21-30, Brighton, UK, November 1, 2004. J. Vaidya and C. W. Clifton, Privacy Preserving K-Means Clustering over Vertically Partitioned Data, Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, August 24-27, 2003 R. Mishra, S.K. Dash, D.P.Mishra and A.Tripathy. A Privacy Preserving Repository for Securing Data across the Cloud. In the Proceedings of the 3rd International Conference on Electronics Computer Technology, Kanyakumari, India, April 8-10, 2011. Merugu S. and Ghosh J. Privacy-preserving distributed clustering using generative models. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 24-27, 2003, Washington, DC, USA J. Han and M. Kamber, Data Mining: Concepts and Techniques,Morgan Kaufmann, 2000. P. Berkhin, Survey Of Clustering Data Mining Techniques, Accrue Software, San Jose, CA, 2002. Kaufman, L. and Rousseeuw, P., Finding groups in data, Wiley, New York, NY, 1990. FTC questions cloud-computing security. http://news.cnet.com/8301-13578_3-1019857738.html?part=rss&subj=news&tag=2547-1_3-0-20. AOL apologizes for release of user search data. http://news.cnet.com/2100-1030_3-6102793.html Lithuania Weathers Cyber Attack Braces for Round 2. http://blog.washingtonpost.com/securityfix/2008/07/lithua nia_weathers_cyber_attac_1.html. Justin Zhan, Privacy Preserving k-medoids clustering, IEEE, 2007