Fast-anonymity - An approach for improved security in cloud database

16 downloads 23413 Views 348KB Size Report
We also compare our proposed method to the encryption proposed by the ... People can only enjoy the full benefits of Cloud Computing if we are able to address.
Fast-anonymity - An approach for improved security in cloud database Antonio M. R. Almeida1 , Matheus Pericini1 , Javam C. Machado1 1

Departamento de Computa¸ca˜o – Universidade Federal do Cear´a (UFC) {manoel.ribeiro, matheus.pericini}@lsbd.ufc.br, [email protected]

Abstract. Privacy preserving data publishing (PPDP) provides methods to collect, store and publish information in the cloud in order to prevent unauthorized access. In cloud computing, the centralization of data requires extra care with the issue of data privacy. Encryption has been adopted as the main solution to this problem. However, apply encryption to the data sent to and retrieved from the database is costly. Encryption also requires more space for storage as well as it does not allow data to be sorted. This paper then suggests a data anonymization method that does not suffer from the problems listed above. We assume that the data is sent to a relational database stored in the cloud or even in a cloud provider’s infrastructure. We also compare our proposed method to the encryption proposed by the survey security and privacy in cloud computing.

1. Introduction People can only enjoy the full benefits of Cloud Computing if we are able to address the true concerns regarding privacy and security that occur when storing sensitive personal information in databases and software spreaded throughout the Internet [Armbrust et al. 2010]. There are many service providers in the internet. We can address each service as a cloud, each cloud service will exchange data with other cloud, so when the data is exchanged between the clouds, may occur the problem of privacy disclosure. [Lu and Tseng 2002] So the privacy disclosure problem whether is individual or a company one is inevitably exposed when releasing or sharing data in the cloud service. Privacy is an important issue for cloud computing, both in terms of legal compliance and user trust, and needs to be considered at every phase of design. Our paper provides some privacy preserving technologies used in cloud computing services [Deutsch and Papakonstantinou 2005]. Security and privacy are the main challenges that may prevent wide adoption of cloud computing approach. [Sweeney 2002] As security flaws in any component can impact the other security components, the security of the whole system could collapse [Jr et al. ]. The invasions of accounts or services are not an unknown safety issue, as they happen in phishing, frauds, exploitation of vulnerabilities of systems and applications etc., and it is a common practice to reuse users credentials and passwords, amplifying the impact of this type of attack. The use of encryption in infrastructure and database in the communication

protocol with the browser is important to take action towards the issues of privacy and security, especially in database hosted in the cloud [Jensen et al. 2010]. But a complex and naturally exposed cloud scenario suggests that vulnerability; the cloud service provider that allows eventual access to the database with a password that enables data decryption to put in doubt the safety of this model. This essay proposes creating another layer of protection for the data in the application form and application layer, independently from the infrastructure database and file systems [Pacheco 2013]. The most popular methods of anonymization are: deletion, generalization, shuffling and replacement. and that the suppression and generalization methods data will be lost, while the methods of: shuffling and replacing the original data can be recovered. Shuffling method is similar to Replacement method, except that the anonymized data is derived from the column itself. Both methods have their pros and cons, depending on the size of the database in use. For example, in the Replacement process, the integrity of the information remains intact (unlike the information resulting from the encryption process). However Replacement can pose a challenge if the records consist of a million of user names that require substitution. [Owens et al. 2004] An effective substitution requires a list that is equal to or longer than the amount of data that requires substitution. In the shuffling process, the integrity of the data also remains intact and it is easy to obtain it, since data is derived from the existing column itself. However shuffling can be an issue if the number of records is small.[Machanavajjhala et al. 2007] To keep the context information in-Fast anonymity was used Shuffling method.

2. Related Work Security and Privacy in Cloud Computing: A Survey According to [Zhou et al. 2010] the confidentiality of data is a major obstacle for the use of data cloud service. In this paper the authors confirm the future prosperity of Cloud Computing will come only after the security and privacy issues have to be resolved. Database Security Approach for Distributed Datasets: A Survey This work seeks to strengthen the security scheme based on encryption in database infrastructure in the cloud using the concept of trusted third party (TTP), where the run involved rely on a third party to validate a certificate of access to data. [Palve and Deshpande 2014]

3. Implementation and Evaluation We are defining this study as an strategy based on the disruption of words exchange, on the basis of the mapping table being dynamically generated for a given context of the data. This word mapping table that corresponds to the mapTable draws parameter disturbance data and when applied to the original data anonymized, generates a set of data in text format also, or on binary, as it is the case of traditional

cryptography. This approach has some advantages in its use, in relation to the method of encryption.

Figure 1. The architecture of Fast-anonymity model.

The display format of the data makes it difficult to define a stop criterion for the de-anonymization algorithm through brute force, without mapTable. The display format allows a database with anonymized data to be normally used, while preserving the privacy of data. The mapTable can and should be stored under a different infrastructure where the data is stored anonymized, thereby increasing the degree of difficulty to break off the de-anonymization process. In order to provide data structure performance and availability, we persist the mapTable in a NoSQL database. For the sake of simplicity and performance we used in the experiments the MongoDB [Cattell 2011]. MongoDB also natively offers data replication feature that in the case of disaster recovery is critical to maintaining system availability. The mapTable structure basically consists of a key and a value, where the key is the original term and the value is the term anonymized. In the MongoDB mapTable there is a collection within a database in which the name identifies a unique context of the application using Universally unique identifier (UUID) [Leach et al. 2005]. The UUID is necessary to permit use the Fast-anonymity in multi tenant environment. Relational databases also receives a UUID mark to avoid the attempt to de-anonymize the wrong mapTable, when the algorithm returns the anonymized value itself for application (dirty read). This may particularly be most useful to release copies of the production database for application of test environments. The ”dicWords” is data structure that store the terms that will be used in anonymization process - to prevent anonymization it is done with a different set of words in the language used in the application. Early in the process a dictionary of words based on the original database is created. This dictionary of words is used during the process of anonymization and is stored in MongoDB in the collection called dicWords. If the amount of dicWords words is not sufficient, the anonymization algorithm creates random words with the required size. Initially the algorithm tries to combine two smaller words separated by ’ ’, if a new word can not be entirely generated in random using letters and numbers. If the word to be anonymized is formed by more than 50% of numbers, a new word will be generated only randomly, changing numbers. So the Fast-anonymity process uses a set of words in context (dicWords) to produce a mapping of words (mapTable) randomly within a context (UUID). This

structure is stored in a separate place (MongoDB) of relational database functions as an access key to the original data, thus creating another layer of protection for sensitive data. The algorithms may be implemented in the application layer, in the JDBC driver, very transparent to application layer and database. The anonymized data completely lost the concept of ordering within a table in a relational database. This may cause a harmful effect to the application since many Querys perform operation sort through the argument of ORDER BY in SELECT command. In order to preserve data ordering, we introduce Fast-anonymity in the anonymization process hashcode of a field that preserves the order of the original data. So one anonymized field is composed of hashcode plus anonymized terms. The hashcode is stored in base 36 (hexatridecimal). Each letter of the original ASCII string is normalized to ASCII between 32-96 , i.e. 64 values which are finally reduced in proportion 36/64. The last digit is the hashcode generated randomly to hinder any attempt to recover the original key value. The size of the hashcode (Table 1) is proportional to the difference in size of the anonymized field and the maximum field size. Table 1. Hashcode example Original Field Adrian Castaneda Adrian Dickerson ...

Hashcode IKSNIP IKSNIP ...

Anonymizided field Cade Camille Cade Osborne ...

In order to estimate the residual error, we introduced classification using the hashcode comparing the order of the original order of registration with anonymous registration. A possible displacement of order is computed for each record and the total set average displacement is removed. The displacement percentage is calculated based on the ratio of the individual displacement on the sample size (Table 2). getFreeToken Table 2. Residual error of sorting with anonymization Order 1 2 3 4 5 7 ...

Original Field Aaron Carson Aaron Hoover Abbot Pierce Abra Garner Abraham Morin Acton Browning ...

Anonymized Field IISQP0 mFW Dawn IISQP0 mFW nYvw IJJQT0 jfN Hale IJSI0M QbG Bree IJSIMI Kylan kPB IJTQP0 uat Briggs ... Averange

Shift 0,20 0,40 1,00 ... 9,65

Error% 0,00% 0,00% 0,00% 0,02% 0,04% 0,10% ... 0,97%

The getFreeToken is the core function in the process of anonymization. Its main objective is to obtain a unique token and reduce the size of the new word compared to the original word, enabling the strategy of hashcode without exceeding the maximum field size. Its size is reduced in order: size = bn − (logn3 + 1)c This formula allows a gradual reduction in the size of an anonymized term in relation to the original term, saving space for the hashcode. Only the terms with size greater than three are anonymized. Contributions In addition to the components of the engineering model, consider the main contributions of this work:

1. A solution of referral to the sorting problem in anonymized fields with Hashcode 2. The creation of an anonymization algorithm that respects the maximum field size avoiding unnecessary growth of the database 3. The extreme rapidity of anonymization and de-anonymization algorithms Experiments The implementation of the algorithms on the level of the JDBC driver was done and to carry out performance tests, we used some tools to support spawner, PostgreSQL, MongoDB and Cipher Java library. The spawner software was used to generate four sets of test data with random collection of people’s names. Each set with a different number of elements: 1000, 10000, 100000 and 1000000. This Datasets was imported into PostgreSQL tables and a test routine was applied that repeats 100 times the following steps: Read (plan), Read (anonymized), Read (encrypted), Read and write (plan), Read and write (anonymized), Read and write (encrypted). All experiments were performed in controlled environments and freed from any outside influence. Results The residual error chart ordination, is the cumulative percentage of tuples positioning errors. The lower the degree of error, the less the number of records listed anonymized out of natural order. one million names (r/w)

Residual sort error 1.2 1

0.6 0.4

1,000

10 5

900

0.2

default anonymized crypto

15 Time (sec.)

Time (sec.)

% error

default anonymized crypto

1,100

0.8

one million names (read) 20

0

0 0

200

400

600

Records

Figure 2. Read/W rite

800 1,000

0

20

40 60 Experiment

Figure 3. Read/W rite

80

100

0

20

40 60 Experiment

80

100

Figure 4. Read

The first observation we made is that the impact of the methods of anonymization and encryption are very different. While the method of anonymization causes an increase of 26.48% in the average size of records stored in PostgreSQL table, the encryption method generates an average increase of 175.62% in the size of the data stored in the database as shown in figures 3 and 4. Analyzing the images 3 and 4 of the stress tests, we can easily conclude that the overhead of the method of anonymization in cases of reading only is at least five times smaller than the encryption method. In cases of reading and writing, the method of anonymization has slightly lower performance than the method of data encryption. The fragility of our solution is the dependence of the mapTable

to de-anonymized the data. If there is a loss of this structure it is not possible to obtain the original data. Table 3. Impact of anonyzation over data size

Data Set Original Anonymized Encrypted

Max 22 24 64

Mean 12.84 16.24 35.39

4. Conclusion and Future Work The benefit of increased privacy in cloud system and the low overhead of our proposed method, lead us to conclude that the Fast-anonymity method has proved to be a viable data privacy alternative in cloud, since the method has low impact on the performance and ensures that the original data is protected even if the cloud infrastructure barriers are broken by malicious users. A deficiency of this model that can receive other contributions is the residual error of the sorting that may cause displacement of small order up to 1% of the record of a data set.

References Armbrust, M., Fox, A., Griffith, R., Joseph, A. D., Katz, R., Konwinski, A., Lee, G., Patterson, D., Rabkin, A., Stoica, I., et al. (2010). A view of cloud computing. Communications of the ACM, 53(4):50–58. Cattell, R. (2011). Scalable sql and nosql data stores. ACM SIGMOD Record, 39(4):12–27. Deutsch, A. and Papakonstantinou, Y. (2005). Privacy in database publishing. In Database Theory-ICDT 2005, pages 230–245. Springer. Jensen, M., Schage, S., and Schwenk, J. (2010). Towards an anonymous access control and accountability scheme for cloud computing. 2013 IEEE Sixth International Conference on Cloud Computing, 0:540–541. Jr, A. M., Laureano, M., Santin, A., and Maziero, C. Aspects of security and privacy in computing environments cloud. Leach, P. J., Mealling, M., and Salz, R. (2005). A universally unique identifier (uuid) urn namespace. Lu, C.-C. and Tseng, S.-Y. (2002). Integrated design of aes (advanced encryption standard) encrypter and decrypter. In Application-Specific Systems, Architectures and Processors, 2002. Proceedings. The IEEE International Conference on, pages 277–285. IEEE. Machanavajjhala, A., Kifer, D., Gehrke, J., and Venkitasubramaniam, M. (2007). l-diversity: Privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data (TKDD), 1(1):3. Owens, L., Duffy, A., and Dowling, T. (2004). An identity based encryption system. In Proceedings of the 3rd international symposium on Principles and practice of programming in Java, pages 154–159. Trinity College Dublin. Pacheco, V. M. (2013). Employment of anonymity privacy for the improvement in the consumption of services in saas. Palve, K. K. and Deshpande, R. (2014). Database security approach for distributed datasets: A survey. Database, 2(11). Sweeney, L. (2002). k-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(05):557–570. Zhou, M., Zhang, R., Xie, W., Qian, W., and Zhou, A. (2010). Security and privacy in cloud computing: A survey. In Semantics Knowledge and Grid (SKG), 2010 Sixth International Conference on, pages 105–112. IEEE.

Suggest Documents