Available online at www.sciencedirect.com
ScienceDirect Procedia Technology 24 (2016) 1550 – 1557
International Conference on Emerging Trends in Engineering, Science and Technology (ICETEST - 2015)
Role and access based data segregator for security of Big Data Aayush Gupta*, Ketan Pandhi, P V Bindu, P Santhi Thilagam Department of Computer Science and Engineering, National Institute of Technology Karnataka, Surathkal, India.
Abstract A conceptual storage model based on Hadoop is proposed in order to solve the security problems such as transmission, data storage, and data verification in distributed networks. The model is meant for private cloud of an organization and allows the organization to store data ranging from confidential to public on the same cloud. The proposed model for cloud storage segregates data based on roles in the hierarchy of the organization having access to the data. In addition, it takes into account how valuable the data is for the organization and number of times the data has been accessed within a certain period of time. The model is a combination of Role Based Access Control and Hadoop Distributed File System, and employs a normalization technique to utilize the resources efficiently by evenly distributing data over different servers. © 2016 2016The TheAuthors. Authors.Published Published Elsevier © by by Elsevier Ltd.Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility of the organizing committee of ICETEST – 2015. Peer-review under responsibility of the organizing committee of ICETEST – 2015 Keywords: Role Based Access Control; Hadoop Distributed File System; Hadoop; Big Data; storage criticality
1. Introduction Cloud computing is an Internet based development that provides IT-based services over the Internet. Cloud computing, a software concept, is an extension of grid computing and distributed computing. It mainly relies on resource sharing which makes it highly scalable over large scale distributed systems. It is realized mainly through virtual technology, which is classified into single virtualization and multiple virtualization [1]. The single virtualization makes a single machine to virtually work as multiple machines working together. VMware is an example for single machine virtualization. Multiple machine virtualization connects multiple machines through a control center and makes them work like one machine. Hadoop is an example for multiple machine virtualization.
* Corresponding author. Tel.: +0-000-000-0000 E-mail address:
[email protected]
2212-0173 © 2016 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility of the organizing committee of ICETEST – 2015 doi:10.1016/j.protcy.2016.05.130
Aayush Gupta et al. / Procedia Technology 24 (2016) 1550 – 1557
Because of the rapid growth of cloud computing [2, 3], Big Data has become ubiquitous and crucial in numerous application domains, thus leading to significant challenges from data management perspective [4, 5]. Cloud computing is evolving from a single private cloud to a complex cloud ecosystem, bringing in picture new settings such as multiple cloud or cloud federations that involve end users, Service providers (SP), and Infrastructure Providers (IP). The evolution has introduced many challenges in terms of security, trust, risk, eco-efficiency, cost and legal issues. Cloud storage systems provide application-level security, and involve different components that authenticate and process user requests. The components execute with sufficient privileges to access any user's data. Based on the credentials of the user, each component is responsible for authorizing the request. OpenStack Swift [6] and a few publicly available cloud storage systems make use of this architecture. Further, logged-in user can store data in cloud without any additional storage media and can have access to the same data anywhere and anytime. Cloud disk can make storage simple, fast and convenient but users are not allowed to upload confidential data because of security reasons. The security problems of cloud disk are quite common in cloud computing [7]. Security challenges exist for three services of cloud SaaS, PaaS, IaaS. The threats being faced from different aspects are analyzed and summarized in [8, 9]. The security of cloud storage becomes weak due to the following reasons: 1. Transmission security: Data can be intercepted while being transmitted in cloud, because of the use of weak encryption protection. 2. Access control: Since the user data is stored in cloud without setting access authority, the user loses absolute right to monitor the data. 3. Data storage: Once the data has been uploaded, it is stored in a distributed manner and the user is unaware of the specific position of data. Since confidential data and non-confidential data stored are not classified, it may cause data leakage. 4. Data verification: Cloud makes no verification and inspection on the data being uploaded. It cannot guarantee that the uploaded data corresponds to the right user's data or the original data from the user. Top ten Big Data security and privacy challenges are discussed in [10]. The scientific challenges faced by cloud security have been discussed in survey articles [11, 12]. Based on the security threats discussed in [10, 11, 12], we propose a model for private cloud of an organization that stores the data by considering the role hierarchy of the organization and many other parameters. The rest of the paper is organized as follows: Section 2 discusses the related work. Section 3 describes the conceptual model with the assumptions made. Section 4 explains the algorithms proposed for storing data in cloud and Section 5 analyzes the results of the model over different data sets. Section 6 discusses the conclusion and future work. 2. Related work Hadoop is an Apache open source project which consists of Hadoop Distributed File System (HDFS), MapReduce, HBase, Hive, ZooKeeper and other projects [13]. Its main components are HDFS and MapReduce. MapReduce aims at paralleling and deals with tasks on a large scale, which makes MapReduce scheduler become particularly important [14]. HDFS is an open source project of Google distributed file system (GFS) [15], and has high fault tolerance and data access control. We use Hadoop's HDFS for cloud storage. As Hadoop lacks safety measures, Kerberos was integrated into Hadoop in 2009 by yahoo. The user has to obtain access certification from third party center for key issues before accessing Hadoop cluster, and it greatly reduces the risk of user's data leakages. Batten et al. [16] focus on preventing information harvesting in cloud data storage. The information harvesting problem is of major concern to cloud customers, cloud business center, and cloud data storage center. Data stored in the storage center may be highly confidential and come from a multitude of customers. The staff in the data storage center has access to the data, and hence can steal identities for compromising cloud customers. The authors propose an efficient method of data storage which prevents the staff from accessing and stealing data. Large Iterative Multiplier Ensemble (LIME) classifier model has been proposed by Abawajy et al. [17] for information security of Big Data. LIME classifier is a four-tier classifier for detection of malware in Big Data. They are automatically generated as iterative multi-tier ensembles. Zhang et al. [18] proposed the use of HDFS to build a private enterprise cloud suitable for big data.
1551
1552
Aayush Gupta et al. / Procedia Technology 24 (2016) 1550 – 1557
For secure scheduling of big data applications in cloud computing, Liu et al. [19] proposed an iterative hierarchical key exchange scheme. The privacy preservation over big data on cloud is considered in [20]. Qinghua et al. [21] put forth a method on protection of user data privacy in cloud storage platform. Yu et al. [22] encrypt the cloud data using attribute-based encryption (ABE) scheme, which defines a public key for each attribute associated with the data file. These public keys are used to encrypt the data files before they are uploaded. This scheme ensures the safety of data storage, and at the same time, the server has no need to keep a public key for each user. Chang et al. [23] implement access-control by identifying the users with image processing methods, such as face recognition, and fingerprint recognition. Jing et al. [24] proposed a cloud-disk storage based on Hadoop, by adopting principles from Kerberos authorization process. Factor et al. [25] proposed a model, Secure Logical Isolation for Multitenancy (SLIM), that provides an orthogonal tenant isolation mechanism over existing application-level security. The model makes use of the Linux process isolation mechanisms for providing isolation. 3. Conceptual model We propose a model that aims at segregating data in cloud into classes by evaluating the storage criticality of each file. The proposed model is intended for unstructured data that may include text files, binary files, image, audio etc. The model is meant for private cloud of an organization, and it requires the hierarchy model of the organization as well as the weight associated to each role in the hierarchy. For each file being stored, the following information is associated with it: x Lowest role in the hierarchy which has access to the file x External criticality of the file which shows how valuable the file is for organization x Number of times the file is being accessed in a time period of 24 hours To evaluate the storage criticality of a file, we have followed proportionality measure of storage criticality (SC) with respect to the above mentioned inputs: To evaluate the storage criticality (SC) of a file, we have following proportionalities of storage criticality with above mentioned inputs: x
Greater the value of external criticality (EC), the more security the file needs, as file is more valuable for the organization: SC v EC
x
All the roles lying above the lowest role having access to a file will also have access to the file. Thus, file needs less security in order to make the intra -organization access easier. Hence, SC is inversely proportional to the sum of role weights (RW) of roles from the top of the hierarchy to the lowest role in hierarchy having access to the file: SC v 1/σୀଵ ܴܹሺ݅ሻ, where L is the number of levels of security classes.
x
More the number of times the file is accessed (denoted by access number (AN)), the more security the file needs as it is being accessed frequently in a time period of 24 hours. Hence, SC v AN
x
To normalize the proportionality of storage criticality with AN, we identify maximum of access number (MAN) and the relationship between SC and MAN is as follows: SC v 1/MAN
Combining the above four proportionalities, the final value of storage cr iticality for a file x is given as follows: ܵܥሺݔሻ ൌ
ாሺ௫ሻ ௌோሺ௫ሻ
ேሺ௫ሻ ெேሺ௫ሻ
, where ܴܵሺݔሻ ൌ σୀଵ ܴܹሺ݅ሻ
(1)
1553
Aayush Gupta et al. / Procedia Technology 24 (2016) 1550 – 1557
Let SC obtained by using the above formula lies in range [0, N]. If we have K security classes, their boundary values will be [0, N/K], [N/K, 2N/K], [2N/K, 3N/K],…,[(K-1)N/K, N]. However, the number of files will be more in lower classes and less in higher classes, leading to relatively over-utilization of resources in lower classes and under-utilization of resources in higher classes. Hence, to overcome this shortcoming we propose the idea of normalizing the storage criticalities and we used cumulati ve probability distribution function. The normalization works as follows: x
For a given storage criticality x, identify the total number of files with SC = x. Let Nx be the number of files with storage criticality x, and let N be the total number of files being stored in the server.
x
Find probability of SC = x P(x) = N x /N
x
Evaluate the cumulative probability for SC = x: ܲܥሺݔሻ ൌ σ௫ୀ ܲሺ݅ሻ
x
(2)
(3)
Let K be number of storage classes to be built. Hence, replace SC = x with CP(x) * K.
Following the above algorithm, the new boundary value of storage criticalities o f security classes are [0, 1], [1, 2],. . ., [k-1, k]. The algorithm further ensures even distribution of files in each security class, thus leading to optimal utilization of resources. 4. Experimental setup For the experimental purposes, we build our own name node having functionalities quiet similar to NameNode of HDFS. We performed experiments over data sets of 1000, 10000 and 100000 files, where each file had following values of the aforementioned parameters: x
External Criticality (EC): [1, 1000]
x
Number of Roles in Hierarchy model: 20
x
Weight of each role (RW): [1, 100]
x
Access Number for a file: [0, 2000]
Number of security class in which data to be stored is taken as 5. Thus name node had 5 servers, with each server having 1000 racks and each rack having 1000 data nodes. For experiment s, a buffer is also required whose threshold capacity is one-tenth of the total number of files present in the server. The experiment is performed on a system with dual core processor with 2.2 GHz clock speed, 3GB RAM and 64-bit Windows7 OS. The pseudocode for normalizing the storage criticality using cumulative distribution function is given in Algorithm 1. Algorithm 1: Cumulative Distribution Normalization Function file_dir Å Directory of files where key is file name and value is file’s storage_criticality. criticality_count Å Directory where key is a storage_criticality, say S, and value is number of files having storage_criticaly equal to S. ciritcs Å list to store all the storage_critcality value N Å Total number of files in fiel_dir for file in file_dir.keys() do if file_dir[file] not in keys of criticality_count.keys() then criticality_count[file_dir[file]] =1 else: criticality_count[file_dir[file]] = criticality_count[file_dir[file]]+1 end if
1554
Aayush Gupta et al. / Procedia Technology 24 (2016) 1550 – 1557
end for for k in criticality_count.keys() do criticality_count[k]=criticality_count[k]/N critics.append(criticality_count[k) end for Sort(critics) for i=1 to critics.length() – 1 do criticality_count[critics[i]] = criticality_count[critics[i]] +criticality_counr[critics[i -1]] end for for k in criticality_count.keys() do criticality_count[k] = criticality_count[k]*5 end for return criticality_count.
Algorithm 2 is used to evaluate the storage criticality and it calls algorithm 1 for normalizing the same. Algorithm 2: Role and Access Based Data Segregation f Å file to be stored in the server file_dir Å directory holding name of each file and its normalized criticality and server allocated to it external_critic Å external criticality of file f. buffer_threshold Å threshold size of buffer buffer_count Å Number of files in buffer master_node Å masterNode of proposed file distribution system if buffer_count < buffer_threshold then server = (external_critic/500) mod 5 master_node.add_file(f, server) buffer_count+=1 else normalized_files = cumulative_distribution_function for file in file_dir do if 0 ≤normalized_files[file] ≤ 1 then server=0 else if 1 < normalize_files[file] ≤ 2 then server=1 else if 2 < normalize_files[file] ≤ 3 then server=2 else if 3< normalize_files[file] ≤ 4 then server=3 else if 4 < normalize_files[file] ≤ 5 then server=4 end if if file_dir[file].server!= server then master_node.shift_file(file, server) else file_dir[file].criticality = normalized_files[file] end if end for
1555
Aayush Gupta et al. / Procedia Technology 24 (2016) 1550 – 1557
5. Results We tested Algorithm 2 on data sets of 1000, 10000 and 100000 files and evaluated the number of files stored under five levels of security classes before and after normalizing the storage criticality. Table1: File distribution for unnormalized and normalized SC for 1000 files dataset. Unnormalized SC
Normalized SC
Level of security class
Lower boundary value
Upper boundary value
Number of files stored
Lower boundary value
Upper boundary value
Number of files stored
Level V
>80
≤ 100
20
>4
≤5
200
Level IV
>60
≤ 80
20
>3
≤4
200
Level III
>40
≤ 60
19
>2
≤3
202
Level II
>20
≤ 40
133
>1
≤2
198
Level I
≥0
≤ 20
808
≥0
≤1
200
Table 1, Table 2 and Table 3 show the file distributions belonging to different levels of security classes under the servers with unnormalized and normalized storage criticalities for 1000, 10000, and 100000 files dataset respectively. Table 2: File distribution for unnormalized and normalized SC for 100 00 files dataset Level of Security class
Lower boundary value
Level V
>80
Level IV
Unnormalized SC Upper boundary value
Normalized SC Upper boundary value
Number of files stored
Lower boundary value
Number of files stored
≤ 100
230
>4
≤5
2001
>60
≤ 80
216
>3
≤4
2000
Level III
>40
≤ 60
193
>2
≤3
2000
Level II
>20
≤ 40
1402
>1
≤2
1999
Level I
≥0
≤ 20
7959
≥0
≤1
2000
Table 3: File distribution for unnormalized and normalized SC for 100000 files dataset Unnormalized SC
Normalized SC Upper Number of boundary files value of stored SC
Level of Security class
Lower boundary value of SC
Upper boundary value of SC
Number of files stored
Lower boundary value of SC
Level V
>80
≤ 100
2006
>4
≤5
20015
Level IV
>60
≤ 80
1956
>3
≤4
19982
Level III
>40
≤ 60
1966
>2
≤3
19981
Level II
>20
≤ 40
13962
>1
≤2
20011
Level I
≥0
≤ 20
80110
≥0
≤1
20011
From the results, it can be inferred that when the storage criticality is not normalized, the servers of security classes at level I and II are over-utilized. However, using the proposed normalization methodology, the files are distributed evenly across all the levels of security classes.
1556
Aayush Gupta et al. / Procedia Technology 24 (2016) 1550 – 1557
6. Conclusion and future works The following conclusions are derived from the experiments: 1. The normalization ensures even distribution of files over different security servers resulting in efficient utilization of resources. 2. When the data set is quite large, the normalization function is called iteratively thereby resulting in re-evaluation of storage criticality 3. As the number of times a file is accessed changes every 24 hours, the storage criticality of the file changes resulting in dynamic storage of the file, thus ensuring more security to the file system. Future work for the proposed model is to go for its practical implementation in Hadoop Framework. Furthermore, the periodic normalization of storage criticality results in an overhead. To overcome the overhead, the threshold capacity of the buffer is to be modified to different values so as to perform minimum number of normalizations. We can also look for optimization of MapReduce task to overcome the induced overhead. The future work also involves identifying more parameters for evaluating storage criticality so as to make the formula more accurate and reliable.
References [1] Virtualization (May 26, 2015). URL https://en.wikipedia.org/wiki/Virtualization [2] Wang L, Ranjan R, Chen J, Benatallah B. Cloud computing: methodology, systems, and applications. CRC Press; 2011. [3] Liu X, Chen J, Yang Y. Temporal QoS management in scientific cloud workflow systems. Elsevier; 2012. [4] Chen J, Chen Y, Du X, Li c, Lu J, Zhao S, Zhou X. Big data challenge: a data management perspective. Frontiers of Computer Science; 2013. p. 157-164. [5] Liu L. Computing infrastructure for big data processing. Frontiers of Computer Science; 2013. p. 165-170. [6] Openstack swift (May 26, 2015). URL http://docs.openstack.org/developer/swift/ [7] Munier M, Lalanne V, Ricarde M. Self-protecting documents for cloudstorage security. IEEE 11th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom); 2012. p. 231-238. [8] Shaikh F B, Haider S. Security threats in cloud computing. IEEE international Conference on Internet technology and secured transactions (ICITST); 2011. p. 214-219. [9] Winkler V J. Securing the Cloud: Cloud computer Security techniques and tactics. Elsevier; 2011. [10] Expanded top ten big data security and privacy challenges, Tech. rep; 2013. [11] Ryan M D. Cloud computing security: The scientific challenge, and a survey of solutions. Journal of Systems and Software; 2013. p. 22632268. [12] Rong C, Nguyen S T, Jaatun M G. Beyond lightning: A survey on security challenges in cloud computing. Computers & Electrical Engineering; 2013. p. 47-54. [13] Xu G, Xu F, Ma H. Deploying and researching hadoop in virtual machines. IEEE International Conference on Automation and Logistics (ICAL); 2012. p. 395-399. [14] Tang Z, Zhou J, Li K, Li R, Mtsd: A task scheduling algorithm for mapreduce baseD on deadline constraints. Proceedings of 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum. IEEE Computer Society; 2012. [15] Ghemawat S, Gobiofi H, Leung S T, The google file system. in: ACM SIGOPS operating systems review; 2003. p. 29-43. [16] Batten L M, Abawajy J, Doss R. Prevention of information harvesting in a cloud services environment; CLOSER 2011: Proceedings of the 1st International Conference on Cloud Computing and Services Science, INSTICC; 2011. p. 66-72. [17] Abawajy J H, Kelarev A, Chowdhury M. Large iterative multitier ensemble classifiers for security of big data. IEEE Transactions on Emerging Topics in Computing; 2014. p. 352-363. [18] Zhang D W, Sun F Q, Cheng X, Liu C. Research on hadoop-based enterprise file cloud storage system. Awareness Science and Technology (iCAST), IEEE 2011 3rd International Conference; 2011. p. 434-437. [19] Liu C, Zhang X, Liu C, Yang Y, Ranjan R, Georgakopoulos D, Chen J. An iterative hierarchical key exchange scheme for secure scheduling of big data applications in cloud computing. Trust, Security and Privacy in Computing and Communications (TrustCom). 12th IEEE International Conference; 2013. p. 9-16. [20] Zhang X, Liu C, Nepal S, Yang C, Chen J, Privacy preservation over big data in cloud systems. Security, Privacy and Trust in Cloud Systems. Springer; 2014. p. 239-257. [21] Qinghua H, Yongwei W, Weimin Z, Guangwen Y. A method on protection of user data privacy in cloud storage platform. Journal of computer research and development; 2011. p. 1146-1154. [22] Yu S, Wang C, Ren K, Lou W. Achieving secure, scalable, and fine-grained data access control in cloud computing. INFOCOM.IEEE; 2010. p. 1-9.
Aayush Gupta et al. / Procedia Technology 24 (2016) 1550 – 1557
1557
[23] Chang B R, Tsai H F, Lin Z Y, Chen C M. Access security on cloud computing implemented in hadoop system. Genetic and Evolutionary Computing (ICGEC). IEEE Fifth International Conference; 2011. p. 77-80. [24] Jing H, Renfa L, Zhuo T. The research of the data security for cloud disk based on the hadoop framework. Intelligent Control and Information Processing (ICICIP). IEEE Fourth International Conference; 2013. p. 293-298. [25] Factor W, Hadas D, Hamama A, Har'El N, Kolodner E K, Kurmus A, Shulman-Peleg A, Sorniotti A. Secure logical isolation for multitenancy in cloud storage. IEEE 29th Symposium on Mass Storage Systems and Technologies; 2013. p. 1-5.