Security and Privacy Challenges in Big Data

National Conference RQ³'DWD$QDO\WLFV0DFKLQH/HDUQLQJDQG6HFXULW\¶¶-16 February 2018 Department of CSIT, GGV, Bilaspur, C.G, India -495009 ISBN: 978-93-5291-457-9

Security and Privacy Challenges in Big Data

RAM MILAN1 Research Scholar Dept. of Computer Science & Applications Dr. Harisingh Gour Vishwavidyalaya, Sagar, M.P. [email protected]

Kamlesh Kumar Pandey2 Research Scholar Dept. of Computer Science & Applications Dr. Harisingh Gour Vishwavidyalaya, Sagar, M.P. [email protected]

Abstract Big data is a term that is used to show that data is very large and it is unable to process with the normal processing of computing technology and it is unable to store it in the normal storage space we require extra storage space and extra processing power to process data like the Parallelism. In the same way, we also require the security concerns of big data because data is very large and it is distributed in the small file and connected with the fast speed internet so that anyone can access the database. We have to provide the security to both of these files. Security is a challenging task in the big data. Keywords:- Security, Privacy, Structured, Unstructured, Differential privacy.

I. INTRODUCTION Big data is the term for a collection of unstructured, semi-structured and structured datasets whose volume, complexity and rate of growth make them difficult to be captured, managed, processed or analyzed by using the typical database software tools and technologies. Different varieties are in the form of text, video, image, audio, webpage log files, blogs, tweets, location information, sensor data etc [1]. Big Data means not only an enormous volume of data but also other features that differentiate it from the concepts of ³Very Large Data´, and ³Massive Data´. In fact, several definitions for Big Data are in literature.

Prof. Diwakar Shukla3 HOD Dept. of Computer Science & Applications Dr. Harisingh Gour Vishwavidyalaya , Sagar, M.P. [email protected]

International Data Corporation (IDC) defines Big Data as: ³Big Data Technologies describe new generations of technologies and architecture designed to economically extract value from the very large volume of a wide variety of structured or unstructured data, by establishing huge velocity maintained, discovery or analysis´ [2]. McKinsey report defines Big Data as ³datasets whose size beyond the ability of typical database software tools to capture, store, manage, and analyze´. Big Data is also defined in terms of 3 Vs like Volume, Variety, and Velocity. The Volume represents the size of the data. The Velocity refers to the speed of both data generation and data delivery of real-time data. The Variety makes the data too big as data comes appears from the various sources. The variety also means the data is structured, semi-structured, and unstructured. Whereas the traditional data which is supported by the DBMS is structured only so its uses are limited and we cannot use the traditional data in big data. Table 1 gives the different types of big data and its sources. TABLE 1: Different Types of Big Data & Its Sources

Data Types Structured

Sources Business Applications such as retail, finance, bioinformatics etc.

Formats RDBMS, OLAP, Data warehousing

1|Page

74


Semi Structured

Unstructured

Web Applications such as web logs, email, WebPages Images, Audio, Video, Sensor data, Blogs, Tweets etc.

XML, CSV, HTML, RDF

Usergenerated text content

The above definitions for Big Data provide a set of tools to compare the emerging Big Data with traditional analytics. Big data is generally in Petabyte and the traditional data is generally in Gigabyte so the base of volume big data is very large and traditional data is very small in comparison with big data. This comparison is summarized in table 2 under the dimension of 3 V¶s Volume, Velocity, and Variety. I shown theses three V¶s of Big Data in Fig. 1. Today¶s environment big data security is a very challenging task because big data is becoming very large so we have to put more emphasis on data security. Big Data has many applications lineSmart energy big data analytics is a very complex topic. In Healthcare industry the big data is also using. In healthcare industry, a large amount of data of the customers is stored in the data warehouse. These data warehouses are distributed. So providing data warehouse security is a very challenging task. TABLE 2: Comparison of Big Data and Traditional Data

Characteristics

Big Data

Volume

Terabyte, Petabyte, Exabyte More rapidly

Velocity Variety

Data integration Data Access Source of data

Traditional Data Gigabyte

Per hour, day

Structured, SemiStructured or Unstructured Difficult Real-time batch Fully distributed

or

Structured

Simple Interactive Centralized

Fig1: V¶s of Big Data

II.

SECURITY CHALLENGE OF BIG DATA

We have to provide strict security in Data warehouse where the big data is stored. If security is not very strict then the security compromised is very easily. It is very easy to attack the system because it is in distributed and it is connected to the internet where anyone one can access the system without any restriction. Security issues are: 1) Privacy- information privacy is concerned with that how the data is stored and how the data is collected. One serious issue of data privacy is the theft of data during the transfer of data into the internet. Like today¶s most challenging task is to secure the Aadhar data which is the huge collection of data and the security of that data is very much concern of Indian government because leakage of that data is very much harmful to the Indian government. Indian statistical data and Indian Space and Research data and Military data is very much concern to the privacy because these are related to the very much important for the safety and security of the Indian government. 2.) Security- means restricting the data to unauthorized access, Disclosure, Disruption, Modification, Inspection, and Destruction. Security means to restrict the users so that only authorized users are able to access none else because if security breaches than the whole company or the organization have to suffer so we have to restrict the users by some means of security methods. 3.) Privacy during the storage of big data- when the data is stored in the cloud we have to secure our data by the Confidentiality, Integrity, and Availability method. Availability means the authorized users can access the data but the unauthorized users are not able to access the data. Big data method means we have to protect the data of an individual there is some method that is used

2|Page

75


to protect the privacy of data. The example we can use either the public key method or the private key method. The sender can encrypt his data using the public key encryption (PKE) and the receiver can decrypt the data using the private key. In this way, we can provide the data security in big data as well.

S.no 1

2. 4.) Methods of protecting the data when stored in cloud databaseA) Encryption based on attribute- Usage of data is based on the identity of a user and user can access his data based on the identity.

3.

Privacy Privacy means the how to secure to use the data. Privacy means what information is stored where and how. Privacy is related to data handling.

Security Security means the ³Confidentiality, Integrity and Availability´ of data. Data integrity is the assurance that digital information is accessed or modified by those authorized to do so. Security means how to use the data.

B) Encryption based on Homomorphism-Can be deployed in IBE or ABE scheme. C) Encryption based on storage path- It provides security of data during the path as well as in the storage. Path-based security can be applied by the link encryption and decryption means on one can find the information during their transmission because it is encrypted. At the other end when it reaches the destination the receiver can decrypt the data. D) Encryption based on Hybrid clouds ± Hybrid cloud means a combination of security of more than two like data is saved on private cloud as well as on public cloud. Data is distributed and we apply two-level security. Table 3 discusses differences between the Privacy and Security.

III.

RECENTLY APPROACHES TO PRIVACY IN BIG DATA

Big

data is now a day¶s become very large that is not able to process with the conventional type of processing so we require more processing facility and the data is distributed so we require more strict security measure to secure our data. Most of the companies kept their data in the cloud so we have to provide the security to the clouds as well. Table 3: Difference between Privacy and Security

Variety of tools to be utilized for both deidentifying and re-identifying web log records. Access controls- Access controls maintains the list of authenticated users in a table.When any person wants to access the data it matches the list if it verifies correct then permission is access or permission is denied.

1) Differential Privacy- It is a technique in which personal information of people is stored without revealing the personal identities of the individuals. 2.) Identity-based anonymization- Using cloud computing we can provide identity-based security. Cloud computing is a large scale distributed computing environment which has become a driving force for information and communication technology over the past several years, due to its innovative and promising vision. Cloud computing storage service significant benefits to data owners- Cloud computing environment provides Virtualization means when we need the data for that part it will available otherwise not. The company also hire the ethical hackers so that they will check their own cloud whether it is safe or not. Following are conclusions issues: i) Investment should be reduced by the cloud computing. ii) Data access should be independent of geographical location. iii) Data access facility provided by cloud at any time from anywhere. To meet these objectives, Intel created an open architecture for Anonymization that allowed a

IV.

PRIVACY-PRESERVING APRIORI ALGORITHM IN MAP REDUCE FRAMEWORK

Privacy-preserving is also done by the Apriori algorithm which is based on the frequent set and it is a data mining algorithm. This algorithm is used in privacy preserving in the cloud the frequent set is maintained by the private cloud. This algorithm

3|Page

76


is a robust algorithm for the providing privacy for the clouds.[6].

V.

CONCLUSION

VI. Privacy and security are big challenges in big data and its implementation. In IT age data is growing in Exponential manner. To provide security is a recent and a challenging task. Data is sharing in the environment like the healthcare data and the ecommerce data. If these data is compromised in the network so these are a very dangerous to the company one can provide tough security and privacy while storing and as well as transmission. It presents scalable Anonymization methods within the Map-Reduce framework [4]. Also with the rapid development of IoT, there are lot of challenges when IoT and big data come; the quantity of data is big but the quality is low and the data are various from different data sources inherently possessing a great many different types of and representation forms, and the data is heterogeneous, as structured, semi-structured and even entirely unstructured [5]. This poses new privacy challenges and open research issues. So, different methods of privacy-preserving mining may be studied and implemented in future. As such, there exists a huge scope for further research in privacy-preserving methods in big data.

VII.

REFERENCES

[1]. Puneet Singh Duggal and Sanchita Paul, Big Data Analysis: Challenges and Solutions. [2]. Gantz, J., & Reinsel, D.(2001). The 2011 Digital University Study: Extracting Value from Chaos. [3]. Abadi DJ, Carney D, Centitemel U, Cherniack M, Convey C, Lee S(2003) Stone- breaker M, Tatbul N, Zdonik SB, Aurora:´ a new model and architecture for data stream management´ VLDB J.; 12(2):120-39. [4]. Zhang X, YangT, Liu C, Chen J(2014) ³A scalable two phase top down specialization approach for data anonymization using systems, in MapReduce on the cloud´ IEEE Trans Parallel Distrib. ; 25(2): 363-73. [5]. Chen F, et al(2015) ³Data mining for the internet of things´: Literature review and challenges. Int J Distrib Sens N etw; 501:43 1047. [6]. Jung K, Park S,(2014)´ Hiding a needle in a haystack: privacy preserving Apriori algorithm in MapReduce framework PSBD´ 14 Shanghai;. P. 11-17. [7]. https://www.alienvault.com/blogs/securityessentials/9-key-big-data-security-issues [8]. Jain Priyank et al (2016) ³Big Data Privacy: Technological perspective and review´, ³journal of Big data´.

4|Page

77