1
Big Data Security and Privacy Challenges
Big Data Security and Privacy Challenges: A Review Ali Javid Department of Software Engineering University of Engineering and Technology, Taxila
[email protected]
Ali Nawaz Software Engineering Department University of Engineering and Technology, Taxila
[email protected]
Javid Iqbal Software Engineering Department University of Engineering and Technology, Taxila
[email protected]
Abstract: Big Data security and privacy are becoming pervasive. Big data has become the popular research area in the information and communication technology industry. With the interconnection with cyber society the components, such as search engines, social media, websites with their services, blogs, etc. are responsible to generate massive amount of data known as, “Big Data”. Big data is an essential part of our life. Maximum number of organization like banking, retail industry, health care, government, agriculture, smart city, social media, IT sector, defense system, educational sector and many more already adapted Big Data. With the generation of large quantity of data create multiple research challenges like data life cycle management, scalability, security and privacy, data processing heterogeneity and data visualization. However, big data privacy and security and the development of secure big data application is still a distant milestone. Because, current business focus on customers’ data protection, but due to the large size of data it’s seems like more difficult. All the security and privacy issue which are created through big data are not solved by the conventional security solutions. Hence, security is one of the serious challenge for big data. Generation of new problem in privacy and security checks is a serious challenge for researchers to manage the big data. In this paper we have performed a comprehensive analysis of existing privacy and security challenges in big data. In this paper we studied about big data and different framework of big data handling and change security related to security and privacy of Big Data. Keywords: Big Data, Security, Privacy, Challenges of Big Data, Big Data challenges
1. Introduction The term Big Data was created by Roger Mougalas back in 2005. However, the application of big data and the mission to understand the available data is something that has been in existence for a long time [1]. The first book which discuss the term Big Data is a data mining book that came to forefront in 1998 too by Weiss and Indrukya. Big Data is the title of the first academic paper in the year 2000 in a paper by Diebold [2]. With the development and daily interaction of smart devices, Internet applications, Internet of Thing (IoT), cyber physical systems, cloud systems and social network a continuous way of data generation, this data known as big data [3]. This Big data is produced from different sources with multiple formats. According to a latest survey, about 200 thousand users use Google for the information purpose every second, and 40 billion content like pictures, videos etc. share on Facebook on each other. Furthermore, all categories and organization have continuously produced vast amounts of data. The global amount of
2
Big Data Security and Privacy Challenges
information has reached 8ZB in 2015 and now in 2018 it is expected to grow 15ZB. This massive amount of data has been generated in the last three years. This report determines that the data size increased approximately eight times in the world. The size of data is probably double in the coming two years in future [4]. One of the other survey that show the increasing size of the data from 130 Exabyte in 2005 to 40,000 Exabyte in 2020 [5]. Nowadays, big data contain variety of research work, from different organization, industry, science, technology, the media, and government and public sectors. Big data is all about managing the massive amount of data from different sources which include database management system (DBMS), posting to social media, log files, sensor data, google drive and many more [4][6]. In simple words, big data relates to large volumes of computer readable data. But not only, the amount and the complexity of data as argued by [7], these data exceeds existing computational, storage and communication capabilities of conventional methods or systems.
2. Characteristics of Big Data With regards to Big Data, clearly there's a considerable measure of great attention nowadays. Data is being created at huge rates. Indeed, 90% of the data on the world currently was generated over the most recent years. The word "Big Data" can be characterized as data that convert to be large to the point that it can't be handled by traditional methods [8]. Size of the data which can be measure because Big Data is an always changing components and the device produced continuously to deal with Big Data. It is changing our world totally and shows no symbol of being a passing fad that will finish anytime in the coming time. In order to make sense, scientists almost always describe “big data” as having at least three distinct dimensions: volume, velocity, and variety. Some then go on to add more V’s to the list, variability and value. Here’s we define the “five V’s of big data”: Velocity, Volume, Value, Variety, and Veracity. [9]
Fig. 1: Fundamental Characteristic of Big Data
3
Big Data Security and Privacy Challenges
2.1 Velocity Velocity refers to the speed which big data is created or moves from one point to another and pace at which enormous amount of data are being generated, flows, collected and analyzed. This data comes from sources like twitter messages, emails, photos, video, machines, business processes, networks and human interaction with things like social media sites, mobile devices etc. increases at lighting speeds around the world. Data is increasing day by day. Big data technology allows us now to analyze the data while it is being generated, without ever putting it into databases. The flow of data is massive and continuous. This real time data can help researchers and businesses make valuable decisions that provide strategic competitive advantages if you are able to handle the velocity.
2.2 Volume The word big in big data is due to extreme size of data that it actually means. Volume refers to the vast amounts of data generated every moment from different kind of sources. We are not talking Terabytes but Zettabytes or Brontobytes. This data collected from social media, cell phones, M2M sensors, credit card, video, emailing, cars, photographs, video, etc. IoT (Internet of Things) is generating exponential growth in data. Gently speaking, a person this kind of massive data have become so large in fact which may cause the storage problem in local data base [48]. For this problem data scientist use distributed systems, where data stored in different location and in different format which is together by different kind of application. Now, if we look the Facebook example here, there are 10 billion messages, 4.5 billion times that the “like” button is pressed, and over 350 million new pictures are uploaded every day. Collecting and analyzing this data is clearly an engineering challenge of immensely vast proportions. Now the size of the data is huge and huge which ranges from terabytes to zettabytes (which is equal to, 1021 bytes)
2.3 Value Last but not least, big data must have value. Whether the data is big or little, no matter generated from anywhere in whatever format, should have some value. When we talk about value, we’re referring to the worth of the data being extracted. Having endless amounts of data is one thing, but unless it can be turned into value it is useless. As, we’re aware that data in itself has no importance or utility, but still we need valuable data to get the information. Data value helps in measuring the usefulness of data in decision making. Queries can be run on the stored data so as to deduce important results and gain insights from the filtered data so obtained so as to solve most analytically complex business problems [10] [11].
2.4 Variety With increasing velocity and volume comes increasing variety. Variety is defined as the different types of data we can now use. Data today looks very different than data from the past. Three types of data which are consider e.g. structured data, unstructured data and semi-structured data. In fact, 80% of the world’s data is now unstructured and therefore can’t easily be put into tables or relational databases. Out of all these, we’re very much familiar with structured data which is in the form of pure text (person’s name) or in numeric (age) which is stored in external databases. But, rest of the two types are
4
Big Data Security and Privacy Challenges
new in big data. New and innovative big data technology is now allowing structured and unstructured data to be harvested, stored, and used simultaneously [48].
2.5 Veracity Veracity is the quality or trustworthiness of the data. Just how all the data is accurate or not? Is the data that is being stored, and mined meaningful to the problem being analyzed. For example, people use Twitter in daily life, if we think about all the posts with hash tags, abbreviations, etc., which requires processes to keep the bad data from accumulating in your systems and the reliability and accuracy of all that content. Big data technology now allows us to work with these types of data. The volumes often make up for the lack of quality or accuracy [48]. Table No 1: Comparison of 5V’s with the Big Data Privacy and Security Aspects [56] 5 V’s Concept Big Data Privacy and Security Aspects of Big Data Confidentiality Integrity Data Authenticity Data Availability Data Efficiency Velocity Volume Value
Yes
Variety Veracity
Yes
Yes Yes Yes
Yes Yes Yes
Yes Yes
Yes
Yes
Yes
3. Frame work of Big Data Now we look at different open source Big Data processing frameworks which are using in the current world. Big Data contains, data gathering, preparation, storage and management, analysis and mining and verification of the data. Nowadays Big Data is growing computational capabilities, it is possible to use more computational power to do the same work. However, high performance network capacity has not increased at the same rate as processing and storage capabilities. The limitation in computation has simply shifted from moving data to a big supercomputer, to moving the application to many smaller computers which store the data [12]. Processing frameworks and processing engines are responsible for computing over data in a data system.
3.1 Hadoop Hadoop all-time standard, and one of the best frameworks which is using today. Hadoop is a free, Javabased programming frame work that helps in the large data sets processing in a distributed computing environment. Hadoop is part of Apache project supported by Apache Software Foundation [13]. Hadoop process large data sets, which can be processed across a cluster of servers also many applications should be run on the systems with millions of nodes containing thousands of terabytes. Hadoop was first out of the gate, widespread implementation in industry. Although Hadoop processes often supplement big data and many tools are available for reference. The distributed file system in Hadoop helps to speed up data transfer speed and allows the system to continue its normal operation even when there is some error or failure occur in node. This method minimizes the risk of a complete system failure, even in the case of a significant number of node errors. Hadoop allows a scalable computing solution and costs effective, fault tolerant and flexible. Hadoop Framework is used by popular companies like Google,
5
Big Data Security and Privacy Challenges
Yahoo, Amazon and IBM, etc., to support their applications involving large amounts of data. Hadoop further have two types of sub framework which is shortly discuss below [14]. The Apache Hadoop core consists of a storage section called Hadoop Distributed File System (HDFS) and a processing section of a Map Reduce programming model. Hadoop breaks these files into chunks and distributes them to the nodes of the cluster.
3.2 Map Reduce Google's introduced Map Reduce framework, in 2004, it is a programming model for generating and processing large data sets [15]. Map Reduce consists of two processes, Map (simple calculation) and Reduce (integrated) [16] [17]. Map Reduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster [18]. Map Reduce is at the core of Apache Hadoop. It is this programming paradigm that enables large-scale scalability on hundreds or thousands of servers in a Hadoop cluster. The concept of Map Reduce is easy to understand for those familiar with scalable data processing solutions in a cluster. The term Map Reduce actually refers to two separate and distinct tasks executed by a Hadoop program [19]. The first is the job of the map, which takes a set of data and transforms it into another set of data, where each element is divided into tuples (key / value pairs). Reducing the work takes the output of the map as input and combines the data into a smaller set of tuples. As the Map Reduce name sequence implies, downsizing is always done after the map is done.
Fig. 2: Map Reduce Work Flow [45]
3.3 Hadoop Distributed File System (HDFS) HDFS is a file system that extends to all nodes in a Hadoop cluster for data storage. The link merges the file system to the local node to make it a large file system. HDFS improves reliability when copying data from multiple sources to overcome node failure [20]. HDFS has many similarities with other distributed file systems, but differs in several ways. One notable difference is that the HDFS write-once read multiple model relaxes concurrency control requirements, simplifying data consistency and enabling high-performance access. Another unique attribute of HDFS is the notion that it is often better to place
6
Big Data Security and Privacy Challenges
processing logic close to the data rather than move the data into application space. HDFS strictly limits once to write data to an author. Bytes are always appended to the end of the sequence, and the bytes are guaranteed to be stored in write order [45].
Fig. 3: Hadoop Distributed File System [45]
3.4 Apache Spark Apache Spark is a batch framework with the ability to handle streaming, which makes it a hybrid framework. Spark is particularly easy to use, and writing applications in Java, Scala, Python, and R is easy. This open source cluster computing framework is ideal for machine learning, but requires cluster administrators and distributed storage systems. Spark can be executed on a single machine, with an executor per CPU core. It can be used as a standalone framework or with Hadoop or Apache Mesos to make it suitable for almost any business [21]. Spark is based on a data structure called Distributed Elastic Data Sets (RDDS). This is a set of read-only data elements distributed throughout the machine. RDD works as a working set for distributed programs and provides a limited form of distributed shared memory. Spark has access to data sources such as HDFS, Cassandra, HBase, and S3 for distributed storage. It also supports pseudo-distributed native modes that can be used for development or testing.
4. Challenges of Big Data Today, every second, we will see a lot of data generated. Every large company is trying to find ways to make this information useful. However, this is not an easy task. The amount of data generated makes it very difficult to store, manage, analyze and use them. The development of multiple big data analytics tools has greatly helped in data management [22]. Following are the some of the main challenges faced in Big Data analysis.
4.1 Understanding the Data Acquiring data in the right way requires a great deal of knowledge, so visualization can be part of the data analysis. For example, if the data is from social media content, you should know that users are users in the general sense, for example, customers using a specific product collection, and about what
7
Big Data Security and Privacy Challenges
they are trying to see from the data. Without a certain context, it is possible that the visualization tools are less valuable to the user.
4.2 Addressing Data Quality Even if you can find and analyze the data quickly put it in the right context for the audience to be consumed Information, the value of the data. What the decision will be, If the data is not accurate, it is at risk or in time. This is any challenge data analysis, but when considering Involving a lot of information data items, it becomes more Sentence. Again, data visualization It is a valuable tool That quality data is guaranteed [23]. To solve this problem, businesses need to have information governance or information management. The process is in place to ensure that data is cleaning. It is better to have a proactive approach to data quality problem, which result that there will be no problem later.
4.3 Displaying Meaningful Results Draw points on the analysis chart when it becomes difficult the amount of information is very large or various information categories. For example, imagine you have 10 billion of the retail SKU data row you are trying to compare. The user tried to see 10 billion of frames on the screen will be hard time to see so many data points. The solution to this problem is to group the data a higher level perspective where small groups the data becomes visible. By grouping overall data or "box" can be more effective visualization of data.
4.4 Data Storage and Quality Companies and organizations are growing very fast. In addition, the company's rapid increase in the amount of data generated. The storage of these data is becoming everyone's challenge. Data repositories are used to collect and store vast amounts of unstructured data in its original format. The problem, however, is that data warehouse attempts to combine irrelevant data from different sources, it will find mistakes. Inconsistent data, duplicate data, logical conflicts, and missing data pose challenges in data quality [23].
4.5 Security and Privacy of the Data Big data is priceless Information Sources. But it also contains the important data and information that you need security for unauthorized access and release. Once companies and organizations find out how to use big data, it offers them a wide range of opportunities. However, this also implies a huge risk in terms of data security and privacy [23]. Tools for analyzing, storing, managing, analyzing, and using data from a variety of sources. This ultimately leads to the risk of data exposure, which makes them very vulnerable. Therefore, producing more and more data adds security and privacy issues. Therefore, analysts and data scientists must consider these issues and process the data in ways that do not result in privacy interruptions.
8
Big Data Security and Privacy Challenges
Fig. 1 Challenges of Big Data
5. Security and Privacy Challenges of Big Data The most challenging and tough issue in big data may be security and privacy. The huge benefits of Big Data for government agencies, health care industry, biomedical researchers and private companies have devoted enormous resources to collecting, aggregating and sharing large amounts of personal data [5]. “With recent disclosures, NSA regularly collects and analyzes Pal Talk, YouTube, Skype, and other data from heterogeneous data sources such as telecommunications, the internet and large user databases including large corporate users such as Microsoft, Yahoo, Google, Facebook, AOL and Apple” [22]. Many facts show that Big Data will harm the user's privacy if it is not properly handled. Following are the some of the most important challenges for big data in the security and privacy attributes.
5.1 Data Preparation According to the definition Big data powerful and accurate technology, Big data analysis of the important basis management is high quality usability, Accurate and reliable data. data preparation increasing the value of big is the most important data
5.2 Protecting Data and Transaction Logs Data stored on a storage medium, such as transaction logs and other different kind of sensitive information, can have varying levels, but not enough. Let’s take a simple example of data transfer between these levels enables IT administrators to gain insight into the data being moved. The company must also protect these storage devices from unauthorized access and ensure they are readily available. Unreliable storage service providers often look for clues to help them associate user activity and datasets and know some attributes, which may be of crucial importance to them. However, they cannot enter more than the encrypted data. As the data owner stores the encrypted text in the automated storage system and assigns the private key to it. Every user has the right to access certain users' data from certain parts, which is not authorized access data however, you can collude with users by exchanging keys and data, so you can get your unauthorized data. Data corrupting and data loss by malicious users often results in disputes between the data storage provider or amongst users [24].
5.3 Effective online data analysis Online analysis Multi-dimensional data has become inevitable Potential source of decision making manufacture. This will need to be adapted to existing ones OLAP handles big data. Depending on your
9
Big Data Security and Privacy Challenges
needs, some data is more valuable than others. Therefore, before you start the big data analytics process, identify some goals and goals. As concrete as possible, and pay attention to any problems to be solved.
5.4 Validation and Filtration of End-to-End Inputs The end-to-point is an important part of any big data collection. They provide input for Storage, processing, and other necessary tasks are done with the input data provided by the endpoints. Therefore, it is necessary to ensure that only real endpoints are used. Each network must have no malicious end points. Malicious users can create fake IDs and provide malicious data as input to the central data collection system. ID clone attacks, such as Sybil attacks, dominate in their own right device solutions (BYOD), malicious users to their own fake devices as a reliable device, from there it provides malicious information to the central data collection system. The source of sensory data can be manipulated, and the temperature artificially altered enter the malicious input from the temperature sensor to the temperature acquisition process. GPS The signal can be manipulated in the same way. Malicious users can make data changes in it send a generous source to the central data collection system [24].
5.5 Provide Safety and Monitoring Data in Real Time In the context of big data analysis, real-time safety monitoring has been a continuous challenge alarm due to safety equipment. These alerts may be interrelated, probably not, lead to many false positives, and because humans cannot successfully handle this situation Most of them at such a speed that they click away them or ignore them [25]. Unfortunately, multiple traditional platforms are unable to perform this due to the huge amounts of data generated continuously.
5.6 Data Provenance For the categorization of data, data scientist must know about source of the data is very important; it allows the data to be classified. The origin can be accurately determined by authorization, verification and graining access control.
5.7 Non-Relational Data Sources Privacy There are many loopholes in a data warehouse like NoSQL and other types of data storage which creates privacy threats and a lot of security problems. These gaps include the lack of the ability to encrypt data during the transmission or storage of data, during labeling or data logging, or during the classification of this data into different groups. Due to the large amount of data, many powerful solutions must be introduced to protect the various parts of the infrastructure involved. Data warehouse must ensure that no leaks. Finally, real-time protection must be enabled during the initial data collection. All of this will ensure that consumer privacy is preserved.
10
Big Data Security and Privacy Challenges
5.8 Securing Communications and Encryption of ACM (Access Control Methods) A secured data storage device is an intelligent step in order to protect the data. Therefore, a simple way to protect your data is to protect your data storage platform. However, applications that protect the data storage platform are often fragile. Then, the access method must be strictly encrypted.
5.9 Social Analytics The ability to distinguish between these users to comply with the reliable data the needs and preferences are important, as well Achieve different. Then you should conduct a social analysis solve this problem by providing accurate information solid social data analysis [5].
6. Data Protection Requirements In the literature, research has introduced many security theories and practices to cover the confidentiality, availability, integrity and privacy aspects of big data security [21]. So here we discuss it briefly.
6.1 Confidentiality Confidentiality is related to the application of certain rules and restrictions on data that is disclosed illegally. Limited access to data and various cryptographic techniques are widely used for confidentiality. An intuitive way to protect the confidentiality of big data is to use it encryption. Traditional encryption algorithms are used to ensure encryption Common processing (e.g. symmetric encryption algorithms, AES, Asymmetric Encryption Algorithm, RSA). But they are limited and cannot be provided data processing. Therefore, homomorphic encryption technology has been developed solve this problem, at the same time can achieve the confidentiality and data processing protection [26]. A disadvantage of these encryption techniques is They still need an expensive calculation. Additional encryption algorithm like classical public key approach and attribute based encryption approach are used for data sharing. The main drawbacks, they are costly computation on the owner. To overcome this limitation, the authors in [27] [28] proposed a cheaper and more flexible re-encryption agent [26]. Proxy re-encryption assumes that the cloud is semi-reliable. This The main idea of the algorithm is as follows: First, the data owner encrypts the data with the data public key, and then for each potential recipient; the owner of the data generates a re-encryption key. If the recipient is authorized to share data with the owner, then the cloud Re-encrypt the encrypted text and send it to the receiver designed for decryption [30].
6.2 Integrity Data integrity provides protection against altering of data by an unauthorized user in an unauthorized manner. Hardware errors, user errors, software errors or intruders are main reasons for data integrity issues [30]. Integrity is maintained by using data provenance, data trustworthiness, data loss and data deduplication. [31].
6.3 Availability The availability of data ensures that the data must be available to authorized users. High Availability system (HA) is a solution that meets data availability [30]. The integrity of information refers to all
11
Big Data Security and Privacy Challenges
resources that can only be modified by authorized personnel or authorization form. The purpose is to prevent Information will be modified by unauthorized users. Due to the big openness data, in the process of network transmission, the information will be destroyed Intercept hackers, interrupt, manipulate and forge. Encryption technology It solves the data confidentiality requirements and protects the integrity of the data. But cannot solve all the security problem.
6.4 Data Privacy Data privacy plays a more important role in several areas such as health care, management, social security etc. with big data, privacy protection becomes even more critical. The data from the past are together and collected which are traceable. Many privacy modes have been proposed in the literature to protect the privacy of big data, including anonymous and differential privacy. The goal of privacy protection by hiding user identities and secrets is anonymization data. Data privacy aims is to assure Personally Identifiable Information(PII) not be leak and disclose for most applications, you first need to deal with it Raw data to ensure the security of sensitive information. Then on the basis of data extraction, release and processing other operations may be happening [34]. There are also some of the technique which use on behalf of anonymization over data. Recently we capture high amount of traceable data that were never taken and stored in the past. The advantage of the data privacy is to crop up the data and after the result the data which is traced was anonymous. For example, to develop effective treatment or medicine, study of patient’s medical record is essential. Hence, PII of patient must be anonymized to protect privacy.
6.5 Data Analysis Data analysis is a data collection and organization method from which useful information can be obtained. In other words, the main purpose of data analysis is to observe what the data is trying to tell us. There are several facets and methods of data analysis, covering a wide range of technologies in various fields of business, science and social sciences. Data analysis use different techniques for data gathering here 26 papers listed in the table no 2. In which 15 papers use machine learning and the remaining use the data mining and any other techniques. Where a cyclic pattern is detected, as well as different versions of the nearest k-neighbor algorithm (KNN), which finds the nearest k points of a given reference point [34]. The listed paper are selected from 30 onward.
Table No 2: The Literature Review and its Implementation Topic [55] Short Title
Author
Confidentiality
Efficient secure similarity computation on encrypted trajectory data PrivacyPreserving SimRank over Distributed Information Network Explainable Security for Relational
Liu, An, et al [35].
Yes
Chu, YuWei, et al [36].
Yes
Bender et al[37]
Yes
Integrity
Availability
Data Privacy
Data Analysis
Yes
Yes
Yes
Yes
Yes
SDN(Software Define Network)
Remarks
12
Big Data Security and Privacy Challenges
Databases JustMyFriends: Full SQL, Full Transactional Amenities, and Access Privacy Is Feature Selection Secure against Training Data Poisoning? Secure Databaseas-a-service with Cipherbase A Secure Search Engine for the Personal Cloud Preserving privacy in social networks against connection fingerprint attacks D Objects+: Enabling Privacy Preserving Data Federation Services Publishing Microdata with a Robust Privacy Guarantee Security and Privacy Issues of Big Data Big data privacy: a technological perspective and review Big data security and privacy issues — A survey Big data: A survey Benefits and risks of big data Secure k-nearest neighbor query over encrypted data in outsourced environments Differentially private learning with kernels Secure nearest neighbor revisited Big Data Security and Privacy Challenges: A Review (This Paper)
Meacham et al[38]
Yes
Yes
Xiao, et al [39]
Yes
Yes
Arasu, Arvind, et al [40] Lallali, Saliha, et al [41] Wang, et al [42]
Yes
Yes
Yes
yes
yes
Jurczyk, Pawel, et al [43]
yes
Cao, et al [44]
José Moura et al [46] Jain, et al [47]
N. Joshi et al [48]
Chen, Min et al [49] Cole, Dana et al [50] Elmehdwi, Yousef et al [51]
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes Yes
Yes
Ali Nawaz et al [54]
Yes
Yes
Yes
Yes
Yes
Jain, Prateek et al [52] Yao, Bin et al [53]
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
13
Big Data Security and Privacy Challenges
Discussion Security and privacy in big data are the most important challenges among all other challenges. The big data revolution has brought substantial benefits for companies and end-users, there are many risks during the usage of big data. If the data set is large or small and the user need to protect personal information at the same time the quality of the of the data is most important. Although the 5V’s concept (volume, velocity, value, veracity, and variety) is creating new kind of problems which need to be addressed.
Conclusion In this review paper, we discuss an overview of big data's contents, challenges of big data, characteristics of big data. Furthermore, the main concern which is security and privacy challenges on big data and big data 5V’s concept also have been reviewed. We explain briefly big data challenges such understanding the data, addressing data quality, displaying meaningful results, data storage and quality, security and privacy of the data. Furthermore, we have studied about the security and privacy issues in big data such as confidentiality, integrity, availability, data privacy in detail. We only point out these issues theoretically, now the aim is to solve these problems and issues. Although this paper did not solve everything related to big data security and privacy issues, but we hope to provide a useful discussion and researcher framework for the future researcher and students. Future work, will be on simulation of these research to make a hybrid solution for all the problem related to big data and big data security and privacy issues.
References [1] "Brief History of Big Data", Cleverism, 2018. [Online]. Available: https://www.cleverism.com/briefhistory-big-data/. [Accessed: 12- Jan- 2018]. [2] "Big Data: Security Issues, Challenges and Future Scope", International Journal of Research Studies in Computer Science and Engineering, vol. 3, no. 3, 2016. [3] Jha A, Dave M. and Madan, S. 2016. A Review on the Study and Analysis of Big Data using Data Mining Techniques, International Journal of Latest Trends in Engineering and Technology (IJLTET), Vol6, Issue 3. [4] Chen, M., Mao, S., & Liu, Y. (2014, January). Big data: A survey. Springer Science Business Media New York. [5] B. Matturdi, X. Zhou, S. Li, F. Lin, “Big Data security and privacy: A review”, Big Data, Cloud & Mobile Computing, China Communications vol.11, issue: 14, 2014, pp. 135 – 145.
14
Big Data Security and Privacy Challenges
[6] H. Bajaj, R., & Ramteke, P. P. (2014). Big data–the new era of data. IJCSIT, 5(2), 1875-1885. Jaya bharathi, K. (2014). Big data security challenges. IJRSET, 2(14), 8-11. [7]Patil, H. K., & Seshadri, R. (2014). Big data security and privacy issues in healthcare. In Big data (bigdata congress), 2014 ieee international congress on (pp. 762–765). [8] Cano, "The V's of Big Data: Velocity, Volume, Value, Variety, and Veracity", Xsnet.com, 2018. [Online]. Available: https://www.xsnet.com/blog/bid/205405/the-v-s-of-big-data-velocity-volume-valuevariety-and-veracity. [Accessed: 12- Jan- 2018]. [9] A. Jain and A. Jain, "The 5 Vs of Big Data - Watson Health Perspectives", Watson Health Perspectives, 2018. [Online]. Available: https://www.ibm.com/blogs/watson-health/the-5-vs-of-big-data/. [Accessed: 12- Jan- 2018]. [10] www.coursera.org, Introduction to Big Data, University of California, San Diego. https://www.coursera.org/learn/big-data-introduction [11] http://www.slideshare.net/HarshMishra3/harsh-big-data-seminar-report [12] B. Matturdi, X. Zhou, S. Li and F. Lin, "Big Data security and privacy: A review", China Communications, vol. 11, no. 14, pp. 135-145, 2014. [13] Kaustav Ghosh, Asoke Nath, “Big Data: Security Issues and Challenges”, International Journal of Research Studies in Computer Science and Engineering (IJRSCSE) Volume 3, Issue 3, 2016, PP 1-9 [14] Challenges and Security Issues in Big Data Analysis. Reena Singh. Kunver Arif Ali. IJIRSET. Volume: 5. Issue: 1. January 2016. [18] Google spotlights data center inner workings | Tech news blog - CNET News.com [19] What is MapReduce? | IBM Analytics", Ibm.com, 2018. [Online]. Available: https://www.ibm.com/analytics/hadoop/mapreduce. [Accessed: 13- Jan- 2018]. [20] K. Iswarya Security Issues Associated with Big Data in Cloud Computing Department of Computer Science. Idhaya College for Women Kumbakonam India.SSRG International Journal of Computer Science and Engineering (SSRG - IJCSE) – volume1issue 8 October 2014 [21] B. Davies, S. Agarwal, S. Agarwal, S. Agarwal and R. Dhole, "5 Best Data Processing Frameworks", KnowledgeHut Blog, 2018. [Online]. Available: https://www.knowledgehut.com/blog/information-technology/5-best-data-processing-frameworks. [Accessed: 13- Jan- 2018]. [22] "Top 5 Challenges in Big Data Analytics - UpX Academy", UpX Academy, 2018. [Online]. Available: http://upxacademy.com/big-data-analysis-top-5-challenges/. [Accessed: 14- Jan- 2018]. [23] Bamford J. The NSA is building the country’s biggest spy center (Watch What You Say) [J]. Wired, March, 2012, 15 [24] Big Data Working Group. Expanded Top Ten Big Data Security and Privacy Challenges © 2013 Cloud Security Alliance – All Rights Reserved
15
Big Data Security and Privacy Challenges
[25] Reena Singh. Kunver Arif Ali, “Challenges and Security Issues in Big Data Analysis” IJIRSET. Volume: 5. Issue: 1. January 2016. [26] L Xu and W Shi, Security Theories and Practices for Big Data, Big Data Concepts, Theories, and Applications. Springer, 2016. [27] M Blaze Et al. Divertible protocols and atomic proxy cryptography, Springer, 1998. [28] M Mambo, E Okamoto. Proxy cryptosystems: delegation of the power to decrypt ciphertexts. [29] A. Kourid and S. Chikhi, "A Comparative Study of Recent Advances in Big Data for Security and Privacy", Networking Communication and Data Knowledge Engineering, pp. 249-259, 2017. [30] S. Sudarsan, R. Jetley and S. Ramaswamy, "Security and Privacy of Big Data", Studies in Big Data, 2015, pp. 121-136. [31] "Types of Network Attacks against Confidentiality, Integrity and Avilability", Omnisecu.com, 2017. [Online]. Available: http://www.omnisecu.com/ccna-security/types-of-network-attacks.php. [Accessed: 23- Jan- 2017]. [32] B. Kitchenham and S. Charters. Guidelines for performing Systematic Literature Reviews in Software Engineering. Tech. rep. Keele University, United Kingdom, 2007 [33] R. Schutt and C. O'Neil, Doing data science. Sebastopol, CA: O'Reilly Media, 2014. (Book Data Analysis) [34] Nelson, Boel, and Tomas Olovsson. "Security and privacy for big data: A systematic literature review." Big Data (Big Data), 2016 IEEE International Conference on. IEEE, 2016. [35] Liu, A., Zhengy, K., Liz, L., Liu, G., Zhao, L., & Zhou, X. (2015, April). Efficient secure similarity computation on encrypted trajectory data. In Data Engineering (ICDE), 2015 IEEE 31st International Conference on (pp. 66-77). IEEE. [36] Chu, Yu-Wei, Chih-Hua Tai, Ming-Syan Chen, and Philip S. Yu. "Privacy-preserving simrank over distributed information network." In Data Mining (ICDM), 2012 IEEE 12th International Conference on, pp. 840-845. IEEE, 2012. [37] Bender, Gabriel, Lucja Kot, and Johannes Gehrke. "Explainable security for relational databases." In Proceedings of the 2014 ACM SIGMOD international conference on Management of data, pp. 14111422. ACM, 2014. [38] Meacham, Arthur, and Dennis Shasha. "JustMyFriends: full SQL, full transactional amenities, and access privacy." In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp. 633-636. ACM, 2012. [39] Xiao, Huang, Battista Biggio, Gavin Brown, Giorgio Fumera, Claudia Eckert, and Fabio Roli. "Is feature selection secure against training data poisoning?" In International Conference on Machine Learning, pp. 1689-1698. 2015. [40] Arasu, Arvind, Spyros Blanas, Ken Eguro, Manas Joglekar, Raghav Kaushik, Donald Kossmann, Ravi Ramamurthy, Prasang Upadhyaya, and Ramarathnam Venkatesan. "Secure database-as-a-service with
16
Big Data Security and Privacy Challenges
cipher base." In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pp. 1033-1036. ACM, 2013. [41] Lallali, Saliha, Nicolas Anciaux, Iulian Sandu Popa, and Philippe Pucheral. "A secure search engine for the personal cloud." In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1445-1450. ACM, 2015. [42] Wang, Yazhe, and Baihua Zheng. "Preserving privacy in social networks against connection fingerprint attacks." In Data Engineering (ICDE), 2015 IEEE 31st International Conference on, pp. 54-65. IEEE, 2015 [43] Jurczyk, Pawel, Li Xiong, and Slawomir Goryczka. "Dobjects+: enabling privacy-preserving data federation services." In Data Engineering (ICDE), 2012 IEEE 28th International Conference on, pp. 13251328. IEEE, 2012 [44] Cao, Jianneng, and Panagiotis Karras. "Publishing microdata with a robust privacy guarantee. " Proceedings of the VLDB Endowment 5, no. 11 (2012): 1388-1399. [45] Conceptual Overview of Map-Reduce and Hadoop", Glennklockwood.com, 2018. [Online]. Available: http://www.glennklockwood.com/data-intensive/hadoop/overview.html. [Accessed: 28- Jan2018]. [46] Moura, Jose, and Carlos Serrão. "Security and privacy issues of big data." arXiv preprint arXiv:1601.06206 (2016). [47] Jain, Priyank, Manasi Gyanchandani, and Nilay Khare. "Big data privacy: a technological perspective and review." Journal of Big Data 3, no. 1 (2016): 25. [48] N. Joshi and B. Kadhiwala, "Big data security and privacy issues — A survey," 2017 Innovations in Power and Advanced Computing Technologies (i-PACT), Vellore, 2017, pp. 1-5. [49] Chen, Min, Shiwen Mao, and Yunhao Liu. "Big data: A survey." Mobile networks and applications 19, no. 2 (2014): 171-209 [50] Cole, Dana, Jasmine Nelson, and Brian McDaniel. "Benefits and risks of big data." Proceedings of SAIS 2015 (2015): 1-5 [51] Elmehdwi, Yousef, Bharath K. Samanthula, and Wei Jiang. "Secure k-nearest neighbor query over encrypted data in outsourced environments." In Data Engineering (ICDE), 2014 IEEE 30th International Conference on, pp. 664-675. IEEE, 2014. [52] Jain, Prateek, and Abhradeep Thakurta. "Differentially private learning with kernels." In International Conference on Machine Learning, pp. 118-126. 2013 [53] Yao, Bin, Feifei Li, and Xiaokui Xiao. "Secure nearest neighbor revisited." In Data Engineering (ICDE), 2013 IEEE 29th International Conference on, pp. 733-744. IEEE, 2013. [54] Ali Nawaz, Javid Iqbal, “Big Data Security and Privacy Challenges: A Review” This Paper are not publishing yet. This Paper.
17
Big Data Security and Privacy Challenges
[55] Nelson, Boel, and Tomas Olovsson. "Security and privacy for big data: A systematic literature review." In Big Data (Big Data), 2016 IEEE International Conference on, pp. 3693-3702. IEEE, 2016. [56] Anupama Jha, Meenu Dave, Supriya Madan " Big Data Security and Privacy: A Review on Issues,Challenges and Privacy Preserving Methods" International Journal of Computer Applications (0975 – 8887)Volume 177 – No.4, November 2017