Testing Data Integrity in Distributed Systems - Science Direct

8 downloads 83277 Views 1MB Size Report
In this age of big data every small or big, old or new data is considered to be of great importance. Lots of analytical systems use this data and produce results ...
Available online at www.sciencedirect.com

ScienceDirect Procedia Computer Science 45 (2015) 446 – 452

International Conference on Advanced Computing Technologies and Applications (ICACTA2015)

Testing Data Integrity in Distributed Systems Manika Mittala,Ronak Sangania,Kriti Srivastavaa a

Information Technology Department, DJSCOE, Vile Parle (W), Mumbai -400056, India

Abstract Security threats in a distributed environment are one of the greatest threats in IT and big data on distributed infrastructure makes it more complex and difficult to implement a fool proof security framework. Network provides the access path for both inside and outside attacks which makes it the key access point for any type of security threats. Securing distributed infrastructure is not an easy task as it is complex, it requires careful configuration and it is subject to human errors. Distributed environment should be open but with proper security framework in place. Testing security threats in this kind of a complex infrastructure, with huge and unstructured data is an on-going process. Hence we need a security testing model which is capable to identify any unauthenticated intrusion immediately. This paper presents a technique to test distributed environments against attacks on data integrity. © 2015 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license © 2015 The Authors. Published by Elsevier B.V. (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility of scientific committee of International Conference on Advanced Computing Technologies and Peer-review under responsibility of scientific committee of International Conference on Advanced Computing Technologies and Applications (ICACTA-2015). Applications (ICACTA-2015). Keywords:Distributed System; Two Phase Commit Protocol;Hash Function; Collisions

1. Introduction In this age of big data every small or big, old or new data is considered to be of great importance. Lots of analytical systems use this data and produce results which can be used for different purposes. So there are no single, rather countable sources, which contribute to data collection. Data collection is happening manually and online through various new and old resources. Data is being frequently created, copied and moved around. Existing distributed computing networks and various cloud infrastructures have deployed their own security mechanism but with rapid increase in data quantity and its variety, security threats have also increased 1,2. This is the main reason why enterprise and large organizations use their own private data storage instead of using public clouds3. Privacy preservation is a must in Social network4, E Commerce 5, and Service Orientation and Cloud 6.

1877-0509 © 2015 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility of scientific committee of International Conference on Advanced Computing Technologies and Applications (ICACTA-2015). doi:10.1016/j.procs.2015.03.077

Manika Mittal et al. / Procedia Computer Science 45 (2015) 446 – 452

History shows that security breach has happened in even the largest and leading organizations7. In 2010 Google revealed, through one of its blog, that it had been a victim of a cyber-attack, where some of its intellectual properties were stolen. In 2011, Sony play station network was hacked, which shut the service for several days and millions of user information was exposed. Security models are never sufficient. They require constant checking and frequent updating. Broadly classifying the types of threats, it can be viewed as damaging existing data, stealing information, creating disruption in the network, exposing confidential data and corruption8. All these threats are subject to unauthenticated access to the system. Different service providers have implemented secure authentication systems such as Kerberos but still they are vulnerable to security threats. Hence there should be some testing mechanism which constantly checks data integrity within the system and intimates whenever there is some disruption. This paper proposes a security model which tests data integrity in a distributed framework. Section 2 discusses about the existing systems and pitfalls in the existing system, section 3 proposes a testing model to identify vulnerability towards Data Integrity attacks and section 4 concludes with future scope. 2. Existing System Distributed System is a collection of independent nodes, as shown in Fig. 1, each of which store data fragments ( D1, D2, etc.). The nodes are connected via a network which could be a local area network or a wide area network depending on whether the nodes are situated in the same local area or different areas respectively.

Fig. 1. Distributed System with replicated data fragments.

2.1. Data Storage in a Distributed System In a distributed system the data to be stored9 is split into fragments and these fragments are distributed across several nodes depending on the requirement of data on that node. This process is called Data Fragmentation. Certain data or fragments of data may also be replicated judiciously in order to increase reliability, availability and hence the performance of a distributed system. Data Replication has the following advantages:

447

448

Manika Mittal et al. / Procedia Computer Science 45 (2015) 446 – 452

x

x

Loss due to crashing of one replica can be avoided since the failed replica can be replaced by another replica. In other words even if a site fails but has its data replicated on another site, that data would still be accessible. This improves both availability and reliability. Placing a copy of data within close proximity to the process using them will reduce the access time of that data leading to enhanced performance.

2.2. Maintaining consistency: Two Phase Commit Protocol The downside of having multiple copies of the same data is maintaining consistency 10 of the replicated data. This means that if one copy of data is changed, the corresponding modification must be made to all copies of the data. In order to achieve this tight consistency11, we use the Two Phase Commit Protocol (2PC). In the following example, the coordinator wants to update a replica of some data in the database while P1 and P2 are the participant nodes which also contain replicas of the same data. x

x

Phase 1: Voting Phase – The coordinator sends a voting request (PREPARE T), after adding the request to its log, to all the participants with the replicated copy of the data to be updated. The participants either agree to commit (READY T) or abort (NO T) the transaction (which is the modification of data). Phase 2: Decision Phase – The coordinator commits the transaction if all the participants reply with a “READY T” and sends “COMMIT T” to all the participants, as shown in Fig. 2(a), otherwise it aborts the transaction with “ABORT T”, as shown in Fig. 2(b).

Fig. 2. (a) 2PC with committed transaction

Manika Mittal et al. / Procedia Computer Science 45 (2015) 446 – 452

Fig. 2. (b) 2PC with aborted transaction

2.3. Problem: Attack on Integrity of Data Suppose an attacker makes an illegal modification to a copy of some data in Node 1 (Refer to Fig.1). For the user, who requests access to data D1, the distributed system is a coherent system and he is unaware of which copy of data is made available to him. The user could be given access to data D1 of either Node 1 or Node 2. Due to the attack, the content of D1 in Node 1 is different from the content of D1 in Node 2. Since the user has no way of verifying the integrity of data he will not realize that the data has been changed by an unauthorized person and he may continue using the malicious data D1. 2.4. Solution: Hash Functions To overcome the problem, we can add a global Hash Store to the distributed system 12,13 which consists of hash function values of all the data/fragments stored, as shown in Fig. 3. A hash function is a one way function which takes as input a message of any arbitrary length and returns a fixed length output which is called the hash or message digest of the input message. This hash value is appended with the message and recomputed by the receiver of the message in order to check for any transmission errors or to detect attacks on integrity of data. Even a small modification of a message causes the hash of that message to change and this property of hash functions makes them suitable for ensuring message integrity. And thus if the recomputed hash value does not match the appended hash value, the receiver can conclude that the content of the message has been changed. Adding global Hash Store to a Distributed System will make the hash values of all the data and data fragments, available to the authenticated users. In this way whenever a user accesses any data he/she can confirm the integrity of that data by calculating the hash of the data and comparing it with the hash value retrieved from the Hash Store. The computed hash value will not match the retrieved hash value if the data has been modified illegally by an attacker.

449

450

Manika Mittal et al. / Procedia Computer Science 45 (2015) 446 – 452

Fig. 3. Distributed Network with Hash Store

2.5. Modified Two Phase Commit Protocol The calculation of hash values for each data/data fragment can be done at the end of the Two Phase Commit Protocol, as shown in Fig. 4, every time a data is updated or added to the Distributed System. These hash values can then be stored in the global Hash Store from where it is accessible to all the authenticated users.

Fig. 4. Modified 2PC Protocol

Manika Mittal et al. / Procedia Computer Science 45 (2015) 446 – 452

3. Testing the Existing System for Data Integrity If the attacker modifies the data with its hash collision then the hash function of the modified data will calculate to the same value as the hash of the original data. Since the hash value of both the legitimate and the malicious data will be same the user will not be able to distinguish between them, rendering the purpose of the Hash Store to fail. Hash Collisions: Hash Functions need to be collision resistant i.e. it should be computationally infeasible to find two different messages that hash to the same value. If two or more messages have the same hash value then such a situation is called a collision14,15. Let H: M→R be a hash function, where M is the message space and R is the resulting hash value space. Since M is more than R, according to the pigeonhole principle, there will be more than one message in M which will hash to the same value in R. These collisions can be found out using a parallel collision search technique16,17 as follows: -

-

-

Let m ϵ M be the message for which we need to find a hash collision, m’ ϵ M be the message the victim will willingly sign and gm: R →M be an injective function which maps a hash result in R to a perturbation of message m in M. The perturbation of the message is such that it doesn’t change the semantic meaning of message m. Since gm is an injective function for one r ϵ R there can be only one m ϵ M such that gm(r) = m. Partition the set R into equal sized subsets R1 and R2 i.e. |R1| = |R2| and define a function f:R → R such that ݂ሺ‫ݎ‬ሻ ൌ ‫ܪ‬൫݃݉ሺ‫ݎ‬ሻ൯݂݅ሺ‫ͳܴ߳ݎ‬ሻ ‫ܪ‬൫݃݉ǯሺ‫ݎ‬ሻ൯݂݅ሺ‫ʹܴ߳ݎ‬ሻ  Generate a trail of points xi = f(xi-1) for i= 1,2, … until a distinguished point xd , with an easily testable property, is found. Store the distinguished points found in a list and continue the above process until same distinguished point appears twice in the list. Finding same distinguished points in the list could imply that 2 trails intersected at a point i.e. f(xi) = f(xj) such that i ≠ j. Check if xi and xj belong to different subsets R1 and R2. If they do then we have successfully found a collision otherwise repeat the above steps till you find a collision.

A system is vulnerable to Data Integrity attacks if the testing model can find collisions for the data stored in the system easily and replace the original data with its collision. Hence by using the above technique a testing tool can test the Distributed System for Data Integrity and thus provide the users with an evaluation criterion which would enable them to select the most secure distributed system. 4. Conclusion/ Future Work The security of a Hash Function depends on its collision resistant property which is determined by the size of its range. Usually this size is of the order of ʹଵଶ଼ ǡ ʹଶହ଺, etc. This huge set space makes the obvious method, of selecting distinct inputs ‫ݔ‬௜ for ݅ ൌ ͳǡʹǡ͵ǡ ǥ and checking for a collision among the ݂ሺ‫ݔ‬௜ ሻ values, very time consuming. Thus there is a need to marshal large processing power from distributed processors and run the collision search task in parallel in order to find collisions in feasible time. We aim to implement the testing algorithm mentioned above using the Apache Hadoop’s MapReduce framework. MapReduce’s parallel processing capability caters to the need of the parallel system required by the testing algorithm. It automatically partitions the data, schedules the execution of the algorithm across different machines and handles communication between machines. The data integrity of a Distributed System and hence its trustworthiness can be tested using the above mentioned technique. If the system does not have a global Hash Store then any illegal modification made to one replica of data fragment by the testing model will go undetected since the user will not have any means of verifying the content of the fragment made accessible to him. On the other hand if the system does have a global Hash Store but the testing model can successfully find a collision for a data fragment in feasible time using the above mentioned algorithm then

451

452

Manika Mittal et al. / Procedia Computer Science 45 (2015) 446 – 452

again the illegal modification will go undetected and the Distributed System would fail the Data Integrity Test. The above testing strategy will help a user to evaluate the trustworthiness of different distributed systems and choose the one which is strongest against Data Integrity attacks. References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17.

Mohamed Firdhous. Implementation of Security in Distributed Systems – A Comparative Study. International Journal of Computer Information Systems, Vol. 2, No. 2, 2011. Ragib Hasan, Suvda Myagmar, Adam J. Lee, William Yurcik. Toward a Threat Model for Storage Systems. National Center for Supercomputing Applications (NCSA) University of Illinois at Urbana-Champaign (UIUC) Jo˜ao Moreno, Prof. Dr. Donald Kossmann, Dr. Tim Kraska, Simon Loesing. A Testing Framework for Cloud Storage Systems. Systems Group, http://systems.ethz.ch/ Department of Computer Science, ETH Z¨urich, March 2010 - September 2010 T. H. Ngac, I Echizen, K. Komes and H. Yoshiura, “New approach to quantification of privacy on Social Network sites,” in IEEE, 2010,pp 556-564. G. Udo, “Privacy and Security concerns as major barriers for e commerce: A survey study,” Inf. Mngt. Comp. sec, Vol 9 no4 pp 165174, oct 2001. R. Chow, P. Golle, M. Jakobson, E shi, J Standdon, R. Masuoka and J Molina,”Controlling Data in the cloud: Outsourcing Computation without outsourcing control,” in ACM, 2009, pp 85-90. Kevin Hamlen, Murat Kantarcioglu, Latifur Khan and Bhavani Thuraisingham. Security Issues for Cloud Computing. Technical Report UTDCS-02-10. February 2010. Theodoros K. Dikaliotis, Alexandros G. Dimakis, Tracey Ho. Security in Distributed Storage Systems by Communicating a Logarithmic Number of Bits. 4 May 2010. Ilker Kose. Distributed Database Security. GYTE, Veri ve Ag Guvenligi Spring 2002. Bettina Kemme1, Ganesan Ramalingam2, André Schiper3, Marc Shapiro4, and Kapil Vaswani5. Consistency in Distributed Systems. "Dagstuhl Reports 3, 2 (2013) 92-126". Peter Triantafillou and Carl Neilson. Achieving Strong Consistency in a Distributed File System. IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 23, NO. 1, JANUARY 1997. Anthony W. Scabby, Anil L. Pereria . Using Hashing to Maintain Data Integrity in Cloud Computing Systems. Department͓of͓ Accounting,͓Computer͓Science͓and͓Entrepreneurship Southwestern͓Oklahoma͓State͓University͓Weatherford,͓OK͓73096͓ Selman Haxhijaha 1, Gazmend Bajrami 1, Fisnik Prekazi 1. Data Integrity Check using Hash Functions in Cloud environment. Jean –Jacques Quisquater and Jeab-Paul Delescalille. How easy is collision search. New results and applications to DES. Springer Verlag. 1998. Jean –Jacques Quisquater and Jeab-Paul Delescalille. How easy is collision search. Application to DES. Springer Verlag. 1998. Paul C. van Oorschot and Michael J. Wiener. Parallel Collision Search with Cryptanalytic Applications. Second ACM Conference on Computer and Communications Security [42] and in the proceedings of Crypto ’96 [43]. 1996 September 23 Paul C. van Oorschot and Michael J. Wiener. Parallel Collision Search with Application to Hash Functions and Discrete Logarithms. 1994 August 17

Suggest Documents