The Problem of 'Personal Data' in Cloud Computing - What Information ...

Queen Mary University of London, School of Law Legal Studies Research Paper No. 75/2011

The Problem of 'Personal Data' in Cloud Computing - What Information is Regulated? The Cloud of Unknowing, part 1

W Kuan Hon Christopher Millard Ian Walden

Updated 13 April 2011: added full reference to pt 2 of this series, link and related cross references; corrected uppercase on p 33; removed superfluous phrase on p 36; added clarification to n 58 and its text.

Electronic copy available at: http://ssrn.com/abstract=1783577

The Problem of 'Personal Data' in Cloud Computing What Information is Regulated?1 The Cloud of Unknowing, Part 1 2

3

Kuan Hon, Christopher Millard and Ian Walden

4

10 March 2011

Abstract Cloud computing service providers, even those based outside Europe, may become subject to the EU Data Protection Directive's extensive and complex regime purely through their customers' choices, of which they may have no knowledge or control. We consider the definition and application of the EU 'personal data' concept in the context of anonymisation/pseudonymisation, encryption and data fragmentation in cloud computing, arguing that the definition should be based on the realistic risk of identification, and that the applicability of data protection rules should be based on the risk of harm and its likely severity. In particular, the status of encryption and anonymisation/pseudonymisation procedures should be clarified to promote their use as privacy-enhancing techniques; data encrypted and secured to recognised standards should not be considered 'personal data' in the hands of those without access to the decryption key, such as many cloud computing providers; and finally, unlike, for example, social networking sites, Infrastructure as a Service and Platform as a Service providers (and certain Software as a Service providers) offer no more than utility infrastructure services, and may not even know if information processed using their services is 'personal data' (hence, the 'cloud of unknowing'), so it seems inappropriate for such cloud infrastructure providers to become arbitrarily subject to EU data protection regulation due to their customers' choices.

1

This paper forms part of the Cloud Legal Project at the Centre for Commercial Law Studies, Queen Mary, University of London. The authors are grateful to Microsoft for providing generous financial support to make this project possible. The views expressed within this paper, however, are solely those of the authors. 2

Research Assistant, Cloud Legal Project, Centre for Commercial Law Studies, Queen Mary, University of London. 3

Professor of Privacy and Information Law and Project Leader, Cloud Legal Project, Centre for Commercial Law Studies, Queen Mary, University of London, and Senior Research Fellow, Oxford Internet Institute, University of Oxford. 4

Professor of Information and Communications Law, Centre for Commercial Law Studies, Queen Mary, University of London.

Centre for Commercial Law Studies 1


Keywords Cloud Computing, Data Privacy, Data Protection, EU, European Union, Internet, Legal Issues, Outsourcing, Personal Data, Personal Identifying Information, Privacy

Table of Contents 1.

Introduction and scope .......................................................................................... 3

2.

Terminology, definitions and cloud computing overview......................................... 5

3.

'Personal data' definition ........................................................................................ 8

4.

3.1

Relevance to cloud computing, and problems with the definition ....................... 8

3.2

'Personal data' definition ................................................................................. 11

3.3

Anonymisation and pseudonymisation ............................................................ 15 3.3.1

Anonymisation or pseudonymisation itself as 'processing'

15

3.3.2

Common anonymisation and pseudonymisation techniques

15

3.4

Encryption ....................................................................................................... 22

3.5

Sharding.......................................................................................................... 29

3.6

The provider's ability to access stored data ..................................................... 33

3.7

The way forward?............................................................................................ 37 Concluding remarks ............................................................................................. 45

Appendix Suggested issues for consideration in the revision of the DPD



46

1. Introduction and scope The QMUL Cloud Legal Project ('CLP') was set up in 2009 to identify and evaluate the legal issues arising from the use of cloud computing.5 The CLP comprises a number of distinct workstreams focussing on a range of specific issues; it will produce a series of papers intended to illuminate the legal concerns that providers and users of cloud services may encounter and propose some regulatory solutions designed to improve the legal framework for the provision of cloud services. Privacy is a major reason commonly cited for hesitation on the part of many potential cloud users to move to cloud computing.6 Privacy law is a broad area.7 This paper focuses on certain key aspects of the 1995 EU Data Protection Directive8 ('DPD'). The DPD is relevant even to non-EU entities because of its potentially-broad reach.9 This paper is the first of a series of three related CLP papers which will consider key foundational DPD legal issues relevant to cloud computing, namely: what information is regulated under the DPD; who is regulated; and which country's laws apply?10

5

The QMUL Cloud Legal Project is being undertaken by members of the Centre for Commercial Law Studies at Queen Mary, University of London. 6

See surveys such as Ponemon Institute, Flying Blind in the Cloud - The State of Information Governance (7 April 2010) for security firm Symantec, and European Network and information Security Agency, An SME perspective on Cloud Computing – Survey (ENISA, November 2009). On privacy concerns in cloud computing, see Robert Gellman, Privacy in the clouds: Risks to privacy and confidentiality from cloud computing (World Privacy Forum, 23 February 2009), and Michael R Nelson, 'Cloud Computing and Public Policy - briefing paper for the OECD Information, Computer and Communications Policy Committee's Technology Foresight Forum' (OECD, 2009), and Nicole A. Ozer and Chris Conley, Cloud Computing: a Storm Warning for Privacy? (ACLU of Northern California, 2010). Key data protection challenges in cloud computing are outlined in Peter Hustinx, 'Data Protection and Cloud Computing under EU law' (Third European Cyber Security Awareness Day, BSA, European Parliament, 13 April 2010). 7

This paper does not cover other privacy law issues such as confidentiality; the use of private information or right to private life under the European Convention of Human Rights or the EU Charter of Fundamental Human Rights; or Directive 2002/58/EC of the European Parliament and of the Council of 12 July 2002 concerning the processing of personal data and the protection of privacy in the electronic communications sector OJ L 201/37, 31.07.2002. On confidential information in the cloud, see the CLP paper by Chris Reed, 'Information "Ownership" in the Cloud' (2010) Queen Mary School of Law Legal Studies Research Paper No. 45/2010 . 8

Directive 95/46/EC of the European Parliament and of the Council of 24 October 1995 on the protection of individuals with regard to the processing of personal data and on the free movement of such data OJ L 281/31, 23.11.1995. Unless otherwise stated, references in this paper to articles and recitals will be to articles and recitals of the DPD. 9

Jurisdictional aspects of the DPD will be discussed in detail in a separate CLP paper, the third in this data protection series. 10

Issues surrounding compliance with data protection laws, such as security measures, will be dealt with in more detail in a separate CLP paper.


The DPD was intended to encourage the free movement of personal data within the EU11 by harmonising national provisions on data protection12 while protecting to a high level the rights and freedoms of individuals (known as 'data subjects'13) who are the subjects of such data, in particular their rights to privacy with respect to the processing of their personal data14 'wholly or partly by automatic means'.15 The DPD requires member states to impose certain obligations on the data 'controller' (the person who determines the purposes and means of processing the personal data),16 provided the controller has the requisite connection with the EEA.17 The DPD does not apply to certain matters,18 in respect of which EU member states' national implementations may, for example, allow exemptions from certain data protection law obligations. This paper considers the scope of 'personal data' under the DPD, and in what circumstances information is (or is not) within that definition, and therefore caught by national laws implementing the DPD. It should be noted that member states of the EEA had some flexibility in implementing the DPD's requirements into law locally, and they were also permitted to extend the DPD's scope (for example, to include the data of non-natural persons such as companies).19 This means there are important national differences in data protection laws within the EEA - such as on civil liability, and on penalties for non-compliance with data protection laws.20 This paper addresses the DPD only at a European level. Space 11

In fact the DPD extends to non-EU countries within the European Economic Area, namely Iceland, Liechtenstein and Norway, by virtue of a Decision of the EEA Joint Committee No 83/1999 of 25 June 1999 amending Protocol 37 and Annex XI (Telecommunication services) to the EEA Agreement OJ L 296/41, 23.11.2000. This paper thus often refers to 'EEA' instead of 'EU'. 12

In this paper, 'data protection' refers to its meaning under the DPD, which regulates the processing of „personal data‟ wholly or partly 'by automatic means', notably computerised processing. The IT industry often uses 'data protection' to mean data loss protection, ie protecting all kinds of data against loss, including backup, data recovery, disaster recovery and business continuity. Although those elements may be relevant to 'data protection' in the DPD sense, in this paper we use the term only in the DPD sense. 13

Art 2(a).

14

Art 1(1).

15

Art 3(1).

16

Art 2(d). This paper assumes that, whenever a 'controller' is discussed, it has the requisite EEA connection so as to be subject to the DPD. 17

Art 4 (through being 'established' in the EEA or making 'use' of equipment in the EEA). This paper will not cover conflict of laws issues of applicable law or jurisdiction, the DPD‟s applicability to an entity through its having the requisite EEA connection, or transfer of personal data outside the EEA. Another CLP paper will address those questions. 18

Such as national security or defence. Art 3(2). However, following the Lisbon Treaty, which took effect in December 2009, the EU will extend the DPD to cover such areas more comprehensively, including police and judicial cooperation in criminal matters. The new DPD framework should extend to data retained for law enforcement purposes. European Commission, 'Data protection reform – frequently asked questions' (4 November 2010) MEMO/10/542. 19

As in Austria, Italy and Luxembourg. C Kuner, European Data Protection Law: Corporate Compliance and Regulation (2nd edn, OUP Oxford, 2007), ch 2.37. 20

Kuner, ibid, ch 1 pt G.


does not permit coverage of individual member states, though illustrative examples from various states will be given. The DPD is currently undergoing review. 21 The European Commission consulted on the DPD in 2009 ('2009 consultation')22 and its subsequent November 2010 Communication on updating the DPD23 indicated that the text of the proposed revised updated Directive would be issued in 2011. The Communication included mention of cloud computing24 and another 2010 Commission Communication on the Digital Agenda for Europe recommended25 the development of an EU-wide strategy on cloud computing, notably for government and science, including economic, legal and institutional aspects. In a recent speech, European Commissioner Kroes, Vice-President for the Digital Agenda, discussed cloud computing, privacy and data protection.26 Therefore it seems likely that the DPD revision will seek to address the implications of cloud computing in some fashion.

2.

Terminology, definitions and cloud computing overview

In this paper, those who provide cloud services are termed 'providers'. Users of cloud services (whether corporate or public administration bodies, small/medium enterprises or private individuals) are called 'users' or 'customers'. Individuals using cloud services in a private capacity are termed 'consumers'. There are many different definitions of cloud computing. A common view is that any kind of 'internet-based computing' is cloud computing.27 This paper uses the following definition:28

21

The Council of Europe's Convention 108 for the Protection of Individuals with regard to Automatic Processing of Personal Data is also to be modernised – Council of Europe, Council of Europe response to privacy challenges - Modernisation of Convention (position paper, 32nd International Conference of Data Protection and Privacy Commissioners, 27-29 October 2010, Jerusalem, Israel) and 30th Council of Europe Conference of Ministers of Justice, Resolution No. 3 on data protection and privacy in the third millennium (Istanbul, Turkey, 24-26 November 2010) MJU-30 (2010) RESOL 3 E. EU states are parties to this Convention, which formed much of the basis for the DPD, so issues being considered in relation to the Convention's modernisation (see for example text to n 193) may also be relevant when considering the modernisation of the DPD. 22

At http://ec.europa.eu/justice/news/consulting_public/news_consulting_0003_en.htm.

23

European Commission, 'A comprehensive approach on personal data protection in the European Union' (Communication) COM (2010) 609 final. 24

Said to 'pose challenges to data protection, as it may involve the loss of individuals' control over their potentially sensitive information when they store their data with programs hosted on someone else's hardware'. COM (2010) 609 (n 23), 2. 25

European Commission, A Digital Agenda for Europe COM (2010) 245 (19 May 2010), 23.

26

Neelie Kroes, 'Cloud computing and data protection' (Les Assises du Numérique conference, Université Paris-Dauphine, 25 November 2010) SPEECH/10/686. 27

For example the European Commission in its Communication (n 23) refers to cloud computing as 'Internet-based computing whereby software, shared resources and information are




Cloud computing provides flexible, location-independent access to computing resources that are quickly and seamlessly allocated or released in response to demand.



Services (especially infrastructure) are abstracted and typically virtualised, generally being allocated from a pool shared as a fungible resource with other customers.



Charging, where present, is commonly on an access basis, often in proportion to the resources used.

Cloud computing activities are often described in terms of three service models. 29 Infrastructure as a Service ('IaaS') provides a computing resource such as processing power and/or storage;30 Platform as a Service ('PaaS'), tools for the construction (and usually deployment) of custom applications;31 and Software as a Service ('SaaS'), functionality akin to that of an end-user application.32 These services may be viewed as a spectrum, from low-level functionality (IaaS) to high-level functionality (SaaS), with PaaS in the middle.33 It should be noted that often there may be layers of cloud providers involved in a service provided to the customer, not always to the customer's knowledge. Also, the perspective may affect a cloud service‟s classification. For instance, to customers of data storage services provider DropBox, DropBox offers SaaS. However, behind the scenes DropBox uses Amazon's IaaS infrastructure in order to provide its services to its own customers,34 on remote servers ("in the cloud")', and see UK Information Commissioner's Office, Personal information online code of practice (2010), 29, 'The use of cloud computing, or „internet-based computing‟, involves an organisation using services – for example data storage – provided through the internet'. 28

S Bradshaw, C Millard and I Walden, 'Contracts for Clouds: Comparison and Analysis of the Terms and Conditions of Cloud Computing Services' (2010) Queen Mary School of Law Legal Studies Research Paper No 63/2010 ('CLP Contracts Paper') . 29

Peter Mell and Tim Grance, 'The NIST Definition of Cloud Computing version 15' (US National Institute of Standards and Technology 2009). 30

Examples being Rackspace, and Amazon's EC2 and S3.

31

Such as Google's App Engine and Microsoft's Windows Azure.

32

Examples are webmail services such as Yahoo! Mail, and social networking sites such as Facebook. Salesforce's online customer relationship management service is often cited as an example of SaaS for enterprise cloud users. 33

It is possible to view PaaS as a specialised kind of SaaS, providing application software over the cloud combined with an application hosting service. The specific type of application software provided to the cloud customer in PaaS is programming and code testing software, which enables the customer to develop its own application software using the cloud provider's infrastructure. The customer's resulting application software may then be hosted on the provider's servers, thus also enabling the cloud customer (or employees or customers of the cloud customer) to run the customer's application software on the cloud provider's servers over the internet. Alternatively, PaaS may be viewed as a kind of IaaS, providing development and hosting infrastructure to enable the cloud customer to develop its own cloud applications and make them available to its own employees or customers. PaaS is probably more akin to IaaS than SaaS, but this illustrates the blurred boundaries in cloud computing concepts. 34

Dropbox, 'Where are my files stored?' (2010) .


so from the viewpoint of DropBox, as a customer of Amazon, Amazon provides it with IaaS.35 Also, PaaS may be layered on IaaS, SaaS layered on PaaS, and so on. To illustrate, provider Makara offers several kinds of cloud platform which may use for example Amazon's EC2 or VMWare's vCloud IaaS infrastructure.36 Finally, it should be noted that it is not just cloud providers but also cloud customers themselves who increasingly combine cloud services from different cloud providers. One use might be ancillary support for the primary cloud service - for example analytics, monitoring,37 and even cloud-based billing systems.38As further examples of the increasing integration of SaaS across different cloud providers, independent vendor Ribbit's unified communications application39 integrates with Salesforce.com's SaaS customer relationship management service, while Google Apps MarketPlace40 allows customers of its SaaS office suite Google Apps to purchase and deploy third party cloud applications which integrate with, and can be managed and accessed through, Google Apps. These examples illustrate the increasing sophistication of cloud use, and the increasing layering of providers. Just as, with traditional IT usage, the same organisation may install and operate different software applications together, so cloud customers are increasingly integrating different cloud applications and support services with each other as well as with legacy internal IT systems. The definition of 'processing' under the DPD is very broad and includes any operation on data, such as collection or disclosure of data.41 In this paper we assume 'processing' includes all operations relating to data in cloud computing, including storage of personal data in the cloud.

35

CLP Contracts paper (n 28), s 3, 8.

36

Makara, „What clouds . 37

does

Makara

support?‟

(FAQ)

For example Nimsoft, which provides cloud applications monitoring and reporting.

38

For example, for Windows Azure users, Zuora's pricing and credit card processing system. „Real World Windows Azure: Interview with Jeff Yoshimura, Head of Product Marketing, Zuora‟ (Windows Azure Team Blog, 11 November 2010) . 39

Ie voice, SMS, email, and voice-to-text transcriptions. Ribbit, What is Ribbit for Salesforce? . 40

„Google Apps Marketplace now launched‟ (Google Apps, 10 March . 41

Art 2(b).


2010)

3.

'Personal data' definition

3.1

Relevance to cloud computing, and problems with the definition

In considering the processing of personal data in the cloud, the definition of 'personal data' is critical, because the DPD only applies to 'personal data'.42 Information which is not, or which ceases to be, 'personal data', may be processed, in the cloud or otherwise, free of EU data protection law requirements such as the principles ('Principles') in accordance with which 'personal data' must be processed.43 Those Principles include restrictions on transfers of personal data outside the EEA. In cloud computing, the 'personal data' definitional issue is most relevant in three specific contexts: 

anonymised and pseudonymised data;



encrypted data - whether encrypted in transmission (ie data in motion or data in transit) or in storage (data at rest);44 and



sharding or fragmentation of data.

In each case, the question is, should such data be treated as 'personal data'? These forms of data involve different procedures being applied to the original personal data, at different stages, and/or by different actors. Those procedures will be discussed in more detail below. First, we provide a brief overview. Anonymised or pseudonymised data45 result from actions deliberately taken on the original personal data to try to conceal or hide the data subjects‟ identities. Common anonymisation and pseudonymisation procedures are outlined at 3.3.2. In cloud computing, a cloud user may perform an anonymisation or pseudonymisation procedure on a data set before processing the resulting data in the cloud. It is also possible, and in some cases key to its business model, for a cloud provider to anonymise personal data stored with it, in order to then use, sell or share the anonymised data.46 For example, some US cloud providers who store health data

42

Including 'special category' or 'sensitive personal data'. See paragraph containing n 207.

43

See recital 26.

44

Data held by a cloud service provider, i.e. 'at rest', may in fact be regularly 'in transmission' between internal resources of the service provider, in order to maximise the efficient utilisation of such resources. 45

See further 3.3 below.

46

Miranda Mowbray, 'The Fog over the Grimpen Mire: Cloud Computing and the Law' (2009) 6:1 SCRIPTed 129, 144-5 . Some providers may not even anonymise personal data prior to 'processing', such as when their automated software scans social networking profiles (or web-based emails, which may contain personal data) to display advertisements to users based on the data content. See also the findings of Joseph Bonneau and Sören Preibusch, 'The Privacy Jungle: On the Market for Data


records anonymise and sell the data.47 As an EU example, UK service HipSnip, enables mobile phone users to 'snip' and save information such as offers from consumer product brands. Its terms state that it may 'share, rent, sell, or trade aggregated and anonymised data (e.g. which age groups prefer a particular service, but never individual information)'.48 Many anonymisation and pseudonymisation techniques involve amendments to part only of a data set, such as deleting or disguising names or other identifiers, including applying cryptography to identifiers. The other information in the data set, such as usage information or test results associated with the name, remains available to those receiving or having access to the resulting data set.49 Encrypted data50 normally involves encrypting the entire data set for security purposes, ie transforming or converting the entire original data set by applying an 'algorithm' to it, rather like translating information into a foreign language so that only those who understand that language can read it.51 The security of encrypted data depends on several factors, notably the strength of the encryption method used (ie the cryptographic strength of the algorithm) and the length of the encryption key, with longer keys generally providing better security against attacks, another important factor being the management of the decryption key - how securely the key is stored, how many people have access to it, etc.52 Some encryption methods have

Protection in Social Networks' in Tyler Moore, David J. Pym, Christos Ioannidis (eds), Economics of Information Security and Privacy (Springer 2010). Consent to such processing is a separate matter. 47

K Zetter, 'Medical Records: Stored in the Cloud, Sold on the Open Market' (Wired, 2009) : '[M]edical data is collected, anonymized and sold, not by insurance agencies and health care providers, but by third-party vendors who provide medical-record storage in the cloud.' Another US example is the site PatientsLikeMe which hosts discussion boards where users discuss experiences with AIDS and other medical conditions. 'PatientsLikeMe also sells data about its users. PatientsLikeMe says the data it sells is anonymized, no names attached.' Julia Angwin and Steve Stecklow, 'Scrapers' Dig Deep for Data on Web', The Wall Street Journal (2 October 2010) . 48

HipSnip, HipSnip Legal Statement .

49

For examples of social networking sites sharing 'anonymised' data, and re-identification of individuals from their anonymised social graphs (network of individuals to whom they have connections), see Arvind Narayanan and Vitaly Shmatikov, 'De-anonymizing Social Networks' (Proceedings of the 2009 30th IEEE Symposium on Security and Privacy, IEEE Computer Society Washington, DC, USA, 17-20 May 2009) 173. 50

See generally Ross Anderson, Security Engineering: A Guide to Building Dependable Distributed Systems (2nd edn, John Wiley & Sons, 2008) Chapter 5 'Cryptography', for an excellent guide to cryptography for those without cryptology training. See further 3.4 below. 51

Similarly, encrypted data are intended to be read by only those who hold and can use the decryption key – typically, those who know the password or passphrase required to generate and use the decryption key, which will itself usually be encrypted. 52

See generally eg Matt Blaze et al, Minimal key lengths for symmetric ciphers to provide adequate commercial security (US Defense Technical Information Center 1996) and William


been "broken" or "cracked", and data encrypted via the method revealed.53 In this paper, we will consider information to be "strongly" secured if it is secure against decryption for most practical purposes most of the time in the real world,54 and in particular if it has been encrypted, and the decryption keys secured, to recognised industry standards.55 As well as encrypting all information in a data set, it is possible to apply cryptography to only part of the set, for example only names. Using cryptography in this way overlaps with pseudonymisation or anonymisation. In attempting to anonymise or pseudonymise data, identifiers may be 'hashed' using 1-way cryptography, or encrypted using 2-way cryptography, respectively, but the other data left readable as "plaintext".56 Cryptography may be applied to data recorded in an electronic file, folder, database, parts only of a database, etc. In cloud computing, a user may apply cryptography to parts or, perhaps more commonly, the whole of a data set recorded in whatever form, before storing it in the cloud. Encryption performed on data within the user‟s computer prior to the data transmission may employ the user‟s own software, or the cloud provider's.57 Even if the user intends to process data unencrypted in the cloud, the

th

Stallings, Cryptography and Network Security: Principles and Practice (5 ed, Prentice Hall 2010). The widely-reported incident, where contents of supposedly-secret US embassy cables were published on the Wikileaks website, illustrates the importance of restricting access to the key. While the leaked data may have been stored very securely, the overall measures were not conducive to security because it seems too many people had access – see Greg Miller, 'CIA to examine impact of files recently released by WikiLeaks', The Washington Post (22 December 2010) and Bruce Schneier, 'Wikileaks' (Schneier on Security, 9 December 2010) . 53

Researchers and others have over time found vulnerabilities in various encryption techniques, which have then had to be replaced by more secure algorithms. Also, technological advances – including cloud computing, see nn 117 and 124 - have made it easier to decrypt encrypted data via 'brute force' attacks, whereby many computers very quickly try different decryption keys to find that 'fits'. For example, messages encrypted using the Data Encryption Standard (DES) were decrypted by the US Electronic Frontier Foundation in 1998. 54

Security expert Bruce Schneier has noted, on weaknesses being discovered in various cryptographic systems, that in practice the system will not become suddenly insecure. However, practical attacks will probably be possible someday, based on the techniques discovered. So such discoveries provide impetus to move to more secure encryption techniques, rather than signifying that everything encrypted using the system concerned is immediately insecure. Bruce Schneier, 'Cryptanalysis of SHA-1' (Schneier on Security, 18 February 2005) . 55

For an example of recognised encryption standards see the US Code of Federal Regulations 45 CFR Part 170, Health Information Technology: Initial Set of standards, Implementation Specifications, and Certification Criteria for Electronic Health Record Technology issued by the US Department of Health and Human Services (Federal Register, July 28, 2010). § 170.210 stipulates standards to which § 170.302 generally requires electronic health information to be encrypted: 'Any encryption algorithm identified by the National Institute of Standards and Technology (NIST) as an approved security function in Annex A of the Federal Information Processing Standards (FIPS) Publication 140–2...'. 56

1-way and 2-way cryptography are explained further at 3.4 below.

57

For example Mozilla Weave, now Firefox Sync - mentioned in Christopher Soghoian, 'Caught In The Cloud: Privacy, Encryption, And Government Back Doors In The Web 2.0 Era',


provider itself may nevertheless choose to encrypt all or part of the data it receives, in order to store it more securely. The data transmission may itself be encrypted or unencrypted, usually depending on how the cloud provider has set up its systems, for example to use encrypted connections. Sharding or fragmentation58 is an automated procedure performed by a cloud provider's software, which automatically breaks data up into fragments for storage in different storage equipment, possibly in different locations, based on the provider's sharding policies, such as performance maximisation. Cloud users have no say in the automated sharding procedure, although in some cases they may be able to choose in what broad geographical zone the resulting shards should be stored.59

3.2

'Personal data' definition

We now consider in detail the 'personal data' definition. A key characteristic of data protection law, which distinguishes it from privacy law, is the use of objective definitions for personal data and sensitive personal data, rather than subjective reasonable

[2010] Journal on Telecommunications and High Technology Law, 8(2) 359, which suggests reasons for the relative lack of security measures in the cloud such as encryption, and offers possible solutions. On technological cryptographic options in the cloud, see Seny Kamara and Kristin Lauter, 'Cryptographic Cloud Storage', in Radu Sion and others (eds), FC'10 Proceedings of the 14th International Conference on Financial Cryptography and Data Security (SpringerVerlag Berlin, Heidelberg 2010), 136. Baking in secure encryption for users can result in other problems for providers. Mobile phone maker Research In Motion (RIM) designed its Blackberry Enterprise Solution for very strong security, with organisation-controlled servers, where encryption keys are not accessible to Blackberry and there are no back doors – 'RIM does not possess a "master key", nor does any "back door" exist in the system that would allow RIM or any third party to gain unauthorized access to the key or corporate data' (RIM's statement sent to customers in Saudi Arabia, reproduced in Joanne Bladd 'BlackBerry's response: RIM statement in full' (ArabianBusiness.com 3 August 2010) . While those features have made it popular with businesses and others exchanging sensitive data, RIM is increasingly under pressure from some governments who wish to intercept the encrypted communications of Blackberry customers despite access to most communications being 'technological infeasibile'. RIM, BlackBerry Enterprise Solution Security release 4.1.2 Technical Overview (2006); RIM, 'RIM Offers to Establish an Industry Forum to Collectively Work with the Government of India on Policies and Processes Surrounding Lawful Access' (Blackberry, 26 August 2010) ; RIM 'Customer Update August 12, 2010' (Blackberry) ; Satchit Gayakwad, 'RIM statement on enterprise server access, India access deadline' (Bloomberg January 2011) ; Josh Halliday, 'BlackBerry denies India email access deal as struggle continues' (Guardian London 30 December 2010). 58

See further 3.5 below. Users may choose to „shard‟ cloud databases logically, for example into different „domains‟ they create using Amazon‟s SimpleDB; but in this paper we use „sharding‟ solely to refer to the data fragmentation automatically performed by providers‟ systems. 59

Providers such as Amazon and Microsoft offer users the option to confine storage (and presumably other processing) of their data to the EU or other geographically-circumscribed locations. CLP Contracts paper (n 28).


expectations. This has resulted, as will be seen below, in a 'binary', 'all or nothing' perspective, and wide-ranging applicability. The DPD defines 'personal data' as: any information relating to an identified or identifiable natural person („data subject‟); an identifiable person is one who can be identified, directly or indirectly, in particular by reference to an identification number or to one or more factors specific to his physical, physiological, mental, economic, cultural or social identity.60 The processing of certain special categories of personal data, deemed to be particularly sensitive ('sensitive data'), is regulated more stringently.61 These categories are personal data revealing „racial or ethnic origin, political opinions, religious or philosophical beliefs, trade-union membership‟, „data concerning health or sex life', and data relating to criminal offences or convictions.62 Just as information which is 'personal data' is subject to the DPD whatever its nature, so information which qualifies as sensitive data is automatically subject to the stricter regime regardless of its exact nature or context. For example, information that X had flu last week is required to be treated in the same way as news that X has contracted a very embarrassing disease. Not all information will merit protection as 'personal data'. One might think some information is intrinsically 'non-personal' and can never be related to an identifiable individual, such as information on meteorological conditions on top of Mt Everest recorded at fixed intervals by automated equipment.63 Leaving aside such apparently 'non-personal' information, however, recital 26 of the DPD also recognises that information which is 'personal data', as defined, may nevertheless be rendered 'anonymous' – and, as such, no longer subject to data protection requirements: Whereas the principles of protection must apply to any information concerning an identified or identifiable person; whereas, to determine whether a person is identifiable, account should be taken of all the means likely reasonably to be used either by the controller or by any other person to identify the said person;

60

Art 2(a), emphasis added.

61

Certain other types of data are also subject to more stringent regulation, such as 'traffic data' and 'location data' under Directive 2002/58/EC of the European Parliament and of the Council of 12 July 2002 concerning the processing of personal data and the protection of privacy in the electronic communications sector OJ L 201/37, 31.07.2002. 62

Art 8. The DPD uses the term 'special category' data, but such data are generally referred to as 'sensitive data' or 'sensitive personal data'. The stricter requirements may include requiring the data subject's „explicit consent‟ to the processing. 63

That is not strictly correct, as will be seen. Even seemingly non-personal information can be 'personal data'.


whereas the principles of protection shall not apply to data rendered anonymous in such a way that the data subject is no longer identifiable.64 Unfortunately, interpreting and applying these provisions is not straightforward 65 especially when considering how personal data may be sufficiently 'anonymised' or 'pseudonymised' to take it outside the regulatory regime. In 2007 the Article 29 Working Party ('A29WP')66 produced an opinion ('WP136') on the concept of 'personal data'67 to provide guidance contributing to the uniform application of data protection rules across the EU. There are some important general points to note regarding WP136. First, it takes a broad view of 'personal data'. It states that the DPD was intended to cover all information concerning, or which may be linked, to an individual,68 and that 'unduly restricting' the interpretation of the 'personal data' definition should be avoided. What might seem an over-broad application of the DPD, resulting from wide interpretation of the definition, should instead be balanced out by using the flexibility allowed in the actual application of the DPD's rules.69 Secondly, as with all A29WP opinions, WP136 may have persuasive authority but it is not binding on any EU member states‟ courts or the European Court of Justice. Data protection regulators may perhaps be less likely to pursue an entity which complies with A29WP opinions, but even individual national regulators, not to mention national courts, may have different views from those expressed in A29WP opinions. 70 Therefore, in practice controllers may take a conservative or cautious approach to reliance on such opinions. Thirdly, WP136 emphasises that whether information is 'personal data' is a question of fact, depending on the context. For example, a very common family name will not be

64

Emphasis added.

65

Particularly as many national law definitions of 'personal data' also differ from the DPD definition - see the table of comparative definitions in Kuner (n 19) ch 2.82. 66

Established under DPD art 20, this Working Party comprises national EU data protection regulators and the European Data Protection Supervisor (who supervises compliance by EU institutions such as the European Commission with data protection requirements). Their functions include advising the European Commission on various DPD-related matters, and considering national implementations of data protection laws to contribute to their uniform application. 67

Article 29 Data Protection Working Party, Opinion 4/2007 on the concept of personal data, WP 136 (2007). 68

Even information about things can be 'personal data' if it can be linked to an individual, see WP136 part 2. 69

WP136, 4-6.

70

This is perhaps not surprising given that A29WP decisions are taken by a simple majority - art 29(3).


sufficient to single someone out from the whole of a country's population, but is likely to identify a pupil in a classroom.71 In particular, it is important to note that information which is not 'personal data' in the hands of one person (such as a cloud user) may, depending on the circumstances, become 'personal data' when obtained or processed by another72 (such as a cloud provider). For example, information may become 'personal data' if the provider tries to process the information further for its own purposes. Indeed, information held by the same entity, which is not originally 'personal data', may become so if the entity later processes it for other purposes, such as to identify the individuals concerned. Similarly, when considering whether individuals can be identified, account must be taken of 'all the means likely reasonably to be used by the controller or any other person' to identify them.73 This test is dynamic. Methods 'likely reasonably to be used' to identify individuals may change over time, especially as re-identification technology improves and costs decrease. Accordingly, the intended storage period of the information is also a factor.74 There is a final important point on the context-dependent nature of the definition. Whether information is personal data or not may be affected by what technical and organisational measures are taken to prevent identification (where the purpose of processing is not identification).75 The more effective the measures taken to prevent identification,76 the more likely it is that the information will not be 'personal data'.

71

WP136, 13, and the contextual nature of 'personal data' has been recognised for example in Common Services Agency v Scottish Information Commissioner (Scotland) [2008] UKHL 47, [2008] 1 WLR 1550 ('CSA'), [27]. 72

WP136, 20, and see the examples in part 5.2, UK Information Commissioner's Office, Determining what is personal data (Data Protection Technical Guidance) (2007). 73

Which means, according to WP136, that 'a mere hypothetical possibility to single out the individual is not enough to consider the person as "identifiable".' The difficult question is, where does the possibility of identification become more than merely hypothetical? 74

WP136, 15 notes that for data meant to be stored for a month, identification may be expected to be impossible during the 'lifetime' of the information, so they should not be considered as personal data. But for data intended to be kept for 10 years, the controller should consider the possibility of identification in say year 9, which may make them personal data at that point. 75

WP136, 17 (emphasis added): 'Putting in place the appropriate state-of-the-art technical and organizational measures to protect the data against identification may make the difference to consider that the persons are not identifiable, taking account of all the means likely reasonably to be used by the controller or by any other person to identify the individuals. In this case, the implementation of those measures are not the consequence of a legal obligation arising from Article 17 of the Directive (which only applies if the information is personal data in the first place), but rather a condition for the information precisely not to be considered to be personal data and its processing not to be subject to the Directive.' 76

Which may include formulating and implementing appropriate policies, procedures and staff training, as well as technical measures.


3.3

Anonymisation and pseudonymisation

As summarised above, because anonymous data is outside the DPD„s scope, information is often processed in the cloud, by cloud user and/or provider, on the basis that it is not „personal data‟ but is 'anonymous', or it may be 'anonymised' to facilitate future processing. We therefore now discuss anonymisation and pseudonymisation in further detail, to assess the extent to which anonymised or pseudonymised data may be processed using cloud computing without being subject to EU data protection rules. 3.3.1

Anonymisation or pseudonymisation itself as 'processing'

The release or other processing of anonymous or pseudonymised data in fact involves 2 discrete steps, of necessity: 1. anonymising or pseudonymising the „personal data‟; followed by 2. disclosing, releasing or otherwise processing the 'anonymised' or 'pseudonymised' data. If the first step, anonymisation or pseudonymisation of personal data, is itself considered a form of 'processing' within the DPD‟s wide definition, then all data protection laws would apply to the anonymisation or pseudonymisation procedure. This would mean, for example, that data subjects‟ consent, even explicit consent for sensitive data like medical data, might be needed - simply to enable personal data to be anonymised or pseudonymised. WP136 did not explicitly consider the significance of these procedures involving two steps. The current legal uncertainty regarding the status of these procedures may, unfortunately, discourage their use as privacy enhancing techniques, and may also discourage the production and use of anonymised or pseudonymised data for socially-desirable purposes like medical research.77 For the purposes of this paper, we will assume that there is no issue with the actual procedure of anonymisation or pseudonymisation itself. 3.3.2

Common anonymisation and pseudonymisation techniques

The main ways used to try to 'anonymise' personal data, particularly to enable release of statistical information or publication of research results, include:

77

I Walden, 'Anonymising Personal Data' [2002] 10(2) International Journal of Law and Information Technology 224. There are however views that consent should be required for anonymisation, such as where anonymised data are to be used for medical research in areas where the data subject has moral objections – ibid, fn 33. See also The European Privacy Officers Forum, Comments on the Review of European Data Protection Framework (2009) ('EPOF') , commenting that while that may not have been the legislative intention, the 'processing' definition conceivably catches the act of anonymisation itself: 'Organisations are discouraged from adopting anonymisation as a privacy enhancing technique, when in reality they have to expend disproportionate resource de-identifying data to the required standard, or in seeking unnecessary consents from individuals'.




deleting or omitting 'identifying details' from the information, such as names;



substituting code numbers for names or other direct identifiers (this is pseudonymisation, effectively);



aggregating information, for example by age group, year of occurrence or town of residence;78 and



barnardisation,79 or other techniques which introduce statistical noise into information, such as differential privacy techniques applied to statistical databases;80

- or some combination of the above. We now consider WP136's detailed views on anonymisation and pseudonymisation, 81 which will also be relevant to the subsequent discussion on encryption. Regarding 'directly or indirectly' in the „personal data‟ definition, WP136 notes82 that identification involves singling someone out or distinguishing them from other people. It states that: 

direct identification includes identifying someone by their name or other 'direct identifier', and

78

WP136, 22. Aggregating individuals into a group is intended to make it harder to single out individuals in the group. 79

A statistical technique attempting to anonymise statistical counts, and thus ensure individuals cannot be identified from statistics, while still providing indications of the actual numbers concerned. It involves randomly adding 0, +1 or -1 to all non-zero cell counts in the cells of a table, then recalculating table row and column totals accordingly. CSA (n 71) [8], [15]. After CSA (n 71), the Scottish Information Commissioner found, on the facts, that barnardisation would not adequately anonymise the data, but broader aggregation would. Statistics requested by age range 0-14, by year, from 1990-2003, for the Dumfries and Galloway area by census ward, were considered too identifying, even if barnardised. However, aggregated statistics for the whole Dumfries and Galloway area for each of the years 1990-2001 were ordered to be disclosed instead. Collie and the Common Services Agency for the Scottish Health Service, Childhood leukaemia statistics in Dumfries and Galloway, [2010] UKSIC 021/2005 ref 200500298. 80

C Dwork', Differential Privacy: A Survey of Results, Theory and Applications of Models of Computation', in Manindra Agrawal, Dingzhu Du, Zhenhua Duan and Angsheng Li (eds), Theory and Applications of Models of Computation (Springerlink 2008). Differential privacy seeks to enable accurate statistical information to be provided through querying a database containing information on a group of individuals, without compromising their privacy. For a non-mathematical overview see Ethan Zuckerman, 'Cynthia Dwork defines Differential Privacy' (my heart's in accra, 29 September 2010) . 81

WP136 analysed all 'building blocks' of the 'personal data' definition - 'any information', 'relating to', 'an identified or identifiable' and 'natural person'. Here we consider only the 'identified or identifiable' element. 82

12-15.




indirect identification includes identifying someone by reference to an identification number or other specific characteristic of theirs. This also includes identifying someone by combining different pieces of information (identifiers), whether held by the controller or not, which individually might not be enough to identify the person.

Identification numbers attached to an individual, or similar unique identifiers, are considered of particular concern because they can enable disparate information, linked to the same ID number or other indirect identifier, to be correlated with the same physical individual, and therefore combined to identify them. Thus, omitting or deleting obvious direct identifiers such as names, while leaving indirect identifiers untouched, may not be enough to render information non-personal. Nevertheless, deleting direct identifiers is often generally considered adequate to prevent identifiability.83 Proposed practical guidance on the minimum standard for de-identifying datasets, to ensure patient privacy when sharing clinical research data, recommends deleting direct identifiers. The authors consider „direct identifiers‟ include not just names but also email addresses, biometric data, medical device identifiers and IP addresses. They also recommend that, even after such deletion, if the remaining information includes at least 3 indirect identifiers, for example age or sex, it should be reviewed independently before publication. In other words, they consider that the ability to combine 3 or more indirect identifiers presents sufficient risk of identification so as to require independent review of whether the risk is 'non-negligible', in which case they recommend approval by a relevant advisory body before publication.84 Pseudonyms, involving substituting a nickname or code number for the real name, are considered a form of indirect identifier.85 WP136 describes86 pseudonymisation as: the process of disguising identities. The aim of such a process is to be able to collect additional information relating to the same individual without having to know his identity. This is particularly relevant in the context of research and statistics. Pseudonymisation methods may be retraceable/reversible, or irreversible: 

Retraceable/reversible pseudonymisation. Retraceable pseudonymisation is intended to allow the individuals concerned to be 'retraced' or re-identified in certain restricted circumstances.87 An example of such pseudonymisation involves changing

83

For example, in Joined Cases C 92/09 and C 93/09 Volker und Markus Schecke (Approximation of laws) OJ C 13/6, 15.1.2011 - which seemed to assume that deleting names etc would adequately anonymise recipients of certain funds. 84

I Hrynaszkiewicz and others., 'Preparing raw clinical data for publication: guidance for journal editors, authors, and peer reviewers' British Medical Journal 2010; 340:c181. 85

For example, in Germany the holder of a pseudonym may seek access to information held by online service providers relating to his pseudonym. Kuner (n 19) ch 2.10. 86

WP136, 17.

87

Such as, with pseudonymised data from medical trials, to identify individuals who may need follow-up treatment, or to allow regulators to audit the trials.


names to code numbers, with a key or list 88 mapping which name corresponds to which number. The resulting information is known as 'key-coded data'. Retraceable pseudonymisation is often used in pharmaceutical trials. Another example of reversible pseudonymisation is applying 2-way cryptography to names or other direct identifiers. 

Irreversible pseudonymisation. Irreversible pseudonymisation is intended to render re-identification impossible. An example is 'hashing', applying 1-way cryptography (hash functions) to names or other direct identifiers.89

Retraceably pseudonymised data such as key-coded data may still be considered „personal data‟, as the point of making the procedure retraceable is to enable future reidentification of individuals, albeit only in strictly limited circumstances. If codes used are unique to each individual, WP136 considers there is still a risk of identification, and the pseudonymised information remains 'personal data' to which data protection rules still apply.90 However, if the pseudonymisation reduces risks for the individual, the rules could be applied more flexibly, and processing of pseudonymised data made subject to less strict conditions, than when processing of information on directly-identifiable individuals.91 The Austrian implementation of the DPD92 illustrates this less strict approach. It uses a concept of 'indirectly personal data'. Information is 'indirectly personal data' if the controller, processor or recipient of the data cannot identify the individuals using legally permissible means. Indirectly personal data are, effectively, a kind of pseudonymous data, where identities can be retraced, but not via legal methods. Key-coded pharmaceutical trials data are considered 'indirectly personal data'. Such information, presumably because risks to individuals are considered lower, is given less protection than fully-fledged 'personal data'. It can, for example, be transferred out of the EEA without regulatory approval.93 If codes used are not unique to individuals, but the same code number is used, for example, for all individuals in the same town, or all records for the same year, WP136 seems to consider that the risk of identification might even be eliminated, and the data rendered no longer personal. Such measures effectively involve aggregation of the data; in the examples given in WP136, aggregation by town or year respectively.94

88

Accessible only to one person or a restricted numbers of individuals.

89

Anderson (n 50), Ch 5.3.1.

90

WP136 19, example 17, suggests that the risks of the key being hacked or leaked should be taken into account in considering 'means likely reasonably to be used'. 91

WP136, 18-19.

92

Datenschutzgesetz 2000 (Austrian Federal Act concerning the Protection of Personal

Data). 93

Kuner (n 19) ch 2.12, and Peter Fleischer, 'Austrian insights' (Peter Fleischer: Privacy...?, 22 February 2010) . 94

Aggregation is discussed further below.


Key-coded data used in pharmaceutical trials may be „personal data‟; but WP136 also recognises that:95 This does not mean, though, that any other data controller processing the same set of coded data would be processing personal data, if within the specific scheme in which those other controllers are operating re-identification is explicitly excluded and appropriate technical measures have been taken in this respect.96 In other words, according to WP136 key-coded data may be considered non-personal data in the hands of another person, where the other person is specifically not intended to identify individuals, and appropriate measures have been taken to exclude reidentification by that person.97 Furthermore, WP136 notes98 that, in such a situation, the information may not be 'personal data' in the hands of the other person if 'the identification is not supposed or expected to take place under any circumstance, and appropriate technical measures (for example cryptographic, irreversible hashing) have been put in place to prevent that from happening' - even if it is still theoretically possible to identify individuals in 'unforeseeable circumstances', for example through 'accidental matching of qualities of the data subject that reveal his/her identity' to a third party.99 This is because: the information processed by the original controller [and now held by another person] may not be considered to relate to identified or identifiable individuals

95

20. Emphasis added.

96

Other controllers processing the same data may be considered not to be processing 'personal data' because only the lead researcher holds the key, under an obligation of confidentiality, and the key-coding is intended to enable the researcher or authorities to identify the individual concerned if necessary for follow-up treatment or regulatory auditing, while disguising trial participants' identities from those who will receive the pseudonymised data without the key. Typically, intended recipients will include sponsors/funders of the research or, where research containing key-coded data is to be published, readers. 97

It may however be noted that, based on the „personal data‟ definition in the UK Data Protection Act 1998, which differs from the DPD definition, there is currently disagreement about how personal data may be anonymised in the UK. The First–tier Tribunal (Information Rights) seems to consider that anonymised data remains 'personal data' unless the controller destroys both the key and the original personal data from which the anonymised data was derived, whereas the Information Commissioner's Office considers that anonymised data may be safely released as non-personal data to others who cannot access the key, without the controller having to destroy key and original data. The Scottish Information Commissioner appears to share the Information Commissioner's view – see n 79. The case exemplifying this stark difference of opinion, Department Of Health v Information Commissioner & The Pro Life Alliance [2009] UKIT EA/2008/0074 ('PLA'), is going before the UK Court of Appeal for decision on this very point. It should be heard during 2011. While the Scottish court in Craigdale Housing Association & Ors v The Scottish Information Commissioner [2010] CSIH 43, [2010] SLT 655 did not discuss PLA, it observed at [19] that the 'hard-line' interpretation, 'under which anonymised information could not be released unless the data controller at the same time destroyed the raw material from which the anonymisation was made (and any means of retrieving that material)', was 'hardly consistent‟ with recital 26. 98

20.

99

How such accidental matching could happen was not elaborated upon.


taking account of all the means likely reasonably to be used by the controller or by any other person. Its processing may thus not be subject to the provisions of the Directive. A different matter is that for the new controller [the third party who has identified individuals through the 'accidental matching'] who has effectively gained access to the identifiable information, it will undoubtedly be considered to be 'personal data'. The European Commission has opined100 in FAQs101 that transfer of key-coded data outside the EU to, say, the United States (without transferring or revealing the key), does not involve transfer of personal data subject to Safe Harbor principles. 102 WP136 considers that it is consistent with the European Commission's opinions, on the basis that the recipient of key-coded data never knows the identity of the individuals concerned; only the EU researcher has the key. In discussing pseudonymised data, WP136 appears to have focused mainly on changing names or other perceived unique identifiers into code numbers, ie key-coded data, rather than attempting to pseudonymise the data irreversibly by deleting direct identifiers altogether or by 1-way encrypting them. There was also relatively little discussion of aggregating data, which is often used to try to disguise identities, for example when releasing general statistics derived from research. What is the position in relation to such 'irreversibly pseudonymised' data, or data intended to be anonymised by aggregation or other means? WP136 seems to consider103 irreversible pseudonymisation as essentially the same as anonymisation, ie resulting in 'anonymised' non-personal data. However, WP136 states further104 that whether such information is truly anonymous depends on the circumstances, looking at all means likely reasonably to be used to identify individuals. It considers this particularly pertinent to statistical information where, even though information is aggregated, the size of the original group must be taken into account. If that size is small, identification is still possible through combining the aggregated information with other information. Deleting or irreversibly changing direct identifiers such as names still leaves untouched the other information originally associated with the direct identifier. As another illustration, if information comprises name, age, gender, postcode and pharmacological test results, and only the name is deleted or changed to a code number, information about each person's age, gender etc still remains available. Indeed, usually the purpose of the deletion or change is to enable disclosure to others of the remaining information while attempting to disguise individual identities. That purpose would be defeated if age, and, certainly in the hypothetical examples, test results, had to be deleted before the information could be disclosed to intended recipients.

100

The European Commission's opinions in this area, like the A29WP's, are persuasive rather than definitive. 101

European Commission, Frequently Asked Questions relating to Transfers of Personal Data from the EU/EEA to Third Countries (2009). 102

The Safe Harbor is one way of allowing transfer of personal data from the EEA to the USA. Commission Decision 2000/520/EC OJ L 215/7, 25.8.2000. 103

18.

104

21.


However, information like age and gender can itself be identifying, especially when combined with each other and perhaps information obtainable from public or other sources.105 Research has shown106 that, with developing techniques, information is increasingly linkable, and therefore individuals are increasingly identifiable. It is possible, particularly with automated data mining methods operating quickly over huge quantities of information, to correlate, associate, combine or link and analyse different information,107 perhaps from different sources, all with reference to the same individual.108 Over time, increasingly more information can be gathered which is linkable to, and increasingly enables identification of, the same person. This is so whether the data are key-coded, irreversibly-pseudonymised, aggregated or barnardised. To summarise, after WP136: 

Retraceably pseudonymised data such as key-coded data may be considered personal data even after pseudonymisation, and this should be borne in mind when processing such data in the cloud.



However –  aggregating pseudonymised data, such as through using non-unique codes, may be effective to turn that data into 'anonymous data', with the size of the data set being one factor that should be taken into account, so that it is possible to process such aggregated pseudonymised data in the cloud free of data protection law constraints and  even retraceably pseudonymised data may be considered anonymous data in the hands of another person who operates within a scheme where reidentification has been explicitly excluded and appropriate measures have been taken to prevent re-identification by that person - even if a third party could theoretically re-identify individuals through 'accidental matching'.



Critically, whether information amounts to 'personal data', including in cloud computing, depends on the circumstances, and consideration of all means likely reasonably to be used to identify individuals - including, in relation to anonymised or pseudonymised data, the strength of the 'anti-identification' measures used.

105

For example, a person's internet search queries can identify them, especially when different queries by the same person are recorded against the same pseudonym or code number, and therefore can be combined - see the AOL search results release incident, referred to in Paul Ohm, 'Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization', (2009) University of Colorado Law Legal Studies Research Paper No 09-12 . 106

See the examples and papers cited in Ohm, ibid. A Narayanan and V Shmatikov, 'Myths and fallacies of "personally identifiable information"' [2010] 53(6) Commun ACM 24 concisely summarises the current position and problems. 107

These operations and analyses may themselves employ cloud computing. See for example Michael Farber, Mike Cameron, Christopher Ellis, Josh Sullivan, Massive Data Analytics and the Cloud - A Revolution in Intelligence Analysis (Booz Allen Hamilton 2010). 108

The linking may be facilitated if a name or unique individual identifier is attached to each separate datum.




The status of anonymisation/pseudonymisation procedures themselves is unclear, and again this uncertainty should be noted in relation to anonymising or pseudonymising data for processing, or further processing, in the cloud.

The A29WP has been criticised for 'deficient' understanding of the issues on the basis that it is the size of the data sets involved which determines the effectiveness of pseudonymisation or anonymisation procedures, rather than the quality and effectiveness of measures used including the level of encryption.109 However, in our view, by the same token the size of the data sets should not be the only determinant; it is a major factor, but the quality and effectiveness of measures are also major factors to be considered, all in the particular circumstances involved. If, for example, strong encryption is applied to the whole of the data set (discussed below), the size of the data set may not be relevant. The DPD and WP136 certainly recognised the possibility of indirect identification through combining different pieces of information on the same individual and, as discussed above, WP136 mentioned data set size as well as the ability to link, and noted that deanonymisation techniques would improve over time. However, neither the DPD nor WP136 appears to have anticipated the pace of technological advances, and they do not deal adequately with the implications. Research continues into re-identification methods, which are still improving.110 This reinforces the present reality that current techniques, such as removing or scrambling direct identifiers or even indirect identifiers, and/or aggregating data, may be of little effect in anonymising data irreversibly.111 Indeed, any and all information which is linkable to an individual is potentially personal data within the current DPD definition, because it can lead to their identification if combined or correlated with enough other information linked to the same individual.112 We will discuss further below possible responses to these scientific advances.

3.4

Encryption113

What then is the status of encrypted data in the cloud? Would such data constitute 'personal data'?

109

Douwe Korff, European Commission Comparative study on different approaches to new privacy challenges, in particular in the light of technological developments - Working Paper No. 2: Data protection laws in the EU: The difficulties in meeting the challenges posed by global social and technical developments (European Commission 2010), 48. 110

Even anonymisation methods involving the introduction of statistical noise, such as differential privacy, may be defeated by new re-identification techniques - see Graham Cormode, 'Individual Privacy vs Population Privacy: Learning to Attack Anonymization' (2010). 111

Although work also continues on improving anonymisation techniques - for example the Purdue project on anonymising textual data, and its impact on utility. 112

Even meterological data collected automatically at the top of Mt Everest could be linked to the individual researcher who publishes the data. 113

Anderson (n 50).


It may be noted that the issues analysed at 3.2.1 regarding anonymisation and pseudonymisation also arise with encryption, namely whether the act of encrypting personal data is itself 'processing' which may require data subject consent or other justification for the processing and may also be subject to other Principles. The arguments there apply equally to encryption, and will not be repeated here. Moving on to the status of encrypted data, from the previous section it can be seen that EU data protection regulators, at least collectively, take the view that whether information is 'personal data' depends on the circumstances, and in particular whether there are 'means likely reasonably to be used' (fair or foul) to re-identify individuals. It may be recalled that three important factors affecting the security of encrypted data against decryption are the strength (eg complexity) of the encryption algorithm used; the length of the encryption key (a longer key is more likely to be secure against attacks than a shorter key); and the security of the key management, ie how securely the decryption key is stored, who has access to it, etc. As outlined previously, users or providers may apply 1-way or 2-way cryptography to identifiers within personal data, to try to anonymise or pseudonymise personal data before storage or other processing of the data in the cloud. Alternatively, 2-way cryptography may be applied to the whole data set before storing it in the cloud. In both cases, the cryptography is usually applied by the user on the user's computer before transmission of the data to the cloud. However, the provider may also apply cryptography to data, to enable it to use or sell the anonymised or pseudonymised data (by applying 1-way or 2-way cryptography to identifiers and the like), or to enable it to store data more securely (applying 2-way cryptography to the full data set).114 1-way cryptography involves applying 1-way functions (cryptographic hash functions) to data. It is often called 'hashing'. It produces fixed-length 'hashes' or 'hash values' from the original data, and is intended to be irreversible. 2-way cryptography, often simply called 'encryption', is intended to be reversible so as to enable reconstitution of the original unencrypted data set, but only by certain persons or in certain circumstances. 115 Each will now be discussed in turn. WP136 stated116 that 'Disguising identities can also be done in a way that no reidentification is possible, e.g. by one-way cryptography, which creates in general anonymised data'. This statement suggests that data containing 1-way encrypted identifiers would not be considered 'personal data'. However, WP136 then goes on to discuss the effectiveness of the procedures involved, so in reality the key issue is the reversibility of the 1-way process. Even 1-way cryptography may be broken, and the original data reconstituted.117 The more secure the 1-way cryptography method used,

114

The encryption of data by the cloud provider after receipt is not discussed further in this

paper. 115

Anderson (n 50).

116

18.

117

The SHA-1 cryptography technique was 'broken' in 2004 so that 'collisions', ie different original data with the same hash values, can be found far more quickly than with brute force attacks. X Wang, YL Yin and H Yu, 'Advances in Cryptology – CRYPTO 2005' in V Shoup (ed) Advances in Cryptology – CRYPTO 2005 (Springer Berlin / Heidelberg 2005).


the less likely it is that the information will be considered 'personal data'. If the particular cryptography technique employed to „anonymise‟ data is cracked, to maintain 'nonpersonal data' status the data may require re-encryption using an even more secure method.118 WP136 suggests119 that if the controller has effectively tried its best to anonymise the information, taking appropriate technical measures to prevent re-identification, such as irreversible hashing, the information may be treated as anonymous data in the hands of a third party, even if re-identification is theoretically still possible from 'accidental matching'.120 However, as previously discussed, irreversibly hashing direct identifiers cannot stop others from being able to identify individuals using indirect identifiers or other remaining information in the data set, or perhaps information from other sources. Thus, personal data where identifiers have been deleted or 1-way hashed may, after taking into account such 'means likely reasonably to be used', still be 'personal data'. In that case, their storage or use by the cloud provider would be subject to the Principles. WP136's discussion of cryptography focused mainly on 1-way cryptographic hashing121 in particular, scrambling direct identifiers like names, supposedly irreversibly, to anonymise or pseudonymise personal data. However, often a cloud user wants to store data in the cloud for future retrieval and use, but to ensure the security and confidentiality of the stored data, will first encrypt the full data set, and not just one component of it such as names, using reversible 2-way, encryption. The user can download and read the encrypted data using the controller's secret decryption key, but no one else is intended to be able to decipher it. In such a case, as far as the user is concerned, the data remain „personal data‟.

Cloud computing, in the form of Amazon's then newly-introduced EC2 GPU cluster instances, was used in November 2010 by a security blogger to find 14 passwords of 1-6 characters long from their SHA-1 hashes. He succeeded in less than 1 hour at a cost of under US $2. Thomas Roth, 'Cracking Passwords In The Cloud: Amazon‟s New EC2 GPU Instances' (Stacksmashing.net, 15 November 2010) . 118

At least, within a reasonable period of time – not necessarily immediately. See n 57 and associated text. 119

See the paragraphs containing nn 95-102.

120

Whatever that may involve - 'accidental' re-identification seems less likely with data anonymised through 1-way cryptography. 121

As an example of 1-way cryptography, passwords are often 1-way 'hashed'. A 1-way cryptographic 'hash function' is applied to the password, and the resulting hash stored. A password later entered by users is similarly hashed using the same hash function, and the hash value compared with the stored hash to check that the correct password was entered. This avoids the service provider storing passwords themselves, which would not be very secure. Comparing hashes also enables checking of integrity, ie that there has been no change to or corruption of the original data. Hashes, also known as message digests, are generally smaller than the original data (but not always). A hash can be transmitted and/or stored along with the original data. If integrity has been compromised, the hashes would be different. WP136 discusses the use of 1-way cryptography, not for password authentication or integrity checking, but to scramble irreversibly identifiers like names within a larger data set.


What is the position of cloud providers? The 'personal data/not personal data' debate has focused mainly on anonymisation or pseudonymisation of parts of a data set. However, we suggest that it applies equally to 2-way encryption of the full data set prior to its storage in the cloud. This may benefit cloud providers by arguably removing them from the scope of the data protection legislation altogether, at least where the encryption has been performed to industry standards and best practices by the controller prior to transmission of the encrypted data to the provider, and the provider has no access to the decryption key. An analogy may122 be drawn with key-coded clinical trials data, summarised above.123 In the hands of one who encrypts the data and holds the encryption (or, strictly, decryption) key, the information is likely to remain 'personal data'. However, in the hands of another person, such as a cloud provider storing encrypted data on its servers, who has no access to the key and no means 'reasonably likely' to be used to decrypt the data, 124 under WP136 the information may be considered anonymous data.125 This might be the case, for example, where a SaaS provider uses a PaaS or IaaS provider‟s infrastructure to provide its own services. Even if the SaaS provider has the key, so that it must treat the information as „personal data‟, the PaaS or IaaS provider may not have the key. In SaaS involving provision of mere data storage, where the user encrypted the data before transmission, the SaaS provider itself may not have access to the key. With key-coded data, only names and the like are pseudonymised or anonymised; other information in the data set remains accessible. However, where the entire data set is encrypted, dataset size should not be an issue because, unlike with key-coded data, no part of the remaining data is available 'in the clear' to be used as linkable indirectlyidentifying information. Encrypted data, completely transformed into another form, is very different in nature from, and arguably poses fewer risks than, key-coded data or even aggregated data. Thus there is a stronger argument than with key-coded or aggregated data that fully-encrypted data are no longer 'personal data'. The key question here is, again, how secure are the encrypted data against decryption by unauthorised persons? The stronger the measures taken to prevent re-identification, the more likely that originally-personal data, to which such measures have been applied, will be considered anonymous data, based on WP136. Again, the strength of the encryption will be a major factor, along with the effectiveness of other technical or organisational measures such as access restrictions and key management. If personal data were not encrypted at all before uploading for storage in the cloud, or only weakly encrypted, or if the decryption key was not stored securely or was accessible to too many people, then the data stored might be 'personal data', and the cloud provider might be a 'processor'.126 In contrast, 122

Ignoring the UK difficulties for now, pending the Court of Appeal PLA decision, n 97.

123

See paragraphs containing nn 95-102.

124

Bearing in mind that cloud computing processing power may itself increasingly be 'means reasonably likely' to be utilised to decrypt data! See n 123 and 'Cracking Passwords in the Cloud: Breaking PGP on EC2 with EDPR' (Electric Alchemy, 30 October 2009) . 125

See paragraphs containing nn 95-102.

126

Assuming for present purposes that a cloud provider storing personal data 'processes' the data on behalf of its cloud user customer, and so is a 'processor'. That issue is discussed in detail in part 2 of this CLP data protection series: W Kuan Hon, Christopher Millard and Ian


had the personal data been encrypted strongly before being transmitted to the provider, provided the key was securely managed the stored data would be unlikely to be considered 'personal data' in the hands of the provider. However, why should whether or not the user decides to encrypt the data,127 and how effective their chosen encryption method or other security measures are, determine whether encrypted data hosted by the cloud provider is 'personal data' or not? Generally, 'pure' cloud storage providers128 cannot control the form in which their user chooses to upload the user's data to the cloud.129 Nor would the provider necessarily know the nature of data the user intends to store, hence the 'cloud of unknowing' used in the title of this paper. Yet the status of data stored with a cloud provider, which affects the status of the provider as 'processor' (or not) of data stored by its users,130 will vary with each user's decisions and actions - which may differ for different users, and may even differ for the same user storing different kinds of data, or the same data at different times. This position seems unsatisfactory. In relation to the knowledge (or not) of the cloud provider, it is interesting to note an Italian judgment issued in 2010. Even in the case of a SaaS provider who was considered more than a pure passive storage provider, the court stated that, for criminal data protection liability to be imposed on a data host for failing to monitor or pre-screen uploaded sensitive data, there had to be knowledge and will on its part.131 What about data in the course of transmission, also known as 'data in flight' or 'data in motion'? The connection or channel for transmitting unencrypted or encrypted data may itself be unencrypted or encrypted.132 Normally, whether a transmission is encrypted or

Walden, ' Who is Responsible for 'Personal Data' in Cloud Computing? The Cloud of Unknowing, Part 2' (2011) < http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1794130>. 127

There are disadvantages to storing data in encrypted form, such as the inability to index and therefore search the stored data. These may lead some users to prefer not to encrypt the data. However, searchable encryption is being investigated and may someday become feasible – see Kamara and Lauter, n 57. 128

Ie providers of IaaS or PaaS involving storage by the user of data on the provider's infrastructure, or providers of SaaS in the form of 'storage as a service' where the service provided to the user is limited to that of data storage and tools to upload, download and manage the stored data. 129

Although they can control it if they build it into their systems. The procedures of some providers such as Mozy may involve the data being encrypted on the user's local computer, using a key of the user's choice, before transfer to Mozy's servers via encrypted transmissions - for consumer users as well as organisations. Mozy, Inc, 'Is MozyHome secure?' (2010) and 'Is MozyEnterprise secure?' (2010) . 130

The status of 'storage' cloud providers will be discussed in more detail in part 2 of this CLP data protection series – see n 126. 131

Liability was, however, imposed for other reasons, and the judgment is being appealed. Decision of Judge Oscar Magi, Milan, Sentenza n 1972.2010, Tribunale Ordinario di Milano in composizione monocratica 92-96 . 132

The TLS protocol, formerly called SSL, is used to secure transmission of, for example, credit card details between web browser and remote server (https in the browser address bar).


unencrypted depends on how the provider has set up its systems, so this aspect is generally within the control of the provider. Different approaches could be taken towards data encrypted in transmission and data encrypted for longer-term storage.133 The transmission duration, and therefore available window for possible interception, may be relatively short. Therefore perhaps it can be argued that encryption of 'data in motion' needs to be less powerful in order to be considered secure enough to make the transmitted information non-personal, in most cases.134 In contrast, for data held in persistent storage,135 it would seem that stronger encryption by the user is likely to be required before the information may be considered non-personal in cloud storage.136 If the user transmits unencrypted personal data, even using a secure connection, the personal data will still be received as such by the storage provider. In this connection, it is interesting to consider the DPD art 25(1) restriction on transfers of personal data to 'third countries' outside the EEA. Such transfers are prohibited unless the country concerned ensures an 'adequate level of protection'. Art 25(2) requires adequacy to be assessed in light of all circumstances surrounding the data transfer operation or set of operations, giving particular consideration to specified matters including the 'duration of the proposed processing operation or operations'. This implies that where personal data are only transferred for a short time, and presumably therefore only exposed to the third country for a short time, the risks are lower, and less stringent protective measures may be considered adequate when they might not be adequate for data processed for a longer period within the third country. Similarly it could be argued that information which is exposed unencrypted for only a relatively short period should not lose 'anonymous' status simply because of transient processing operations on unencrypted data. A further issue should be noted. Suppose that military-grade encryption and other security measures are applied in relation to data transmitted for storage, such that the stored data can be considered non-'personal data' in the hands of the storage provider. Should that stored data later need to be operated on by application software, for example to sort or analyse the data, the data must be decrypted first.137 Research continues on secure computation,138 whereby operations on encrypted data can take

Anderson (n 50), Ch 21.4.5.6. Connections may also be secured using virtual private networks (VPNs). For instance, IaaS provider Amazon offers VPN connections to transfer data between 'Virtual Private Clouds' on its infrastructure to and from its users' own data centers. 133

Recall that 'processing' is very wide and includes transmission and storage of personal data as well as collecting it and performing operations on it - art 2(b). 134

Where, however, transmissions from or to a particular source or destination are actively monitored and captured, the brevity of each transmission operation of course may not then matter. 135

Where, therefore, the potential attack window will be longer.

136

See n 74 on the expected duration of the storage also being a factor affecting how strong the encryption should be. 137

Mowbray (n 45), 135-6.

138

For example N Smart and F Vercauteren, 'Fully Homomorphic Encryption with Relatively Small Key and Ciphertext Sizes' in Phong Q. Nguyen and David Pointcheval (eds), Public Key Cryptography – PKC 2010, 420 (Springer Berlin/ Heidelberg 2010).


place securely, for example inside encrypted 'containers', and there are hopes that it will one day be practicable.139 However, currently such operations would take an unfeasibly long time to complete.140 Thus, at present, if a cloud user wishes to run application software to operate on (originally personal) data it has stored encrypted in the cloud, that requires decrypting the stored data first, then operating on the decrypted personal data. The operation therefore involves the user in processing 'personal data'.141 If the user has downloaded the data to a local computer before decrypting it, the cloud provider would not be involved in the local processing of the decrypted personal data, but the user would lose the benefit of cloud processing power. If however the user runs cloud application software on the provider's servers to operate on the decrypted data, that may possibly make the cloud provider a 'processor'.142 However, as with transmissions of data, such operations may be relatively transient in nature; the information may remain encrypted for far longer than it is in decrypted 'personal data' form. Must all the Principles nevertheless be applied in full force to those operations, or would it make sense for a more limited application in that situation? In summary, it seems the best way to try to ensure information stored in the cloud is not considered 'personal data' in the hands of the cloud provider is to encrypt it strongly before transmitting it for storage, also taking other technical and organisational measures such as securely storing the decryption key and strictly limiting access to it. However, the matter is not free from doubt, and even personal data previously encrypted for storage as anonymous data must be decrypted in order to be operated on, so those operations will then be on 'personal data'. Practical concerns clearly arise from the uncertainty regarding whether, and in what circumstances, encrypted data would be considered 'personal data', and the extent to which the user's own encryption decisions affect the status of data in the cloud. It would therefore be helpful if the forthcoming revised DPD were to clarify the position in relation to encrypted data and encryption procedures.143

139

R Chow and others, 'Controlling data in the cloud: outsourcing computation without outsourcing control' in Radu Sion and Dawn Song (chairs), Proceedings of the 2009 ACM workshop on Cloud computing security (ACM, Chicago, Illinois, USA 2009). Other possible approaches to improve secure processing of data in the cloud include data obfuscation – see Miranda Mowbray, Siani Pearson and Yun Shen, 'Enhancing privacy in cloud computing via policy-based obfuscation' The Journal of Supercomputing (online 31 March 2010) DOI: 10.1007/s11227-010-0425-z. 140

Daniele Catteddu and Giles Hogben, Cloud Computing - Benefits, risks and recommendations for information security (European Network and information Security Agency, 2009) 55, V10. 141

Even if information is 'personal data' ephemerally, only while being operated on in decrypted form, and is then encrypted and saved in permanent storage in encrypted form. The DPD does not exempt data which is 'personal data' for only a short length of time, but it could be argued based on WP136 that the risks for data subjects are lower in that situation, as with data in flight. 142

Subject to arguments we will make in part 2 of this series of papers – see n 126.

143

See, for example, the submission by an IaaS provider in Rackspace US, Inc., International transfer of personal data (Consultation Paper on the Legal Framework for the Fundamental Right to Protection of Personal Data) (2009)


3.5

Sharding

We now consider the detailed mechanics of cloud data storage and its implications. In this section 3.5 and section 3.6, we do not deal with securely-encrypted data, but assume the storage discussed is of data which the cloud user has not encrypted, or has secured only weakly. This is because, as argued above, personal data stored in the cloud in securely-encrypted form ought in any event to be treated as anonymous data in the hands of a provider who has no access to the decryption key. Data may be stored in 'plaintext' unencrypted form in IaaS or PaaS as well as SaaS. 144 Indeed, in PaaS or SaaS involving cloud-based applications, beyond simple storage, it may be more common for stored data to be unencrypted, or for the provider to have access to the user's keys or its own secondary decryption keys. This is to enable related value-added services from the provider, such as indexing and full-text searching of data, retention management or data format conversion, which would not be possible with encrypted data.145 Cloud computing typically uses virtual machines ('VMs') as 'virtual servers'. VMs are simulations of physical computers, but exist only in the RAM of the physical server hosting the virtual machines. Terminating a VM instance which is no longer needed is like shutting down a physical computer - if the data operated on using the VM had not been saved to persistent storage146 before the VM's termination or indeed failure, the data may be lost. Depending on the cloud service, even where VM instances appear to have attached storage, 'stored' data may vanish on 'taking down' or terminating the instance, unless the data had been deliberately saved to persistent storage first. Cloud providers often allow data to be stored in their facilities in persistent storage. Data may be saved on the cloud provider's non-volatile storage equipment in databases or distributed filesystems which can store files over different hardware units or 'nodes', to ensure the data's availability for future retrieval and use even after the VM instance operating on the data is terminated.147

. 144

This obviously includes SaaS in the form of 'storage as a service'. However, it may also include other types of SaaS like webmail where, as well as providing cloud application software, in this case email client software, the provider also stores the data used in relation to that application, such as in this example emails and contacts data. 145

Tim Mather, Subra Kumaraswamy, Shahed Latif, Cloud Security and Privacy: An Enterprise Perspective on Risks and Compliance (O'Reilly 2009), 62. 146

For example, ultimately one or more hard disk drives, or SSD flash storage. Stephen Lawson, 'The cloud could be a boon for flash storage' (Computerworld, 24 August 2009) . 147

Storage systems include Amazon's S3, SimpleDB or Elastic Block Storage, and Windows Azure's SQL Azure, blob or table storage or XDrive. As an example, social networking site Facebook uses the open source Hadoop framework for large scale parallel processing using a distributed file system and the MapReduce programming model (on which see further n 149 below). Joydeep Sen Sarma, 'Hadoop' (Facebook Engineering's notes, 4 June 2008) . For a concise summary of the


In order to 'scale out' flexibly, adding more (often commodity-priced) physical computers as and when additional processing power or storage space is needed, cloud providers may employ not only server virtualisation but also 'storage virtualisation', where the stored data appears as one simple logical unit (or several) to the user, while behind the scenes the provider deals automatically with the details of exactly where to store the data physically. Providers may use distributed file systems along with distributed relational databases such as the open source MySQL and/or distributed non-relational 'NoSQL' databases.148 With distributed data storage, data may be spread out amongst different hardware, even in different geographical locations. Also, to maximise efficient use of resources and reduce costs, stored data may at different times be automatically moved between different storage hardware in different physical locations: a well-known feature of cloud computing. In cloud computing, it is not just storage of data which may be distributed. Operations on data may also be distributed. Such operations are often split up into smaller suboperations. These sub-operations are run in parallel simultaneously within different nodes, again perhaps in different locations, each processing a different chunk or sub-set of the data set being worked on, and the results of the sub-operations are then combined and sent to the user.149 For backup and redundancy as well as performance, availability and tolerance to any failure of individual nodes, stored data are normally 'replicated' automatically in two150 or three151 different data centres in different locations.

software infrastructure used by Facebook, see Sean Michael Kerner, „Inside Facebook's Open Source Infrastructure‟ (Developer.com 22 July 2010) . On Windows Azure see Brad Calder, 'Windows Azure Storage Architecture Overview' (Windows Azure Storage Team Blog, 30 December 2010) . 148

For example Facebook uses the relational database MySQL and also the open source NoSQL database Cassandra – Kerner, ibid. NoSQL databases do not fit into the more traditional and hitherto ubiquitous relational SQL database model, but they do scale easily. Other examples of NoSQL databases are Google's BigTable, Amazon's Dynamo and the open source HBase . Fay Chang and others, 'Bigtable: A Distributed Storage System for Structured Data' [2008] 26 ACM Transactions on Computer Systems (TOCS) 4:1); Giuseppe DeCandia and others, 'Dynamo: Amazon‟s Highly Available Key-value Store', ACM SIGOPS Operating Systems Review (SOSP '07 ACM, New York, NY, USA 2007)). 149

The MapReduce programming model is employed by several cloud providers for the efficient distributed processing of data. Jeffrey Dean and Sanjay Ghemawat 'MapReduce: simplified data processing on large clusters' [2008] Comms ACM 51(1) 107. MapReduce, first developed within Google, is implemented for example in Hadoop (n 147). Amazon's Elastic MapReduce service for distributed data processing uses Apache Hadoop. Amazon, 'Amazon Elastic MapReduce FAQs' (2011) . 150

Google Apps – Rajen Sheth, „Disaster Recovery by Google‟ (Official Google Enterprise Blog, 4 March 2010) . 151

Microsoft Windows Azure – Microsoft, „Introduction to the Windows Azure Platform‟ (MSDN Library), .


'Sharding', sometimes called 'partitioning', involves 'splitting a data set into smaller fragments ('shards') and distributing them across a large number of machines'.152 Sharding policies, which dictate how to shard the data, may vary with space constraints and performance considerations.153 Applications' requests for operations on the data set are automatically sent to some or all servers hosting relevant shards, and the results are coalesced by the application. Sharding also helps availability because it is faster to retrieve small data fragments than larger ones, thus improving response times. While the term 'sharding' is most commonly applied to the fragmentation of databases, data which are not part of a structured database may also be split up into chunks or fragments for storage or operations.154 This paper uses 'sharding' and 'shards' broadly, to refer to fragmentation of data in whatever form stored or operated upon. If unencrypted personal data uploaded to the cloud are automatically split up155 for distributed storage, it seems clear that a cloud user who uploads the personal data, and perhaps runs applications in the cloud which retrieve and operate on that data and return the coalesced results to the user, has to be treated as processing personal data. However, do the separate shards held on the provider's equipment in distributed storage constitute 'personal data' in the hands of the provider? This situation seems similar to that of key-coded data discussed above, where the researcher has the key, but no one else does, and where, if adequate measures against re-identification have been taken, then the data may be considered non-'personal data' in the hands of another.156 Here, much will depend on the methods of sharding or replication used, and on the precautionary measures taken to prevent 'reunification' of the shards or other access to the reunified data by anyone other than authorised users. A fuller shard may well contain sufficient information to enable identification of an individual, such that its access or replication involves 'processing' personal data. A tiny shard may be considered anonymous data if it does not contain sufficient identifying information. However, it should be borne in mind that it is not the size of the shard which matters, but its content. Even a tiny shard may contain personal data. The Google Streetview incident illustrates that fragmentary data may sometimes contain personal data.157 Such data, when combined with geolocation to find out from whose home the 152

LA Barroso and U Hölzle, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines in Mark D Hill (ed), Synthesis Lectures on Computer Architecture (Morgan & Claypool 2010) 1. 153

For example, to equalise workload across different servers and/or data centres. Ibid.

154

For example, data stored using Amazon's Elastic Block Store ('EBS'), which to the Amazon user appears as an emulation of a physical hard drive, is, at least before a snapshot of an EBS is stored on Amazon's S3 storage service, first broken into chunks, whose size depends on Amazon's optimisations. Amazon, Amazon Elastic Block Store (EBS) (2011) . 155

The automated sharding procedure, as with anonymisation and pseudonymisation, may itself be considered 'processing', as discussed at 3.3.1 above. However, at least where the shards are each too small to contain personal data, it could be argued that, just as anonymisation ought to be permitted or indeed encouraged under data protection laws, so sharding into nonpersonally identifying fragments ought to be permissible. 156

See text to nn 95-102.

157

Internet corporation Google sent out vehicles to travel the streets of many cities and towns globally, to collect data for its Street View online mapping service. It was discovered in


data originated, together with the ability to correlate building location with residents‟ identities, might well enable a username and password to be associated with the individual concerned. Different cloud providers have different sharding policies, so it may be difficult to analyse the position further without detailed knowledge of their exact policies.158 A concern sometimes expressed regarding cloud computing is that data stored 'in the cloud' may be seized or accessed by local law enforcement authorities in the jurisdiction where the storage equipment is located.159 However, if someone other than the cloud user should physically seize the cloud provider's hard drive or other equipment containing a shard of that user's data, would they thereby retrieve any personal data stored by the user? Not if only an incomprehensible shard is contained in the seized equipment, and they cannot access other equipment holding the remaining shards, for example because the other equipment is located in a different jurisdiction. It depends on how the shards have been stored; often they may be stored in the same jurisdiction, but, depending on the provider‟s policies and practices, they may not be.160 Finally, if a user runs cloud applications to operate on data, distributed processing may be automatically employed, whereby operations on data are split up and may operate on different shards, running on different servers perhaps in different locations. In the course of a cloud application's operations on data, shards may also be stored in the provider's equipment, albeit usually ephemerally – irrespective of whether the user intends the

May 2010 that the vehicles had also captured data transmitted wirelessly over open (non password-protected) wi-fi networks, such as some consumers' home computer networks. That data included 'payload' data, ie the content of transmitted data, as well as addressing information. Investigations by various data protection authorities found that „while most of the [captured] data is fragmentary, in some instances entire emails and URLs were captured, as well as passwords‟. Alan Eustace, „Creating stronger privacy controls inside Google‟ (Official Google Blog 22 October 2010) . 158

An interesting multi-layered approach involving both encryption and sharding is taken by cloud provider Symform, which offers distributed storage of data on its users' own computers. Users' files are split into smaller blocks, and each block is encrypted. The encrypted blocks are then further fragmented before being distributed for storage. Even for very small files contained within a single data fragment: [T]he information about which file is associated with which data fragment and where that data fragment is located is stored separate from the data fragment itself – in Symform‟s cloud control. So, an attacker would have to identify a file and break into Symform to find out where its fragments are located. After this, they would have to actually have access to at least one of those fragments to be able reconstruct the encrypted contents. Last, and certainly not least, they would have to break the 256-AES encryption. Symform, „How Symform Processes and Stores Data‟ (Symform) . 159

Law enforcement issues in cloud computing are discussed in detail in another CLP paper: Ian Walden, 'Law Enforcement Access in a Cloud Environment' (2011) Queen Mary School of Law Legal Studies Research Paper No. 72/2011 . 160

It is more likely that the authorities of a jurisdiction to which the provider is subject may require the provider to give the authorities access to data stored using the provider's services, regardless of where its data centres are located (even perhaps in another jurisdiction). The provider's ability to access the data is discussed further at 3.6 below.


original or resulting data to be permanently stored in the cloud. Similar issues would thus arise regarding whether individual shards operated on or temporarily stored are large enough to include personal data, and whether, as with transient operations on decrypted data in the cloud, such temporary operations or storage merits the full application of all the Principles. In summary, the position in relation to storing or operating on data in shards in the cloud is unclear, and detailed operational aspects, which may vary with the cloud service and/or the user, may determine whether the stored information is 'personal data' or not. Again, this does not seem satisfactory. More transparency by cloud providers as to their sharding and other operational procedures would help to inform the debate.

3.6

The provider's ability to access stored data

With securely-encrypted data, we have argued that it is not possible to identify individuals, and therefore such data should not be considered 'personal data'. What about data stored unencrypted or only weakly encrypted in the cloud? Even with personal data stored unencrypted in the cloud, re-identification of individuals through the stored data can only be achieved by someone who can access the reunified shards of the data in intelligible form, typically by logging into the user's account with the provider.161 Now recall that, under WP136, 'anonymised' data may be considered anonymous in the hands of the provider if 'within the specific scheme in which those other controllers [eg cloud providers] are operating re-identification is explicitly excluded and appropriate technical measures have been taken in this respect'.162 We suggest, based on this rationale, that if you cannot view or read certain data, you cannot identify the data subjects, and therefore identification may be excluded by excluding others from being able to view or read the data. The clearest scenario which excludes others from viewing data, which we have already discussed, is where data are strongly encrypted. However, what if the cloud user is the only person who can access the reunified shards of their stored data through logging into their account with the cloud provider, and the provider's scheme excludes anyone else from being able to access the account and thence the re-unified data? In that situation, it is arguable that the data may be 'personal data' as far as the user is concerned, but not in relation to anyone else. This may perhaps be arguable even with data stored unecrypted, in cases where the data are accessible only through the user's account. In other words, the effectiveness of measures to prevent persons other than the user from accessing the user's stored personal data may affect whether the data are 'personal data' as regards those persons. Therefore, it seems that a key factor will be the effectiveness of the cloud provider's access control system, which typically only allows authenticated and authorised cloud users to use a particular cloud account. By logging into their cloud account, the user gains the ability to access and operate on the full set of any personal data stored. That does not, however, mean that others can access the set.163 The more effective and restrictive the access control measures, the 161

Leaving aside for now the possibility that individual shards may contain personal data.

162

See text to n 99-102.

163

Subject to the 'back door' issue discussed at 3.6 below.


more likely it is that re-identification by others will be excluded, and therefore that the stored data will not constitute 'personal data'.164 Another key factor is the existence of any 'back doors' allowing the provider to sign in to users' accounts or otherwise access users‟ re-unified data. One advantage of cloud computing is that log-in software and (in the case of SaaS and PaaS) application software is usually maintained and updated automatically by the provider, without any user involvement. While convenient for users, this feature also means that many providers can, unbeknownst to users, build in and use back doors - or even introduce backdoors at the behest of law enforcement or other authorities. While seizure of storage equipment hardware containing fragmentary personal data is theoretically possible, as discussed above, it may be that undesired access to data stored in the cloud is more likely to occur in practice through the cloud provider's ability to access,165 or to allow third parties166 to access, re-unified shards of stored data by accessing users' accounts. Currently, many providers expressly reserve the right in their contract terms to monitor their users' data and/or usage of the data.167 Where a cloud provider has an individual user as its customer, it may well be acting as controller of any personal data which it collects relating to the customer. This may include personal data provided by the customer in the sign-up process as well as, for example, metadata168 generated relating to the customer's ongoing usage of the service. However, in this paper we do not discuss any further the user-related personal data monitored by the provider. We consider only any personal data, perhaps relating to other individuals, processed by the cloud user using the provider's facilities. What is the significance of providers' reservation of the right to monitor such data?

164

The strength of the login password chosen by the user may also affect the effectiveness of access control measures. Here, again, something within the control of the user rather than the provider, ie the selection of the user's password, contributes towards determining whether the information held by the provider is 'personal data'. 165

Site Reliability Engineers at Google apparently had 'unfettered access to users' accounts for the services they oversee'. In late 2010 one such engineer was dismissed for accessing minors' Google accounts without their consent, including call logs and contact details from internet phone service Google Voice, instant messaging contact lists and chat transcripts. Adrian Chen, „GCreep: Google Engineer Stalked Teens, Spied on Chats (Updated)‟ (Gawker 14 September 2010) . See also, regarding allegedly 'universal access' by employees to users' accounts on social networking site Facebook, Soghoian n 57, fn 99. 166

Such as law enforcement or other official authorities in the jurisdiction where the provider is incorporated, or private parties under court judgments made against it in that jurisdiction or elsewhere. See Soghoian n 57. See also Google's Government Requests Report – David Drummond, „Greater transparency around government requests‟ (Google Public Policy Blog, 20 April 2010) . A recent example is the US Justice Department‟s subpoena against US micro-blogging SaaS service Twitter for private messages sent or received using Twitter by Icelandic Member of Parliament Birgitta Jonsdottir. Dominic Rushe, „Icelandic MP fights US demand for her Twitter account details‟ (Guardian, London, 8 January 2011). 167

CLP Contracts paper (n 28) s 4.11, 30.

168

On metadata generated by cloud providers, see Reed (n 7), 9.


It may be noted that having the legal right and technical ability to access such data is not the same as actually accessing the data. Is there a difference? Should there be? Unless and until the provider accesses the data, it may not even know that the data are 'personal data'. It might be thought that a provider who restricts access to very few employees, allows it only in tightly-controlled circumstances and to a very limited extent, and who takes other technical and organisational measures such as regularly checking logs of such accesses, ought to be exposed to fewer liabilities than a provider who, say, allows all employees access to all users' data at any time for whatever purposes any employee wishes. However, cloud providers face a difficulty if they enable any kind of internal or external access to users' accounts or data - even where access is strictly controlled, is restricted only to that essential for maintenance and proper provision of their cloud service, and is audited frequently. This is because WP136 does not seem to envisage limited reidentification where it is incidental to the cloud provider accessing the data in order to deal with service issues, for instance, and is not for the purpose of identifying data subjects. There is a blanket requirement that re-identification must be excluded altogether, before the data may be considered non-'personal data' in the hands of the provider. Thus, even if the provider never actually accesses users' unencrypted stored personal data, it would seem that if it simply has the capability to access the data in practice, identification is not excluded by its scheme, and the data would be 'personal data' in its hands. Many SaaS services, such as social networking sites or photo sharing sites, comprise more than 'pure' storage facilities, or involve the provider having access to stored unencrypted personal data (for example to run advertisements against the content), and/or expose the personal data to widespread or even public view as part of the service or the business model.169 In those cases, excluding identification may be impossible. This is also the case with IaaS, PaaS or SaaS 'passive' storage services where, although stored unencrypted data are intended to be non-public and accessible only to the user, in order to investigate a problem with the user's account the provider's engineers may need to have the ability to view the stored data, and accordingly may have sight of any identifying information in the data.170 The same also applies if shards of personal data remain temporarily on the provider's equipment pending being overwritten automatically by other data. This might happen when a user decides to delete data or terminates an account, at least where the provider is able to read the shards marked for deletion.171 Therefore, under current data protection laws, it seems some data stored by all these kinds of cloud services may have to be treated as 'personal data'. This result seems inevitable because of WP136's focus on preventing identification rather than on assessing risks to individuals‟ privacy in the particular circumstances. However, there may be policy reasons for encouraging the development of cloud 169

CLP Contracts paper (n 28) s 4.9, 27.

170

The same would apply to storage services which offer indexing and searching facilities or other value-added services requiring the provider or its software to be able to access the data see n 145. 171

CLP Contracts Paper (n 28) s 4.8, 23. If the provider does not promptly delete such data, including duplicates, and the retained data include personal data, would that affect the provider's status?


infrastructure services offerings, and for recognising the more neutral position of infrastructure providers. More flexible application of data protection laws may be more appropriate in some cloud computing situations, with fewer requirements perhaps being made applicable to infrastructure or passive storage providers than to SaaS providers who actively encourage the processing of personal data through their services. We discuss these issues in more detail in a paper on controllers and processors in this CLP series (see n 126). Even with obviously 'personal' data stored unencrypted and generally-accessible in the cloud, one twist merits exploration. The DPD's prohibition on processing sensitive data does not apply where 'the processing relates to data which are manifestly made public by the data subject...'.172 In such cases, the sensitive data may be processed without the data subject's explicit consent. The publicisation of the data by the data subject is, itself, enough to provide a justification for further processing by others without breaching the ban on processing of sensitive data. However, there is no such provision in relation to non-sensitive 'personal data', because it is accepted that general consent under the DPD can be implied through the actions of a data subject. The data nevertheless remain subject to the protections of the DPD relating to 'personal data' generally, so that while explicit consent or some other justification for processing sensitive data may no longer be necessary, consent or another justification for processing the personal data is still needed. It might even be argued that personal data which are not sensitive data, but which have been made public by the data subject, are impliedly also allowed to be processed freely thereafter, even though the DPD did not explicitly so provide. Related difficulties here include determining when data have been 'made public' for this purpose,173 especially with social networks, and the role of the data subject's intention to make the data public. There are obvious policy implications. In this world of 'Web 2.0' user-generated content, consumers are posting an unprecedented volume of information online, much of which is personal data. Yet they are not necessarily aware of the possible extensive consequences, both for themselves and for others whose data they post, notably loss of control over the personal data posted. Regulators and policy-makers are increasingly concerned that consumers may need to be protected in such situations, for example by requiring more privacy-friendly default settings.174 Many social networking sites' default settings make the posted information available to a wider group, sometimes even to the

172

Art 8(2)(e). This is implemented in the UK Data Protection Act 1998, for instance, by Sch 3 para 9, ' The information contained in the personal data has been made public as a result of steps deliberately taken by the data subject'. 173

Peter Carey, Data Protection: a Practical Guide to UK and EU Law (OUP, 2009), 86: a statement in a TV interview would be public, but what about a personal announcement to a group of friends? 174

See for example Article 29 Data Protection Working Party and Working Party on Police and Justice, The Future of Privacy - Joint contribution to the Consultation of the European Commission on the legal framework for the fundamental right to protection of personal data, WP 168 (2009) [48], [71], Article 29 Data Protection Working Party, Opinion 5/2009 on online social networking, WP 163 (2009), 3.1.2 and 3.2. On privacy protection in social networking, see also Martin Pekárek and Ronald Leenes, 'Privacy and Social Network Sites: Follow the Money!', (Tilburg Institute for Law, Technology, and Society (TILT), Tilburg University, Position paper for the W3C Workshop on the Future of Social Networking, Barcelona, January 15-16, 2009).


public at large; whereas some consumers may believe that the data they post will only be available to a limited group of friends and family. In such an environment, there may be reluctance to endorse the free processing of personal data made public by the data subject. However, there may still be some scope for relaxing data protection requirements applicable to such data. In summary, the effectiveness of access control restrictions and existence of any back doors or other means for a cloud provider to access unencrypted personal data stored unencrypted with it may affect whether the data are 'personal data' in the hands of the provider, even if the provider only has limited incidental access to such data. Yet it is arguable that, in relation to utility infrastructure providers who may not know the nature of the data stored in their infrastructure 'cloud of unknowing', the DPD should not be applied in full force or indeed at all. We will develop these arguments in a future CLP paper.

3.7

The way forward?

We suggest there should be a two-stage approach. First, the definition of 'personal data' should be based on the realistic likelihood of identification. Secondly, rather than applying all the Principles to information which has been determined to be 'personal data', there should be consideration in each particular context of which data protection rules should be applied, and to what extent, based on the realistic risk of harm and its likely severity. The concept of 'personal data' is used currently to determine the applicability (or not) of the DPD in a binary, bright line manner. If information is personal data, all the requirements of the DPD apply to it in full force; if it is not, none of the DPD's provisions will regulate its processing. However, as WP136 and courts have recognised, the 'personal data' concept is in reality itself not binary, but analogue. Identifiability falls on a continuum. There are degrees of identifiability; identifiability can change with the circumstances, with who processes information, for what purpose etc; and the more information that is accumulated about an individual, the easier it becomes to identify them. On that basis, it seems that almost all data could potentially be personal data, which is not helpful in determining the applicability of data protection laws to specific data processing operations in practice. Similarly, whether or not particular information about a person constitutes 'sensitive data' will, in many cases, depend entirely on the context. As the UK Information Commissioner's Office ('ICO') has pointed out in its submission175 to the UK Ministry of Justice's call for evidence on current data protection law, 176 while many individuals

175

ICO, The Information Commissioner’s response to the Ministry of Justice’s call for evidence on the current data protection legislative framework (2010) ('MoJ') . 176

In order to help inform the UK‟s position in the forthcoming negotiations on updating the DPD – UK Ministry of Justice, Call for Evidence on the Current Data Protection Legislative Framework (2010) . The responses to the call have been summarised at UK Ministry of Justice, Response to the Call for Evidence on the Current Data Protection Legislative Framework (2011) .


would consider health data to be sensitive, 'is a record kept in a manager‟s file recording that an employee was absent from work because he or she had a cold particularly sensitive in any real sense?'. Advances in re-identification techniques over the last decade have surely put paid to the justifications for applying the Principles in an 'all or nothing' fashion. In its submission, mentioned above, the ICO noted that the 'personal data' definition needs to be clearer and more relevant to modern technologies and the practical realities of processing personal data held in both automated and manual filing systems. The ICO also pointed out that any future framework must deal better with new forms of identifiability, but suggested that different kinds of information, such as IP address logs, might have different data protection rules applied to them, or disapplied.177 Its overall view178 was that 'a simple "all or nothing" binary approach to the application of data protection requirements no longer suffices, given the breadth of information now falling within the definition of "personal data"'. In its Communication on the approach to data protection in the EU,179 the European Commission did not discuss the possibility of changing or eliminating the definition of personal data. However, the Communication noted,180 as regards the current broad DPD definition of 'personal data', that there are 'numerous cases where it is not always clear, when implementing the Directive, which approach to take, whether individuals enjoy data protection rights and whether data controllers should comply with the obligations imposed by the Directive'. They felt that certain situations involving processing of specific kinds of data181 would require additional protections, and they will consider how to ensure coherent application of data protection rules in light of new technologies and the objective of ensuring free circulation of personal data within the EU. Perhaps it is time to consider focusing less on the protection of data as such,182 and more on achieving a proper balance in specific processing situations between the

177

Such as, in the case of IP logs, requiring security measures but not subject access rights or consent to log recording, 2 of the ICO's submission to the 2009 Consultation (n 22) ICO, The Information Commissioner’s response to the European Commission’s consultation on the legal framework for the fundamental right to protection of personal data ; reiterated in MoJ (n 175) 7. 178

MoJ (n 175) 7; 2 of the ICO's submission to the Commission (ibid); and ICO, The Information Commissioner’s (United Kingdom) response to A comprehensive approach on personal data protection in the European Union - A Communication from the European Commission to the European Parliament, the Council, the Economic and Social Committee and the Committee of the Regions on 4 November 2010 (2011) . 179

n 23.

180

n 23, 5.

181

For example location data, and even key-coded data.

182

In its submission to the 2009 Consultation (n 22) telecommunications provider Vodafone commented on the 'data-centric' nature of the current framework. Vodafone, The future direction of EU Data Protection and privacy regulation (2009)


protection of individuals' rights in relation to data and the free movement of data within the EEA, and, where appropriate, beyond. In particular, rather than looking solely at whether information is or is not personal data,183 might it not make more sense to consider the risk of identification and the risk of harm184 to individuals as a result of the particular processing in question, and the likely severity of any resulting harm? How the information should be processed should then be tailored accordingly, considering what measures are appropriate in the circumstances in light of such risks.185 Arguably, the underlying approach of the A29WP is in fact a risk-based one, for example the WP136 suggestion that pseudonymous data may involve fewer risks for the individual.186 Regarding anonymisation, the ICO has pointed out the different levels of identifiability and argued for a more nuanced and contextual approach to the protection of 'personal data' and 'anonymised' data.187 Austria's less restrictive treatment of 'indirectly personal data' has already been mentioned above.188 The amended Swedish implementation of the DPD189 from 2007

. Also, the European branch of a global trade association for GSM mobile phone operators noted EU states' application of data protection rules based on distinguishing individuals from others, 'even though an organisation collecting and processing such information has no intention of using it to target or in any way affect a specific individual'. Martin Whitehead, GSMA Europe response to the European Commission consultation on the framework for the fundamental right to the protection of personal data (GSMA Europe, 2009) . 183

As the UK ICO pointed out in the context of manual records (MoJ, n 175), reliance on the 'personal data' definition has produced some attempts to avoid data protection regulation by trying to structure the situation to make data fall outside the definition. 184

As per the approach of the APEC Privacy Framework (APEC Secretariat, 2005). Technology multinational Cisco also supports 'a more explicit link between harm and the necessary data protection requirements' noting that the DPD 'tends towards seeing all personal data as worthy of protection regardless of the privacy risk', and that the divergence of this approach from one based on privacy risk 'is exacerbated by the broad definition of personal data, which encompasses any data that can be linked to an individual'. Cisco Systems, Cisco response to the consultation on the legal framework for the fundamental right to protection of personal data (2009) s 3.3, 5. 185

If you've been identified as belonging to a group which has 100 members, the chance of identifying you is 1 in 100, and the risk of your being identified is similarly 1 in 100. That may entail more precautions in relation to that data compared with data which would only identify you as belonging to a group which has 1 billion members. EPOF (n 77) consider that it should be possible to 'match the level of regulation with the degree of identifiability of data... The concept of personal data should rather be defined pragmatically, based upon the likelihood of identification...' 186

See n 91 and accompanying main text.

187

MoJ (n 175), 8.

188

Text to n 92.


also illustrates a risk-based approach, with reduced regulation of personal data in unstructured material190 such as word processing documents, webpages, emails, audio and images, presumably on the basis that the risk is considered lower with such material. The amendment relaxed requirements relating to the processing of personal data contained in such unstructured material, provided the material is not included or intended for inclusion in a document management system, case management system or other database. Such unstructured data are exempt from most of the Principles, for example the restriction on transfer of personal data outside the EEA. However, security requirements apply to such material, and the processing of personal data in such material must not violate the integrity (privacy) of data subjects, otherwise penalties will apply.191 Thus, the focus is on preventing harm. As the DPD was largely based on the Council of Europe's Convention 108,192 it is also interesting to note the approach taken in a recent report on lacunae in the Convention arising from technological developments. The author argued that it was increasingly less relevant to ask whether data constituted personal data, but rather one should identify the risks in relation to the use of data posed by information and communication technologies in a particular context, and respond accordingly.193 Whether information is 'personal data' under the DPD already requires consideration of the specific circumstances. Considering the risk of identification and the risk of harm in the context of a specific processing operation would not seem to be more difficult than determining whether particular information is 'personal', and yet may make more sense in terms of forcing those involved to think about the overall objective of protecting individuals' privacy. In short, the 'personal data' definition is currently the single trigger for the applicability of all data protection rules. But, as the definition stands, given scientific advances it seems almost all data could qualify as such. Therefore, in considering whether particular information should be treated as 'personal data' so as to trigger data protection obligations, the key factor ought instead to be, what is the realistic risk of identification? Only where there is sufficient risk of identifiability (for example, 'more likely than not'), should the data be considered 'personal data'.

189

Personuppgiftslag (1998:204) (Swedish Personal Data Act) (rough translation).

190

The UK Data Protection Act 1998's five categories of 'data' in s 1(1) also includes certain unstructured data, such as any recorded information held by a public authority. However, manual data held by a public authority are explicitly exempted from most of the Principles and data protection legislation, the main exception being the Principle requiring such data to be accurate and, where necessary, kept up to date (s 33A). 191

Swedish Ministry of Justice, Personal Data Protection – Information on the Personal Data Act (2006). 192

n 21.

193

Jean-Marc Dinant, Rapport sur les lacunes de la Convention n° 108 pour la protection des personnes à l‟égard du traitement automatisé des données à caractère personnel face aux développements technologiques T-PD-BUR(2010)09 (I) FINAL (Conseil de l'Europe 2010) 8 .


If there is only a remote, highly theoretical risk of identification, particularly because of technical and other measures taken, then we suggest the information should not be treated as 'personal data'. In particular, in the cloud computing context the status of encrypted data as non-personal data should be clarified, at least where it has been encrypted and secured to industry standards and best practices.194 Clarification is also needed regarding anonymised data and the procedures of anonymisation and encryption. In many ways the current law recognises this already ('means reasonably likely to be used', and WP136's reference to theoretical risks). However, in the current environment it may make sense for the threshold to be higher, based on realistic risks (such as 'more likely than not'). The boundary also needs to be clearer. The criteria for triggering the application of data protection obligations need to be more nuanced, rather than 'all or nothing'. It may be appropriate to apply all the Principles in some situations, but not in others. A better starting point might be simply to require explicit consideration of the degree of risk of harm to living individuals which the intended processing carries, and the likely severity of that harm, balancing the interests involved, and subject to appropriate exemptions. This would require an accountability-based approach that is proportionate to the circumstances, including the situation of the controller, data subject and any processor.195 More sensitive situations, with greater risk of resulting harm and/or greater severity of the likely harm, would require more precautions than less sensitive ones. Such an approach would be in line with the European Commission's desire to require additional protections in certain circumstances. Yet it would also allow fewer or even no Principles to be applied, for example, to encrypted data held by someone who has no access to the key, where the encryption and other security measures have been to accepted industry standards and best practices,196 or where recognised certifications have been achieved.

194

The European Commission wants to support development of 'technical commercial fundamentals' in cloud computing, including standards. In a recent speech Commissioner Kroes said 'As a mediator, the Commission can also play a stronger role in the technical standardisation of APIs and data formats, as well as in the development of template contracts and service level agreements.' Neelie Kroes, 'Towards a European Cloud Computing Strategy' (World Economic Forum Davos 27 January 2011) SPEECH/11/50. Ewncryption and security measures ought also to be standardised, with a view to appropriate legal recognition of accepted industry standards. 195

We refer here to accountability in the sense used in the Canadian Personal Information Protection and Electronic Documents Act 2000 ('PIPEDA'). Views differ on what 'accountability' involves. The Article 29 Data Protection Working Party in Opinion 1/2010 on the principle of accountability, WP 173 (2010) seems to consider that 'accountability' involves taking measures to enable compliance with the data protection requirements and being able to demonstrate that that has been done. In contrast, the broader PIPEDA approach treats accountability as end-to-end responsibility on the part of the controller; for instance, there is no prohibition on transfer of personal data abroad, but the controller remains accountable in relation to the data wherever it is held – PIPEDA sch 1, 4.1.3. Accountability and the respective positions of the different actors will be discussed in more detail in future CLP papers. 196

At least where it is reasonable to do so, and assuming that such standards would require stronger encryption for more sensitive data such as medical records. If encryption under recognised standards is broken, however, it might be reasonable to expect that a stronger form of encryption should be substituted within a reasonable time period. See nn 117 and 51.


It could be said that such a broad and flexible approach may result in uncertainty and may exacerbate the lack of harmonisation of DPD rules within the EU, but arguably no more so than the current regime.197 Indeed, it may better reflect the practical reality in many member states. It is also interesting in this regard to note that the A29WP's submission198 responding to the European Commission's December 2009 consultation on the DPD did not mention the definition of 'personal data'. Instead it concentrated, in our view rightly, on broader practical structural protections such as requiring privacy by design199 and accountability.200 It has been suggested201 that, because of the increasing ease of re-identification and the 'near-impossibility of full anonymisation of personal data in the new socio-technical global environment', 'the basic approach should be to reduce the collecting and even initial storing of personal data to the absolute minimum'.202 A requirement to use privacy

197

In addition, given the importance of harmonisation, and the recognition in the Commission Communication (n 23) that more needs to be done to foster a unified approach, it would be possible, if the Commission thought fit, for the proposed new Data Protection Directive to provide that any guidance issued by the A29WP or Commission would be binding on member states and their courts, and/or to empower the A29WP to assess the adequacy and consistency of national implementations and issue other binding rulings. European Commission, Summary of Replies to the Public Consultation about the Future Legal Framework for Protecting Personal Data (2010) 2.2.18 noted that some organisations indeed want the A29WP to have more powers to oversee harmonisation, such as through more use of authoritative opinions. 198

The Future of Privacy (n 174).

199

Embedding privacy-protective features into the design of technologies, policies and practices. PbD was pioneered by Ontario's Information & Privacy Commissioner Dr Ann Cavoukian as developed from her PhD thesis. See NEC Company, Ltd. and Information and Privacy Commissioner, Ontario, Canada, Modelling Cloud Computing Architecture Without Compromising Privacy: A Privacy by Design Approach (Privacy by Design 2010), and more generally Ann Cavoukian, Privacy by Design (Privacy by Design 2009). PbD is increasingly finding favour amongst other regulators and legislators – for example, 32nd International Conference of Data Protection and Privacy Commissioners, 'Privacy by Design Resolution' (2729 October 2010, Jerusalem, Israel); The Future of Privacy (n 174); the Communication (n 23); Viviane Reding, 'Doing the Single Market justice' (Unleashing the digital single market Conference, Lisbon Council, 16 September 2010) SPEECH/10/441 and 'Towards a true Single Market of data protection' (Meeting of Article 29 Working Party 'Review of the Data protection legal framework' Brussels, 14 July 2010) SPEECH/10/386; Peter Hustinx, Opinion of the European Data Protection Supervisor on Promoting Trust in the Information Society by Fostering Data Protection and Privacy (European Data Protection Supervisor, 18 March 2010) and 'The Strategic Context and the Role of Data Protection Authorities in the Debate on the Future of Privacy' (European Privacy and Data Protection Commissioners‟ Conference, Prague, Czech Republic, 29 April 2010); FAQs (n 101); Kroes (n 26). 200

These issues will be discussed in more detail in a future CLP paper.

201

LRDP Kantor and Centre for Public Reform, Comparative study on different approaches to new privacy challenges, in particular in the light of technological developments - final report to European Commission (European Commission 2010) [121]. 202

The Principle of data minimisation is already included in the DPD as one of the data quality Principles at art 6(1)(c); however, the suggestion was to focus primarily, or more strongly, on that Principle.


by design and privacy enhancing technologies203 should certainly assist in data minimisation,204 and in our view such a requirement should be an essential part of the forthcoming revised DPD. However, data minimisation by itself will not be enough to deal with all the personal data that is already 'out there'.205 Such personal data should still be subject to a requirement that it be processed in a way that takes due account of the risks of harm to living individuals arising from the processing operation in question, and their likely severity. What about sensitive data? Under the current DPD, all data may potentially be personal (or not); indeed, all data may potentially be sensitive (or not), depending on the context. However, the distinction between personal data and sensitive data is, arguably, also no longer tenable. The European Commission is to consider whether other categories of data should be considered as 'sensitive data', for example, genetic data; and will be further 'clarifying and harmonising' conditions under which sensitive data may be processed.206 The UK ICO has made the point207 that a fixed list of categories can be problematic: sensitivity can be subjective/cultural, a set list208 does not take account sufficiently of the context and may even exclude data which individuals consider

203

Privacy enhancing technologies ('PETs') are not necessarily the same as privacy by design (n 199), which involves an approach to the design of technologies. Views differ on the definition of PETs. See for instance Working Group on Data Protection in Telecommunications of the Committee on Technical and organisational aspects of data protection of the German Federal and State Data Protection Commissioners, Privacy Enhancing Technologies in Telecommunications - Working document (European Commission 1997) which merely attempts to define criteria for PETs rather than defining the term, including data minimisation and anonymous use of technologies. The ICO considers PETs 'are not limited to tools that provide a degree of anonymity for individuals but they are also any technology that exists to protect or enhance an individual‟s privacy, including facilitating individuals‟ access to their rights under the Data Protection Act 1998' – Information Commissioner's Office, Data Protection Technical Guidance Note: Privacy enhancing technologies (PETs) (2006). A more recent study for the European Commission on the economic benefits of PETs pointed out the complexity of the PETs concept, noting that data security technologies are PETs if used to enhance privacy, but they can be used in privacy-invasive applications; the study suggested that the use of more specific terminology such as 'data protection tools', 'data minimisation tools', 'consent mechanisms' etc might be preferable in many cases. London Economics, Study on the economic benefits of privacy-enhancing technologies (PETs): Final Report to The European Commission DG Justice, Freedom and Security (European Commission 2010). 204

Eg IBM's Idemix, Microsoft's Windows CardSpace and other technology incorporating Credentica's U-Prove, and the Information Card Foundation's Information Cards. Touch2ID is a UK pilot of smart cards proving the holder is of drinking age without revealing other personal data - Kim Cameron, 'Doing it right: Touch2Id' (Identity Weblog, 3 July 2010) . 205

And see Case C-73/07 Tietosuojavaltuutettu v Satakunnan Markkinaporssi and Satamedia Oy (Approximation of laws) OJ C 44/6, 21.2.2009 - public domain personal data is still protectable as personal data. 206

n 23, 2.1.6

207

MoJ (n 175), 10.

208

Paragraph containing n 62.


sensitive, and non-EU jurisdictions may have different lists, which could cause difficulties for multinationals.209 Going further, the ICO has suggested: [A] definition based on the concept that information is sensitive if its processing could have an especially adverse or discriminatory effect on particular individuals, groups of individuals or on society more widely. This definition might state that information is sensitive if the processing of that information would have the potential to cause individuals or [sic] significant damage or distress. Such an approach would allow for flexibility in different contexts, so that real protection is given where it matters most. In practice, it could mean that the current list of special data categories remains largely valid, but it would allow for personal data not currently in the list to be better protected, for example financial data or location data. Or, more radically, the distinctions between special categories and ordinary data could be removed from the new framework, with emphasis instead on the risk that particular processing poses in particular circumstances. This hints at possible support for a similar risk-based approach to personal data generally, and not just to sensitive personal data. If a general risk-based approach were taken, there might not be any need for 'sensitive data' as a special category, with the sensitivity of particular data in a specific context being one of the factors determining how it may or should be processed. We suggest a two-stage, technologically-neutral, accountability-based210 approach to address the privacy concerns originally intended to be dealt with by the DPD 'personal data' concept: 1. risk of identification - appropriate technical and organisational measures should be taken to minimise the risk of identification. Only if the resulting risk of identification is still deemed sufficiently high should the data be considered 'personal data', triggering data protection obligations; 2. risk of harm and likely extent of harm - the risk of harm and its likely severity should then be assessed, and appropriate measures taken in relation to the personal data, with the obligations being proportionate to the risks. Accordingly, if appropriate measures have been successfully implemented by a controller to minimise the risk of identification, notably, secure encryption measures so that the information is not considered 'personal data', then there should be no need to address the risk of harm. However, if there is a sufficient risk of identification in the specific circumstances, such as with pseudonymisation or aggregation carried out in certain ways, then the risk of harm and likely severity of harm should be assessed, and appropriate measures taken.

209

The ICO's response to the Commission's Communication, n 177, points to the recent draft US Privacy Bill to demonstrate the difficulties with defining “sensitive personal data” as a list of categories, as opposed to defining it against the impact, or potential impact, of the processing of the data on the individual. 210

To be discussed further in future CLP papers. For instance, the cloud user ought to consider any risk of the provider inadvertently deleting the encrypted data, and protect the data accordingly such as by ensuring further copies of the encrypted data with other providers, or store copies locally.


There seem to be two situations where it may be possible to allow free, or at least less restricted, processing of originally 'personal' data. The first is where the data have been so depersonalised that they are no longer 'personal', such as through secure encryption of the full data set. The second, perhaps involving more difficult policy issues, is where personal data have been intentionally made public by the data subject. Account should be taken not just of what is done to data, but who does it – data subject, cloud user, or cloud provider.

4.

Concluding remarks

We have advanced proposals which we suggest would enable data protection laws to cater for cloud computing and other technological developments in a clearer and more balanced way. An accountability approach to data protection responsibilities should be taken by raising the threshold inherent in the 'personal data' definition, basing it instead on the realistic risk of identification and considering a continuum or spectrum of parties (depending on the circumstances) who may be processing personal data, each having varying degrees of obligations and liabilities under data protection law, with the risk of identification and risk of harm (and its likely severity) being the key factors. Such an approach should result in lighter, or even no, data protection regulation of passive utility infrastructure cloud providers, while reinforcing the obligation of cloud providers who knowingly and actively process personal data to handle such data appropriately. More specifically, it is important to clarify the status of encrypted data and anonymised data to ensure that securely-encrypted data are not treated as 'personal data'. The legal status of the encryption or anonymisation procedure, i.e. converting personal data into an encrypted or anonymised state, also needs consideration and clarification. As for the industry, cloud computing providers, especially infrastructure providers, may wish to consider developing and putting into place measures to minimise the likelihood of their cloud service being regulated inappropriately by EU data protection laws, such as encryption at the user end by default. They may also benefit from providing more transparency on their sharding and other operational procedures, and from continuing work on developing industry standards, such as on encryption of data to be stored in the cloud, including various elements of privacy by design. Such an emphasis on standards, while facilitating a more flexible and pragmatic approach to the regulation of the various actors in the cloud ecosystem, should also help to shift regulatory focus back to protecting the interests of individuals.


Appendix Suggested issues for consideration in the revision of the DPD This Appendix summarises some matters which might be considered in revising the DPD. The DPD was first proposed by the European Commission in 1990211 and adopted in 1995. Technology, and in particular the use of the internet, has moved on significantly since then. It is time to make the DPD fit for the next decade, and hopefully beyond. 1. Anonymisation, pseudonymisation and encryption procedures212 – clarify the position, such as by providing that such procedures are not within the DPD‟s scope, or are not considered 'processing', or that fewer obligations applicable to such data, or that such procedures are authorised under the DPD,213 so as not to deter their use as privacy enhancing technologies. 2. Encrypted data – consider providing explicitly that data encrypted and secured to industry standards (including on key management) are not 'personal data' and thus may be processed' freely by those who do not hold the decryption key. 3. Anonymised data – consider clarifying the circumstances in which anonymisation may produce non-'personal data', with the likelihood of identification being the main determinant - for example treating as 'personal data' data which 'more likely than not' would lead to the identification of individuals. Data should not be treated as 'personal data' where there is insufficient realistic risk of identification. 4. Accountability – consider moving to a more nuanced, proportionate and flexible regime, which bases the general approach on end-to-end accountability rather than the binary 'controller/processor' distinction, basing the applicability of data protection obligations on the risk of harm and its likely severity, with appropriate exemptions.214 5. Sensitive data – consider a similar risk of harm approach, with definitions as suggested by the UK ICO.215 6. Personal data manifestly made public – clarify the position in relation to such data, for non-sensitive as well as sensitive data, such as whether to limit applicability of data protection laws to such data, only apply certain Principles to such data, etc.

211

Proposal for a Council Directive concerning the Protection of Individuals in relation to the Processing of Personal Data, COM/90/314 final - SYN 287, OJ C 277/3, 5.11.1990. 212

Including separate consideration of anonymisation and pseudonymisation procedures in this context, and possibly separate treatment of the two kinds of procedures. 213

Walden (n 77).

214

This will be considered in a future CLP paper.

215

See paragraph containing n 207.


The Problem of 'Personal Data' in Cloud Computing - What Information ...

The Problem of 'Personal Data' in Cloud Computing - What Information ...

Suggest Documents

International Cloud Computing: What About the Data?

Data Export in Cloud Computing: How can Personal Data ... - SCRIPTed

What is Cloud Computing?

Personal Health Records in Cloud Computing

Personal data in the cloud - Fujitsu

cloud computing security in business information systems

INFORMATION SECURITY IN CLOUD COMPUTING - Aircc

cloud computing security in business information

CLOUD COMPUTING CLOUD COMPUTING

Professional Services Cloud: What cloud computing means for the ...

Personal Cloud Computing Security Framework - IEEE Xplore

Data Management with Cloud Computing

Cloud Computing with Data Management

Protection of personal information in the South African Cloud ...

Improving Data Sharing Security in Cloud Computing

Ensuring Data Storage Security in Cloud Computing

IRJET-Data Leakage Detection in Cloud Computing

Data Security in Cloud Computing Using Various

Sheltered Data Storage Services in Cloud Computing

Architectural Data Security in Cloud Computing

Managing Big Data in Cloud Computing Environments

Data Security in Cloud Computing - Inpressco

Improving Data Sharing Security in Cloud Computing

Intelligent Data Security in Cloud Computing - Inpressco