Supplementary Material - Semantic Scholar

Supplementary Material

and read / write performance improvements.

S1. RISK ANALYSIS

S2.2. Backup S2.2.1. Off-site copies

This does not constitute an exhaustive list, but it does provide the readers with a platform from which they may develop their own analysis. Details regarding checklists can be found in Section S6.

S1.1. Hardware failure

Although our daily interaction with electronic storage media may suggest that they are infallible, this belief is tested by both the volume of data at hand, and the length of time for which they need to be stored. Such failures may take the form of corrupted storage media, or may simply be a mechanical fault limiting the ability of the disk to function whilst leaving data intact. Research by Google[1] revealed an annualised failure rate >5% for disks two or more years of age. Although their definition of failure did not imply a complete loss of data, there was still a need to repair the device.

S1.2. Non-technical risks

It is easy, when considering technological aspects of risk, to forget about the physical aspects of data security such as fire or water damage, and direct access to a laboratory instrument or a backup device. The reality of such risks is realised when considering the protective frameworks surrounding credit card data[2] which have explicit physical requirements. The trust placed in laboratory employees, associates, and visitors is yet another point of potential weakness in the security of data.

S1.2.1. Human error

An elegantly engineered data architecture can be rendered moot by a single human error. Automation of processes, as in laboratory practice[3], provides a level of quality assurance against imperfect humans.

S2.TECHNOLOGICAL MITIGATION OF RISK

S2.1. Redundancy

A key approach in data protection lies in the keeping of redundant copies of said data. In its most rudimentary form the methodology simply duplicates data within the same local architecture, but this fails to mitigate certain risks which would be common to both copies. Further mitigations are hence implemented in an attempt to minimise the probability of a complete loss of all copies.

S2.1.1. RAID

Redundant array of independent disks (RAID) “is a method by which many independent disks attached to a computer can be made, from the perspective of users and applications, to appear as a single disk”.[4] A set of standard configurations exist (see Vadala[4] for details) each of which provide varied degrees of fault tolerance

As alluded to above, the disks of a single RAID array are all subject to a common set of risks—for example water damage—and thus require a complementary approach. A common mitigation in this case is to create geographically-separated backups. It is important to ensure that the data transfer between sites (presumably over a public network) is performed over a secured connection as detailed in Section S4.

S2.2.2. Rolling backups

We are fortunate within genomics that NGS data are static in that the output of a historical sequencing run will never change—this allows for the creation of a single set of backup copies. More dynamic data such as ongoing analyses or those associated with other disciplines will require ongoing backup creation. Given a finite amount of hardware resources we are forced to overwrite historical backups after a particular period of time—rolling through disks in a rotating fashion. The overwriting of historical backups pertains only to the redundant copies of newer creations. No permanent deletion occurs as this would constitute the obliteration of a medical record, but instead we have a reduced depth of redundancy. For example, a doubling of data within a period will result in a halving of the redundancy protections. It is not necessary to duplicate unchanged data, and an approach known as incremental backup can be utilised whereby only new data are copied. This is best done in an automated fashion, and free open source approaches such as rsync are described in Preston.[5]

S3. ADDITIONAL SECURITY PRINCIPLES

S3.1. Defence in depth

As with the use of redundant storage media, we can implement a series of security measures despite a single one being theoretically sufficient. Theory and practice differ, and such an approach—known as defence in depth—increases the probability that an error in one level of implementation is safeguarded by a secondary, redundant implementation. This is not to say that we should necessarily encrypt sensitive data with multiple algorithms as some cryptographic-onion; remember that we may end up losing our information should we accidentally block our own access. Defence in depth applies to the plastic padlock scenario described in the main text—even encrypted data should ideally be inaccessible to those who lack authorisation to access its contents.

How deep is deep enough? This relies on the threat analysis performed before the implementation of our security mechanisms.

S3.2 Least privileges

Kerckhoffs’ principle is partly based on the premise that the more we share a piece of information the more difficult it is to limit its dissemination to only those authorised to be privy to the information. This can be generalised to the concept of least privileges—when concerned with information this amounts to need-to-know. The greater the number of people with authorised access to a computer system, the greater the probability that someone may have their user account compromised. It takes only one vulnerability for an adversary to compromise a system. When considering all elements of data security we should limit the authorisation of all computer users such that they are only able to perform the tasks that they are expected to perform, and no more. Should their real-world authorisation level change (through resignation for example), then so too should their electronic equivalent. There is little point in engaging in an arduous security implementation only to have it foiled by a disgruntled individual such as an exemployee or even a current employee who opens the wrong, virus-laden email attachment.

S4. APPLICATIONS OF CRYPTOGRAPHY There are many applications of the cryptographic concepts described in the main text. For the purposes of laboratory-data protection they fall into two broad categories of protecting data at rest and in motion. Data at rest implies that it is merely being stored (e.g. genomic archives) whilst data in motion are being transmitted elsewhere (e.g. to an off-site backup). Perhaps the most common means of protecting data in motion is through Transport Layer Security—commonly TLS, and mistakenly confused with the name of its predecessor Secure Sockets Layer which readers may know as SSL. Broadly speaking this involves establishing communications with a remote party—your bank’s website, perhaps—who presents their public key along with a certificate attesting to their true ownership; after verification of the certificate’s signature—providing sufficient evidence that you are in fact communicating with your bank rather than an adversary—the parties use the public key to agree upon (negotiate) a secret session key to then be used for symmetric encryption of further communications. A session refers to that particular electronic conversation. Key negotiation can take more complex forms than one party simply deciding upon a symmetric key prior to sharing it with public-key cryptography. Interested readers are encouraged to seek information regarding Diffie-Hellman key exchange,[6] and other algorithms pertaining to perfect forward secrecy.

S5. IMPLEMENTATION NOTES Dependent on the chosen asymmetric algorithm, the size of the private key will differ, and in some cases be too large to practically store on dead-tree media—such is the case with RSA. In such scenarios an electronic copy can be kept in an encrypted format—utilising an ASDapproved symmetric algorithm—and this symmetric key is kept in hard-copy prior to being electronically discarded. The need for an HMAC of the encrypted NGS data can be negated with the use of an authenticated mode of encryption. The OpenSSL programmatic library contains an AES-GCM implementation, but it is not made available via the command line[7] even as of version 1.0.2a, the latest version as of writing, as tested by the author. OpenSSL forms the basis of at least two thirds[8] of security on the world wide web. In light of recent vulnerabilities[8] it is undergoing a thorough public audit.[9] Given the extent of global reliance on its proper functioning, its adoption in the laboratory makes for a prudent choice.

S6. RESOURCES AND FURTHER READING • Cloud Computing Security, Australian Signals Directorate. http://www.asd.gov.au/infosec/cloudsecurity.htm ˚ Cloud Computing Security for Tenants; and ˚ Cloud Computing Security Considerations. • Amazon Web Services Whitepapers. https://aws.amazon.com/whitepapers/ ˚ Overview of Security Processes; and Architecting for Genomic Data Security and ˚ Compliance in AWS, Amazon Web Services. • Blog by cryptography and security expert, Bruce Schneier, author or co-author of many of the references of this paper, including Ferguson et al.[12*] and Abelson et al.[20*]. *References in main text. https://www.schneier.com/ • Information security forum. A strictly-moderated Q&A platform on which users are assigned reputation scores based upon the quality of their contributions. https://security.stackexchange.com/ • Qualys SSL Labs. Automated tools for testing servers and browsers for known vulnerabilities in TLS / SSL configuration. https://www.ssllabs.com • Open-source implementation of two-factor authentication whereby a device generates timelimited six-digit codes to complement passwords. https://github.com/google/google-authenticator Interested readers are encouraged to seek information regarding the “birthday paradox” which has security

implications for hash collisions (and hence also proper selection of unique patient identifiers). At the time of writing, the Wikipedia article pertaining to this subject provided an accessible and accurate introduction. The permanent link to this version of the article is included below. https://en.wikipedia.org/w/index.php?title=Birthday_ problem&oldid=668887660

3. 4. 5. 6. 7.

SUPPLEMENTARY REFERENCES 1. 2.

Pinheiro, E, Weber, WD, Barroso, LA. Failure Trends in a Large Disk Drive Population. in FAST 2007;7:17-23. Official Source of PCI DSS Data Security Standards Documents and

8. 9.

Payment Card Compliance Guidelines. 2015. Available from: . [Last accessed on 2015 Aug 10]. Kalra, J. Medical errors: Impact on clinical laboratories and other critical areas. Clinical Biochemistry 2004;37:1052-62. Vadala, D. Managing RAID on Linux. Sebastopol, CA: O’Reilly Media, Inc.; 2002. Preston, C. Backup and recovery: Inexpensive backup solutions for open systems. Sebastopol, CA: O’Reilly Media, Inc.; 2007. Diffie, W, Hellman, ME. New directions in cryptography. Information Theory, IEEE Transactions 1976l;22:644-54. Google Groups. v1.0.1g command line gcm error. Available from: . [Last accessed 2015 Aug 10]. Heartbleed Bug. Available from: . [Last accessed 2015 Aug 10]. NCC Group. OpenSSL Audit. Availabe from: . [Lst accessed on 2015 Aug 10].