TBI Keyword: 1. Analysis Pipelines, Workflows .... (MS Windows Server 2008 R2 SP1), runs VMware vCenter Server, which coordinates the load-balancing and.
Development of a Protected Environment for Translational Research Data and Analytics
Journal:
2013 AMIA Summit on Translational Bioinformatics
Manuscript ID:
Draft
Manuscript Type:
Paper
Date Submitted by the Author: Complete List of Authors:
TBI Keyword:
n/a Bradford, Wayne; University of Utah, CHCP Hurdle, John; University of Utah, Biomedical Informatics LaSalle, Bernard; University of Utah, Biomedical Informatics Core Facelli, Julio; University of Utah, Biomedical Informatics 1. Analysis Pipelines, Workflows, and Data Provenance, 3. Data Repositories, 7. Ethical, Legal, and Social Issues Around Big Biological Data
Page 1 of 5
Development of a Protected Environment for Translational Research Data and Analytics Wayne Bradford 1 John Hurdle 2 Bernie LaSalle 2 and Julio C. Facelli 1 ,2 1
Center for High Performance Computing and 2 Department of Biomedical Informatics The University of Utah, Salt Lake City, UT 84112.
Abstract This paper describes how the Center for High Performance Computing (CHPC) has developed a Protected Environment for Translational Research Data and Analytics. This space can be used for computationally intensive tasks such as data mining, machine learning, and statistics, as well as for deploying protected virtual machines (VMs). In the planning stages CHPC engaged the University of Utah Information Security Office (ISO) and the Information Security and Privacy Office (ISPO). The environment was architected to require a double authentication mechanism. The first level of authentication requires access to the University of Utah’s Virtual Private Network (VPN) and the second requires that the users be listed in the CHPC NIS directory. The physical hardware resides in a datacenter with controlled room access. All employees and users of the cluster have taken the University of Utah’s HIPAA training. In three years the number of users has increase from 6 to 58.
Introduction Translational research is increasingly dependent on reusing clinical data for research purposes.1 However, in the US and elsewhere there are very stringent laws and regulations that require significant measurements to assure the privacy of data collected during clinical encounters.2-4 These regulations can be a significant deterrent to translational investigators because they are familiar with much more fluid and open environments, especially in terms of data access. Informatics solutions are required to bridge this gap, making data and analytical tools available to researchers without undue barriers, while preserving privacy of health data as required by law. Several approaches to this problem have been reported in the literature,5-6 but to-date the translational research community has not adopted universally accepted solutions or best practice. Here we present the architecture adopted by the University of Utah to address this problem. Until the deployment of CHPC Protected Environment the only options available to University of Utah researchers were “stand alone” islands of work in individual research groups, which limited cross-group collaboration. To the extent that there was cooperation, it was severely limited, cumbersome, and often haphazard. Security and risk management was not well understood, controlled, reviewed, nor documented. Rarely did the research groups who attempted to solve this problem on their own have the time, resources, and expertise to put the necessary security controls and risk management controls in place. Often there was no segregation of duties for administrative responsibilities and accountability. The CHPC supports a wide variety of research users with different requirements and was tasked by the University administration to address the needs of researchers working with sensitive and restricted Electronic Health Record (EHR) data, both in an HPC computational workload environment and with various Web and database applications. Processing Protected Health Information (PHI) was not something easily fit into our current CHPC infrastructure while still meeting the regulatory compliance stipulations without significantly impacting the rest of our research groups. This prompted us to design and deploy a separate restricted environment that we call the “Protected Environment”. Our semi-formal assessment of the needs of the research community, which was performed by a series of formal and informal meetings (using semi-structured interview methods) with researchers at the University of Utah, indicated two distinct but complementary needs:
Page 2 of 5
• •
The need to provide large storage capacity with access to analytical tools that are well integrated into highperformance computing (the HPC environment), and The ability to provide virtual machines (VMs) to deploy applications containing PHI, such as clinical trials data bases management tools or specialized systems to provide personalized health data accessible to patients (the VM environment).
Our preliminary study also indicated that the institutional financial support available was quite limited and that existing infrastructure would have to be used to the maximum possible extent.
Methods When architecting the protected environment discussed above we wanted to isolate it as much as possible, while still utilizing some pieces of the core infrastructure to minimize costs. The network IP space we used for the protected environments (i.e., the HPC and the VM environments) were assigned to separate logical subnets and vLANs from both our existing services and from each other. We further segregated the individual Virtual Machines (VM’s) to project-specific groups. With this approach we were able to utilize most of our existing core services, like VPN, DNS, NTP, and Kerberos. The first phase of the project focused on building the HPC environment. This space is for computationally intensive needs such as data mining, machine learning, natural language processing, statistics, and operations across large datasets. Such a space requires large storage and network-bandwidth capacity, hardened storage, and high performance computing. The second phase was building the VM environment, a protected VM farm that is not intended for computationally complex tasks but rather for scientists that need various services that do not justify the expense or capabilities of a dedicated server. In the early planning stages we engaged the University of Utah Information Security Office (ISO) and the Information Security and Privacy Office (ISPO) to gain a better understanding of the security requirements associated with housing PHI and HIPAA-regulated data. Working with the ISO/ISPO was an important step in designing a system with appropriate security controls and safeguards to protect the confidentiality, integrity, and availability of University data and systems. Additionally, CHPC reviewed policies from similar institutions to ours, in order to determine the best approach to the building of this new cluster and appropriately secured data. Physical Infrastructure: The physical hardware resides in a datacenter with controlled room access. These hosts are racked in a locked cabinet and hosts have locked server bezels. Physical access to the data center is reviewed biannually and documented on an access controlled departmental Wiki. Backups are restricted to one specific backup server on one particular port. Backup data traffic is automatically encrypted (BLOWFISH encryption) at the client side before traversing the network. Backup media are stored in locked cabinets in the access-restricted data center. All employees and users who interact with the cluster have been through the University of Utah’s HIPAA training, and many have completed the well-known CITI human subjects research training.7 HPC analytic environment: The HPC environment was architected to require a two authentication mechanisms. The first level of authentication requires use of the University of Utah campus Virtual Private Network (VPN), and only users with specifically assigned privileges may log into the VPN pool of IP addresses for this cluster. The second authentication level requires users to be listed in the CHPC NIS directory server in order to interact with the cluster. The access to both the VPN and the NIS server are set up only by explicit approval, which is documented and kept in perpetuity. Access to the computational cluster servers is provided via front end “login servers.” The login servers are restricted by router Access Control Lists (ACL) to Remote Desktop Protocol (RDP uses TCP port 3389) for the Windows hosts and Secure Shell (SSH v2 uses TCP port 22) for the Linux hosts. Public key, host, or RHOSTS based authentication are not allowed. The login servers also employ firewall services to limit access to VPN addresses. All interactions between the front-end hosts and the cluster file server, batch controller, and computing nodes takes place on an isolated back-end, high-speed Infinaband network. All Linux hosts in the cluster are automatically updated on at least a monthly basis utilizing the Redhat Network update service, while the Windows hosts are updated using Microsoft update and use anti-virus software that is also regularly updated. We have to provision a
Page 3 of 5
dedicated fileserver for the HPC home directories in the protected environment, but we utilize the existing UNIX application file systems in read-only mode to deliver executable programs. Virtual Machine Implementation: The VM Cluster consists of four servers (two VMware ESX, one Windows, one Red Hat Linux) and a disk tray (Dell MD3200i with MD1200 expansion): five devices in total. One Windows server (MS Windows Server 2008 R2 SP1), runs VMware vCenter Server, which coordinates the load-balancing and failover functions of the two ESX servers, and also acts as a single management point. This server does not process any protected data. One Red Hat Linux server (Red Hat Enterprise Linux 6) acts as an administrative access point and also does not process any protected data. Two VMware servers (VMware ESX 4.1u1) host the actual guest VMs. These servers do process protected data, but do not store it internally (i.e., all transactions are RAM-based). The disk tray (Dell MD3200i with MD1200 expansion) is an iSCSI disk tray, providing shared storage to the two ESX servers. These disks store the VMs, and thus all sensitive data in those VMs. We require all VMs to encrypt their disk, thus all sensitive data is encrypted. VM and applications are regularly scanned by the University ISO. Administrative procedures: In order for a user to access the HIPAA environment at CHPC they must meet all the following requirements: •
Have an active account in the University of Utah's Kerberos authentication system, using the University of Utah’s mechanisms. This extends to any external collaborators as well. There are no ‘local’ user accounts.
•
Have an active CHPC departmental account, where sponsorship and approval of a Principal Investigator (PI) is required.
•
Have an active CHPC account created in the HIPAA environments isolated NIS (network information system) and be a member of the ‘NIS group’ that is listed in the PAM security access.conf file. This requires verification and completion of the on-line HIPAA privacy and security training.
•
Be added to the HIPAA Virtual Private Network (VPN) pool, and use this VPN encrypted tunnel to access designated login nodes. CHPC account provisioning requires the addition to the HIPAA VPN be limited to those authorized to access the cluster.
Permission to use a given dataset is governed by the approval of the University's Institutional Review Board (IRB). Researchers must submit a proposal to the IRB listing the data to be used and the people who will have access to it. If the IRB approves the use of the data in question, the researcher is given an IRB number. In order to store the data in CHPC's HIPAA environment, the researcher must provide CHPC with this number and a list of the users who will be permitted to see the data. The list is independently verified with the IRB. Thereafter, the data may be transferred to CHPC and only the IRB-approved users will be able to work with it. Logical access is monitored by SYSLOG and process accounting (PSACCT). Logs are kept both locally and on a remote SYSLOG server and routinely reviewed. Retention of logs on the SYSLOG server is currently kept indefinitely. Logwatch reports are emailed daily to designated administrator accounts. These daily logwatch reports show the last 24 hours of what accounts logged in (or failed to login) and from what IP. Firewall configuration prevents ‘brute force’ login attempts. If an account login is unsuccessful more than 3 times it will lock/prevent the account from authenticating for 3 minutes. Access to view or manipulate other users/groups data is enforced by using UNIX file and directory permission(s). VPN logs are available from the University of Utah’s IT (UIT) Department. These logs show access and failures by date and time. When an account is locked or disabled at the campus, VPN or local department level, login access is prevented. Account validation is periodically reviewed. User access to IRB data is periodically reviewed (scheduled biannually). IRB projects are the authoritative source for who has access to HIPAA data. If a person is not listed on an approved IRB project, then they are not allowed in the UNIX group for access to that project's data. Our IRB reviews studies on annual basis.
Page 4 of 5
Results The system has been operational for more than three years and as depicted in the following tables it has received a great deal of acceptance by our campus researchers as demonstrated by its continuous grow in size and number of users and variety of VMs supported.
Table 1: Grow of the HPC protected space
Date 2/09 10/10 04/11 03/12
Number of Hosts 9 hosts 16 hosts 19 hosts 20 hosts
Total Disk 5.6 TB 27.7 TB 33.7 TB 33.7 TB
Number of Users 6 users 26 users 37 users 58 users
Table 2: Resources allocated in the protected virtual space
Application
Number of VMs
RAM
Disk
Redcap Asthmatracker caTissue
8 4 4
26 GB 8 GB 8 GB
8 TB 4 TB 4 TB
Summary We have been able to reuse a substantial part of the CHPC HPC infrastructure to develop a new protected environment increasing user productivity and compliance at the same time. New users from various organizations on campus are utilizing this infrastructure and is stimulating new collaborations and ideas, including the Departments of Biomedical Informatics (DBI), Pediatrics (Primary Children Medical Center), Radiology, the College of Nursing, and the University’s FURTHeR project, an infrastructure being built under the large NIH translational research grant called the Clinical and Translational Science Award (CTSA). For these researchers, in addition to access to high performance computing power, other tangible benefits for researchers is that the CHPC handles systems management issues, such as rapid response to electrical power issues, provision of reliable cooling and heating; VPN support for a work-anywhere computing experience; and ensuring a hardened, secure environment compared to office computers or departmental servers. For the institution this resource allows much better compliance and reduces the vulnerabilities of exposure of PHI data.
Page 5 of 5
References 1. 2. 3. 4. 5. 6. 7.
Prokosch HU, Ganslandt T. Perspectives for medical informatics. Reusing the electronic medical record for clinical research. Methods Inf Med. 2009;48(1):38-44. Erlen JA. HIPAA--Implications for research., Orthop Nurs. 2005 Mar-Apr;24(2):139-42. Gunter KP. The HIPAA privacy rule: practical advice for academic and research institutions. Healthc Financ Manage. 2002 Feb;56(2):50-4. Gunn PP, Fremont AM, Bottrell M, Shugarman LR, Galegher J, Bikson T. The Health Insurance Portability and Accountability Act Privacy Rule: a practical guide for researchers. Med Care 2004 Apr;42(4):321-7. Cimino JJ, Ayres EJ. The clinical research data repository of the US National Institutes of Health. Stud Health Technol Inform. 2010;160(Pt 2):1299-303. Malin B, Karp D, Scheuermann RH. Technical and policy approaches to balancing patient privacy and data sharing in clinical and translational research. J Investig Med. 2010 Jan;58(1):11-8. CITI. CITI Collaborative Institutional Training Initiative. https://www.citiprogram.org/default.asp. Last referenced 09/27/2012.
Acknowledgements: This work has been partially supported by grants from the NIH, National Center for Research Resources award UL1RR025764, National Library of Medicine 5RC2LM010798 and 5R21LM009967, and DHHS Health Resources & Services award 1D1BRH20425-01-00.