Challenges and Chances of a distributed network architecture for ...

60 downloads 11480 Views 260KB Size Report
benefits and challenges of a distributed database system. ... workplace infrastructure or connecting from within a safe center in a local branch (e.g. in. Germany at the ... Ceri et al. (1983) call the subsets resulting from horizontal partitioning.
Challenges and Chances of a distributed network architecture for sharing confidential data in the ESS Wolf Heinrich Reuter1 1

Vienna University of Economics and Business, Austria, e-mail: [email protected]

Abstract In many statistical offices with regional branches databases are created centrally out of several regional parts. The same situation is predominant in respect to confidential micro data in the European Statistical System where Eurostat centrally assembles databases out of dataset from the Member States. This paper uses the Eurostat example to highlight the benefits and challenges of a distributed database system. A distributed database syste, holds the regional parts of the database in the regional offices and centrally only stores the parameters how to combine the various parts. Query results then are provided to the user via secured VPN connections as if she is working with one single, not distributed database. The distributed database system can improve data quality, data security and availability, give regional branches full control over their data and lower costs. On the other side new infrastructure, security measures and standards are needed to establish such a system.

Keywords: Distributed database management system, European Statistical System, Access to confidential data

1. Introduction New technical opportunities and regulatory changes allow statistical offices to offer new ways of accessing confidential data. Among others remote access systems have been implemented in various countries. Also Eurostat is examining the possibility of implementing such a system for the European Statistical System (ESS) (see e.g. Reuter and Museux, 2011). In the short run users get access to the confidential data using their workplace infrastructure or connecting from within a safe center in a local branch (e.g. in Germany at the respective state statistical office) of the statistical office respectively Eurostat. The advantage is that users do not have to travel (the sometimes long way) to the central statistical office to access the assembled confidential database anymore. Nevertheless the database, to which the users get access, is still collected, assembled and managed centrally at the headquarters of the statistical office. In a country where regional offices collect, edit and afterwards send data to the central office respectively at Eurostat where the national statistical institutes collect and edit the data, and afterwards send them to Eurostat, the establishment of a distributed database could be interesting. Such a

system would allow the statistical office to give remote access to a complete database without actually assembling it as a whole at the central node, but just storing and managing information on where to find the various parts of the database respectively query management tools. This could significantly reduce the overall workload and especially the workload at the central office. Each regional statistical office would retain control over its data and the data would never physically leave that office. Since many regulatory, infrastructure and collaboration issues have to be tackled first, the approach should be seen as a possible solution for the ESS only in the very long run. Big science networks (e.g. CERN) and internet applications (e.g. Facebook) are already working with sophisticated distributed database technology and could provide best practice examples as well as expertise in this field. This paper aims to give an overview of the key concepts associated with this technology and gives a rough picture of a possible implementation in the ESS. The paper is organized as follows: In Section 2 some important technical concepts for distributed databases respectively distributed database management systems are introduced, and in Section 3 the benefits and challenges of implementing such a system using the example of the European Statistical System are highlighted. Section 4 concludes

2. Technical Overview 2.1 Distributed Database System As defined e.g. in Özsu and Valduriez (1999) a distributed database is characterized by parts of one database being attached to multiple processing units. These units can be in multiple computers in multiple physical locations which are connected via a local or wide area network (e.g. one database can be distributed across several servers around the world). Such a system of distributed parts of a database is managed by a (central) database management system. This system ensures the integrity, performance and consistency of the database. Transactions have to be managed for an optimal performance and making sure the retrieved data has not been changed in the meantime. These issues are tackled by so called locking and time stamping, which are described e.g. in Bernstein and Goodman (1981). The database management system will let the distributed database appear to users as a single database. So in the best case the user does not see a difference if she is working with one single database or with the output of the distributed database management system.

2.2 Horizontal Partitioning In this work we focus on splitting a database into disjoint subsets which is called horizontal partitioning, i.e. sets of rows/ entries of a database are stored in different

locations. Compared to vertical partitioning where a dataset is split according to its attributes, i.e. sets of columns of a database are stored in different locations. Normally horizontal partitioning is used to divide large amounts of data into smaller portions, which can be managed and accessed more easily. Furthermore techniques like partition pruning and partition-wise joins can significantly improve the query performance (Eadon et al., 2008). Ceri et al. (1983) call the subsets resulting from horizontal partitioning fragments which can be allocated on a particular site of the distributed database system. To reassemble the complete database the system has to understand the data objects and links in the logical schema and have knowledge of the interrelationships among the data, i.e. knows the common criteria for partitioning multiple objects. A subset of columns forms the so called partitioning key by which the database is split across the several locations.

2.3 Virtual Private Networks When parts of a database are distributed across different maybe far apart physical locations the transmission of the data becomes crucial for its security. The data should not be transmitted over public connections without encryption and security measures. We will herein focus on Virtual Private Networks (VPN) for that purpose. VPNs enable one to connect different local networks (LAN) over public connections (WAN) by establishing so called tunnels through the WAN. This way e.g. home workers can connect to their company network and act as if they are in the same building. Data security is provided by various encryption technologies (e.g. DES, IDEA, AES), check sum algorithms (e.g. MD5, SHA) and digital signatures.

2.4 Available solutions Compared to vertical partitioning the horizontal partitioning is already thoroughly understood in literature (see e.g. Agrawal et al., 2004) and implemented in most of the common database systems. Table 1 shows links to the horizontal partitioning parts of documentation documents of some of the most common database systems solutions. Table 1: Horizontal partitioning in selected database solutions

Database system Oracle

Documentation of horizontal partitioning http://www.oracle.com/technetwork/database/features/ bi-datawarehousing/dbbi-tech-info-part-100980.html IBM DB2 http://www.ibm.com/developerworks/data/library/ techarticle/dm-0605ahuja2/ Microsoft SQL Server http://msdn.microsoft.com/ en-us/library/dd578580(v=sql.100).aspx PostgreSQL http://www.postgresql.org/docs/9.0/static/ddl-partitioning.html MySQL http://dev.mysql.com/doc/refman/5.5/en/ partitioning-overview.html

Virtual private networks including various encryption and security measures are available and incorporated already in many of the major server operating systems. The VPN technology currently is used in many ways and circumstances. Overall the required parts for establishing a distributed database system are already in use for several years, but no single software package is available which makes expertise in the sector important for the successful implementation.

3. Implementation This section will describe a potential implementation of a distributed database system for the European Statistical System (ESS) with the central node at Eurostat and child nodes in the offices of the various National Statistical Institutes (NSIs) of the Member States. But we see this setting only as an example of such a system. Currently similar existing workflows as in the ESS can also be found in other statistical institutions with local branches (e.g. in the various states of a country). Thus the exemplary implementation can also be applied to those settings.

Figure 1: Comparison current network structure and possible solution with distributed database

3.1. Current Situation Among others Eurostat provides access to confidential microdata (see e.g. Regulation EC No 831 / 2002). Those databases consist of 27 separable country-specific parts. At specific dates the NSIs have to transmit these parts to Eurostat staff, which collects, merges, checks the integrity and edits the data (e.g. remove direct identifiers). The resulting database files are currently organized as flat, character-separated files. For each (research) project in a cumbersome process the approval for each country-part of the database is collected from the Member States. Research projects can also only have access to specific parts of the database. Those permissions are set centrally by Eurostat staff on the merged database. On the left side of Figure 1 the medium term approach with the current database structure is sketched. As proposed in Reuter and Museux (2011) in the medium run researchers could have the opportunity to either go to a safe center in the country’s NSI and remotely connect to the Eurostat server or use the own office infrastructure to establish a remote desktop connection to the central server. The researcher then can perform calculations on the database provided to him in the above described process. After she finished working with the database all output is checked on breaches of confidentiality by Eurostat staff. The upper panel in Figure 2 shows the current process in a simplified illustration. The NSI staff collects the data necessary for country part of the database. At specific deadlines it transmits the assembled country dataset to Eurostat. Often the single country dataset is also made available in a separate working environment for researchers at the NSI. Eurostat staff merges the data, edits and adjusts parts of it to have a homogenous database, applies confidentiality measures and sets the permissions for users according to the decisions of the countries. After the researchers finished their work the Eurostat staff checks the output to make sure there is no breach of confidentiality.

3.2. Possible Solution A distributed database system would allow Eurostat to offer users (e.g. via remote access) a confidential database which consists of different parts located in the Member states. Nevertheless the user can access and run calculations on the full dataset. He is interacting with the database as if it was one logical system. But there would be no need for Eurostat anymore to collect and assemble the database. At Eurostat only a database management system would be running with information on the location of the various parts and how to access and assemble them. On the lower panel of Figure 2 the situation with an implemented distributed database management system is depicted. Each NSI is in full control and responsibility of their country part of the database. NSI staff can directly update the provided data, can independently execute confidentiality measures on the data and set the permissions for specific research projects. This way data would be up-to-date and easy to control by the respective NSIs, which supposedly have more expertise in handling the data of their own

country. After researchers finished their calculations the output could be checked directly by the involved NSIs which supposedly also have more knowledge of potential confidentiality threats in their country data. In this scenario the staff at the central node (Eurostat) would only have to maintain a database with the respective connection parameters to the country databases and supervise the status of the system and the timely fulfillment of the NSI’s tasks. As one can see on the right side of Figure 1 researchers could get access to the country part of the database locally and to the combined European database as before via remote access to Eurostat either from a safe center or accredited office. The distributed database management system at the central node will then provide the database assembled from the respective NSIs according to the central directory. No actual dataset needs to be held at Eurostat. In the above described scenario horizontally partitioning has been applied to the full database using the data-origin country attribute as only partitioning key. Virtual private networks including encrypted transmission tunnels could be set up between each of the NSIs and Eurostat. Eurostat thereby can build on the experience with VPNs in connection with remote access systems in some of the Member States (e.g. France, Netherlands). Instead of the currently used flat, character-separated files the most common statistical packages used by the researchers also allow for connections e.g. via ODBC to all kind of databases.

Figure 2: Comparison of current workflow and situation with distributed database system

3.3. Chances Using a process involving a distributed database management system instead of the above described current workflow could 1) improve the data quality, 2) improve the data security and availability, 3) give NSIs full control over their country data and 4) lower costs: 1) Data quality is improved by the risen actuality of the data. NSI’s can update their part of the database individually respectively correct errors without a cumbersome process through Eurostat. Furthermore the local staff possesses more knowledge about the properties and features of the data and has to work with a much smaller amount of data than the staff at Eurostat. Thus the adjustment process to meet the respective table formats will lead to less unneeded data suppression/ loss. 2) The distributed architecture would increase data security given that additional to secure transmission channels the already today existing security measures like firewalls, encrypted storage devices, separation of production and secure environment, etc. are in place and maintained. An intruder would never get access to the whole database, but only to specific country parts. Furthermore the availability will increase since a single site failure would not affect the performance of the system, only the availability of this country’s dataset. 3) The country data would never physically leave the statistical offices. The NSIs always have full control over their data, can set permissions according to their policies and monitor the activity as well as the security of their data. 4) In total the costs will decrease due to the fact that NSIs can perform their tasks more efficiently since they are more familiar with their parts of the database and the volume of the dataset is much smaller. At the central node the workload and costs will significantly decrease because of the distribution of the former central tasks. New costs will evolve in connection with the maintenance of the central directory and network. 3.4. Challenges For the establishment of a distributed database system additional components would be necessary:  At the NSIs the required database infrastructure to run a part of the database needs to be created. In many offices such a system is already running in order to provide local researchers with country data. This remote database fragments must be secured physically and from attacks on the remote site.  A secure communication tunnel between each of the NSIs and Eurostat has to be established and maintained. At Eurostat as well as some NSIs VPN technology is already used for remote access facilities, e.g. for home workers or researchers.  At Eurostat a central directory needs to be built and maintained which holds all information about the location of the several database parts. There could be different entries for different databases from one NSI.  At least the database and server communication and interfaces should be standardized in order to enable the combination of the several parts. Furthermore standards for database structure, language/ system/ software or commands should

in principle already exists due to the current data combination process. Nevertheless e.g. Sheth and Larson (1990) show that those are not essential for the functioning of such a system.  Although officials are currently finishing the work to renew the regulatory basis for the access to confidential data (see task force on regulation EC 831/2002), the distributed database system would need some additional legal rules. In some countries though the fact that the country data is never physically given away could make the introduction of the system even easier. In general the distributed database system technology is compared to other IT systems relatively new. This would need special expertise for the establishment of such an infrastructure which e.g. could be found in big research networks relying on huge distributed database networks.

4. Conclusions This paper introduces the main specific technical concepts needed for the establishment of a distributed database system. Building on that it sketches a possible implementation in the European Statistical System. This implementation is only realistic in the long run since technical and legal changes on European and country level would be necessary. But in countries with similar workflows the solution could be interesting already earlier especially if large parts of the needed infrastructure and legal basis is already in place. The system could gain from lower costs, higher data quality and full control by the local branches over the local data. Currently an ESSnet project is discussing the possibilities of decentralized access to confidential microdata in Europe, which is focused on the short and medium run solutions. Investigating a distributed database architecture or similar technologies could follow the results of this project. In the meantime the expertise with distributed databases will grow in the industry and further technical solutions will be emerging. Automatic output checking or de-identification of data would further reduce the workload of the statistical institutes.

References Agrawal, S., Narasayya, V.R., Yang, B. (2004) Integrating Vertical and Horizontal Partitioning Into Automated Physical Database Design, in: SIGMOD '04: Proceedings of the 2004 ACM SIGMOD international conference on Management of data, 359370. Bernstein, P. A., Goodman, N. (1981) Concurrency Control in Distributed Database Systems, Computing Surveys, Vol. 13, No. 2. Ceri, S., Navathe, S., Wiederhold, G. (1983). Distribution Design of Logical Database Schemas, IEEE Transactions on Software Engineering, Vol. SE-9, No. 4, 487 - 503. Eadon, G., Chong, E. I., Shankar S., Raghavan, A., Srinivasan, J., Souripriya, D. (2008) Supporting table partitioning by reference in oracle, in: SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, 11111122. Özsu, M.T., Valduriez, P. (1999). Principles of Distributed Database Systems, PrenticeHall. Reuter, W.H. and J.-M. Museux (2011), “Establishing an Infrastructure for Remote Access to Microdata at Eurostat”, in: Privacy in Statistical Databases, Lecture Notes in Computer Science, Volume 6344/2011, 249-257. Sheth, A., Larson, J. (1990). Federated Databases: Architectures and Integration, ACM Computing Surveys, Vol. 22, No. 3, 183-236.

Suggest Documents