A Content-Based Data Masking Technique for A Built

5 downloads 0 Views 788KB Size Report
Based on SQL Server 2016 DBMS and BI tools, Microsoft added built-in .... [6] Oracle Inc., "Data Masking Best Practices", White Paper, July 2010. [7] R. J. ... [10] Z. Arthur, "SSIS Data Flow Transformation Component To Provide. Basic Data ...
A Content-Based Data Masking Technique for A Built-In Framework in Business Intelligence Platform Osama Ali (Ozkan) and Abdelkader Ouda Electrical and Computer Engineering The University of Western Ontario London, Ontario, Canada {oali8, aouda }@uwo.ca Abstract—The implementation of a Business Intelligence (BI) platform in different types of organizations including healthcare, has become an important project. One common implementation is to extract sensitive data from production databases and then load them into an enterprise Data Warehouse (DW). However, the internal privacy breach that occurs when developers, researchers, and testers access the DW is a serious threat that should be taken into consideration. Reversible data masking techniques are used to de-identify the sensitive data, preserve the data format, and maintain the quality of data utility (data analytics). In this paper a practical built-in data masking framework (IMETU- Identify, Map, Execute, Test, and Utilize) is proposed focusing on the execution and testing modules. The proposed data masking algorithm is derived from the statistical content of the extracted dataset, which is grouped at certain levels (micro-aggregation) that are associated with a numeric attribute. The combination of the related statistical information will be used within a mathematical formula to generate the new masked value, and then the statistical variables will be put together in such a sequence and encapsulated to form a strong pair of public and secret keys. This strengthens the security factor while introducing small overheads in performance and space in comparison with native encryption techniques. Keywords—Business Intelligence; Data Masking; Software Framework; Data Warehouse; Anonymization; Data Security and Privacy; Health Data.

I. INTRODUCTION Advances in BI system have increased the capability to integrate, store, and utilize sensitive operational data, such as Personally Identifiable Information (PII) and Personal Health Information (PHI). Security breaches of data, whether internal or external are on the rise. These data breach stems from hacker attacks or missteps by business associates [1], and study [2] shows that more than 70% of breaches are caused by internal users (e.g., curious employees) [11]. As shown in Fig. 1, all the sensitive data (operational and legacy) are saved into the staging data area in DW [3], [11] and are considered a serious threat to privacy security when many internal users have different roles and responsibilities of access to the BI system. The traditional way to protect data privacy has been covered in the context of cryptography. However, the emphasis on preserving the data format to keep looking realistic for end users within the non-production environment has led to data masking techniques that maintain the utility of data analytics.

  Fig. 1: BI architecture and the location of sensitive data (PII & PHI).

The problem that needs to be solved is how to protect individual privacy at rest in a physical DW layer, comply with the regulations, and ensure the usefulness of data analytics. Therefore, the proper masking techniques must be chosen to automate the manual data transformation process to de-identify the sensitive data and preserve the data utility. Consequently, this process of automation has led to the design of a built-in data masking framework within the BI platform for healthcare data. In this paper, we continue to propose a conceptual data masking framework which has been called IMETU. This framework consists of five modules (Identify, Map, Execute, Test, and Utilize) as shown in Fig. 2; the first two modules were covered during our first phase of the investigation [11]. The second phase, which includes the third and fourth modules, will be described and analyzed in this paper. Thereby, a new practical masking approach is proposed to offer an alternative solution for preventing individual reidentification over traditional masking or standard encryption techniques. This proposed built-in data masking framework, including the new masking technique, is required to be available in the early integration service stage (ETL process),

Fig. 2: General context diagram of the proposed data masking framework within the BI platform

as well as in the analytical layer to re-identify the sensitive data in a simple way. Fig. 2 describes the general context diagram of five modules for the IMETU framework [11].

data sufficiently such that the model can satisfy the requirements of development and testing without the leakage of sensitive data [5].

The rest of this paper is organized as follows: Section II introduces the related works on data masking techniques. Section III presents the empirical assessment using health data to describe the problem. Section IV assesses the execution and tests the new data masking algorithm within the framework. Section V concludes the paper.

Based on Oracle Data Masking, Oracle has developed a comprehensive 4-step approach to implementing data masking called Find, Assess, Secure, and Test (FAST). These steps provide out-of-box irreversible masking approaches for various types of data attributes [6], as shown in Table (1). The concept of this built-in data masking tool is not quite useful for the BIDW platform as we need to retrieve the original data while preserving privacy, so some masking techniques might be applicable to certain data attributes.

II. RELATED WORKS ON DATA MASKING TECHNIQUES Traditionally, data masking has been used for solving data privacy issues in test/non-production environments such as duplicating databases for developers, researchers, and testers. However, a report by Gartner Magic Quadrant in December 2013 extends the scope of data masking technology to more broadly include “data de-identification in production, nonproduction, and analytic use cases” [4]. Research on data masking in general can be classified into two categories: A. Traditional Data Masking Studies and Industrial Applications (Irreversible) Ravikumar, et al., conducted an analysis of traditional data masking techniques for testing purposes in order to design uniform application architecture to automate the process that reduces the exposure of sensitive data without taking into account the usefulness of analyzed data. The architecture is divided into two sub-divisions: Analysis and App&DB. In analysis, the sensitive data is identified and further processed for the implementation of Data Masking. In App&DB empirical assessment of the two masking techniques is implemented. The first masking technique used is data shuffling that does not require any parameter specifications. The second technique is the SBLM procedure with the requirement that Beta2 be a diagonal matrix, and their measurements highlighted the strengths and weaknesses of the assessment of disclosure risk [12], [14]. Min Li, et al., analyzed basic algorithms, such as numeric alteration, format preserving encryption, random substitution, and classic schemes of data masking, and provided a formal definition of data masking according to the work principle and process. At the same time, the paper proposed a generic data masking model (without taking into account the quality of data analysis), described and analyzed the algorithms focusing on the process of masking, functions implement secure and effective masking operation for data. The masked data generated from the generic model are similar to the original

Table (1): Traditional data masking techniques (irreversible). Algorithm Original data Masked Data Explanation Health record number’s last Masking 552-888-3291 552-88#-#### four characters hashed out out Random data substitution Random Data Susan Brown Cindy Arthur from pre-prepared dataset Substitute Date of admission 2015/05/25 2015/05/15 Date Aging decreased by ± 10 days Patient’s length of stay Numeric 8 15 value increment by ±10 Alteration Postal code was shuffled N6G7K8 N5V1A1 based on the same data set. Data Shuffling Diagnosis code ICD-10 M6285 M4433 was shuffled.

Based on SQL Server 2016 DBMS and BI tools, Microsoft added built-in dynamic data masking at the presentation layer with no changes to the original data [8]. This approach is also not suitable for the BI-DW platform as we need to protect the data at rest. Also, many personal sample projects’ attempts to build ad-hoc masking procedures, functions, and simple components in order to be used by other developers resulted in greater difficulty in customizing and modifying the codes as well as more time consuming process standardization [9], [10]. B. Enhanced Data Masking v.s Data Encryption (Reversible) Ricardo, et al., investigated the best database encryption solutions to protect sensitive data. However, given the volume of data typically processed by DW queries, the existing encryption solutions heavily increase storage space and introduce very large overheads in query response time. They proposed a data masking solution for numerical values in DWs based on the mathematical modulus operator, which can be used with an extra software application layer (Not embedded within the BI platform) [7]. Krishnamurty, et al., provided a method for data shuffling to preserve data confidentiality. The method comprises

masking of particular attributes of a dataset that are to be preserved in confidentiality, followed by a shuffling step comprising sorting the transformed dataset and a transformed confidential attribute in accordance with the same rank order criteria. For normally distributed datasets, transformation was achieved by general additive data perturbation, followed by generating a normalized perturbed value of the confidential attribute using a conditional distribution of the confidential and non-confidential attribute. In another aspect, a software program for accomplishing the method was provided. They claimed that their method provides greater security and utility for the data, and increased user comfort by allowing use of the actual data without identifying the origin. However, they used a complex multi-pass statistical process to mask and retrieve the original data [5]. III. EMPIRICAL ASSESSMENT USING SIMULATED HEALTH DATA To demonstrate the classical masking problem, we used simulated health data to illustrate the summary level of an Emergency Department’s (ED) daily visits (with no patient names or health card numbers) [13]. The proposed ED data set consists of 9 attributes with observations reflecting 30 days of summarized records (approx. for 3000 patients’ visit), as shown in Table (2): Table (2): Simulated Emergency department summary data. Data Description Data Value or Attribute Type range VisitDate Emergency Visit Date Date 2015-05-30 ED_Visits Emergency Visit Number Integer 125 ED_CTAS1 Number of patients with Triage Integer 3 Level 1 (Resuscitation) ED_CTAS2 Number of patients with Triage Integer 28 Level 2 (Emergent) ED_CTAS3 Number of patients with Triage Integer 48 Level 3 (Urgent) ED_CTAS4 Number of patients with Triage Integer 39 Level 4 (Less urgent) ED_CTAS5 Number of patients with Triage Integer 2 Level 5 (Non-urgent) IP_Admits Number of patients being Integer 16 admitted to Inpatient units Real 4.18 hours ED_ALOS_ Average Length Of Stay-LOS AllDisp (Wait Time) for All Disposition Status

From the sample data set, we applied the “Date and Numeric Variance” algorithm using the random number generation function within the SQL. Then we applied statistical functions, such as SUM and AVG, at the programming level to get the final results. The results in Fig. 3 show huge discrepancies between the original and masked datasets, which are not even similar; the masked data did not produce the same outcome. This means that classical data masking algorithms have a significant negative impact on data accuracy and utility. IV. A CONTENT-BASED DATA MASKING TECHNIQUE, COBAD As presented in the previous section, the traditional masking technique is extremely useless in terms of data utility. Thus, a request for a masking technique to minimize the privacy disclosure and maintain the data utility could bring a significant balance in this regard. Within the recommendation of the IMETU data masking framework, we aim to propose an acceptable column-based masking algorithm for numeric, date, and alpha-numeric data attributes. The proposed data masking

  Fig. 3: Visual outcome of the analysis of Original data v.s masked data

algorithm is derived from the statistical content of the extracted dataset (T is data table made up of a set of columns C1,C2,…CN and rows R1,R2,…,RN) which is grouped at certain levels (micro-aggregation) based on a selected numeric attribute within the dataset. The combination of the associated statistical variables (COUNT, MAX, MIN, SUM, AVG, STDDEV, VAR, CHECKSUM, etc.) that are being used within our proposed mathematical formulas (1, 2, and 3) are organized in such a way so as to generate the new masked value at three different complexity levels. This can be generated by using consecutive MOD (%) and XOR (^) operators (used notations from SQL server script), as follows:  Simple Formula: using 16 bytes (128-bit) of variables size ,

,

%

%

^

… 1

 Medium Formula: using 24 bytes (192-bit) of variables size ,

,

%

%

^

… 2

 Hard Formula: using 32 bytes (256-bit) of variables size ,

,

%

%

^

… 3

Note: Adding a private key KP to the masking formula increases the protection of the new masked value. This predefined 128-bit private key will be saved in a black box and only accessible through the querying process. The detailed processes of the “Execute” and “Test” modules have been organized in a diagram as shown in Fig. 4. Moreover, the statistical variables are put together in such a sequence and encapsulated to construct a public key KPb based on the sub-group of data and saved for each row in a new column within the new dataset as shown in Fig. 5. The

From these statistical variables and the pre-stored private key, a data masking formula has been created and applied to protect the selected sensitive data. The strength of this technique relies on the following factors:  Manual or random selection of a sensitive numeric attribute to be used for the statistical analysis in Phase 1  Selection of the granularity levels of the aggregation process that applies to the loaded dataset to determine the statistical variables from each sub-group.  Number of statistical variables being used in the masking formula, as well as the sequence of using them with MOD and XOR operators, will add different complexity levels to our non-inverse function (except for trying a brute force attack).  The encapsulation process of constructing a public key, based on a sequence of the statistical variables that are being selected, based on the complexity of the deidentification calculation, relies on the size of the generated public key (Simple 128-bit, Medium 192-bit, Hard 256-bit).  Using a pre-stored private key (128-bit) within the masking formula will increase the protection of the sensitive data attributes in the DW. REFERENCES [1] [2] [3]

[4]

  Fig. 4: UML Activity Diagram for the “Execute” and “Test” modules in the Data Masking Framework for BI platform

sequence in this key will be used to re-identify the original value at the query stage of analysis services. V. CONCLUSION This research has highlighted the uselessness of the traditional data masking techniques in terms of data utility within the BI analytic platform, taking into consideration the importance of data privacy. Also, this research outlined the main modules of the proposed built-in data masking framework (IMETU – Identify, Map, Execute, Test, and Utilize). Furthermore, this research has focused on the “Execute” and “Test” of a new proposed data masking technique (COBAD) based on the statistical content derived from the loaded dataset at aggregated levels.

 

[5] [6] [7]

[8] [9]

[10]

[11]

[12]

[13]

[14]

Fig. 5: Constructing Public Key from derived statistical variables

M. K. McGee, “Biggest Health Data Breaches in 2014”, Data Breach Today (HealthInfoSec), December 22, 2014 Dataguise Inc., “Why Add Data Masking To Your Best Practices For Securing Sensitive Data”, p1, White Paper, 2010 R. Guro. Components of business intelligence. The Business Intelligence Guy [Online]. 2011. Available: http://www.the-businessintelligenceguy.com/components-of-business-intelligence-bi/ . Voltage Security Solution for Data De-Identification, White Paper 2014. Available: https://www.voltage.com/wpcontent/uploads/Voltage_UC_Data_De-Identification.pdf K. Muralidhar and R. Sarathy. "Data shuffling procedure for masking data", Patent US 7200757 B1, April 2007. Oracle Inc., "Data Masking Best Practices", White Paper, July 2010 R. J. Santos, J. Bernardino, M. Vieira, "A Data Masking Technique for Data Warehouses", ACM Communication, IDEAS11 2011, September 21-23, Lisbon, Portugal Microsoft Inc., "Dynamic Data Masking", SQL Server 2016, https://msdn.microsoft.com/en-CA/library/mt130841.aspx R. Dobson, "Masking Personal Identifiable SQL Server Data", November 2013, https://www.mssqltips.com/sqlservertip/3091/maskingpersonal-identifiable-sql-server-data/ Z. Arthur, "SSIS Data Flow Transformation Component To Provide Basic Data Masking Capabilities", Feb 2012, http://ssisdatamasker.codeplex.com/ O. Ali, A. Ouda, " A Classification Module in Data Masking Framework for Business Intelligence Platform in Healthcare", IEEE 7th Annual Conference (IEMCON), Vancouver, Canada, 2016. G K Ravikumar, Dr. B. Justus Rabi, Dr MGR University, “Experimental Study of Various Data Masking Techniques with Random Replacement using data volume”, (IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 8, August 2011, p.154-158 O. Ali, H. Johnson, P. Crvenkoviski, " Using a Business Intelligence Data Analytics Solution in Healthcare", IEEE 7th Annual Conference (IEMCON), Vancouver, Canada, 2016. G. K. Ravikumar, T. N. Manjunath, S. Hegadi Ravindra, I. M Umesh, “A Survey on Recent Trends, Process and Development in Data Masking for Testing”, (IJCSI) International Journal of Computer Science Issues, Vol. 8, Issue 2, March 2011.

Suggest Documents