Making Big Data, Privacy, and Anonymization work together in the Enterprise: Experiences and Issues
Jeff Sedayao, Rahul Bhardwaj Intel Corporation Santa Clara, United States
[email protected],
[email protected]
Nakul Gorade University of Texas at Dallas Dallas, United States mailto:
[email protected] Abstract— Some scholars feel that Big Data techniques render anonymization (also known as de-identification) useless as a privacy protection technique. This paper discusses our experiences and issues encountered when we successfully combined anonymization, privacy protection, and Big Data techniques to analyze usage data while protecting the identities of users. Our Human Factors Engineering team wanted to use web page access logs and Big Data tools to improve usability of Intel’s heavily used internal web portal. To protect Intel employees’ privacy, they needed to remove Personally Identifying Information (PII) from the portal’s usage log repository but in a way that did not affect the use of Big Data tools to do analysis or the ability to re-identify a log entry in order to investigate unusual behavior. To meet these objectives, we created an open architecture for anonymization that allowed a variety of tools to be used for both deidentifying and re-identifying web log records. In the process of implementing our architecture, we found that enterprise data has properties different from the standard examples in anonymization literature. Our proof of concept showed that Big Data techniques could yield benefits in the enterprise environment even when working on anonymized data. We also found that despite masking obvious PII like usernames and IP addresses, the anonymized data was vulnerable to correlation attacks. We explored the tradeoffs of correcting these vulnerabilities and found that User Agent (Browser/OS) information strongly correlates to individual users. While browser fingerprinting has been known before, it has implications for tools and products currently used to deidentify enterprise data. We conclude that Big Data, anonymization, and privacy can be successfully combined but requires analysis of data sets to make sure that anonymization is not vulnerable to correlation attacks. Keywords- Anonymization, de-identification, Big Data, Hadoop, privacy, encryption, tokenization
I.
INTRODUCTION
Anonymization, also known as de-identification, has been used as a way to protect privacy in data sets stored in clouds and medical databases. One of the largest issues that organizations have with Cloud Computing, particularly in
public Clouds, is protecting the confidentiality of data. Anonymization would seem to solve this problem by making it so that data could be processed without concern as to who looked at it. Another issue is with jurisdictional requirements for personally identifiable information that require storing and processing that data only in designated geographies. Anonymization can be used as a way to deal with those requirements while allowing the transformed data to be processed where desired. Intel has a heavily used internal web portal called Circuit. Our Human Factors Engineering team wanted to use Big Data techniques to optimize Circuit’s layout and to improve employee productivity. For example, if they found that a frequently used area on Circuit averaged ten clicks to reach, they could put a link to that area on the portal’s opening page or on a list of frequently used links. While that kind of data was available in web usage logs, privacy concerns about individuals and their usage prevented the human factors team from pursuing this. To answer these privacy concerns, anonymization seemed like an obvious solution, but some open questions remained. In a world of Big Data tools, would anonymization fail to preserve privacy as some suggest? If privacy was preserved, would our results still be useful to an enterprise? We decided to try using anonymization and see if we could answer those questions. Our experiences of implementing anonymization in an enterprise context proved to be different from what is described in anonymization literature. This paper is a case study of anonymization deployment in an enterprise, describing requirements, implementation, and experiences encountered when using anonymization to protect privacy in enterprise data analyzed using Big Data techniques. Section II covers relevant work in this area, which we compare and contrast with our work. Section III describes in more detail what problems our human factors team faced, both with improving the portal and with protecting privacy. Section IV goes over our solution design, followed by Section V where we describe our implementation experiences. Section VI talks about how we measured the quality of our anonymization and the tradeoffs that we
encountered while improving that quality. Section VII discusses the lessons learned, the future work planned, and the issues left remaining. II.
PREVIOUS WORK
Analyses of the general effects of Big Data technologies on anonymization have been done by legal scholars. Ohm [1] claims that computer science findings render anonymization useless, a claim which would seem to be confirmed by the case of an anonymized Netflix dataset being comprised through correlation with the Internet Movie Database [2]. Ohm also contends that Big Data benefits are underwhelming [3], while Tene and Polonetsky [4] have a more nuanced look at benefits of big data versus negative effects on privacy. These studies do not provide technical solutions to anonymization problems in the enterprise, but Ohm points out [1] the inherent tradeoff in anonymization – utility vs privacy. CAIDA has catalogued open source tools for anonymizing network traces [5]. A number of general open source tools for anonymization are available, such the Cornell Anonymization Toolkit [6] and ARX [7]. Toolkits that work with Big Data tools, like the Hadoop Anonymization Toolkit [8], emerged after we completed our work. A thorough treatment of the IT governance, usage models, and available techniques for applying anonymization in the enterprise has been done by Raghunathan [9]. Vinogradov and Pastsyak [10] published an evaluation of data anonymization tools for enterprises. Neither of these studies discuss evaluating how well anonymization is actually done, something that we do in this paper. Ross et. al. [11] look at anonymization as a possible method for solving cloud computing data security concerns but do not discuss measuring the quality of anonymization. We had also done an implementation of using anonymization to secure data stored in a Cloud Software as a Service for maintaining log files [12]. In this work, we anonymized IP addresses and analyzed data in the cloud, but this work did not involve true enterprise data, Big Data techniques, or measuring the quality of anonymization. Another body of work looks at issues with anonymizing enterprise network data [13,14]. These efforts do look at the quality of anonymization, albeit in the specific area of network traces. Medical data has traditionally been anonymized [15] to enable research while protecting privacy. Standard metrics have in established in the medical data space to measure the utility of data and the level of privacy, which we utilize in our work in the enterprise environment. III.
HOW OPTIMIZING A WEB PORTAL REQUIRED ANONYMIZATION
An important web site within Intel is Circuit, our internal web portal. Many internal employee applications like expense reporting and health benefits management are launched through it as well as searches for internal information. Since it is heavily used, improving its usability could generate large productivity improvements.
Intel’s Human Factor Engineering group (HFE) wanted to use Circuit web page access logs to improve the Circuit user experience. For example, if the logs indicated that users took many clicks to reach a commonly used application, employee productivity could be improved by placing a link to that application on the Circuit’s main page. Other questions like “what do Intel workers search for?” could be answered through analyzing the logs. Table 1 contains a list of use cases and questions that HFE wants to investigate. The Circuit usage logs that we were allowed to use for our proof of concept were between 250-300 Megabytes for a weekday (less on weekends), and administrators had on hand a number of years of these logs. Ideally, these files would be kept in a central repository where HFE analysts could use big data tools like Hadoop to investigate usage. This was not possible because Circuit usage logs contain individuals’ user names and system IP addresses. Curious HFE staff or system administrators should not be able to look up individual employee usage patterns. Also, the privacy of those searching for topics like “Open Door Policy” which could indicate that an employee has sensitive personnel issues, needed to be protected. In contrast, there are circumstances when Intel needs to know who executed a particular search or looked at a particular page. Security investigations and the need to investigate the reasoning of TABLE I.
USE CASES FOR THE STUDY OF CIRCUIT LOGS
Use Case
Description
Aggregate page hits
Per user, how often do they hit the circuit website as a whole? Need to identify users and aggregate by Calendar Month
Session Time
What is the aggregate time per session by users?
Aggregate Search vs. Browse
Aggregate hits, by user per month, of search (identified by iSearch in the string) vs. browsed page hits
Search terms
What search terms were used by users in a given month?
Browse vs. Search in a session
How do users leverage circuit? Within a session, how do users navigate the environment? Do they browse for items of interest before searching? Do they directly search and browse through topics?
Search Efficiency
How different searches do users enter in a session and what is the variation between searches?
Referrer Pages
Which pages are referred by what page? Can we see how users get to different information and where they are coming from to get there?
User Demographics
Who are the frequent users? Where are they from? What business group? - Test merging CDIS data with Circuit output
Peak Usage Times
What are the peak use times by users by region? Aggregate at the hour using system time
Anonymous Circuit and Anonymous Circuit CDIS
Can we bring 2 anonymized data sources together and get joined information while the data is anonymized? – Can we still use simple key encryption or is tokenized a requirement?
particular actions are two such circumstances. Personally Identifiable Information (PII) would be deleted or somehow obscured in the log data repository. While protecting privacy, the anonymization process should still allow Big Data tools to analyze the log file in useful ways. HFE also wanted the ability to re-identify and rebuild a log entry in order contact employees who they found to behave unusually. These conflicting goals typify the challenges in using anonymization. Privacy trades off with utility – an increase in one leads to a decrease in the other. We explore these tradeoffs in a later section. IV.
AN OPEN ANONYMIZATION ARCHITECTURE
A requirement not directly related to privacy was to have the anonymization process open and readily accessible to a variety of approaches and tools, especially open source tools. From an IT department perspective, we wanted to avoid dependencies on a single vendor or program. Since we were experimenting with de-identification tools, we wanted to make it easy to switch tools with no impact to existing data. Given these additional requirements, in additions to privacy and analysis needs specified by our HFE staff, we created the architecture shown in Figure 1. This anonymization architecture makes de-identification (and reidentification when necessary) happen in a secure enclave. To start the anonymization process, sensitive fields like IP address and user IDs in Circuit log files are encrypted using AES [16] symmetric key encryption. Once the usage data has been anonymized, we can safely move it to Hadoop File System (HDFS) based storage where it would be available to HFE analysts to study Circuit usage. When the analysts need to re-identify log data, the logs could be moved back to the security enclave and the sensitive fields decrypted with the same symmetric key. We chose symmetric key encryption because it easily allows multiple tools to work on the data. A set of tools could generate or read the same data or as long as they had the same key and used the same encryption mode. An alternative to using symmetric key encryption key for multiple tools was to use tokenization [17], which maps a string to each item that needs to be de-identified. While this approach is perfectly viable to use for de-identification, in order for multiple de-identification and re-identification tools to interoperate, a large token table would need to be
Fig. 1: Anonymization Architecture
maintained and accessed by the tools. Another thing that we needed to design was how to handle the format of the log files. Circuit uses an extended web trends [18] format, with the fields defined as follows: date time c-ip cs-username cs-host cs-method cs-uri-stem csuri-query sc-status sc-bytes cs-version cs(User-Agent) cs(Referer) Obvious PII are c-ip (IP address) and cs-username (user’s official Intel user name), so it might seem that the anonymization process would be simply to encrypt those two fields. But a look at an actual example of a Circuit log entry, displayed in Figure 2, shows that de-identifying the files is not so simple. The user name jcsedaya occurs not only in the cs-username field but repeatedly in the cs-uri-query field! We also found that user names could be embedded in the cs(Referrer) field also, and IP addresses also has been known to reoccur multiple times within a record. The canonical model for performing anonymization [19] expects data organized into relational database type tables. Some columns are identifiers like user name and IP address, which can directly identify a person. Others attributes are quasi-identifiers which could potentially lead to identification of the person whose behavior is recorded, while other are data fields of interest. The model does not deal with identifiers embedded within quasi-identifiers or even in data fields of interest. Our challenge was to transform Circuit log records so that each records’ fields could fit in that canonical format. The fields where IP addresses and user names would reoccur are not free form text fields as they contain enough structure to distinguish where these identifiers occur. We put a field reference (e.g. 1, or 2) whenever an identifier would occur within a field that wasn’t an identifier field. An identifier is encrypted once, and future occurrences in that record would refer back to the encrypted field. In the Circuit log files, a username could occur multiple times in a single record. Since username is the second encrypted field (IP address would be the first), we would replace its first occurrence with its encrypted equivalent. Each later occurrence of the user name would get replaced with a pointer of “2” (since it is the second encrypted field). Figure 3 shows how this would work based on the log entry displayed in Figure 2. The field references and the
Fig. 2: Example of a Circuit log entry
encrypted values are shown in bold. Note that we delimit the anonymized fields with %%, as that sequence is not used by any field in the Circuit log file. Using field references has the property that all the quasi-identifying fields should be the same if two people were looking at the same web page at the same time. This property improves the resiliency of the data to attack. An obvious alternative to using these field references would be to simply delete the additional references to user name or IP address. Doing that would make it difficult to recover the full log record should re-identification be required, so we did not pursue that option. V.
IMPLEMENTATION EXPERIENCES
Our initial implementation of anonymization was done in PERL and processed all log files serially. Since this first program also implemented AES in PERL, anonymization of a day’s log took hours. Later, a much faster implementation in Python was written, reducing the time to process to minutes. We then leveraged Hadoop, writing Pig Latin [20] code to process the input data in minutes and in a more scalable fashion. After making improvements in the encryption process, we implemented two use cases from the Table 1 with anonymized circuit log data from October 2012. We felt that one month of data would be at sufficient scale to validate our efforts. The two use cases were 1) what search terms that employees used and 2) how much do individual users use the Circuit web site. Per the first use case, Table 2 contains the top ten searches from October 2012. This data was easily found using HIVE [21]. October is typically when influenza vaccines are given in the United States. The queries for the “recognition awards” and “recognition” occurred because Intel changed its recognition award vendor during that time. This table can be used by Intel management to understand what concerns are “trending” within Intel’s employee base.
Fig. 3. Anonymized version of log record shown in Figure 2.
TABLE II.
TOP 8 SEARCHES ON CIRCUIT FROM OCTOBER 2012 Search Term
Occurrences
Recognition
2308
Plts
1294
Flu shots
1120
Mft
892
Recognition awards
861
PLTS
824
MFT
818
Health for Life
783
Per the second use case, we implemented traffic counts by user. We implemented the architecture of Figure 1, processing the anonymized log files and doing traffic counts of the anonymized user names. HIVE makes this kind of work extremely easy. We took the table of anonymized user names and traffic counts, decrypted the user names, and generated a table of real usernames and traffic counts. Overall, we successfully implemented anonymization software and dealt with two of HFE’s use cases, doing both using Big Data tools. That part of the work was complete. VI.
MEASURING AND IMPROVING THE QUALITY OF ANONYMIZATION
Even though we successfully implemented anonymization in our environment, one question remained – just how well did we do it? Answering that question is one step that many products in the enterprise space seem to miss when implementing anonymization. Just because we have obscured the user name and IP address does not mean that data is safe and that individuals cannot be identified. An attacker with side knowledge could potentially correlate that data with some of the clearly visible fields to figure out the identity of someone who generated a particular log line. One fundamental measure of the quality of anonymization is k-Anonymity [19]. This means that for each record in an anonymized file, there are k entries with the same fields that could be used to identify it. A k value of one for a file means that with side knowledge, there exists a record that could be identified. Privacy is better the higher the k value is for a set of data. We ran an analysis using HIVE to find the k value each entry of the October 2012 after aggregating time stamps into one hour intervals. Figure 4 shows the distribution of k values. These results show that our initial attempt at anonymizing data left our data set high vulnerable to correlation attacks.
•
Figure 4: Occurrence Distribution of k-anonymity levels
To guide our improvement efforts, we adapted metrics used in anonymizing medical records. We used two metrics for privacy, average risk and maximum risk [15]. For a given record, we say that the probability of the risk of disclosure is the inverse of the number the occurrences of that records (or the inverse of the k-anonymity value for the record). The average risk is the average of all the disclosure probabilities and provides a “big picture” measure of disclosure risk. The maximum risk is the highest risk of all of those disclosure probabilities. It presents a view of the maximum risk to any individual whose behavior is recorded in the data set. One measure for utility is entropy, which indicates how much information loss happens. Less entropy equates with more utility. An obvious entropy metric is precision, which measures how much information is lost during anonymization. For some measures like aggregating time stamps, precision effects are obvious. A one minute time stamp aggregation reduces precision to 1/60 of the original, a ten aggregation to 1/600, and a one hour to 1/3600. Since that metric didn’t vary with what we were planning to do (mask fields), we did not track precision. The utility metric that we needed to use if we considered removing records was completeness. This is the ratio of records that are still in the data set to the original records, the inverse of missingness [15]. Ideally we want to this to be 1, but completeness (and thus utility of the data) decreases as we remove records that are vulnerable to correlations. Once we had metrics defined, we looked at what could make a log entry unique and traceable to the user who looked at Circuit. Figure 1 shows that there are many different fields that can narrow down the possibilities of who a user might be even if the user’s name is removed. There is a site code field embedded into the log entry that indicates where a user works (ext.CampusCode), a time zone field (WT.tz), and a user language filed (WT.ul) along with the other identifiers like a nodeId value. The time stamp of the entry could reveal individual users, and so could the user’s browser/OS combination in the user agent field. We decided to make changes to the log file in the following ways in order to improve the anonymization: • Hide all references to site, nodes, languages, and other characteristics of the individual (no local info)
Hide all user agent/browser information (no browser) • Aggregate timestamps with increasingly large intervals • Remove log entries with certain risk level Using Pig, we first looked at the effects of taking these measures on our month of usage log data that has local data and browser information removed. Figure 5 shows initial results of our experiments. For each time aggregation, eliminating local information improved average risk a lot, as did eliminating browser information. Aggregating time stamps into increasing large time frames also yielded improvement although at the cost of losing resolution for timing events. Figure 6 shows what happens to average risk when we eliminate log entries that were at certain vulnerability levels. Maximum risk at 1 means that we did not delete any entries, maximum risk at .5 means that we deleted entries with only one similar entry, and maximum risk at .33 means that we deleted entries with two or less identical entries. Reducing maximum risk has a very powerful effect on average risk, as more vulnerable log entries are eliminated. Doing this clearly has a cost, as shown in Figure 7. As we decrease maximum risk, the completeness goes down. Completeness is less of an issue with a one hour time stamp aggregation,
Figure 5: The effects of removing data on Average Risk
Figure 6: Improving Average Risk and Maximum Risk
•
Figure 7: Tradeoffs of Completeness versus Maximum risk
but as we mentioned before, this results in a significant cost in precision. VII. CONCLUSIONS, ISSUES, AND FUTURE WORK When we set out to anonymize usage data for our Human Factors Engineering team, we wanted to know whether anonymization could be combined with Big Data technologies to do something useful in our enterprise. The answer would seem to be a qualified yes. We used Hadoop to analyze the anonymized data and obtain useful results for the Human Factors analysts. At the same time, we learned that anonymization needs to be more than just masking or generalizing certain fields – anonymized datasets need to be carefully analyzed to determine whether they are vulnerable to attack. We encountered a number of issues and surprises when working on this project. First, we did not expect that structured enterprise data like a web server log file would have identifiers distributed throughout a record, differing so much from the canonical case. We used our reference pointer technique to convert entries into the canonical case without loss of information. Second, we were surprised that browser and user agent information would be so closely tied to individuals. Later, we found that other individuals have previously made the same discovery [22]. This finding has implications for anonymization tools – if browser information leaks into an anonymized data set, it can potentially disclose the individual. Third, we found that anonymization tools destined for the enterprise generally did not seem to consider the quality of anonymization and whether an anonymized data set was vulnerable to correlation attacks. Given our experiences, we can make the following recommendations when de-identifying enterprise data sets: • After identifying fields to be obscured, account for all occurrences of those fields. If fields recur throughout a record, our reference pointer technique can hide those occurrences while aiding in increasing record k-anonymity.
Look at all fields in a record for their ability to identify individual. User language, time zone, and screen resolution field information may seem harmless but can be used to deduce possible users. • Web Browser information is a special case of the previous recommendation. Consider omitting browser information if possible and obscuring with either encryption or tokenization if you need to process that information. • Measure and manage anonymization and information loss metrics. Simply obscuring fields in a record probably will not protect privacy. Following these recommendations should improve privacy and provide metrics for monitoring and improving both privacy and the utility of anonymized enterprise data. Work on the Circuit log data has now passed to an engineering team for production implementation. This team chose to use tokenization instead of encryption for obscuring PII. Tokenization has the advantage of not being vulnerable to any potential weaknesses in encryption algorithms since there is no mathematical relationship between the original PII and the anonymized values. We have also began using anonymization on the log files for www.intel.com, Intel’s corporate presence. These files are much than Circuit log files. Our study of the quality of anonymization used kanonymity based metrics. While we feel that this is a necessary start, other attacks exist than can cause privacy loss even with reasonable k-anonymity values. Other measures of privacy, like l-diversity [23] can be used to improve privacy, and we want to explore looking at them. Also, our current approach does not look at data longitudinally [15] and does not hide sequential occurrences that can reveal users. As an example, if an adversary knew that someone looked at circuit every hour5 times a day, he or she could potentially identify that person even if we aggregated time stamps to one hour time frames and removed other fields. We need to exploring dealing with this threat using techniques already used with medical data [15]. Other future work includes finding and removing quasiidentifier data fields that correlate with other fields. As an example, work site and time zone fields correlate with each other, so there is no need to have both. Managing redundant correlating data could save both storage space and reduce analysis time. We are looking forward to evaluating the Hadoop Anonymization Toolbox when it becomes available. This toolkit also calculates metrics for entropy, which we did not consider in our evaluation, and other measures of privacy like l-diversity. We also see a number of areas for further research and open source tool development. While we carefully analyzed our data for occurrences such as identifiers distributed through records and applied techniques to handle such problems, it would be good to have tools that do that automatically. We worked with static data sets to calculate our metrics, but we foresee the de-identification process and the privacy and utility metrics calculation needing to happen dynamically as anonymized data is continuously released into big data repositories or into clouds. Finally, it would be
useful to have tools that anonymize appropriately given a set of desired privacy and utility metrics, automating the analysis needed to figure out the best techniques for anonymization. ACKNOWLEDGMENT We would like to thank Derrick Schloss for presenting us with an excellent problem to work on. REFERENCES [1]
P. Ohm, "Broken promises of privacy: Responding to the surprising failure of anonymization," UCLA Law Review, vol. 57, 2010, pp. 1701. [2] A. Narayanan and V. Shmatikov, "How to break anonymity of the netflix prize dataset," arXiv preprint cs/0610105, 2006. [3] P. Ohm, "The Underwhelming Benefits of Big Data," U. Pa. L. Rev. Online, vol. 161, 2013, pp. 339-347. [4] O. Tene, and J. Polonetsky, "Privacy in the age of big data: a time for big decisions," Stanford Law Review Online, vol. 64, 2012: pp. 63. [5] CAIDA, “Anonymization Tools Taxonomy,” July 2009, http://www.caida.org/tools/taxonomy/anonymization.xml. [6] X. Xiao, G. Wang, and J. Gehrke, Interactive anonymization of sensitive data, Proceedings of the 35th SIGMOD international conference on Management of data (SIGMOD '09), Carsten Binnig and Benoit Dageville (Eds.), ACM, New York, NY, USA, pp.10511054. [7] F. Kohlmayer, F. Prasser, C. Eckert, A. Kemper, and K. Kuhn, “Flash: Efficient, Stable and Optimal K-Anonymity,” Proceedings of the 4th IEEE International Conference on Information Privacy, Security, Risk and Trust (PASSAT), 2012. [8] A. Radwan, “Scalable, Flexible Data Privacy in the Cloud,” Hadoop World 2013, October 2013. [9] B. Raghunathan, The Complete Book of Data Anonymization: From Planning to Implementation. Auerbach Pub, 2013. [10] S. Vinogradovand A.r Pastsyak, "Evaluation of Data Anonymization Tools," The Fourth International Conference on Advances in Databases, Knowledge, and Data Applications (DBKDA 2012), 2012.
[11] M. Dave, T. Kohlenberg, S. Purcell, A. Ross, and J.Sedayao, "Some Key Cloud Security Considerations," Intel Technology Journal, vol. 16.4, 2012, pp 112-125. [12] J. Sedayao. "Enhancing Cloud Security Using Data Anonymization." Intel IT, IT@ Intel White Paper. IT Best Practices, Cloud Computing and Information Security, 2012. [13] D. Koukis, A., Spyros, K. Anagnostakis, "On the privacy risks of publishing anonymized IP network traces," Communications and Multimedia Security, Springer Berlin Heidelberg, 2006. [14] B. Ribeiro, et al., "Analyzing Privacy in Enterprise Packet Trace Anonymization," NDSS, 2008. [15] K. El Emam and L. Arbuckle, Anonymizing Health Data, O’Reilly Media, December 2013. [16] J. Daemen, and V. Rijmen, The design of Rijndael: AES-the advanced encryption standard. Springer, 2002. [17] T. Axon, “Understanding and Selecting a Tokenization Solution,” https://securosis.com/assets/library/reports/Securosis_Understanding_ Tokenization_V.1_.0_.pdf. [18] Webtrends reference, http://kb.webtrends.com/articles/Information/What-is-the-WebtrendsEnterprise-Data-Connector-log-file-format-1365447888162. [19] L. Sweeney, “k-anonymity: A model for protecting privacy,” International Journal of Uncertainty, Fuzziness and KnowledgeBased Systems, vol. 10(05), pp. 557-570. [20] A. Gates, et al., "Building a high-level dataflow system on top of Map-Reduce: the Pig experience," Proceedings of the VLDB Endowment 2.2, 2009, pp. 1414-1425. [21] A. Thusoo et al., "Hive-a petabyte scale data warehouse using hadoop," 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010),. IEEE, 2010. [22] P. Eckersley, "How unique is your web browser?" Privacy Enhancing Technologies, Springer Berlin. Heidelberg, 2010. [23] A. Machanavajjhala, et al., "l-diversity: Privacy beyond kanonymity." ACM Transactions on Knowledge Discovery from Data (TKDD) vol. 1.1, 2007, pp. 3.