EUC 2016
An efficient scalable parallelized version of the Mondrian Algorithm John Christoforidis Department of Informatics and Telecommunications Engineering Kozani, Greece
[email protected]
Minas Dasygenis Department of Informatics and Telecommunications Engineering Kozani, Greece
[email protected]
Presentation is delivered by: Mrs Tsourma Maria, Dipl. Eng
1
EUC 2016
Introduction • Information traffic and data generation are heavily increased. • Privacy poses a major issue. • The combination of data from various sources can result in exploiting personal data. • Anonymization is mandatory to ensure security in published micro-data.
2
EUC 2016
Micro-data •Micro-data are generated by various companies for research, statistical or other purposes. •This kind of data is usually in a table form where each row represents a person and each column represents an attribute. Age
Sex
Zipcode
Illness
19
Male
52261
Flu
23
Female
52221
Nausea
25
Male
52221
Flu
32
Male
52244
Healthy
3
EUC 2016
Exploit example By combining raw data from various sources, one can gain access to personal information. Age
Sex
Zipcode
Illness
Name
Age
Zipcode
Married
19
Male
52261
Flu
Chris
18
52282
Yes
23
Female
52221
Nausea
John
23
52221
No
25
Male
52221
Flu
Peter
25
52221
No
32
Male
52244
Healthy
Petrof
30
52210
No
For instance, the above example tables can be the output of 2 separate studies/surveys. The 1st from the left and the 2nd from the right matches. It is revealed that Peter has Flu.
4
EUC 2016
Anonymization • Mondrian achieves anonymization by combining data entries. • The number of entries combined together is defined by the letter K. For K=2 the previous example would be anonymized into:
Age
Sex
Zipcode
Illness
19-23
Male, Female
52221-52261
Flu
19-23
Male, Female
52221-52261
Nausea
25-32
Male
52221-52244
Flu
25-32
Male
52221-52244
Healthy
5
EUC 2016
The number K • • • •
The law requires K to be at least 2. Low K numbers result in low protection. High K numbers result in high data loss. K should be set according to the data-set type and value range.
6
EUC 2016
The key argument • The last argument of each micro-data list is defined as “the key argument”. • This argument is usually the result of the respective list and should not be merged or otherwise changed.
7
EUC 2016
Parallel Mondrian • The development of IoT has led to an increase of data generated. • Since anonymization is mandatory, those data needs to be processed by Mondrian as quickly as possible as more data are being created. • Paralleling Mondrian results in a faster anonymization as it uses all of the computer’s available CPU.
8
EUC 2016
Our Methodology
9
EUC 2016
Program’s Arguments • • • •
Our implementation of Mondrian requires the following arguments to operate: The input file. The desired number ‘K’. The number of the threads to be created. The output file.
10
EUC 2016
Read File The Algorithm requires an input file of the following structure: • Each line represents an entry. • Each argument is separated by a comma.
11
EUC 2016
Define Argument Type and Range • After the file has been read, each argument is being stored into a Lines X Argument table. • Each argument is marked according to its type: Number or Alphanumeric. • The program also calculates the distinct values of each argument, which is the argument’s range.
12
EUC 2016
The Selected Argument • To best split the entries into groups the algorithm selects the argument with the biggest range. • This argument is marked as “the selected Argument”. • The threads are now created and each one takes a portion of the range of the selected argument.
13
EUC 2016
Population counter • Each thread calculates the population for each range’s value. • The population is the number of entries that has that value. • Each thread’s execution is halted until every thread calculates their respective population so they know their output range.
14
EUC 2016
Combination group Each thread calculates its combination groups; the number of entries that will be combined. The combination group are created by this method: • Create a group and add the range’s population into a counter.
• If the counter is bigger or equal to ‘K’ then create a new group for the next range. • If the last group’s counter is less than ‘K’ then the last group is combined with the group with the least population counter.
15
EUC 2016
Combining entries • Each entry in each group will have its arguments’ values combined. • The last argument will not be merged as it is the key argument and it is left unchanged. • If the argument’s type is numeric then the merged information is a range: smallest value-biggest value. • If the argument’s type is alphanumeric then it displays all distinct values: Alpha1| Alpha2| Aplha3.
16
EUC 2016
Example File Argument 1
Argument 2
Key Argument
1
A
John
1
B
Joanne
2
E
George
3
D
Willow
4
C
Senel
5
B
Odysseus
5
C
Lina
6
B
Leon
7
B
Sora
7
D
Varian
8
C
Orteil
8
E
Phill
17
EUC 2016
Example File Analysis • The example file has 2 arguments and the key argument. • The first argument is numeric and has a 8 range. • The second argument is alphanumeric and has 5 range. • The selected argument is the first argument as it has the biggest range.
18
EUC 2016
File Graph
19
EUC 2016
Standard Mondrian (1 thread) • If the file is run with only 1 thread then that thread will view the whole range of the first argument, which is from 1 to 8. • It will split the file according to the group creation rules. • The execution time will be 4 time circles (4 splits).
20
EUC 2016
Standard Mondrian (1 thread) Argument 1
Argument 2
Key Argument
1
A,B
John
1
A,B
Joanne
2-3
E,D
George
2-3
E,D
Willow
4-5
C,B
Senel
4-5
C,B
Odysseus
4-5
C,B
Lina
6-7
B,D
Leon
6-7
B,D
Sora
6-7
B,D
Varian
8
C,E
Orteil
8
C,E
Phill
21
EUC 2016
Standard Mondrian (1 thread)
22
EUC 2016
Parallel Mondrian (2 threads) • If the file is run with 2 threads, then the first thread will take the half the range (1 to 4) and the second thread will take the other half (1 to 8). • The difference will now be that the first thread will not be aware of the other’s thread range, thus it will make its own groups. • The execution time will be 2 time circles (2 splits for each thread). 23
EUC 2016
Parallel Mondrian (2 threads) Argument 1
Argument 2
Key Argument
1
A,B
John
1
A,B
Joanne
2-4
E,D,C
George
2-4
E,D,C
Willow
2-4
E,D,C
Senel
5
B,C
Odysseus
5
B,C
Lina
6-7
B,D
Leon
6-7
B,D
Sora
6-7
B,D
Varian
8
C,E
Orteil
8
C,E
Phill
24
EUC 2016
Parallel Mondrian
25
EUC 2016
Mondrian Execution Time • Our Implementation’s execution time is highly dependable upon the selected argument’s range. • File size and argument number does not affect the run time.
• The algorithm was tested on 2 machines: A Windows OS with 8-core CPU, and a Linux OS with 16-core CPU. • Each computer has different clock speed peripherals so the results are not to be compared to each other. 26
EUC 2016
Results • Our results show that doubling the file’s argument range will about double the execution time. • However if we double the threads too the execution time will be about back to normal. • The execution times are also consistent with the law of
Amdahl. 27
EUC 2016
File Range 8000 Threads
Windows PC Total Execution Time (sec)
Memory Allocated (MB)
1
255
320
2
136
564
4
73
1054
8
53
2036
Threads
Linux PC Total Execution Time (sec)
Memory Allocated (MB)
1
408
294
2
279
544
4
142
997
8
82
1953
16
54
3868
28
EUC 2016
File range 4000 to 8000 comparison Windows PC (4000) Threads TET(sec) MA(MB)
Windows PC (8000) Threads TET(sec) MA(MB)
1
116
138
1
255
320
2
67
202
2
136
564
4
37
333
4
73
1054
8
29
587
8
53
2036
Linux PC (4000)
Linux PC (8000)
Threads
TET(sec)
MA(MB)
Threads
TET(sec)
MA(MB)
1
200
106
1
408
294
2
139
170
2
279
544
4
72
295
4
142
997
8
41
546
8
82
1953
16
31
1004
16
54
3868
29
Execution Time and Memory Graphs
EUC 2016
30
EUC 2016
Conclusion • Our algorithm provides a fast and quality solution to anonymizing data-sets by paralleling the process. • It is ideal for decent sized data-sets that are produced on a daily basis. • It can be easily modified to be optimized to any specific data-set type.
31
EUC 2016
An efficient scalable parallelized version of the Mondrian Algorithm
Thank you! 32