An efficient scalable parallelized version of the

4 downloads 0 Views 902KB Size Report
The development of IoT has led to an increase of data generated. • Since anonymization is mandatory, those data needs to be processed by Mondrian as quickly ...
EUC 2016

An efficient scalable parallelized version of the Mondrian Algorithm John Christoforidis Department of Informatics and Telecommunications Engineering Kozani, Greece [email protected]

Minas Dasygenis Department of Informatics and Telecommunications Engineering Kozani, Greece [email protected]

Presentation is delivered by: Mrs Tsourma Maria, Dipl. Eng

1

EUC 2016

Introduction • Information traffic and data generation are heavily increased. • Privacy poses a major issue. • The combination of data from various sources can result in exploiting personal data. • Anonymization is mandatory to ensure security in published micro-data.

2

EUC 2016

Micro-data •Micro-data are generated by various companies for research, statistical or other purposes. •This kind of data is usually in a table form where each row represents a person and each column represents an attribute. Age

Sex

Zipcode

Illness

19

Male

52261

Flu

23

Female

52221

Nausea

25

Male

52221

Flu

32

Male

52244

Healthy

3

EUC 2016

Exploit example By combining raw data from various sources, one can gain access to personal information. Age

Sex

Zipcode

Illness

Name

Age

Zipcode

Married

19

Male

52261

Flu

Chris

18

52282

Yes

23

Female

52221

Nausea

John

23

52221

No

25

Male

52221

Flu

Peter

25

52221

No

32

Male

52244

Healthy

Petrof

30

52210

No

For instance, the above example tables can be the output of 2 separate studies/surveys. The 1st from the left and the 2nd from the right matches. It is revealed that Peter has Flu.

4

EUC 2016

Anonymization • Mondrian achieves anonymization by combining data entries. • The number of entries combined together is defined by the letter K. For K=2 the previous example would be anonymized into:

Age

Sex

Zipcode

Illness

19-23

Male, Female

52221-52261

Flu

19-23

Male, Female

52221-52261

Nausea

25-32

Male

52221-52244

Flu

25-32

Male

52221-52244

Healthy

5

EUC 2016

The number K • • • •

The law requires K to be at least 2. Low K numbers result in low protection. High K numbers result in high data loss. K should be set according to the data-set type and value range.

6

EUC 2016

The key argument • The last argument of each micro-data list is defined as “the key argument”. • This argument is usually the result of the respective list and should not be merged or otherwise changed.

7

EUC 2016

Parallel Mondrian • The development of IoT has led to an increase of data generated. • Since anonymization is mandatory, those data needs to be processed by Mondrian as quickly as possible as more data are being created. • Paralleling Mondrian results in a faster anonymization as it uses all of the computer’s available CPU.

8

EUC 2016

Our Methodology

9

EUC 2016

Program’s Arguments • • • •

Our implementation of Mondrian requires the following arguments to operate: The input file. The desired number ‘K’. The number of the threads to be created. The output file.

10

EUC 2016

Read File The Algorithm requires an input file of the following structure: • Each line represents an entry. • Each argument is separated by a comma.

11

EUC 2016

Define Argument Type and Range • After the file has been read, each argument is being stored into a Lines X Argument table. • Each argument is marked according to its type: Number or Alphanumeric. • The program also calculates the distinct values of each argument, which is the argument’s range.

12

EUC 2016

The Selected Argument • To best split the entries into groups the algorithm selects the argument with the biggest range. • This argument is marked as “the selected Argument”. • The threads are now created and each one takes a portion of the range of the selected argument.

13

EUC 2016

Population counter • Each thread calculates the population for each range’s value. • The population is the number of entries that has that value. • Each thread’s execution is halted until every thread calculates their respective population so they know their output range.

14

EUC 2016

Combination group Each thread calculates its combination groups; the number of entries that will be combined. The combination group are created by this method: • Create a group and add the range’s population into a counter.

• If the counter is bigger or equal to ‘K’ then create a new group for the next range. • If the last group’s counter is less than ‘K’ then the last group is combined with the group with the least population counter.

15

EUC 2016

Combining entries • Each entry in each group will have its arguments’ values combined. • The last argument will not be merged as it is the key argument and it is left unchanged. • If the argument’s type is numeric then the merged information is a range: smallest value-biggest value. • If the argument’s type is alphanumeric then it displays all distinct values: Alpha1| Alpha2| Aplha3.

16

EUC 2016

Example File Argument 1

Argument 2

Key Argument

1

A

John

1

B

Joanne

2

E

George

3

D

Willow

4

C

Senel

5

B

Odysseus

5

C

Lina

6

B

Leon

7

B

Sora

7

D

Varian

8

C

Orteil

8

E

Phill

17

EUC 2016

Example File Analysis • The example file has 2 arguments and the key argument. • The first argument is numeric and has a 8 range. • The second argument is alphanumeric and has 5 range. • The selected argument is the first argument as it has the biggest range.

18

EUC 2016

File Graph

19

EUC 2016

Standard Mondrian (1 thread) • If the file is run with only 1 thread then that thread will view the whole range of the first argument, which is from 1 to 8. • It will split the file according to the group creation rules. • The execution time will be 4 time circles (4 splits).

20

EUC 2016

Standard Mondrian (1 thread) Argument 1

Argument 2

Key Argument

1

A,B

John

1

A,B

Joanne

2-3

E,D

George

2-3

E,D

Willow

4-5

C,B

Senel

4-5

C,B

Odysseus

4-5

C,B

Lina

6-7

B,D

Leon

6-7

B,D

Sora

6-7

B,D

Varian

8

C,E

Orteil

8

C,E

Phill

21

EUC 2016

Standard Mondrian (1 thread)

22

EUC 2016

Parallel Mondrian (2 threads) • If the file is run with 2 threads, then the first thread will take the half the range (1 to 4) and the second thread will take the other half (1 to 8). • The difference will now be that the first thread will not be aware of the other’s thread range, thus it will make its own groups. • The execution time will be 2 time circles (2 splits for each thread). 23

EUC 2016

Parallel Mondrian (2 threads) Argument 1

Argument 2

Key Argument

1

A,B

John

1

A,B

Joanne

2-4

E,D,C

George

2-4

E,D,C

Willow

2-4

E,D,C

Senel

5

B,C

Odysseus

5

B,C

Lina

6-7

B,D

Leon

6-7

B,D

Sora

6-7

B,D

Varian

8

C,E

Orteil

8

C,E

Phill

24

EUC 2016

Parallel Mondrian

25

EUC 2016

Mondrian Execution Time • Our Implementation’s execution time is highly dependable upon the selected argument’s range. • File size and argument number does not affect the run time.

• The algorithm was tested on 2 machines: A Windows OS with 8-core CPU, and a Linux OS with 16-core CPU. • Each computer has different clock speed peripherals so the results are not to be compared to each other. 26

EUC 2016

Results • Our results show that doubling the file’s argument range will about double the execution time. • However if we double the threads too the execution time will be about back to normal. • The execution times are also consistent with the law of

Amdahl. 27

EUC 2016

File Range 8000 Threads

Windows PC Total Execution Time (sec)

Memory Allocated (MB)

1

255

320

2

136

564

4

73

1054

8

53

2036

Threads

Linux PC Total Execution Time (sec)

Memory Allocated (MB)

1

408

294

2

279

544

4

142

997

8

82

1953

16

54

3868

28

EUC 2016

File range 4000 to 8000 comparison Windows PC (4000) Threads TET(sec) MA(MB)

Windows PC (8000) Threads TET(sec) MA(MB)

1

116

138

1

255

320

2

67

202

2

136

564

4

37

333

4

73

1054

8

29

587

8

53

2036

Linux PC (4000)

Linux PC (8000)

Threads

TET(sec)

MA(MB)

Threads

TET(sec)

MA(MB)

1

200

106

1

408

294

2

139

170

2

279

544

4

72

295

4

142

997

8

41

546

8

82

1953

16

31

1004

16

54

3868

29

Execution Time and Memory Graphs

EUC 2016

30

EUC 2016

Conclusion • Our algorithm provides a fast and quality solution to anonymizing data-sets by paralleling the process. • It is ideal for decent sized data-sets that are produced on a daily basis. • It can be easily modified to be optimized to any specific data-set type.

31

EUC 2016

An efficient scalable parallelized version of the Mondrian Algorithm

Thank you! 32

Suggest Documents