Accelerating String Matching Using Multi-threaded ... - Google Sites

0 downloads 163 Views 303KB Size Report
Experimental Results. AC_CPU. AC_OMP AC_Pthread. PFAC. Speedup. 1 thread. (Gbps). 8 threads. (Gbps). 8 threads. (Gbps) m
Accelerating String Matching Using Multi-threaded Algorithm on GPU Cheng-Hung Lin*, Sheng-Yu Tsai**, Chen-Hsiung Liu**, Shih-Chieh Chang**, Jyuo-Min Shyu** *National Taiwan Normal University, Taiwan **National Tsing Hua University , Taiwan

Introduction • Network Intrusions Detection System (NIDS) has been widely used to detect network attacks. • The pattern matching engine dominates the performance of an NIDS. • Traditional pattern matching approaches on uniprocessor are too slow for today’s networking. • Hardware approaches for acceleration pattern matching. – Logic-based – Memory-based – Multiprocessor-based 2

GPU for Pattern Matching • Parallel computation on GPU is suitable for accelerating pattern matching.

AAAAAAAAAAAAAAAAAAAAAAAB

AAAAAAAAAAAAAAAAAAAAAAAB Thread #1 Thread #2

Thread #3 Thread #4

1 thread 24 cycles

4 segments 4 threads 6 cycles 3

Boundary Problem • Boundary Problem – Pattern occurring in the boundary of adjacent segments cannot be detected. – False negative results False Negative

AAAAAAAAAAAAAAAAAABBBBBB Thread #1 Thread #2

Thread #3 Thread #4 4

Overlapped Computation • To resolve boundary problem – Scan across boundaries

• Problem – Overhead of overlapped computation – Throughput reduction Thread #3 can identify "AB" Thread #1

AAAAAAAAAAAAAAAAAABBBBBB Thread #2 Thread #3 Thread #4 5

Aho-Corasick Algorithm • Aho-Corasick (AC) algorithm has been widely used for pattern matching due to its advantage of matching multiple patterns in a single pass – Compiling multiple patterns into a composite state machine

B

Patterns (1) AB (2) ABG (3) BEDE (4) EF

2

1

A

G 3

[^ABE] B

0

4

E 8

E

F

E

D 5

6

7

9 6

Aho-Corasick Algorithm (cont.) • Aho-Corasick (AC) state machine composes of – Solid line represents valid transitions. – Dotted line represents failure transitions.

• Failure transition backtracks the state machine to recognize patterns in different start locations. B 1

A

G 2

3

[^ABE]

Input strings : A B E D E

B

0

4

E 8

location 1

E

F

E

D 5

6

7

9

location 2 7

Problems of AC on GPU • Direct implementation of AC on GPU – To resolve the boundary problem, each thread has low bound constraint of scanning length • Constraint = segment length + overlapped length • Overlapped length = the length of longest pattern -1

– Overhead of overlapped computation AAAAAAAAAAAAAAAAAABBBBBB

8

Problems of AC on GPU (cont.)

9

Failureless-AC State Machine • AC state machine – Failure transition backtracks the state machine to recognize patterns in different start locations. B

1

A

2

3

[^ABE]

Input strings :

4

E

location 2

E

D

E

B

0

A B E D E location 1

G

6

5

7

F 8

9

• Failureless-AC state machine – Remove failure transition – Terminated when no valid transitions – Recognize patterns in location 1. Input strings : Location 1

A B E D E

0

Stop B 1

A B

4

E 8

G 2

E

F

3 E

D 5

9

6

7

10

Parallel Failureless-AC Algorithm • Parallel Failureless-AC (PFAC) Algorithm – Allocate each byte of input a thread to traverse Failureless-AC state machine.

XXXXXXXXXABEDEXXXXXXXXXX

11

Mechanism of PFAC Thread #n

…XXXXABEDEXXX… Thread #n+1 1

A B 0

B

2

E 4

E 8

G D

5 F

3 E 6

1

A 7

B 0

Thread #n

2

E 4

E 9

B

8

3 E

D 5

F

G

6

7

9

Thread #n+1

12

Reducing Overlapped Computation • Direct Implementation of AC Algorithm – Each thread has low bound constraint of scanning length – Overlapped computation (overlapped length = 3)

• PFAC Algorithm – – – –

Without boundary problem. Each thread has variable scanning length Most thread terminates early Reducing overlapped computation to 1 1 3 …CCCCCCCCBCCCCCC… 13

Experimental Environments • CPU: Intel® Core™ i7 CPU 950 @3.07 GHz – 4 cores – 12 GB DDR3 memory • GPU: NVIDIA ® GeForce ® GTX 480 @ 1.4 GHz – 480 cores – 1536MB DDR5 memory • Patterns: String pattern of Snort V2.4 – 1,998 rules containing 41,997 characters – Total 27,754 states • Input: Normal and worst case – DEFCON packet 14

Experimental Results Table 1: Throughput of normal case inputs

2 MB 4 MB 8 MB 16 MB 32 MB 64 MB 128 MB 192 MB

AC_CPU

AC_OMP

AC_Pthread

PFAC

Speedup

1 thread (Gbps) 0.73 0.72 0.72 0.72 0.72 0.73 0.82 0.86

8 threads (Gbps) 2.84 2.98 3.04 3.06 3.04 3.11 3.36 3.43

8 threads (Gbps) 3.98 4.06 4.24 3.26 4.32 4.39 4.71 4.69

multi-threads (Gbps) 69.73 80.36 87.66 90.13 88.20 94.89 110.63 117.04

to fastest 17.53 19.78 20.66 27.63 20.43 21.63 23.51 24.94

15

Comparisons of Normal Case 140.00 120.00 100.00 80.00

AC_CPU 1 thread AC_OMP 8 threads AC_Pthread 8 threads PFAC multi-threads

60.00 40.00 20.00 0.00 2 MB 4 MB 8 MB

16 MB

32 MB

64 MB

128 MB

192 MB 16

Experimental Results Table 2: Throughput of worst case inputs

2 MB 4 MB 8 MB 16 MB 32 MB 64 MB 128 MB 192 MB

AC_CPU

AC_OMP

AC_Pthread

PFAC

Speedup

1 thread (Gbps) 0.45 0.45 0.45 0.45 0.46 0.45 0.46 0.46

8 threads (Gbps) 2.40 2.50 2.58 2.61 2.55 2.62 2.57 2.63

8 threads (Gbps) 2.44 2.57 2.62 2.63 2.61 2.63 2.60 2.64

multi-threads (Gbps) 12.09 12.61 12.79 12.98 12.95 13.10 13.12 13.10

to fastest 4.96 4.91 4.88 4.93 4.96 4.97 5.04 4.97

17

Comparisons of Worst Case 14.00 12.00 10.00 8.00

AC_CPU 1 thread AC_OMP 8 threads AC_Pthread 8 threads PFAC multi-threads

6.00 4.00 2.00 0.00 2 MB 4 MB 8 MB 16 MB

32 MB

64 MB

128 MB

192 MB

18

Comparisons Approaches

PFAC Huang et al. [10] Modified WM Schatz et al. [11] Suffix Tree Vasiliadis et al. [12] DFA Smith et al. [13] XFA

Character number of rule set 41997

Memory (KB)

Throughput (Gbps)

Memory Efficiency

Notes

27754

117.04

177.10

1565

230

2.40

16.33

200000

14125

2.00

28.32

N.A.

200000

6.40

NA

N.A.

3000

10.40

NA

NVIDIA GTX 480 NVIDIA 7600 GT NVIDIA GTX 8800 NVIDIA 9800 GX2 NVIDIA 8800 GTX

 Memory efficiency= (Throughput x # of characters) / Memory 19

Conclusions • We have proposed a novel parallel string matching algorithm which is well-suited to be performed on GPUs and is free from the boundary detection problem. • The proposed algorithm creates a new state machine which has less complexity and memory usage compared to the traditional Aho-Corasick state machine. • The new algorithm achieves a significant speedup compared to the traditional Aho-Corasick algorithm accelerated by OpenMP on CPU. • Compared to other GPU approaches, the new algorithm achieves 11.6 times faster than the state-of-the-art approach. 20

Suggest Documents