Small-Ruleset Regular Expression Matching on ... - Semantic Scholar

8 downloads 144846 Views 1MB Size Report
Jun 2, 2010 - block for search engines, business analytics, natural language pro- ..... mapping the input characters into a smaller number of equivalence ..... 800. 1000. 1200. 1400. 0k 2k 4k 6k 8k 10k 12k 14k. Total number of threads.
Small-Ruleset Regular Expression Matching on GPGPUs: Quantitative Performance Analysis and Optimization Jamin Naghmouchi1,2 1

Daniele Paolo Scarpazza1

IBM T.J. Watson Research Center Business Analytics & Math Dept. Yorktown Heights, NY, USA

[email protected], [email protected]

ABSTRACT We explore the intersection between an emerging class of architectures and a prominent workload: GPGPUs (General-Purpose Graphics Processing Units) and regular expression matching, respectively. It is a challenging task because this workload –with its irregular, non-coalesceable memory access patterns– is very different from the regular, numerical workloads that run efficiently on GPGPUs. Small-ruleset expression matching is a fundamental building block for search engines, business analytics, natural language processing, XML processing, compiler front-ends and network security. Despite the abundant power that GPGPUs promise, little work has investigated their potential and limitations with this workload, and how to best utilize the memory classes that GPGPUs offer. We describe an optimization path of the kernel of flex (the popular, open-source regular expression scanner generator) to four nVidia GPGPU models, with decisions based on quantitative micro-benchmarking, performance counters and simulator runs. Our solution achieves a tokenization throughput that exceeds the results obtained by the GPGPU-based string matching solutions presented so far, and compares well with solutions obtained on any architecture.

Categories and Subject Descriptors D.1.3 [Programming Techniques]: Concurrent Programming— Parallel Programming; F.2.2 [Analysis of Algorithms]: Nonnumerical Algorithms—Pattern matching

General Terms Algorithms, Design, Performance

1.

INTRODUCTION

With the advent of “Web 2.0” applications, the volume of unstructured data that Internet and enterprises applications produce and consume has been growing at extraordinary rates. The tools we use to access, transform and protect these data are search

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICS’10, June 2–4, 2010, Tsukuba, Ibaraki, Japan. Copyright 2010 ACM 978-1-4503-0018-6/10/06 ...$10.00.

2

Mladen Berekovic2

Institut für Datentechnik und Kommunikationsnetze Technische Universität Braunschweig Braunschweig, Germany [email protected] engines, business analytics suites, Natural-Language Processing (NLP) tools, XML processors and Intrusion Detection Systems (IDSs). These tools rely crucially on some form of Regular Expression (regexp) scanning. We focus on tokenization: a form of small-ruleset regexp matching used to divide a character stream into tokens like English words, E-mail addresses, company names, URLs, phone numbers, IP addresses, etc. Tokenization is the first stage of any search engine indexer (where it consumes between 14 and 20% of the execution time [1]) and any XML processing tool (where it can absorb 30% [2, 3]). It is also part of NLP tools and programming language compilers. The further growth of unstructured-data applications depends on whether fast, scalable tokenizers are available. Architectures are offering more cores per socket [4, 5] and wider SIMD units (SingleInstruction Multiple-Data). For example, Intel is increasing perchip core counts from the current 4–8 to the 16–48 of Larrabee [6], and SIMD width from the current 128 bits of SSE (Streaming SIMD Extension [7]) to the 256 bits of AVX (Advanced Vector eXtensions [8]) and the 512 bits of LRBni [9]. nVidia GPGPUs [10] employ hundreds of light-weight cores that juggle thousands of threads, in an attempt to mask the latency to an uncached main memory. Despite this promising amount of parallelism, little work has explored the potential of GPGPUs for text processing tasks, whereas traditional multi-core architectures have received abundant attention [11, 12, 13, 14, 15]. Filling this gap is the objective with this paper. It is a challenging task because tokenization is far from the numerical, array-based applications that traditionally map well to GPGPUs. Unlike numerical kernels that aim at fully coalesced memory accesses, our workload never enjoys coalescing. Also, automaton-based algorithms have been named embarrassingly sequential [16] for their inherent lack of parallelism. Our optimization reasoning relies on performance figures that are not available from the manufacturer or independent publications. We determine these figures with micro-benchmarks specifically designed for the purpose. We start our optimization from a naïve port to GPGPUs of a tokenizer kernel produced by flex [17]. We analyze compute operations and memory accesses, and explore data-layout improvements on a quantitative basis, with the help of benchmarks, profiling, performance counters, static analysis of the disassembled bytecode, and simulator [18] runs. On a GTX280 device, we achieve a typical tokenizing throughput on realistic (Wikipedia) data of 1.185 Gbyte/s per device, and a peak scanning throughput of 6.92 Gbyte/s (i.e., 3.62× and 8.59× speedups over naïve GPGPU ports, respectively). This performance is 20.1× faster than the original, unmodified flex tokenizer

running in 4 threads on a modern commodity processor, and 49.8× faster than a single-threaded flex. The limitations of our approach are the size of the rule set and the need for a large number of independent input streams. The first limitation derives from our mapping of automata state tables to (small) core-local memories and caches. This constraint does not fit applications requiring large state spaces like IDS or content-based traffic filtering. The second constraint is due to the high number of threads (approx. 4,000–6,000) necessary to reach good GPGPU utilization threads at any time. Traditional CPUs that have fewer cores and threads, and reach full utilization with fewer input streams.

2.

THE GPGPU ARCHITECTURE AND PROGRAMMING MODEL

We briefly introduce the architecture and programming model of nVidia GPGPUs of the CUDA (Compute-Unified Device Architecture) family. We focus primarily on the GTX280 device, but these concepts apply broadly to the other devices we consider (Table 2). For more detailed information, see the technical documentation [19] and the relevant research papers [10, 20, 21, 22, 18].

2.1

Compute cores and memory hierarchy

A CUDA GPGPU is a hierarchical collection of cores as in Figure 1. Cores are called Scalar Processors (SP). A Streaming Multiprocessor (SM) contains 8 SPs and associated resources (e.g., a common instruction unit, shared memory, a common register file, and an L1 cache). Three SMs, together with texture units and L2 caches constitute a Thread Processing Cluster (TPC). The GTX280 has 10 TPCs, connected to memory controllers and an L3 cache. Part of the internals we report are unofficial and derive from Papadopoulou et al. [22]. nVidia often does not disclose the internals of its devices, possibly in an attempt to discourage non-portable optimizations. Nevertheless, the high-performance computing community pursues efficiency even at the cost of device-specific, lowlevel optimizations. The memory hierarchy includes a shared register file, a block of shared memory and a global memory. The register file is statically partitioned among threads at compile time. In the general case, GPGPUs do not mitigate memory latency by using caches (except for the Fermi models, not available at the time this article was composed). Rather, they maintain a large number of threads, and mask latencies by switching to a different, ready group of threads. The cores are designed to perform inexpensive context switches. The L1–L3 caches are used only for instructions, constants and caches. In detail, the constant memory is a programmerdesignated area of global memory, up to 64 kbytes, initialized by the host and not modifiable by code running on the device; accesses to this area are cached. A visual representation of memory classes and latencies is in Figure 2. In Section 4, we analyze quantitatively the performance of these classes of memory.

2.2

Programming Model

The CUDA programming model does not explicitly associate threads to cores. Rather, the programmer provides a kernel of code that describes the operations of a single thread, and a rectangular grid that defines the thread count and numbering. Threads are organized in a hierarchy: a grid of blocks of warps of threads. A warp is a collection of 32 threads that run on the 8 SPs of an SM concurrently (each SP runs 4 threads). A block is a collection of threads, of programmer-defined size (up to 512). We found no convenience in defining blocks of non-multiple-of-32 threads, therefore we regard a block as a group of warps. A grid is

an array of blocks, having 1, 2 or 3 dimensions. Image processing and physics kernels map naturally to 2D and 3D grids, but our textbased workload does not need more than 1D. Therefore we treat this grouping as a linear array of blocks and refer to this degree of freedom only as “number of blocks” from now on.

2.3

Compilation and Execution Model

With CUDA, GPGPUs are coprocessor devices where to offload portions of the code called kernels. The programmer can write single-source C/C++ hybrid applications, where the kernels are limited to a subset of the C language with no recursion and no function pointers. The NVCC compiler separates host code from device code and it forwards host code to an external compiler like GNU GCC. The device source code is compiled to a virtual instruction set called PTX [23]. At program startup, the device driver applies final optimizations and translates PTX code into real binary code. The real instruction set and the compilation process are undocumented. Threads in a warp execute in lockstep (with a shared program counter) and therefore have no independent control flows. Control flow statements in the source map to predicated instructions: the hardware nullifies the instructions of non-taken branches. The memory accesses of the threads in each half-warp can coalesce into one, that serves 16 threads concurrently, provided that the target addresses respect constraints of stride and continuity. Uncoalesced accesses are much more inefficient than coalesced ones. Due to the lack of regularity of our workload, its memory accesses never coalesce.

3.

REGULAR-EXPRESSION MATCHING AND TOKENIZATION

A regular expression describes a set of strings in terms of an alphabet and operators like choice ‘|’, optionality ‘?’, and unbounded repetition ‘*’. Regexps are a powerful abstraction: one expression can denote an infinite class of strings with desired properties. Finding matches between a stream of text and a set of regexps is a common task. Antivirus scanners find matches between user files and a set of regexps that denote the signatures of malicious content. IDSs do the same on live network traffic. These examples exhibit large rule sets (since threat signatures can be tens of thousands) and low match rates (because the majority of traffic and files are usually legitimate). In this work, we rather focus on the small-ruleset, highmatch-rate regexp matching involved in the tokenization stage of search engines, XML and NLP processors, and compilers. Also, tokenizers implement a different matching semantics: they only accepts non-overlapping, maximum length matches. Regexp matching is performed with a Deterministic Finite Automaton (DFA), often generated automatically with tools like flex [17]. Flex takes as an input a rule set like the one in Figure 3, and produces the C code of an automaton that matches those rules, as in Figure 4. Our GPGPU-based tokenizer implements this example. The size of the flex-generated DFA is 174 states, which illustrates practically what we mean by small rule set. In the listing, the while loop scans the input. At each iteration, the automaton reads one input character, determines the next state and whether it is an accepting one, it performs associated semantic actions, and transitions to the new state. In tokenizers, the usual semantic actions add the accepted token to an output table. The memory accesses of this DFA are illustrated in Figure 5. The DFA reads sequentially characters from the input (1) with no reuse, except for backup transitions, discussed below; the input is readonly and not bound in size. The automaton accesses the STT (2) and the accept table (3); both accesses are, at a first approximation,

TPC

TPC

TPC

TPC

Texture units

SM

SP

L2 8k

L1 2k

Register File

SP SP

Shared Memory

SP

SP SP SP

SFU

SM

SFU

SM

L3 32k

Interconnect

Streaming Multiprocessor (SM) Instruction Unit

TPC

TPC

TPC

TPC

Thread Processing Cluster (TPC)

TPC

Memory

TPC

GPGPU card

SP

DPU

Figure 1: A GTX280 contains 10 Thread Processing Clusters (TPCs), each of which groups 3 Streaming Multiprocessors (SMs). One SM contains 8 Scalar Processors (SPs), i.e., 10×3×8 = 240 cores. An SM also contains Special Function Units (SFUs) and Double-Precision Units (DPUs) which we do not use in this work.

Table 1: Architectural characteristics of the compute devices we employ in our experiments. Architecture Revision

Number of SMs

Total cores

Clock Rate

Global Memory

Constant Memory

Shared Memory

Registers per Block

1.3 1.0 1.1 1.0

30 16 16 14

240 128 128 112

1.30 GHz *1.40 GHz 1.35 GHz 1.24 GHz

1.00 Gbytes 0.75 Gbytes 1.50 Gbytes 0.50 Gbytes

64 kbytes 64 kbytes 64 kbytes 64 kbytes

16 kbytes 16 kbytes 16 kbytes 16 kbytes

16,384 8,192 8,192 8,192

GeForce GTX280 GeForce GTX8800 Tesla C870 Quadro FX3700

In all devices a warp is 32 threads and the maximum number of threads per block is 512.

(* Overclocked specimen; factory clock rate was 1.35 GHz).

>406 cycles ~220 cycles Global Memory 1 Gbyte

L3 32k

~81 cycles

6 cycles L1 2k

6 cycles Shared Memory 16k

L2 8k

Thread Processing Cluster (TPC)

Streaming Multiprocessor (SM)

Scalar Processor (SP)

Figure 2: Round-trip read latencies to the memories on a GTX280 GPGPU, from the point of view of a Scalar Processor (SP), expressed in clock cycles for a 1.30 GHz device [22]. Color coding is consistent with Fig. 1, but some blocks were omitted for clarity.

LETTER DIGIT P HAS_DIGIT ALPHA ALPHANUM APOSTROPHE ACRONYM COMPANY EMAIL HOST NUM

STOPWORD

KEPT_AS_IS

[a-z] [0-9] ("_"|[,-/]) ({LETTER}|{DIGIT})*{DIGIT}({LETTER}|{DIGIT})* {LETTER}+ ({LETTER}|{DIGIT})+ {ALPHA}("’"{ALPHA})+ {ALPHA}"."({ALPHA}".")+ {ALPHA}("&"|"@"){ALPHA} {ALPHANUM}(("."|"-"|"_"){ALPHANUM})*"@" {ALPHANUM}(("."|"-"){ALPHANUM})+ {ALPHANUM}("."{ALPHANUM})+ {ALPHANUM}{P}{HAS_DIGIT}|{HAS_DIGIT}{P}{ALPHANUM}| {ALPHANUM}({P}{HAS_DIGIT}{P}{ALPHANUM})+| {HAS_DIGIT}({P}{ALPHANUM}{P}{HAS_DIGIT})+| {ALPHANUM}{P}{HAS_DIGIT}({P}{ALPHANUM}{P}{HAS_DIGIT})+| {HAS_DIGIT}{P}{ALPHANUM}({P}{HAS_DIGIT}{P}{ALPHANUM})+ "a"|"an"|"and"|"are"|"as"|"at"|"be"|"but"|"by"| "for"|"if"|"in"|"into"|"is"|"it"|"no"|"not"| "of"|"on"|"or"|"s"|"such"|"t"|"that"|"the"| "their"|"then"|"there"|"these"|"they"|"this"| "to"|"was"|"will"|"with" {ALPHANUM}|{COMPANY}|{EMAIL}|{HOST}|{NUM}

%% {STOPWORD}|.|\n {KEPT_AS_IS} {ACRONYM} {APOSTROPHE} %%

/* ignore */; emit_token (yytext); emit_acronym (yytext); emit_apostrophe (yytext);

Figure 3: An example tokenizer rule set, specified in flex, similar to the one of Lucene [24], the open-source search engine library. The rules accept words, company names, email addresses, host names and numbers as a class of tokens; they also recognize acronyms and apostrophe expressions as distinct classes.

const flex_int16_t yy_nxt[][...] = { ... }; /* next-state table */ const flex_int16_t yy_accept[ ] = { ... } ; /* accept table */ /* ... */ while ( 1 ) { yy_bp = yy_cp; yy_current_state = yy_start; /* initial state */ while ( (yy_current_state = yy_nxt[ yy_current_state ][ *yy_cp ]) > 0 ) { if ( yy_accept[yy_current_state] ) { (yy_last_accepting_state) = yy_current_state; (yy_last_accepting_cpos) = yy_cp; } ++yy_cp; } yy_current_state = -yy_current_state; yy_find_action: yy_act = yy_accept[yy_current_state]; switch ( yy_act ) { case 0: /* back-up transition */ yy_cp = (yy_last_accepting_cpos) + 1; yy_current_state = (yy_last_accepting_state); goto yy_find_action; case 1: /* ignore */ case 2: emit_token(yytext); case 3: emit_acronym(yytext); case 4: emit_apostrophe(yytext); /* ... */ }

break; break; break; break;

}

Figure 4: The core of the tokenizer generated by flex, corresponding to the rule set of Figure 3. In the code, yy_nxt contains the State Transition Table (STT). Array yy_accept marks the accepting states and the semantic actions (rule numbers) associated with them. At any time, the characters between pointers yy_bp and yy_cp are the input partial match candidate that the automaton is considering.

State transition table

Input text

Accept table

... its biggest annual gain since 2003, ... yy_bp

yy_cp

1: *yy_cp Automaton yy_current_state

2: yy_nxt [yy_current_state][*yy_cp]

yy_bp yy_cp

3: yy_accept [yy_current_state]

token_table_p

Output token table start_p stop_p

file_id

rule_id

4: (*token_table_p) = ...

Figure 5: The memory accesses of a tokenizing automaton. Access 1 is a mostly-linear read with no reuse. Accesses 2 and 3 are mostly-random accesses. Access 4 is a linear write with no reuse. Reads are drawn in green, writes in red, computation in blue. random, and both tables are limited in size. The uncompressed 16bit STT for the example above, with ASCII-128 inputs occupies |states| × |input alphabet| × (16 bits) =174×128×2 bytes = 44 kbytes. Since flex-based tokenizers implement the longest-of-theleftmost matching semantics, upon a valid match, the DFA can postpone the action in search of a longer match. If the attempt fails, the automaton backs up to the last valid match and re-processes the rest of the input again. Accesses (1–4) never coalesce because, at any time, the DFAs running in the different threads find themselves in uncorrelated states, accessing independent portions of the STT. Not even the input pointers advance with constant strides, since each DFA might at any time incur a backup state and jump back to an arbitrary input location. STTs naturally contain redundant information. One STT compression technique consists in removing duplicate columns, by mapping the input characters into a smaller number of equivalence classes. The number of classes needed depends on the rule set. Our example above needs 29 classes, thus reducing the STT size by a factor of 128/29 = 4.4. Equivalence classes come at the price of one additional lookup, since each input character must now be mapped to the class it belongs to.

4.

A TOKENIZATION-ORIENTED PERFORMANCE CHARACTERIZATION

On a GPGPU, the different memory classes have significantly different performance, that varies depending on the characteristics of its accesses: coalescing, stride, block size, caching, spread, locality, alignment, bank and controller congestion, etc.. Actual performance can be significantly lower than peak values advertised by the manufacturer or measured in ideal conditions. For example, the GTX280 is advertised with a bandwidth of a 120 Gbyte/s, and the BWtest benchmark in the SDK reports 114 Gbyte/s. In practice, parallel threads accessing single contiguous characters from device memory experience