parser that can process languages generated by real-life unrestricted context-free grammars .... after the application of a nite number of grammar rules). Formally ...
An FPGA-based syntactic parser for real-life unrestricted context-free grammars Cristian Ciressan1, Eduardo Sanchez2, Martin Rajman1 , and Jean-Cedric Chappelier1 1 2
Swiss Federal Institute of Technology Computer Science Dept., LIA, CH-1015, Lausanne, Switzerland Swiss Federal Institute of Technology Computer Science Dept., LSL, CH-1015 Lausanne, Switzerland
Abstract. This paper presents an FPGA-based implementation of a syntactic parser that can process languages generated by real-life unrestricted context-free grammars (CFGs). More precisely, we study the advantages oered by a hardware implementation of a parallel version of a chart parsing algorithm adapted for word lattice parsing with unrestricted CFGs. Natural Language Processing applications, for which parsing speed is an important issue, may bene t from such an implementation. The parsing algorithm and the hardware design are rst described. Then a method called tiling, based on the decomposition of processor tasks into subtasks of prede ned size, is proposed. This method allows a processor and I/O bandwidth load that optimally takes into account the data-dependencies associated with the parsing algorithm. Finally, an evaluation of the design performance on real world data is presented.
1 Introduction Parsing is an important part in most NLP applications. Typical examples of such applications include: textual data processing : advanced Information Retrieval [8,9] and Text Mining [6,7] techniques represent promising approaches for the automated processing of large collections of textual data. They all may require parsing to further enhance performance by integrating syntactic knowledge about the language. human-machine interfaces : better integration of syntactic processing within speech-recognition systems is an important goal for the design of ecient multimodal systems. In the case of a sequential coupling, the output of a speech recognizer (a word lattice) may be further processed with a syntactic parser to lter out those of the word sequences that are not syntactically correct. Moreover, for industrial applications with strong real-time and/or data-size constraints, especially ecient parsing solutions need to be proposed. As far as context-free languages are concerned, VLSI implementations, based
2
Cristian Ciressan et al.
on 2D-array of processors have been proposed, both for the CYK [2] and the Earley [3] parsing algorithms. Although these designs do meet the usual VLSI requirements (constant-time processor operations, regular communication geometry, uniform data movement), the hardware resources they require do not allow them to accommodate the real-life unrestricted context-free grammars used in large-scale NLP applications1. In previous contributions [15,16], we have proposed two FPGA-based implementations of the CYK algorithm [5] that can process such real-life CFG. We choose the FPGA technology because it provides hardware designers with exible and rapid system prototyping tools and also because, in the near future, it is expected to nd FPGA components in general-purpose processors. The main advantages of these two designs were: (1) their ability to parse word lattices, which makes them better adapted for integration within speech-recognition systems, (2) their scalability, and (3) a speed-up of about a factor 30 [16] when compared against a software implementation of the CYK algorithm. Both designs were described in VHDL, simulated for validation, synthesized and tested on an existing FPGA board. On the other hand, the weak points of the former designs were: (1) the fact they require CFGs written in Chomsky Normal Form, (2) a low average processor load and (3) a low average I/O bandwidth load. These dierent drawbacks are addressed in this paper, in which we propose a new implementation. The new design is based on a general item-based chart parsing algorithm (an enhanced version of the CYK algorithm [13,17]) that allows the processing of almost unrestricted CFGs; the only restrictions being that the CFGs have to be non partially lexicalized { see section 2.1 { and must not contain any chain rules, i.e. rules of the form X ! Y 2 . The low average processor and I/O bandwidth load was a direct consequence of the data-dependencies imposed by the CYK algorithm. Indeed, due to these data dependencies, it often happened that only few processors were working on some big chunk of data, while all the others were idle waiting for input. In the new design, this blocking data problem is solved by splitting big chunks of data to be processed in smaller parts (tiles) that are then distributed over the processors. Thus, almost all processors are kept busy almost all the time and I/O bandwidth load is also improved due to the continuous reading and writing of the processing results. Among the other improvements over the previous designs, we can mention a simpler initialization procedure (that eliminates the need for guard-vectors), and a faster processor implementation. In section 2, we brie y present the general item-based chart parsing algorithm. Section 3 describes the hardware design and its main components. for instance, the CFG extracted from the SUSANNE corpus [1], used in our experiments, contains 1,912 non-terminals and 66,123 grammar rules. 2 Notice however that, any CFG containing chain rules without cycles can be automatically rewritten in an equivalent non partially lexicalized CFG with no such rules.
1
An FPGA-based parser
3
Section 4 analyzes the performance of the FPGA design in comparison with a software implementation of the parsing algorithm. Eventually, conclusions are presented in section 5.
2 The general item-based chart parsing algorithm for word lattice parsing 2.1 The parsing algorithm A CF grammar is a 4-tuple G = fN; ; S; Rg where: N is a set of non-terminals (representing grammatical categories, e.g. verb V , noun-phrase NP , sentence S , . . . ); is a set of terminals (representing words); S 2 N is the top level non-terminal (corresponding to a sentence);S R is a set of grammar rules, i.e. a subset of elements of N S(N )+ written in the form X ! , where X 2 N and 2 (N )+ . For instance, a grammar rule such as S ! NP V , expresses the fact that a sentence (S ) consists of a noun-phrase (NP ) followed by a verb (V ). In the following, we use capital letters X , Y , Z , . . . to denote non-terminal symbols and w1 , w2 , . . . to denote terminal symbols. The parsing implemented algorithm is designed to process non partially lexicalized CFGs, a large subclass of CFGs henceforth denoted as nplCFG. A nplCFG is a CFG in which terminal symbols, i.e. words, only occur in rules of the form X ! w1 w2 :::w 3 , called lexical rules. For our design, the restriction to nplCFGs is very useful as it allows us to restrict the processing of lexical rules to the initialization step only, outside the FPGA chip (with an on-board general-purpose processor). In addition, to further increase the eciency of the hardware implementation, we restrict to nplCFGs without chain rules (though this is not a restriction of the general item-based chart parsing algorithm, it avoids the implementation of stacks within the FPGA design for walking the chains). The implemented parsing algorithm uses a triangular table (hereafter called parsing table) with n(n + 1)=2 cells, where n is the length of the input sentence w1 : : : w . Let w = w w +1 : : : w + ?1 be the subpart of length j of the input sentence that starts at w . The cell at row j and column i in the table will contain the following two kinds of sets: a set of non-terminals X that can derive w (i.e. that can produce w after the application of a nite number of grammar rules). Formally this set is a subset of N de ned by: N 1 = fX 2 N : X ) w g a set of items representing partial parsings of the string w . Formally this set is de ned by: 3 usually n = 1; n > 1 for compound words. n
n
ij
i
i
i
j
i
ij
i;j
ij
?
ij
ij
4
Cristian Ciressan et al.
N 2 = f : 9 2 N + ; 9(X ! ) 2 R, s.t. ( ) w )g ?
i;j
ij
With these notations, the implemented parsing algorithm is then de ned as follows:
for j = 1 to n do for i = 1 to n ? j + 1 do N 1 = fX : (X ! w ) 2 Rg N 2 = fY : 9 2 N + ; 9(X ! Y ) 2 R, s.t. Y ! w g for j = 2 to n do for i = 1 to n ? j + 1 do for k = 1 to j ? 1Sdo N 1 = N 1 fX : 9(X ! Y ) 2 R with 2 N 2 and Y 2 N1 + ? g 9: N 2 = N 2 SfY : 9(X ! Y ) 2 R with 2 N 2 , Y 2 N 1 + ? and 2 N + g
1: 2: 3: 4: 5: 6: 7: 8:
i;j
ij
i;j
ij
i;j
i;k
i;j
i
k;j
i
k;j
i;j
k
i;j
i;k
k
The lines 1-4 in the algorithm correspond to the initialization step, during which the sets N 1 are initialized with the elements X for which there exist in the grammar lexical rules of the form X ! w w +1 :::w + ?1 . The sets N 2 are initialized with elements Y whenever there is an Y in the set N 1 which is the leftmost non-terminal in a right-hand side of some grammar rule. The lines 5-9 correspond to the subsequent lling-up of the parsing table. When processing word lattices the initialization step is adapted in the same way as discussed in [15,16]. i;j
i
i
i
j
i;j
i;j
3 The FPGA Design The block diagram given in gure 1 provides a synoptic view of our design for an N-processors FPGA parser. In this diagram, all the elements inside the dashed line are implemented within the FPGA chip. The other ones (the parsing memory and the grammar memories GMi) are implemented with SRAM chips. Notice that, the maximal length of the sentences that can be parsed is independent of the number of processors in the system and only depends on the size of the parsing memory. For example, with the current implementation, 256 Kbytes of memory allow a maximal sentence length of 32 words which can be increased up to 256 words if 16 Mbytes are available. Another important aspect of the design is that the parsing result, i.e. the produced parse forest, is not entirely stored in the parsing memory but shipped4 outside the FPGA circuit during parsing (through the outPARSE output) and can be further sent through a PCI interface to an external processing device (PC or workstation). Let us now provide a brief description of the functioning of the hardware parser. 4
in an eciently packed format
An FPGA-based parser outPARSE (to a SCSI interface)
overPARSE startPARSE SLEN_1
PARSING MEMORY ARBITER
5
MEMORY
POOL & CHECKER
IOctrl
D-TABLE EXTRACTOR
WRITER
CHECKER
SEQ-GEN
DISPATCHER READER
BUFFER MEMORY
LOOKUP CTX
DISPATCHER TILLER
CTXRAW
P1
GM1
P2
GM2
P3
GM3
PN
GMN
Fig. 1. An N-processor system
Initialization. Before any parsing can start, the grammar memories are con gured with a binary image of the data-structure representing the nplCFG grammar used. This initialization needs to be done only once for a given grammar. Similarly, the parsing memory is initialized with the data structures representing the sets N 1 and N 2 derived from the lexical rules. In addition, the I/O controller IOctrl is initialized with the length of the sentence to be parsed. These two initializations need to be done before each parsing. Parsing. Once initialization is done, the startPARSE signal initiates the parsing and the SEQ-GEN module starts to generate triples of the form [D,S1,S2], where S1 and S2 identify two source cells the contents of which need to be combined (see lines 8-9 of the algorithm) and possibly stored in the destination cell5 D. For each of the generated triples, the CHECKER module checks whether the two source cells to be combined are ready, i.e. that they are not destinations of previous un nished combination processes (data-dependency constraints). For this purpose, the CHECKER uses the D-table (destination table) to store information about all destination cells currently under processing. If both of the source cells satisfy the data-dependency check, the corresponding triple is passed to the DISPATCHER module that has the task of distributing the work corresponding to the processing of the triple over the available processors. On the opposite, when the data-dependency check i;j
5
i;j
The sequence of generated triples depends on the length of the currently parsed sentence.
6
Cristian Ciressan et al. 8 WORDS
15 WORDS
14
14
13
13
12
12
11
11
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1 0
2
4
6
8
10
12
14
16
18
0
2
4
6
8
10
12
14
16
5
x 10
14
14
13
13
12
12
11
11
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
18 5
x 10
2
1
1 0
2
4
6
8
10
12
14
16
18
0
2
4
6
8
10
12
14
16
5
x 10
14
14
13
13
12
12
11
11
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
18 5
x 10
2
1
1 0
2
4
6
8
10
12
14
16
18 5
x 10
0
2
4
6
8
10
12
14
16
18 5
x 10
Fig. 2. Comparison of processors activity for 3 dierent designs: presented in [15]
(top), in [16] (middle) and in this paper (bottom). Vertical axis represent the number of processors and horizontal axis the time in nanoseconds.
is not satis ed, the triple [D,S1,S2] is stored in the POOL where it is kept for further periodical data-dependency checks. As soon as the data-dependency constraints are satis ed, the triple is sent back to the CHECKER and immediately handed over to the DISPATCHER. Task distribution (tiling). As already mentioned, the goal of the DISPATCHER is to split the work corresponding to the combination of 2 cells in smaller subtasks that can be distributed over the available processors. To do so, the DISPATCHER rst fetches the N 2 set corresponding to the S1 cell and the N 1 set corresponding to the S2 cell. The combination of S1 and S2 then requires to explore all the possible pairings of an element in N 2 and an element in N 1, in other words, to explore the cross-product N 2 N 1. To realize this in an eciently distributed way, the idea is to
An FPGA-based parser
7
break the N 2 and N 1 sets into smaller chunks of prede ned size6, lets say N 21 , N 22,. . . ,N 2 2 and N 11, N 12 ,. . . ,N 1 1 respectively, and to distribute the tasks corresponding to the exploration of the smaller cross-products N 21 N 11, N 21 N 12 ,. . . ,N 2 2 N 1 1 (called tiles) over the available processors. The choice of the prede ned sizes for the tiles is important and should be neither to small (e.g. 2x2), nor to large (e.g. 16x16) to guaranty optimal processor load and design performance. Once tiling has been performed, the DISPATCHER sends the subtasks to the processors along with an ID that uniquely identi es the destination cell to which they correspond. Thus, when a processor nishes to work on a subtask, it sends the result to the WRITER module which further sends it both to the EXTRACTOR and the parsing memory (via the LOOKUP unit). The LOOKUP unit contains two associative memories, one for the destination sets N 2 and one for the destination sets N 1, that are used to verify whether a given value is already present in the destination set or not. This is necessary, in order not to duplicate the information in the N 1, respectively N 2 sets. The associative memories replace the guard-vectors that were used in our previous designs [15,16]. As the processors are usually working on several destination cells at the same time and the associative memories can store information for only one given destination cell, the ability for these memories to switch the context is essential. This is accomplished with the ID that comes along with every processor result (and which is the same as the ID associated by the DISPATCHER to each subtask). The CTXRAW memory stores valid contexts for all destinations currently under processing and the current context is set by the CHECKER whenever a new destination is encountered. The CTX memory contains all the information required for context switching: the size of the N 1 and N 2 sets, the physical address in the parsing memory for each of these sets, etc. Impact of tiling. Figure 2 shows a comparison between processor activity for three dierent designs with 14 processors for the analysis of a 8 words and a 15 word sentences. The two gures in the top correspond to the design presented in [15] (speed-up factor 10.77), the two in the middle to the design presented in [16] (speed-up factor 29.45) and the two in the bottom to the design proposed in this paper (speed-up factor 58). For all these designs, we used the same grammar: the CF grammar extracted from the SUSANNE corpus and transformed in Chomsky Normal Form7. In the gures, the vertical axis identi es the processors (numbered from 1 to 14), the horizontal axis k
k
k
6 7
k
not necessarily the same for N 2 and N 1 the Chomsky Normal Form was required because the older designs can only accommodate such grammars and we wanted to use exactly the same grammar for all 3 designs for this comparison. However, when the general form of the grammar is used instead, the speed-up factor grows up from 58 to 244 for the current implementation.
8
Cristian Ciressan et al.
the time scale in nanoseconds and a black area indicates that a processor is working. For the oldest design (top gures) all the processors are resynchronized after processing a line in the parsing table. This strategy leads to poor processor load, due to the fact that processors are often idle while waiting for the ones working on the longest tasks. In the second design (middle gures) improvement is achieved by eliminating the synchronization requirement but low processor load can still be observed due to the fact that some processors work on large chunks of data while the others are blocked (and idle) because of data-dependency constraints. The usefulness of the idea of splitting the tasks into smaller chunks that, implemented in the current design, is clearly illustrated by the gures in the bottom which show almost continuous processor activity.
4 Design Performance The performance measurements presented in this section were made with a grammar extracted from the SUSANNE corpus, henceforth referred to as the SUSANNE grammar. As already mentioned, the SUSANNE grammar contains 1,912 non-terminals and 66,123 grammar rules and its binary image requires 558,576 bytes of memory. The chosen parsing memory size is 256 Kbytes thus allowing the parsing of any sentence of length up to 32 words. The width of the databus between the processors and the grammar memories is 32-bit. Due to the limited number of pins (804 user pins) on the Xilinx Virtex 2000efg1156 FPGA available, only 14 processors were used. In order to determine the maximal clock frequency at which the system could work, the design was synthesized8 and Placed&Routed9 in the FPGA. The resulting system uses about 60% of the FPGA and was benchmarked10 at a clock frequency of 60 MHz. The software parser used for comparison is an implementation of the parsing algorithm run on a SUN station (ultra-sparc 60) with 512 Mbytes memory, 770 Mbytes of swap memory, and 1 processor at clock frequency of 360 MHz. The initialization of the parsing memory was not taken into account for the computation of the run-times. For accuracy, the timing was done with the time() C library function and not by pro ling the code. To compute the comparison results, 2,022 sentences were parsed and validated11 . These test sentences had a length ranging from 3 to 15 words and were all taken from the SUSANNE corpus. The obtained average speedup factor was of 244 and gure 3 shows its evolution as a function of the sentence length. with LeonardoSpectrum v2000.1a2 with Design Manager (Xilinx Alliance Series 2.1i) 10 with ModelSim SE-EE 5.4c 11 the hardware output was compared against the software output to detect mismatches 8
9
An FPGA-based parser
9
270
260
250
240
speedup
230
220
210
200
190
180
170
2
4
6
8
10
12
14
16
sentence length[words]
Fig. 3. Hardware speedup. Average timings over at least 100 sentences for each sentence length.
As the output data (the produced compacted parse forest) was dynamically made available at the output pins of the hardware, it was also important to determine the minimal data-rate at which an external interface (e.g. a PCI interface) should collect the data in order not to create a bottleneck in the system at this point. The same 2,022 sentences of the SUSANNE corpus were used and the test showed that the required minimal transmission rate was of 132 Mbyte/s, which can be achieved with a PCI interface.
5 Conclusions A design implementing an enhanced version of the CYK algorithm for the parsing of word lattices with almost unrestricted CFG has been presented. Several hardware implementations and performance measurements have been carried out, allowing us to compare the original designs for the CYK algorithm (as presented in [15,16]) with the one given in this paper. The main conclusions of this work are: (1) in the CYK framework, datadependencies are the main reason for low processor load; (2) task distribution, as implemented through the tiling mechanism in our design, strongly contributes to the improvement of the overall processors load; (3) when tiling is used, I/O bandwidth and processor load are directly dependent. This means that increasing processor load requires to also increase the I/O bandwidth available to avoid an I/O bottleneck in the system. In its current version, our design allows an interesting speed-up factor of 243, when compared with a pure software implementation. Additional investigations will be carried out in order to determine the critical parameters that should be further optimized to increase the eciency and integrability of our hardware parser.
10
Cristian Ciressan et al.
References 1. G. Sampson (1994) The Susanne Corpus Release 3. School of Cognitive & Computing Sciences, University of Sussex, Falmer, Brighton, England. 2. K.H. Chu and K.S. Fu (1982) VLSI architectures for high speed recognition of context-free languages and nite-state languages. Proc. 9th Annu. Int. Symp. Comput. Arch., 43{49. 3. Y. T. Chiang and K.S. Fu (1984) Parallel Parsing Algorithms and VLSI Implementations for Syntactic Pattern Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence. 4. S. R. Kosaraju. (1975) Speed of recognition of context-free languages by array automata. SIAM J. Comput., 4, 331{340. 5. A. V. Aho and J. D. Ullman (1972) The Theory of Parsing, Translation and Compiling. Prentice-Hall. 6. R. Feldman, et al. (1998) Text Mining at the Term Level. Proc. of the 2nd European Symposium on Principles of Data Mining and Knowledge Discovery (PKDD'98). Nantes, France 7. Feldman,R. et al. (1996). Ecient Algorithm for Mining and Manipulating Associations in Texts. 13th European Meeting on Cybernetics and Research. 8. Schauble,P. (1997) Mutlimedia Information Retrieval - Content-Based Information Retrieval from Large Text and Audio Databases. Kluwer Academic Publishers. 9. Hull,D., et al. (1996) Xerox TREC-5 Sire Report: Routing, Filtering, NLP and Spanish Tracks. NIST Special Publication 500-238: The Fifth Text REtrieval Conference (TREC-5). Gaithersburg, Maryland. 10. Linz P. (1997). An Introduction to Formal Languages and Automata. Jones and Bartlett Publishers. 11. Fu K.S. (1974) Syntactic methods in pattern recognition. Academic Press. 12. J.-C. Chappelier, et al. (1999) Lattice Parsing for Speech Recognition. Proc. of 6me confrence sur le Traitement Automatique du Langage Naturel (TALN'99), 95{104. Cargse, France. 13. J.-C. Chappelier and M. Rajman (1998). A Generalized CYK Algorithm for Parsing Stochastic CFG. TAPD'98 Workshop, 133{137. Paris, France. 14. E. Sanchez and M. Tomassini (1996). Towards Evolvable Hardware. SpringerVerlag Berlin. 15. C. Ciressan, et al. (2000). An FPGA-based coprocessor for the parsing of context-free grammars. 2000 IEEE Symp. on Field-Programmable Custom Computing Machines., Computer Society Press., 236-245. 16. C. Ciressan, et al. (2000). Towards NLP co-processing : An FPGA implementation of a context-free parser. TALN 2000, 7e confrence annuelle sur le Traitement Automatique des Langues Naturelles., 99-100, Lausanne, Switzerland. 17. J.-C. Chappelier, M. Rajman (1998). A generalized CYK algorithm for parsing stochastic CFG (TR 98/284). DI-LIA, EPFL