Parsing using the PARSEC Vector Processing Chip M. T. Rowland y, B. Perazichz, R. A. Helzerman, M. P. Harper, J. P. Robertson, G. D. Rogers, T. R. Johoski, E. Toepke, and H. Rosario School of Electrical Engineering Purdue University, West Lafayette, IN 47907 Abstract This paper describes the implementation of the PARSEC1 chip, a vector processing element (PE) for parsing languages. This chip has applications not only in natural language processing, but can also be applied to other constraint satisfaction problems. The PARSEC chip is based on a parsing algorithm which formerly ran in real time on a massively parallel machine [8]; however, the chip can achieve processing speeds fast enough for real-time language processing systems, while at the same time, having a price and form suitable for mass market applications.
A key component of any natural language interface is its parsing algorithm. Because some features of English (e.g., context) are clumsy or impossible to handle using existing parsers, we have extended and implemented a parsing algorithm based on a new, exible grammatical formalism, called Constraint Dependency Grammar (CDG), introduced by Maruyama [11, 12, 13]. Although CDG has proven eective for processing English [6, 20] and This work has been supported in part by Purdue Research Foundation, Intel, NSF grant number IRI9011179, and NSF Parallel Infrastructure Grant CDA-9015696. y Current address: JF1-71, 2111 N. E. 25th Avenue, Hillsboro OR 97124-5961, current email:
[email protected]. z Current address: RN4-16, 2200 Mission College Blvd., Santa Clara CA 95052, current email:
[email protected]. 1 PARSEC is an acronym for Parallel Architecture Sentence Constrainer.
1
improving the accuracy of spoken language recognition systems [6, 20], its O(n4) running time (where n is the number of words in a sentence) has motivated the use of massively parallel processing to speed up its performance. Although O(log n) parsing times can be achieved with commercially available parallel machines such as the MasPar MP-1 [8], the cost and bulkiness of such systems prohibit their use for commercial applications. To make the bene ts of CDG parsing more cost eective, we have designed and are developing the PARSEC chip, a vector PE. The PARSEC chip is a fully-custom 8,000 transistor device currently implemented in a two-micron CMOS process. It has a maximum clock rate of ten MHz and its two-stage pipeline gives it a sustained throughput of ten million constraints per second (for comparison, a Sparc-I has a sustained throughput of not quite one million constraints per second). Multiple PARSEC chips can be used in parallel during parsing to provide additional speedup. In Section 1, we describe Constraint Dependency Grammar. Section 2 discusses some of the advantages of constraint parsing. Section 3 outlines the parallelization of the CDG parsing algorithm given a CRCW P-RAM model of computation [8]. In Section 4, we describe the vectorization of CDG parsing, and in Section 5, we describe the design decisions involved in implementing the algorithms in silicon. In Section 6, we estimate the speedup that the chip will provide by parsing sentences of various lengths in two dierent languages.
1 Constraint Dependency Grammars A CDG grammar consists of a four-tuple, h, L, R, C i, where:
L
R C
= = = =
set of terminal symbols set of labels = fl1 ; : : : ; lp g set of roles = fr1; : : : ; rq g a set of ku unary and kb binary constraints
To develop a syntactic analysis for a sentence using CDG, a constraint network (CN) of word nodes is constructed. Associated with each node is its position and a set of roles, which indicate the various functions the word lls in a sentence. Though two roles are required to write a grammar at least as expressive as a context-free grammar [11], our examples depict a single role, governor, which represents the sentence function a word lls when governed by its head. 2
A 1
WORD POSITION
G
WORD NODE {DET-nil, DET-2, DET-3}
ROLE ROLE VALUES
fish
G
G
eats
2
3 {ROOT-nil, ROOT-1, ROOT-2}
{SUBJ-nil, SUBJ-1, SUBJ-3}
Figure 1: The word nodes for A sh eats. Each role is initially assigned a set of role values, where a role value consists of a label (the function the word can serve, e.g., SUBJ) and a modi ee (either nil or the position of one of the other words in the sentence). The role values assigned to a role are limited by the word's lexical categories. There are p n = O(n) possible role values (where p is a grammatical constant for the number of labels, and n is the number of modi ees or words) for each of the q n roles in the sentence (where q is a grammatical constant for the number of roles per word). Hence, there are O(n2) role values, which require O(n2) time to generate. Figure 1 shows the initialization of the role values for the sentence A sh eats2. Each role value keeps track of several pieces of information including its label, modi ee, role, category, and position. Once the word nodes are constructed, constraints are applied to the role values to eliminate the ungrammatical ones. Each constraint must be of the form (if Antecedent Consequent), where Antecedent and Consequent are predicates or predicates joined by logical connectives. The variables in constraints denote role values in a role. Constraints can contain the following access functions, predicates, and logical connectives:
Access Functions: (pos x) returns the position of the word for role value x. (rid x) returns the role-id for role value x. (lab x) returns the label for role value x. (mod x) returns the modi ee's position for role value x. (cat x) returns the category (in ) for role value x.3
For simplicity of presentation, assume that each word has a single word category. Maruyama uses the access function word rather than cat to access the category of a word using its position. 2
3
3
; [U-1] A ROOT modi es no word in the sentence. (if (eq (lab x) ROOT) (eq (mod x) nil)) ; [U-2] A DET and a SUBJ must modify a word to its right. (if (or (eq (lab x) DET) (eq (lab x) SUBJ)) (lt (pos x) (mod x))) ; [B-1] A DET must modify a noun to its right. (if (and (eq (lab x) DET) (eq (mod x) (pos y))) (eq (cat y) noun)) Figure 2: Examples of unary and binary constraints.
Predicate symbols: (eq x y) returns true if x = y, false otherwise. (gt x y) returns true if x > y and x, y 2 Integers, false otherwise4. (lt x y) returns true if x < y and x, y 2 Integers, false otherwise. (elt x y) returns true if x 2 y, false otherwise. Logical Connectives: (and p q) returns true if p and q are true, false otherwise. (or p q) returns true if p or q is true, false otherwise. (not p) returns true if p is false, false otherwise.
A unary constraint contains only one variable (e.g., U-1 and U-2 in Figure 2), while a binary constraint contains two (e.g., B-1 in Figure 2). One and two variable constraints allow for sucient expressivity [11], and more than two would unreasonably increase the running time of the parsing algorithm. To parse a sentence, unary constraints are initially applied to each role value to eliminate ungrammatical role values from the roles. For example, the rst unary constraint U-1 eliminates all of the role values from the governor role for eats except for ROOT-nil in Figure 1. Because a unary constraint can be tested against one role value in constant time and there are O(n2) role values to check, the time to apply a single unary constraint is O(n2). Initially, many unary constraints are applied to reduce the number of legal role values, requiring O(ku n2) time, where ku represents a constant number of unary constraints. After the second unary constraint is propagated, a's role contains fDET-2, DET-3g and sh's role contains fSUBJ-3g (see Figure 3). 4
For example, (gt 1 nil) is false, because nil is not an integer.
4
A 1 G
{DET-2, DET-3}
ROOT-nil
SUBJ-3 DET-2 1 DET-3 1
fish
DET-2 DET-3
1 1
ARC
G
G
ROOT-nil
2
SUBJ-3
1
{SUBJ-3}
eats 3
{ROOT-nil}
Figure 3: The CN before binary constraint propagation. A 1 G
{DET-2, DET-3}
ROOT-nil
SUBJ-3 DET-2 1 DET-3 1
fish
DET-2 DET-3
ARC
G
2
1 0
ROOT-nil SUBJ-3
{SUBJ-3}
1
G
eats 3
{ROOT-nil}
Figure 4: The CN after binary constraint propagation. Next, each of the roles for the sentence is connected to all of the other roles with an arc in preparation for the propagation of binary constraints. Associated with each arc is a matrix whose row and column indices are the role values associated with the two roles. Initially, all entries in the matrices are set to one. Figure 3 shows the network after arc construction prior to binary constraint propagation. After the arc matrices are constructed in O(n4) time, binary constraints are applied to the pairs of role values that represent the indices for matrix entries. If a binary constraint fails for a pair of role values then they cannot coexist in the same sentence, which is indicated by setting the entry in the matrix to zero. Figure 4 shows the network after the propagation of the binary constraint B-1. Since it is applied to O(n4) pairs of role values, the time to apply the constraint is O(n4). The time required to propagate kb binary constraints is O(kb n4). Following the propagation of binary constraints, the network could contain role values that would never be legal role values in a parse for the sentence. Illegal role values are 5
A 1 {DET-2}
DET-2
fish
G ROOT-nil
SUBJ-3 1
DET-2
G
2
ROOT-nil SUBJ-3
1
{SUBJ-3}
1
G
eats 3
{ROOT-nil}
Figure 5: The nal parse of the sentence. eliminated by iteratively removing those with rows or columns containing only zeros; a process called ltering. A ltered role value is removed from its role and from the row or column it indexes for each matrix associated with the arcs emanating from the role. For example, the role value DET-3 in Figure 4 can be eliminated from the role for the word a and the rows indexed by that value can also be eliminated from the matrices on the arcs emanating from the role for a, leading to an unambiguous parse for the sentence. Figure 5 shows the network after ltering. Notice that there is a single role value per role, which together form a parse graph for the sentence. A single application of ltering may be insucient to eliminate illegal role values since the elimination of a role value from one role could lead to the elimination of a role value from another role. Filtering continues until there are no role values indexing matrix rows or columns containing only zeros, requiring O(n4) time (see [11]).
2 Bene ts of a Constraint-Based Approach There are many bene ts to using a constraint based parser, with the primary one being
exibility. When a traditional context-free grammar (CFG) parser generates a set of ambiguous parses for a sentence, it cannot invoke additional production rules to further prune the analyses. In CDG parsing, on the other hand, the presence of ambiguity can trigger the propagation of additional constraints to further re ne the parse. A core set of constraints that hold universally can be propagated rst, and then if ambiguity remains, additional, 6
possibly context-dependent, constraints can be applied. We have developed semantic constraints which are used to eliminate parses with semantically anomalous readings from the set represented in the constraint network [6]. Additional knowledge sources are quite easy to add given the uniform framework provided by constraints. Tight coupling of prosodic [2] and semantic rules with CFG grammar rules typically increases the size and complexity of the grammar and reduces its understandability. Semantic grammars have been eective for limited domains, but they often do not scale up well to larger systems [1]. The most successful modules for semantics are more loosely coupled with the syntactic module. The constraint-based approach represents a loosely-coupled approach for combining a variety of knowledge sources. It diers from a blackboard approach in that all constraints are applied using the uniform mechanism of constraint propagation [5]. Hence, the designer does not need to create a set of functionally dierent modules and worry about their interface with other modules. Constraint propagation is a uniform method which allows the designer to focus on the best way to order the sources of information impacting comprehension. The set of languages accepted by a CDG grammar is a superset of the set of languages which can be accepted by CFGs. In fact, Maruyama [11, 12] is able to construct CDG grammars with two roles (degree = 2) and up to two variables in a constraint (arity = 2) which accept the same language as an arbitrary CFG converted to Griebach Normal form. We have also devised an algorithm to map a set of CFG production rules into a CDG grammar. This algorithm does not assume that the rules are in normal form, and the number of constraints created is O(G), where G is the size of the CFG. In addition, CDG can accept languages that CFGs cannot (e.g., anbn cn and ww, where w is some string of terminal symbols). There has been considerable interest in the development of parsers for grammars that are more expressive than the class of context-free grammars, but less expressive than context-sensitive grammars [9, 17, 18]. The running time of the CDG parser compares quite favorably to the running times of parsers for languages which are beyond context-free. For example, the parser for tree adjoining grammars (TAG) has a running time of O(n6). CFG parsing has been parallelized by several researchers. For example, Kosaraju's method [10] using cellular automata can parse CFGs in O(n) time using O(n2) processors. However, achieving CFG parsing times of less than O(n) has required more powerful and less 7
implementable models of parallel computation than used by [10], as well as signi cantly more processors. Ruzzo's method [16] has a running time of O(log2(n)) using a CREW P-RAM model (Concurrent Read, Exclusive Write, Parallel Random Access Machine), but requires O(n6) processors. In contrast, we have devised a parallelization for the single sentence CDG parser [7, 8] which uses O(n4) processors to parse in O(k) time for a CRCW P-RAM model (Concurrent Read, Concurrent Write, Parallel Random Access Machine), where n is the number of words in the sentence, and k, the number of constraints, is a grammatical constant. Furthermore, this algorithm has been simulated on the MasPar MP-1, a massively parallel SIMD computer. The MP-1 supports up to 16K 4-bit processing elements, each with 16KB of local memory. The CDG algorithm on the MP-1 achieves an O(k + log(n)) running time by using O(n4) processors. By comparison, the TAG parsing algorithm has also been parallelized, and operates in linear time with O(n5) processors [15]. To parse a free-order language like Spanish or Latin, CFGs require that additional rules containing the permutations of the right-hand side of a production be explicitly included in the grammar [14]. Unordered CFGs do not have this combinatorial explosion of rules, but the recognition problem for this class of grammars is NP-complete. A free-order language can easily be handled by a CDG parser because order between constituents is not a requirement of the grammatical formalism. Furthermore, CDG is capable of eciently analyzing free-order languages because it is does not have to test for all possible word orders. In summary, CDG supports a framework which is more expressive and exible than CFGs, making it an attractive alternative to traditional parsers. It is able to utilize a variety of dierent knowledge sources in a uniform framework to incrementally disambiguate a sentence's parse. The algorithm also has the advantage that it is eciently parallelizeable.
3 Parallelization of CDG Parsing For deriving minimum time complexities, we use the CRCW P-RAM (Concurrent Read, Concurrent Write Parallel Random Access Machine) model of parallel computation [3, 4, 19], which allows any number of processors to read from or write to any memory location. If more than one processor tries to write to the same location in memory, then a single random processor will succeed. This allows the ORing and ANDing of any number of bits in constant 8
PE
DET-nil
PE
DET-2
PE
DET-3
PE
SUBJ-nil
PE
SUBJ-1
PE
SUBJ-3
PE
ROOT-nil
eats
PE
ROOT-1
3
PE
ROOT-2
A 1 G {DET-nil, DET-2, DET-3}
fish
G
G
2 {SUBJ-nil, SUBJ-1, SUBJ-3}
{ROOT-nil, ROOT-1, ROOT-2}
Figure 6: The representation of the CN for parallel unary constraint constraint propagation.
Other unary constraints if needed.
; A DET and a SUBJ must ; modify a word to its right. (if (or (eq (lab x) DET) (eq (lab x) SUBJ)) (lt (pos x) (mod x)))
; A ROOT modifies no word ; in the sentence. (if (eq (lab x) ROOT) (eq (mod x) nil))
PE
DET-nil
PE
DET-2
PE
DET-3
PE
SUBJ-nil
PE
SUBJ-1
PE
SUBJ-3
PE
ROOT-nil
PE
ROOT-1
PE
ROOT-2
Figure 7: Broadcasting unary constraints to PEs. time with a large enough number of processors [4]. To determine the time and processor complexity of the parallel algorithm, consider the generation of role values, the propagation of unary and binary constraints, and ltering. Each role value, being nothing more than a label-modi ee pair, can be generated independently of all others. Recall that number of role values initially generated for a sentence is O(n2); hence, all the role values can be generated in constant time with O(n2) processors. Figure 6 shows how the role values would be independently generated by the PEs for our example sentence A sh eats. The propagation of constraints is a very local computation. To apply unary constraints, a processor requires access to information from one role value only, and for the binary constraints, only two role values need to be checked. Because of the shared-memory feature of the CRCW P-RAM, this information is immediately available to any processor that needs it. Furthermore, the checking of one role value or pair of role values is independent of the checking of other role values or pairs of role values, and hence all the checking can go on in parallel. Because there are O(n2) role values to test against the unary constraints, they can be propagated in constant time with O(n2) processors. Figure 7 shows how each unary constraint is broadcast to the all of the PEs simultaneously. 9
A 1 {DET-2, DET-3}
ROOT-nil PE PE
G ROOT-nil
SUBJ-3 DET-2 1 DET-3 1
fish
DET-2 DET-3
G
G
2
1 1
SUBJ-3 DET-2 PE DET-3 PE
eats
ROOT-nil PE SUBJ-3
3
ROOT-nil
{SUBJ-3}
SUBJ-3
DET-2 DET-3
1
{ROOT-nil}
Figure 8: The representation of the CN for parallel binary constraint constraint propagation. ROOT-nil PE PE Other binary constraints if needed.
; A DET must modify a ; noun to its right. (if (and (eq (lab x) DET)) (eq (mod x) (pos y))) (eq (cat y) noun))
DET-2 DET-3
SUBJ-3 DET-2 PE DET-3 PE
ROOT-nil PE
SUBJ-3
Figure 9: Broadcasting binary constraints to PEs.
Since there are q2n = O(n2) arcs in the constraint graph, each of which keeps track of the validity of O(n2) role values, a binary constraint must be checked against O(n2 n2)=O(n4) pairs of role values. Figure 8 shows how the arc elements are mapped onto PEs in our example sentence, and Figure 9 shows how the binary constraints are broadcast to each PE. Given that k is the number of unary and binary constraints, and that each pair of role values can be checked concurrently, all of the constraints can be propagated in O(k) time using O(n4) processors. Recall that one step of ltering involves removing unsupported role values from their roles. A role value is still supported by the arc matrices associated with the role if each of the rows (or columns) indexed by the role value contains at least one 1. This can be determined by logically ORing the elements contained in all the rows and columns indexed by the role value and then logically ANDing the results of those operations. If the result is 1, then the role value is still supported. Since logical AND and OR operations on a CRCW P-RAM can be done in constant time [4], this requires only constant time. After removing a role value from a role, the algorithm must zero out all of the entries in the row (or 10
column) which the role value indexes. Because there are O(n) arcs connected to each role, and a column contains O(n) entries, this step requires O(n2) work and can be performed in constant time with O(n2) processors (all entries can be zeroed simultaneously). Since there are O(n2) role values to check and each is checked independently of the others, with O(n4) processors, one step of ltering can be completed in constant time. The parallelization of ltering is limited by the fact that one deleted role value can enable the deletion of other role values, resulting in a cascade of role value elimination. In the worst case, O(n2) role values would have to be sequentially eliminated, resulting in a running time of O(n2) for ltering. However, this worst case is not a problem in practice; we have found that very few ltering steps (typically fewer than six) are required at the end of constraint propagation. The running time of the CDG parsing algorithm is dominated by the time required to propagate unary and binary constraints. Hence, in practice the total running time of the algorithm is O(k) with O(n4) processors, where k is the number of constraints propagated. Because the typical length of an English sentence is on the order of tens of words, massively parallel machines like the MasPar MP-1 have sucient processors (i.e., 16,000) for parsing a typical sentence quickly [8].
4 Vectorization of CDG Parsing Vectorized CDG parsing is the dual of parallelized CDG parsing. In parallel CDG parsing, one constraint at a time is applied to all of the role values simultaneously; whereas, in vectorized CDG parsing, all of the constraints are applied to one role value at a time. Parallel CDG parsing exploits the data parallelism of constraint propagation, while vectorized CDG parsing exploits the instruction parallelism of constraint propagation. Consider the unary constraint U-1 in Figure 2. There are two tests for equality, whether the label of x is ROOT and whether the modi ee of x is nil. Because these two tests are independent of each other, they can be executed together. In general, the tests and comparisons which make up a constraint are independent of each other. Hence, given enough functional units, a constraint can be evaluated in a single cycle. With this in mind, we now show how each step of the algorithm can be vectorized. Recall that constraint propagation proceeds in three steps; unary constraint propagation, binary constraint propagation, and ltering. 11
Unary Constraint Vector Processor Vector PE ; A ROOT modifies no ; word in the sentence. (if (eq (lab x) ROOT) (eq (mod x) nil)) ROOT 2
ROOT 1
ROOT nil
SUBJ 3
SUBJ 1
SUBJ nil
DET 3
DET 2
DET nil
AND
Result
(any other unary constraints needed)
Figure 10: Vectorization of unary constraint propagation. To vectorize unary constraint propagation, all of the role values are generated and stored in a one-dimensional vector. They are then fed into a vector processor, as shown in Figure 10. The vector processor has a modular structure; it is composed of vector PEs. There is one vector PE for each of the unary constraints. Each vector PE has enough functional units to evaluate its constraint in a single clock cycle. Hence, a vector of ru role values can be processed in ru clock cycles. Because a typical English grammar has a few hundred unary constraints, it might seem like an inordinate number of functional units are necessary to create the vector processor. However, each functional unit only has to do simple comparisons, and therefore requires very few transistors. For instance, the average functional unit on the PARSEC chip has around 25 transistors. With today's chip densities exceeding three million transistors, it is clear that this hypothetical vector processor is realizable. Vectorized propagation of binary constraints is illustrated in Figure 11. Directing the process is a control unit, which has access to the topology of the constraint network and the role values left over after unary constraint propagation. Note that we do not explicitly construct the arc matrices. This is advantageous because the size of the arc matrices is O(n4). The arc matrices of even moderate-sized sentences can require many megabytes of storage. In low-cost, embedded applications of this algorithm, the cost and size of this extra memory may be prohibitive. Because of this, the control unit simply generates the rows and columns of the arc matrices which it needs on the y and then sends them to the binary constraint vector processor. Since we only have to store O(n2) role values, a minimal system would require only a few kilobytes of memory even for very large sentences. 12
Figure 11 shows the process of determining whether ROOT-nil is a valid role value. First, the control unit looks at the topology of the network, and determines that ROOT-nil must be compatible with one of the role values from node 1 (DET-2 or DET-3), and the role value SUBJ-3 from node 2. So it creates two vectors of role value pairs, [(ROOT-nil, DET-2) (ROOT-nil, DET-3)] and [(ROOT-nil, SUBJ-3)], to send to the binary constraint vector processor. The binary constraint vector processor is like the unary constraint vector processor in that it has a vector PE for each of the binary constraints in the grammar. However, it requires more hardware than the unary constraint vector processor, since it operates on pairs of role values. It must have twice the input bandwidth of the unary constraint vector processor, and each of its vector PEs must have enough functional units to evaluate a binary constraint in a single cycle. In order for ROOT-nil to be compatible with word node 1 in Figure 11, it must be compatible with at least one of node 1's role values. To test for this, the vector processor has extra hardware to maintain a cumulative OR for the results of testing ROOT-nil against all of node 1's role values. The results of testing [(ROOT-nil, DET-2) (ROOT-nil, DET-3)] against the binary constraints is the vector [0 1]. The cumulative OR is 1, so ROOT-nil is compatible with node 1. Next the vector processor tests ROOT-nil for compatibility with word node 2. The cumulative OR from testing [(ROOT-nil, SUBJ-3)] is also 1. The control unit maintains a cumulative AND of the results of all the cumulative ORs from testing ROOT-nil. If this cumulative AND is 1 (which it is), then ROOT-nil is retained. If the cumulative AND had been 0, then ROOT-nil would have been eliminated from the network. If rb role values remain after unary constraints, the binary constraints can be propagated in O(rb rb) clock cycles. After binary constraint propagation, the ltering algorithm normally goes through the arc matrices and checks for rows and columns which contain all zeros. However, we have not built any arc matrices, so when the ltering algorithm must check if a row or column contains all zeros, it must ask the control unit to construct it. The control unit does this by forming the vector of pairs of role values which pertain to the row or column requested by the ltering algorithm, and then running that vector through the binary constraints to generate 13
A 1 {DET-2, DET-3}
G
fish
ROOT-nil
2
SUBJ-3
1
{SUBJ-3}
reset OR
SUBJ 3
1 1
G
AND
eats
The control unit generates these two vectors and sends them to the vector processor.
3 {ROOT-nil}
Cumulative OR determines that ROOT-nil is compatible with word node 1. ROOT nil
Control Unit
ROOT-nil DET-2 DET-3
G
Binary Constraint Vector Processor Vector PE
ROOT nil
ROOT nil
; A DET must modify a ; noun to its right.
DET 3
DET 2
(if (and (eq (lab x) DET)) (eq (mod x) (pos y))) (eq (cat y) noun))
AND Result
The Control Unit ANDs the results of the cumulative ORs for each of these vectors to determine that ROOT-nil is valid.
(other binary constraints if needed)
Cumulative OR
Figure 11: Vectorization of binary constraint propagation. the vector of results. If the vector of results contains all zeros, the ltering algorithm will eliminate the role value which indexes the row or column from the network. The cumulative OR allows the ltering algorithm to determine whether the row or column contains all zeros. Hence, a single step of the ltering algorithm is equivalent to simply repropagating all of the binary constraints. Filtering then becomes simply a matter of iteratively reapplying the binary constraints until no more role values are eliminated. As we explained in section 3, typically there are only a small, constant number of ltering steps required. Also, the number of remaining role values dramatically drops for each iteration of the algorithm, and so each succeeding iteration runs faster because the lengths of the vectors decrease. This combined with the fact that vector propagation of the binary constraints is so fast implies that we pay only a small penalty in increased running time for the convenience of not having to explicitly create all of the arc matrices. Although vectorization of the CDG parsing algorithm does not reduce the asymptotic running time, it dramatically reduces the number of clock cycles required to evaluate the constraints. On a non-superscaler CPU each constraint would take between ve and 20 clock cycles to evaluate; therefore a suciently powerful vector processor could easily parse 14
a sentence thousands of times faster than a conventional CPU. In a typical scenario, using a constraint grammar of several hundred constraints to parse a sentence of twelve words would require upwards of ve million comparisons to parse. This can easily be handled in real time by a single PARSEC chip. We conclude this section by observing that the two techniques of speeding up CDG parsing, vectorization and parallelization, can both be applied simultaneously. A massively parallel machine with O((ku + kb ) n4) vector PEs could parse a sentence in a single clock cycle. This \ultimate constraint machine" would require about 50 billion transistors to create, but would be very useful in doing intelligent text searches on very large databases{it could parse the entire Encyclopedia Britanica in about half a minute.
5 The PARSEC Chip Constraint propagation is highly parallelizable at the instruction level because its primitive operations consist of simple comparisons and a logical ORing of the results. These comparisons require very little feedback, as opposed to typical arithmetic operations on a generic CPU. Special hardware allows us to utilize this extra parallelism, resulting in superpipelining of the hardware comparators, and an unusually fast clock cycle and higher overall performance. Implementing PARSEC in custom hardware has several distinct advantages. Constraint propagation can now operate in real time, parsing sentences. Currently, this can only be achieved with high-performance systems, such as the MasPar MP-1. Customized hardware eliminates the expensive system requirements, dramatically reducing computational cost. Finally, the miniaturization provides an easy means to place constraint-based propagation in virtually any system, regardless of size. The PARSEC chip consists of nine highly-pipelined functional units working concurrently to process a constraint. The constraints can then be applied to the role values, which are broadcast onto the bus. This allows a linear increase in performance by placing additional chips on the bus. These new chip outputs are ANDed, using a trinary tree constructed from specialized on-chip hardware. The length of the internal pipeline increases logarithmically for the number of additional PARSEC chips. The design consists of only 8,000 transistors, 15
and could be integrated in a very small space and possibly incorporated into other chips. The PARSEC chip was designed to be fabricated using a two-micron MOSIS process. Although this process is ten years old and considered primitive compared to current VLSI technology, it would provide a high performance chip capable of over ninety million computations per second with only a ten MHz clock. This would allow processing of up to ten million constraints per second, roughly ten times faster than a Sparc-I. Figure 12 shows the VLSI layout and the pin assignments for the PARSEC vector PE. We had three primary design goals for the PARSEC chip: 1) Single-cycle evaluation of both unary and binary constraints; 2) Single-chip implementation; and 3) On-chip support for multiple PEs. In order to support single-cycle evaluation times for constraints, we had to provide seven functional units on chip: four 6-bit comparators for comparing the positions of words and modi ees, three 3-bit equality testers for labels, words, and categories, and one 1-bit comparator for roles. Feeding so many functional units made bussing a major consideration. The word lines are twenty-two bits long, and at any given time, these bits may need to be routed to any one of several functional blocks. Furthermore, another twentytwo bits are stored on the chip, and these must also be available to other functional blocks. With this in mind, we designed the chip logic so that the 22-bit long bus could be broken into smaller, more speci c buses. This was done by splitting the processing of the various parts of the input word into autonomous functional blocks, and by carefully de ning the bits within the input word so that there would be as little \bus overlap" as possible. This accounts for the slightly strange, nonsequential pin assignments of the chip. Whenever possible, bus lines were time shared. Even after this careful bus assignment, bussing accounts for approximately 50% of the chip area. Getting the whole PARSEC chip to t on a single MOSIS tiny chip was, to say the least, an interesting challenge. Originally we just wanted to implement our massively-parallel algorithm in hardware. However, it became apparent that this algorithm was not suitable for VLSI implementation given our small transistor budget. The solution was to use the vectorization described in this paper. The vectorized algorithm, however, has tremendous bandwidth requirements. Unary constraint evaluation was no problem because a MOSIS tiny chip had enough pins to let us feed in one role value at a time. However, binary constraint evaluation required us to feed 16
15
14
13
12
GND
11
10
9
8
0
16
17
1
Position and Modifiee 6-bit Comparitors
18
2
19
3
20
4
21
Logic for Pipeline Control
5 Cumulative OR-ing Circuit
ring ocil.
Feature
Stat0
6
Role 7
Label
Category
Data Rdy
Stat1
Stat0
Final Stat
Status
CMD bit 1
VDD
CMD bit 0
CLOCK Enable
Disable Pipe
Figure 12: The VLSI layout and pin assignments for the PARSEC chip.
17
in two role values at once, and there simply were not enough pins to support this. We were faced with two alternatives. The rst was to just multiplex the data bus and feed in the two role values one after the other. We considered this unacceptable, however, because it would have doubled our execution times. Our second alternative was to somehow nd a way to cut the bandwidth requirements of binary constraint evaluation in half. The solution was to use the fact that, for each of the pairs of role values which make up a row or column of an arc matrix, there is a common role value, namely, the role value which indexes that row or column. By rst loading the role value which every pair has in common onto the PARSEC chip, we can simply feed in the other half of the role value pairs one by one. The only problem with this solution was that the seven functional units and the bus lines required to feed them had very nearly exhausted our chip real estate. We simply did not have any room left to store a 22 bit-long role value. However, we did have room to store the six 1-bit results from passing the role value through the functional units. This allowed us to squeeze the entire PARSEC vector PE onto a single chip and still maintain the single-cycle evaluation times. To make it easier for a board designer to design systems with many PARSEC chips, we included all of the glue logic necessary to do so onto the PARSEC chip. This glue logic is tightly synchronized with the rest of the chip and, in eect, allows us to extend the execution pipeline over multiple chips. Each PARSEC chip can accept the output of three other PARSEC chips which it ANDs together with its own output to form the result. Thus, for every power of three processors we get an extra one cycle execution latency on the pipeline. The cumulative OR was implemented as a one-shot which if it goes high once stays high. The processor enable pin serves two functions. First, it allows us to select a single PE so that we can program it for the constraint it is to evaluate. Second, it allows us to turn on or o groups of constraints which might be applied under various circumstances. For example, dierent constraints might apply when parsing dierent dialects of English.
18
6 Chip Simulations We have conducted simulations of the chip using two dierent grammars. The rst grammar is capable of parsing strings in the language an bncn. Though this language is beyond context free, our grammatical formalism is capable of parsing strings in the language using three unary and eight binary constraints. The second grammar covers simple statements in English using 17 unary and 27 binary constraints. The grammars are contained in the Appendix. Gprof was used to pro le the execution of our chip for these two grammars. In particular, we determined how much time was spent in the functions that apply the unary constraints and binary constraints, and also how many times those routines were called. All simulations were performed on a Sparc-I running SunOS. The estimated time to execute unary and binary constraints using the chip instead of the Sparc-I CPU was calculated using the formulas: j = d junary-constraints e chip-clock jchipsj j d e chip-clock chip-binary = d jbinary-constraints jchipsj
d chip-unary
For these simulations a clock of 10 MHZ was assumed. The time saved by using the chip to apply the unary constraints was estimated by the following equation: d time-saved u
=
cpu-unary
?
d chip-unary
The time saved by using the chip to apply the binary constraints was estimated by the following equation: d time-saved b
=
cpu-binary
?
d chip-binary
The overall time saved by using the chip to apply the constraints was estimated by the following equation: d time-saved
= (cpu-time ?
cpu-unary
?
cpu-binary
+
d chip-unary
+
The rst simulation we ran was using the anbncn grammar, which has only three unary and eight binary constraints. Because of this, the maximum parallelism is achieved by using three chips for unary constraint propagation and eight chips for binary constraint propagation. Figures 13 and 14 show the amount of time saved over a Sparc-I CPU when propagating the unary and binary constraints, respectively. Notice that the speedup for binary constraints 19
)
d chip-binary
Time saved on unary constraints using an a^nb^nc^n grammer. 2.5 60 Words
Time saved using chip (sec).
2
54 Words 48 Words
1.5
42 Words 1
36 Words
30 Words 0.5
0 1
2 Number of chips used.
24 Words 18 Words 12 Words 3
Figure 13: Estimated time saved by the PARSEC chip during unary constraint propagation for the anbncn grammar. is much higher than for unary constraints. By utilizing multiple chips, some speedup occurs for both unary constraint propagation and binary constraint propagation as Figures 15 and 16 show. These gures plot the chip time for parsing sentences of varying lengths as a function of the number of chips used and the number of words in the sentence. Note that the timings for unary constraint propagation are not monotonically increasing as the number of words increase. This is probably due to the fact that the short time to propagate the unary constraints was more impacted by system load at the time of measurement than the binary constraint propagation time. Despite the fact that multiple chips do help, the eect is hidden in Figures 13 and 14 because the most dramatic gain was obtained by using a single chip for all constraints. The second simulation we ran was using a natural language grammar, which has 17 unary and 27 binary constraints. Because of this, the maximum parallelism is achieved by using 17 chips for unary constraint propagation and 27 chips for binary constraint propagation. Figures 17 and 18 show the amount of time saved over a Sparc-I CPU when propagating the unary and binary constraints, respectively. Notice that the speedup for this grammar is 20
Time saved on binary constraints using an a^nb^nc^n grammer. 500 450
60 Words
Time saved using chip (sec).
400 350 300
54 Words
250 200 48 Words 150 42 Words 100 36 Words
50 0 1
24 Words 12 Words 2
3
4 5 Number of chips used.
6
7
8
Figure 14: Estimated time saved by the PARSEC chip during binary constraint propagation for the anbncn grammar. Time taken to parse using a^nb^nc^n unary constraints. 2500 1 Chip
Time (microsec)
2000
1500
2 Chips 1000 3 Chips
500
0 0
10
20
30 Number of words.
40
50
60
Figure 15: Time for unary constraint propagation for the anbncn grammar over sentences of varying lengths given one to three PARSEC chips. 21
5
5
x 10
Time taken to parse using a^nb^nc^n binary constraints.
1 Chip
4.5 4
Time (microsec)
3.5 3 2.5
2 Chips
2 3 Chips
1.5
4 Chips 5 Chips 6 Chips 8 Chips
1 0.5 0 0
10
20
30 Number of words.
40
50
60
Figure 16: Time for binary constraint propagation for the anbncn grammar over sentences of varying lengths given one to eight PARSEC chips. somewhat greater than in our rst simulation because the grammar is larger. Also notice that the speedup for binary constraints is much higher than for unary constraints. By utilizing multiple chips, some speedup occurs for both unary constraint propagation and binary constraint propagation as Figures 19 and 20 show. These gures plot the chip time for parsing sentences of varying lengths as a function of the number of chips used and the number of words in the sentence. Note that the timings for unary constraint propagation are not monotonically increasing as the number of words increase. This again is probably due to the fact that the short time to propagate the unary constraints was more impacted by system load at the time of measurement than the binary constraint propagation time. Despite the fact that multiple chips do help, the eect is again hidden in Figures 17 and 18 because the most dramatic gain occurs with the use of a single chip.
22
Time saved on unary constraints using an NLP grammer. 0.3 40 Words
Time saved using chip (sec).
0.25
0.2
0.15
35 Words 30 Words
0.1 25 Words 0.05 20 Words
0 0
15 Words 5 Words 2
4
6
8 10 12 Number of chips used.
14
16
18
Figure 17: Estimated time saved by the PARSEC chip during unary constraint propagation for the NLP grammar.
7 Conclusion The experimental phase in developing a new language understanding system is almost always limited by computational resources, and our experience with the research and development of the PARSEC system has been no exception. To overcome this limitation, our work has made extensive use of parallel and vector processing. However, because we are now trying to bring the bene ts of PARSEC out of the lab and into the commercial arena, we had to provide a cost-eective way to deliver the computational power demanded by constraint dependency parsing. The PARSEC chip, with its multiple functional units and its pipelined design, ts the bill admirably. We hope that the low cost and small form-factor of the PARSEC chip will let it nd its way into many consumer electronic niches, such as improving the quality of voice and handwriting recognition in hand-held PDAs and voice-command VCRs.
23
Time saved on binary constraints using an NLP grammer. 600 40 Words
Time saved using chip (sec).
500
400 35 Words 300
200
30 Words
100
25 Words
0 0
5
10
15 20 Number of chips used.
25
20 Words 15 Words 5 Words 30
Figure 18: Estimated time saved by the PARSEC chip during binary constraint propagation for the NLP grammar.
24
Time taken to parse using NLP unary constraints. 300
250
Time (microsec)
200
150
1 Chip
100 2 Chips 3 Chips
50
0 5
10
15
20 25 Number of words.
30
35
4 Chips 5 Chips 10 Chips 17 Chips 40
Figure 19: Time for unary constraint propagation for the NLP grammar over sentences of varying lengths given one to 17 PARSEC chips. Time taken to parse using NLP binary constraints.
5
6
x 10
1 Chip 5
Time (microsec)
4
3
2
2 Chips 3 Chips
1
0 5
4 Chips 5 Chips 10 Chips 27 Chips 10
15
20 25 Number of words.
30
35
40
Figure 20: Time for binary constraint propagation for the NLP grammar over sentences of varying lengths given one to 27 PARSEC chips. 25
A Appendix A.1
a b c n
n
n
;; Grammar Parameters (roles governor) (labels a b c) ;; 3 Unary Constraints (if (eq (root_word x) a) (^ (eq (lab x) a) (gt (mod x) (pos x)))) (if (eq (root_word x) b) (^ (eq (lab x) b) (lt (mod x) (pos x)))) (if (eq (root_word x) c) (^ (eq (lab x) c) (lt (mod x) (pos x)))) ;; 8 Binary Constraints (if (^ (eq (lab x) a) (eq (lab y) b)) (lt (pos x) (pos y))) (if (^ (eq (lab x) b) (eq (lab y) c)) (lt (pos x) (pos y))) (if (^ (eq (lab x) a) (eq (lab y) a) (gt (pos x) (pos y))) (gt (mod x) (mod y))) (if (^ (eq (lab (eq (mod (eq (rid (eq (lab y)
x) a) x) (pos y)) y) governor)) c))
(if (^ (eq (lab (eq (lab (gt (pos (gt (mod x)
x) c) y) c) x) (pos y))) (mod y)))
(if (^ (eq (lab (eq (mod (eq (rid (eq (lab y)
x) c) x) (pos y)) y) governor)) b))
(if (^ (eq (lab (eq (lab (gt (pos (gt (mod x)
x) b) y) b) x) (pos y))) (mod y)))
(if (^ (eq (lab (eq (mod (eq (rid (eq (lab y)
x) b) x) (pos y)) y) governor)) a))
26
A.2 A Simple English Grammar ;; Grammar Parameters (categories det adj noun verb propernoun pronoun prep adv) (roles governor needs) (labels root subj obj adj det noun_mod s blank np pp_obj pp_need part_need v_pp) (label_table (governor (noun subj noun_mod obj pp_obj) (det det) (propernoun subj obj pp_obj) (pronoun subj obj pp_obj) (verb root) (adj adj) (adv part) (prep v_pp n_pp)) (needs (noun np) (det blank) (propernoun blank) (pronoun blank) (verb s) (adj blank) (adv part_need) (prep pp_need))) (grammar_features (number 1s 2s 3s 1p 2p 3p) (subcats dobj none)) (feature_table (noun (number 3s 3p [3s])) (det (number 3s 3p [3s])) (pronoun (number 3s 3p [3s])) (propernoun (number 3s 3p [3s])) (verb (number 1s 2s 3s 1p 2p 3p [3s]) (subcats dobj none []))) ;; 17 Unary Constraints (if (eq (lab x) root) (eq (mod x) nil)) (if (eq (lab x) subj) (lt (pos x) (mod x))) (if (eq (lab x) obj) (gt (pos x) (mod x))) (if (eq (lab x) adj) (lt (pos x) (mod x))) (if (eq (lab x) det) (lt (pos x) (mod x))) (if (eq (lab x) noun_mod) (lt (pos x) (mod x))) (if (^ (eq (lab x) s) (eq (subcat x) dobj)) (lt (pos x) (mod x))) (if (^ (eq (lab x) s) (eq (subcat x) none)) (eq (mod x) nil)) (if (eq (lab x) blank) (eq (mod x) nil))
27
(if (^ (eq (lab x) np) (eq (number x) 3s)) (gt (pos x) (mod x))) (if (^ (eq (lab x) np) (eq (number x) 3p)) (eq (mod x) nil)) (if (eq (lab x) pp_obj) (gt (pos x) (mod x))) (if (eq (lab x) pp_need) (lt (pos x) (mod x))) (if (eq (lab x) part_need) (lt (pos x) (mod x))) (if (eq (lab x) v_pp) (gt (pos x) (mod x))) (if (eq (lab x) n_pp) (gt (pos x) (mod x))) (if (eq (lab x) part) (gt (pos x) (mod x))) ;; 27 Binary Constraints (if (eq (pos x) (pos y)) (eq (cat x) (cat y))) (if (^ (eq (pos (eq (cat (eq (cat (eq (number
x) x) x) x)
(pos y)) (cat y)) noun)) (number y)))
(if (^ (eq (pos (eq (cat (eq (cat (eq (number
x) x) x) x)
(pos y)) (cat y)) propernoun)) (number y)))
(if (^ (eq (pos (eq (cat (eq (cat (eq (number
x) x) x) x)
(pos y)) (cat y)) pronoun)) (number y)))
(if (^ (eq (eq (eq (^ (eq (eq
(pos x) (cat x) (cat x) (number (subcat
(pos y)) (cat y)) verb)) x) (number y)) x) (subcat y))))
(if (^ (eq (eq (eq (eq (^ (eq (eq
(lab x) (subcat (rid y) (mod x) (lab y) (mod y)
s) x) dobj) governor) (pos y))) obj) (pos x))))
(if (^ (eq (eq (eq (^ (eq (eq (eq
(lab x) (rid y) (mod x) (lab y) (subcat (mod y)
obj) needs) (pos y))) s) y) dobj) (pos x))))
(if (^ (eq (lab (eq (pos (eq (rid (eq (lab y)
x) root) x) (pos y)) y) needs)) s))
28
(if (^ (eq (lab (eq (rid (eq (mod (eq (lab y)
x) subj) y) governor) x) (pos y))) root))
(if (^ (eq (lab x) det) (eq (mod x) (pos y))) (eq (cat y) noun)) (if (^ (eq (lab x) adj) (eq (mod x) (pos y))) (eq (cat y) noun)) (if (^ (eq (lab x) noun_mod) (eq (mod x) (pos y))) (eq (cat y) noun)) (if (^ (eq (eq (eq (^ (eq (eq
(lab (rid (mod (lab (mod
(if (^ (eq (lab (eq (rid (eq (mod (eq (lab y)
x) y) x) y) y)
np) governor) (pos y))) det) (pos x))))
x) obj) y) governor) x) (pos y))) root))
(if (^ (eq (lab x) pp_obj) (eq (mod x) (pos y)) (eq (rid y) needs)) (^ (eq (lab y) pp_need) (eq (mod y) (pos x)))) (if (^ (eq (eq (eq (^ (eq (eq
(lab (mod (rid (lab (mod
x) x) x) y) y)
pp_need) (pos y)) governor)) pp_obj) (pos x))))
(if (^ (eq (lab x) v_pp) (eq (mod x) (pos y))) (eq (cat y) verb)) (if (^ (eq (lab x) n_pp) (eq (mod x) (pos y))) (eq (cat y) noun)) (if (^ (eq (lab (eq (rid (eq (mod (eq (lab y)
x) part) y) governor) x) (pos y))) root))
(if (^ (eq (eq (eq (^ (eq (lt
x) y) x) y) y)
(lab (rid (mod (lab (mod
part_need) governor) (pos y))) obj) (pos x))))
(if (^ (eq (lab x) subj) (eq (mod x) (pos y))) (agree (number x) (number y))) (if (^ (eq (root_word x) a) (eq (lab x) det) (eq (mod x) (pos y))) (eq (number y) 3s))
29
(if (^ (eq (lab (eq (cat (gt (pos (lt (mod x)
x) noun_mod) y) prep) y) (pos x))) (pos y)))
; if pos(x) < pos(y) < mod(x) -> pos(x)