A Parallel CKY Parsing Algorithm on Large-Scale Distributed-Memory Parallel Machines NINOMIYA Takashi TORISAWA Kentaro TAURA Kenjiro TSUJII Jun'ichi Department of Information Science University of Tokyo, Hongo 7-3-1 Tokyo 113, Japan fninomi,torisawa,tau,
[email protected]
Abstract
This paper describes an ecient parallel CKY algorithm for CFG. We intend to obtain an ecient HPSG parsing algorithm by using this parallel CKY algorithm in Torisawa's HPSG parsing algorithm. Torisawa's parsing algorithm for HPSG consists of two phases. At Phase 1 a parser enumerates possible parse trees using CFG rules compiled from lexical entries in HPSG. At Phase 2 the parser solves constraints which cannot be covered by CFG. We realized a parallel parsing algorithm for Phase 1 on a massively parallel computer AP1000+(256 Super Sparc 50MHz) with concurrent object-oriented programming language ABCL/f. The average parsing time for a corpus consisting of 2,173 sentences(the average length is 41.43 words) was 120.6 msec. The speedup by using 256 nodes was about 45 times when the average length of input sentences was 119.0 words.
1 Introduction This paper proposes a parallel CFG parsing algorithm for a practical use in aspects of speed, data distribution and memory eciency. Although many parallel CFG parsing algorithms exist, a parallel parser which can be used for parsing real-world texts has not been developed yet. We developed a parallel CFG parser for a practical use and applied this algorithm to a parser using a more sophisticated grammar formalism, HPSG[1]. Recent Natural Language Processing(NLP) based on HPSG attracts a great deal of researchers' attention[2], but only a few works are beyond theoretical speculation or experiments with a small grammar. We aim at constructing a framework and an environment based on HPSG in order to develop several NLP techniques on them, including knowledge acquisition, machine translation and information extraction. To accomplish our aims, Torisawa developed an ecient two-phased HPSG parsing algorithm[3]. The key ideas of Torisawa's algorithm are compilation of HPSG and a two-phased parsing technique. At the compile time, the lexical entries in HPSG are compiled into CFG rules. At Phase 1, a parser enu-
merates possible parse trees using bottom-up chart parsing for CFG which is obtained by the compiler. The remaining constraints which cannot be covered with the CFG are solved at Phase 2. This paper describes a parallel parsing algorithm for Phase 1. We chose the CKY algorithm[4][5] as a basis of our parallel CFG parsing algorithm. A parallel CKY algorithm is desirable from the viewpoints of speedup, distribution of data and memory eciency. The next section describes the sequential CKY algorithm. Section 3 describes our parallel CKY algorithm. The eectiveness of our method is exempli ed with a series of experiments using real-world text in Section 4, and the performance limit of our algorithm is discussed in Section 5.
2
Sequential CKY Algorithm
This section describes the sequential CKY algorithm. Let G = (VN ; VT ; P; ) be a context-free grammar, where VN is a set of nonterminal symbols, VT is a set of terminal symbols, P is a set of rewriting rules and is the starting symbol. For any input string w = w1 w2 : : : wn , Si;j is de ned as the subset of VN such that A 2 Si;j if and only if A!3 wi+1 : : : wj . The
string w belongs to L(G) if and only if is in S0;n . When the set of rewriting rules P is in Chomsky Normal Form(CNF)(i.e. each rule is of the form A ! BC or A ! w, where A; B; C 2 VN ; w 2 VT ), following constraints between Si;j hold.
2 Si01;i , 9wi ( ! wi 2 P ) (1)
2 Si;j , 9k; ; (i < k < j; 2 Si;k ; 2 Sk;j ;
! 2 P ) (2) We introduce Ti;k;j for the convenience of describing our parallel CKY algorithm. Ti;k;j is de ned as,
2 Ti;k;j , 9; ( 2 Si;k ; 2 Sk;j ; ! 2 P ) (3) That is, Ti;k;j is a subset of Si;j and Si;j = Si