Efficient finite-state algorithms for the application of

Efficient finite-state algorithms for the application of local grammars Javier M. Sastre-Martínez1,2,3 Ph.D. thesis supervised by

Mikel L. Forcada2 Eric Laporte1 1

LIGM, Université Paris-Est

2

Grup Transducens, DLSI, Universitat d’Alacant

3

iTEAM, Universitat Politècnica de València

11th July 2011 Javier Sastre (Univs. Paris-Est & Alacant)

Ph.D. public defense

11th July 2011

1 / 42

Outline 1

Background Local grammars Lexicon grammar + local grammars = natural language parsing The MovistarBot: an industrial natural language application

2

Motivation & goal

3

Hierarchy of finite-state machines

4

Efficient algorithms of application of local grammars

5

Algorithm optimizations Efficient management of sets Efficient management of sequences

6

Empirical tests Experimental conditions Experimental results

7

Conclusion

Javier Sastre (Univs. Paris-Est & Alacant)


11th July 2011

2 / 42

Background

Outline 1


2

Motivation & goal

3


4


5


6


7

Conclusion



11th July 2011

2 / 42

Background

Local grammars

Local grammars (Gross, 1997) Describe sets of meaningful sequences in natural languages (NLs) Handcrafted or semi-automatically built More control on the results than statistical methods Formalism: recursive transition networks (RTNs, Woods, 1970) with output, taking a set of lexical masks as input alphabet Lexical masks: predicates representing sets of lexical units complying with some morphosyntactic and/or semantic criteria RTNs ≡ context-free grammars (CFGs) ≡ push-down automata RTNs + “unification output” ≡ lexical-functional grammars RTNs are more compact than CFGs, hence. . . . . . more efficient algorithms of application (Woods, 1969) Very intuitive graphical representation Javier Sastre (Univs. Paris-Est & Alacant)


11th July 2011

3 / 42

Background

Local grammars

Example of local grammar I (excerpt)



11th July 2011

4 / 42

Background

Local grammars

Example of local grammar II (excerpt)



11th July 2011

5 / 42

Background

Lexicon grammar + local grammars = natural language parsing

Lexicon grammar (Gross, 1996) First empirical method for the exhaustive description of the syntax of NLs (as for Laporte, 2005) Classes of syntactic structures of sentences (Leclère, 2002). . . . . . but taking into account irregularities within each class due to the use of specific predicative elements! A lexicon grammar table per class: a matrix of predicative elements × differential properties Syntactic structures described in the table’s documentation Lexicon grammar of French: one of the richest linguistic resources for French (72000 entries, starting with Gross, 1975) Can be semi-automatically transformed into local grammars for NL parsing (Roche, 1993; Constant, 2003a), though. . . . . . must be first transformed into some exploitable format . . . not a negligible task (e.g.: see Tolone, 2011) Javier Sastre (Univs. Paris-Est & Alacant)


11th July 2011

6 / 42

Background

Lexicon grammar + local grammars = natural language parsing

Example of lexicon grammar table (excerpt)

Red area: possibility to use auxiliary verbs avoir and être Visualized with HOOP (Sastre, 2006a; Sastre, 2006b): http://hoop.univ-mlv.fr Javier Sastre (Univs. Paris-Est & Alacant)


11th July 2011

7 / 42

Background

The MovistarBot: an industrial natural language application

The MovistarBot Conversational agent created by Telefónica I+D Text-based communication in Spanish through MSN Messenger Sells mobile services: sending text & multimedia messages search & download games, photos and music search & subscribe to alerts provides information about products and offers

Firstly based on AIML (Wallace, 2004): Simple formalism based on XML Less powerful than regular expressions

Extended with local grammars for boosting the recognition of request sentences (Sastre et al., 2009) Javier Sastre (Univs. Paris-Est & Alacant)


11th July 2011

8 / 42

Motivation & goal

Outline 1


2

Motivation & goal

3


4


5


6


7

Conclusion



11th July 2011

8 / 42

Motivation & goal

Motivation & goal Conception of faster algorithms of application of local grammars w.r.t. the algorithms nowadays in use: Unitex (Paumier et al., 2009): top-down depth-first Outilex (Blanc and Constant, 2006): Earley-like Intex/NooJ (Silberztein, 1998; Silberztein, 2007): unknown not open-source conceived for research on linguistics, but not on algorithmics

Other classic algorithms not so straightforwardly adaptable: LR (Knuth, 1965): deterministic & non-ambiguous grammars only; not NL grammars requires a table having an entry per possible input symbol; but input alphabets of local grammars are potentially infinite

Tomita, 1987: LR extension, solves 1st problem but not 2nd CYK (Cocke and Schwartz, 1970; Younger, 1967; Kasami, 1965): Chomsky normal form; grammar broken into binary pieces

Outilex’s Earley-like algorithm can be further improved Javier Sastre (Univs. Paris-Est & Alacant)


11th July 2011

9 / 42


Outline 1


2

Motivation & goal

3


4


5


6


7

Conclusion



11th July 2011

9 / 42


Why a hierarchy?

Different problems to solve → different machines to use. . . . . . but not so different Common features, properties and generic procedures Complex machines (and associated algorithms) easier to understand as extensions of simpler ones Hierarchy of finite-state machines: Factors out common features, properties and generic procedures Incremental definition of machines Incremental construction of proofs Incremental definition of their respective algorithms



11th July 2011

10 / 42


The hierarchy I: finite-state machines (FSMs) A virtual base class Common features & properties to every machine A set of input symbols, states, transitions & transition labels

qi .. .

...

ξi

qj

...

ξj

qk .. .

...

ql

...

ξ0 q0 Some states are initial

.. . qh

ξk

qm .. .

Some states are final

qn

Transition (qs , ξ, qt ) allows to bring the machine from qs to qt depending on ξ and the current context of execution (e.g.: next input)

Classic representation equivalent to that of local grammars Javier Sastre (Univs. Paris-Est & Alacant)


11th July 2011

11 / 42


The hierarchy II: finite-state automata (FSAs) Transitions either consume one input symbol or none (ε) Compact lexicon representation (Revuz, 1992; Daciuk et al., 2000; Carrasco and Forcada, 2002; Daciuk et al., 2005) Factor out prefixes & suffixes folk/s fork/s four/s

q0

f y

q3 k

l q1

o q 2

q5

r u

q4 r

ε s

q6

yolk/s york/s your/s Javier Sastre (Univs. Paris-Est & Alacant)


11th July 2011

12 / 42


The hierarchy III: retrieval trees (tries) FSAs having a tree-like structure (Fredkin, 1960) Factor out prefixes but not suffixes Useful property: each state corresponds to a unique prefix

f f ε y

fol k l o fo r for k u fou r yol k

l o y yo r yor k u you r



s

folks

fork s

forks

four s

fours

yolk s

yolks

york s

yorks

your s

yours

folk

11th July 2011

13 / 42


The hierarchy IV: recursive transition networks (RTNs) FSAs extended with recursive calls (Woods, 1970) Compact grammar representation: factor out infixes as well Modular grammars (Gross, 1999; Friburger, 2002; Constant, 2003b; Jung, 2005; Yannacopoulou, 2005; Voyatzi, 2006; Laporte, 2007. . . ) A determiner (DET) followed by a noun (N) is a noun phrase (NP) e.g.: the machine

{qDET0 } qNP1 {qN0 } qNP0

qNP3

{qNP0 } qNP2 {qPP0 } A NP followed by a prepositional phrase (PP) is another NP e.g.: the machine with calls Javier Sastre (Univs. Paris-Est & Alacant)


11th July 2011

14 / 42


The hierarchy V: input alphabets

Letter machines: transitions may consume one specific symbol Lexical machines: transitions may consume any word complying with a set of morphosyntactic and/or semantic restrictions Input alphabet of predicates (e.g.: lexical masks) better suited for NL grammars (van Noord and Gerdemann, 2001) Difference affects implementation but not theory Letter machines are conceptually simpler Therefore, hierarchy described in terms of letter machines Guidelines given for the implementation of lexical machines



11th July 2011

15 / 42


The hierarchy VI: machines with output Finite-state transducers (FSTs) & RTNs with different kinds of output: Blackboards (FSTBOs & RTNBOs) → generic output Strings (FSTSOs & RTNSOs) → to translate or to insert metadata

{qDET0 } qNP2 {qN0 } qNP0

ε :

qNP1

qNP4

ε : q NP5

{qNP0 } qNP3 {qPP0 } Weights (WFSMs) → implement heuristics for ambiguity resolution Unification (UFSMs) → ease the representation of long distance phenomena, subcategorization and free-permutable constituents Composite output (FSMCOs) → combined solution Javier Sastre (Univs. Paris-Est & Alacant)


11th July 2011

16 / 42


The hierarchy VII: filtered-popping RTNs (FPRTNs) Also called filtered-popping networks (FPNs) Efficient representation of the outputs generated by applying a RTN with output to an input sequence (Sastre, 2009) FPN = RTN + map κ of states to keys Pop transitions cannot be taken unless keys of final and return states match (they are filtered ) FPN paths represent translations performed by the RTN Keys are indexes to the input symbols consumed by the RTN Keys give the correspondence between FPN paths and input segments Pop transitions are filtered in order to ensure that connected FPN paths correspond to translations of connected input segments Javier Sastre (Univs. Paris-Est & Alacant)


11th July 2011

17 / 42


The hierarchy VII: filtered-popping RTNs (example) translate abc

=======⇒

RTNSO a:(

q1

{q6 }

q0 a:[

q2

q3

c:)

q4

FPN

q5

0

[

b:y

q8

r7 2

r6 q9

r5 3

x 1

y

r8

]

2

a

r4 3

r3 r5

c:]

)

{r6 }

r2

q7

q6

r3 2

1

b:x

{r6 }

1

r0

{q6 }

r1

(

r9 3

b

c

Red pop transitions are forbidden: (r7 , r5, r5 ) allows for skipping translation of c (r9 , r3, r3 ) allows for two consecutive translations of c Javier Sastre (Univs. Paris-Est & Alacant)


11th July 2011

18 / 42


Outline 1


2

Motivation & goal

3


4


5


6


7

Conclusion



11th July 2011

18 / 42


Formal description of machine behaviour I Application of machines in terms of execution states (ESs) x ∈ X ES = last machine state reached (q ∈ Q) + additional data representing the algorithm’s context of execution Exact definition depends on the machine and the algorithm Examples of ESs for top-down breadth-first and depth-first algs.: FSA: q, last state reached FSTBOs: (q, b), last state reached + output generated up to q RTNs: (q, π), last state reached + stack of return states RTNBOs: (q, b, π), combination of FSTBOs and RTNs

Algorithms for FSAs generalized to other machines by treating ESs as FSA states Indeterministic machines ⇒ multiple ESs for the same input Management of sets of ESs Vi (SES) rather than simple ESs Javier Sastre (Univs. Paris-Est & Alacant)


11th July 2011

19 / 42


Formal description of machine behaviour II XI : initial SESs (for a RTNBO: initial state, empty blackboard and empty stack) XF : final SESs (for a RTNBO: final state, some blackboard, empty stack) D(V ): set of ESs directly ε-reachable from V , that is, reachable from any ES of V through a transition that does not consume input Cε (V ): set of ESs ε-reachable from V , that is, through zero, one or more ε-transitions ∆(V , σ): set of ESs directly reachable from V by consuming σ ∆∗ (V , σ1 . . . σn ): set of ESs reachable from V by consuming σ1 . . . σn L(A): language accepted by machine A τ (A): language of translations generated by machine A Javier Sastre (Univs. Paris-Est & Alacant)


11th July 2011

20 / 42


ε-closure Generic computation of the ε-closure à la van Noord, 2000 Algorithm 1 eclosure(V ) enqueue every ES in V while there are enqueued ESs do dequeue next ES xs for each xt ∈ D(V ) do if xt ∈ / V then add xt to V and enqueue it end if end for end while



⊲ Cε (V )

11th July 2011

21 / 42


Breadth-first & depth-first application Generic breadth-first application à la Sastre and Forcada, 2009 Algorithm 2 translate_string(σ1 . . . σn ) V0 = Cε (XI ) for i = 0 to n do if Vi = ∅ then return ∅ end if Vi+1 = Cε (∆(Vi , σi+1 )) end for return the set of blackboards of the ESs in Vn ∩ XF

⊲ τ (σ1 . . . σn )

Unitex’s depth-first produces the same ESs but in depth-first order No use of SESs: follow a single path & backtrack forks May process the same ES twice, but managing SESs is expensive Javier Sastre (Univs. Paris-Est & Alacant)


11th July 2011

22 / 42


Pseudo-determinization

Reduces the number of reachable ESs from other ESs Apply the machine for every input sequence it may consume Take SESs as the new states of the machine Problem: contrary to FSAs, machines with output may not be determinizable Solution interpret the machines as FSAs by taking transition labels as mere input symbols Not a full determinization, but removes some structures that may lead to infinite loops Other problematic structures do not make sense for NL grammars



11th July 2011

23 / 42


Pseudo-minimization Minimization reduces the size of the machine Pseudo-minimization à la van de Snepscheut, 1985 There are more efficient minimization algorithms (Hopcroft et al., 2000), but we focus on algorithms of application: minimize once, apply to multiple sentences Simple procedure: reverse, pseudo-determinize, reverse, pseudo-determinize Reverse machine: produces reversed translations of reversed sequences Basically consists in swapping initial and final sets of states and in reversing the transitions



11th July 2011

24 / 42


Flattening of RTNs

Recursively replace call transitions by copies of the called substructures (up to n recursion levels) Equivalent to function inlining in C++ Accelerates the machine application, but. . . . . . size may increase exponentially w.r.t. n Complete flattening of RTNs results in FSAs Complete flattening of RTNs with output results in FSTs Complete flattening only possible for non-recursive RTNs, but. . . . . . natural languages are recursive (in theory)



11th July 2011

25 / 42


Earley-like acceptor Contrary to FSAs, RTNs can factor out infixes Breadth-first & depth-first treat RTNs as FSAs: take ESs with a stack as FSA states Problem 1: computation of common infixes is not factored out (exponential worst-case cost) Problem 2: left-recursive calls lead to infinite loops Earley-like: as breadth-first but without stacks Exploration of call transitions is paused Calls to the same set of states are initiated only once Paused explorations are resumed each time the call they depend on is completed Both problems solved Polynomial worst-case cost (n3 , without output generation) Javier Sastre (Univs. Paris-Est & Alacant)


11th July 2011

26 / 42


Earley-like translator Outilex’s “trivial” extension of Earley’s algorithm for RTNs with output (see Sastre and Forcada, 2009) extend ESs with the blackboards generated from the last initiated call up to reaching the ES upon call completion, resume explorations with the combination of the pre-call and in-call blackboards

Problem: ESs cloned due to different outputs to generate Indeed, implicit computation of pre-call × in-call blackboards RTN generating an exponential number of outputs w.r.t. the input length ⇒ exponential worst-case cost Example: unresolved prepositional phrase attachments The boy saw the man: 20 interpretations The boy saw the man with the telescope: 21 interpretations The boy saw the man with the telescope in the garden: 22 interp. Javier Sastre (Univs. Paris-Est & Alacant)


11th July 2011

27 / 42


Translation into FPNs

Compute the set of outputs as a FPN accepting them instead of extending ESs with blackboards (Sastre, 2009) Earley acceptor ESs become FPN states Call transitions between ESs become FPN call transitions RTN infixes also factored out in the FPN No more ES cloning: create FPN transitions consuming the outputs Polynomial worst-case cost (n3 ) even for grammars generating an exponential number of outputs w.r.t. the input length



11th July 2011

28 / 42


FPN pruning & language generation

FPN pruning: remove useless substructures due to input misinterpretations (Sastre et al., 2009) Prune before generating the language of outputs, if needed (the effective list of outputs, Sastre et al., 2009) Again, exponential worst-case cost, but. . . . . . no time wasted computing translations of misinterpreted input segments



11th July 2011

29 / 42


Blackboard set processing of FPNs r0

0

r1 {r2}

r6 r4 r2 0

Blackboard set processing (BSP): efficient generation of the language of a FPN Traverse the FPN by following a topological sort Avoid multiple explorations of FPN transitions Output FPNs are a kind of “acyclic” RTNs

B

r3 1

{r2} r

4

1

A

Topological sort possible as for PERT networks (Kahn, 1962), though no calls in PERT networks! Redefinition of topological sort for FPNs:

r5 2

{r2}

r6

2

A

r7

3

r1

Topological sort within call substructures as for PERT networks Initialization of call substructures in arbitrary order, but. . . return states must wait for every call completion they depend on

l



11th July 2011

30 / 42


Computing a FPN’s top-ranked blackboard in time n3 Extension of blackboard set processing for FPNs with weighted output Inspired in dynamic programming algorithm for the computation of the edit distance between two strings (Wagner and Fischer, 1974) Traverse the FPN by following a topological sort Annotate at each state the maximum weight that can be generated up to reaching them and the corresponding last transition Traverse backwards the succession of last transitions and build the corresponding top blackboard Finally, a polynomial worst-case cost algorithm even for RTNs generating an exponentially increasing number of outputs Unification machines may produce incompatible feature structures Top blackboard might be illegal Compute the top non-illegal blackboard in time n3 ? (future work) Javier Sastre (Univs. Paris-Est & Alacant)


11th July 2011

31 / 42

Algorithm optimizations

Outline 1


2

Motivation & goal

3


4


5


6


7

Conclusion



11th July 2011

31 / 42


Efficient management of sets

Why should we care about set management? Algorithms make an intensive use of set data structures: Construction of sets of execution states (SESs) Construction of sets of outputs

Self-balancing binary search trees (BSTs) are an efficient option Element searches have a logarithmic cost, but. . . . . . addition & removal cost increased due to rebalancing Worst case: successive additions in direct or reverse order

4

1

1

add 2

===⇒

6

2 3

5


1

rebalance!

======⇒

1 2

2 add 3

===⇒

7 Ph.D. public defense

2 1

3

3 11th July 2011

32 / 42



Red-black trees Addition of elements in random order tends to keep balance Strict balancing unnecessary GNU’s C++ Standard Template Library implements red-black trees (following Cormen et al., 2001): “half”-balanced BSTs

4

2 6

2 1

3

1

5

4 3

random order

direct order

5 6

Good compromise between balanced & unbalanced BSTs, but. . . . . . once a FPN is built, further processing does not require element additions or searches ⇒ rebalances are unnecessary Javier Sastre (Univs. Paris-Est & Alacant)


11th July 2011

33 / 42



Double-linked red-black trees Red-black tree + double-linked list = double-linked red-black tree Once no more elements are to be added or searched, the structure can be treated as a mere double-linked list Faster element removal without rebalancing or even maintaining the tree structure

?

4

1

remove 4

=====⇒

6

2 3

5

7

6

2 1

3

5

7

Unexpected (but good) side effects (Das et al., 2008): Faster access to neighbour elements → faster element addition Faster sequential traversal → faster set deletion Javier Sastre (Univs. Paris-Est & Alacant)


11th July 2011

34 / 42


Efficient management of sequences

Why should we care about sequence management?

Algorithms generating output sequences or using stacks make an intensive use of sequence copies and comparisons: Compare a sequence when adding it to a set of ESs or outputs Copy α when building β as ασ (appending σ to the output) Copy π when building π ′ as πqr (pushing return state qr ) Copy π when popping return state qr from π ′

Cost proportional to the sequence lengths Recall: each trie state corresponds to a unique string Sequences can be reduced to integer numbers: use the pointers to the nodes of a trie as identifiers



11th July 2011

35 / 42


Efficient management of sequences

Retrieval trees for string management Build the trie as needed and retrieve the pointers (red arrows)

ε o o

i i n in

f of

ε t

t n o on to

of ε ε = ⇒ o o ε·o o·f =⇒ o = ⇒ o f o·f = of ⇐

ε ε o o o·n o =⇒ o n f f on of of

Append σ to α: follow pointer to α, search/insert children ασ & return its pointer Remove σ from ασ: follow pointer to ασ & return parent pointer Operations on sequences reduced to pointer copies & comparisons Logarithmic worst-case cost instead of linear Javier Sastre (Univs. Paris-Est & Alacant)


11th July 2011

36 / 42

Empirical tests

Outline 1


2

Motivation & goal

3


4


5


6


7

Conclusion



11th July 2011

36 / 42

Empirical tests

Experimental conditions

The MovistarBot grammar & testing corpus Used for comparing the performances of the different algorithms and optimizations Translates sentences in Spanish requesting for mobile services into commands that an AIML chatterbot can easily understand The grammar (two versions): Pseudo-minimized version: 1359 states & 3141 transitions Flattened & pseudo-minimized version: 5504 states & 31702 transitions

The corpus: 168 sentences 6.9 interpretations per sentence (average) 10.1 words per sentence (average) 4.1 characters per word (average)



11th July 2011

37 / 42

Empirical tests

Experimental results

Speedups w.r.t. Unitex’s depth-first algorithm Grammar flattening (before pseudo-minimization): Pseudo-minimized FPN top blackboard FPN blackboard set proc. Optimized Earley Outilex’s Earley Optimized depth-first Unitex’s depth-first Optimized breadth-first

[1.43, 5.05]

2.12 1.74 1.64 1.48 1.15 1 0.68

Flattened & pseudo-minimized FPN top blackboard 1.45 Optimized depth-first 1.16 FPN blackboard set proc. 1.15 Unitex’s depth-first 1 Optimized breadth-first 0.76 Optimized Earley 0.72 Outilex’s Earley 0.69

Pseudo-minimized Set management [1.02, 1.37] Sequence manag. [1.14, 1.43] Both [1.30, 1.64]

Flattened & pseudo-minimized Set management [1.02, 1.12] Sequence manag. [1.11, 1.37] Both [1.12, 1.45]



11th July 2011

38 / 42

Empirical tests

Experimental results

What about large coverage NL grammars?

Speedups of new algorithms can be expected to be greater for large coverage NL grammars Main difference between new algorithms and Unitex’s and Outilex’s algorithms: more efficient treatment of non-determinism and ambiguity These factors are greater in large coverage NL grammars Furthermore speedup of FPN top blackboard expected to increase exponentially w.r.t. ambiguity and non-determinism since. . . . . . it has a polynomial worst-case cost instead of exponential



11th July 2011

39 / 42

Conclusion

Outline 1


2

Motivation & goal

3


4


5


6


7

Conclusion



11th July 2011

39 / 42

Conclusion

Conclusion Grammar flattening: best optimization (when possible) Faster algorithms of application of local grammars FPN top blackboard, the fastest for both MovistarBot grammars FPN blackboard set processing faster than Unitex’s & Outilex’s algorithms Flattened grammar ⇒ Unitex’s algorithm faster than Outilex’s

A polynomial worst-case cost algorithm instead of exponential New algorithms treat more efficiently ambiguity and indeterminism Therefore even better results expected for larger NL grammars Optimizations applicable to parsing algorithms in general: Efficient management of sets with double-linked red-black trees Efficient management of sequences with retrieval trees

A family of finite-state machines and algorithms of application A theoretical framework providing the tools for future extensions Javier Sastre (Univs. Paris-Est & Alacant)


11th July 2011

40 / 42

Conclusion

Future work Multiple proposals for further continuing this work (extensive list in the thesis) Algorithm enhancements Better strategies for the management of sets and sequences Parallelization: concurrent traversal of transitions

Grammar optimizations Grammar filtering according to the sentence to apply (Boullier and Sagot, 2007) Flattening initial fragments of grammar paths (prefix overlay transducers, Marschner, 2007)

Additional functionalities Efficient support of unification grammars (problem of the illegal top-blackboard) Tolerating errors (approximate string matching)



11th July 2011

41 / 42

Conclusion

Acknowledgements Université Paris-Est, Ministère de l’Éducation Nationale de la Recherche et de la Technologie & Centre Nationale de la Recherche Scientifique: contrat d’engagement en qualité d’allocataire de recherche No 15198-2004 Universitat d’Alacant: grant numbers INV05-40 & VIGROB-127 Spanish Government: grant number TIC20033-080681-C02-02 Universitat Politècnica de València, Instituto de Telecomunicaciones y Aplicaciones Multimedia & Telefónica I+D: Project “Tecnologías disruptivas para servicios avanzados en movilidad”, Ref. 48566/1



11th July 2011

42 / 42

Traces

Outline

8

Traces

9

References



11th July 2011

0 / 12

Traces

Breadth-first acceptor trace a q1

{q0}

a q3

{q0}

(q0, ¸)

q2 b

"

q0

q4 b

XI

(q5, ¸)

q5

[ C"(XI)

(q1, ¸)

(q3, ¸)

(q0, q2)

(q0, q4)

(q5, q2)

(q5, q4)

(q2, ¸)

(q4, ¸)

¢(V0, a)

[ C"(¢(V0, a))

(q1, q2)

(q3, q2)

(q1, q4)

(q3, q4)

(q0, q2q2)

(q0, q2q4)

(q0, q4q2)

(q0, q4q4)

(q5, q2q2)

(q5, q2q4)

(q5, q4q2)

(q5, q4q4)

(q2, q2)

(q4, q2)

(q2, q4)

(q4, q4)

¢(V1, a)

[ C"(¢(V1, a))

(q5, q2)

(q5, q4)

¢(V2, b)

(q2, ¸)

(q4, ¸)

[ C"(¢(V2, b))

(q5, ¸)

¢(V3, b)

[ C"(¢(V3, b)) Javier Sastre (Univs. Paris-Est & Alacant)


11th July 2011

1 / 12

Traces

Breadth-first translator trace {q } a:( q1 0 q2 b:) ":x q5 q0 q q 4 b:] a:[ 3 {q0}

(q0, ", ¸)

XI

(q5, x, ¸) (q1, (, ¸) (q0, (, q2) (q5, (x, q2) (q2, (x, ¸)

(q1, ((, q2)

(q3, ([, q2)

[ C"(XI) (q3, [, ¸)

¢(V0, a)

(q0, [, q4)

[ C"(¢(V0, a))

(q5, (x, q4) (q4, (x, ¸) (q1, [(, q4)

(q3, [[, q4)

¢(V1, a)

(q0, ((, q2q2)

(q0, ([, q2q4)

(q0, [(, q4q2)

(q0, [[, q4q4)

(q5, ((x, q2q2)

(q5, ([x, q2q4)

(q5, [(x, q4q2)

(q5, [[x, q4q4)

(q2, ((x, q2)

(q4, ([x, q2)

(q2, [(x, q4)

(q4, [[x, q4)

(q5, ((x), q2)

(q5, ([x], q2)

(q5, [(x), q4)

(q5, [[x], q4)

¢(V2, b)

[ C"(¢(V1, a))

(q2, ((x), ¸)

(q2, ([x], ¸)

(q4, [(x), ¸)

(q4, [[x], ¸)

[ C"(¢(V2, b))

(q5, ((x)), ¸)

(q5, ([x]), ¸)

(q5, [(x)], ¸)

(q5, [[x]], ¸)

¢(V3, b)



11th July 2011

2 / 12

Traces

Earley acceptor trace a q1

{q0}

q2 b

"

q0 a q3

{q0}

q5 q4 b

(q0, ¸, {q0}, 0)

XI

(q5, ¸, {q0}, 0)

[ C"(XI)

(q1, ¸, {q0}, 0)

(q3, ¸, {q0}, 0)

¢(V0, a)

(q0, ¸, {q0}, 1) (q5, ¸, {q0}, 1)

[ C"(¢(V0, a)) (q4, ¸, {q0}, 0)

(q2, ¸, {q0}, 0) (q1, ¸, {q0}, 1)

(q3, ¸, {q0}, 1)

¢(V1, a)

(q0, ¸, {q0}, 2) (q5, ¸, {q0}, 2) (q2, ¸, {q0}, 1)

[ C"(¢(V1, a)) (q4, ¸, {q0}, 1) ¢(V2, b)

(q5, ¸, {q0}, 1) (q4, ¸, {q0}, 0)

(q2, ¸, {q0}, 0) (q5, ¸, {q0}, 0)

[ C"(¢(V2, b)) ¢(V3, b)



11th July 2011

3 / 12

Traces

Earley translator trace {q } a:( q1 0 q2 b:) ":x q5 q0 q q 4 b:] a:[ 3 {q0}

(q0, ", ¸, {q0}, 0)

XI

(q5, x, ¸, {q0}, 0)

[ C"(XI)

(q1, (, ¸, {q0}, 0)

(q3, [, ¸, {q0}, 0)

¢(V0, a)

(q0, ", ¸, {q0}, 1) (q5, x, ¸, {q0}, 1) (q2, (x, ¸, {q0}, 0)

[ C"(¢(V0, a)) (q4, [x, ¸, {q0}, 0)

(q1, (, ¸, {q0}, 1)

(q3, [, ¸, {q0}, 1)

¢(V1, a)

(q0, ", ¸, {q0}, 2) (q5, x, ¸, {q0}, 2)

[ C"(¢(V1, a))

(q2, (x, ¸, {q0}, 1)

(q4, [x, ¸, {q0}, 1)

(q5, (x), ¸, {q0}, 1)

(q5, [x], ¸, {q0}, 1)

¢(V2, b)

(q2, ((x), ¸, {q0}, 0)

(q4, [(x), ¸, {q0}, 0)

(q2, ([x], ¸, {q0}, 0)

(q4, [[x], ¸, {q0}, 0) [ C"(¢(V2, b))

(q5, ((x)), ¸, {q0}, 0)

(q5, [(x)], ¸, {q0}, 0)

(q5, ([x]), ¸, {q0}, 0)

(q5, [[x]], ¸, {q0}, 0) ¢(V3, b)



11th July 2011

4 / 12

References

Outline

8

Traces

9

References



11th July 2011

4 / 12

References

References I Blanc, O. and Constant, M. (2006). Outilex, a linguistic platform for text processing. In Interactive Presentation Session of Coling-ACL06, pages 73–76, Morristown, NJ, USA. Association for Computational Linguistics. Boullier, P. and Sagot, B. (2007). Are very large context-free grammars tractable? In Proceedings of the 10th International Workshop on Parsing Technologies (IWPT 07) , Prague, Czech Republic. Carrasco, R. C. and Forcada, M. L. (2002). Incremental construction and maintenance of minimal finite-state automata. Computational Linguistics, 28(2):207–216. Cocke, J. and Schwartz, J. T. (1970). Programming languages and their compilers: Preliminary notes. Technical report, Courant Institute of Mathematical Sciences, New York University, New York. Constant, M. (2003a). Converting linguistic systems of relational matrices into finite-state transducers. In Proceedings of the EACL Workshop on Finite-State Methods in Natural Language Processing, pages 75–82, Budapest. Constant, M. (2003b). Grammaires locales pour l’analyse automatique de textes : Méthodes de construction et outils de gestion . PhD thesis, Université de Marne la Vallée. Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein, C. (2001). Introduction to algorithms. MIT press, Cambridge, Massachusetts, 2nd edition. Javier Sastre (Univs. Paris-Est & Alacant)


11th July 2011

5 / 12

References

References II Daciuk, J., Maurel, D., and Savary, A. (2005). Dynamic perfect hashing with finite-state automata. ´ S., and Trojanowski, K., editors, Intelligent Information Processing and Web Mining, In Kłopotek, M. A., Wierzchon, volume 31 of Advances in Soft Computing, pages 169–178. Springer Berlin / Heidelberg. Daciuk, J., Mihov, S., Watson, B. W., and Watson, R. E. (2000). Incremental construction of minimal acyclic finite-state automata. Computational Linguistics, 26(1):3–16. Das, D., Valluri, M., Wong, M., and Cambly, C. (2008). Speeding up STL set/map usage in C++ applications. In Kounev, S., Gorton, I., and Sachs, K., editors, Performance Evaluation: Metrics, Models and Benchmarks, volume 5119 of Lecture Notes in Computer Science, pages 314–321. Springer-Verlag. Fredkin, E. (1960). Trie memory. Communications of the ACM, 3(9):490–499. Friburger, N. (2002). Reconnaissance automatique de noms propres: Application à la classification automatique des textes journalistiques. PhD thesis, Université de Tours. Gross, M. (1975). Méthodes en syntaxe. Hermann, Paris.



11th July 2011

6 / 12

References

References III Gross, M. (1996). Lexicon-grammar. In Brown, K. and Miller, J., editors, Concise Encyclopedia of Syntactic Theories , pages 224–259. Pergamon Press, Oxford. Gross, M. (1997). The construction of local grammars. In Roche, E. and Schabes, Y., editors, Finite State Language Processing, pages 329–352. MIT Press, Cambridge, MA, USA. Gross, M. (1999). Lemmatization of compound tenses in English. Lingvisticæ Investigationes, 22:71–122. Hopcroft, J. E., Motwani, R., and Ullman, J. D. (2000). Introduction to automata theory, languages, and computation. Addison-Wesley, 2nd edition. Jung, E.-J. (2005). Grammaire des adverbes de duree et de date en coréen. PhD thesis, Université de Marne-la-Vallée. Kahn, A. B. (1962). Topological sorting of large networks. Communications of the ACM, 5(11):558–562. Kasami, T. (1965). An efficient recognition and syntax analysis algorithm for context free languages. Scientific Report AF CRL-65-758, Air Force Cambridge Research Laboratory, Bedford, Massachusetts.



11th July 2011

7 / 12

References

References IV Knuth, D. E. (1965). On the translation of languages from left to right. Information and Control, 8(6):607–639. Laporte, E. (2005). In memoriam Maurice Gross. Archives of Control Sciences, 15(3):257–278. Special issue on Human Language Technologies as a challenge for Computer Science and Linguistics. Part I. (2nd Language and Technology Conference). Laporte, E. (2007). Evaluation of a grammar of French determiners. In 27th Congress of the Brazilian Society of Computation (SBC’07) , pages 1625–1634. Workshop on Information Technology and Human Language (TIL). Leclère, C. (2002). Organization of the lexicon-grammar of French verbs. Lingvisticæ Investigationes, 25(1):29–48. Marschner, C. (2007). Efficiently matching with local grammars using prefix overlay transducers. In Holub, J. and Ždárek, J., editors, Implementation and Application of Automata, volume 4783 of Lecture Notes in Computer Science, pages 314–316. Springer-Verlag. Paumier, S., Nakamura, T., and Voyatzi, S. (2009). UNITEX, a corpus processing system with multi-lingual linguistic resources. In eLexicography in the 21st century: new challenges, new applications (eLEX’09), pages 173–175.



11th July 2011

8 / 12

References

References V Revuz, D. (1992). Minimisation of acyclic deterministic automata in linear time. Theoretical Computer Science, 92(1):181–189. Roche, E. (1993). Une représentation par automate fini des textes et des propriétés transformationnelles des verbes. Lingvisticæ Investigationes, 17(1):189–222. Sastre, J. M. (2006a). Computer tools for the management of lexicon-grammar databases. In Proceedings of TALN’06, pages 600–608, Leuven, Belgium. Sastre, J. M. (2006b). HOOP: a Hyper-Object Oriented Platform for the management of linguistic databases. Presentation in 25th Lexis and Grammar Conference, Palermo, Italy, September 6-9. Abstract available for download at http://www-igm.univ-mlv.fr/~sastre/publications/sastre06b.zip. Sastre, J. M. (2009). Efficient parsing using filtered-popping recursive transition networks. In Maneth, S., editor, Implementation and Application of Automata, volume 5642 of Lecture Notes in Computer Science, pages 241–244. Springer-Verlag. Sastre, J. M. and Forcada, M. L. (2009). Efficient parsing using recursive transition networks with output. In Vetulani, Z. and Uszkoreit, H., editors, Human Language Technology. Challenges of the Information Society, volume 5603 of Lecture Notes in Artificial Intelligence, pages 192–204. Springer-Verlag. Extended version.



11th July 2011

9 / 12

References

References VI Sastre, J. M., Sastre, J., and García, J. (2009). Boosting a chatterbot understanding with a weighted filtered-popping network parser. ´ In Vetulani, Z., editor, Proceedings of the 4th Language & Technology Conference (LTC’09), pages 74–78, Poznan, ´ Poland. Wydawnictwo Poznanskie Sp. z o.o. Silberztein, M. D. (1998). INTEX: An integrated FST toolbox. In Wood, D. and Yu, S., editors, Automata Implementation, volume 1436 of Lecture Notes in Computer Science, pages 185–197. Springer Berlin / Heidelberg. Silberztein, M. D. (2007). An alternative approach to tagging. In Kedad, Z., Lammari, N., Métais, E., Meziane, F., and Rezgui, Y., editors, Natural Language Processing and Information Systems, volume 4592 of Lecture Notes in Computer Science, pages 1–11. Springer Berlin / Heidelberg. Tolone, E. (2011). Analyse syntaxique à l’aide des tables du Lexique-Grammaire du français. PhD thesis, Université Paris-Est. Tomita, M. (1987). An efficient augmented-context-free parsing algorithm. Computational Linguistics, 13(1-2):31–46. van de Snepscheut, J. L. A. (1985). Trace Theory and VLSI Design, volume 200 of Lecture Notes in Computer Science. Springer-Verlag. PhD thesis, Eindhoven University of Technology.



11th July 2011

10 / 12

References

References VII van Noord, G. (2000). Treatment of epsilon moves in subset construction. Computational Linguistics, 26(1):61–76. van Noord, G. and Gerdemann, D. (2001). Finite state transducers with predicates and identities. Grammars, 4(3):263–286. Voyatzi, S. (2006). Description morpho-syntaxique et sémantique des adverbes figés de phrase en vue d’un système d’analyse automatique des textes grecs. PhD thesis, Université de Marne-la-Vallée. Wagner, R. A. and Fischer, M. J. (1974). The string-to-string correction problem. Journal of the ACM, 21(1):168–173. Wallace, R. (2004). The elements of AIML style. ALICE AI Foundation. Woods, W. A. (1969). Augmented transition networks for natural language analysis. Technical Report CS-1, Harvard Computation Laboratory. Woods, W. A. (1970). Transition network grammars for natural language analysis. Communications of the ACM, 13(10):591–606.



11th July 2011

11 / 12

References

References VIII

Yannacopoulou, A. (2005). Le lexique-grammaire des verbes du grec moderne – Les constructions transitives locatives standard. PhD thesis, Université de Marne-la-Vallée. Younger, D. H. (1967). Recognition and parsing of context-free languages in time n3 . Information and Control, 10(2):189–208.



11th July 2011

12 / 12

Efficient finite-state algorithms for the application of

Efficient finite-state algorithms for the application of

Suggest Documents

EFFICIENT RIEMANNIAN ALGORITHMS FOR

Efficient algorithms for the optimization of shielding

Efficient algorithms for the regularization of

Efficient Geo-Computational Algorithms for

Efficient algorithms for sequence segmentation

computationally efficient algorithms for robust

Efficient Algorithms for Airline Problem

Efficient Algorithms for Mining Inclusion

Efficient Algorithms for Ptychographic Phase

Efficient Algorithms for the Spoonerism Problem

Efficient Algorithms for the Uncapacitated Single ...

Efficient Algorithms for the Longest Path Problem

Computationally efficient algorithms for the two ... - CiteSeerX

EFFICIENT ALGORITHMS FOR THE DISCRETE GABOR ...

Efficient Algorithms for the Maximum Sum Problems

Efficient Algorithms for Creation of Linearly

Performance of Efficient Sorting Algorithms for ...

Efficient Algorithms for Atmospheric Correction of ... - CiteSeerX

Application of Intelligent Algorithms for Humanoid

Application of fuzzy control algorithms for electric

Application of evolutionary algorithms for software maintainability ...

Application Methods for Genetic Algorithms for the Search of ... - Core

Efficient Subspace Approximation Algorithms

EFFICIENT TREFFTZ COLLOCATION ALGORITHMS