Ambiguity Detection: Scaling to Scannerless - Google Sites

0 downloads 222 Views 2MB Size Report
Harmless production filtering. – Significant speed-ups (LDTA 2010). – Proved correct ... Measurement results. (small
Ambiguity Detection: Scaling to Scannerless Bas Basten Paul Klint Jurgen Vinju Centrum Wiskunde & Informatica Amsterdam, The Netherlands

Motivation: Scannerless Generalized Parsing ●

No separate scanner/tokenizer



Modular grammar definitions



Enables parsing of: – –



Legacy languages Language embeddings

Problem: possible ambiguity!

Parsing (Legacy) Languages ●

PL/I:

IF IF = THEN THEN IF = ENDIF; ENDIF;

Parsing (Legacy) Languages ●

PL/I:

IF IF = THEN THEN IF = ENDIF; ENDIF; ●

Pascal:

a : array [1..10] of Integer

Parsing (Legacy) Languages ●

PL/I:

IF IF = THEN THEN IF = ENDIF; ENDIF; ●

Pascal:

a : array [1..10] of Integer ●

C++:

List setList;

Language Embeddings ●

Embedding AspectJ into Java*



Problem: different reserved keywords

* Bravenboer, Tanter, Visser – OOPSLA 2006

Language Embeddings ●

Embedding AspectJ into Java*



Problem: different reserved keywords

class Screen {   private float width, height;   public float aspect() {     return width / height;   } }

Java

* Bravenboer, Tanter, Visser – OOPSLA 2006

Language Embeddings ●

Embedding AspectJ into Java*



Problem: different reserved keywords

class Screen {   private float width, height; Java   public float aspect() {     return width / height;   } AspectJ } aspect MyAspect {   pointcut aspectCall(): target(Screen)     && call(float aspect());   ... } * Bravenboer, Tanter, Visser – OOPSLA 2006

Character-level grammars ●

EBNF



Include lexical definitions (no tokens)



Terminals are character-classes ([a­z], [0­9])



Disambiguation filters: –

Follow restrictions (longest match) ●



Identifier ­/­ [a­z]

Rejects (keyword reservation) ●

Identifier → ”else” {reject}

Ambiguity Detection ●

Undecidable in general



Trade-off: performance/termination ↔ accuracy



Ambiguity detection methods: –

Approximative



Exhaustive

Research Question ●

Previous work: AmbiDexter –

Harmless production filtering



Significant speed-ups (LDTA 2010)



Proved correct (ICTAC 2010)

Research Question ●



Previous work: AmbiDexter –

Harmless production filtering



Significant speed-ups (LDTA 2010)



Proved correct (ICTAC 2010)

Applicable to character-level grammars? –

More complex: full definition of lexical syntax



Less deterministic: no heuristics of scanner



Disambiguation filters

AmbiDexter Filter & Reduce ”Unambiguous” Grammar

Non-deterministic Finite Automaton

”Ambiguous”

? Sentence Generator

? Time-out



NFA describes overapproximation of parse trees



Smaller NFA = less sentences



Goal: find ambiguous strings faster

NFA Describes Parse Trees ●

Parse trees: Exp

Exp Exp Exp + Exp ●

Exp *

Exp

Exp + Exp

*

Exp

Bracketed strings:

(2 (1 Exp + Exp )1 * Exp )2

(1 Exp + (2 Exp * Exp )2 )1

Example NFA

(2 (1 Exp + Exp )1 * Exp )2

(1 Exp + (2 Exp * Exp )2 )1

Extensions to baseline algorithm ●



Modifications to NFA for: –

Character classes

(replace tokens)



Follow restrictions

(propagation)



Rejects

(language difference)



Priority/Associativity

(derivation restriction)

General improvement: –

Grammar unfolding ●

Often used non-terminals (whitespace)

Experiment setup ●

Grammar test set: Grammar Oberon0 C ECMAScript SQL-92 Java 1.5 C++



Productions 189 324 403 419 698 807

Disambiguation annotations 190 374 53 58 431 162

Measurements: –

NFA Filtering (time, memory, edges filtered)



Sentence generation time before & after filtering

Measurement results (small grammar) Oberon0 35000

Filtering time: 14s Edges filtered: 53%

30000 25000 20000 15000 10000 5000 0

20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37

Unfiltered Filtered (including filtering time)

Measurement results (medium sized grammars)

80000

C

ECMAScript

SQL-92

Filtering time: 97s Edges filtered: 28%

Filtering time: 31s Edges filtered: 50%

Filtering time: 37s Edges filtered: 49%

60000

12000

50000

10000

40000

8000

30000

6000

20000

4000

10000

2000

0

0

60000

40000

20000

0

4

5

6

1

5

2

7

8

9

1

1

3

4

5

6

7

3

2

2

1

1

10

1

11

12

1

13

14

15

1

Unfiltered Filtered (including filtering time)

Measurement results (large grammars) Java 1.5

C++

Filtering time: 28m Memory: 16Gb

Filtering time: >2h40m Memory: >17GB Too ambiguous!

40000

30000

20000

10000

0

6

7

8

1

9

10

Unfiltered Filtered (including filtering time)

Summary Grammar

Oberon0 C ECMAScript SQL-92 Java 1.5

Break-even time

Maximum speedup

15s 6m 2m 1m 35m

3399x 1.7x 2.7x 15x 5.3x

Ambiguous non-terminals found faster * 0 2 4 3 0 * Average time limit: 10hrs

Conclusions ●

Ambiguity detection for character-level grammars



Staged approach: –

NFA filtering



Sentence generation



Experimental evaluation



Significant speedups



Next step: integration with Rascal