Inductive and statistical learning of formal grammars

2002

Grammar Induction

Machine Learning

Inductive and statistical learning of formal grammars

Goal: to give the learning ability to a machine Design programs the performance of which improves over time

Pierre Dupont [email protected]

Inductive learning is a particular instance of machine learning • Goal: to find a general law from examples • Subproblem of theoretical computer science, artificial intelligence or pattern recognition

– Typeset by FoilTEX –

2002

Pierre Dupont

Grammar Induction

2

2002

Outline

Grammar Induction

Grammar Induction or Grammatical Inference

• Grammar induction definition

Grammar induction is a particular case of inductive learning

• Learning paradigms

The general law is represented by a formal grammar or an equivalent machine

• DFA learning from positive and negative examples The set of examples, known as positive sample, is usually made of strings or sequences over a specific alphabet

• RPNI algorithm • Probabilistic DFA learning

A negative sample, i.e. a set of strings not belonging to the target language, can sometimes help the induction process

• Application to a natural language task • Links with Markov models

Data

• Smoothing issues

aaabbb ab

• Related problems and future work

Pierre Dupont

1

Pierre Dupont

Induction

Grammar S−>aSb S−> λ

3

2002

Grammar Induction

2002

Grammar Induction

Examples

Chromosome classification

• Natural language sentence Centromere

• Speech • Chronological series

Chromosome 2a

grey density

• Successive moves during a chess game

90 80 70 60 50 40 30

grey dens. derivative

• Successive actions of a WEB user

6 4 2 0 -2 -4 -6

• A musical piece • A program • A form characterized by a chain code

0

100

200

300

400

500

600

0

100

200

300 position along median axis

400

500

600

"=====CDFDCBBBBBBBA==bcdc==DGFB=bccb== ...... ==cffc=CCC==cdb==BCB==dfdcb====="

• A biological sequence (DNA, proteins, . . .)

String of Primitives

Pierre Dupont

4

2002

Grammar Induction

Pierre Dupont

6

2002

Grammar Induction

Pattern Recognition

A modeling hypothesis

G0

16 ’3.4cont’ ’3.8cont’

15

Generation

Data

Induction

Grammar

G

14 13 12

8dC 3

2

4 5

11

1 0

6

7

• Find G as close as possible to G0

10 9 8

• The induction process does not prove the existence of G0 It is a modeling hypothesis

7 6 5 4 3 2 1 0 -2

-1

0

1

2

3

4

5

6

7

8

9

10 11 12 13 14

8dC: 000077766676666555545444443211000710112344543311001234454311 Pierre Dupont

5

Pierre Dupont

7

2002

Grammar Induction

2002

Grammar Induction

Identification in the limit • Grammar induction definition • Learning paradigms • DFA learning from positive and negative examples

G0

• RPNI algorithm

Data

Generation

• Probabilistic DFA learning

Induction

Grammar

d1 d2

G1 G2

dn

G*

• Application to a natural language task • Links with Markov models • Smoothing issues

• convergence in finite time to G∗

• Related problems and future work • G∗ is a representation of L(G0) (exact learning)

Pierre Dupont

8

2002

Grammar Induction

Pierre Dupont

10

2002

Grammar Induction

Learning paradigms

PAC Learning

G0

How to characterize learning?

Generation

• which concept classes can or cannot be learned?

Data

Induction

Grammar

d1 d2

G1 G2

dn

G*

• what is a good example? • is it possible to learn in polynomial time?

• convergence to G∗ • G∗ is close enough to G0 with high probability ⇒ Probably Approximately Correct learning • polynomial time complexity

Pierre Dupont

9

Pierre Dupont

11

2002

Grammar Induction

2002

Grammar Induction

Other learnability results Define a probability distribution D on a set of strings Σ≤n L(G 0)

• Identification in the limit in polynomial time

L(G* )

– DFAs cannot be efficiently identified in the limit – unless we can ask equivalence and membership queries to an oracle Σ∗

• PAC learning P [PD (L(G∗) ⊕ L(G0)) < ] > 1 − δ

– DFAs are not PAC learnable (under some cryptographic limitation assumption) – unless we can ask membership queries to an oracle

The same unknown distribution D is used to generate the sample and to measure the error The result must hold for any distribution D (distribution free requirement) The algorithm must return an hypothesis in polynomial time with respect to 1 , 1δ , n, |R(L)|

Pierre Dupont

12

2002

Grammar Induction

Pierre Dupont

14

2002

Grammar Induction

Identification in the limit: good and bad news

• PAC learning with simple examples, i.e. examples drawn according to the conditional Solomonoff-Levin distribution

The bad one. . . Theorem 1. No superfinite class of languages is identifiable in the limit from positive data only

K(x|c) denotes the Kolmogorov complexity of x given a representation c of the concept to be learned

The good one. . . Theorem 2. Any admissible class of languages is identifiable in the limit from positive and negative data

Pierre Dupont

Pc(x) = λc2−K(x|c)

13

– regular languages are PACS learnable with positive examples only – but Kolmogorov complexity is not computable!

Pierre Dupont

15

2002

Grammar Induction

2002

Grammar Induction

Cognitive relevance of learning paradigms • Grammar induction definition A largely unsolved question

• Learning paradigms • DFA learning from positive and negative examples

Learning paradigms seem irrelevant to model human learning:

• RPNI algorithm

• Gold’s identification in the limit framework has been criticized as children seem to learn natural language without negative examples


• All learning models assume a known representation class

• Links with Markov models

• Application to a natural language task

• Smoothing issues • Some learnability results are based on enumeration

Pierre Dupont

2002


16

Grammar Induction

Pierre Dupont

18

2002

Grammar Induction

Regular Inference from Positive and Negative Data However learning models show that: Additional hypothesis: the underlying theory is a regular grammar or, equivalently, a finite state automaton

• an oracle can help • some examples are useless, others are good: characteristic samples ⇔ typical examples

Property 1. Any regular language has a canonical automaton A(L) which is deterministic and minimal (minimal DFA)

• learning well is learning efficiently

Example : L = (ba∗a)∗

• example frequency matters

b 0

b

1

a

a 2

• good examples are simple examples ⇔ cognitive economy

Pierre Dupont

17

Pierre Dupont

19

2002

Grammar Induction

2002

Grammar Induction

A theorem

A few definitions

The positive data can be represented by a prefix tree acceptor (PTA)

Definition 1. A positive sample S+ is structurally complete with respect to an automaton A if, when generating S+ from A:

3

a

• every transition of A is used at least one

a 0

• every final state is used as accepting state of at least one string b b

0

Example : {ba, baa, baba, λ}

a

1

b

b

4

a

2

Example : {aa, abba, baa}

a

b

1

a

5

a

6

8

7

Theorem 3. If the positive sample is structurally complete with respect to a canonical automaton A(L0) then there exists a partition π of the state set of P T A such that P T A/π = A(L0)

2

Pierre Dupont

20

2002

Grammar Induction

Pierre Dupont

22

2002

Grammar Induction

Merging is fun a A

1

0

a

2 a 0

1

b 2

a A

2

0,1

a b

2

a

0

3

a

b

1

4

a

5

b a

6

a

8

7

b

b 1,2

b

a 0,2

a

b

1

0

a a

b 1

b a 0,1,2

b a

• Merging ⇔ definition of a partition π on the set of states Example : {{0,1}, {2}}

0

How are we going to find the right partition? Use negative data!

• If A2 = A1/π then L(A1) ⊆ L(A2) : merging states ⇔ generalize language Pierre Dupont

21

Pierre Dupont

23

2002

Grammar Induction

2002

Grammar Induction

Summary

A(L 0 )

• Grammar induction definition Generation

Data PTA

Induction

Grammar

π ?


PTA π

• DFA learning from positive and negative examples We observe some positive and negative data

• RPNI algorithm

The positive sample S+ comes from a regular language L0


The positive sample is assumed to be structurally complete with respect to the canonical automaton A(L0) of the target language L0 (Not an additional hypothesis


but a way to restrict the search to reasonable generalizations!)

• Links with Markov models • Smoothing issues

We build the Prefix Tree Acceptor of S+. By construction L(P T A) = S+


Merging states ⇔ generalize S+ The negative sample S− helps to control over-generalization Note: finding the minimal DFA consistent with S+, S− is NP-complete! Pierre Dupont

24

2002

Grammar Induction

Pierre Dupont

26

2002

Grammar Induction

An automaton induction algorithm

RPNI algorithm RPNI is a particular instance of the “generalization as search” paradigm RPNI follows the prefix order in PTA

Algorithm Automaton Induction input S+ S− A ← P T A(S+) while (i, j) ←choose states () do if compatible (i, j, S−) then A ← A/πij end if end while return A

a

// positive sample // negative sample // PTA // Choose a state pair // Check for compatibility of merging i and j

a 0

1

b 2

b a

3

4

5

b a

6

a

8

7

Polynomial time complexity with respect to sample size (S+, S−) RPNI identifies in the limit the class of regular languages A characteristic sample, i.e. a sample such that RPNI is guaranteed to produce the correct solution, has a quadratic size with respect to |A(L0)| Additional heuristics exist to improve performance when such a sample is not provided

Pierre Dupont

25

Pierre Dupont

27

2002

Grammar Induction

2002

RPNI algorithm: pseudo-code

Grammar Induction

Search space characterization

input S+, S− output A DFA consistent with S+, S− begin A ← P T A(S+) // N denotes the number of states of P T A(S+) π ← {{0}, {1}, . . . , {N − 1}} // One state for each prefix according to standard order < for i = 1 to |π| − 1 // Loop over partition subsets π for j = 0 to i − 1 // Loop over subsets of lower rank π 0 ← π\{Bj , Bi}U {BiU Bj } // Merging Bi and Bj A/π 0 ← derive (A, π 0) π 00 ← determ merging (A/π 0) if compatible (A/π 00,S−) then // Deterministic parsing of S− π ← π 00 break // Break j loop end if end for // End j loop end for // End i loop return A/π

Conditions on the learning sample to guarantee the existence of a solution

Pierre Dupont

Pierre Dupont

28

2002

Grammar Induction

DFA and NFA in the lattice Characterization of the set of maximal generalizations ⇒ similar to the G set from Version Space Efficient Incremental lattice construction is possible ⇒ RPNI2 algorithm Possible search by genetic optimization

2002

30

Grammar Induction

An execution step of RPNI a a 0

a b

a 0

b

2

a b

5 4

a

a

7

2

b b

8 4

a 0

a b

a 0

a b

a 0

2

2

a b

b b

7 4

9 4

a b

2

b

b a

b

b

9

a


10

8


Merge 5 and 2 6

9

a

• DFA learning from positive and negative examples 10

• RPNI algorithm

Merge 8 and 4 a b a

a a

a 4

7


6

9

a

10

Merge 7 and 4


6

• Links with Markov models 10

Merge 9 and 4


6


10

Merge 10 and 6

a 6

a 0

Pierre Dupont

a b

2

b

4

a

6

29

Pierre Dupont

31

2002

Grammar Induction

2002

Probabilistic DFA

b

b

0.4

a

0

A probabilistic automaton induction algorithm

Algorithm Probabilistic Automaton Induction input S+ α A ← P P T A(S+)

0.7

1

0.6

Grammar Induction

0.3

// positive sample // precision parameter // Probabilistic PTA

P (ab) = 0.6 ∗ 0.7 ∗ 0.3 while (i, j) ←choose states () do if compatible (i, j, α) then A ← A/πij end if end while return A

A structural and probabilistic model ⇒ an explicit and noise tolerant theory A combined inductive learning and statistical estimation problem Learning from positive examples only and frequency information

// Choose a pair of states // Check for compatibility of merging i and j

Outside of the scope of the previous learning paradigms

Pierre Dupont

32

2002

Grammar Induction

Pierre Dupont

34

2002

Probabilistic prefix tree acceptor (PPTA)

Grammar Induction

Compatibility criterion

1 1/2 2/3 a 0

1

b a

2

b

4

1/2

b 1/3

ALERGIA, RLIPS

3

a

7

1

1

Two states are compatible (can be merged) if their suffix distributions are close enough

8

1

1

a

5

a

6

1

MDI

1

Two states are compatible if prior probability gain of the merged model compensates for the likelihood loss of the data: ⇒ Bayesian learning (not strictly in this case) ⇒ based on Kullback-Leibler divergence

⇓ 1/3

2/3

0

a

a 1/3

4

b 1

6

1/3

b 1/3

b

1,3

2

a 1

8 1

a 1

5

a 1

7 1

Pierre Dupont

33

Pierre Dupont

35

2002

Grammar Induction

2002

Grammar Induction

Kullback-Leibler divergence

ALERGIA RPNI state merging order Compatibility measure :

D(PA0 k PA1 ) a

a q

1

notation

q

b

2

b

=

D(A0 k A1) P PA0 (x) = x∈Σ∗ PA0 (x) log PA (x) 1 P = − x∈Σ∗ PA0 (x) log PA1 (x) − H(A0)

Likelihood of x given model A1 : q 1,a) C(q2,a) 1 2 √ • C(q − < ln C(q1 ) C(q2 ) 2 αA

1 C(q1 )

+√

1 C(q2 )

PA1 (x) = P (x|A1)

, ∀a ∈ Σ ∪ {#}

Cross entropy between A0 and A1 :

• δ(q1, a) and δ(q2, a) are αA−compatible, ∀a ∈ Σ

−

Remarks: It is a recursive measure of suffix proximity

36

2002

PA0 (x) log PA1 (x)

When A0 is a maximum likelihood estimate, e.g. the PPTA, cross entropy measure the likelihood loss while going from A0 to A1

This measure does not depend on the prefixes of q1 and q2 ⇒ local criterion Pierre Dupont

X

Grammar Induction

Pierre Dupont

38

2002

Grammar Induction

Bayesian learning

MDI algorithm RPNI state merging order

ˆ which maximizes the likelihood of the data P (X|M ) and the prior Find a model M probability of the model P (M ) : ˆ = argmax P (X|M ).P (M ) M M

PPTA maximizes the data likelihood

Compatibility measure: small divergence increase (= small likelihood loss) with respect to size reduction (= prior probability increase) ⇒ a global criterion ∆(A1, A2) < αM |A1| − |A2| Efficient computation of divergence increase

A smaller model (number of states) is a priori assumed more likely D(A0||A2) = D(A0||A1) + ∆(A1, A2) P P γ (q ,a) ∆(A1, A2) = ci γ0(qi, a) log γ12(qii,a) qi ∈Q012 a∈Σ∪{#}

Q012 = {qi ∈ Q0 |Bπ01 (qi) 6= Bπ02 (qi)} denotes the set of states of A0 which have been merged to get A2 from A1 Pierre Dupont

37

Pierre Dupont

39

2002

Grammar Induction

2002

Grammar Induction

Comparative results • Grammar induction definition

110


100 90

• RPNI algorithm

80

Perplexity

• DFA learning from positive and negative examples

• Probabilistic DFA learning • Application to a natural language task

ALERGIA MDI

70 60


50


40


30 0

2000

4000 6000 8000 10000 Training sample size

12000

Perplexity measure the prediction power of the model: the smaller, the better Pierre Dupont

40

2002

Grammar Induction

Pierre Dupont

42

2002

Grammar Induction

Perplexity

Natural language application: the ATIS task

P (xji |q i) : probability of generating xji , the i-th symbol of the j-th string from state q i

Air travel information system, “spontaneous” American English “Uh, I’d like to go from, uh, Pittsburgh to Boston next Tuesday, no wait, Wednesday”.



 |S| |x| X X 1 log P (xji |q i) LL = − kSk j=1 i=1

Lexicon (alphabet): 1294 words Learning sample: 13044 sentences, 130773 words

P P = 2LL

Validation set: 974 sentences, 10636 words Test set: 1001 sentences, 11703 words

P P = 1 ⇒ a perfectly predictive model P P = |Σ| ⇒ uniform random guessing over Σ

Pierre Dupont

41

Pierre Dupont

43

2002

Grammar Induction

2002

Grammar Induction

Equivalence between PNFA and HMM • Grammar induction definition

Probabilistic non-deterministic automata (PNFA), with no end-of-string probabilities, are equivalent to Hidden Markov Models (HMMs)

• Learning paradigms • DFA learning from positive and negative examples

[a 0.3] [b 0.7] 0.9

0.4 0.4

• RPNI algorithm a 0.02


0.6

a 0.27 b 0.63

1

0.1 [a 0.2] [b 0.8]

a 0.27

2

a 0.56

• Application to a natural language task b 0.08


b 0.14

1

0.3

2

[a 0.9] [b 0.1]

0.7 [a 0.8] [b 0.2]

b 0.03

PNFA

0.6

HMM with emission on transitions

• Smoothing issues • Related problems and future work

Pierre Dupont

44

2002

Grammar Induction

Pierre Dupont

46

2002

Grammar Induction

Links with Markov chains 0.4 0.1 [a 0.2] [b 0.8]

A subclass of regular languages: the k-testable languages in the strict sense A k-TSS language is generated by an automaton such that all subsequences sharing the same last k − 1 symbols lead to the same state 0.04

b λ

a

a

ab

0.04 a 0.02

ba

a

aa

0.1

pˆ(a|bb) =

0.9

11

C(bba) C(bb)

11

[a 0.3] [b 0.7]

0.1

b 0.72

12

a 0.21

b 0.21

b 0.49 0.9

0.3

a 0.09

b 0.02 a 0.08 a 0.72 b 0.18

There exists probabilistic regular languages not reducible to Markov chains of any finite order

[a 0.8] [b 0.2] 0.42

0.7

22

0.3

[a 0.9] [b 0.1] 0.18

HMM with emission on states

a b

b 0.08

0.36 a 0.18

0.7

21

a

12

[a 0.2] [b 0.8]

A probabilistic k-TSS language is equivalent to a k − 1 order Markov chain

Pierre Dupont

0.3 [a 0.9] [b 0.1]

0.36

a

bb

a

0

0.7 [a 0.8] [b 0.2]

0.6 2

HMM with emission on transitions

b b

1

[a 0.3] [b 0.7] 0.9

21

a 0.63

a 0.27 22 b 0.03

b 0.07 0.42

0.18

PNFA

1 45

Pierre Dupont

47

2002

Grammar Induction

2002

Grammar Induction







• RPNI algorithm

• RPNI algorithm











Pierre Dupont

48

2002

Grammar Induction

Pierre Dupont

50

2002

Grammar Induction

Related problems and approaches

The smoothing problem

A probabilistic DFA defines a probability distribution over a set of strings

I did not talk about

Some strings are not observed on the training sample but they could be observed ⇒ their probability should be strictly positive

• other induction problems (NFA, CFG, tree grammars, . . .) • heuristic approaches as neural nets or genetic algorithms

The smoothing problem: how to assign a reasonable probability to (yet) unseen random events ?

• how to use prior knowledge

Highly optimized smoothing techniques exist for Markov chains

• smoothing techniques

How to adapt these techniques to more general probabilistic automata?

• how to parse natural language without a grammars (decision trees) • how to learn transducers • benchmarks, applications

Pierre Dupont

49

Pierre Dupont

51

2002

Grammar Induction

Ongoing and future work

• Definition of a theoretical framework for inductive and statistical learning • Links with HMMs: parameter estimation, structural induction • Smoothing techniques improvement ⇒ a key issue for practical applications • Applications to probabilistic modeling of proteins • Automatic translation • Applications to Text categorization or Text mining

Pierre Dupont

52

Inductive and statistical learning of formal grammars

Inductive and statistical learning of formal grammars

Suggest Documents

Formal Grammars and Languages - Cs.ucr.edu

Formal Grammars and Languages - Cs.ucr.edu

Formal Grammars of Early Language

Formal Grammars of Early Language

THE LANGUAGES OF ACTIONS, FORMAL GRAMMARS AND ...

THE LANGUAGES OF ACTIONS, FORMAL GRAMMARS AND

Formal Languages, Natural Grammars, and Axiomatic

Psycholinguistics, formal grammars, and cognitive science

A Two-Stage Method for Active Learning of Statistical Grammars

Statistical Relational Learning with Formal Ontologies

Psycholinguistics, formal grammars, and ... - Linguistics Network

Statistical Learning for Inductive Query Answering on OWL Ontologies

Inductive Teaching and Learning Methods

Learning Correction Grammars

Specifying and Learning Inductive Learning Systems ... - CiteSeerX

Learning context-free grammars

Towards Machine Learning of Grammars and ... - VideoLectures

Using A Formal Approach to Evaluate Grammars

Statistical Properties of Probabilistic Context-Free Grammars

Formal Grammars for Linguistic Treebank Queries

From Inductive Learning Towards Interactive Inductive Learning - ortus

LOGICAL CHARACTERISATIONS OF INDUCTIVE LEARNING

Active, Inductive, Cooperative Learning

inductive learning method