Grammar induction is a particular case of inductive learning .... PAC learning with
simple examples, i.e. examples drawn according to the conditional ...
2002
Grammar Induction
Machine Learning
Inductive and statistical learning of formal grammars
Goal: to give the learning ability to a machine Design programs the performance of which improves over time
Pierre Dupont
[email protected]
Inductive learning is a particular instance of machine learning • Goal: to find a general law from examples • Subproblem of theoretical computer science, artificial intelligence or pattern recognition
– Typeset by FoilTEX –
2002
Pierre Dupont
Grammar Induction
2
2002
Outline
Grammar Induction
Grammar Induction or Grammatical Inference
• Grammar induction definition
Grammar induction is a particular case of inductive learning
• Learning paradigms
The general law is represented by a formal grammar or an equivalent machine
• DFA learning from positive and negative examples The set of examples, known as positive sample, is usually made of strings or sequences over a specific alphabet
• RPNI algorithm • Probabilistic DFA learning
A negative sample, i.e. a set of strings not belonging to the target language, can sometimes help the induction process
• Application to a natural language task • Links with Markov models
Data
• Smoothing issues
aaabbb ab
• Related problems and future work
Pierre Dupont
1
Pierre Dupont
Induction
Grammar S−>aSb S−> λ
3
2002
Grammar Induction
2002
Grammar Induction
Examples
Chromosome classification
• Natural language sentence Centromere
• Speech • Chronological series
Chromosome 2a
grey density
• Successive moves during a chess game
90 80 70 60 50 40 30
grey dens. derivative
• Successive actions of a WEB user
6 4 2 0 -2 -4 -6
• A musical piece • A program • A form characterized by a chain code
0
100
200
300
400
500
600
0
100
200
300 position along median axis
400
500
600
"=====CDFDCBBBBBBBA==bcdc==DGFB=bccb== ...... ==cffc=CCC==cdb==BCB==dfdcb====="
• A biological sequence (DNA, proteins, . . .)
String of Primitives
Pierre Dupont
4
2002
Grammar Induction
Pierre Dupont
6
2002
Grammar Induction
Pattern Recognition
A modeling hypothesis
G0
16 ’3.4cont’ ’3.8cont’
15
Generation
Data
Induction
Grammar
G
14 13 12
8dC 3
2
4 5
11
1 0
6
7
• Find G as close as possible to G0
10 9 8
• The induction process does not prove the existence of G0 It is a modeling hypothesis
7 6 5 4 3 2 1 0 -2
-1
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14
8dC: 000077766676666555545444443211000710112344543311001234454311 Pierre Dupont
5
Pierre Dupont
7
2002
Grammar Induction
2002
Grammar Induction
Identification in the limit • Grammar induction definition • Learning paradigms • DFA learning from positive and negative examples
G0
• RPNI algorithm
Data
Generation
• Probabilistic DFA learning
Induction
Grammar
d1 d2
G1 G2
dn
G*
• Application to a natural language task • Links with Markov models • Smoothing issues
• convergence in finite time to G∗
• Related problems and future work • G∗ is a representation of L(G0) (exact learning)
Pierre Dupont
8
2002
Grammar Induction
Pierre Dupont
10
2002
Grammar Induction
Learning paradigms
PAC Learning
G0
How to characterize learning?
Generation
• which concept classes can or cannot be learned?
Data
Induction
Grammar
d1 d2
G1 G2
dn
G*
• what is a good example? • is it possible to learn in polynomial time?
• convergence to G∗ • G∗ is close enough to G0 with high probability ⇒ Probably Approximately Correct learning • polynomial time complexity
Pierre Dupont
9
Pierre Dupont
11
2002
Grammar Induction
2002
Grammar Induction
Other learnability results Define a probability distribution D on a set of strings Σ≤n L(G 0)
• Identification in the limit in polynomial time
L(G* )
– DFAs cannot be efficiently identified in the limit – unless we can ask equivalence and membership queries to an oracle Σ∗
• PAC learning P [PD (L(G∗) ⊕ L(G0)) < ] > 1 − δ
– DFAs are not PAC learnable (under some cryptographic limitation assumption) – unless we can ask membership queries to an oracle
The same unknown distribution D is used to generate the sample and to measure the error The result must hold for any distribution D (distribution free requirement) The algorithm must return an hypothesis in polynomial time with respect to 1 , 1δ , n, |R(L)|
Pierre Dupont
12
2002
Grammar Induction
Pierre Dupont
14
2002
Grammar Induction
Identification in the limit: good and bad news
• PAC learning with simple examples, i.e. examples drawn according to the conditional Solomonoff-Levin distribution
The bad one. . . Theorem 1. No superfinite class of languages is identifiable in the limit from positive data only
K(x|c) denotes the Kolmogorov complexity of x given a representation c of the concept to be learned
The good one. . . Theorem 2. Any admissible class of languages is identifiable in the limit from positive and negative data
Pierre Dupont
Pc(x) = λc2−K(x|c)
13
– regular languages are PACS learnable with positive examples only – but Kolmogorov complexity is not computable!
Pierre Dupont
15
2002
Grammar Induction
2002
Grammar Induction
Cognitive relevance of learning paradigms • Grammar induction definition A largely unsolved question
• Learning paradigms • DFA learning from positive and negative examples
Learning paradigms seem irrelevant to model human learning:
• RPNI algorithm
• Gold’s identification in the limit framework has been criticized as children seem to learn natural language without negative examples
• Probabilistic DFA learning
• All learning models assume a known representation class
• Links with Markov models
• Application to a natural language task
• Smoothing issues • Some learnability results are based on enumeration
Pierre Dupont
2002
• Related problems and future work
16
Grammar Induction
Pierre Dupont
18
2002
Grammar Induction
Regular Inference from Positive and Negative Data However learning models show that: Additional hypothesis: the underlying theory is a regular grammar or, equivalently, a finite state automaton
• an oracle can help • some examples are useless, others are good: characteristic samples ⇔ typical examples
Property 1. Any regular language has a canonical automaton A(L) which is deterministic and minimal (minimal DFA)
• learning well is learning efficiently
Example : L = (ba∗a)∗
• example frequency matters
b 0
b
1
a
a 2
• good examples are simple examples ⇔ cognitive economy
Pierre Dupont
17
Pierre Dupont
19
2002
Grammar Induction
2002
Grammar Induction
A theorem
A few definitions
The positive data can be represented by a prefix tree acceptor (PTA)
Definition 1. A positive sample S+ is structurally complete with respect to an automaton A if, when generating S+ from A:
3
a
• every transition of A is used at least one
a 0
• every final state is used as accepting state of at least one string b b
0
Example : {ba, baa, baba, λ}
a
1
b
b
4
a
2
Example : {aa, abba, baa}
a
b
1
a
5
a
6
8
7
Theorem 3. If the positive sample is structurally complete with respect to a canonical automaton A(L0) then there exists a partition π of the state set of P T A such that P T A/π = A(L0)
2
Pierre Dupont
20
2002
Grammar Induction
Pierre Dupont
22
2002
Grammar Induction
Merging is fun a A
1
0
a
2 a 0
1
b 2
a A
2
0,1
a b
2
a
0
3
a
b
1
4
a
5
b a
6
a
8
7
b
b 1,2
b
a 0,2
a
b
1
0
a a
b 1
b a 0,1,2
b a
• Merging ⇔ definition of a partition π on the set of states Example : {{0,1}, {2}}
0
How are we going to find the right partition? Use negative data!
• If A2 = A1/π then L(A1) ⊆ L(A2) : merging states ⇔ generalize language Pierre Dupont
21
Pierre Dupont
23
2002
Grammar Induction
2002
Grammar Induction
Summary
A(L 0 )
• Grammar induction definition Generation
Data PTA
Induction
Grammar
π ?
• Learning paradigms
PTA π
• DFA learning from positive and negative examples We observe some positive and negative data
• RPNI algorithm
The positive sample S+ comes from a regular language L0
• Probabilistic DFA learning
The positive sample is assumed to be structurally complete with respect to the canonical automaton A(L0) of the target language L0 (Not an additional hypothesis
• Application to a natural language task
but a way to restrict the search to reasonable generalizations!)
• Links with Markov models • Smoothing issues
We build the Prefix Tree Acceptor of S+. By construction L(P T A) = S+
• Related problems and future work
Merging states ⇔ generalize S+ The negative sample S− helps to control over-generalization Note: finding the minimal DFA consistent with S+, S− is NP-complete! Pierre Dupont
24
2002
Grammar Induction
Pierre Dupont
26
2002
Grammar Induction
An automaton induction algorithm
RPNI algorithm RPNI is a particular instance of the “generalization as search” paradigm RPNI follows the prefix order in PTA
Algorithm Automaton Induction input S+ S− A ← P T A(S+) while (i, j) ←choose states () do if compatible (i, j, S−) then A ← A/πij end if end while return A
a
// positive sample // negative sample // PTA // Choose a state pair // Check for compatibility of merging i and j
a 0
1
b 2
b a
3
4
5
b a
6
a
8
7
Polynomial time complexity with respect to sample size (S+, S−) RPNI identifies in the limit the class of regular languages A characteristic sample, i.e. a sample such that RPNI is guaranteed to produce the correct solution, has a quadratic size with respect to |A(L0)| Additional heuristics exist to improve performance when such a sample is not provided
Pierre Dupont
25
Pierre Dupont
27
2002
Grammar Induction
2002
RPNI algorithm: pseudo-code
Grammar Induction
Search space characterization
input S+, S− output A DFA consistent with S+, S− begin A ← P T A(S+) // N denotes the number of states of P T A(S+) π ← {{0}, {1}, . . . , {N − 1}} // One state for each prefix according to standard order < for i = 1 to |π| − 1 // Loop over partition subsets π for j = 0 to i − 1 // Loop over subsets of lower rank π 0 ← π\{Bj , Bi}U {BiU Bj } // Merging Bi and Bj A/π 0 ← derive (A, π 0) π 00 ← determ merging (A/π 0) if compatible (A/π 00,S−) then // Deterministic parsing of S− π ← π 00 break // Break j loop end if end for // End j loop end for // End i loop return A/π
Conditions on the learning sample to guarantee the existence of a solution
Pierre Dupont
Pierre Dupont
28
2002
Grammar Induction
DFA and NFA in the lattice Characterization of the set of maximal generalizations ⇒ similar to the G set from Version Space Efficient Incremental lattice construction is possible ⇒ RPNI2 algorithm Possible search by genetic optimization
2002
30
Grammar Induction
An execution step of RPNI a a 0
a b
a 0
b
2
a b
5 4
a
a
7
2
b b
8 4
a 0
a b
a 0
a b
a 0
2
2
a b
b b
7 4
9 4
a b
2
b
b a
b
b
9
a
• Grammar induction definition
10
8
• Learning paradigms
Merge 5 and 2 6
9
a
• DFA learning from positive and negative examples 10
• RPNI algorithm
Merge 8 and 4 a b a
a a
a 4
7
• Probabilistic DFA learning
6
9
a
10
Merge 7 and 4
• Application to a natural language task
6
• Links with Markov models 10
Merge 9 and 4
• Smoothing issues
6
• Related problems and future work
10
Merge 10 and 6
a 6
a 0
Pierre Dupont
a b
2
b
4
a
6
29
Pierre Dupont
31
2002
Grammar Induction
2002
Probabilistic DFA
b
b
0.4
a
0
A probabilistic automaton induction algorithm
Algorithm Probabilistic Automaton Induction input S+ α A ← P P T A(S+)
0.7
1
0.6
Grammar Induction
0.3
// positive sample // precision parameter // Probabilistic PTA
P (ab) = 0.6 ∗ 0.7 ∗ 0.3 while (i, j) ←choose states () do if compatible (i, j, α) then A ← A/πij end if end while return A
A structural and probabilistic model ⇒ an explicit and noise tolerant theory A combined inductive learning and statistical estimation problem Learning from positive examples only and frequency information
// Choose a pair of states // Check for compatibility of merging i and j
Outside of the scope of the previous learning paradigms
Pierre Dupont
32
2002
Grammar Induction
Pierre Dupont
34
2002
Probabilistic prefix tree acceptor (PPTA)
Grammar Induction
Compatibility criterion
1 1/2 2/3 a 0
1
b a
2
b
4
1/2
b 1/3
ALERGIA, RLIPS
3
a
7
1
1
Two states are compatible (can be merged) if their suffix distributions are close enough
8
1
1
a
5
a
6
1
MDI
1
Two states are compatible if prior probability gain of the merged model compensates for the likelihood loss of the data: ⇒ Bayesian learning (not strictly in this case) ⇒ based on Kullback-Leibler divergence
⇓ 1/3
2/3
0
a
a 1/3
4
b 1
6
1/3
b 1/3
b
1,3
2
a 1
8 1
a 1
5
a 1
7 1
Pierre Dupont
33
Pierre Dupont
35
2002
Grammar Induction
2002
Grammar Induction
Kullback-Leibler divergence
ALERGIA RPNI state merging order Compatibility measure :
D(PA0 k PA1 ) a
a q
1
notation
q
b
2
b
=
D(A0 k A1) P PA0 (x) = x∈Σ∗ PA0 (x) log PA (x) 1 P = − x∈Σ∗ PA0 (x) log PA1 (x) − H(A0)
Likelihood of x given model A1 : q 1,a) C(q2,a) 1 2 √ • C(q − < ln C(q1 ) C(q2 ) 2 αA
1 C(q1 )
+√
1 C(q2 )
PA1 (x) = P (x|A1)
, ∀a ∈ Σ ∪ {#}
Cross entropy between A0 and A1 :
• δ(q1, a) and δ(q2, a) are αA−compatible, ∀a ∈ Σ
−
Remarks: It is a recursive measure of suffix proximity
36
2002
PA0 (x) log PA1 (x)
When A0 is a maximum likelihood estimate, e.g. the PPTA, cross entropy measure the likelihood loss while going from A0 to A1
This measure does not depend on the prefixes of q1 and q2 ⇒ local criterion Pierre Dupont
X
Grammar Induction
Pierre Dupont
38
2002
Grammar Induction
Bayesian learning
MDI algorithm RPNI state merging order
ˆ which maximizes the likelihood of the data P (X|M ) and the prior Find a model M probability of the model P (M ) : ˆ = argmax P (X|M ).P (M ) M M
PPTA maximizes the data likelihood
Compatibility measure: small divergence increase (= small likelihood loss) with respect to size reduction (= prior probability increase) ⇒ a global criterion ∆(A1, A2) < αM |A1| − |A2| Efficient computation of divergence increase
A smaller model (number of states) is a priori assumed more likely D(A0||A2) = D(A0||A1) + ∆(A1, A2) P P γ (q ,a) ∆(A1, A2) = ci γ0(qi, a) log γ12(qii,a) qi ∈Q012 a∈Σ∪{#}
Q012 = {qi ∈ Q0 |Bπ01 (qi) 6= Bπ02 (qi)} denotes the set of states of A0 which have been merged to get A2 from A1 Pierre Dupont
37
Pierre Dupont
39
2002
Grammar Induction
2002
Grammar Induction
Comparative results • Grammar induction definition
110
• Learning paradigms
100 90
• RPNI algorithm
80
Perplexity
• DFA learning from positive and negative examples
• Probabilistic DFA learning • Application to a natural language task
ALERGIA MDI
70 60
• Links with Markov models
50
• Smoothing issues
40
• Related problems and future work
30 0
2000
4000 6000 8000 10000 Training sample size
12000
Perplexity measure the prediction power of the model: the smaller, the better Pierre Dupont
40
2002
Grammar Induction
Pierre Dupont
42
2002
Grammar Induction
Perplexity
Natural language application: the ATIS task
P (xji |q i) : probability of generating xji , the i-th symbol of the j-th string from state q i
Air travel information system, “spontaneous” American English “Uh, I’d like to go from, uh, Pittsburgh to Boston next Tuesday, no wait, Wednesday”.
|S| |x| X X 1 log P (xji |q i) LL = − kSk j=1 i=1
Lexicon (alphabet): 1294 words Learning sample: 13044 sentences, 130773 words
P P = 2LL
Validation set: 974 sentences, 10636 words Test set: 1001 sentences, 11703 words
P P = 1 ⇒ a perfectly predictive model P P = |Σ| ⇒ uniform random guessing over Σ
Pierre Dupont
41
Pierre Dupont
43
2002
Grammar Induction
2002
Grammar Induction
Equivalence between PNFA and HMM • Grammar induction definition
Probabilistic non-deterministic automata (PNFA), with no end-of-string probabilities, are equivalent to Hidden Markov Models (HMMs)
• Learning paradigms • DFA learning from positive and negative examples
[a 0.3] [b 0.7] 0.9
0.4 0.4
• RPNI algorithm a 0.02
• Probabilistic DFA learning
0.6
a 0.27 b 0.63
1
0.1 [a 0.2] [b 0.8]
a 0.27
2
a 0.56
• Application to a natural language task b 0.08
• Links with Markov models
b 0.14
1
0.3
2
[a 0.9] [b 0.1]
0.7 [a 0.8] [b 0.2]
b 0.03
PNFA
0.6
HMM with emission on transitions
• Smoothing issues • Related problems and future work
Pierre Dupont
44
2002
Grammar Induction
Pierre Dupont
46
2002
Grammar Induction
Links with Markov chains 0.4 0.1 [a 0.2] [b 0.8]
A subclass of regular languages: the k-testable languages in the strict sense A k-TSS language is generated by an automaton such that all subsequences sharing the same last k − 1 symbols lead to the same state 0.04
b λ
a
a
ab
0.04 a 0.02
ba
a
aa
0.1
pˆ(a|bb) =
0.9
11
C(bba) C(bb)
11
[a 0.3] [b 0.7]
0.1
b 0.72
12
a 0.21
b 0.21
b 0.49 0.9
0.3
a 0.09
b 0.02 a 0.08 a 0.72 b 0.18
There exists probabilistic regular languages not reducible to Markov chains of any finite order
[a 0.8] [b 0.2] 0.42
0.7
22
0.3
[a 0.9] [b 0.1] 0.18
HMM with emission on states
a b
b 0.08
0.36 a 0.18
0.7
21
a
12
[a 0.2] [b 0.8]
A probabilistic k-TSS language is equivalent to a k − 1 order Markov chain
Pierre Dupont
0.3 [a 0.9] [b 0.1]
0.36
a
bb
a
0
0.7 [a 0.8] [b 0.2]
0.6 2
HMM with emission on transitions
b b
1
[a 0.3] [b 0.7] 0.9
21
a 0.63
a 0.27 22 b 0.03
b 0.07 0.42
0.18
PNFA
1 45
Pierre Dupont
47
2002
Grammar Induction
2002
Grammar Induction
• Grammar induction definition
• Grammar induction definition
• Learning paradigms
• Learning paradigms
• DFA learning from positive and negative examples
• DFA learning from positive and negative examples
• RPNI algorithm
• RPNI algorithm
• Probabilistic DFA learning
• Probabilistic DFA learning
• Application to a natural language task
• Application to a natural language task
• Links with Markov models
• Links with Markov models
• Smoothing issues
• Smoothing issues
• Related problems and future work
• Related problems and future work
Pierre Dupont
48
2002
Grammar Induction
Pierre Dupont
50
2002
Grammar Induction
Related problems and approaches
The smoothing problem
A probabilistic DFA defines a probability distribution over a set of strings
I did not talk about
Some strings are not observed on the training sample but they could be observed ⇒ their probability should be strictly positive
• other induction problems (NFA, CFG, tree grammars, . . .) • heuristic approaches as neural nets or genetic algorithms
The smoothing problem: how to assign a reasonable probability to (yet) unseen random events ?
• how to use prior knowledge
Highly optimized smoothing techniques exist for Markov chains
• smoothing techniques
How to adapt these techniques to more general probabilistic automata?
• how to parse natural language without a grammars (decision trees) • how to learn transducers • benchmarks, applications
Pierre Dupont
49
Pierre Dupont
51
2002
Grammar Induction
Ongoing and future work
• Definition of a theoretical framework for inductive and statistical learning • Links with HMMs: parameter estimation, structural induction • Smoothing techniques improvement ⇒ a key issue for practical applications • Applications to probabilistic modeling of proteins • Automatic translation • Applications to Text categorization or Text mining
Pierre Dupont
52