Basics: Formal Language Theory Doug Arnold University of Essex
[email protected]
This lecture covers some basic ideas of formal language theory and computation.
1
Languages, Grammars, Derivations
For the purpose of discussion, a language is just a set of strings over some alphabet. In a linguistic context, it is natural to think of the strings as sentences, and the alphabet as the words of the language. [But many works on formal language theory use the term ‘word’ to refer to the strings, and take the alphabet to be characters]. A grammar G is 4-tuple: G = hS , NT , T , Ri where • • • •
N T is the non-terminal vocabulary: a finite set; T is the terminal vocabulary: a finite set, and T ∩ N T = ∅; S ∈ N T is the start symbol ; R is a finite set of productions, or rules, α → β, the precise form of which determines the class of grammar.
The way a grammar defines, or generates a language can be viewed in terms of the notion of derivation: If A → α is a production, we say A (immediately) derives α (A ⇒ α). Suppose S ⇒ ABC, A ⇒ a, + B ⇒ b, C ⇒ c, then we write S ⇒ abc (S derives abc in one or more steps). Similarly, we use ∗ S ⇒ abc to mean S derives abc in zero or more steps: • A ⇒ α A derives α in one step + • A ⇒ α A derives α in one or more steps ∗ • A ⇒ α A derives α in zero or more steps We write L(G) for the language generated by the grammar G. ∗
A string of terminal elements ω is in L(G) if S ⇒ ω, where S is the start symbol of G. ∗
If α is a string over T ∪ N T , and S ⇒ α, then α is a sentential form. Example (1) S → A S B S →A B A→a B →b This generates the language an bn , i.e. the set of strings that contain exactly as many as as bs: {ab, aabb, aaabbb, . . . } 1
The following are some sentential forms: (2) S ASB AASBB AASBB AAABBB aAABBB aAABbB At each step of a derivation, there are potentially two choices: • which nonterminal to replace; • which rule to use for this nonterminal In a leftmost derivation only the leftmost nonterminal is replaced at each step. In a rightmost derivation, only the rightmost nonterminal is replaced. From this one gets the notion of left and right sentential forms of the grammar. A a a a a a a a a
S S S A S a S a A B a a B a a b a a b a a b
B B B B B b b
B B B B B B B B b
Thinking logically, one can see this as a deductive process, where there is a single axiom (the start symbol), and the productions are inference rules.
2
Trees
A parse tree provides a way of abstracting away from details of the derivation (e.g. whether it was leftmost or rightmost): S »X X X » » XX »» A S B !aa !! a a A S B b ¿\ ¿ \ a A B b a b A tree is a collection of nodes, connected by branches. The root of the tree is written at the top. The as and bs are terminal items or leaves, and constitute the frontier of the tree. Nodes stand in relations of (immediate) dominance, and (immediate) precedence. We speak of mothers immediately dominating daughters. The daughters of a node are sisters. Various notions of command are also sometimes used: in general, a node commands its sisters and all their descendents (but not its own descendents). 2
It is possible for a grammar to assign more than one parse tree to a string, in which case, we say the grammar is ambiguous. (3) S → NP VP VP → V NP (PP ) NP → DET N (PP ) (4) Sam saw the baby [P P with the telescope ]
3
Classes of Grammar
It is possible to classify grammars according to the precise form of the productions: Type 0 no restrictions: arbitrary transformations are possible; Type 1 productions are of the form φAψ → φαψ, (or A → α/φ ψ), where A ∈ N T , and α is a string over T ∪ N T . Here φ and ψ provide the context for the production. Such productions are called context sensitive. Type 2 productions are of the form A → α, i.e. there is no context: productions are called context free; Type 3 productions are of the form A → aα, where a ∈ T ; i.e. the left-most element of the righthand-side of the production must be a terminal (alternatively productions must be of the form: A → αa) Notice that the classes form a hierarchy of increasing restrictiveness, so every type 3 production is also a type 2 production, and every type 2 production is also a type 1 production. A grammar is type N if all its productions are of type N. Thus, the grammar above is of type 2. Notice that G3 ⊆ G2 ⊆ G1 ⊆ G0 . Languages can be classified by the grammars required to generate them. A language is of type N if it can be generated by a grammar of type N. Notice that L3 ⊆ L2 ⊆ L1 ⊆ L0 . Thus, corresponding to the grammars, we have the following hierarchy of languages: type type type type
0 1 2 3
the recursively enumerable sets; context sensitive languages; context free languages; regular languages/sets
Exactly where in this hierarchy Natural Languages fall was a topic of intense research in the 1980s (answer: they are ‘mildly’ context sensitive – almost context free, see below).
4
Automata
A Finite State Automata (FSA) is a (finite) collection of states and transitions; certain states are designated start and end states. There are various notations, but a transition diagram is very intuitive. The following generates the language an bm :
?>=< 89:; 0123 7654 S
?89:; >=
=
=< 89:; 0123 7654
/ E
The idea is that the automaton starts in a start state, and makes transitions writing the symbol on the transition to the output stream. (If one thinks of them as acceptors rather than generators, then the symbol is read from the input stream). FSAs such as the above are equivalent to regular grammars: (5) S → a B S →a A B A→a B →b B B →b However/hence, they are strictly limited. For example, they cannot recognize/generate an bn . To see why, consider what such an FSA would look like for a finite language (e.g. where n ≤ 3):
?>=< 89:; (/).0*+-,
a
?>=< 89:;
/ 1
b
?>=< 89:; 0123 7654
/ E O
a
b
² 2
?>=< 89:;
a
?>=< 89:;
/ 20
b
@ABC GFED
/ 200 O
a
b
² 3
?>=< 89:;
a
?>=< 89:;
/ 30
b
@ABC GFED
/ 300
If we want to get allow strings with 4 as and bs we would need to add another layer. If we want to be able to generate any of the infinite number of strings in an bn , we would need an infinite number of states (which we are not allowed). It is possible to equip FSAs with additional memory; this gives classes of Automata, which relate exactly to the classes of Grammar: ATNs Augmented Transition Networks: transitions may require arbitrary test/actions: equivalent to type 0 grammars; and Turing Machines (see below); Linear Bounded Automata equivalent to CSGs; PDAs Push Down Automata – equipped with an auxiliary memory, which allows one automaton to temporarily transfer control to another; equivalent to CFGs; FSAs type 3 grammars The following PDA will generate an bn :
?>=< 89:; (/).1*+-,
a
?89:; >=
=< ).*+-,
/ 3
5
Equivalence of Grammars • Grammars are weakly equivalent if they generate the same string sets. • Grammars are strongly equivalent if they generate the same string with the same structural descriptions (e.g. parsetrees)
6
Natural Languages
6.1
Natural Languages are not Finite State (‘regular’)
• There is no FSA (hence type 3 grammar) that can generate an bn ; • Natural Languages are infinite, and have constructions like an bn , i.e. ‘nested dependencies’: (8) a1 a2 a3 . . . b3 b2 b1 • an NL like English is infinite, because there is no longest sentence of English: any English sentence can be made longer by putting “I don’t believe that . . . ” in front of it; any English sentence containing “very” can be made longer by adding another “very” (“that was very nice” → “that was very very nice”). • constructions like the following show nested dependencies: (9) If S1 then S2 (10) Either S1 or S2
6.2
Are NLs Context Free?
This turns out to be a subtle question. Cases of agreement seem to suggest they are not: (11) These babies are sleeping. (12) This baby is sleeping. (13) DET → these/ babies (14) DET → this/ baby But the matter is more complex. So long as the there are only a finite number of features involved, a CFG is adequate: (15) (16) (17) (18) (19) (20)
NPsing → DETsing Nsing NPpl → DETpl Npl S → NPsing VPsing S → NPpl VPpl VPsing → V sing . . . VPpl → Vpl . . .
But a CF grammar like this would be very large and unrevealing: there is no such thing as an N P according to this grammar — just N Psing s and N Ppl s. But the notion N P is useful in English, e.g. in rules like: (21) PP → P NP (22) VP → V NP
7
Normal Forms
It is possible to convert grammars whose productions are of various forms to ‘normal forms’, which may be easier to process. For example: • all productions are at most binary branching; • eliminating left recursive rules (problematic for top-down parsing algorithms) 5
• eliminating ²-productions (problematic for bottom-down parsing algorithms) (The resulting grammars are only weakly equivalent to the originals, of course).
7.1
Reduction to (at most) binary branching
A ternary rule such as S → ABC can be replaced by a pair of binary rules: 1. S → A B c 2. B c → B C A grammar where rules are of the form: 1. A → B C (RHS with two non-terminals) 2. A → α (RHS with a single terminal) is said to be in ‘Chomsky Normal Form’ (CNF).
7.2
Eliminating Left Recursion
Rules of the form A → AB , which give left recursive structures can be replaced with ones that give right branching trees: A ´Q ´ Q a3 A ½ Z ½ Z A %e % e A a1
a2
A "b " b β A’ "b " b a1 A’ ½ ½ ZZ a2
β
A’ ¡@ ¡ @ a3 A’ ²
Given 1. A → Aα 2. A → β, where β does not begin with A We produce: 1. A → βA0 (because the β comes before all the As) 2. A0 → a A0 3. A0 → ² There are similar techniques for eliminating non-immediate left-recursion.
7.3
Eliminating ² productions
• find every non-terminal that can rewrite as ²; suppose X is such a non-terminal. • for every production containing such a non-terminal on the rhs, add a new production to the grammar which is identical except that it lacks the non-terminal: for A → BXC , add the production A → BC . • remove the ² productions from the grammar
6
8
NLs again
NLs are not Finite State or Context Free (cf above), and have some problematic properties (e.g. they are ambiguous, have left-recursion, perhaps ²-production, etc.), and very large grammars. But in some respects they are easier than general CF languages. (they do not have cycles, or infinite ambiguities — these are the same thing, essentially — or ‘dense’ ambiguity), and typically sentence length is fairly short (about 10-20 words, rarely more than 30 words). • infinite ambiguity: (23) s → ss s→e s→x • cycles: (24) s → s s→x • dense ambiguity: (25) s → ssss s → sss s → ss s→x One implication of this is that one may be able to do better than general CF parsing algorithms, which are rather inefficient for practical use.
9
Computation and Computational Complexity
A few words about computation in general, and ‘computational complexity’ in particular.
9.1
Algorithms and decidability
An algorithm for a problem is an “effective procedure” for solving the problem, that is, a completely specified, completely mechanical procedure that is guaranteed to succeed in a finite number of steps. Not all problems have algorithmic solutions: for example, problems that require the computation of the complete decimal expansion of π (which does not terminate). Some problems are undecidable, or only semi-decidable: for example, checking whether a given formula f of predicate logic follows from a collection of axioms according the standard rules of inference. Suppose you try to prove f ; after a certain amount of time (say 10 years) you might manage this — good. But what if you don’t? The fact that you haven’t managed to prove it yet does not mean it does not follow, and this is just as true after 100 or 1000 or . . . years. (An implication of this is that the problem of logical equivalence for predicate logic is undecidable in general, since checking whether f and f 0 are equivalent is just checking whether f follows from f 0 and vice versa). More generally, suppose you have a procedure for generating (enumerating) members of a set S, but no general procedure for deciding if something is not in S (i.e. S is recursively enumerable, but not recursive). You want to know if x is an element of S: if it is in S you can just generate elements of S until you produce x (which you will do after a finite number of steps). But suppose it is not in S — you can never be sure that you won’t generate it.
7
9.2
Turing Machines
A Turing Machine (TM) is an abstract machine intended to capture the idea of computation, ignoring the practical details of real computers. A TM consists of a control unit, which may be in one of a finite number of states, a read-write head, and a tape, divided into squares. A program for a TM defines: • a set of tape symbols (the input and output, intuitively); • a set of states (including start and halt states); • instructions for moving the tape left and right, writing on the tape, and changing state, depending on the current state and the symbol on the tape. The idea is that a TM can simulate any computation at all — anything that can be computed can be computed on a TM. They also provide a standard model of a computer for the study of computational complexity.
9.3
Computational Complexity
It is sometimes interesting to think about what computational resources (time and space/memory) are required by algorithms for different problems (do some problems inherently require more resources than others, for example?), or to compare alternative algorithms for the same problem. If one is going to concentrate on the problems and algorithms (rather than distractions like actual computers and coding systems) one must agree on: • a standard model of computation (e.g. the Turing Machine), • standard coding methods, • standard measures of: – problem ‘size’ (e.g. the length of the input tape), – time (the number of moves the TM makes) – space (the amount of tape it uses) One might look at performance and resources either in relation to: • realistic cases; or • worst cases The latter is generally easier, because: 1. you don’t have to know what sorts of case really arise in practice; 2. you can ignore ‘constant factors’ which affect all computations (if there is no limit on the length of inputs, then ‘most’ inputs are very long, so long that the resource demands due to constant factors are overwhelmed by factors that depend on input length). 9.3.1
‘Big Oh’ Notation
We say that an algorithm is O(n2 ) (‘Big Oh n2 ’, or just ‘O n2 ’) if in the worst case it requires time proportional to the square of the input length (n), that is, it requires quadratic time. Similarly an algorithm is O(n3 ) if it requires time proportional to the cube of the input length (‘cubic time’). An algorithm is O(n1 ) if it requires time proportional to the length of the input (‘linear complexity’). O(nk ) algorithms/problems are said to require polynomial time. These are the problems that can be solved efficiently. An algorithm that is O(2n ), O(3n ), or O(k n ) (for any k ≥ 0) is said to be exponential, ‘computationally explosive’. They are computationally untractable. To see why, consider a computer that can perform 10,000 operations a second (one operation every 8
.00001 of a second). n2 2n
n=10 .0001s .001s
n=20 .0004s 1.0s
n=30 .0009s 17.9m
n=40 .0016s 12.7 days
n=50 .0025s 35.7 years
n=60 .0036s 366 centuries
There are some known complexity properties for recognition of languages in different classes: Grammar Type regular (3) context free (2) context sensitive (1) unrestricted (0) 9.3.2
Complexity of recognition linear cubic (n3 ) exponential undecidable
P, NP, PSPACE
Three Complexity classes are often mentioned: P problems that can be solved in polynomial time on a deterministic TM. These problems can be solved ‘efficiently’. NP problems that can be solved in polynomial time on a non-deterministic TM — intuitively, this is a TM that can, when faced with alternative possibilities, explore all in parallel. On a deterministic TM (or a real computer) they require exponential time. PSPACE problems where the amount of space required is a polynomial function of the input length. These form an ascending hierarchy, and it is generally thought (though not proven) that each is a proper subset of the higher classes. We say a problems is “NP-Hard” if it is at least as hard as any problem in NP. We say it is “NPComplete” if it is NP-Hard, and included in NP (and similarly for P and PSPACE). Notice that if two algorithms are in P, then their composition is in P (and similarly for NP and PSPACE) — it is not just that problems in NP are more complex than those in P, these are different orders of complexity. Intuitively, the NP-Hard problems are those which involve considering (exponentially) many different possibilities, and where one appears to have to consider all possibilities. Some examples in NP: The traveling salesman: a salesman must visit a number of cities; what is the shortest route that takes in all of them (you have to compare all routes). SAT: the satisfiability problem for propositional logic; formula of propositional logic consists of a number of propositional variables, and connectives. Is there an assignment of truth values to the variables that makes the formula true? (you have to consider all the exponentially many assignments: for n variables, there are 2n alternatives). (3-SAT is a variant of this). PSPACE examples include generalized two player games (e.g. checkers — but normal checkers is finite, so much easier). 9.3.3
Application to NLs
Context free parsing algorithms exist with O(n3 ) complexity (this is not bad, but people seems to be able to parse without much increase in complexity due to length — i.e. linearly: O(n)) Many formal problems that arise in relation to NL appear to be NP-Hard. Many problems in NL understanding appear to be NP Hard (e.g. finding the antecedent for a pronoun — is there any way to avoid considering all potential antecedents?) But these results relate to worst case, not normal or typical case behaviour.
9
10
Reading
Partee et al. (1990, Part E, Ch18 ff) is an introduction to formal language theory for linguists. Gazdar and Mellish (1989, 132ff) is a brief discussion of some issues. Gazdar and Mellish (1989, Chs1–3) discuss FSAs. The argument about NLs not being Finite State can be found in Chomsky (1957), convincing arguments for the non-context freeness of NLs were given in ?, and there is an excellent discussion in Pullum (1991). Barton et al. (1987) and Ristad (1993) are discussions of Computational Complexity from a linguistic perspective. A standard general reference on Computational Complexity is Garey and Johnson (1979). Aho et al. (1986) give techniques for eliminating left-recursion.
References Aho, Alfred V., Sethi, Ravi and Ullman, Jeffrey D. 1986. Compilers: Principles, Techniques, and Tools. Reading, Mass.: Addison-Wesley Publishing Co. Barton, Jr, G. Edward, Berwick, Robert C. and Ristad, Eric Sven. 1987. Computational Complexity and Natural Language. Cambridge, MA: The MIT Press. Chomsky, Noam. 1957. Syntactic Structures. The Hague: Mouton. Garey, Michael R and Johnson, David S. 1979. Computers and intractability: a guide to the theory of NP-Completeness. San Francisco: W H Freemand and Co. Gazdar, G. and Mellish, C. 1989. Natural Language Processing in Prolog. Wokingham: Addison Wesley. Partee, Barbara H., ter Meulen, Alice and Wall, Robert E. 1990. Mathematical Methods in Linguistics. Dordrecht: Kluwer Academic Publishers. Pullum, Geoffrey K. 1991. The Great Eskimo Vocabulary Hoax . Chicago: Chicago University Press. Ristad, Eric Sven. 1993. The language complexity game. Cambridge, Mass: MIT Press.
10