Available online at www.sciencedirect.com Available online at www.sciencedirect.com
Procedia Engineering
Procedia Engineering 00 (2011) 000–000 Procedia Engineering 24 (2011) 700 – 707 www.elsevier.com/locate/procedia
2011 International Conference on Advances in Engineering
Advances in Ambiguity Detection Methods for Formal Grammars Hari Mohan Pandey * Department of Computing, Middle East College of Information Technology, Muscat, Oman
Abstract In the area of computer science and others grammars play a vital role from both theoretical and practical points of view. The application areas of formal grammars is increasing day by day in various areas such as patter recognition, machine learning, computational biology, robotics and control systems, speech recognition systems, inductive logic programming and other. Context free grammars, one of the four classes of grammars as defined by Noam Chomsky have wide variety of application. Primarily, Context Free Grammar used to build compilers to verify the syntax of computer program. However, it is known that making research in this field is a computationally hard nut to crack. This paper mainly explores the language model presented by Comosky, Problem of Ambiguity, Degree of Ambiguity, Approaches to Detect Ambiguity, comparisons of existing methods and recent trends etc.
© 2011 Published by Elsevier Ltd. Selection and/or peer-review under responsibility of ICAE2011 Keywords: Ambiguity; Context Free Grammars; Horizontal and Vertical Ambiguity; Harmless Productions; Non-canonical Test.
1. Introduction The problem of ambiguity of formal grammar arises in a variety of context. In some situation, the application requires finitely ambiguous input grammar but in some cases the convergence of a bound on finite ambiguity or the asymptotic rate of increase of ambiguity as a function of the string length. Hence, in all these cases we needed algorithm to detect ambiguity in an optimized manner. The problem of ambiguity grammar has been analysed and discussed in very few literature. In [1] Chomsky defined grammar as: “Grammar is based on a finite number of observed sentences, and it projects this set to an infinite set of grammatical sentences by establishing general laws framed in terms of phonemes word and phrases”. In his work, Chomsky showed how to build a relationship between grammatical sentences and observed sentences. The diagrammatic representation of language
* Corresponding author.Tel: +968-92415993, fax: +968-24446554 E-mail address:
[email protected]
1877-7058 © 2011 Published by Elsevier Ltd. doi:10.1016/j.proeng.2011.11.2722
701
Hari / Procedia – 707 HariMohan MohanPandey Pandey/ ProcediaEngineering Engineering2400(2011) (2011)700 000–000
2
models are given in the Fig. 1. Sag and Wasow [2] gave two distinct grammatical theories namely Transformational Generative Grammar and Constraint Base Lexicalist Grammar. Both theories were based on generative tradition. A transformational derivation keeps tracks of two main phrases, first generates the phrase structure using CFG, second applying transformational rules to map into other phrase structure. Bresnan [3] and Montague [4]-[6] developed the way for the development of non-transformational generative framework. Bresnan [7] gave the idea about Lexical Functional Grammar (LFG) while others developed Generalized Phrase Structured Grammar (GPSG) and Categorical Grammar (CG).
Fig. 1. Description of Language presented by Chomsky
The Grammatical theory played a crucial role for natural languages are Construction Grammar. The idea of Construction Grammar came into picture from Berkeley, particularly in the work of Charles J. Fillmore. The basis of construction grammar is in the secondary process which took place when general rules and principles interact with each other. The grammatical theory which directly came from GPSG is Head-driven Phrase Structure Grammar (HPSG). The development of HPSG has been influenced by Relational Grammar (RG) and Optimality Theory (OT). The goal of this paper is to present work conducted by the authors on various major grammars along with detailed description about ambiguity, ambiguity detection methods and there comparisons. Outline of the Paper: Section-II is dedicated for ambiguity, degree of ambiguity, detection methods and comparisons of existing methods. Section-III covers conclusion. Lastly, reference section has been given. 2. Ambiguity and detection methods 2.1 What is Ambiguity? Ambiguity is a situation in which a grammar deriving a word using left or right derivation with more than one parse tree exists. There is always a parse tree for a single word derived from the grammar but reverse does not hold always. An example of a grammar with more than one parse tree is Grammar 1.1 as given below: (1.1) E → E + E / E * E / id It recognizes the word id trees. E LM E + E
+ id * id which has two different leftmost (rightmost) derivations and parse
E
LM
id + E
E
LM
id + E * E
E
LM
id + id * E
E LM id + id * id Another leftmost derivation:
(1.2)
702
Mohan Pandey / Procedia Engineering 24 (2011) 700 – 707 HariHari Mohan Pandey/ Procedia Engineering 00 (2011) 000–000
E
LM
E
3
E * E
LM
E + E * E
E
LM
id + E * E
E
LM
id + id * E
E
LM
id + id * id
(1.3)
Having seen Fig. 2, we can define Ambiguous Context Free Grammar as [11]: “A Context Free Grammar G = (V , , P, S ) is ambiguous; iff a word
w∈ * exists which has multiple leftmost
*
derivations S LM w ∈ G ”.
Fig. 2. Two parse tree of the word “id + id * id” using Grammar 1.1 and derivations (1.2) and (1.3)
For the derivation (1.2) and (1.3), the leftmost derivation operation has been used, using rightmost derivation or parse tree instead result in equivalent definition. We can call a grammar unambiguous if it is not ambiguous. 2.2 Degree of ambiguity The degree of ambiguity is defined as the total number of different parse tree of a string w. For example degree of ambiguity for Grammar 1.1 is 2 for left most derivation 1.2 and 1.3 for w = id + id * id. We can classify ambiguity in grammars using degree of ambiguity. If the number of distinct parse trees for each string increases with the length of strings generated by a grammar, it is possible that the degree of ambiguity of that grammar is infinite. 2.3 Detection and removal methods The problem with context free grammar is, whether a given grammar is ambiguous. In [11] Michael Kruse proposed a theorem for un-decidability of the problem of ambiguity for context free grammars. This problem may hamper the working of technical languages. Although, no algorithm exists which solves the ambiguity for every CFG but there are algorithms available for some grammars. 2.3.1. Gorn’s method (1963) Saul Gorn [13] described a Turing machine to generate all possible strings of a grammar. Searching (string generated before or not) starts after generation of each new string. If the string exists before, then the string has multiple derivation hence the grammar is ambiguous. The searching process for the string is simple Breadth First Search. This is capable enough to identify the ambiguities but it is not always capable to recognize the non-ambiguities. Also we cannot run this forever it will always have to be halted at that moment there might be chances to left undiscovered ambiguity in the grammar. Another drawback is that if the method has not found another ambiguity yet it will remain undecidable (ambiguous or not). 2.3.2. Knuth’s method: LR (k) parsing (1965) LR (k) parsing algorithms described by Knuth [18] makes decisions based on k-input symbols of lookahead. It is based on bottom up parsing technique. LR (k) grammars can be deterministically parsed using this algorithm. For applying this method a parse table has to be maintained which helps in
703
Hari / Procedia – 707 HariMohan MohanPandey Pandey/ ProcediaEngineering Engineering2400(2011) (2011)700 000–000
4
identifying the action (shift, reduce, accept or error) to be performed. We can construct a parse table for k-input of the grammar without conflict, but if parse table have multiple action for the same lookahead then conflict occurs, hence there is no deterministic choice to be made at a certain point during parsing. This means that a parse table without conflict will result in a single derivation for every string in the language of the grammar. Hence, if a grammar is LR (k) then it is unambiguous. LR (k) testing is very simple for detecting ambiguity: if the parse table contains no conflicts then the grammar is LR (k). Similar, test can be used for ambiguity detection 2.3.3. Cheung and Uzgalis method (1995) In [14] Cheung and Uzgalis gave a searching method with pruning of all possible derivation of a grammar. This method is nothing but an optimization to [13]. In addition to [13] this method can also detect non-ambiguity for some languages with an infinite language. Since [14] is based on pruning which is always done in a safe way [15], hence no ambiguities are overlooked. This method also generates all possible sentential form of a grammar and checks for the duplicates. To do this, it compares the terminal pre and postfixes of the generated sentential form. 2.3.4. Schroer’s method: AMBER (2001) Schroer [16] developed an ambiguity detection tool “AMBER”. To generate all possible strings and for finding the duplicates it uses Earley parser. The way of derivation of string of a grammar is similar to as given in [13], except some variations in the search method which helps in improving the search timing and comparing the strings. One more improvement given in this method is we can bound the search space and make the searching stop at any point. It is also possible to specify the maximum length of the generated strings or the maximum number of strings it should generate hence it leads to a random search method which will reach strings of greater length earlier. 2.3.5. Jampana’s method (2005) Jampana [17] presented an algorithm to detect ambiguity. This method is based on the observation that “non-trivial grammars define non-finite languages in a recursive manner. Based on this observation this method divide sentential form of a grammar into two groups: those with one or more repetition of one or more patterns and the rest of the strings. For the first set it has been assumed that all ambiguities of a grammar in Chomsky Normal Form (CNF) will occur in derivations in which every live production (ABC) is used at most once. The assumption about the repletion of the application of production is incorrect in this method which leads to failing to search string with repetitive derivations will leave ambiguities undetected hence this algorithm might report false negatives. 2.3.6. Brabrand’s, Giegerich’s, and Møller’s method (2007) Ambiguity checking with Language Approximation (ACLA) has been presented in [19]. The idea here is separating the ambiguity problem into two problem of same type namely Vertical and Horizontal ambiguity. All individual productions are to be searched for horizontal and vertical ambiguities respectively in terms of languages not in terms of grammars. If ( A → α and A → β ) ∈ P are two productions for a grammar G, then it will be vertical ambiguous iff: LG (α ) ∩ LG ( β ) ≠ φ Where α ≠ β (1.4) If ( A → αβ ) ∈ P be any production of a grammar G, then it will be horizontal ambiguous iff:
LG (α )
LG ( β ) ≠ φ
Where,
is overlap operation
(1.5)
The languages of the productions are approximated to make the intersection and overlap problems decidable. By approximating we are nothing but extending the original language into a language, which can be represented using regular grammar and this processes of doing approximation called as
704
Mohan Pandey / Procedia Engineering 24 (2011) 700 – 707 HariHari Mohan Pandey/ Procedia Engineering 00 (2011) 000–000
5
conservative approximation because all strings of the original language are also included in regular one. By doing this way we can compute the regular approximations of vertical and horizontal ambiguity. If the resulting set is non-empty then vertical or horizontal ambiguity is found. 2.3.7. Schmitz’s method (2007) A conservative approach was presented in [20] by Sylvain Schmitz. In [20] author showed that approximations cannot result in overlooked ambiguous cases. This method provides a framework for the conservative detection of ambiguities, which allows only false positive. To begin with this method we have to convert the given grammar into a bracketed grammar. In i bracketed grammar each rule are represented as: i = A ⎯⎯ → α which should be opened and closed by pair of opening and closing brackets di and ri , where i , di and ri are number of the production,
derivation and reduction respectively. If
G = N , T , P, S be an input context free grammar and Gb = N , Tb , Pb , S is the bracketed
context free grammar. The language
L ( Gb ) keeps a unique string for each derivation of G. Then by
applying homomorphism we can map every string of When multiple string of
L ( Gb ) to its corresponding string from L(G ) .
L ( Gb ) mapped to the same string of L(G ) , that string has multiple derivation
and is ambiguous. For the grammar 1.1 the bracketed grammar will be:
E → d1 E + Er1 / d 2 E * Er2 / d3idr3
(1.6)
To get the different bracketed string to be mapped to the original string a position graph to be constructed. The elements of position graph are:
Node: shows the position in every sentential form of Gb Label: labels are of the form a. + a Edges: Valid transitions between the nodes. Transitions: Shift, Derivations and Reduction.
The bracketed grammar cannot be ambiguous, because di and ri terminals always indicate what
production rule that included terminals should be parsed with.
2.3.8. Bastan’s and Vinju’s method: harmless production filtering (2010) Bastan and Vinju [21] presented a new approach for detecting ambiguity, where they combined approximation method and exhaustive method. This method is nothing but an extension to Non-canonical Unambiguity Test (NU-Test). It leads to recognize those production rules that do not participate to the ambiguity of a grammar. Therefore we can filter those production rules. By doing this we can reduce the search space of exhaustive method to improve the performance. For doing the Non-canonical test in [21] Bastan and Vinju used bracketed grammar and position graph [20] (Section 3.7). After applying NU-Test we can surely decide the grammar is ambiguous or not. Item0 relation is the equivalence relation which yields an approximation graphs, which gives closely resembles its LR (0) parse automaton where: • Node: LR (0) items • Edges: Corresponding action. • Start Node: Dot at the beginning of a production of the start symbol. • End Node: Dot at the end of a production of the start symbol.
→ A → α X .β • Shift Transition: A → α . X β ⎯⎯ X
i • Derivation Transition: A → α .Bγ ⎯⎯ → B → .β , where i is the number of the production
d
B → β.
Hari / Procedia – 707 HariMohan MohanPandey Pandey/ ProcediaEngineering Engineering2400(2011) (2011)700 000–000
6
i • Reduction Transition: B → .β ⎯⎯ → A → α B.γ where i is the number of the production
r
B → β.
To find ambiguity in a Item0 graph we use two cursor simultaneously. If two cursor use different paths but construct the same string of shifted token then there is possible ambiguity. Position Pair Graph (PPG) is the new version of the graph which we will get after applying traversal with two cursor. In PPG a pair of cursor is used to represent nodes while edges are represented by steps made by the two cursors. Hence an edge can exist only when: One of the cursor applies an individual derivation or reduction transitions. Or Both cursors simultaneously apply shift transition of the exact same symbol. Therefore, a path in a PPG describes two parse tree of the same string, such path called as an ambiguous path pair if the two paths it represents are not identical. A Join Point indicates the presence of the ambiguous path. Hence, we can efficiently detect ambiguity by constructing a PPG and finding the join point but still there is scope for the optimization. NU Test stops after constructing PPG and ambiguous path pairs are reported to the user. But in [21] PPG has been used for finding those production rules that certainly do not participate to the ambiguity of the grammar, such production rules are called as Harmless Production rules. Once we will get the details about the Harmless Production rules we can filter this set of rules by applying the algorithm [21] given for filter harmless production rules. Upon completion of the filtering process we are left with productions that potentially lead to ambiguity. But these productions can represent and incomplete grammar because nonterminal from the top of the grammar may have been filtered or non-terminals might not have any productions left, but they could still occur in productions of other non-terminals. Therefore, to restore the reachability and productivity properties of the grammar a new start symbol, new terminals, non-terminals and production rules will have to be introduced. 2.3.9. Bas Basten’s and Tijs van der Storm’s method: AMBIDEXTER (2010) Bas Basten and Tijs van der Storm [22] combine exhaustive and approximative searching. AMBIDEXTER allows us to detect both ambiguity and unambiguity. The aim of using approximative filtering is to reduce the search space for an exhaustive checker. Similar to [21] this tool works in two stages: filtering harmless productions and derivations generator. It accepts inputs grammars in YACC and SDF formats. As results it shows that AMBIDEXTER is on average twice as faster as AMBER [16]. Since, it combines the features of exhaustive and approximative approaches; hence it produces comprehensible ambiguity reports in a much faster time. 2.3.10. Bastan’s Klint’s and Vinju’s Method: Ambiguity Detection for Character-Level Grammars (2011) In [23] authors extended the work done previously in [21] and [22]. Authors gave the new algorithm to detect ambiguity in Character Level Grammar [23]. The new method showed that the time taken by the ambiguity detection algorithm for character-level grammars for languages such as C and Java is significantly reduced without any loss. There are many extensions to be considered to make the baseline algorithm more suitable for character level grammars. In place of treating every character as a separate token, treat character classes as shift-able symbols. Use of disambiguation filter to handle the issues such as reservation of keywords and longest match. Make use of grammar unfolding as an optimization techniques, which allow us to identify some parts of grammar that do not contribute to ambiguity. The detailed architecture for baseline algorithm proposed in [23] is given Fig. 4. It is clear from the figure that there are seven steps consisting to generate reports such as non-ambiguity, ambiguity or time-
705
706
Mohan Pandey / Procedia Engineering 24 (2011) 700 – 707 HariHari Mohan Pandey/ Procedia Engineering 00 (2011) 000–000
7
out.
Fig. 4. Baseline Architecture for fast ambiguity detection [24]
Although, the calculation presented in Fig. 4 is correct and efficient but base line algorithm is not suitable for character level grammar because: first, difficult to handle the increased complexity. Second still find ambiguities that are already solved. Hence, in order to deal with these problems we can filter node and edges in the NFA and PG representations. And then we unfold selected parts of grammars. 2.4 Comparison of methods The following Table 1 shows the comparative study of all the ambiguity detection methods M1…..M10 studied in this paper under certain factors/properties. Table 1. Comparative Summary of the existing Ambiguity Detection Methods
Methods Factors Termination Ambiguity Unambiguity False Positive False Negative Scalable Optimize
M1
M2
M3
M4
M5
M6
M7
M8
M9
M10
N Y N N N N
N N Y N N N
N Y Y N N N Y*
N Y N N Y Y
Y Y Y N Y N Y**
Y Y Y Y N Y Y
Y Y Y Y N Y Y
Y Y Y Y N Y Y***
Y Y Y Y N Y Y****
Y Y Y Y N Y Y
M1: Gorn’s Method , M2: Knuth’s Method M3: Cheung and Uzgalis Method M4: Schorer’s Method M5: Jampana’s Method M6: Brabrand’s, Giegrich’s and Moller’s Method M7: Schmitz’s Method M8: Bastan’s and Vinju’s Method M9: Bas Bastan’s and Tijs Van Der Storn’s Method M10: Bastan’s Klint’s and Vinju’s Method Y: Yes N: No Y*: Optimize to M1 Y****: Optimized to M4 Y**: Optimized but report false negative Y***: optimized by using approximation and exhaustive method
3. Conclusions We have found very few articles for ambiguity detection from the formal grammars. This paper mainly covers the language model given by Chomsky [1] to categorize languages, major grammars types play a vital role in many areas in computer science and others, detailed description about ambiguity, degree of ambiguity, various ambiguity detection methods. This paper also includes the applications areas where we can apply formal grammars. While preparing the subject matter of the present paper, author fundamental objective was to provide the detailed introduction about the active and challenging area of grammars and ambiguity detection
707
Hari / Procedia – 707 HariMohan MohanPandey Pandey/ ProcediaEngineering Engineering2400(2011) (2011)700 000–000
8
methods. The targeted audiences are mainly beginners or someone who are interested to get research related activities in the similar field and to find most appropriate ideas or techniques for their own research work. References [1]Chomsky, N., Three Models for the Description of Language, IEEE Transactions on Information Theory, 1956. [2]Sag, Wasow and Bender (2003). Syntactic Theory 2nd edition. CSLI Publications. [3]Bresnan, Joan, 1978. A realistic transformational grammar. In Linguistic theory and psychological reality, ed. Morris Halle, Joan Bresnan, and George A. Miller, Cambridge, MA; the MIT Press. [4]Montague, Richard (1970a). English as a Formal Language. In Visentini, Bruno et al. (eds.) Linguaggi nella società e nella tecnica. Milan: Edizioni di Comunità. 189-224. Reprinted in Montague (1974), 188-221. [5]Montague, Richard (1970b). Universal grammar. Theoria 36:373-398. Reprinted in Montague (1974), 222-246. [6]Montague, Richard (1970c). Pragmatics and intensional logic. Synthèse 22:68-94. Reprinted in Montague (1974), 119-147. [7]Bresnan, Joan, 1982. The Passive in Lexical Theory. In The Mental Representation of Grammatical Relations, ed. Joan Bresnan, 3-86, Cambridge, MA; The MIT Press. [8]J. Dassow and G. Paun. Regulated Rewriting in Formal Language Theory. EATCS Monographs in Theoretical Computer Science 18, Springer-Verlag, 1989. [9]A. Fleck. Formal Models of Computation. World Scientific Publishing, 2001. [10]A.J. Jones. Formal Languages and Automata. School of Computer Science, Cardiff University. [Online]. Available from: . [Accessed 4 October 2006]. [11]Michael Kruse. “Ambiguity Detection for Context-Free Grammars in Eli” Bachelor’s Thesis, Paderborn, 7th May, 2008. [12]H.J.S. Basten, “Ambiguity Detection Methods for Context-Free Grammars”, Master’s thesis, August 17, 2007. [13]Saul Gorn. Detection of generative ambiguities in context-free mechanical languages. J. ACM, 10(2):196–208, 1963. http://doi.acm.org/10.1145/321160.321168. [14]Bruce S. N. Cheung and Robert C. Uzgalis. Ambiguity in context-free grammars. In SAC’95: Proceedings of the 1995 ACM symposium
on
applied
computing,
pages
272–276,
New
York,
NY,
USA,
1995.
ACM
Press.
http://doi.acm.org/10.1145/315891.315991. [15]B.S.N. Cheung. A Theory of automatic language acquisition. PhD thesis, University of Hong Kong, 1994. [16]Friedrich Wilhelm Schr¨oer. AMBER, an ambiguity checker for context-free grammars. Technical report, compilertools.net, 2001. http://accent.compilertools.net/Amber.html. [17]Saichaitanya Jampana. Exploring the problem of ambiguity in context-free grammars. Master’s thesis, Oklahoma State University, July 2005. http://e-archive.library. okstate.edu/dissertations/AAI1427836/. [18]Donald E. Knuth. On the translation of languages from left to right. Information and Control, 8(6):607–639, 1965. [19]Claus Brabrand, Robert Giegerich, and Anders Møller. Analyzing ambiguity of context free grammars. In Miroslav Balik and Jan Holub, editors, Proc. 12th International Conference on Implementation and Application of Automata, CIAA ’07, July 2007. http://www.brics.dk/~brabrand/grambiguity/. [20]Sylvain Schmitz. Conservative ambiguity detection in context-free grammars. In Lars Arge, Christian Cachin, Tomasz Jurdzi´nski, and Andrzej Tarlecki, editors, ICALP’07: 34th International Colloquium on Automata, Languages and Programming, volume 4596 of Lecture Notes in Computer Science, pages 692–703. Springer, 2007. [21]H.J.S. Basten and J.J. Vinju, “Faster Ambiguity Detection by Grammar Filtering, 2010 Association for Computing Machinery. ACM, ISBN: 978-4503-0063-6/6/10/03. [22]Bas Basten and Tijs van der Storm “AMBIDEXTER: Practical Ambiguity Detection Tool Demonstration”, http://homepages.cwi.nl/~basten/ambiguity/ [23]H.J.S. Basten, P. Klint and J.J. Vinju. Ambiguity Detection: Scaling to Scannerless. Pre-proceedings of the 4th International Conference on Software Language Engineering (SLE 2011), Braga, Portugal, July 2011