Apr 10, 2012 ... 10/04/12. Télécom SudParis – CSC 4508 - Linux Tools for Parsing. #1 ... 3.2 -
Preparing a (lexical+syntaxic) parser with flex + bison…… 9.
Using Linux/Unix Compiler Of Compilers Tools I - Lexical analysis with flex alone 1.1 - Preparing a lexical analyzer with flex (lex) ……………….. 2 1.2 - Lexical analysis : recognition of "tokens " …………………. 3 1.3 - Lexical analysis: actions associated to expressions ………… 4
II - Syntaxic analysis with bison alone 2.1 - Generating syntaxic analyzers with bison ………………… 5 2.2 - Get / use a parser using a hand-made lexical analyzer ……. 6
III - Using flex + bison together 3.1 - Combined use of flex / bison ……………………………… 8 3.2 - Preparing a (lexical+syntaxic) parser with flex + bison…… 9
IV - More on syntaxic analysis with bison 4.1 - Grammar specification for bison ………………………… 10 4.2 - bison grammars : simple examples ………………………. 11 4.3 - bison specifications : action blocks ……………………… 12 4.4 - Lexical units' values may be passed on ……………………. 13 10/04/12
Télécom SudParis – CSC 4508 - Linux Tools for Parsing
#1
1.1 - PREPARING A LEXICAL ANALYZER WITH flex 1) Prepare a specification file possible C definitions enclosed in a block : %{ %} expression/action rules
file "lexV.flex" %% ... expression expression ... %%
actionBlock actionBlock
possible C functions
2) Generate the analyzer
flex lexV.flex yylex() { ... }
3) Compile + link
gcc lex.yy.c -ll
4) Use the analyzer
file "lex.yy.c" : lexical analyzer (C source)
lexical analyzer "a.out" (binary)
./a.out < inputFile input 10/04/12
Télécom SudParis – CSC 4508 - Linux Tools for Parsing
output #2
1.2 - LEXICAL ANALYSIS: RECOGNIZING "TOKENS" a) lexical analysis proceeds on a "fragment" basis application of reg.exp e application of reg.exp e'
input fragment f currently recognized available in : char yytext[ ]
next fragment f '
- a fragment begins : with the recognition of a regular expression ends : . at the end of input . or when the recognition of next fragment begins - the user may specify to synchronize on newlines by using \n in expressions
b) when 2 regular expressions apply, the analyzer chooses : - the one which recognizes the longest fragment - in case of equal lengths, the expression defined first in the specification
c) each fragment recognition may be associated with an action 10/04/12
Télécom SudParis – CSC 4508 - Linux Tools for Parsing
#3
1.3 - LEXICAL ANALYSIS: Actions associated to Expressions a) when a regular expression recognizes a fragment or "token": - IF an action (a C block) is associated to the expression, it is executed EX.
%% expReg1 expReg2 %%
{ ECHO; } ;
predefined macro
( ≅ printf("%s",yytext) )
empty action: token not reported in output
- ELSE (no action block) : ≠ default action is to report the fragment in the output
b) actions may use user-defined variables & functions plus : - predefined global variables : char yytext[ ], int yyleng, union yylval ... - predefined functions : yymore() ... - predefined macros : ECHO, BEGIN ...
c) when coupled with a syntactic parser yyparse() an action may : - return a lexical unit to it - maybe after preparing data for it, in yylval 10/04/12
See 4.4 "Lexical units may be passed on "
Télécom SudParis – CSC 4508 - Linux Tools for Parsing
#4
2.1 - SYNTAXIC ANALYZERS with a) specify a grammar with syntaxic rules bison allows to specify LALR-type grammars
b) derive a "parser" or syntaxic analyzer = a C program containing an yyparse( ) function
c) the parser:
bison
… /* defs. for a lexical analyzer */ %% Sentence : Subject Verb Complement | … ; Subject : PRONOUN ADJECTIVE NOUN { printf( "... "); } ; … %% …
- recognizes syntaxic forms among tokens - may generate some output each rule may have an action block (C code), executed if the rule applies to a piece of input - allows ambiguities in the input, but yields at most one interpretation
= improvement of Unix ’s Yacc
bison + … yylex( ) { … }
full parser (C code)
yyparse( ) { … } ...
gcc fullparser (machine-code)
Verizon cuts the bills . a token
lexical class = TRANS_VERB value in class = 1
tokens recognized by yylex(), lexical analyzis function; called by yyparse() to get the next token 10/04/12
Télécom SudParis – CSC 4508 - Linux Tools for Parsing
#5
2.2 - A parser may use a hand-made lexical analyzer (This slide may be skipped in a first go)
Specif. of a grammar : gramX.bison %token possible definitions
WORD ’ \n ’
%{ #include extern int yylex(); void yyerror(char *); %} %% Text :
rules (+actions)
| Text WORD { printf("WORD ");
define lexical classes
#include #include #include "gram.tab.h" int yylex(){ char int
test for lexical classes
%%
buff[256]; c, i = 0;
while( (c=getchar())==' ' ); if( c==EOF ) { return 0; } if( c=='\n' ) { return c; }
} | Text '\n' { printf("\n"); } ;
possible functions
Hand-made lexical analyzer: lex.yy.c
buff[i++] = c; while( isalpha(c=getchar()) ) buff[i++] = c; buff[i] = '\0'; printf("%s ", buff); if( c != EOF ) ungetc(c,stdin);
pass lexical classes to parser
void yyerror(char *s){/*...*/} main(){ yyparse(); }
return WORD; }
10/04/12
Télécom SudParis – CSC 4508 - Linux Tools for Parsing
#6
2.2 - A parser may use a hand-made lexical analyzer, cont. (This slide may be skipped in a first go)
definitions
1) Prepare a grammar file syntaxic rules (+ actions)
functions
%{ gramV.bison #include extern int yylex(); extern void yyerror(); %} %token … %% symb1 : symb2 symb3 ... { /* actions*/ } | symb4 ... { /* actions*/ } ; ... %%
2) Generate the parser: bison –d gramV.bison –o gram.tab.c extern int yylex(); yyparse() { ... yylex(); ... } ...
3) Compile + link:
gcc gram.tab.c lex.yy.c
4) Use the analyzer: a.out < inputFile 10/04/12
gram.tab.c
parser (binary)
Télécom SudParis – CSC 4508 - Linux Tools for Parsing
a.out #7
3 - COMBINED USE of flex / bison 3.1 - Possible organization (Ex., Linux Fedora Core)
syntaxic rules + actions
%{extern int yylex(); %} %token tok1 tok2 ... %%
possible functions
%%
lexical class definitions
%{ include syntaxic analyzer #include "gram.tab.h " (=> lexical class defs available) %} %% … return tok1; lexical rules … return tok2; %% extern void yyparse();
launch syntaxic parser will call yylex() to get tokens
before returning 10/04/12
bison
lexV.flex
gram.tab.h gram.tab.c
flex lex.yy.c
analyzer prototype
possible final processing
gramV.bison
prin.c
int main() { … yyparse( ); … } yywrap( ) { /* … */ return 1; } yyerror(char * s) { fprintf(stderr,"",s); }
Télécom SudParis – CSC 4508 - Linux Tools for Parsing
#8
3.2 - Preparing a full parser with flex + bison Example with previous names / organization bison -d gramV.bison –o gram.tab.c flex lexV.flex gcc gram.tab.c lex.yy.c prin.c -ll -ly -o parserV
gram.tab.h, gram.tab.c lex.yy.c parserV
useless under Linux (RedHat, no library proper to bison) to use the Lex library ( input(), ...)
Ex. of possible elementary automatization : Makefile version = parser$(version) : lex$(version).flex gram$(version).bison prin.c bison –d gram$(version).bison –o gram.tab.c flex lex$(version).flex gcc -o parser$(version) gram.tab.c lex.yy.c prin.c -ll Launch making : make version=V - this way re-uses lex.yy.c, prin.c to spare disk space. Could also re-use a.out - a less crude makefile should care for possible modifications to flex ’s output, etc 10/04/12
Télécom SudParis – CSC 4508 - Linux Tools for Parsing
#9
4.1 - GRAMMAR SPECIFICATION for bison possible C defs or macros : - included files - variable defs. - function defs.
%{ ...
... %}
possible Yacc definitions
%%
syntaxic rules %% possibly other C functions 10/04/12
Example of possibilities #include #include extern int yylex(); int total = 0; ... %token tok1 tok2 ... %union { ... } %type %start %left…
=> “Lexical values may be passed on”
- first rule implicitly defines the “axiom” - each rule in the form : NTSymbol : Symbol1 Symbol2 ... { /* C Code */ } | SymbolP SymbolQ ... { /* C Code */ } ... ;
Télécom SudParis – CSC 4508 - Linux Tools for Parsing
possible action blocks
#10
4.2 - Bison / Yacc GRAMMARS : SIMPLE EXAMPLES Possible productions EX.1
List
: | ; Instruction : | ; Expression : | ; Operand :
EX.2 Form :
Part
List Instruction ‘\n’ Expression SYMBOL ‘=‘ Expression Operand Expression ‘+’ Expression SYMBOL |
toto 999 x=1 y = x + 10
- => the whole entry, as well as each list of Instructions = a possible List
NUMBER ;
Part | ‘(‘ Form ‘)’ | Part Part Part ; : SYMBOL | NUMBER ;
123 ( MyValue ) 0 ( 100 ) 1000 0(1 2 3 ) (4)
First symbol = the "axiom" symbol = what the grammar can produce 10/04/12
- each line = a possible Instruction
Télécom SudParis – CSC 4508 - Linux Tools for Parsing
- each entry = a possible Form - this grammar can produce only one ! not a list of Forms A corresponding parser will accept only 1 Form #11
4.3 - yacc / bison SPECIFICATIONS : ACTION BLOCKS Each rule may have an associated action block (C code) List
EX.1’ Part of prev. Ex1, modified
The actions may : - operate on global variables
- use functions or macros
10/04/12
yytext yyleng yylineno yylval $1, $2 ... $$ ... ECHO yyerror( ) yyerrok yyclearin ...
: { printf(“); } | List Instruction ‘\n’ { printf(“”); } ;
=> see “Lexical values may be passed on “
to reset the parser’s state after error
= Bison predefined variables / macros In addition the user may define others
Télécom SudParis – CSC 4508 - Linux Tools for Parsing
#12
4.4 - Lexical units’ values may be passed on For each recognized lexical unit, the lexical analyzer yylex( ) :
For each lexical unit it gets, the parser yyparse( ) :
- returns a lexical class (token)
- matches it with a tail symbol Tk in k-th position of some rule R
- may prepare a value : . in the predefined variable yylval . in a field suited to the type
- the action part of R may: . use as $k the value associated to Tk . give a value $$ to the head symbol
yylval : an union of user-defined variables
%% EXPREGX { /* … */ yylval.tokValk = .. ; return TOKCLASS; } … %% - the required definitions are a bit technical... report to example on next slide - with previous Lex/Yacc, it was simpler, but weaker: yylval was just an int 10/04/12
tokVal1 tokVal2
%token TOKCLASS1 TOKCLASS2 %union { Ctype1 tokVal1; Ctype2 tokVal2; ... }; %% ... HeadR : SymbolS TOKCLASS ... { printf(“...”, $i ); $$ = $2 ; } ; … %%
Télécom SudParis – CSC 4508 - Linux Tools for Parsing
#13
4.4 - Lexical units’ values may be passed on: Example recordGram1.bison %token ‘ , ’ ‘ \n ’ ... %token INT … %token NAME … %type Record RecordList … %union { int intVal; char * string; yylval will be usable to pass } two different types of values %% RecordList : | RecordList Record here Record ’s value would be $2 ; Record : INT NAME ‘ , ’ INT NAME ‘ \n ’ { printf("%d %s, %d %s ",$1,$2,$4,$5); $$ = $1 + $4; } here a Record symbol is given a value ; (a RecordList could sum up these values) ... %%
declare lexical classes with no value declare lexical classes with values declare a value type for syntaxic symbols that will be given values
here, must compute suitable values for passing them through yylval before returning the lexical class
10/04/12
recordLex1.flex %{ #include "recordGram.tab.h" %} %% [0-9]+ { yylval.intVal = /* */ ; return INT; } [a-z]+ { yylval.string = /* */ ; return NAME; } [,\n] return yytext[0]; ... %%
Télécom SudParis – CSC 4508 - Linux Tools for Parsing
#14