Using Linux/Unix Compiler Of Compilers Tools

4 downloads 10436 Views 50KB Size Report
Apr 10, 2012 ... 10/04/12. Télécom SudParis – CSC 4508 - Linux Tools for Parsing. #1 ... 3.2 - Preparing a (lexical+syntaxic) parser with flex + bison…… 9.
Using Linux/Unix Compiler Of Compilers Tools I - Lexical analysis with flex alone 1.1 - Preparing a lexical analyzer with flex (lex) ……………….. 2 1.2 - Lexical analysis : recognition of "tokens " …………………. 3 1.3 - Lexical analysis: actions associated to expressions ………… 4

II - Syntaxic analysis with bison alone 2.1 - Generating syntaxic analyzers with bison ………………… 5 2.2 - Get / use a parser using a hand-made lexical analyzer ……. 6

III - Using flex + bison together 3.1 - Combined use of flex / bison ……………………………… 8 3.2 - Preparing a (lexical+syntaxic) parser with flex + bison…… 9

IV - More on syntaxic analysis with bison 4.1 - Grammar specification for bison ………………………… 10 4.2 - bison grammars : simple examples ………………………. 11 4.3 - bison specifications : action blocks ……………………… 12 4.4 - Lexical units' values may be passed on ……………………. 13 10/04/12

Télécom SudParis – CSC 4508 - Linux Tools for Parsing

#1

1.1 - PREPARING A LEXICAL ANALYZER WITH flex 1) Prepare a specification file possible C definitions enclosed in a block : %{ %} expression/action rules

file "lexV.flex" %% ... expression expression ... %%

actionBlock actionBlock

possible C functions

2) Generate the analyzer

flex lexV.flex yylex() { ... }

3) Compile + link

gcc lex.yy.c -ll

4) Use the analyzer

file "lex.yy.c" : lexical analyzer (C source)

lexical analyzer "a.out" (binary)

./a.out < inputFile input 10/04/12

Télécom SudParis – CSC 4508 - Linux Tools for Parsing

output #2

1.2 - LEXICAL ANALYSIS: RECOGNIZING "TOKENS" a) lexical analysis proceeds on a "fragment" basis application of reg.exp e application of reg.exp e'

input fragment f currently recognized available in : char yytext[ ]

next fragment f '

- a fragment begins : with the recognition of a regular expression ends : . at the end of input . or when the recognition of next fragment begins - the user may specify to synchronize on newlines by using \n in expressions

b) when 2 regular expressions apply, the analyzer chooses : - the one which recognizes the longest fragment - in case of equal lengths, the expression defined first in the specification

c) each fragment recognition may be associated with an action 10/04/12

Télécom SudParis – CSC 4508 - Linux Tools for Parsing

#3

1.3 - LEXICAL ANALYSIS: Actions associated to Expressions a) when a regular expression recognizes a fragment or "token": - IF an action (a C block) is associated to the expression, it is executed EX.

%% expReg1 expReg2 %%

{ ECHO; } ;

predefined macro

( ≅ printf("%s",yytext) )

empty action: token not reported in output

- ELSE (no action block) : ≠ default action is to report the fragment in the output

b) actions may use user-defined variables & functions plus : - predefined global variables : char yytext[ ], int yyleng, union yylval ... - predefined functions : yymore() ... - predefined macros : ECHO, BEGIN ...

c) when coupled with a syntactic parser yyparse() an action may : - return a lexical unit to it - maybe after preparing data for it, in yylval 10/04/12

See 4.4 "Lexical units may be passed on "

Télécom SudParis – CSC 4508 - Linux Tools for Parsing

#4

2.1 - SYNTAXIC ANALYZERS with a) specify a grammar with syntaxic rules bison allows to specify LALR-type grammars

b) derive a "parser" or syntaxic analyzer = a C program containing an yyparse( ) function

c) the parser:

bison

… /* defs. for a lexical analyzer */ %% Sentence : Subject Verb Complement | … ; Subject : PRONOUN ADJECTIVE NOUN { printf( "... "); } ; … %% …

- recognizes syntaxic forms among tokens - may generate some output each rule may have an action block (C code), executed if the rule applies to a piece of input - allows ambiguities in the input, but yields at most one interpretation

= improvement of Unix ’s Yacc

bison + … yylex( ) { … }

full parser (C code)

yyparse( ) { … } ...

gcc fullparser (machine-code)

Verizon cuts the bills . a token

lexical class = TRANS_VERB value in class = 1

tokens recognized by yylex(), lexical analyzis function; called by yyparse() to get the next token 10/04/12

Télécom SudParis – CSC 4508 - Linux Tools for Parsing

#5

2.2 - A parser may use a hand-made lexical analyzer (This slide may be skipped in a first go)

Specif. of a grammar : gramX.bison %token possible definitions

WORD ’ \n ’

%{ #include extern int yylex(); void yyerror(char *); %} %% Text :

rules (+actions)

| Text WORD { printf("WORD ");

define lexical classes

#include #include #include "gram.tab.h" int yylex(){ char int

test for lexical classes

%%

buff[256]; c, i = 0;

while( (c=getchar())==' ' ); if( c==EOF ) { return 0; } if( c=='\n' ) { return c; }

} | Text '\n' { printf("\n"); } ;

possible functions

Hand-made lexical analyzer: lex.yy.c

buff[i++] = c; while( isalpha(c=getchar()) ) buff[i++] = c; buff[i] = '\0'; printf("%s ", buff); if( c != EOF ) ungetc(c,stdin);

pass lexical classes to parser

void yyerror(char *s){/*...*/} main(){ yyparse(); }

return WORD; }

10/04/12

Télécom SudParis – CSC 4508 - Linux Tools for Parsing

#6

2.2 - A parser may use a hand-made lexical analyzer, cont. (This slide may be skipped in a first go)

definitions

1) Prepare a grammar file syntaxic rules (+ actions)

functions

%{ gramV.bison #include extern int yylex(); extern void yyerror(); %} %token … %% symb1 : symb2 symb3 ... { /* actions*/ } | symb4 ... { /* actions*/ } ; ... %%

2) Generate the parser: bison –d gramV.bison –o gram.tab.c extern int yylex(); yyparse() { ... yylex(); ... } ...

3) Compile + link:

gcc gram.tab.c lex.yy.c

4) Use the analyzer: a.out < inputFile 10/04/12

gram.tab.c

parser (binary)

Télécom SudParis – CSC 4508 - Linux Tools for Parsing

a.out #7

3 - COMBINED USE of flex / bison 3.1 - Possible organization (Ex., Linux Fedora Core)

syntaxic rules + actions

%{extern int yylex(); %} %token tok1 tok2 ... %%

possible functions

%%

lexical class definitions

%{ include syntaxic analyzer #include "gram.tab.h " (=> lexical class defs available) %} %% … return tok1; lexical rules … return tok2; %% extern void yyparse();

launch syntaxic parser will call yylex() to get tokens

before returning 10/04/12

bison

lexV.flex

gram.tab.h gram.tab.c

flex lex.yy.c

analyzer prototype

possible final processing

gramV.bison

prin.c

int main() { … yyparse( ); … } yywrap( ) { /* … */ return 1; } yyerror(char * s) { fprintf(stderr,"",s); }

Télécom SudParis – CSC 4508 - Linux Tools for Parsing

#8

3.2 - Preparing a full parser with flex + bison Example with previous names / organization bison -d gramV.bison –o gram.tab.c flex lexV.flex gcc gram.tab.c lex.yy.c prin.c -ll -ly -o parserV

gram.tab.h, gram.tab.c lex.yy.c parserV

useless under Linux (RedHat, no library proper to bison) to use the Lex library ( input(), ...)

Ex. of possible elementary automatization : Makefile version = parser$(version) : lex$(version).flex gram$(version).bison prin.c bison –d gram$(version).bison –o gram.tab.c flex lex$(version).flex gcc -o parser$(version) gram.tab.c lex.yy.c prin.c -ll Launch making : make version=V - this way re-uses lex.yy.c, prin.c to spare disk space. Could also re-use a.out - a less crude makefile should care for possible modifications to flex ’s output, etc 10/04/12

Télécom SudParis – CSC 4508 - Linux Tools for Parsing

#9

4.1 - GRAMMAR SPECIFICATION for bison possible C defs or macros : - included files - variable defs. - function defs.

%{ ...

... %}

possible Yacc definitions

%%

syntaxic rules %% possibly other C functions 10/04/12

Example of possibilities #include #include extern int yylex(); int total = 0; ... %token tok1 tok2 ... %union { ... } %type %start %left…

=> “Lexical values may be passed on”

- first rule implicitly defines the “axiom” - each rule in the form : NTSymbol : Symbol1 Symbol2 ... { /* C Code */ } | SymbolP SymbolQ ... { /* C Code */ } ... ;

Télécom SudParis – CSC 4508 - Linux Tools for Parsing

possible action blocks

#10

4.2 - Bison / Yacc GRAMMARS : SIMPLE EXAMPLES Possible productions EX.1

List

: | ; Instruction : | ; Expression : | ; Operand :

EX.2 Form :

Part

List Instruction ‘\n’ Expression SYMBOL ‘=‘ Expression Operand Expression ‘+’ Expression SYMBOL |

toto 999 x=1 y = x + 10

- => the whole entry, as well as each list of Instructions = a possible List

NUMBER ;

Part | ‘(‘ Form ‘)’ | Part Part Part ; : SYMBOL | NUMBER ;

123 ( MyValue ) 0 ( 100 ) 1000 0(1 2 3 ) (4)

First symbol = the "axiom" symbol = what the grammar can produce 10/04/12

- each line = a possible Instruction

Télécom SudParis – CSC 4508 - Linux Tools for Parsing

- each entry = a possible Form - this grammar can produce only one ! not a list of Forms A corresponding parser will accept only 1 Form #11

4.3 - yacc / bison SPECIFICATIONS : ACTION BLOCKS Each rule may have an associated action block (C code) List

EX.1’ Part of prev. Ex1, modified

The actions may : - operate on global variables

- use functions or macros

10/04/12

yytext yyleng yylineno yylval $1, $2 ... $$ ... ECHO yyerror( ) yyerrok yyclearin ...

: { printf(“); } | List Instruction ‘\n’ { printf(“”); } ;

=> see “Lexical values may be passed on “

to reset the parser’s state after error

= Bison predefined variables / macros In addition the user may define others

Télécom SudParis – CSC 4508 - Linux Tools for Parsing

#12

4.4 - Lexical units’ values may be passed on For each recognized lexical unit, the lexical analyzer yylex( ) :

For each lexical unit it gets, the parser yyparse( ) :

- returns a lexical class (token)

- matches it with a tail symbol Tk in k-th position of some rule R

- may prepare a value : . in the predefined variable yylval . in a field suited to the type

- the action part of R may: . use as $k the value associated to Tk . give a value $$ to the head symbol

yylval : an union of user-defined variables

%% EXPREGX { /* … */ yylval.tokValk = .. ; return TOKCLASS; } … %% - the required definitions are a bit technical... report to example on next slide - with previous Lex/Yacc, it was simpler, but weaker: yylval was just an int 10/04/12

tokVal1 tokVal2

%token TOKCLASS1 TOKCLASS2 %union { Ctype1 tokVal1; Ctype2 tokVal2; ... }; %% ... HeadR : SymbolS TOKCLASS ... { printf(“...”, $i ); $$ = $2 ; } ; … %%

Télécom SudParis – CSC 4508 - Linux Tools for Parsing

#13

4.4 - Lexical units’ values may be passed on: Example recordGram1.bison %token ‘ , ’ ‘ \n ’ ... %token INT … %token NAME … %type Record RecordList … %union { int intVal; char * string; yylval will be usable to pass } two different types of values %% RecordList : | RecordList Record here Record ’s value would be $2 ; Record : INT NAME ‘ , ’ INT NAME ‘ \n ’ { printf("%d %s, %d %s ",$1,$2,$4,$5); $$ = $1 + $4; } here a Record symbol is given a value ; (a RecordList could sum up these values) ... %%

declare lexical classes with no value declare lexical classes with values declare a value type for syntaxic symbols that will be given values

here, must compute suitable values for passing them through yylval before returning the lexical class

10/04/12

recordLex1.flex %{ #include "recordGram.tab.h" %} %% [0-9]+ { yylval.intVal = /* */ ; return INT; } [a-z]+ { yylval.string = /* */ ; return NAME; } [,\n] return yytext[0]; ... %%

Télécom SudParis – CSC 4508 - Linux Tools for Parsing

#14