Laboratory exercises - Department of Information and Computing ...

2 downloads 0 Views 130KB Size Report
a data type description for BibTEX in Haskell, and. 144 .... author = "Charles Louis Xavier Joseph de la Vall{\'e}e Poussin", note = "A strong form of the prime ...
Appendix A

Laboratory exercises A.1

Parsing BibTeX files

A.1.1

Introduction

The first exercise in this compiler construction course consists of writing a parser for BibTEX using the latest implementation of parser combinators. The purpose of this exercise is to refresh your memory of parsing; parsers being basic ingredients of compilers. BibTEX is a data format for describing bibliographic data, i.e., data about books, scientific articles in journals, theses, etc. A bib database can be queried to get entries that are cited in a document and the entries obtained can be typeset in a variety of ways. This makes the bibliographic data portable and independent from a specific document or document style. An example entry is @PhdThesis{Vis97.thesis, author = {Visser, Eelco}, title = {Syntax Definition for Language Prototyping}, year = {1997}, month = {September}, school = {University of Amsterdam} }

The bibtex tool reads the citations from a document, retrieves the corresponding entries from a bibtex database (file) and outputs the entries formatted in the document preparation language LATEX. For example, the entry above is formatted as \bibitem{Vis97.thesis} Eelco Visser. \newblock {\em Syntax Definition for Language Prototyping}. \newblock PhD thesis, University of Amsterdam, September 1997.

The Assignment

The goal of this assignment is to produce a Haskell program that defines

• a data type description for BibTEX in Haskell, and

144

A.1. PARSING BIBTEX FILES

145

• a parser for BibTEX using the parser combinators described in Chapter 2. The parser should produce an element of the BibTEX data type definition. Tips • The complete syntax of BibTEX as accepted by bibtex is described in Section A.1.3 using the syntax definition formalism EBNF. This definition can be used as guideline in writing the parser and is considered to define the syntax of BibTEX for this assignment. • Several aspects of the BibTEX syntax are rather tricky. In the next section five levels of BibTEX are defined, ranging from easy to difficult. Try making a parser for level n before you try level n + 1. If you don’t succeed in making a parser for full BibTEX, hand in a parser for the highest level that you can deal with. Your grade will be 2*the highest level that you hand in correctly. • Several example input files that can be used for testing your parser are provided at the web page mentioned on the previous page.

A.1.2

BibTEX - Level 1 to 5

Level 1: Regular Entries, String Values At this level the parser can deal with regular entries (with nonterminal EEntry in the EBNF definition), i.e., no comment, preamble or string entries. The field values of these regular entries are numbers or are strings delimited by double quotes that do not contain double quotes. Entry names and keys are just simple identifiers like in Java and Haskell. The fields are enclosed in curly braces only. A level 1 parser can parse the file level1.bib. An example of such an entry is: @INBOOK{chicago, title = "The Chicago Manual of Style", publisher = "University of Chicago Press", edition = "Thirteenth", year = 1982, pages = "400--401", key = "Chicago" }

Level 2: Strings and Preambles At level 2 two other types of entries, i.e., preamble (EPreamble) and string (EString), are added. Field values can now also be names, but otherwise the restriction on field values remains the same. The keywords “preamble” and “string” should be in lowercase only. A level 2 parser can parse the file level2.bib. Examples of non-regular entries and the use of names are: @preamble{"% Bibliography generated from level2.bib"} @string{tosem = "ACM Transactions on Software Engineering and Methodology"}

146

APPENDIX A. LABORATORY EXERCISES

@Article{BV96, author = "Van den Brand, Mark G. J. and Visser, Eelco", title = "Generation of Formatters for Context-free Languages", journal = tosem, year = 1996, month = "January", volume = 5, number = 1, pages = "1--41" }

Level 3: Parentheses, Commas and Hash A few more peculiarities are added: (1) Instead of { and } to delimit entries, ( and ) can also be used for all types of entries: @string (SCRIBE-NOTE = "Chapter 12 deals with bibliographies")

(2) The last field of an entry can be followed by a comma. (3) Field value strings can be concatenated by the concatenation operator #. For example, @string{trans = "ACM Transactions on "} @string{toplas = trans # "Programming Languages and Systems"}

A level 3 parser can parse the file level3.bib. Level 4: Fields with Quotes and Curly Braces At level 4 the parser can deal with field values that contain double quotes and nested pairs of curly braces. Note that a double quote can occur in a field value only enclosed in { and }. Also note that field values can contain newlines. A level 4 parser can parse the file level4.bib. An example entry with these phenomena is: @PhdThesis{Aas92, author = {Aasa, Annika}, title = {User Defined Syntax}, year = {1992}, school = {Department of Computer Sciences, Chalmers University of Technology and University of {G\"oteborg}}, address = "S-412 96 {G\"oteborg}, Sweden" }

Level 5: Comment (Full BibTEX) To deal with full BibTEX we have to make four final modifications to the parser: (1) Keywords should not be case sensitive, i.e., String, sTrIng, and STRING are all valid and equivalent keywords to indicate a string entry. (2) Entry keys have a much more liberal syntax than just identifier syntax. (3) Comment can be included using comment entries. (4) But actually these

A.1. PARSING BIBTEX FILES

147

are not needed (although they should be supported by the parser) since any text between entries not containing @ is considered comment in BibTEX. A level 5 parser can parse the file btxdoc.bib, which documents all weird things allowed by bibtex. An example fragment of that file with comment is: Copyright (C) 1988, all rights reserved. @COMMENT{You may put a comment in a ‘comment’ command, the way you would with SCRIBE.} Or you may dispense with the command and simply give the comment, as long as it’s not within an entry. @MISC{prime-number-theorem, author = "Charles Louis Xavier Joseph de la Vall{\’e}e Poussin", note = "A strong form of the prime number theorem, 19th century" }

A.1.3

BibTEX in EBNF

Notes: 1. The syntax is not complete. For example, the nonterminal “EName” can also produce the word string, but as this is a keyword it should be excluded 2. LAYOUT is allowed between ‘lexical’ elements. Lexical elements are elements that form a unit, e.g. numbers, identifiers, strings between ” ” or { }, or characters that stand for themselves like @ ( ) outside of strings. 3. ~[...] means all characters except those between the [ ]. 4. \t = tab, \n = newline. 5. The notation {Field ","}* means a (possibly empty) sequence of Fields separated by comma’s. This is a bit awkward as the * is already used to denote sequences in BNF. It should be read here as if defined by one of the pChain parser combinators. Bibtex: C {Entry C}* C Entry: EComment | EPreamble | EString | EEntry

5

EComment: "@" Comment "{" ~[\}]+ "}" | "@" Comment "(" ~[\)]+ ")" 10

EPreamble: "@" Preamble "{" Value "}" | "@" Preamble "(" Value ")" EString: "@" String "{" Field "}" | "@" String "(" Field ")" EEntry:

15

148

APPENDIX A. LABORATORY EXERCISES "@" EName "{" Key "," {Field ","}* ","? "}" | "@" EName "(" Key "," {Field ","}* ","? ")"

String: Preamble: Comment:

[Ss][Tt][Rr][Ii][Nn][Gg] [Pp][Rr][Ee][Aa][Mm][Bb][Ll][Ee] [Cc][Oo][Mm][Mm][Ee][Nn][Tt]

LAYOUT: C:

[\ \t\n]* ~[\@]*

Field:

Name "=" Value

Value: Name "{" ValWords "}" "\"" ValWordsDQ "\"" Value "#" Value ValWord: ValWordDQ: ValWords: ValWordsDQ:

20

25

30

| | | |

~[\{\}]+ ~[\{\}"]+ (ValWord | ("{" ValWords "}"))* (ValWordDQ | ("{" ValWords "}"))*

Name: EName: Key:

[A-Za-z0-9\-\_\/]+ Name ~[\ \t\n\,\=\{\}\@]+

A.1.4

Template for Parsing with Combinators

35

40

45

See http://www.cs.uu.nl/docs/vakken/ipt.

A.2 A.2.1

Generating OZIS files from BibTeX files Introduction

Scientific articles often are written using LaTeX. Part of LaTeX is the BibTex bibliography database subsystem, a tool which uses a text based format for storing bibliography entries in such a way that they are easily incorporated in publications written with LaTeX. However, the BibTex format is not the only existing format for storing bibliographies. In particular employees of the Institute of Information and Computing Sciences at Utrecht are required to submit their publications in a format specifically used by the University for its publications, the format used by OZIS (universitaire OnderZoek InformatieSysteem). The purpose of this exercise is to write a tool which translates bibliography entries written in BibTex to the format required by OZIS. The following is an example of what is accepted by this tool (i.e. BibTex format): @book{dijk01ipt-lecturenotes , title = {{Implementation of Programming Languages, Lecture Notes}} , author = {Dijkstra, Atze and Swierstra, Doaitse} , publisher = {Utrecht University, Institute of Information and Computing Sciences} , year = {2001}}