Application Driven Software Methodology for ... - Semantic Scholar

1 downloads 0 Views 139KB Size Report
Teodor Rus and Donald Curtis. Department of Computer Science. University of Iowa, Iowa City, IA 52242. Abstract. Software tools were created in order to ease ...
Application Driven Software Methodology for Automatic Compiler Generation Teodor Rus and Donald Curtis Department of Computer Science University of Iowa, Iowa City, IA 52242 Abstract Software tools were created in order to ease the compiler generation task. But the demand for new programming languages and thus for new compilers grows with the domain of computer applications. The new approaches for language design and compiler implementation, such as those spawned by the domain specific languages, rely on programming and are not easily followed by problem experts. On the other hand compiler design and implementation is a topic frequently eliminated from the curriculum. The consequence is that while the demand for programming language design and compiler implementation increases, expertise in the conventional methods for solving this problem decreases. Since one cannot expect the compiler to disappear from software systems of future computers we need new approaches for language design and compiler implementation that would not rely on conventional knowledge that sits at the basis of current technology. We are developing a new computer-based problem solving methodology where application domain (AD) experts develop problem solving algorithms using the natural language of their problem domains and information technology (IT) experts develop software tools that can execute the algorithms developed by AD experts without translating them into programs. This paper illustrates this new computer-based problem solving methodology using the domain of programming language design and compiler implementation. We show how a programming language expert can generate a parser without programming. A similar approach can be developed for all other compiler components as well as for their integration into a compiler.

1

Introduction

Since it’s inception, software engineering has suffered from a “chronic problem” as each new method which is proposed as the “cure” suffers from the same problem [Par06]. The problem is communication. The traditional software process, such as the waterfall model, is seen as a transformation from higher level software abstractions, defined in the requirements documents, to low level implementations, as code. At the first step of the process the information given by the domain expert (customer) is passed off to the developer who is responsible for writing requirements documents based on the customers wants and needs. The development of the software is immediately out of the hands of the domain expert, passed off to the software engineer who, not being an AD expert, is forced to explain the AD concepts from his point of view, as software in the IT domain. While requirements documents are intended to express the problem in terms of the application domain, they end up serving as the initial representation of the final software system in terms of the IT. The problem is that the domain expert’s requirements are frozen at the beginning of the software development process [Raj06] and this causes a breakdown in communication between the domain expert and software developers. When changes happen in the AD, after the start of the software development lifecycle, modifications to requirements must be made and the process must be restarted so that the changes can propagate through the rest of the development process. Iterative methods, such as Agile software development, utilize multiple, quickly executed iterations of the traditional software process in order to shorten the time of these froze requirements. The benefit is that at each iteration, the requirements can be modified and the change can be propagated back down to the implementation level which can be modified based on these new changes. But this process ends up suffering form the same problems [Par06, Raj06]. 1

In order to approach the communication problem we believe a relationship between the AD and IT must be preserved throughout the software lifecycle. In order to illustrate how our approach differs from others we consider Model Driven Architecture, and eXtreme Programming as approaches to improving communication in the software engineering process. Model Driven Architecture [Bro04] is a formalization of the tradition waterfall model where the domain expert’s requirements are specified as high-level Computation Independent Models (CIM) which are transformed to Platform Independent Models (PIM), representing the software system independent of the hardware, operating system, and programming language, and finally transformed to Platform Dependent Models (PDM) represented in a programming language and representing the implementation of system [Bro04, MSUW04, Sei03]. MDA provides a formalized method for communicating the system structure from the system architects to the programmers and augments this process by providing automation, but at the point of creation these models represent domain concepts in the IT domain rather than domain concepts existing in the domain environment which they operate. EXtreme Programming [BA05] considers software development a process that evolves as a monologue between software developers who “tell stories” and “generate code”. The problem is that software developers are creating software for a domain which they are not experts of. Thus, the monologue is about domain concepts as they exist in the IT domain but not as they exist in the domain environment which they operate. These methodologies address communication problems occurring between software engineers but not the communication problem between the domain expert and the software engineer. We believe this communication gap exists as the result of a gap between domains of the domain expert and the IT expert. We also believe that this gap must exist for the fact that the domains exist independent of each other. The methodology that we propose in this paper is a software development process where the domain expert and software developer can work independently and in parallel. We have addressed the problem by initiating Application Driven Software (ADS) [RC06] where problem domain experts use their natural languages to express problem solving algorithms while computer experts develop tools that map domain experts solutions into computer processes. This provides a foundation for a new computer based problem solving methodology based on Computational Emancipation of the Application Domain, which means: 1. domain knowledge is structured using an appropriate domain ontology, and 2. concepts in the domain ontology are associated with computer artifacts (programs and processes) as computational meaning. With this methodology problem domain is computationally characterized using domain abstractions that are identified and defined by domain experts and are implemented by IT experts. In the ontology of the application domain these abstractions are domain characteristic terms associated with processes that implement them developed by IT experts. The software development process is then based on well-defined concepts that characterize the application domain of the software thus developed. In this paper we choose language processing as the application domain to illustrate this new problem solving methodology.

2

Problem Definition

Languages are (potentially) infinite sets of strings over given alphabets. For an infinite language L, the specification of its elements is usually provided by a finite mechanism that can enumerate (or generate) all elements of L. We consider here languages that can be specified by context-free grammars which are quadruples G = (N, Σ, P, S) where N and Σ are finite sets of symbols called nonterminals (or variables) and terminals (or lexical elements) respectively, N ∩ Σ = ∅, and P is a set of rules of the form lhs → rhs called productions, where lhs ∈ N , and rhs = t0 N1 t1 . . . tn−1 Nk tn for k ≥ 0, t0 , t1 , . . . , tn ∈ Σ∗ , N1 , N2 . . . , Nk ∈ N ∪ {ǫ}, and S ∈ N , is a nonterminal symbol designated as the start symbol or the axiom. The specification mechanism is provided by a process of nonterminal rewriting called derivation [AU72] which is defined by the following rules: 1. For x, y ∈ (N ∪ Σ)∗ we say that x directly derives y using the rule A → α ∈ P if x = β1 Aβ2 and A→α y = β1 αβ2 and denote it by x ⇒ y or by x ⇒ y if A → α does not need to be specified; 2

2. For x, y ∈ (N ∪ Σ)∗ we say that x derives y and denotes it by x ⇒∗ y if there are y0 , y1 , . . . , yn such that (a) x = y0 , (b) yi ⇒ yi+1 , i = 0, 1, . . . , n − 1, and (b) y = yn . The language specified by a grammar G = (N, Σ, P, S) is the set of strings L(G) = {w ∈ Σ∗ |S ⇒∗ w}. For w ∈ L(G), a derivation S ⇒∗ w is visualized by a tree, called derivation or parse tree [AU72, ASU86], whose root is labeled by S, whose leafs are labeled by symbols of Σ used in w, and whose interior nodes are labeled by the lhs of the rules used in the derivation process.

2.1

Language Parsers

One of the key problems in the domain of language processing is: for a particular language L, specified by the grammar G = (N, Σ, P, S), and a string w ∈ Σ∗ , develop an algorithm that decides whether w ∈ L(G). Such an algorithm is called a language decider. Parsers are implementations (i.e., programs) of language deciders that solve this problem. In addition, since parsers are practical tools used in various applications, if w ∈ L(G) then the parser also constructs the derivation of w using G. In this paper we introduce a procedure that allows language processing professionals to design and implement parsers without developing the programs performed by the decision algorithms they implement. Hence, the users of this procedure are people involved in language processing (such as compiler constructors, students learning compiler construction, computational linguists, and may other), who are language processing professionals and who don’t know (or they don’t want to know) the intricacies of programs that implement parsers. There are two strategies for parsing algorithms: top-down and bottom-up. With the top-down strategy the algorithm takes as input G and w and discovers the derivation S ⇒∗ w, if one exists, by constructing a derivation tree for w starting with the root of the tree, which is labeled by the axiom S, and performs nonterminal rewriting operations. With the bottom-up strategy the algorithm takes as input G and w and discover a derivation S ⇒∗ w, if one exists, by constructing a derivation tree for w starting with the leafs of the tree, which are nodes labeled by symbols of w, and performs “handle-pruning operations”[ASU86]. The general algorithms that result from both strategies are exponential in the length of w and therefore are of little practical interest. In addition, in some applications, such as compiler construction, the derivation discovered by the parser must be unique, otherwise it is difficult to associate an appropriate computation meaning (semantics) to w. But the problem of deciding whether a context-free grammar G is ambiguous is unsolvable [Sip06]. This situation combined with the utilitarian nature of parsers led to constraints imposed on the grammar G = (N, Σ, P, S) which ensure both that the parsers of the language L(G) are efficient, i.e., polynomial in the length of the input and, when w ∈ L(G) they generate a unique derivation of w. Examples of such algorithms: 1. Earley algorithm which can be used for unrestricted context-free grammars, generates all derivations of w, and has the complexity O(|w|3 ) where |w| is the length of w. This algorithm is mostly used for natural language parsing. 2. LL-algorithms , which are top-down algorithms that read the input from Left to right and generate Leftmost derivations. LL algorithms are based on grammar restrictions that ensure that for any two rules A → α, A → β ∈ P the following equality holds: f irst(k, α ◦ f ollow(A)) ∩ f irst(k, β ◦ f ollow(A)) = ∅ where f irst(k, x) = {y ∈ Σ∗ |x ⇒∗ yα, |y| ≤ k}, f ollow(A) = {y ∈ Σ∗ |S ⇒ ∗γA y}, and ◦ is string concatenation operator. 3. LR-algorithms , which are bottom-up algorithms that read the input from Left to right and generate Rightmost derivations. LR algorithms are based on grammar restrictions that ensures that for any rightmost derivation S ⇒∗ αβ x where x ∈ Σ∗ and A ∈ N , knowing the string αβ and at most first k symbols of x we can uniquely determine the rule that was used in the previous step of the rightmost derivation. In other words, G must satisfy the following conditions: A→β

B→β

If S ⇒∗ αAw ⇒ αβw, S ⇒∗ γBx ⇒ αβy, and f irst(k, w) = f irst(k, y) then αAy = γBx, i.e., α = γ, A = B, and w = y. 3

The algorithms of practical interest are LL(1) and LR(1) which run in O(n) time and O(n) space where n is the length of the input stream[Tay02].

2.2

Parser Generation

We must observe here that all practical parsers enumerated above are based on information that can be precomputed and then organized as appropriate data structures. Since here we are interested in parsing programming languages we focus further only on LL and LR parsers where k = 1, i.e., LL(1) and LR(1), and organize their information as arrays called parse tables, denoted PT. Thus, any parser of the class considered here is actually a program structured as follows: 1. A parse table, PT. Let P TLL1, P TLR1 be the parse table of the corresponding parser. 2. A procedure that operates on PT and performs the parsing actions which we refer to by the names LL1Procedure and LR1parser. The process of generating these programs is independent of a particular grammar G. The procedures LL1Procedure and LR1Procedure are implementations of push-down automata simulators that can be developed independent of the content of G. Computing PT from the grammar and merging it with the appropriate procedure (LL1Procedure or LR1Parse) we obtain an automatic mechanism for parser implementation, usually called parser generation. Hence, we illustrate the application driven software development with the software tool that solves the following parser generation problem: given a language L, specified by grammar G = (N, Σ, P, S), develop a methodology which determines the parsing approach that can be used to decide whether w ∈ L and generates an appropriate parser. With our methodology, for a grammar G = (N, Σ, P, S) generating a parser of the language L(G) requires the following: • A computational representation of G, denoted further Grammar, that accommodates the grammar, G. • Procedures that decide if Grammar satisfies the constraints set forth by the LL(1) and LR(1) parsing approaches, denoted isLL1 and isLR1 respectively. • Procedures LL1Constructor and LR1Constructor respectively, that construct the parse tables P TLL1 and P TLR1 from the Grammar. • The parsing procedures controlled by the parse tables generated from Grammar, denoted LL1Procedure and LR1Procedure, respectively. • A procedure to merge the parse table and its parsing procedure together to form the resulting parser, denoted Merger. Grammar G is represented computationally using a text-based BNF rules. The procedures isLL1, isLR1, LL1Constructor, LR1Constructor, LL1Procedure, LR1Procedure and Merger already exist as developed by domain experts, and implementations of these procedures exist as developed by the TICS group at the University of Iowa.

3

Domain Driven Development

Domain driven development implies the following process: 1. Structure the application domain by modeling domain concepts using an appropriate formalization. 2. Supply the application domain model concepts with appropriate IT artifacts (implementations representing the domain concept in the IT domain). 3. Solve problems in the application domain and execute the domain level solutions on the computer.

4

3.1

Domain Ontology

The term ontology gets thrown around a lot, both in Philosophy where the term originate and in IT where it is used to represent what Gruber calls a “specification of a conceptualization”. We use the term ontology to represent a formalization of domain knowledge. That is, an ontology is a collection of AD terms organized based on their properties and relations to each other. Structuring the application domain is the responsibility of the domain expert and is the process of creating the domain ontology by collecting the domain concepts and organizing them. Each application domain is characterized by a collection of terms (terminology), which have the same semantic interpretation for all domain experts. The association of the symbolic names of these terms within a table provides an easy way of translating among them while preserving their semantics. Developing an ontology for an application domain means identifying these terms and organizing them in a manner which they can be reasoned about. The ontology for an application domain (in our case language processing) may be very large. However, the terms of the ontology can be classified on a hierarchy of sub-domains and problem solver may focus only on the ontology of the appropriate sub-domain. For the purpose of using ADS to solve our problem, we are concerned with a small subset of terms in the domain whose meanings are computational processes characterized by their input/output behavior as follows: 1. isLL1 Input: Grammar G = (N, Σ, P, S)  true if G in the class of LL(1) grammars. Output: false otherwise 2. isLR1 Input: Grammar G = (N, Σ, P, S)  true if G in the class of LR(1) grammars. Output: false otherwise 3. LL1Constructor Input: Grammar G = (N, Σ, P, S) Output: LL(1) Parse Table 4. LR1Constructor Input: Grammar G = (N, Σ, P, S) Output: LR(1) Parse Table 5. Merger Input: Output: Parser 6. LL1Procedure Input: none Output: LL(1) parsing algorithm for LL(1) parse table. 7. LR1Procedure Input: none Output: LR(1) parsing algorithm for LR(1) parse table.

3.2

Computational Emancipation of AD

The process of supplying the application domain ontology [NM] with IT artifacts implementing them is call Computational Emancipation of the Application Domain, or what we refer to as CEADing the AD. Applying this approach to the language sub-domain, CEADing is done by associating each term in the language ontology with a Uniform Resource Identifier (URI) pointing to the IT component implementing it. We organize these terms in a tree giving us the fully emancipated language processing ontology shown in Figure 1.

5

Language

iiiWWWWWWWWW WWWrecognized iiii i i WWWWW by i ii i WWWWW i i WWW iiii specif ied by

Parser

Grammar properties

check

implement

Checker A

PDA Simulator P

nn PPPP PPP nnn n n PPP nn n PPP n n n n

}A }} AAA isLR1 AA }}} AA } } } isLL1

LL1 Constructor

LR1 Constructor









LL1PT

LR1PT Y

LL1 Merger

LL1 Simulator A

LR1 Simulator A

}} AAA AA }} } AA }} A } } PT

}} AAA AA }} } AA }} A } }

LL1 Procedure

PT

LR1 Procedure

Y Y Y Y Y Y Y Y Y f f Y Y Y f f f Y Y Y f f Y Y Yf f f LR1 Merger









LL1 Parser

LR1 Parser

Figure 1: Computationally Emancipated Language Processing Ontology

3.3

Problem Solving with ADS

For any problem, the solution is formulated in the domain of discourse from which the problem originates. As language processing experts we provide the following formal solution to our problem, formulated in terms of the computationally emancipated language processing domain. Input: Grammar G = hN, Σ, P, Si Output: Parser PLG if isLL1(G) is true then PT := LL1Constructor(G) PA := LL1Procedure else if isLR1(G) is true then PT := LR1Constructor(G) PA := LR1Procedure else return Error end if return Merger(PT,PA)

3.4

Solution Execution

ADS methodology for the execution of domain solutions is a two step process: translation and interpretation. Before a solution can be executed it must be first translated into a process interpretable language called the Software Architecture Description Language (SADL) [RC06]. This translation replaces each domain concept with it’s associated IT artifact and each operational instruction (if, while, etc.) with an appropriate SADL operator. After translation to SADL, the solution process is carried out by the SADL interpreter.

6

3.5

SADL

SADL is a process-description language used in ADS to represent domain solutions in an intermediate form. The structure of the SADL language is build on the following principles. 1. The lexical elements of the language are AD terms and SADL operators. The semantics of AD terms are the software artifacts associated with them in the AD ontology. The semantics of SADL operators are specified by the SADL interpreter. 2. The SADL primitive processes specified by the signature of the component used in the AD ontology or by SADL operators that compose such processes in SADL. 3. The SADL composed process which consists of sequential and parallel process compositions of one or more SADL processes that implement an AD solution algorithm. The syntax of SADL is built on the extensible markup language (XML) syntax. The two types of SADL processes are represented by the two types of XML elements: • SADL primitive processes are represented by empty XML elements of the form where op is an SADL operator that performs a process and atr1 , . . ., atrn define the properties of that process such as the URI of the code that implements it, input, output, etc. • SADL composed processes are represented by content XML elements of the form p1 . . . pn where op is a SADL operator that composes the processes p1 . . . pn using atr1 , . . ., atrn to determine the behavior of the composition. The process performed by the SADL for each element is determined by the SADL operator. For our solution the interpreter recognizes the following set of SADL process operators. Primitive Processes : 1. Operator: execute Attributes: {component, input, output} Process: Executes the process specified by the value of component with input input and output output where input is a comma separated list of input for the process. 2. Operator: message Attributes: {text} Process: Displays the message specified by text to stdout. Composed Processes p1 . . . pn : 1. Operator: system Attributes: ∅ Process: Specifies a domain solution where p1 . . . pn is the sequence of SADL processes performing the solution. 2. Operator: if Attributes: {test} Process: If evaluation of the boolean expression specified by test is true (non-empty string or non-zero numeric value) then the sequence of processes p1 . . . pn is performed. Otherwise no processes are performed.

Interpretation of the SADL language is handled by the SADL interpreter acting as a virtual processor. That is, the SADL interpreter mimics the behavior of a physical processor using a virtual program counter that points to the emancipated nodes of the domain ontology. The processes performed by the SADL interpreter are based on SADL semantics defined above and each process defined by the software artifact used in the ontology is performed on the computer platform in the network where it exists. The challenge for the interpreter is in performing process composition. In sequential composition this means executing software artifacts, waiting for execution to terminate and performing the next composition. This challenge is compounded by the fact that solutions contain sub-compositions as is the case of the if process. The assumption here is that software artifacts associated with ontology nodes are correct and terminating. 7

The solution given in Figure 2 is the SADL form of our domain solution given in Section 3.3. The process performing this solution is the composition of processes between the and tags.

Figure 2: SADL Solution

4

Conclusion

The major problem with software engineering is still the semantic gap, i.e. the gap between the abstractions manipulated by the application domain experts and their implementations manipulated by the IT domain experts. This is a problem of knowledge communication. At the initial stage of the software development process, knowledge is handed from the AD expert to the IT expert. But, while the IT expert manipulates IT abstractions, the AD expert manipulates domain abstractions representing domain concepts that do not necessarily exist in the IT domain. Application Driven Software unifies this difference allowing the AD expert and IT expert to manipulate the same abstractions using natural language terms of their own domains. Thus, ADS enables the AD and IT to collaborate while evolving independently. Application Driven Software is a new frontier of research and this work is just an illustration of how it can be done. It represents a change of mindset, breaking the inertia of current thinking about problem solving and software development. Computational Emancipation of the Application Domain is the key to this change, providing a new methodology for teaching computer-oriented problem solving process based on “hands on the problem” rather than hands on the textbook. For the domain of language processing and compiler construction this means enabling automatic compiler generation without relying on programming or conventional compiler knowledge in order to satisfy the increasing demand for compilers.

8

References [ASU86]

A.V. Aho, R. Sethi, and J.D. Ullman. Compilers – Principles, Techniques, and Tools. Addison-Wesley, Reading, MA, 1986.

[AU72]

A.V. Aho and J.D. Ullman. The Theory of Parsing, Translation, and Compiling, Volume I: Parsing. Prentice–Hall, Englewood Cliffs, N. J., 1972.

[BA05]

K. Beck and C. Andres. Extreme Programming Explained. John Wait, second edition edition, 2005.

[Bro04]

A. Brown. An introduction to model driven architecture. Technical report, IBM, http://www.ibm.com/developerworks/rational/library/3100.html, February 2004.

[MSUW04] S. J. Mellor, K. Scott, A. Uhl, and D. Weise. MDA Distilled: Principles of Model-Driven Architecture. Addison-Wesley, 2004. [NM]

N.F. Noy and D.L McGuinness. Ontology development 101: A guide to creating your first ontology. http://ksl.stanford.edu/people/dlm/papers/ontology-tutorial-noymcguinness-abstract.html.

[Par06]

D. Parnas. Agile methods and gsd: The wrong solution to an old but real problem. Communications of the ACM, 49(10):29, October 2006.

[Raj06]

V. Rajlich. Changing the paradigm of software engineering. Communications of the ACM, 49(8):67–70, 2006.

[RC06]

T. Rus and D.E Curtis. Application driven software development. In International Conference on Software Engineering Advances, Proceedings, Tahiti, 2006.

[Sei03]

E. Seidewitz. What models mean. IEEE Software, 20(5):26–32, 2003.

[Sip06]

M. Sipser. Introduction to the Theory of Computation. Thomson/Course Technology, second edition edition, 2006.

[Tay02]

R.G. Taylor. Ll parsing, lr parsing, complexity, and automata. ACM SIGCSE Bulletin, 34(4):71–75, 2002.

9