An Editor and Parser for Data Formats in End-User Programming Christopher Scaffidi, Brad Myers, and Mary Shaw School of Computer Science, Carnegie Mellon University (cscaffid, bam, mary.shaw}@cs.cmu.edu Abstract It is currently difficult and time-consuming to validate and manipulate data in web applications, so we have developed an editor and a parser to simplify these tasks. Our editor enables end-user programmers to create and debug reusable, flexible data formats without learning a complex new language. Our parser uses these formats to turn strings into structured objects and to report its level of confidence that each string is a valid instance of the format. End-user programmers can use our system to create validation code that takes a graduated response to slightly invalid data. To assist end users in fixing invalid inputs, our parser dynamically generates targeted error messages. We evaluate our system’s expressiveness by defining formats for data that users regularly enter into web forms.
1.
Introduction
Many tools exist for creating and manipulating spreadsheets, web pages, and databases. In many cases, the intended users are “end-user programmers,” people who have enough skill to create simple software but who are not professional software developers [23]. Unfortunately, such environments treat semi-structured data such as part numbers and person names as plain text. To validate these fields, the end-user programmer typically writes a regular expression (or custom code in a programming language), which the application uses to check values before they are inserted into data fields. However, as we discuss in detail below, regular expression notation is unintuitive to many users, and it is cumbersome for expressing common data formats. In this paper, we present two main contributions that make it much easier for end-user programmers to specify formats and to use formats for validating and manipulating data. First, we present a direct manipulation user interface that helps users define formats without learning unnatural notation; our editor internally transforms our human-readable notation into an augmented context-free grammar (ACFG). Second, we present a parser that transforms strings into structured objects and returns a number between 0 and 1 to indicate the parser’s confidence in each string’s validity. These contributions have four ancillary benefits.
First, our editor presents formats in understandable English, which should help users to check their work. This also facilitates sharing of formats, since one user can save a format to a file and email it to a co-worker, and then the co-worker can review the format to see if it is correct and if it is useful for that co-worker’s needs. Second, our editor allows users to incrementally add, remove, and change parts of the format, which should also facilitate sharing: one user can create a generic format with just the basic structure, then share it with other users who can customize the format as needed to make it more precisely match their needs. Third, our system allows what we call “soft constraints”—statements about data that are often but not always true. These enable programmers to create an application that responds in a graduated manner to slightly invalid input. For example, if an input only violates one soft constraint, the application can display a warning rather than rejecting the input; if the user confirms the input, then the application can accept it. Finally, if an input is invalid, our parser generates a message that explains in English what portion of the input needs to be fixed, and it also explains what input is expected. For example, instead of reporting “Syntax error: Unexpected input ‘1’” as in traditional parsers, our parser reports, “Each US Phone Number has a part called the area code. The area code never ends with 11.” We hope this will make it much easier for end users to understand how to fix their inputs. We conceive of our editor and parser as part of a future system that will enable users to specify families of related formats, as well as the operations to transform from one format to another [19]. For example, one family might describe all US phone number formats. Unlike typical programming primitives such as floats and strings, each family of formats seems to have a natural place in end users’ intuitive mental models of data. Consequently, we refer to each family of formats as a “tope,” the Greek word meaning “place.” After discussing related work and our approach, we describe two illustrative examples of how our system greatly lowers the difficulty of validating and manipu-
lating web data. We then describe our system in detail and evaluate it by implementing several formats for data observed in use. We anticipate that future studies will show that users can create and debug such data formats more quickly and accurately than regular expressions, since our system does not require users to learn an unnatural language.
2.
Related work
The most closely related system, Lapis, aims to let users write formats in a relatively user-friendly notation [12]. However, this notation is still not plain English, so significant training is required for proficiency. For example, the Lapis library [11] defines the day in a date as @DayOfMonth is Number equal to /[12][0-9]|3[01]|0?[1-9]/ ignoring nothing
►
(In our paper, a ► symbol indicates line continuation.) The SWYN [2] and Grammex [10] editors are somewhat similar to ours, as they provide direct manipulation interfaces to help users write regular expressions and context free grammars (CFGs), respectively. These systems do not address fundamental limitations of the underlying notations (which we discuss in Section 3). Other distantly related systems including DataPro [8] and Apple Data Detectors [14] also use regular expressions and CFGs to describe formats. None provides a sophisticated editor, so creating formats may be even harder with them than with SWYN and Grammex. In addition to our editor’s presentation of formats in English, another difference between our system and the preceding is that ours lets users specify constraints on parts of the format. These constraints may be either textual or numeric, and they may be soft. A number of systems support numeric constraints in specific domains such as web feeds [16] and spreadsheets [4]. These apparently do not support soft numeric constraints or any sort of textual constraints. Several tools recognize or manipulate some of the same kinds of data as ours, including email addresses, US phone numbers, and mailing addresses [5][6][25]. However, in these tools, the formats are hard-coded and cannot be extended or customized by users. Internally, our system represents formats with an ACFG, meaning that each CFG production can be annotated (“augmented”) with a constraint [1]. However, our ACFG differs from prior ACFGs that did not allow soft constraints: if a constraint is violated, then the parse is forbidden [1][3][26]. In contrast, our ACFG notation can express soft constraints that allow a
slightly invalid parse but cause the parser to downgrade its confidence in the parse’s validity. Stochastic CFGs use probabilities to quantify confidence in parses, but these numbers are associated with productions themselves rather than with constraints on the productions [1]. Through machine learning, some parsers can be trained to apply different probabilities based on context in natural language [7][18][24], but this is still not as expressive as letting users specify arbitrary numeric penalties on arbitrary constraints.
3.
Overcoming the limitations of regular expressions
End-user programmers have three problems with using regular expressions to describe and validate data. First, regular expressions specify binary recognizers: either a value matches the format, or it does not. Thus, regular expressions cannot describe soft constraints. For example, a valid email address theoretically could contain dozens of periods in the domain name, but an input with 5 or more periods in the domain would probably not be a real address (as such domains are rare in practice). Soft constraint violations do not disprove an input’s validity, but they call it into question. Second, regular expressions are cumbersome for expressing negation (e.g.: US phone numbers’ area codes never end with “11”) and numeric ranges (e.g.: the parts of an IPv4 address are in the range 0 through 255). Finally, because regular expressions “are confusing and difficult to read”, end-user programmers find them difficult to master [2]. Even experienced programmers recognize that regular expressions are hard to read, write, and maintain. For example, the top few results in a Google search for the phrase “regular expressions” includes comments such as, “Do not worry if the above example or the quick start make little sense.” “While Python code will be slower than an elaborate regular expression, it will also probably be more understandable.” “Sometimes you have a programming problem and it seems like the best solution is to use regular expressions; now you have two problems.” For these reasons, some programmers prefer not to write regular expressions. In most environments, the alternative is to write custom code, though this requires mastering a programming language such as Python or JavaScript. For example, in JavaScript, a programmer can validate inputs by specifying a new type of structured object, then manually writing code to parse a string in the object’s constructor function. Structured objects can then be instantiated with the “new” operator.
Unfortunately, the complexity of manually writing custom code turns away some programmers. For example, in one survey of over 800 highly-skilled end-user programmers, 43% of respondents reported use of JavaScript, but only 23% reported use of “new” [21]. In addition, in interviews of 6 creators of “person locator” sites after Hurricane Katrina, we learned that one team of professional programmers omitted validation code for data like phone numbers and addresses because they felt that writing this code would have been too time-consuming [22]. Another interviewed programmer omitted validation because he wanted to make it as easy as possible for end users to enter data. However, the lack of validation led to data errors. For example, one end user put “12 Years old” into an address. Writing custom code to detect the difference between this invalid value and a valid address like “12 Years St” is difficult and time-consuming, particularly because the structure in US addresses is quite flexible. Externally, our system uses English in the user interface in order to provide more readability than is the case with regular expressions or custom code. Internally, our system represents formats as ACFGs in order to provide more expressiveness than is the case with regular expressions. As discussed in Section 8, the ability to associate constraints with grammar productions enables our system to represent the data formats that end users commonly enter into web forms. There are three other important reasons why we chose this underlying representation. First, a number of algorithms exist for parsing CFGs, and in Section 7.2 we describe our extension of one algorithm to handle our ACFG. Selecting an ACFG representation lets us take advantage of these algorithms and will create opportunities for us to further tailor these algorithms to users’ needs. In particular, CFGs can contain pairs (or sets) of productions that reference each other in mutual recursion, but we have not yet seen any data formats that would require mutual recursion among productions. Removing support for this language feature from the algorithm might improve efficiency without limiting expressiveness in practice. Second, a CFG-driven parser generates a tree, which is a data structure familiar to many programmers. This is important in situations where a programmer will use the parser’s output in order to manipulate the data, as is the case in our mashup example in Section 4.2. Finally, a CFG associates a name with each part of the format, so that each part can be referenced. This is essential for data manipulation, and it will be essential for future work aimed at helping end-user programmers specify how to transform one format into another.
4.
Illustrative applications
We now present two applications that demonstrate the usefulness of our system. In each example, an end-user programmer uses our editor to define one or more data formats. The programmer then creates an application that feeds data into our parser, along with data formats, in order to validate and manipulate the data. In each example, the programmer needs to do much less work than would be the case without our editor and parser.
4.1. Data validation example When constructing a web form for a web application, a programmer must create code that validates inputs from end users. Some IDEs such as phpClick [17] or Microsoft Visual Studio1 can automatically generate this code from a regular expression specified by the programmer. Below, we examine this approach and its deficiencies, then present an improved approach based on a reusable JavaScript library API that we have built on top of our editor and parser. Suppose that a programmer wants an application to verify that inputs to a single “person name” field follow “Lastname, Firstname” format. In the Visual Studio IDE, the programmer drags a textbox from the toolbox onto the web page, then drags a RegularExpressionValidator from the toolbox and drops it beside the textbox. The next step is to click on the validator’s icon and type a regular expression such as [A-Z][a-z]+,\s[A-Z][a-z]+ Last, the programmer enters an error message (e.g.: “Please enter a valid name”) that will appear for invalid input. The IDE then generates code that checks inputs against the regular expression at run-time. For example, if a user forgets a comma, then the error message appears. This approach has three deficiencies. First, a straightforward regular expression like this one is too constraining. For example, it would reject “d’Gates, Bill” and “Gates-Billings, Bill”; specifying a more accurate expression would be much more work and probably beyond most end-user programmers’ abilities. Second, this approach is inflexible: as we saw in our Katrina study, it is sometimes desirable to let users enter slightly invalid data, but to warn them so they can correct input if desired [22]. Last, the error message must be specified as a fixed string. Thus, it cannot change depending on each specific input’s error, so it provides only limited guidance for fixing inputs.
1
Microsoft Visual Studio is primarily used by professional programmers, though there is an “express” version for “hobbyist, novice, and student developer” programmers: http://msdn.microsoft.com/vstudio/express/
To address these deficiencies, we have implemented a new validator library API (called a PatternValidator) on top of our editor and parser to replace the old RegularExpressionValidator. This improved validator enables programmers to specify formats without resorting to regular expressions.
specifies its parts, tests it if desired, and then saves the format in a file in a subdirectory of the website (alongside any images, stylesheets, scripts, and other supplementary content that comprises the website). Our validator works with the IDE to generate the necessary code for validating inputs according to the format.
First, the programmer drags a textbox from the toolbox onto the web page, then drags our PatternValidator and drops it beside the textbox. Clicking on our validator brings up our format editor, which is shown in Figure 1 and discussed in more detail in Section 5.
At run-time, if an end user’s input violates the format, then our parser generates a targeted error message, and our validator library shows this message to the end user. (The message shown by our validator to the end user is similar to the message shown by the editor to the programmer in Step 3 in Figure 1.) This targeted message is much more helpful to the end user than
Using the editor, the programmer names the format,
Figure 1: Editing a format for a person name. The end user programmer specifies the format’s parts (“lastname” followed by “firstname”) and various facts about each part. The buttons add parts/facts, the buttons remove parts/facts, and the buttons change the order of parts/facts in the format. Each § symbol indicates a space. During Step 3, the programmer can specify some sample values; if these do not match the format, then a tooltip with a targeted error message appears. See Section 5 for more discussion of our editor.
simply stating “Please enter a valid name,” since the targeted message describes the format of a valid input. To avoid overwhelming the user, our parser only describes portions of the format that are related to the error, rather than describing the format in its entirety. To accommodate slightly invalid input, our parser assigns a number between 0 and 1 to represent the parser’s confidence in each input’s validity. The validator always rejects any input with a confidence of 0 and always accepts any input with a confidence of 1. If the confidence is between 0 and 1, the validator displays a confirmation popup rather than rejecting the input. If the user confirms the input’s validity, then the validator accepts it. For example, the popup would appear if the user entered “von Gates, Bill” because the format contains a soft constraint that the last name often starts with 1 uppercase letter. (The programmer can also configure our validator to always reject any input with confidence less than 1, thus dispensing with the popup.) In addition to the validator described above, we used our parser to implement other JavaScript functions so that programmers who know JavaScript can customize validation still further. For example, with a few lines of code, an application can accept both “Firstname Lastname” and “Lastname, Firstname” inputs and reformat the former into the latter so only a single canonical format is submitted by the form to the server.
4.2. Data manipulation example As another example, in a client-side mashup, JavaScript on a web page dynamically grabs data from within the web page, then uses this data to silently request additional data from the server. The script combines this new data with the original data, and it may use the combined data to issue additional requests for data. Ultimately, the script dynamically updates its web page to display some or all of the data [9].
tation. First, the script must grab each author’s names out of the citation. Then, it must pass each name to an existing web application that returns a web page with that person’s contact information. Finally, out of that page, the script must grab the email address and phone number, then reformat these for display to the end user. Chris Scaffidi, Mary Shaw, Brad A. Myers. Estimating the Numbers of End Users and End User Programmers, 2005 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC’05), Dallas, Texas, USA, 20-24 September 2005. pp. 207-214.
Figure 2: Sample citation in this mashup As shown in Figure 2, some authors have middle initials on this particular page of citations. However, the script will send each author’s name to a web application that expects full names in “Firstname Lastname” format. Therefore, the programmer must use our editor to define a format for a person name’s inner structure, so the script can reference first and last names. The programmer can then reuse this person name format inside of a larger citation format. To define the needed format, the programmer can start with the person name format defined in Figure 1. Adapting this format requires swapping the first name and last name (with a single button click), inserting a part for the middle initial, and tweaking the separators between parts (for example, removing the comma after the last name and adding a space between parts). Next, the programmer uses our editor to define a format for the citation as a whole. The first part of this citation format can be defined by reusing the format defined above to create an “author” part, which can repeat with a comma and space between authors. In the same format, the programmer can specify “, and” as well as “ and ” for allowed separators between authors. The resulting part of the format is shown in Figure 3.
Currently, writing mashups is difficult, even for professional programmers. The main reason is that grabbing data from pages requires breaking apart unstructured or semi-structured data in the HTML, and this is brittle without a great deal of code to handle exceptional cases. Just as we have created a validator library (Section 4.1) to simplify the process of flexibly validating inputs, we have created a mashup library (described below) to simplify the process of mashing up data. In both cases, the effect is to greatly reduce the amount of work that a programmer must invest to create an application. Suppose that a programmer needs to enhance a web page that shows citations by adding JavaScript to display author contact information when an end user clicks a ci-
Figure 3: Reusing the person name format for an author part inside the citation format.
Since there is no need to parse the citation’s other parts, such as paper title and date, the programmer can simply add a part called “otherstuff”. Leaving that part unspecified allows it to consume whatever characters in the citation are not picked up by the author list. The programmer tells the editor to save the citation format in a file called “citation.xtope” located in a subdirectory of the web site. At run-time, the script will retrieve this file and pass it to our mashup library. The programmer now creates two additional formats for parsing author contact information. This information will need to be grabbed from HTML sent back by a web application that returns a page showing contact information as in Figure 4.
string shown in Figure 2 yields the structure shown below. Our parser gives each node a label (such as “firstname”) according to the part names that the programmer specified in our editor. citation ├─author │ ├ firstname │ └ lastname ├─author │ ├ firstname │ └ lastname ├─author │ ├ firstname │ ├ initial │ └ lastname └─otherstuff
= "Chris" = "Scaffidi" = "Mary" = "Shaw" = = = =
"Brad" "A." "Myers" ". Estimating ..."
The script then uses our mashup library’s nodes function to get an array of the author nodes (3 in this case): var authors = cit.nodes("author");
The script loops through the authors and uses our mashup library’s textFor function to retrieve the text in the “firstname” and “lastname” nodes:
Figure 4: Contact information returned by server The script must grab the phone number and email address. The relevant snippet of the HTML is: Office
412-268 3564
412-268-2338 (fax)
cscaffid @ andrew.cmu.edu
Home Page
Note the extra spaces in the email address and the use of a space rather than a hyphen between the phone number’s exchange and local number. The programmer might want the script to reformat these data before presenting them to the end user. This reformatting will require referencing parts of the phone number and email address, so the programmer next defines a format for each and stores them in files on the server alongside the citation format file. At this point, the programmer has defined all the needed formats without writing a single line of code. As shown next, the script can now use our mashup library to parse data and access its parts, such as author last names and phone number exchanges. The script first passes the text from the selected citation into our mashup library’s parseStructure function, which uses our parser to parse the string into a structured object according to the specified format file: var cit = parseStructure(citationText, "citation.xtope");
►
The object returned by parseStructure is a parse tree for the selected citation. For example, parsing the
for (var i = 0; i < authors.size(); i++) { var author = authors.get(i); var fname = author.textFor("firstname"); var lname = author.textFor("lastname");
With the author names in hand, the script can now request each author’s contact information from the web server. The script appends the names to the web application’s URL, with a space between the names, to yield the URL for each person’s contact information. The script then passes these URLs to our mashup library’s callWhenReady function, which calls the server with those URLs and then asynchronously passes the contact information HTML back to the script when all of the URL calls are completed. The script can then use our mashup library’s grabStructure function, which looks for two chunks of delimiter HTML and locates an instance of a format between those delimiting markers. In this case, the script grabs a phone number and an email address: var SMARK = "Office"; var EMARK = "