Inference of Regular Expressions for Text Extraction ... - Google Sites

0 downloads 162 Views 879KB Size Report
language based on positive and negative sample strings,. i.e., of strings described ...... [38] J.-R. Cano, “Analysis
1

Inference of Regular Expressions for Text Extraction from Examples Alberto Bartoli, Andrea De Lorenzo, Eric Medvet, and Fabiano Tarlao Abstract—A large class of entity extraction tasks from text that is either semistructured or fully unstructured may be addressed by regular expressions, because in many practical cases the relevant entities follow an underlying syntactical pattern and this pattern may be described by a regular expression. In this work we consider the long-standing problem of synthesizing such expressions automatically, based solely on examples of the desired behavior. We present the design and implementation of a system capable of addressing extraction tasks of realistic complexity. Our system is based on an evolutionary procedure carefully tailored to the specific needs of regular expression generation by examples. The procedure executes a search driven by a multiobjective optimization strategy aimed at simultaneously improving multiple performance indexes of candidate solutions while at the same time ensuring an adequate exploration of the huge solution space. We assess our proposal experimentally in great depth, on a number of challenging datasets. The accuracy of the obtained solutions seems to be adequate for practical usage and improves over earlier proposals significantly. Most importantly, our results are highly competitive even with respect to human operators. A prototype is available as a web application at http://regex.inginf.units.it. Index Terms—Genetic Programming, Information extraction, Programming by examples, Multiobjective optimization, Heuristic search

F

1

I NTRODUCTION

A

REGULAR EXPRESSION is a means for specifying string patterns concisely. Such a specification may be used by a specialized engine for extracting the strings matching the specification from a data stream. Regular expressions are a long-established technique for a large variety of application domains, including text processing, and continue to be a routinely used tool due to their expressiveness and flexibility. A large class of entity extraction tasks, in particular, may be addressed by regular expressions, because in many practical cases the relevant entities follow an underlying syntactical pattern and this pattern may be described by a regular expression. However, the construction of regular expressions capable of guaranteeing high precision and high recall for a given extraction task is tedious, difficult and requires specific technical skills. In this work, we consider the problem of synthesizing a regular expression automatically, based solely on examples of the desired behavior. This problem has attracted considerable interest, since a long time and from different research communities. A wealth of research efforts considered classification problems in formal languages [1], [2], [3], [4], [5], [6]—those results are not immediately useful for text extraction. Essentially, the problem considered by those efforts consisted in inferring an acceptor for a regular language based on positive and negative sample strings, i.e., of strings described by the language and of strings not described by the language. Learning of deterministic finite automata (DFA) from examples was also a very active area, especially because of competitions that resulted in several important insights and algorithms, e.g. [7], [8]. Such



All the authors are with the Department of Engineering and Architecture (DIA), University of Trieste, Italy. E-mail: [email protected].

Manuscript received . . .

research, however, usually considered problems that were not inspired by any real world application [8] and the applicability of the corresponding learning algorithms to other application domains is still largely unexplored [9]. For example, the so-called Abbadingo competition was highly influential in this area and considered short sequences of binary symbols, with training data drawn uniformly from the input space. Settings of this sort do not fit the needs of practical text processing applications, which have to cope with much longer sequences of symbols, from a much larger alphabet, not drawn uniformly from the space of all possible sequences. Furthermore, regular expressions used in modern programming languages allow specifying more various extraction tasks than those which can be specified using a DFA. A text extraction problem was addressed by researchers from IBM Almaden and the University of Michigan, which developed a procedure for improving an initial regular expression to be provided by the user based on examples of the desired functioning [10]. The cited work is perhaps the first one addressing entity extraction from real text of non trivial size and complexity: the entities to be extracted included software names, email addresses and phone numbers while the datasets were unstructured and composed of many thousands of lines. A later proposal by researchers from IBM India and Chennai Mathematical Institute still required an initial regular expression but was more robust toward initial expressions of modest accuracy and noisy datasets [11]. Refinement of a given regular expression was also considered by an IBM Research group, which advocated involvement of a human operator for providing feedback during the process [12]. The need of an initial solution was removed by researchers from SAP AG that demonstrated the practical feasibility of inferring a regular expression from scratch, based solely on a set of examples

2

derived from enterprise data, such as, e.g., a product catalog or historical invoices [13]. A more recent proposal of ours has obtained further significant improvements in this area, in terms of precision and recall of the generated solutions as well as in terms of smaller amount of training data required [14], [15]. Regular expressions for text extraction tasks of practical complexity may now be obtained in a few minutes, based solely on a few tens of examples of the desired behavior. In this work we present a system that aims at improving the state-of-the-art in this area. Our proposal is internally based on Genetic Programming (GP), an evolutionary computing paradigm which implements a heuristic search in a space of candidate solutions [16]. We execute a search driven by a multiobjective optimization strategy aimed at simultaneously improving multiple performance indexes of candidate solutions while at the same time ensuring an adequate exploration of the huge solution space. Our proposal is a significant improvement and redesign of the approach in [15], resulting in a system that generates solutions of much better accuracy. The improvements include: (a) a radically different way of quantifying the quality of candidate solutions; (b) inclusion, in the starting points of the search, of candidate solutions built based on an analysis of the training data, rather than being fully random; (c) a strategy for restricting the solution space by defining potentially useful “building blocks” based on an analysis of the training data; and (d) a simple mechanism for enforcing structural diversity of candidate solutions. Furthermore, the redesign features several novel properties which greatly broaden the scope of extraction tasks that may be addressed effectively: • Support for the or operator. In many cases learning a single pattern capable of describing all the entities to be extracted may be very difficult—e.g., dates may be expressed in a myriad of different formats. Our system is able to address such scenarios by generating several regular expressions that are all joined together with or operators to form a single, larger regular expression. We implement this functionality by means of a separateand-conquer procedure [17], [18], [19]. Once a candidate regular expression provides adequate accuracy on a subset of the examples, the expression is inserted into the set of final solutions and the learning process continues on a smaller set of examples including only those not yet solved adequately [20]. The key point is that the system is able to realize automatically how many regular expressions are needed. • Context-dependent extraction. It is often the case that a text snippet must or must not be extracted depending on the text surrounding the snippet—e.g., an email address might have to be extracted only when following a Reply-To: header name. Modern regular expression engines provide several constructs for addressing these needs but actually taking advantage of those constructs is very challenging: the more the available constructs, the larger the search space. Our system is able to generate regular expressions which exploit lookaround operators effectively, i.e., operators specifying constraints on the text that precedes or follows the text to be extracted. • No constraints on the size of training examples. We place

no constraints on the size of training examples: the training data may consist of either a single, potentially very large, file with an annotation of all the desired extractions, or of a set of lines with zero or more extractions in each one. This seemingly minor detail may in fact be quite important in practice: the cited work [15] was not able to exploit training examples including multiple extractions correctly (this point will be discussed in detail later), thus the training data had to be segmented in units containing at most one extraction and in such a way that desired extractions did not span across adjacent units. The need for such a tricky operation is now removed. Accommodating the possibility of multiple extractions in each training example has required significant changes in the search strategy internally used by the system. We assess our proposal experimentally in great depth, on a number of challenging datasets of realistic complexity and with a very small portion of the dataset available for learning. We compare precision and recall of the regular expressions generated by our system to significant baseline methods proposed earlier in the literature. The results indicate a clear superiority of our proposal and the obtained accuracy values seem to be adequate for practical usage. Our results are highly competitive also with respect to a pool of more than 70 human operators, both in terms of accuracy and of time required for building a regular expression. Indeed, we are not aware of any proposal for automatic generation of regular expressions in which human operators were used as a baseline. We made publicly available the source code of our system (https://github.com/MaLeLabTs/RegexGenerator) and deployed an implementation as a web app (http:// regex.inginf.units.it).

2

R ELATED WORK

In this section we discuss further proposals that, beyond those already discussed in the introduction, may be useful to place our work in perspective with respect to the existing literature. As pointed out by [10], the learning of regular expressions for information extraction prior to the cited work focused on scenarios characterized by alphabet sizes much smaller than those found in natural language text. Rather than attempting to infer patterns over the text to be extracted, the usual approach consisted on learning patterns over tokens generated with various text processing techniques, e.g., POS tagging, morphological analysis, gazetteer matching [21], [22], [23]. An attempt at learning regular expressions over real text was proposed in [24]. The cited work considered reduced forms of regular expressions (a small subset of POSIX rules) and, most importantly, considered a simple classification problem consisting in the detection of HTML lines with a link to other web documents. Text classification and text extraction are related but different problems, though. The former assumes an input stream segmented in units to be processed one at a time; one has to detect whether the given input unit contains at least one interesting substring. The latter requires instead the ability to identify, in the (possibly very long) input stream, the boundaries of all

3

the relevant substrings, if any. Furthermore, text extraction usually requires the ability to identify a context for the desired extraction, that is, a given sequence of characters may or may not have to be extracted depending on its surroundings. Interestingly, the approach in [15] was developed for extraction but delivered better results than in [24] also in classification. Further proposals for addressing classification problems have been developed but tailored to very specific scenarios, recent examples include email spam campaigns [25], [26] and clinical symptoms [27]. There have been other proposals for regular expression learning aimed at information extraction from real text, specifically web documents [28]. The cited work provides an accuracy in URL extraction from real web documents that is quite low—the reported value for F-measure being 27% (on datasets that are not public). In this respect, it is useful to observe that the latest proposal [15] obtained accuracy well above 90% in the 12 datasets considered; moreover, two of those datasets were used also in [10], [13] and in those cases it obtained similar or much better accuracy with a training set smaller by an order of magnitude. The problem of learning a regular expression by examples of the desired extraction behavior could be seen as a very specific problem in the broader category of programming by examples, where a program in a given programming language is to be synthesized based on a set of inputoutput pairs [29]. In particular, the problem is an underspecified task [30] in the sense that there may usually be many different solutions whose behavior on the training data is identical while their behavior on unseen data is different. The cited work considers the generation of regular expressions for classification tasks on phone numbers, dates, email addresses and URLs—tasks that are considered to be tricky even for expert developers and to lack an easy-toformalize specification. It advocates the writing of solutions by several expert developers based on some examples, an assessment of their behavior on unseen data made in crowdsourcing, and an evolutionary optimization of the available solutions based on the feedback from the crowd. Our proposal generates a regular expression in a fully automatic way. Furthermore, we assess our work on datasets that are orders of magnitude larger than those considered in [30] and on tasks that seems fair to define much more challenging. Of course, we make these observations in the attempt of clarifying our proposal and by no means we intend to criticize the cited work: besides, the cited work investigates the possibility of crowd-sourcing difficult programming tasks and is not meant to propose a method for the automatic generation of regular expressions from examples. It is useful to observe, though, that the authors of the cited work were not aware of any approach suitable for learning regular expressions capable of handling the large alphabet sizes occurring in real-world text files, while such functionality was demonstrated in [13], [14], [15]. As pointed out above, learning a program from examples of the desired behavior is an intrinsically underspecified task—there might be many different solutions with identical behavior over the examples. Furthermore, in practice, there is usually not even any guarantee that a solution which perfectly fits all the examples actually exists.

The common approach for addressing this issue, which is also our approach, aims at an heuristic balance between generalization and overfitting: we attempt to infer from the examples what is the actual desired behavior, without insisting on obtaining perfect accuracy on the training set. It may be worth mentioning that coding challenges exist (and occasionally become quite popular in programming forums) which are instead aimed at overfitting a list of examples [31], [32]. The challenge1 consists in writing the shortest regular expression that matches all strings in a given list and does not match any string in another given list. Our proposal is not meant to address these scenarios. From the point of view of our discussion, scenarios of this sort differ from text extraction in several crucial ways. First, they are a classification problem rather than an extraction problem. Second, they place no requirements on how strings not listed in the problem specification should be classified—e.g., strings in the problem specification followed or preceded by additional characters. Text extraction requires instead a form of generalization, i.e., the ability of inducing a general pattern from the provided examples. Finally, we mention a recent proposal for information extraction from examples [33]. The cited work describes a powerful and sophisticated framework for extracting multiple different fields automatically in semi-structured documents. As such, the framework encompasses a much broader scenario than our work. A tool implementing this framework has been publicly released as part of Windows Powershell2 . The tool does not generate a regular expression; instead, it generates a program in a specified algebra of string processing operators that is to be executed by a dedicated engine. We decided to include this tool in our experimental evaluation in order to obtain further insights into our results.

3

S CENARIO

We are concerned with the task of generating a regular expression which can generalize the extraction behavior represented by some examples, i.e., by strings annotated with the desired portions to be extracted. In this section we define the problem statement in detail along with the notation which will be used hereafter. We focus on the regular expression implementation which is provided by the Java standard libraries. A deep comparison of different flavours of regular expressions is beyond the scope of this paper [34], yet it is worth to mention that Java regular expressions provide more constructs than POSIX extended regular expressions (ERE)— e.g., lookarounds (see Section 4.1.1)—which allow to define patterns in a more compact form. 3.1

Definitions

A snippet xs of a string s is a substring of s, identified by the starting and ending index in s. For readability, we refer to snippets using their textual content followed by their starting index as subscript—e.g., ex5 , extra5 and traction7 , are three different snippets of the string text extraction. We denote 1. https://www.google.it/search?q=regex+golf 2. Windows Management Framework 5.0 Preview, November 2014.

4

by Xs the set of all the snippets of s. Let xs , x0s ∈ Xs . A total order is defined among snippets in Xs based on their starting index: xs precedes x0s if the starting index of the former is strictly lower than the starting index of the latter. We say that xs is a supersnippet of x0s if the indexes interval of xs strictly contains the indexes interval of x0s : in this case, x0s is a subsnippet of xs . Finally, we say that xs overlaps x0s if the intersection of their index intervals is not empty. For instance, ex1 , ex5 , extra5 and traction7 , are snippets of the string text extraction: extra5 is a supersnippet of ex5 (but not of ex1 ), extra5 precedes and overlaps traction7 . A regular expression r applied on a string s deterministically extracts zero, one or more snippets. We denote the (possibly empty) set of such snippets, that we call extractions, by [Xs ]r . 3.2

Problem statement

The problem input consists of a set of examples, where an example (s, Xs ) is a string s associated with a (possibly empty) set of non-overlapping snippets Xs ⊂ Xs . String s may be, e.g., a text line, or an email message, or a log file and so on. Set Xs represents the desired extractions from s, whereas snippets in Xs \ Xs are not to be extracted. Intuitively, the problem consists in learning a regular expression rˆ whose extraction behavior is consistent with the provided examples—rˆ should extract from each string s only the desired extractions Xs . Furthermore, rˆ should capture the pattern describing the extractions, thereby generalizing beyond the provided examples. In other words, the examples constitute an incomplete specification of the extraction behavior of an ideal and unknown regular expression r? . The learning algorithm should aim at inferring the extraction behavior of r? rather than merely obtaining from the example strings exactly the desired extractions. We formalize this intuition as follows. Let E and E ? be two different sets of examples, both representing the extraction behavior of a target regular expression r? . The problem consists in learning, from only the examples in E , a regular expression rˆ which maximizes its F-measure on E ? , i.e., the harmonic mean of precision and recall w.r.t. the desired extractions from the examples in E?: P (s,Xs )∈E ? |[Xs ]rˆ ∩ Xs | ? P Prec(ˆ r, E ) := (s,Xs )∈E ? |[Xs ]rˆ| P (s,Xs )∈E ? |[Xs ]rˆ ∩ Xs | P Rec(ˆ r, E ? ) := (s,Xs )∈E ? |Xs | The greater the F-measure of rˆ on E ? , the more similar the extraction behaviour of rˆ and r? . We call the pair of sets of examples (E, E ? ) a problem instance. In our experimental evaluation we built several problem instances starting from quite complex target expressions r? and strings consisting of real world datasets (e.g., logs, HTML lines, Twitter posts, and alike). Of course, in a practical deployment of the system set E ? is not available because the target expression r? is not known. 3.2.1 Observations on the problem statement We point out that characterizing the features of a problem instance which may impact the quality of a generated

solution is beyond the scope of this paper. Assessing the difficulty of a given problem instance, either in general or when solved by a specific approach, is an important theoretical and practical problem. Several communities have long started addressing this specific issue, e.g., in information retrieval [35], [36] or in pattern classification [37], [38]. Obtaining practically useful indications, though, is still a largely open problem, in particular, in evolutionary computing [39] as well as in more general search heuristics [40], [41]. A notable class of problem instances is the one which we call with context. Intuitively, these are the problem instances in which a given sequence of characters is the textual content of snippet to be extracted and also the textual content of a snippet which is not to be extracted. For example, consider a problem instance with the two examples (I have 12 dogs, ∅) and (Today is 7-12-11, {1211 }). This problem instance is with context because the sequence of characters 12 is not to be extracted from the first example but is to be extracted from the second example. The discriminant between the two cases is in the portion of the string surrounding the sequence 12, that is, in its context. Of course, similar scenarios could occur with respect to sequences of characters in the same example rather than in different examples—e.g., assuming an email message is an example, one might want to extract only the email addresses following a Reply-To: header name.

4

O UR APPROACH

Our approach is based on Genetic Programming (GP) [16]. GP is an evolutionary computing paradigm in which candidate solutions for a target problem, called individuals, are encoded as trees. A problem-dependent numerical function, called fitness, must be defined in order to quantify the ability of each individual to solve the target problem. This function is usually implemented by computing a performance index of the individual on a predefined set of problem instances, called the learning set. A GP execution consists of an heuristic and stochastic search in the solution space, looking for a solution with optimal fitness. To this end, an initial population of individuals is built, usually at random, and an iterative procedure is performed which consists in (i) building new individuals from existing ones using genetic operators (usually crossover and mutation), (ii) adding new individuals to the population, and (iii) discarding worst individuals. The procedure is repeated a predefined number of times or until a predefined condition is met (e.g., a solution with perfect fitness is found). We carefully adapted the general framework outlined above to the specific problem of regular expression generation from examples. Our GP procedure is built upon our earlier proposal [15]—the numerous improvements were listed in the introduction. We describe this procedure in detail in the next sections: encoding of regular expressions as trees (Section 4.1.1), fitness definition (Section 4.1.2), construction of the initial population and its evolution for exploring the solution space (Section 4.1.3). Next, we describe our separate-and-conquer strategy (Section 4.1.4) and the overall organization of GP searches (Section 4.2).

5

4.1

GP search

We designed a GP search which takes a training set T as input and outputs a regular expression rˆ. The training set is composed of tuples (s, Xsd , Xsu ), the components of each tuple being as follows: (i) a string s; (ii) a set of snippets Xsd representing the desired extractions from s; (iii) a set of snippets Xsu representing the undesired extractions from s, i.e., no snippet of s overlapping a snippet in Xsu should be extracted. The training set T must be constructed such that ∀s ∈ T (i) Xsd ∩ Xsu = ∅, and, (ii) snippets in Xsd ∪ Xsu must not overlap each other. The goal of a GP search is to generate a regular expression r such that ∀s ∈ T , Xsd = [Xs ]r . We recall that, from a broader point of view, the generated regular expression r should generalize beyond the examples in T (see Section 3.2). 4.1.1 Tree representation In our proposal an individual is a tree which represents a regular expression r. Each node in a tree is associated with a label, which is a string representing basic components of a regular expressions that are available to the GP search (discussed in detail below). Labels of non-leaf nodes include the placeholder symbol A: each children of a node is associated with an occurrence of symbol A in the label of that node. The regular expression represented by a tree is the string constructed by means of a depth-first post-order visit of the tree. In detail, we execute a string transformation of the root node of that tree. The string transformation of a node is a string obtained from the node label where each A symbol is replaced by the string transformation of the associated child. Figure 1 shows two examples of tree representations of regular expressions. Available labels are divided in two sets: a set of predefined labels which represent regular expression constructs, and a set of T -dependent labels constructed as described below. In other words, the GP search explores a space composed of candidate solutions assembled from general regular expression constructs and from components constructed before starting the GP search by analyzing the provided examples—this procedure was not present [15]. The rationale for T -dependent labels consists in attempting to shrink the size of the solution space by identifying those sequences of characters which occurs often in the desired extractions (or “around” them) and making these sequences available to the GP search as unbreakable building blocks. For instance, in the task of generating a regular expression for extracting URLs, the string http could be an useful such block. Predefined labels are the following: character classes (\d, \w), predefined ranges (a-z, A-Z), digits (0, . . . , 1), predefined characters (\., :, ,, ;, , =, ”, ’, \\, /, \?, \!, \}, \{, \(, \), \[, \], , @, #, ), concatenator (AA), set of (un)possible matches ([A], [ˆA]), possessive quantifiers (A*+, A++, A?+, A{A,A}+), noncapturing group ((?:A)), and lookarounds ((?

Suggest Documents