Automatic Web Data Extraction based on Genetic Algorithms and ...

5 downloads 141494 Views 206KB Size Report
Matches 111 - 3333 - Abstract Data Extraction from the World Wide Web is a well known, non ... about the World Wide Web, considered as a highly heterogeneous ...
Chapter 1

Automatic Web Data Extraction based on Genetic Algorithms and Regular Expressions David F. Barrero and David Camacho and Maria D. R-Moreno

Abstract Data Extraction from the World Wide Web is a well known, non solved, and a critical problem when complex information systems are designed. These problems are related to the extraction, management and reuse of the huge amount of Web data available. These data have usually a high heterogeneity, volatility and low quality (i.e. format and content mistakes), so it is quite hard to build realible systems. In this chapter we propose an updated state of the art revision of the problem of Web Data Extraction, and an Evolutionary Computation approach based on Genetic Algorithms and Regular Expressions to the problem of automatically learn software entities. These entities, also called wrappers, will be able to extract some kind of Web data structures from examples.

1.1 Introduction The integration of information needs from flexible and scalable mechanisms to obtain the necessary data from available sources. However, if these sources are not relational-based, or no design has been previously made by an expert (i.e. a database designer) it is usally quite hard to build, and maintain those mechanisms. If we talk about the World Wide Web, considered as a highly heterogeneous data source, previous situation is a critical issue. Web Data Extraction (WDE) is a well known and non solved problem. It is related to the extraction, management and reuse of a huge amount of Web data availDavid F. Barrero Departamento de Autom´atica. Universidad de Alcal´a, Madrid, Spain, e-mail: [email protected] David Camacho Computer Science Department. Universidad Aut´onoma de Madrid, Madrid, Spain, e-mail: [email protected] Maria D. R-Moreno Departamento de Autom´atica. Universidad de Alcal´a, Madrid, Spain e-mail: [email protected]

1

2

D.F. Barrero, D. Camacho, M.D. R-Moreno

able. These data have usually a high heterogeneity, volatility and low quality. One popular approach to address this problem is related to the concept of Wrappers. The Wrappers are specialized programs that automatically extract data from documents and convert the information stored into a structured format. Three main functions need to be implemented in a Web Wrapper. First, it must be able to download HTML pages from a website. Second, it must search for, recognize, and extract the specified data. Third, it has to save this data in a suitably structured format to enable further manipulation. The second function, related to a pattern-matching problem, is a hot topic for a wide number of research areas such as Artificial Intelligence (AI), Information Extraction (IE), Databases, Intelligent Agents (IA) or Semantic Web (SW). These areas have provided interesting results, and advances in several topics such as machine learning algorithms, automatic (or semi-automatic) Web-based data extraction tools, ontologies management(browsing, interaction, extraction), Web Wervices (orchestation, composition), Service Oriented Architectures, etc. . . Due to the complex features of the Web, there is still an important open work in this area. The easiest (apparently) approach to build Wrappers is the hand-made method. That is, a set of patterns are identified by the programmer and then, these patterns are used to extract the data from a particular site. However, this is a poor approach because changes in the document make the wrapper fail. This approach was found impractical because it is too difficult to update and mantain the wrappers. Another kind of approaches to easily build wrappers relies on heuristics to guide the implementation of the wrappers. For instace, Ashish and Knoblock [3] propose an approach that uses heuristics to locate HTML tags that presumably surround an interesting attribute. Hsu et al. [17] proposed a template-based approach that also takes advantage of HTML tags. However, these approaches only cover a small proportion of the Web pages, and it is difficult to extend their heuristics to other kind of Web pages. Other more sophisticated approaches use Machine Learning techniques to learn these wrappers. A revision of these type of techniques and the tools available in the market are described in the next section. The main contribution of this work is a novel approach to the WDE based on Genetic Algorithms (GA) which are used to automatically evolve wrappers. The main differences with other closer approaches are related to the utilization of Regular Expressions. A Regular Expression, or simply regex, is a powerful way to identify a pattern in a particular text. Any regex is written in a formal language (that is translated into a particular syntax like POSIX or Perl) and later processed by a regex engine (i.e. Perl, Ruby and Tcl). Regular expressions are used by many text editors, utilities, and programming languages to search and manipulate text based on patterns. Our approach considers the basic (evolved) regex as the atomic extraction element. The representation, genetic operators and fitness function are designed in order to obtain simple extraction elements that are later used, shared, and integrated by a set of information extraction agents (we are using a multi-agent platform named

1 Automatic Web Data Extraction based on Regex Evolution

3

Searchy [6] to deploy and test the evolved regex). This approach presents two important questions. First, how the regex can be represented taking into account their particular features (i.e. vocabulary, syntaxis, grammar and semantic relationships between the grammatical syntaxis and the patterns that they can extract). Second, how once a particular individual is found, it can be combined (or integrated) with others to build a new data extraction (regex). This chapter is structured as follows. Section 1.2 revises the state of the art on tools and techniques on WDE. Then, section 1.3 describes the basic concepts in GA and its application to wrappers and regular expressions. Section 1.4 describes how a variable length population of agents can support the evolution of regular expressions. Section 1.5 shows how a specific information agent uses simple gramatical rules to combine, and integrate, the evolved atomic regular expressions. Section 1.6 shows the experimental results obtained for a set of web documents. Finally, some conclusions and future lines of work are outlined.

1.2 Related Work This section aims to provide an overview of the existing tools and techniques to the wrapper generation. We have divided the section into two subsections. The first one considers the commertial and non-commertial tools available in the market. The second subsection explains the algorithms (i.e. the Machine Learning) used by most of these tools.

1.2.1 Tools Currently it is possible to find more than thirty tools to aid in the wrapper generation process [9, 20, 21]. Usually, these tools provide the HTML page and the user can select (mark) the piece of information to be extracted, or provide a set of examples (HMTL pages) given by the user. The tool generates a wrapper (usually one or several patterns that are used to identify the beginning and the end of the data to be extracted or using extraction rules) that is able to parse the information. Some of the non-commertial tools systems are: • DEByE [25] (www.lbd.dcc.ufmg.br/ debye/) is a tool for extracting semi-structured data from Web sources. With DEByE, the extraction process is fully based on examples specified by the user. To assist the user with the specification of the examples, DEByE features a graphical user interface that adopts a visual paradigm based on nested tables. The examples specified by the user are then used to generate extraction patterns. The extractor module plus an extraction patterns implements a Wrapper. It exports the extracted data in XML format.

4

D.F. Barrero, D. Camacho, M.D. R-Moreno

• W4F [32] (db.cis.upenn.edu/research/w4f.html) allows the user to select the relevant data in a page using a DOM representation of the page. Then, it generates automatically the rules code for each selection. The result is a java class file. The tool is composed of three layers: extraction layer, the retrieval layer and the mapping rules layer. The mapping rules specify the binding between the propietary internal data representation ant the output notation. The advanced user can specify his/her own mappings rules. • Ariadne [2] (www.isi.edu/ariadne) is a system for building mediators or wrappers that integrates data from multiple sites. Whenever a new semi-structured web source of interest has to be incorporated into an mediator application, the tool is able to build a wrapper with minimal effort using the wrapper generation toolkit. The wrapper would determine what sources can be used to answer the query, retrieve information from these sources, and present the integrated results to the user. To do that, it integrates an AI planner that specifies which data processing operations are going to be needed, in what order, and from which sources. • XWrap Elite [28] (www.cc.gatech.edu/projects/disl/XWRAPElite/) is a software toolkit for generation of XML-enabling wrapper programs for Web sources. The wrapper programs generated by the tool can transform an HTML document into an XML document and deliver the extracted data content in XML format using a DTD. In this tool, the user is presented with a syntax tree that describes the HTML structure of the page the user is interested in. The nodes of this tree correspond to HTML tags. For each portion of the tree, the tool applies a special set of predefined set of rules. • We can also mention WebL [19], RoadRunner [8], JEDI [18], the Garlic project (http://www.almaden.ibm.com/cs/garlic/adagency.html), NoDoSE [1], the University of Maryland Wrapper Generation Project [11], TSIMMIS [12] or LAPIS [21]. Among the commertials systems, only some well known tools will be described: • Lixto [5] (http://www.lixto.com/) defines its own language (rules) representation, named Elog. However the Wrapper designer does not have to deal with Elog. The wrapper designer can label example instances directly on the web page, it is not necessary to work inside the HMTL source or the tree representation that it generates. The user interacts with the tool through a visual and interactive interface where he can highlight extracted instances on sample page(s) or add further conditions to restrict matched instances or filters (rules) to a pattern to enlarge the number of matched instances for a particular pattern. • AgentBuilder [3] (http://www.agentbuilder.com/) is based on the Ariadne project. It consists of two major components - the Toolkit and the Run-Time System. The AgentBuilder Toolkit includes tools for managing the agent-based software development process, analyzing the domain of agent operations, designing and developing networks of communicating agents, defining behaviors of individual agents, and debugging and testing agent software. The AgentBuilder toolkit provides assistance for adding a learning and planning capability to the agents. • We can also mention TextPipe (www.datamystic.com/textpipe.html), Unwwwrap Search (www.extradata.com/), firstRain Studio (www.firstrain.com/), Internet

1 Automatic Web Data Extraction based on Regex Evolution

5

Macros (www.iopus.com/), ParserStudio (www.informatica.com/Pages/index.aspx), RoboSuite (www.kapowtech.com/), Visual Web Task (www.lencom.com/), WebDataKit (www.lotontech.com/) or NQL (www.nqltech.com/).

1.2.2 Machine Learning It is possible to use Machine Learning algorithms to learn automatically the wrappers [22, 23, 26, 27, 31]. The automatic wrapper induction [22, 31] allows to automatically change the wrappers when the HTML content change. Assistant tools, like Wien [13] (www.cs.ucd.ie/ staff/ nick/ home/ research/ wrappers/ wien/) implemented from Kushmerick’s wrapper induction work [24], allows to learn a wrapper which is able to extract the desired information. The wrappers are induced from a set of labeled training tuples. A set of marked (HTML) pages is given to Wien. However, the representation used by his inductive algorithm is too restrictive to wrap semi-structured pages with variant structuring patterns. Wien is also limited to treat missing components in a particular piece of information to be extracted, or to extract complex data structures (that can be nested structures). These limitations have been overcome by other Machine Learning based wrapper systems like SoftMealy [16, 15], or Stalker [29, 30, 31]. Both systems generate wrappers automatically by means of generalizing a set of given examples. The main problem in SoftMealy is that the absence of a component must be represented beforehand by an example in order to allow proper recognition and extraction. Stalker can deal with such problems in a very flexible way, since each object component is extracted independently. Their wrappers are based on a set of disjunctive landmark automata. Each landmark automaton is specialized in extracting an attribute. Therefore, to complete the extraction, wrappers in Stalker need to apply the landmark automata several times. Other inductive algorithms [33] are used to look for patterns inside the semistructured HTML documents. Machine Learning algorithms has been used to maintain automatically the wrappers too. Maintaining wrappers is related to two different issues: on the one hand, to detect when a wrapper is not retrieving correctly the data (wrapper verification). On the other hand, to automatically recover the wrapper generating a new wrapper that takes into account the possible changes in the Web source (wrapper reinduction) [23, 26, 27].

1.3 Genetic Algorithms and its application in wrappers and regular expressions In this section we explain some basic concepts related to Genetic Algorithms and regex that later will be used to obtain automatically the wrappers.

6

D.F. Barrero, D. Camacho, M.D. R-Moreno

1.3.1 Genetic Algorithms GA are part of the Evolutionary Computation, a computing paradigm inspired in the biological process of evolution [14]. Living beings have their attributes coded in a genome that can evolve through a set of natural genetic operators. Modifications in the genome usually generate negative attributes that lead to less survival probability and thus, these negative attributes are not likely to be propagated. However, some new attributes can be positive and will be transmitted to the offspring. The result is the propagation of good quality genomes and thus the evolution of all the population. GA uses the same mechanism adapted to a computational context. Individuals in a GA are not, of course, biological entities, but representations of a particular solution to a problem. The solution is coded in the chromosome and, in the same way that biological individuals, there are some genetic operators that modify the chromosomes. Using selection mechanisms, inspired in the natural selection, individuals with better chromosomes (i.e. the best solutions to the problem) have more changes to reproduce, and thus the whole population improve its genetic code through the successive generations. From the AI point of view, GA can be seen as a stochastic search algorithm where the chromosomes contain the search space and each individual is a point in the search space. If the GA is successful, the individuals will evolve searching the search space until a global solution is found and the individuals will converge to that solution. The success or failure of a GA depends on the four principal parameters: Genome codification, genetic operators, selection strategy, and the fitness function. The following subsections explain in detail each of these parameters.

1.3.1.1 Genome codification Genome codification is a key subject in any GA. Since each individual contains genetic information that codes a solution, it is needed a mechanism to map the solution (phenotype) into the genetic code (genotype). Traditionally, the GA community has used a binary representation in which the solution is coded with a fixed-length sequence of binary digits. However, other approaches are also valid such as integer or float representation and variable-length genomes. Figure 1.1 represents an example of coding. The individual represented is the string [rc]at. It is composed by six characters whose genetic representation is a binary chromosome. Each attribute in the individual (each character) is coded by four bits in the chromosome. The piece of chromosome that codes one attribute is called gen. So, in the example one gen codes one character using four bits. The position of the gen, also named locus, determines the position of the attribute in the phenotype, i.e, the gen in position three codes the character in position three. It is important to point out that many different coding schemes can be used, from the simple ones such us the depicted in the example to complex codings involving variable-length genome with markers delimited genes, overlapping and competitive genes. It should

1 Automatic Web Data Extraction based on Regex Evolution

7

be noticed that different genotypes might represent the same phenotype, i.e. a solution can be represented by different chromosomes.

Fig. 1.1: Chromosome coding example

1.3.1.2 Genetic operators Once the coding has been defined, it is necessary to modify the population’s genetic code in order to explore the solution space. Genome modifications are done by the genetic operators. There are two main types of genetic operations: recombination and mutation. Recombination, also known as crossover, aims to imitate biological sexual reproduction. It consist of interchanging the genetic code of two individuals. It is commonly accepted in the GA literature that recombination increases the adaptation of the population by means of joining up good pieces of chromosomes. Of course, this situation has a stochastic nature and will only occur in a limited number of recombinations. In a more general situation no significant genetic improvement is found or even it is possible to produce a worse offspring. A simple recombination algorithm is the one-point crossover, that is, it interchanges two chunks of chromosomes cutting them in a random point. By means of recombination only a fraction of the search space can be explored, the one contained in the genome of the population, but it is not possible to explore new areas of the search space. The mutation operator is used to solve this lack. Mutation usually introduces random changes in the genome that can generate new attributes in the phenotype not presented previously. In a binary representation the most used mutation operator inverts a single digit with a certain -usually quite smallindependent probability. It should be noticed that the mutation can have positive effects, but also negative effects.

1.3.1.3 Selection strategy Genetic operators have a stochastic nature, thus, it is not possible to assure that the result will increase the chromosome quality. On the contrary, there are high

8

D.F. Barrero, D. Camacho, M.D. R-Moreno

probabilities of getting worse chromosomes. It is necessary to introduce a selection strategy in order to improve the population in the successive iterations of the GA. It is analogous to the biological natural selection. The goal of the selection strategy is to generate a selective pressure, i.e., force that good chromosomes have more probabilities to reproduce than bad ones. The natural question that arises now is how the selection is performed. If the selection algorithm only selects the best individuals, some apparently bad genetic material can be exterminated. However it does not mean that a bad chromosome cannot have potentially good chromosome parts that can be prematurely exterminated. In GA terminology, excessive selective pressure might produce a premature convergence in a local maximum with a low probability of getting out there. Similarly, if the selective pressure is too low the population is not forced to evolve and, thus, to improve. There is an extensive literature about selection mechanism [34], however the most common one is the tournament selection of size N. In this mechanism, N individuals in the population are chosen and the best one is selected. Selective pressure can be adapted through the number of individuals in the tournament, as the number of individuals participate in the tournament increases, the probability of a low fitness chromosome to generate offspring reduces. The goal of the selection strategy is to pressure the population to improve its individuals, to increase generation after generating the number of good chromosomes, and to reduce the number of bad chromosomes. However, there is not a clear nonambiguous meaning for ”good” and ”bad”, yet.

1.3.1.4 Fitness function Goodness and badness are two fuzzy concepts that cannot be used in a scientific context without a precise definition. GA defines good and bad using a fitness function. The goal is to quantify how good or bad is a chromosome. It is a basic piece of any GA, and is usually one of the most challenging problems that one must be faced up in order to successfully implement a GA. In some cases defining a fitness function is a trivial problem. For instance, a canonical problem in GA is to obtain a individual whose genotype is composed by a certain number of ones, of course, using a binary representation. In this case the fitness function can simply count the number of ones. The closest is the individual fitness to the number of ones desired, the better is the individual. In other problems, the definition of the fitness function is more complex. This is the case of the regex evolution.

1 Automatic Web Data Extraction based on Regex Evolution

9

1.3.2 Regular expressions A regular expression is defined in the Automata Theory as a characterization of a regular language1. Then regex are a way to detect sets of strings based on a pattern. The automata is usually characterized by a string that follows a certain notation such as the POSIX notation for regex. Along this text we will refer to the regex notation simply as regex. Programming languages (i.e. Perl) and scripting languages (i.e. UNIX shell) uses a regex engine that creates the automata specified by the regex and are able to detect and extract strings that conform the pattern specified in the regex. More information about the implementation of regex can be found in [35]. From an applied point of view, regex conform a powerful tool to define string patterns. Then, using regex make possible to manipulate strings according to a (potentially quite complex) pattern. A quite extended use of regex is to define sets of files in many user interfaces. The string rm *.jpg means in a UNIX shell, delete all the files whose name ends in .jpg. Actually *.jpg is a regex representing the finite state automata that accepts any string that ends in .jpg. In this case, the UNIX shell interprets the regex notation, generates the automata and runs it. The simplest regex that can be constructed is a sequence of characters. Any string would match the regex if it contains the same sequence of characters. For instance, the regex bcd matches the string abcde but does not match edcba. In this regex only the concatenation operator is involved, however, it is possible to build more complex -and expressive- regex by using more operators. For example, the union of characters can be expressed with []. The strings rat and cat are matched by [rc]at while it does not match bat. To match the three strings, rat, cat and bat, the regex [rcb]at should be used instead. Groups of strings and operators can be defined using parentheses. The regex (111)|(222) - 3333 matches 111 - 3333 as well as 222 - 3333. More complex compositions can be created grouping strings. For example, it is possible to detect if a certain URL begins with ftps, https or file using the regex ((ftp|http)s)|(file)://. This regex detects the strings ftp or http followed by an s:// or the string file. Regex use some reserved characters, such as ”(” or ”[”, so, it defines a escape character, that avoids the special meaning of the following character. Using the escape character is possible to match the string (000), for example, with the regex \(000\). Kleene closure is a powerful mean to associate a cardinality to a part of a regex, i.e., associate a number of occurrences of any part of the regex. This operator boost the expressivity of regex. Regex representation provide a wide range of operators that implements Kleene closure, ? means 0 or one appearance, + means one or more and {n,m} means between n and m. In this way, it is possible to generate a simple regex that can match the strings http and https with the https?. This regex means that the string must contain the sequence of characters http but the s can appear, or not. Regex representations define several special characters for the programmer commodity. For example, the regex \d matches any single decimal digit while \w 1

A regular language is a language that can be recognized by a finite state machine

10

D.F. Barrero, D. Camacho, M.D. R-Moreno

matches any ”word” character. regex \d\d\d matches three consecutive digits, and it is equivalent to \d{3}. [10] presents a complete description of the regex syntax and its practical use. Many practical applications have been found for regex, specially in the UNIX community, that has achieved a long experience using regex. Actually, regex are a basic feature of shell commands like ls grep and some programming languages largely used by the UNIX community like Perl or AWK. Regex are a powerful tool, with a wide range of applications, however generation of regex is a tedious, error prone and time consuming task, specially when dealing with complex patterns that require complex regex. Reading and understanding a regex, even if it is not very complex is far from being an easy task. In order to ease regex generation, several assistant tools have been developed, but writing regex is still a difficult task. An automatic way to generate regex using Machine Learning techniques is a desirable goal that could potentially exploit the potential that regex provide. We propose a two stages approach to the generation of regex. In a first stage a multiagent system is used to evolve a variable length regex able to extract data from documents that follows a known pattern. Then, a second stage use two or more evolved specialized regex to compose a complex regex able to extract and integrate several types of data.

1.4 How agents support data mining: Variable length population The first stage of the data extraction mechanism proposed in this paper deals with the automatic generation of a basic regex. The goal is to use supervised learning to automatically generate a regex in a evolutionary fashion. A multiagent system (MAS) is used in order to generate basic regex able to extract information conforming a pattern, for example, phone numbers or URLs. The agents share a training set composed by positive and negative examples that are used to guide the evolutionary process until the regex is generated. Extraction capabilities from a regex are closely related to its length, and the length of the regex is determined by the length of the chromosome. Traditional fixed-length GA introduce an arbitrary constrain to the size of the evolved regex that should be avoided for many reasons. The GA should be able to self adapt its genome length without human intervention. One solution might be the use of variable-length genomas, however we were interested in an intrinsic parallel solution such as a MAS. Regex are generated by a MAS that evolves a variable length genoma, where subsets of agents use a fixed length genome, as it is shown in figure 1.2. Each agent runs a GA containing a population whose individuals own a chromosome of fixed length, and can evolve by its own with a high degree of independence conforming a microevolution. However agents are not isolated and thus their populations are influenced by other populations by means of emigration: a part of the population

1 Automatic Web Data Extraction based on Regex Evolution

11

can emigrate from one agent to another agent each generation, so the evolution of one subpopulation is affected by the evolution of other agents. The result is that the total population of the MAS presents a macroevolution. Microevolution and macroevolution are different problems that must be addressed individually.

1.4.1 Macroevolution The MAS is actually a way to implement variable-length GA, in which the sets of agents containing populations of different length evolve as a whole. The mechanism that makes this macroevolution possible is the population interchange among agents. An agent containing a population with a chromosome of length n always clones a number to a population with a chromosome size n + k, where k is the gen size. Thus, the genetic operation that modifies the length of the chromosome is performed when the individual is emigrating, adding a new chunk in a random position of the chromosome. The chunk that is inserted into the genome is a non-coding region, i.e., a chunk that codes an empty character and therefore it does not affect the phenotype. Otherwise the (potentially good) genetic properties of the individual might be lost. Generation 1

Generation 2

Generation 3

Generation 4

Chromosome length

Subpopulation 1

L=n

Subpopulation 2

L=n+k

Subpopulation 3

L=n+2k

Fig. 1.2: Population interchange among agents

An important issue is the selection process of the individuals that emigrate and the selection of the individuals that are replaced in the target population. Both selections are done using a tournament, however there is a difference. The tournament of the individuals that emigrate is won by the individual with the best fitness, meanwhile the tournament between the individuals to be replaced is won by the individual with the worst fitness. In this way, individuals with high fitness in a population have more chances to emigrate that individuals with a bad fitness. On the other hand, chromosomes with low fitness will likely be replaced by better immigrant chromosomes.

12

D.F. Barrero, D. Camacho, M.D. R-Moreno

Each agent in the MAS, despite they contain chromosomes of different size and they send or receive a bunch of individuals each generation, evolve a population using the same parameters, strategy and fitness.

1.4.2 Microevolution A fixed-length GA is run within each agent, evolving a population of chromosomes in a microevolution. It is a classical GA, as it was described in section 1.3.2 and thus the GA can be described in terms of genome codification, genetic operators, selection strategy and fitness function. The GA implemented in the MAS uses a binary genome divided in several gens of fixed length. Each gen represents a symbol from an alphabet composed by a set of valid regular expressions constructions. It is important to point out that the alphabet is not composed by single characters but by any valid regex. These simple regular expressions are the building blocks of all the evolved regex and cannot be divided, thus, we will call them atomic regex. Genetic operators used in the evolution of regular expressions were the mutation and crossover. Since the codifications rely in a binary representation the mutation operator is the common inverse operation meanwhile the recombination is performed with a one-point crossover. These genetic operators do not modify the genome length, chromosomes modify its length only when an individual is migrating to another agent. The selection mechanism used is the tournament selection. Section 1.6.1 shows the details of these parameters and the procedure used to set them. The fitness function that we have defined uses a training set composed by a positive and a negative set of examples. It takes the genotype of the individual and transforms it to the corresponding regex, then tries to match against the positive and negative examples. The number of characters extracted from the positive example is divided by the total number of characters in the example, that is, it calculates the percentage of characters correctly extracted from the regex. The percentage of extracted characters of each positive example is averaged. Finally, the fitness is calculated subtracting the average proportion of false positives in the negative example set to the average of characters correctly extracted. From a formal point of view, the fitness function is defined as follows. Given a set of positive examples, P, with M elements, and a set of negative examples, Q, with N negative examples, let p ∈ P be an element of P, and q ∈ Q an element of Q, we define Ω = {ω0 , ω1 , ..., ωn | ωi ∈ P ∪ Q, n = N + M} as the set of elements contained by P and Q, therefore any element of P or Q belongs also to Ω . Let call the set of all regular expressions R, and r an element of R, then we can define a function ϕr , ϕr (ω ) : Ω × R −→ N as the number of characters of ω that are matched by the regex r. Then, the fitness function F : Q −→ R ∈ [−1, 1] is defined as:

1 Automatic Web Data Extraction based on Regex Evolution

F(r) =

1 M

ϕr (pi ) 1 − Mr (qi ) N q∑ pi ∈P | pi | i ∈Q



where | ωi | is the number of characters of ωi and Mr (qi ) is:  1 i f ϕr (qi ) > 0 Mr (qi ) = 0 i f ϕr (qi ) = 0

13

(1.1)

(1.2)

Since true positives are calculated based on the characters, and false positives have no intermediate valors, the fitness function presents an intrinsic bias and therefore is more sensible to false positives than to true positives. It is important to point out that the maximum fitness that an individual can achieve is 1 and it is given when all the positive examples have been completely retrieved and no negative example has been matched. If a chromosome obtains a fitness of 1, it will be named an ideal chromosome. From the evolution of each specialized MAS we obtain a basic regex, e.g. able to extract a URL or a phone number. Each MAS require a training set and, eventually, an appropriate alphabet of atomic regex. Once the basic regex have evolved, it is possible to build more complex regex in the second stage of the extraction process.

1.5 Composition of basic regex We use the grammatical rules provided by the regex notation in a composition agent to integrate the basic regex evolved from each evolutive MAS. A rule database within the agent manually created contains the rules and each one integrates two or more basic regex. The agent applies the grammatical rules to the input regex obtaining a set of composed, potentially complex, regex. They can be not suitable to extract information, so, regex created by the grammatical rules have to be filtered in order to select the valid ones. We use the traditional data mining F-measure to automatically evaluate its extraction capabilities and select the composed regex. A graphical representation of the composition process can be seen in figure 1.3, where a composition of two basic regex is implemented. The composition agent takes two basic (evolved) regex from the output of two evolutive MAS (see section 1.4.2) and apply a set of grammatical rules stored in a database to generate a set of composed regex. Suppose that the composition agent takes http://\w+.\.\w+.com and \(\d+\)\d+\d+ as basic regex, and we desire to compose them using a subset of regex operators, for example, |, (, ), + and ?. Then it is possible to define a grammatical rules database in the composition agent such as: Rule Rule Rule Rule

1: 2: 3: 4:

X|Y XY X+Y? (Y)+|((X)+|foo)

14

D.F. Barrero, D. Camacho, M.D. R-Moreno

Fig. 1.3: Composition agent architecture

Where X and Y are http://\w+.\.\w+.com and \(\d+\)\d+-\d+. The composition agent applies the grammatical rules to the input regex generating the following set of composed regex: Composed Composed Composed Composed

1: 2: 3: 4:

http://\w+\.\w+\.com|\(\d+\)\d+-\d+ http://\w+\.\w+\.com\(\d+\)\d+-\d+ http://\w+\.\w+\.com+\(\d+\)\d+-\d+? (\(\d+\)\d+-\d+)+\((http://\w+\.\w+\.com)+|foo)

Since the regex composition has used a brute force approach, not all the composed rules are supposed to be able to correctly extract data. Therefore it is necessary to select at least one valid regex. This is done by the regex retrieval capacity, that evaluates the generated regex calculating the F-measure using a dataset composed by several documents (see equation 1.3). This measure is based on the weighted harmonic mean from classical Information Retrieval Precision (P) and Recall (R) values [4, 36]. Of course, other extration quality measures such as Fβ or E are also valid. Based on this quantitative measures, automatic estimation of the best composed regex is possible.

Fmeasure =

2PR P+R

(1.3)

1 Automatic Web Data Extraction based on Regex Evolution

15

1.6 Experimental evaluation To perform the experimental evaluation we have selected a classical knowledge extraction case study: extraction of phone numbers and URLs. The goal of the experiments is to retrieve a set of URLs and phone numbers by using a composition of evolved regex, evaluating its performance with: the precision, recall and F-measure. The experiment has been divided into three stages with different goals. The first stage is the setup of the experiment, in which several tests were carried out in order to set the basic GA parameters, necessary for the second stage: the regex evolution using a MAS. Finally some grammatical rules are used in the evolved regex and evaluated in order to obtain a composed regex.

1.6.1 Experiment setup Some initial experiments were carried out to acquire knowledge about the behavior of the regex evolution. The result of these experiments are used in the configuration of the evolutive MAS. To achieve this goal a single GA was used with the same configuration required by the MAS (see section 1.3). The exception is the chromosome length that has been fixed to 12 bits. Due to the lack of space, we just only show the most significative results in the GA parameters: the influence of the mutation rate and the tournament size. The random nature of the GA generates some difficulties. Different initial population may lead to different results in successive runs, depending on the initial population. To limit this effect that could provide inaccurate results, each experiment was run 200 times with different, radomly created, initial populations. The results shown in this section are the average value of all the experiment runs. The GA parameters used in the setup experiments were a initial population of fifty individuals, a mutation rate of 0.01, one-point crossover with a probability of 0.8 and finally the selection mechanism is a tournament of size two. A critical parameter is the mutation rate. Figure 1.4 depicts how the average fitness depends on the mutation rate in the two study cases included in the experimental evaluation, the standard error also is included (as vertical lines show). Two different problems are represented in figure 1.4, however the behaviors of the average fitness are correlated. Very low mutation probabilities are associated with a low average fitness, then the fitness increase their value until a maximum in the range of 0.01 and 0.02; after they begin to decrease slowly. The conclusion is that there is an optimum range of mutation probability between 0.01 and 0.02. Then, an average value of 0.015 is selected for both study cases. Another fact that should be remarked is the high standard deviation of the average fitness. It means that the different runs of the experiment arise quite different average fitness. This variability shows an intrinsic attribute of the fitness function, it is a rough fitness function. Small variations in the evaluated chromosome leads to high

16

D.F. Barrero, D. Camacho, M.D. R-Moreno 0.9 URL Phone 0.8 0.7

Average fitness

0.6 0.5 0.4 0.3 0.2 0.1 0 -0.1 0

0.02

0.04

0.06

0.08 0.1 0.12 Mutation probability

0.14

0.16

0.18

0.2

Fig. 1.4: Relationship between mutation probability and the average fitness in URL and phone regex evolution

fitness variations. In practical terms, a rough fitness means that the fitness presents many local maximum. The variation of the tournament size shows an interesting behavior. Figure 1.5 (a) shows the mutation probability of the phone numbers regex evolution, with different tournament sizes that range from two to ten. As the selective pressure is increased (increasing the tournament size), the average fitness is increased. A complementary view of this fact is shown in figure 1.5 (b). It represents the influence of the selective pressure in the total number of ideal chromosomes. It can be seen that if the selective pressure is increased, the number of ideal individuals is decreased. Selective pressure forces the elimination of low fitness individuals that can contain potentially good pieces of chromosome, so, it should be a trade-off decision between a selective pressure that increases the average fitness and the probability of finding an ideal individual. For this reason, a tournament size of three has been selected for the second stage of the experimental evaluation. Experiments run using different recombination mechanisms and probabilities did not show remarkable differences. However, a one-point crossover with a recombination probability of 0.8 was selected as it is a common value in GA literature. Variation in the size of the population shows the usual GA behavior. With small populations, a small increment in the population increases the convergence speed and average fitness. But with big populations the increase in the population has little effect in the evolution but high impact in the execution time. Experiments show that a population with fifty individuals is a good trade-off between convergence speed and computational resources.

1 Automatic Web Data Extraction based on Regex Evolution

17

0.7

160

140

0.6

120 Ideal individuals (max. 200)

Average fitness

0.5

0.4

0.3

100

80

60

0.2 40 2 chromosomes 3 chromosomes 4 chromosomes 5 chromosomes 10 chromosomes 25 chromosomes

0.1

2 chromosomes 3 chromosomes 4 chromosomes 5 chromosomes 10 chromosomes 25 chromosomes

20

0

0 0

50

100 Generations

150

200

0

50

100 Generations

150

200

(a): Influence of the tournament size in the phone regex (b): Influence of the tournament size in the phone regex average fitness. ideal individuals. Fig. 1.5: Influences in the tournament selection process for both problems.

1.6.2 Results: regex evolution The second experimental stage aims to use a MAS with subsets of agents evolving populations of chromosomes with different lengths. Six subsets of agents have been used with chromosomes sizes of 6, 9, 12, 15, 18 and 21 bits organized in chunks of three bits. The emigration strategy has been set as described in section 1.4. Agents in the MAS run a GA with the parameters obtained in section 1.6.1 and following the same experimental procedure. Figure 1.6 depicts the evolution of the average fitness of each subset of agents for the phone numbers regex evolution. Each agent with the same chromosome length. Since the results for the URL regex evolution are analogous we do not include any figure. It should be noticed the close relationship between the convergence speed and the chromosome size. The longer is the chromosome, the longer it takes to converge because the chromosome codes a solution in a bigger search space. Another fact that influences the difference in the convergence speed is found in the limited speed of propagation of good chromosomes along the MAS. A good chromosome of size 6 lasts at least one generation to migrate to an agent storing a population of size 9, and so on. Another interesting fact that figure 1.6 (a) shows is the different fitness convergence values in the agents for the phone number regex evolution. A population with a chromosome length of 6 bits presents a fast convergence to 0.54. With 6 bits only very small phenotypes can be represented, just two symbols, so only part of the examples can be extracted, achieving a maximum fitness of 0.54. This fact is also found in populations of size 9 but with less dramatic effects. The populations with chromosomes of length 12 present a very important growth in the fitness. The reason is that the shortest ideal regex can be achieved coded in a chromosome of 12 bits length.

18

D.F. Barrero, D. Camacho, M.D. R-Moreno 0.8 200 0.7

Ideal chromosomes (max. 200)

Average fitness

0.6

0.5

0.4

0.3

0.2

150

100

50 6 9 12 15 18 21

0.1

6 9 12 15 18 21

0

0 0

20

40

60

80

100

0

20

40

Generation

(a): Average fitness.

60 Generation

(b): Ideal individuals.

Fig. 1.6: Average fitness and ideal individuals for the phone number regex evolution

The populations with longer chromosomes can also contain ideal individuals, however it is because of the presence of non-coding regions in the chromosomes and therefore they can represent the same phenotype than chromosomes with 12 bits. Figure 1.6 (b) shows the number of ideal chromosomes evolved by the different subsets of agents for the phone number regex evolution. It can be seen that there are no ideal individuals in populations with size 6 and 9 for the reasons exposed above. A chromosome needs at least 12 bits to code an ideal individual, as can be seen in the figure. Longer chromosomes can also represent ideal individuals incorporating noncoding regions. Actually, the migration mechanism trends to increase this type of chromosomes in the population, in a behavior similar to the genome bloat presented in variable-length genomes [7]. Some evolved regex and their associated fitness can be found in table 1.1. Of course, regex with the best fitness were selected to pass to the composition stage. In this case they were http://\w+\.\w+\.com and \(\d+\)\d+-\d+.

Table 1.1: Evolved regular expressions Evolved regex (URL) Fitness Evolved regex (Phone) Fitness http://-http://http:// 0 \w+ 0 conwww\.http://com-www\.com 0 \(\d+\) 0.33 /\w+\. 0.55 \(\d+\)\d+ 0.58 http://\w+\.\w+\ 0.8 \(\d+\)\d+-\d+ 1 http://\w+\.\w+\.com 1

80

100

1 Automatic Web Data Extraction based on Regex Evolution

19

1.6.3 Results: regex composition Two basic regex have been selected for the third evaluation stage, where they are composed and the performance measured using precision, recall and F-measure. The composition agent uses the set of grammatical rules shown in table 1.2. There are two rules that are not actually composed, indeed they are just the basic regex that serve as input. In this way it is possible to evaluate the extraction capabilities of each basic regex alone. The other two rules are an OR operation and a concatenation operation. It can be seen that the regex that matches phone numbers has been named as X meanwhile the regex that extracts URLs is Y .

Table 1.2: Grammatical rules and composed regex Rules Composed regex X \(\d+\)\d+-\d+ Y http://\w+\.\w+\.com (X)|(Y) (\(\d+\)\d+-\d+)|(http://\w+\.\w+\.com) XY \(\d+\)\d+-\d+http://\w+\.\w+\.com

A dataset is needed to measure the precision and recall. In the experiment we have used a dataset composed by ten documents from different origins containing URLs and phone numbers, mixed and alone. Table 1.3 shows basic information about the dataset and its records. Documents 1 and 2 are composed by the testing set extracted from the training set. Document 3 contains the records of documents 1 and 2, and thus it is composed by URLs as well as phone number records. The rest of documents are web pages retrieved from the Web, however documents 5 and 6 were transformed to a plain text format in order to remove all the URLs. Documents 7 to 10 are real web pages fetched, for instance, from Google Directory or Yahoo Yellow Pages without any modification.

Table 1.3: Documents used in the regex composition experiment URL Phones Total Document 1 0 5 5 Document 2 5 0 5 Document 3 5 5 10 Document 4 0 99 99 Document 5 0 10 10 Document 6 0 51 51 Document 7 77 20 97 Document 8 436 37 723 Document 9 241 24 184 Document 10 51 0 51

The calculus of precision and recall uses the total number of records in the document, i.e., the sum of URLs and phone numbers, regardless the evaluated regex.

20

D.F. Barrero, D. Camacho, M.D. R-Moreno

It means that the result is strongly biased against regex that are not able to extract URLs and phone numbers. It should be noticed also that a extracted string is true if and only if it matches exactly to the records, otherwise it has been computed as a false positive. Results, as can be seen in table 1.4, are quite satisfactory for the preprocessed documents, i.e., documents 1 to 5, but measures get worse for real raw documents. X has a perfect precision, meanwhile Y has a poor average precision of 0.41. It can be explained looking at Y . This regex has the form http://\w+\.\w+\.com, which means that it only extracts the protocol and the host name from the URLs, but it cannot extract the path. Real URLs usually contain a protocol, a host and a path, likely with GET parameters, and in this situation Y generates a false positive, reducing its precision and recall. In few words, URLs have a pattern much more complex than phone numbers. A special case is the regex XY, a direct concatenation of X and Y. This regex extracts URLs followed directly by a phone number, obviously, such situation is not likely to happen, and thus it is unable to extract any record (for this reason these results are not shown in the table). In any case, the best balance between precision and recall is achieved by the composed regex (X) | (Y ), with a precision of 0.63 and a recall of 0.62.

Table 1.4: Extraction capacity of basic, and composed, regex. This value is calculated using traditional precision (Prec.) and Recall values. The table shows the Retrieved elements (Retr) and the True Positives (TPos) detected. X Y (X)|(Y) Retr TPos Prec. Recall Retr TPos Prec. Recall Retr TPos Prec. Recall Document 1 5 5 1 1 0 0 5 5 1 1 Document 2 0 0 5 5 1 1 5 5 1 1 Document 3 5 5 1 0.5 5 5 1 0.5 10 10 1 1 Document 4 99 99 1 1 0 0 99 99 1 1 Document 5 10 10 1 1 0 0 10 10 1 1 Document 6 0 0 43 6 0.14 0.12 43 6 0.14 0.12 Document 7 20 20 1 0.21 773 12 0.16 0.12 97 32 0.33 0.33 Document 8 37 37 1 0.05 668 76 0.11 0.11 705 113 0.16 0.16 Document 9 24 24 1 0.13 88 1 0.01 0.01 112 25 0.22 0.14 Document 10 0 0 49 23 0.47 0.45 49 23 0.47 0.45 Average 1 0.56 0.41 0.33 0.63 0.62

Precision and recall balance can be quantified with the F-measure (shown in table 1.5). The (X) | (Y ) regex obtained the best value followed not far by X. As it was expected, the composition agent selectes (X) | (Y ) based on the F-measure.

1 Automatic Web Data Extraction based on Regex Evolution

21

Table 1.5: F-measure of atomic, and composed, regexes. X Y (X)|(Y) Document 1 1 1 Document 2 1 1 Document 3 0.67 0.67 1 Document 4 1 1 Document 5 1 1 Document 6 - 0.12 0.12 Document 7 0.35 0.14 0.33 Document 8 0.09 0.11 0.16 Document 9 0.23 0.01 0.17 Document 10 - 0.46 0.46 Average 0.62 0.35 0.69

1.7 Conclusions A novel approach to data extraction based on regex evolution and grammatical composition of regex has been presented. We have shown that it is possible to use a GA to evolve regex in a variable-length MAS and to apply grammatical rules to the evolved regex in order to generate a composition of regular expressions able to extract different records of data. Using a MAS to simulate a variable-length genome population has showed to be a successful way to generate a variable-length chromosome evolution. Each agent is able to evolve a population and the MAS presents a macroevolution that tends to generate regex correctly sized. But the experiments carried out show some limitations. The linear nature of the GA codification is not the best option to represent a hierarchical structure such as a regex. The result is a natural difficulty to define a fine-grained fitness function able to evaluate not only all the regex but also its parts. For these reasons the next step is to use other evolutionary algorithms, such as Genetic Programming and Grammatical Evolution, that overcome this limitation. Finally, grammatical rules offer a simple way to automatically compose basic regex and select the best composed regex measuring its F-measure with a set of documents. Acknowledgements This work has been supported by the research projects TIN2007-65989, TIN2007-64718 and Junta de Castilla-La Mancha project PAI07-0054-4397.

References 1. Brad Adelberg and Matthew Denny. Nodose version 2.0. SIGMOD Rec., 28(2):559–561, 1999.

22

D.F. Barrero, D. Camacho, M.D. R-Moreno

2. Jose Luis Ambite, Naveen Ashish, Greg Barish, Craig A. Knoblock, Steven Minton, Pragnesh J. Modi, Ion Muslea, Andrew Philpot, and Sheila Tejada. ARIADNE: A System for Constructing Mediators for Internet Sources. In Procs. of SIGMOD 1998, Proceedings ACM SIGMOD International Conference on Management of Data, June 2-4, Seattle, Washington, USA, 1998. 3. Naveen Ashish and Craig A. Knoblock. Semi-automatic wrapper generation for internet information sources. In Second IFCIS Conference on Cooperative Information Systems (CoopIS). Charleston, South Carolina, 1997. 4. Ricardo Baeza-Yates and Berthier Ribiero-Neto. Modern Information Retrieval. Addison Wesley, 1999. 5. Robert Baumgartner, Sergio Flesca, and Georg Gottlob. Visual web information extraction with lixto. The VLDB Journal, pages 119–128, 2001. 6. David Camacho, Maria D. R-Moreno, David F. Barrero, and Rajendra Akerkar. Semantic wrappers for semi-structured data extraction. Computing Letters (COLE), 4(1), 2008. 7. Dominique Chu and Jonathan E. Rowe. Crossover operators to control size growth in linear GP and variable length GAs. In Jun Wang, editor, 2008 IEEE World Congress on Computational Intelligence, Hong Kong, 1-6 June 2008. IEEE Computational Intelligence Society, IEEE Press. 8. Valter Crescenzi and Giansalvatore Mecca. Automatic information extraction from large web sites. ACM, 51(5):731–779, 2004. 9. Line Eikvil. Information extraction from world wide web - a survey. Technical Report 945, Norweigan Computing Center, 1999. 10. Jeffrey E. F. Friedl. Mastering Regular Expressions. O’Reilly & Associates, Inc., Sebastopol, CA, USA, 2002. 11. Jean-Robert Gruser, Louiqa Raschid, Maria Esther Vidal, and Laura Bright. Wrapper generation forweb accessible data sources. In Proceedings of the Third IFCIS International Conference on Cooperative Information Systems, New York, NY, USA, 1998. 12. J. Hammer, M. Breunig, H. Garcia-Molina, S. Nestorov, V. Vassalos, and R. Yerneni. Template-based wrappers in the tsimmis system. In Proceedings of the Twenty-Sixth SIGMOD International Conference on Management of Data, Tucson, AZ, USA, May 12-15, 1997. 13. Andreas He and Nickolas Kushmerick. Learning to attach semantic metadata to web services. In ISWC-2003 (Florida), 2003. 14. John H. Holland. Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control and Artificial Intelligence. MIT Press, Cambridge, MA, USA, 1992. 15. C.-N. Hsu and Chien-Chi. Finite-state transducers for semi-structured text mining. In Proceedings of IJCAI-99 Workshop on Text Mining: Foundations, Techniques and Applications. Stockholm, Sweden, 1999. 16. C.-N. Hsu and M.-T.Dung. Generating finite-state transducers for semistructured data extraction from the web. Information Systems. Special Issue on Semistructured Data, 23(8):521– 538, 1998. 17. J.Y.-J. Hsu and W.-T. Yih. Template-based information mining from html documents. In Proceedings of the Fourteenth National Conference on Artificial Intelligence (AAAI-97), pages 256–262. AAAI Press, Menlo Park, CA, 1997. 18. Gerald Huck, Peter Fankhauser, Karl Aberer, and Erich Neuhold. Jedi: Extracting and synthesizing information from the web. pages 32–43, 1998. 19. T. Kistlera and H. Marais. Webl: a programming language for the web. In Procs. of the WWW7. Brisbane, Australia., 1998. 20. Stefan Kuhlins and Ross Tredwell. Wrapper-generating toolkits, online overview. http://www.wifo.uni-mannheim.de/ kuhlins/wrappertools/, December 2001. 21. Stefan Kuhlins and Ross Tredwell. Toolkits for generating wrappers. In Mehmet Aksit, Mira Mezini, and Rainer Unland, editors, International Conference NetObjectDays on Objects, Components, Architectures, Services, and Applications for a Networked World, pages 184–198. Lecture Notes In Computer Science, Springer-Verlag, 2002.

1 Automatic Web Data Extraction based on Regex Evolution

23

22. Nicholas Kushmerick. Wrapper induction: Efficiency and expressiveness. Artificial Intelligence, 118(1-2):15–68, 2000. 23. Nicholas Kushmerick. Wrapper verification. World Wide Web Journal. Special issue on Web Data Management, 3(2):79–94, 2000. 24. Nickolas Kushmerick, Daniel S. Weld, and Robert B. Doorenbos. Wrapper induction for information extraction. In International Joint Conference on Artificial Intelligence (IJCAI), pages 729–737, 1997. 25. Alberto H.F. Laender, Berthier Ribeiro-Neto, and Altigran S. da Silva. DEByE - data extraction by example. Data & Knowledge Engineering, 40(2):121–154, 2002. 26. Kristina Lerman, Lise Getoor, Steven Minton, and Craig Knoblock. Using the structure of web sites for automatic segmentation of tables. In Procceedings of SIGMOD-2004, 2004. 27. Kristina Lerman, Steven Minton, and Craig Knoblock. Wrapper maintenance: A machine learning approach. Journal of Artificial Intelligence Research, (18):149–181, 2003. 28. Ling Liu, Calton Pu, and Wei Han. XWRAP: An XML-enabled wrapper construction system for web information sources. In Proceedings of the 16th International Conference on Data Engineering (ICDE), pages 611–621. IEEE Computer Society Washington, DC, USA, 2000. 29. Ion Muslea, Steve Minton, and Craig Knoblock. STALKER: Learning extraction rules for semistructured, web-based information sources. In Proceedings of AAAI-98 Workshop on AI and Information Integration, Technical Report WS-98-01, Menlo Park, CA, USA, 1998. AAAI Press. 30. Ion Muslea, Steve Minton, and Craig Knoblock. A hierarchical approach to wrapper induction. In Oren Etzioni, J¨org P. M¨uller, and Jeffrey M. Bradshaw, editors, Proceedings of the Third International Conference on Autonomous Agents (Agents’99), pages 190–197, Seattle, WA, USA, 1999. ACM Press. 31. Ion Muslea, Steven Minton, and Craig Knoblock. Wrapper induction for semistructured data extraction from the web. In Proceedings of Automatic Learning and Discovery, 1998. 32. Arnaud Sahuguet and Fabien Azavant. Building intelligent web applications using lightweight wrappers. Data Knowledge Engineering, 36(3):283–316, 2001. 33. H. Sakamoto, Y. Murakami, H. Arimura, and S. Arikawa. Extracting partial structures from HTML documents. In Hiroshi Sakamoto, Hiroki Arimura, and Setsuo Arikawa, editors, Proc. the 14the International FLAIRS Conference, pages 264–268. AAAI Press, 2001. 34. Byoung tak Zhang and Jung jib Kim. Comparison of selection methods for evolutionary optimization. Evolutionary Optimization, 2:55–70, 2000. 35. Ken Thompson. Programming techniques: Regular expression search algorithm. Commun. ACM, 11(6):419–422, 1968. 36. Cornelis Joost van Rijsbergen. Information Retrieval. London:Butterworths, 1979.

Suggest Documents