HW-STALKER: A Machine Learning-based Approach ...

5 downloads 633 Views 480KB Size Report
the HTML code. We illustrate these steps with an .... this link from the piece of HTML document that was extracted for each Car element in the previous step.
HW-STALKER: A Machine Learning-based Approach to Transform Hidden Web Data to XML Vladimir Kovalev1

Sourav S. Bhowmick1

School of Computer Engineering1 Nanyang Technological University Singapore [email protected]

Sanjay Madria2

Department of Computer Science2 University of Missouri-Rolla Rolla, MO 65409 [email protected]

Abstract. In this paper, we propose an algorithm called HW-Transform for transforming hidden web data to XML format using machine learning by extending stalker to handle hyperlinked hidden web pages. One of the key features of our approach is that we identify and transform key attributes of query results into XML attributes. These key attributes facilitate applications such as change detection and data integration. by efficiently identifying related or identical results. Based on the proposed algorithm, we have implemented a prototype system called hw-stalker using Java. Our experiments demonstrate that HW-Transform shows acceptable performance for transforming query results to XML.

1

Introduction

Most of the data on the Web is “hidden” in databases and can only be accessed by posing queries over these databases using search forms (lots of databases are available only through HTML forms). This portion of the Web is known as the hidden Web or the deep Web. Hence, the task of harvesting information from the hidden web can be divided into four parts: (1) Formulate a query or search task description, (2) discover sources that pertain to the task, (3) for each potentially useful source, fill in the source’s search form and execute the query, and (4) extract query results from result pages as useful data is embedded into the HTML code. We illustrate these steps with an example. The AutoTrader.com is one of the largest car Web sites with over 1.5 million used vehicles listed for sale by private owners, dealers, and manufacturers. The search page is available at http://www.autotrader.com/findacar/ index.jtmpl?ac afflt=none. Examples of result pages are shown in Figure 1. The first page containing results of a query contains short descriptions of a set of cars. Each such short description provides a link to a separate page containing more details on the particular car. There is also a link to a page containing the next set of car descriptions that is formatted in a similar way as the first page. The second page is linked to the third page, and so on. Note that all the pages in this set are generated dynamically. This means that every time a user queries AutoTrader.com with the same query, all the resulting pages and the links that connect them to one another are generated from the hidden web database anew.

Year

Price

Model

Seller

Color

VIN

(a) July 1, 2003. (b) July 2, 2003. Fig. 1. AutoTrader.com: Query results for Ford.

In this paper, we will assume that the task is formulated clearly (Step 1). Step 2, source discovery, usually begins with a keyword search on one of the search engines or a query to one of the web directory services. The works in [2, 4] address the resource discovery problem and describe the design of topic-specific PIW crawlers. Techniques for automatically querying hidden web search forms (Step 3) has been proposed in [6]. These techniques allow a user to specify complex relational queries to hidden web sites that are executed as combination of real queries. In this paper, we focus on Step 4. We present hw-stalker, a prototype system for extracting relevant hidden web query results and transforming them to XML using machine learning technique. We propose an algorithm called HW-Transform for transforming hidden web data to XML format. Our approach is based on the stalker technique [7, 9]. We use stalker because apart from being easy to use, in most cases it needs only one example to learn extraction rules, even for documents containing lists. Moreover, stalker models a page as unordered tree and many hidden web query results are unordered. However, stalker was designed to extract data from a single web page and cannot handle a set of hyperlinked pages. Hence, we need to extend the stalker technique to extract results from a set of dynamically generated hyperlinked web pages. We use machine learning-based technique to induce the rules for this transformation. The process of transforming query results from HTML to XML can be divided into three steps: (1) Constructing EC tree [7, 9] describing the hidden web query results. One of the key features of our approach is that a user maps special key attributes (identifiers and peculiarities) of query results into XML attributes. These key attributes facilitate change detection, data integration etc. by efficiently identifying related or identical results. We discuss this in detail in Section 2. (2) Inducing extraction rules for tree nodes by providing learning

examples. We do not discuss this step here as it is similar to that of stalker approach. (3) Transforming results from HTML to XML using extended EC tree with assigned rules. We discuss this step in Section 3. Note that we do not address transformation related to client-side scripts in this paper.

2

Modelling Hidden Web Query Results

As hidden web results are distributed between a set of pages, we need a general model to represent these pages. In other words, we should model the links between a collection of hyperlinked hidden web pages. We distinguish these links into two types - chain links and fanout links. When the pages returned by a query are linked to one another by a set of links, we say that the links between them are the chain links. When a result returned by a query contains hyperlinks to pages with additional information, we say that these links are fanout links. 2.1 Constructing HW-EC Tree stalker uses Embedded Catalog(EC) formalism to model HTML page. This formalism is used to compose tree-like structure of the page based on List and Element nodes. EC tree nodes are unordered. Thus, to be able to apply stalker to a set of pages we should extend EC formalism for modelling a set of pages. We add three new types of nodes (chain, fanout, and attribute nodes) to EC tree formalism. The new formalism is called Hidden Web Embedded Catalog (HW − EC). The chain and fanout nodes are assigned with descriptions of chain and fanout links in the results. The attribute nodes are used to represent semantic constraints in the query results. Fanout Node Fanout node models a fanout of pages. The node should be assigned with stalker extraction rules for extracting the fanout link. This node should be nested at the List node, so that the rules for searching a fanout link are applied inside each element of this List. Fanout node does not appear in the output XML document. All the nodes that are nested at fanout node appear nested at each element of the List which is the ancestor of the particular fanout node. Chain Node Chain node models a chain of pages. The node should be assigned with stalker extraction rules for extracting the chain links. Chain node should be nested at the element node so that the rules for searching a chain link are applied inside each next page with results. Chain node does not appear in output XML document. All the nodes that are nested at chain node appear nested at the element that is the ancestor of the particular chain node. There is also a parameter ChainType that should be specified. This parameter can be assigned with only two possible values: RightChain or LeftChain. We elaborate on this parameter below. All the hidden web sites compose chain links in their own way. The main type of chain link that is common for most of the sites is the link to the “next” page containing a set of results. For example, reconsider AutoTrader.com query results. The “next” is followed by different suffix in every page of the results except the last. stalker extraction rules are rules for locating the beginning and the end, i.e., prefix and suffix of the piece of data to be extracted. We assign the chain

! ! " "" "

# $

%

& ! ### $ %&" &!'

"

' $

" %

'

(

) ( & !

!

###

(a) HW − EC description. (b) XML representation. Fig. 2. HW − EC description and XML representation of result pages in Autotrader.

node with rules to extract the link to the “next” page of results. Observe that for the “next” link it is common to be followed by (or follow) a block of links to every other page with results. Also a page is the last result page in the chain if no link fits the extraction rule assigned to the particular node. To reduce the number of learning examples for extracting “next” link, we ask a user to specify learning examples for extracting the “next” link along with the block of links to every other page and also to specify if “next” link is followed by or follows the block of links to every other page. We call the first and second choices as LeftChain and RightChain respectively. Atrtibute Node Attribute node is a leaf node in the HW − EC tree and is used for modelling semantic constraints in the query results. It is transformed to XML attribute in the output XML file. It can be used to identify a result uniquely (identifier ) or it may provide enough information to determine which results are related and has the potential to be identical (peculiarity) in the old and new versions of the query results. There are four parameters that needs to be specified for attribute nodes: ParentNode, Type, Name, and isOnlyAttribute. The ParentNode parameter is the link to the particular node in the HW − EC description that should be assigned with this attribute in the output XML file. The Type parameter is for defining type of attribute (it can be of only two types: identifier or peculiarity). We discuss these two types in the following subsections. The Name denotes the name of this node. The isOnlyAttribute contains boolean value. If isOnlyAttribute is set to “true” then it denotes that the piece of data extracted for this node should only appear in output XML as a semantic attribute. Otherwise, it should appear both as a semantic attribute and as an element. So if this information is needed as a part of XML document then the isOnlyAttribute parameter is set to “false” so that the node appears as an element. Following is the example illustrating the HW − EC formalism and mapping of set of hidden web pages to XML. Consider the results from AutoTrader.com. Figure 2(a) depicts HW − EC tree for these results and Figure 2(b) depicts XML representation of the results

Cars 1

2

Car

Model

Color Year

Price Seller

2003

Model Vin

Red

$39995

Color Year

2003 1FAHP60A92Y120755

Ford F250 3/4 Ton Truck Super Duty

...

Car

Price

Model

Seller Red

Id "1FAHP60A92Y120755"

Northport Ford

Ford Thunderbird Convertible

3

Car

Color

Vin $38995

Year Id "1FTNW21P83EC02037"

2003

L & B Lincoln Mercury 1FTNW21P83EC02037

Price Seller

Black and Gold

Ford F350 1 Ton Truck 4x4 Crew Cab DRW

Vin $36995 Id "1FTWW33F41EA96343"

Smithtown Ford

1FTWW33F41EA96343

(a) T1: 01 July

Cars 4

5

Car

Model

Color Year

Price Seller

Black and Gold

2003 Ford F350 1 Ton Truck 4x4 Crew Cab DRW

6

Car

Smithtown Ford

Model

Color Year

$36995

Price Seller

Red

2002 Ford Thunderbird Convertible

...

Car

Model Vin

$39995

Color Year

2003 1FAHP60A92Y120755

Seller Red

Id "1FAHP60A92Y120755"

Northport Ford

Price

Ford F250 3/4 Ton Truck Super Duty

Vin $35995

Id "1FTNW21P83EC02037"

L & B Lincoln Mercury 1FTNW21P83EC02037

(b) T2: 02 July

Fig. 3. Tree representation of query results from Figure 1 modelling Identifiers.

according to the tree. For clarity and space, we only show a subset of the attributes in each result of the query. The root Cars node is established for uniting all the other nodes. The next node Chain models the set of pages connected with “left” chain links. List(Car) node is assigned with iterative rule for extracting elements of the Car. The fanout node denotes that each Car element contains a link to the page with extra data. Fanout is assigned with rules for extracting this link from the piece of HTML document that was extracted for each Car element in the previous step. Next level of the tree contains five elements. Four of them are Element nodes. The last node is an Attribute node containing four parameters. We can see in Figure 2(b) that VIN is extracted only once for each Car as an attribute. The rest of elements are nested in the output XML the same way they are nested in the tree. 2.2 Identifiers Some elements in a set of query results can serve as unique identifier for the particular result, distinguishing them from other results. For example, Flight number uniquely characterizes every Flight information returned as the result of querying an on-line booking site, VIN uniquely characterizes every Car in the query results from a car database, Auction Id is the identifier of Auction in the query results from an on-line auctions site, etc. These elements are called identifiers. Identifiers may be either automatically generated by the hidden web database, like Auction Id, or stored in the database along with the data, like VIN. In this work we assume that identifier, being assigned to a particular query result, does not change for this result through different versions of query results. That is, an identifier behaves like an unique identifier or “key” for the result. However, it is possible for identifier to be missing in a result. Also, if an identifier is specified (not specified) for a node in an initial version of query results or when the node appeared for the first time in the results, then it will remain specified (not specified) throughout all versions, until the node is deleted. This reflects

the case for most web sites we have studied. Note that we allow specifying only one identifier for each result. As each result is transformed into a subtree in the XML representation of the hidden web query results, we model identifier for a particular node in the subtree as XML attribute with name Id and identifier information as value. We now illustrate with an example the usefulness of identifiers in change detection. Figure 1 depicts search query results for Ford cars in AutoTrader.com, executed on two different days. The results are presented as a list of particular car offerings. Each offering contains details of the car, such as Model, Year, Price, Color, Seller, VIN, etc. VIN (Vehicle Identification Number) is the unique identifying number for a specific car assigned by manufacturer. It is obvious that as VIN is the part of information that should be specified by the publisher, it may not be presented in query results for some cars. For instance, the first car in results depicted in Figure 1(b) has no VIN specified. Nevertheless, if we use VINs as Identifiers for the cars in the list, then we can distinguish those cars with specified VIN effectively. A sample of tree representation of the results in Figure 1 is shown in Figure 3. In the tree representation, every offering from the real data is represented by the node Car. The Car nodes in T1 and T2 have child attributes with name Id and value equal to VIN. Intuitively, if we wish to detect changes between the two versions of the query results, then we can match Car nodes between two subtrees by comparing Id values. For instance, node 2 in T1 matches node 6 in T2 and node 1 in T1 matches node 5 in T2 . Suppose, there are no more Car nodes in both trees. Then we can say that node 3 is deleted in T2 as its identifier does not match any identifier in T2 . Node 3 also does not match node 4 as node 4 does not have any identifier attribute. Hence, node 4 is inserted in T2 . 2.3

Peculiarities

One or more elements in the result of a hidden web query result set can serve as non-unique characteristics for distinguishing the results from other results. Two results that have the same characteristics (same attribute/value pair) can be matched with each other. While results that have different characteristics can not be matched with each other. Examples of types of such characteristics are: Year or Model of a Car node in the query results from car trading site, Title or Year for a Movie node in the query results from movie database, Title or Publisher for a Book node in the query results from on-line book catalog. These non-unique elements are called peculiarities. Note that these elements may not identify a result (entity) uniquely. But they may provide enough information to identify results that does not refer to the same entities. We allow specifying any number of peculiarities on a node. Peculiarities are denoted by node attributes with names P ec 1, P ec 2, . . . , P ec n for all n peculiarity attributes specified for a particular node. If a node does not have a peculiarity attribute (the subelement may be missing) then the peculiarity value is set to “*”. Note that the peculiarity attribute for a node can appear in any version of query results, but once it appears we assume that it will not disappear in the future versions. As we never know which peculiarity may appear for

Cars 1

2

Car

Model

Color

Price

Year

Seller Red

2002

Pec_2 "Northport Ford"

Vin

Model

Color Year

$39995

1FAHP60A92Y120755

Seller

Pec_2 "L & B Lincoln Mercury"

Vin

Model

Color Year

Price Seller

Black and Gold

$38995 Pec_1 "2003"

L & B Lincoln Mercury

Ford F250 3/4 Ton Truck Super Duty

...

Car

Price

Red

2003

Pec_1 "2002"

Northport Ford

Ford Thunderbird Convertible

3

Car

1FTNW21P83EC02037

2001

Vin $36995

Pec_1 "2001"

Smithtown Ford

Ford F350 1 Ton Truck 4x4 Crew Cab DRW

Pec_2 "Smithtown Ford"

1FTWW33F41EA96343

(a) T1: 01 July

Cars 4

5 Car

Model

Color Year

Price Seller

Tan

2003 Ford F350 1 Ton Truck 4x4 Crew Cab DRW

6

Car

$36995 Smithtown Ford

Model Pec_2 "Smithtown Ford"

Color Year

Pec_1 "2003"

Price Seller

Red

2002 Ford Thunderbird Convertible

...

Car

Pec_2 "Northport Ford"

Vin

Model

Color Year

$39995

Seller Red

2003

Pec_1 "2002"

Northport Ford

Price

1FAHP60A92Y120755

Ford F250 3/4 Ton Truck Super Duty

Pec_2 "L & B Lincoln Mercury"

Vin $35995

Pec_1 "2003" L & B Lincoln Mercury

1FTNW21P83EC02037

(b) T2: 02 July

Fig. 4. Tree representation of query results from Figure 1 modelling Peculiarities.

a node in future, a node with no peculiarity attribute should be matched with nodes having peculiarities. This statement reflects the case for most hidden web sites we have studied. We illustrate with an example how peculiarities can be useful for the change detection problem. Reconsider Figure 1. We can find several candidates to Peculiarities for cars in the list, i.e., Color, Year, or Seller. Figure 4 shows tree representation of the results in Figure 1 highlighting peculiarities for various nodes. There are two peculiarities specified, Year as attribute with name Pec 1 and Seller as attribute with name Pec 2 for every Car node. We can now match Car nodes between the two versions of the tree using peculiarity attributes. Using only Pec 1 for matching nodes, we can match exactly node 1 to node 5 as no other nodes have Pec 1 attributes with value “2002”. Using only Pec 1 we also observe that node 3 in T1 does not match any node in T2 as there are no nodes with Pec 1=“2001” in T2 . Thus, we can state that node 3 is deleted in T2 . Using only Pec 1 we can match node 2 to nodes 4 or 6, but we are not able to say which of these two does not match node 2. However, if we use both Pec 1 and Pec 2, then we can answer this question by comparing Pec 2 of nodes 2 with nodes 4 and 6. As we can exactly match node 2 to node 6, we can say that node 4 is inserted in T2 .

3

Algorithm HW-Transform

Figure 5 depicts the HW-Transform algorithm for transforming hidden web query results to XML format. This algorithm takes as input the first page of the results and HW − EC description of these results, and returns XML representation of the results. There are three main steps in the algorithm. The first step is to extract information from the query results and generate tree T storing extracted information in hierarchical order according to HW − EC description of these results. Lines 8-11 in Figure 5 describe this step. The ExtractNode

1 2 3 4 5 6 7 8 9 10 11 12 13 14

T: tree

/* for storing tree representation of query results */ Ta: tree /* which is enabled to store attributes for every node */ [Te]: set of trees Doc: XML document set T, Ta, [Te], and Doc empty let Root be the root of HW-EC add Root as a root to T /* extract information from query results */ for all Ch that is a child of Root in HW-EC do add ExtractNode(Index, Ch) to [Te] end for add every Te from [Te] as a child of root node to T /* assign attributes */ Ta = AssignAttributes(T, HW-EC) /* generate XML */ Doc = GenerateXML(Ta) Return Doc

Transformation time (s)

Input: Index, /* index page of query results */ HW-EC /* HW-EC description of these results */ Output: Doc /* XML representation of query results */

800 AutoTrader.com

700

Amazon.com

600

CiteSeer.org IMDb.com

500 400 300 200 100 0 0

500

1000

1500

Number of results

(a) Algorithm HW-Transform. (b) Performance study. Fig. 5. Algorithm HW-Transform.

function implements a recursive algorithm for extracting pieces of data from the query results corresponding to node N in HW − EC description. The next step is to traverse tree T in order to assign attributes to the nodes according to HW − EC description of the query results (Line 12). Note that we do not assign attributes to particular nodes while first parsing the tree in the previous step as attributes can be located in different parts of the tree. It is faster to assign all attributes by performing one traversal in a separate step rather than doing a lot of additional traversals in the previous step. In the final step, we generate the XML document from the tree by traversing it starting from the root.

4

Performance Study

In this section, we present performance analysis of our prototype system. All the experiments have been performed on a Pentium 4 CPU 2.4 GHz with 512 MB of RAM. We used Microsoft Windows 2000 Professional as operating system. We have implemented hw-stalker using Java. We use the data from the following four hidden web sites for our experiments: AutoTrader.com, Amazon.com, IMDb.com, and CiteSeer.org. To evaluate the performance of the extraction of relevant data from hidden web pages to XML format, we have to evaluate the performance of rule induction mechanism and the performance of HTML to XML transformation mechanism. Performance of the rule induction system is determined by the time a user spends on creating HW − EC tree. It is also determined by the time a user needs to provide examples of data for each element of HW − EC tree. This time dramatically depends on the number of examples a user should provide for each element of results to learn correct extraction rules. The number of training examples for some of the elements of the hidden web data that was used in our experiments are shown in Figure 6(a). Observe that we only need one example to learn extraction rules for one element for 40% of elements in the query results. We need more than five examples for one element for only 5% of the elements. The number of results needed to learn extraction rules for particular element is

Site

Site

AutoTrader.com

IMDb.com

Number of samples

Element AutoTrader Car Model Year VIN Price Color Miles CHAIN

1 8 5 1 1 1 1 2 3

IMDB_Comments Comment Author From Date Summary CHAIN

2 3 2 2 1 1 3

Site

Amazon.com

CiteSeer.org

Number of samples

Element Amazon Book ISBN Title Price Year Authors Availability CHAIN FANOUT CiteSeer Paper Title Year Authors Conference Abstract CHAIN FANOUT

1 6 2 1 1 1 3 1 6 3 2 3 1 1 1 1 1 4 3

AutoTrader.com

Amazon.com

IMDb.com

CiteSeer.org

Query

Number of results

Number of CHAIN pages

Number of FANOUT pages

2000-2004 Acura of any model within 25 miles from ZIP 00501

48

2

-

1983-2004 Ford Escort within 25 Miles from ZIP 10001

120

5

-

1983-2004 Jaguar of any model within 50 Miles from ZIP 00501

202

9

-

1983-2004 Land Rover Range Rover within 200 Miles from ZIP 00501

310

13

-

1990-2004 Cadillac with mileage under 75,000 within 50 Miles from ZIP10001

430

18

-

1991-2004 Toyota with price range from 10,000 to 15,000 within 50 Miles from ZIP 10001

472

19

-

1983-2004 Ford of any model within 25 Miles from ZIP 10001

500

20

-

Search for "gardenia" Search for "snooker" Search for "intranet" Search for "dock" Search for "dot" User comments to "Arrival" User comments to "Once Upon a Time in Mexico" User comments to "Star Wars V"

21 190 431 599 1001 58 201 746

2 9 44 60 101 3 11 38

21 190 431 599 1001 -

User comments to "Harry Potter and the Sorcerer's Stone"

1212

61

-

Search for "hidden web" Search for "google" Search for "google" Search for "web"

38 523 754 1000

2 27 38 50

38 523 754 1000

(a) Number of samples. (b) Different queries to sample sites. Fig. 6. Performance study.

determined by the number of different HTML-environments which can be found for this element in different results [7]. To evaluate the performance of HTML to XML transformation module using the HW-Transform algorithm, we have extracted the results of different queries from the four hidden web sites. The set of queries that we have used in this experiment is shown in Figure 6(b). The summary of this experiment is shown in Figure 5(b).

5

Related Work

WIEN [8] takes as input a set of example pages where data of interest is labelled, and the output is a wrapper that is consistent with each labelled page. SRV [5] is an approach for wrapper generation based on the Natural Language Processing (NLP). SRV is a tool for learning extraction rules of text documents, based on a given set of training examples. It is a single-slot tool, like WIEN, thus it supports neither nesting nor semistructured objects. A technique for supervised wrapper generation based on a logic-based declarative language called Elog is presented in [1]. This technique is implemented in a system called Lixto which assists the user to semi-automatically create wrapper programs by providing a fully visual and interactive user interface. With Lixto, expressive visual wrapper generation is possible. Compared to WIEN, which extracts all items at one time, in our approach we use several single slot rules based on stalker. Also, we support nesting unlike WIEN and SRV. Compared to Lixto, our approach is much more userfriendly as it is based on stalker. As user should define extraction rules himself in Lixto, it is a complicated task even using GUI. Furthermore, unlike the above approaches, hw-stalker is developed specifically for hidden web data and hence is able to extract key attributes (identifiers and peculiarities) from the query results. Our system is also focused on extracting data from dynamically generated hyperlinked web pages only.

RoadRunner [3] is the HTML-aware tool for automatic wrapper generation based on the inherent features of HTML documents. The unique feature that distinguishes RoadRunner from all other approaches is that the process of wrapper generation is fully automatic and no user intervention is requested. However, such flexibility poses disadvantage as far as extracting hidden web data is concerned as it cannot extract identifiers and peculiarities automatically.

6

Conclusions

In this paper, we present a machine learning-based approach for extracting relevant hidden web query results and transforming them to XML. We propose an algorithm called HW-Transform for transforming hidden web data to XML format. In our approach, we extend the stalker technique to extract results from a set of dynamically generated hyperlinked web pages. The XML representation of query results encapsulates only the data that a user is interested in. One of the key features of our approach is that a user maps special key attributes in query results, called identifiers and peculiarities, into XML attributes. These attributes facilitate change detection, data integration etc. by efficiently identifying related results. As a proof of concept, we have implemented a prototype system called hw-stalker using Java. Our experiments demonstrate that HW-Transform shows acceptable performance for transforming query results to XML.

References 1. R. Baumgartner, S. Flesca, and G. Gottlob. Visual Web Information Extraction with Lixto. In Proceedings of the 27th VLDB Conference, Roma, Italy, 2001. 2. S. Chakrabarti, M. van den Berg, and B. Dom. Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery. In 8th World Wide Web Conference, May 1999. 3. V. Crescenzi, G. Mecca, and P. Merialdo. RoadRunner: Towards Automatic Data Extraction from Large Web Sites. In Proceedings of the 26th International Conference on Very Large Database Systems, pages 109–118, Roma, Italy, 2001. 4. M. Diligenti, F. Coetzee, S. Lawrence, C. L. Giles, and M. Gori. Focused Crawling using Context Graphs. In 26th International Conference on Very Large Databases, VLDB 2000, September 2000. 5. D. Freitag. Machine Learning for Information Extraction in Informal Domains. Machine Learning, 39, 2/3:169–202, 2000. 6. H.Davulku, J.Freire, M.Kifer, and I.V.Ramakrishnan. A Layered Architecture for Querying Dynamic Web Content. In ACM Conference on Management of Data (SIGMOD), June 1999. 7. C. A. Knoblock, K. Lerman, S. Minton, and I. Muslea. Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach. IEEE Data Engineering Bulletin, 23(4):33–41, 2000. 8. N. Kushmerick. Wrapper Induction: Efficiency and Expressiveness. Artificial Intelligence Journal, 118, 1-2:15–68, 2000. 9. I. Muslea, S. Minton, and C. A. Knoblock. Hierarchical Wrapper Induction for Semistructured Information Sources. Autonomous Agents and Multi-Agent Systems, 4(1/2):93–114, 2001.

Suggest Documents