Marc Jacobs. 1. , and Martin. Hofmann-Apitius. 1,2. 1Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing, Sankt Augustin ...
Jan 24, 2013 - Department of Zoology, University of Oxford, Oxford, United Kingdom. Citation style for this article: Jolley KA, Maiden MC. Automated extraction ...
Automated Information Extraction from Empirical Software Engineering. Literature: Is ... Dept. of Computer .... and the filled computer-science job template where.
relevant semantic knowledge from online biomedical resources. MEDLINE [3] is the United States National Library of Medicine (NLM)'s [2] premier bibliographic.
Automated Information Extraction to Support Biomedical Decision Model ... and construct templates in the form of Bayesian ... email: [email protected].
Oct 16, 2014 - seamlessly. This disadvantage limits the applicability of d-SPE .... analysis software. .... customized sorbents, these must be fixed in the devices.
Keywords: aberrations, phase wheel, lithography. 1. .... object function that has an infinite support in the frequency domain and therefore can sample the pupil ...
Annotations print out predefined labels (SEL for selection statements, REP for repetition and MET for method blocks) to the standard output along with the va-.
(217) 607-6006; FAX (217) 265-8039; email: [email protected] ... application domains such as automated natural language translation (Marquez 2000),.
ABSTRACT. An amorphic association scheme has the property that any of its fusion is also an association scheme. In this paper we generalize the property to be.
'belong together', from HTML pages (or texts in general) in a domain of ..... elements (e.g. paragraph or bold text) and their order within parent (e.g. the first or.
language processing techniques with extraction patterns developed by human ... Today, in the world of the knowledge-hungry applications, there is an ...
Wikipedia, handling other corpora such as the gen- ..... (shared by the Turing Center at the University of ... call.4 Of course, one can apply a concreteness fil-.
processing in a database, and then express extraction needs using queries, which can be .... The index builder creates an inverted index for the corpus as part of ...
On May 6th, 2002, the Dutch politician Pim Fortuyn, called a âmaverickâ and âcon- ... in this list will constitute the labelset, which means approximately one million ...
of bonafide work carried out by me under the supervision of Dr. N. Ch. ... CERTIFICATE ... ANUPRIYA (School of Computing Science and Engineering) VIT University, .... would like to thank my parents Mr. Elumalai Singaravelu and Mrs. Thulasi ..... impo
Jan 9, 1998 - 2 Introduction. Researchers have ... In the TREC6 routing task, a document retrieval system .... complex noun phrases in the Combiner phase.
Samsung Electronics, Suwon, Korea nojunk@ieee. ... approximate the probability density p(x) of a vector of con- tinuous random variables X [3]. It involves the ...
modules, including sentence splitters, tokenizers, named entity recognizers .... inverted index that indexes sentences according to the words, named entities and ...
Jan 21, 2018 - world of modern computing, people feel free to share their views and ... Text mining, knowledge discovery, sentiment analysis, opinion mining.
Oct 26, 2010 - thesis, Dept. of Computer Science and Software Eng., Univ. of .... [38] K. Fundel, R. Kuffner, and R. Zimmer, âRelEx-Relation Extraction.
KAARE (Knowledge Availability, Access, Retrieval and. Extraction) svstem, a ..... several domain names, containing unstructured text with a variety of data ...
Instead users must rely on computerized tools that can systematically roam the web and carry out ... Finally, extraction systems vary in the degree of automation involved in generating the initial wrapper or in ... 1998, Seo et al. 2001]. Similar to ...
Automated Information Extraction using Amorphic Dawn G. Gregg (University of Colorado Denver and Health Sciences Center [email protected]) Abstract: The Amorphic system is an adaptive web information extraction scheme for building intelligent systems for mining information from web pages. It can locate data of interest based on domain-knowledge or page structure, can automatically generate a wrapper for an information source, and can detect when the structure of a web-based resource has changed and act on this knowledge to search the updated resource to locate the desired information. This allows Amorphic to adapt to changing structures of websites allowing users to manage their information extraction more effectively. Five different example implementations are described to illustrate the need for information extraction systems capable of extracting information from semi-structured web documents. They demonstrate the versatility of the system, showing how a system, like Amorphic, can be used in systematic data extraction applications that require data collection to be conducted over an extended period of time. The current Amorphic system represents a cost-effective approach to developing large-scale adaptable information extraction systems for a variety of domains. Keywords: information extraction, wrapper recovery and repair, pattern matching, semistructured data, ontology Categories: H3.3, H3.4, H5.4
1
Introduction
The Internet is the source of vast amounts of information. There are company, government, organization and individual websites, tutorials, discussion forums, blogs, wikis, social bookmarks and social networking services. Most of these sites contain information that could potentially be useful for a variety of applications IF the information could be systematically extracted and used. For example, networks of friends on social networking sites can be used by businesses to understand their customers or by governments mapping terrorist groups; data from blogs can be aggregated based on their labels and used to create a repository of knowledge on a particular topic; or social bookmarks can be used by anthropologists studying patterns in society. However, the volume of information available on the web makes it impossible to manually process it for use in these types of applications. Instead users must rely on computerized tools that can systematically roam the web and carry out sophisticated information processing tasks on their behalf. Understanding web content is one of the big challenges facing developers interested in tapping the Internet's vast information resources. One approach to simplifying the process of understanding the web is the "Semantic Web." The idea behind the Semantic Web is that if web page authors add semantics to meaningful web content it will make it easier to find and use web data [Berners-Lee et al. 2001]. Adding these formal semantics to web pages will aid in everything from resource discovery to the automation of information processing tasks [Koivunen, Miller 2002].
1
On the semantic web, systems would be able to query the Internet and obtain information of interest based on metadata written in written in Resource Description Framework (RDF) [Carroll, Klyne 2004]. However, currently only a small fraction of web content contains the metadata necessary to make such Internet queries possible. For years people have been writing web pages in HTML creating a vast amount of human readable content. It is unlikely that even a small fraction of these documents will ever be rewritten to include RDF metadata allow them to be processed by Semantic Web agents [Embly 2004]. Instead developers interested in using web data must create systems capable of using heterogeneous human-readable data. Using information from the current human-readable web in any sort of systematic fashion presents numerous challenges to developers. The first challenge is locating information of interest. Typical web searches return thousands of documents related to a search term. Information retrieval systems attempt to locate interesting documents from within the thousands of uninteresting documents returned from these searches. The second challenge is extracting data from the web in a form that a computer system can then manipulate “in ways that are useful and meaningful to the human user” [Berners-Lee et al. 2001]. The third challenge is to develop extraction processes that are resilient to changes in the structure of the basic web resource. The web is a dynamic medium and websites frequently change the layout of their web pages to make them more useful or attractive to the human user. Unfortunately, this can cause extraction systems designed to work with the original website to fail. A final challenge is to develop tools that allow the information content of websites to be "understood" by applications. There are a number of techniques that can be used to cluster, categorize, or otherwise derive meaning from web content [see Liu, Maes 2005, Schuff, Turketken 2006]. However, applying these techniques are rarely straight forward and the process of developing systems to understand the meaning of human generated content is a continuously evolving science. The focus of this paper is on two of the four challenges discussed above: the extraction of data from web pages of interest and the resilience of the extraction system. This paper discusses the Amorphic web information extraction system. It can locate data of interest based on domain-knowledge or page structure, and can detect when the structure of a web based resource has changed and act on this knowledge to search the updated resource to locate the desired information. The Amorphic information extraction system is been demonstrated using five different example implementations which illustrate: 1. 2. 3.
The ability of Amorphic to interpret a web pages from wide variety of domains, The ability of Amorphic to adapt to changes in the page structure without breaking, and The variety of information-centric tasks that an adaptive information extraction system can support.
The remainder of this paper is structured as follows. Section two describes related work on information extraction wrappers and wrapper repair. Section three describes the Amorphic information extraction system. Section four presents the five example implementations and section five presents our discussion and conclusions.
2
2
Related Work
Information extraction automates the translation of input pages or text into structured data. Information extraction systems usually rely on extraction rules tailored to a particular information source. These extraction rules are generally called wrappers, where a wrapper is defined as a program or a rule that understands information provided by a specific source and translates it into a regular form (e.g. XML or relational tables). The most challenging aspect of wrappers is that they must be able to recognize the data of interest among many other uninteresting pieces of text (e.g., markup tags, inline code, navigation hints, etc.) [Laender et. al. 2002]. Generally, information extraction can be evaluated on three different characteristics, the type of extraction document/target, the extraction technique, and the degree of automation [Chang et al. 2006]. For example, the input documents used for information extraction can be unstructured free-text documents written in natural language or semi-structured documents that are frequently found on the web. Extraction can use position-based wrappers or wrappers that are ontology driven. Finally, extraction systems vary in the degree of automation involved in generating the initial wrapper or in repairing the wrapper when the underlying data source changes. 2.1
Extraction Targets
There are a wide variety of document types that information extraction systems can be designed to extract data from. These document types include freeform text documents like news articles, semi-structured documents like medical records or computer logs, and structured documents like XML documents or documents annotated using RDF [Chang et al. 2006]. The focus of this paper is on information extraction processes tailored to HTML documents available via the web. A substantial proportion of these HTML documents are semi-structured documents either because they are dynamic pages generated from a database using document templates which produce a set of regularly formatted pages (e.g. an eBay auction page or a search result page) or because the human-generated content conforms to a regular pattern (e.g. manually generated blogs or lists of publications). Semi-structured HTML data can also vary based on the way extraction targets (data of interest within the web page) are defined. For example, a web page can contain data for a single data entity or record (e.g. a single product page) or the web page can contain data for multiple data entities or records (e.g. a search results page or an organization membership list). Web pages containing data for a single data entity use page level wrappers to extract all of the extraction targets embedded in that page. The data can be labeled or unlabeled, exist in tables, lists, or other regular format. Web pages containing multiple records utilize record level wrappers to discover record boundaries then divide each record into individual attributes [Sarawagi 2002]. Multiple-record data can be in regularly formatted tables with column headings, in tables with some items in columns with headings and others individually labeled, in repeating paragraphs with many individual data items independently labeled, or in some other repeating structure with few (if any) labels
3
describing the content (e.g. general search results) [Gregg, Walczak 2007]. All of these variations in the structure of the web page and the definition of extraction targets can complicate the extraction process. 2.2
Extraction Technique
For a wrapper to extract data from an input document, it needs to tokenize the input string, apply extraction rules for each extraction target, and assemble extracted values into individual records [Chang et al. 2006]. Generally, there are two classes of extraction rules that can be applied to web documents: extraction rules that rely on the position of the extraction target within the web page and extraction rules that rely on domain knowledge to locate and extract the target data. 2.2.1
Position-based Extraction
Wrappers utilizing position-based extraction rely on inherent structural features of HTML documents for accomplishing information extraction. Positionbased extraction systems utilize a HTML parser to decompose a HTML document into a parse-tree that reflects its HTML tag hierarchy. Position-based extraction rules are written to locate data based on this parse-tree hierarchy. The extraction rules can be either regular expression rules or Prolog like logic rules, which make an assignment between a variable name and a path expression. Early information extraction systems utilize position-based wrappers where the extraction rules are constructed manually [see Atzeni et al. 2002]. These systems require a human developer to create a new wrapper for any new information source and each wrapper only works for web pages that use the same underlying page template. This limits users to accessing information from a pre-fixed set information sources. In addition, when there are changes to the structure of the target page templates, these position-based wrappers fail to extract the target information correctly. However, position-based wrappers do guarantee a high accuracy of information extraction, with both precision and recall being at least 98 % [Chidlovskii 2002]. In addition, it is possible to use wrapper induction to create position-based wrappers based on a sample of regularly formatted web pages (e.g. like those generated from a database using a web page template). Wrapper induction automatically builds a wrapper by learning from a set of resource's sample pages, and then it uses the wrapper to extract specific information from web information sources [Crescenzi et al. 2001]. This can greatly speed the development and update of position-based wrappers [Arasu, Garcia-Molina 2003, Flesca et al. 2004, Kushmerick et al. 1997, Muslea et al. 1999]. 2.2.2
Ontology-based Extraction
An alternative to position-based extraction is to generate extraction rules based on knowledge about the reference domain. Ontology-based information extraction systems utilize domain knowledge to recognize and extract data of interest from web pages [Embley et al. 1998, Seo et al. 2001]. Similar to position-based extraction systems, ontology-based extraction systems utilize wrappers which make
4
and assignment between a variable name and specific domain keywords used to label data within the document. They also use the lexical appearance of the data to help identify extraction targets. Since ontology-based information extraction tools use domain knowledge to describe the data of interest, the wrappers generated using domain ontologies continue to work properly even if the formatting features of the source pages change and will work for pages from many distinct sources belonging to a same application domain [Embley et al. 1998]. However, ontology based tools require the extraction targets to be fully described using page-independent features. This means all data to be extracted must either have unique characteristics or be labeled using context keywords. Unfortunately, data of interest on the web does not always meet these requirements. Some data of interest is freeform and cannot be identified using a specific lexical pattern, and is not labeled using context keywords. This type of data can only be extracted using its specific location in the web page. 2.3
Wrapper Recovery and Repair
An important characteristic an adaptive information extraction system can have is for it to be capable of repairing itself when an information extraction error occurs [see Knoblock et. al. 2000]. Wrapper recovery and repair consists of three steps. First, the information extraction system must be able to detect that the structure of the data source has changed which could potentially cause the wrapper to fail [Kushmerick 2000]. Then, the recovery routine attempts to locate the extraction data within the revised page structure. If the extraction target is located successfully a repair process is initiated to regenerate the extraction rules to match the new page format [Chidlovskii 2002, Knoblock et. al. 2000, Lerman et al. 2003]. One technique to perform wrapper recovery is to use alternative views of web pages, content-based classifiers and backward wrappers to cope with small changes to the web page format [Chidlovskii 2002]. A second system that performs wrapper recovery is a schema guided wrapper maintenance system that utilizes preserved syntactic patterns, annotations and hyperlinks to help identify the locations of desired values in changed web pages [Meng et al. 2003]. An alternative to wrapper repair is to detect when the underlying data source has changed and then to completely regenerate the wrapper using sample pages from the revised site. One system that uses this approach detects changes to a web page’s underlying structure using previous positive extraction examples. It identifies data of interest on the new web page allowing a new wrapper to be generated automatically [Lerman et al. 2003]. A second automatic wrapper generation/regeneration system uses clustering algorithms to effectively determine which tokens of interest should be extracted from the page. This system has been demonstrated to work well on documents with a regular structure [Papadakis et al. 2005].
3
Amorphic System
One alternative to the limitations of exclusively position based extraction systems and exclusively ontology based systems is an extraction system that combines the best of both approaches. The Amorphic system combines a position based
5
extraction system with an ontology based extraction system to allow data to be successfully extracted from a wide variety of web pages. In addition, the system includes a wrapper recovery and repair module that allows it to recover from page structure/terminology changes that otherwise would cause the wrapper to fail [Gregg, Walczak 2007]. The current Amorphic system, shown in Figure 1, is designed to work in conjunction with a separate application interface that allows Amorphic to extract information for applications with different extraction goals. On the front-end, the interface application provides tools to allow the generation of domain specific ontologies and for the retrieval of relevant web pages to be passed to the Amorphic information extraction system for processing. On the back-end, the interface application receives an XML file containing the structured tokens of interest from the web page and processes those tokens as need for the current application.
Figure 1. System Architecture The Amorphic information extraction system consists of several modules. The data preprocessing module examines the page structure and determines how best to parse the site. It analyzes the web page passed from the interface application, and constructs extraction rules based on the domain ontology. The extraction rules are used to locate tokens of interest in the web page. If the web page contains tabular data, a modified wrapper is generated that maps the table columns to tokens defined in
6
the domain ontology. The data extraction module extracts the specific data from the web page. If the extraction completes successfully, it generates an XML file containing the extracted data and passes it back to the interface application. If extraction fails, the web page and domain ontology are passed to the wrapper recovery module, which attempts to locate the “missing” tokens and generate a revised domain ontology. 3.1
Domain Ontology Generation
The Amorphic system utilizes domain ontologies as its primary tool for identifying and extracting tokens from web pages. As with domain ontologies proposed in prior research [see Benslimane et al. 2006, Embly et al. 1998, Tijerino et al. 2005], the domain ontologies used by the Amorphic system utilizes keywords and regular expressions to identify target data within a web page [Gregg, Walczak 2007]. However, in addition to extracting data based on domain keywords and patterns, the Amorphic system can also extract a token based on its position within the web page. This allows Amorphic to extract data from unlabeled semi-structured web pages in addition to the labeled data that can be extracted by traditional ontology extraction systems. A domain ontology is a “rigorous and exhaustive organization of some knowledge domain that is usually hierarchical and contains all the relevant entities and their relations” [WordNet 2007]. In the context of information extraction systems, a domain ontology needs to define the keywords (or labels) of the tokens to be extracted and the characteristics of the token to be extracted. The Amorphic system uses an XML-based ontology for representing the domain knowledge. The XML ontology used by Amorphic uses three features to help identify the token to be extracted: keywords (or labels), patterns, and data types [Gregg, Walczak 2007]. •
• •
Keywords can either be domain specific labels commonly used within web pages to describe the token, or, for unlabeled tokens, they can be path expressions that describe the position of the token within a specific web page. Patterns are regular expressions specifying what form the data usually takes. For example: currency usually is preceded or followed by a currency identifier (e.g. €200 or $100). Data types specify the type of data the returned token should be. Data types are especially useful for locating numeric data within web pages.
Figure 2 shows a simple XML domain ontology to extract basic user profile information from a web page. The Amorphic contains one , and one or more allowed instances of the nested element . The contains all of the information necessary to extract a single token from a web document (e.g a user name or user rating). The contains four required nested elements and one optional nested element [Gregg, Walczak 2007]: •
The is used when generating XML output for the token extracted.
7
• • •
•
There are then one or more elements. The keywords are followed by one or more elements, The element specifies the data type of the returned token. It has an optional length attribute, which is used to limit the number of characters returned for Strings, and optional minimum and maximum attributes that are used to specify a valid range for numeric data. The optional element can be used to indicate that the token needs to be present in every valid data record. If a required token is not retrieved from a web page using normal extraction procedures, Amorphic initiates a wrapper recovery algorithm to locate the token and repair the ontology.
User User User information User: User \( \w*\w :? ?\( ?\d*\d ?.*.\) \w*\w :? ?\( ?\d*\d ?\) \w*\w :? ?\( ?\d*\d ?\) \w*\w :? ?\( ?\d*\d star ?\) \w*\w :? ?\( ?\d ?\) \w*\w :? ?\( ?unrated ?\) WebString True UserRating User information User: User \( [^:] \( ?\d*\d.*. ?\) \( ?\d*\d ?\) \( ?\d*\d star ?\) \( ?\d ?\) \( ?unrated ?\) int
Figure 2: Sample Domain Ontology Domain ontology generation can be a complex process because there can frequently be dozens of different tokens that can be defined for a given domain and fully specifying all of the possible keywords used to label the tokens and all of the possible forms (or patterns) the extracted data can take time and effort. To aid in the domain ontology generation process, a set of tools are being developed to help with keyword selection, pattern generation and location specification (for unlabeled data).
8
Current, automatic keyword or pattern identification algorithms work well for clearly labeled features in simple single item web pages. However, automated processes frequently require manual correction or augmentation of the generated ontology. The goal of the domain ontology generation process is to develop domain ontologies that are as general as possible so that they can be used to extract data from a variety of websites from a given domain. 3.2
Web Page Retrieval
All information extraction systems require documents to extract information from. In the Amorphic system shown in Figure 1, this function is accomplished in the web page retrieval module, which is part of the interface application. The web page retrieval module is responsible locating web pages of interests and then sending them to Amorphic for information extraction. The process used to locate web pages of interest varies by application. For example, the module includes a form/query processor that is used to create a user query by parsing the site’s search form, writing application specific query terms into the extracted form elements, and posting the search parameters to the appropriate search page to obtain the HTML search result. This capability allows the web page retrieval module to search individual sites or even search the web using a general-purpose search engine like Google. It also can recursively search a set of predefined pages to detect updates (e.g. for blog sites that regularly change their homepage to add new posts). The web page retrieval module is responsible for downloading the web page and passing passes it to Amorphic to begin the extraction process. In cases where a search is used to retrieve multiple web pages, the web page retrieval module passes each page to Amorphic individually. In some applications, the web page retrieval module might navigate through a tree of connected web pages, evaluate the pages encountered based on simple application specific heuristics, and pass relevant pages to Amorphic for information extraction. 3.3
Data Preprocessing
The web page goes through several preprocessing steps to prepare the data for extraction. Once a document is retrieved by the web page retrieval module, the Amorphic data preprocessor uses a HTML parser to separate the context-text in the web page from the HTML (or XHTML) tags surrounding the text. The parser creates a representation of the web page's structure as a tree of nested tags that follow the Document Object Model (DOM). The data preprocessing module uses the DOM parse-tree to create a location-key that identifies the content-text found in the web page. The location-key is a path expression that defines the set of nested tags that the content-text resides within [Cohen et al. 2002]. Once the parsing process is complete, the Amorphic data preprocessor determines whether the page contains multiple records contained in a table, multiple individually labeled records, or data for a single record of interest. This step is required because multiple record pages require additional preprocessing steps that must be taken to before data can be extracted from the document.
9
3.3.1
Preprocessing Tabular Data
Amorphic detects whether the information extraction tokens are contained in a table. To do this, Amorphic searches the parsed HTML data to for keywords defined in the domain ontology. When keywords are located in adjacent cells in a table row, they are assumed to be column headings for tabular data. In order to map each of these column headings to the data tokens contained in each row of the rable, a temporary wrapper is generated. The temporary wrapper redefines the domain ontology such that the keyword elements found in the heading row are converted to path expressions that describe the position (column) of the corresponding token within the HTML table [Gregg, Walczak 2007]. On the web, data is packaged in many different ways. For example, it is possible that a table can contain tokens arranged in columns (where keywords are the column headers) as well as tokens that are individually labeled. In these situations, Amorphic uses two separate labeling schemes. One scheme uses path expressions to identify the tokens labeled using column headings and the second scheme uses original keyword expressions found in the element map for all tokens that do not have a corresponding column heading. This allows Amorhphic to locate the locally labeled tokens in addition to locating the tokens based on the column headings. 3.3.2
Preprocessing Multiple Data Records
Web pages (especially those resulting from a search) frequently contain multiple data records. For Amorphic to extract the data for each of these records correctly, it must be able to separate the data into groups, each of which contains data for an individual record. The separation of data into single record groups can be accomplished in two ways. First, the developer of the domain ontology can specify a record delimiter in the domain ontology. The record delimiter represents a group of characters (or HTML tags) that are used to separate individual records on web pages from a particular site. The advantage of pre-specifying a record delimiter is that it guarantees that Amorphic will be able to effectively separate the records, as long as the web page structure remains the same. However, if the page structure changes, separating records using predefined delimiters can fail and Amorphic will automatically revert to a dynamic process for locating record boundaries. The dynamic record separation process uses two independent heuristics. The first heuristic is works for documents with a uniform structure for the individual records. The uniform structure heuristic examines the order of the data on the web page and determines if there is a repeating pattern that the tokens occur in. It then locates the token that occurs first in the pattern and the token that occurs last in the pattern. These beginning and ending tokens are then used to separate the parsed HTML document into individual record groups [Gregg, Walczak 2007]. The uniform structure heuristic is appropriate for results returned from a web search or other consistently formatted data. The second record separation heuristic is used to process less structured documents where the order of tokens within an individual record is not uniform. For example, blogs and social networking pages allow individuals to pick and choose the template they use to display their content and individuals on these sites are allowed to
10
generate the much of the text displayed within the templates, and as such, they lack the uniform structure of search page results. The nonuniform structure heuristic identifies record boundaries in these types of pages by looking for the HTML tags typically used as record separators. Prior research indicates that records on HTML pages are frequently separated by repeating pattern of two or more commonly used HTML tags (e.g. ,
, , , ,) [Embly et al. 1999]. Amorphic counts the number of occurrences of all pairs of commonly used separator tags that have no intervening plain text and then ranks the identified pairs to find the pairs that most closely approximate the estimated number of returned records on the page. The highest rank pair is then used to separate the data and the resulting data element groups are then examined to determine if they represent valid records (based on the domain ontology). If the tag pair record separation heuristic process cannot identify the individual records, the occurrences of individual commonly used separator tags are counted to see if a correct number of records can be identified [Gregg, Walczak 2007]. 3.4
Data Extraction
Once the document has completed data preprocessing, data extraction is initiated. The Amorphic system uses a three-step process to locate, to correctly identify, and to extract tokens from the web page [Gregg, Walczak 2007]: •
•
•
Step 1: The set of location-key/content-text pairs generated by the HTML parser (and if appropriate grouped in the record separation process) are searched for any of the keywords defined in one of the ElementMaps specified in the domain ontology. If the keyword represents a path expression, the location-key is used for the search. Otherwise, the keyword is searched for within the content-text. Step 2: The content-text immediately following the keyword location is searched to find the first occurrence of one of the patterns defined in the domain ontology. If the search forward process does not locate any content text matching one of the patterns in the domain ontology within ten contenttext groups following the keyword, the ten content-text groups preceding the keyword are searched for the first occurrence of one of the domain ontology patterns. Step 3: The data type is used to extract the desired token from the contenttext. When extracting text data, the data extraction module will use the length attribute to determine how much text to retrieve. It will retrieve all the content-text following the occurrence of the keyword from the same subtree of the parse-tree as the keyword, or it resides. If the length of the text retrieved is less than the length specified in the length attribute it will also retrieve the text in the subtree immediately following the keyword subtree. If the retrieved exceeds the length specified in the length attribute, the data extraction module will trim the returned string to size. If the data type is a numeric type and the maximum and/or minimum attribute are specified the data retrieved will be examined to determine if it falls within the range specified. If it does not the search for a matching token will continue.
11
Once all of the tokens for a single record are located and extracted, each tokens is enclosed in appropriate XML tags and the entire XML record is returned. The XML record name will be the defined in the domain ontology. The individual tokens will be enclosed in XML tags with the defined in the specific ElementMap. On many web sites or in many larger web domains the data available may vary from page to page. For example, on a commercial web site some products may include a specification for the weight of the product (e.g. to indicate an additional shipping charge) and other product pages may not include a weight. Amorphic assumes that it does not have to retrieve every record specified in the domain ontology for the record to be valid. If record must include a particular token (or group of tokens) to be valid, then the token definition must include the tag in the domain ontology and the value of the required element must be "True". If a token is not found using normal extraction procedures, an Amorphic initiates an automatic wrapper recovery and repair process. 3.5
Automatic Wrapper Recovery and Repair
The Amorphic system uses wrapper-recovery and repair process that attempts to locate the missing token by creating a new set of keywords to use to locate the data of interest on the web page. A thesaurus-based system is used to generate additional keywords to search for within the web data source. The Amorphic system utilizes a standard thesaurus to define a set of synonyms for many words in the English language. This thesaurus consists of a hash table where each initial keyword is the key and the set of possible thesaurus synonyms are stored as an array of strings. Any of the synonyms can be used to replace a single word or set of words found in a keyword list for a particular domain. For example, if the following keywords were originally specified for a particular domain: User information: The thesaurus-based generation process would use the following synonyms for the word user: consumer, customer, client, occupant, addict or abuser. It would use the following synonyms for the word information: advice, clue, cue, data, description, dossier, enlightenment, erudition, instruction, intelligence, knowledge, learning, lore, material, message, network, news, notice, notification, propaganda, report, scoop, score, tip, or wisdom. The keyword generation process would first replace each of the individual terms the keyword pair with each of the synonyms for that word. It then creates keyword sets for each of the unique combination of synonyms. Finally each individual keyword and keyword synonym is added to the list of candidate keywords to search for. Once the set of new candidate keywords is generated the Amorphic system uses a three-step location process to identify candidate tokens that could be the tokens of interest. This process is similar to the process used to locate tokens for the original keywords [Gregg, Walczak 2007]: •
Step1: scan the set of location-key/content-text pairs generated by the HTML parser to identify all text segments that contain one of the thesaurus-
12
•
•
generated keywords or keyword sets. Step 2: search the content-text before and after the keyword location to find the first occurrence of one of the patterns defined in the domain ontology. If content-text containing an appropriate pattern is found within the ten text segments surrounding the keyword, the keyword and matching content are added to a list of candidate tokens. Step 3: identify the correct data token from within the set of candidate tokens identified. Three factors are used to determine the relevance of candidate tokens. First, the location of a candidate token relative to other tokens already located on the web page is determined (tokens closer to other extracted text are more likely to contain data of interest). Second, the distance (in characters) between the keyword and the content-text containing an appropriate pattern is calculated. And finally, the number of keywords in the keyword phrase is calculated. A candidate token that is located near a group of related keywords is assumed to be more precisely labeled (and more likely to be the relevant token) than one that is located near a single thesaurus word.
For example, assume that the "User information" cannot be located in the document being scanned. Amorphic look for all possible keywords based on the generated thesaurus words. Assume it finds the keyword "News" at the top of the document with a candidate token "Events" located 2 characters following the candidate key word. It also finds the keyword "User Description" in the main body of the page with the candidate token "JaneDoe" immediately following the keyword. The second keyword/candidate token pair would score much higher on all three relevance factors. Location related to other tokens extracted would be on the order of 50 to 200 characters Versus 1000+ characters for the token at the top of the page. The distance between the second keyword and the candidate token is smaller. Finally there are 2 keywords describing the second candidate token instead of just one. Once the candidate token with the highest relevance score is identified, the content-text for the highest ranking token is extracted and the new keywords describing the token is added to the set of keywords defined in the ElementMap for the domain ontology.
4
Example Implementations
The primary purpose of an adaptive information extraction system is to be useful for information extraction for a variety of domains and adapt to changes in websites over time without breaking. This section describes five different applications to show the power of the Amorphic approach and the role information extraction systems can play in a variety of different domains. In three of the example implementations the information extraction occurred over a period of years. The purpose of these example implementations is to demonstrate the advantages of using a system like Amorphic and to provide some examples of how information extraction systems might be used in practice.
13
4.1
Auction Advisor
The Amorphic information extraction system was first used as part of an Auction Advisor system. The Auction Advisor uses Amorphic to acquire a variety of information related to online auctions so meaningful pricing and bidding recommendations can be made to end-users [Gregg, Walczak 2006a]. The initial Auction Advisor System used position-based wrappers to extract information from the eBay, Amazon and Yahoo auction sites. This required six separate wrappers to be developed (a wrapper for the search page and a wrapper for the auction item page for each site). These six position-based wrappers frequently failed due to page structure changes at the three auction websites. The Auction Advisor system was modified in 2003 to use Amorphic as its information extraction system. Two separate domain ontologies were generated for the online auction domain. The first domain ontology was designed to extract data from auction search results pages and the second domain ontology was designed for individual auction pages containing a single item being auctioned. Since Amorphic domain ontologies allow more than one set of keywords and more than one pattern to be specified, it can be used to extract data from more than one web site. Thus, the same two domain ontologies were used to extract data from all three auction sites being evaluated. The Amorphic information extraction system was tested as to determine if it was a suitable tool for use with the Auction Advisor. Both auction search results pages and individual item pages were used for the validation tests. Amorphic succeeded in retrieving both single item data and multiple item date from all three Web sites. It was capable of extracting multiple data records from pages with data in tables with headings, from pages with formatted paragraphs and labels and from pages with a mix of labels and table headings. Searches were conducted at each auction site using 74 different search strings. Each of these searches resulted in at least one search results page being retrieved per auction site. If the search resulted in more than 50 items being matched for a given auction site, then multiple search results pages were parsed by the system. The Amorphic system retrieved and extracted data for 70,892 auctions from eBay, 1137 auctions for Yahoo and 1061 auctions for Amazon. The system correctly extracted 100% of the auction records from the multi-item pages with 99.99% accuracy. Only the “Buy Now” price was extracted incorrectly for a small fraction of the eBay records (0.04%) [Gregg, Walczak 2007, Gregg, Walczak 2006b]. A sample of 250 individual item pages for each of the three auction sites were then selected for parsing once the auction had closed. When the actual data extraction process was performed, all three auction sites had removed a few of the selected individual auction pages from their databases. Thus, a final sample of 245 pages for eBay, 189 pages for Yahoo and 192 pages for Amazon were processed using the Amorphic system. Results of the information extraction tests at all three sites are summarized in Table 1. The Amorphic system was able to parse 100% of the individual item records for each of these Web sites. 99.02% of the tokens extracted from the individual item pages represented data of interest [Gregg, Walczak 2007, Gregg, Walczak 2006b]. In 0.8% of the cases, keywords from the domain ontology appeared elsewhere in the text (e.g. in the freeform item descriptions) and caused the
14
Amorphic date extractor to extract the wrong token from the file. TABLE 1: INFORMATION EXTRACTED FROM AUCTION ITEM PAGES IN 2003 Site
% Records Retrieved
% Data Items Retrieved Amorphic
EBay
# Records Retrieved 245
100.00%
99.13%
Yahoo
189
100.00%
98.68%
Amazon
192
100.00%
99.22%
One of the challenges typically faced when relying on information extracted from the web is that the web is a constantly changing medium. Websites like eBay are frequently changing the structure of the site to provide more attractive listings or to make more information available to users. One of the reasons Amorphic was adopted for use in the Auction Advisor system is its ability to work over a period of years without requiring manual changes to the domain ontology. To determine if Amorphic met this objective, three separate tests of the Amorphic information extraction system were made at the eBay online auction site: the original test in 2003, a test in 2004, and a final test in 2007. As a part of each of these tests a sample of individual item pages were retrieved and the information extracted using the original auction domain ontology was determined. The information extraction was compared to the information extraction that would have occurred using a position-based wrapper and an ontology-based wrapper, both developed at the time of the original extraction. Amorphic demonstrated superior performance to both the position based wrappers and the ontology wrappers in all three tests, as shown in Table 2. The reliability of the position-based wrapper degraded dramatically from 98.58% in 2003 to only 14.28% of tokens extracted in 2007. This was due to major changes in the structure of the individual item pages on eBay between 2003 and 2007. The ontology based wrappers performed significantly better and had limited degradation in performance over the period examined. However, the performance of the Amorphic wrapper actually improved between 2003 and 2007! This was entire due to the fact that eBay implemented a standardized way of presenting shipping charges between 2004 and 2007. The wrapper recovery module was able to locate the new shipping charge data allowing Amorphic to more reliably extract the shipping charge data. Based on this evaluation, the Amorphic information extraction system represents a significant improvement over the position-based wrappers used in the original Auction Advisor system. TABLE 2: INFORMATION EXTRACTED FROM EBAY OVER 5 YEARS Year
# Records Retrieved
% Records Retrieved
% Data Items Retrieved Amorphic
2003
245
100.00%
99.13%
2004
872
100.00%
99.00%
75.92%
98.23%
2007
101
99.36%
14.28%
98.20%
100.00%
% Data Items % Data Items Retrieved Retrieved Position Only Ontology Only 98.58% 98.36%
15
4.2
Online Auction Fraud
Amorphic has been used in a pair of studies that sought to understand online auction complaints and online auction fraud better through the analysis of negative feedback comments posted at eBay [Gregg, Scott 2006, Gregg, Scott 2007]. These exploratory studies used the Amorphic information extraction system to extract negative feedback comments about online auction sellers posted in the eBay reputation system. A domain ontology was constructed for eBay comment pages. The ontology was designed to extract buyer and seller user names, buyer and seller ratings and the negative comments placed by buyers. The interface application retrieved user pages for over 40,000 different eBay users. Amorphic successfully extracted 6,571 negative feedback comments for 3,862 different users (the remaining users did not have any negative feedback comments in the one month period preceding the information extraction) in May 2003. An additional 867 negative comments were extracted in July 2005. The extracted data was passed back to the interface application which then analyzed the complaints using a set of 15 rules which created a preliminary classification of the negative comments. This was followed up by a manual content analysis which refined the classification and determined the frequency of problems reported on eBay. The overall complaint rate for this study was 0.73 complaints for every 100 comments made. The study indicated that 69.7% (71.8% in 2005) of negative comments posted in eBay’s feedback forum indicate that the seller may have defrauded the buyer by failing to deliver the item, misrepresenting the item in the product description, selling illegal goods, by adding charges after the close of the auction, or by shill bidding [Gregg, Scott 2006, Gregg, Scott 2007]. Thus, the rate of fraud accusations (as a percentage of completed auctions) made in the eBay reputation system was nearly to 0.2%, 20 times higher than the rate reported through official channels.1 These online auction fraud studies demonstrate that web information extraction and classification could be used by law enforcement to identify criminal activity that is not being reported. This could help law enforcement identify criminally activities more quickly and potentially allow criminals to be apprehended earlier than if automatic web information extraction and analysis were not used. 4.3
Medical Article Mining
One domain with high interest in automated information retrieval and extraction is the medical domain. There is a huge amount of data available in medical domains and it is necessary for medical professionals to keep their knowledge up-todate [Spath 2000]. Most medical data is rapidly moving towards more web-oriented content [Bazzoli 2000, Detmer, Shortliffe 1997]. Physicians have indicated that they are willing to use data obtained from the web [Bazzoli 2000], but they need assistance navigating the web to locate relevant data and to access it. Information retrieval agents have been proposed to facilitate access to web-based medical information
[Walczak 2003], but these agents rely on specific web-page formats and known access protocols. The Cardiac Lab it the Denver Veterans Administration Medical Center began a project to extract abstracts related to heart/cardiac problems from PubMed and other online medical publication databases. The goal of the project was to automatically obtain abstracts of interest from online sources and then to classify them using automated classification tools such that the process of finding relevant articles could be automated for the staff at the Cardiac Lab. The Amorphic agent was used for the information location part of the application. The agent queried 4 different online publication databases and retrieved abstracts and other citation information to be used in the classifier application. The information extraction portion of the project was successful, with Amorphic successfully extracting 95.67% of the data of interest successfully [Gregg, Walczak 2007] 4.4
Social Networks
Social Networking sites are storehouses for vast amounts of information related to individual preferences, social relationships, and recommendations. Researchers are just beginning to use the information contained in social networks. One example is InterestMap, an application that extracted information from social networking sites to build models of people based on their interests and then creates a network-style view of the space of interconnecting interests and identities [Liu, Maes 2005]. Another study examines patterns of information revelation by university students in online social networks and the associated privacy implications [Gross et al.. 2005]. There are a variety of other social networks that can be used to provide information for applications. For example, it is possible to detect conflict of interest relationships among potential reviewers and authors of scientific papers using ‘semantic associations’ created by integrating entities and relationships from two social networks, namely “knows,” from a FOAF (Friend-of-a-Friend) social network and “co-author,” from the underlying co-authorship network of the DBLP bibliography [Aleman-Meza et al. 2006]. Amorphic has been used to study social network relationships between buyers and sellers on eBay. As part of this project, Amorphic accessed eBay profile and “About Me” pages and a variety of user characteristics were extracted. In all information from 43,859 different users was extracted over a three-year period. This information was used as a part of ongoing studies to understand the dynamics of online auctions and to understand how the role of reputation systems in decisions made regarding whom to transact with online. Unlike the prior research on social networks that focused on data gathered during a single slice of time, this study gathered data over time – and thus it was important for the information extraction system to adapt to changes in the structure of the site. There are other important questions can best be answered using longitudinal data from a variety of social networks. This research can map changing associations, friendship networks, and evolving preferences. The Amorphic system is currently being expanded to gather longitudinal data related to social networks on MySpace, Facebook, and LinkedIn. The advantage of using an adaptive information extraction
17
tool for this application is that it can continue to successfully extract a wide variety of data from individual profiles – even when the user changes the template used to display the data or the site updates the way they organize content. 4.5
Blog Article Classifier
Weblogs (or blogs) are another emerging information extraction domain. Blogs have become an important communication mechanism for individuals and organizations. However, the rapid proliferation of blogs has led to an information explosion that makes obtaining a systematic understanding of blog information content difficult. Several researchers have begun to use information extraction techniques to systematically process blog data. One system automatically collects and monitors Japanese blog collections made with both blog software and those written as normal web pages. Their approach located date expressions within documents, extracts content-text between the dates, and mines useful information from the collected blog pages [Nanno et al. 2004]. A second system tracks information flow through the "blogsphere" using the existing link structure between blog entries [Adarm, Adamic 2005]. A third system uses text clustering to organize blog posts from multiple sources and present them within a single user interface The system (FeedWiz) retrieves blog feeds that conform to either the Atom or RSS specification and then uses the textual content of the post to create a hierarchical structure based on term similarity [Schuff, Turketken 2006]. As blog extraction systems evolve they need to provide enhanced tools for managing this unique form of information. This requires using a wide variety of data to better understand and classify blog entries. For example, the blog information extraction tools discussed previously do not analyze either blog labels or blog comments when evaluating or classifying blog entries. Amorphic is currently being used in a project to extract and evaluate blog postings. The blog interface application (currently being developed) uses a blog search engine to locate blog postings related to a particular topic. It then passes the blog posting page to Amorphic, which scans the blog posting page using a blog specific domain ontology and extracts the tokens of interest from the page. These tokens include the date of the post, the author, the actual post text, any labels applied to the post, comments made about the post and the authors of those comments. Amorphic returns the extracted content to the blog interface application which will pass the extracted data to a text clustering tool like Feedwiz. Using a tool like Amorphic would allow Feedwiz to work with a wider variety of blog sites, and capture richer data than is available with RSS or Atom alone. In addition, the ability to capture labels, comments and comment authors has the potential to expand the effectiveness of the clustering algorithm (e.g. more accurate keywords) and provide a measure of the quality of the individual post (e.g. using number of comments and average length of a comment). 5
Discussion and Conclusions
One of the biggest challenges surrounding using web-based information for organizational decision making is the unreliability of the web as an information
18
source. Web sites containing data of interest come in a wide array of formats, some include labels some do not, some are well structured and others are not. This presents a challenge for managing information extraction because frequently, a tool that can extract information from one site is not well suited to extracting data from another. A second challenge is that web sites frequently change the design of their sites requiring most wrappers written to work with the original sit to fail or to become less effective. This presents a challenge to organizations interested in using web based information for decision making because the effort required to maintain their information extraction wrappers often takes more time than the data is worth to the organization. This paper presents an adaptive combined ontology-based and position-based information extraction system that can be used as a part of intelligent web mining systems to extract data from a wide variety of data sources. The Amorphic information extraction system can locate data of interest based on domain-knowledge or web page structure, and can automatically repair a wrapper for an information source when the structure of the information source changes. One feature of the system is that the domain knowledge-based wrappers and the extracted data are represented by XML documents. This allows the system to be easily configured for a variety of domains. The prototype system demonstrated good performance for all five example implementations discussed in this paper. The five example implementations illustrate the need for information extraction systems capable of extracting information from semi-structured web documents. Contrary to the predictions of semantic web proponents, the web does not seem to be moving towards a form where the majority of content is labeled to facilitate processing by autonomous software agents. Instead, the past few years have led to an explosion of semi-structured human-readable data facilitated by Web 2.0 technologies like blogs, wikis and social network services. The example implementations show that these sites can be potentially large information sources for information extraction applications. For example, the online auction fraud study demonstrates the usefulness of web information to law enforcement. Similar studies could be undertaken at social network sites to examine comments to gain a more accurate estimate of the percent of young people at these sites that are approached by sexual predators or solicited to participate in some type of illegal activity. Other uses for data from these sites include applications like FeedWiz that are capable of aggregating information on a wide variety of topics [Schuff, Turketken 2006]. Three of the five example implementations illustrate the adaptability of the Amorphic system. They show that Amorphic can be used to gather data from information sources over time, even when the website changes the design of its web pages. This dramatically improves the reliability of the information extraction process, especially when compared to traditional position-based extraction systems. The results of the example implementations demonstrate that Amorphic system represents a cost-effective approach to developing reliable large-scale adaptable information extraction systems for a variety of domains. There is a need for additional research on information extraction systems, more specifically on improving the ability of systems to work with both positionbased extraction and ontology-based extraction. For example, the current wrapper recover algorithm used by the current Amorphic system can only recover for failures in the ontology extraction rules. This system could be combined with other wrapper
19
recovery and repair approaches capable of recovering position-based failures [see Chidlovskii 2002, Lerman et al. 2003, Meng et al. 2003, Papadakis et al. 2005]. In addition, additional wrapper recovery algorithms can be added to the existing wrapper recovery module so that the system can handle a wider variety of terminology and page structure changes with a greater degree of accuracy. For example, the current thesaurus-based recovery system could be extended to deal with misspellings (missing or transposed letters). This would allow the system to more readily cope with human errors made when constructing web pages. Finally, the ability to automatically create domain ontologies using a set of sample pages needs to be improved. An ontology generation system needs to locate both potential keywords and data of interest. The domain ontology generation system should also identify potential unlabled extraction targets and add their position to the generated ontology rules.
References [Adarm, Adamic 2005] E. Adar, L.A. Adamic, "Tracking Information Epidemics in Blogspace," IEEE/WIC/ACM International Conference on Web Intelligence, (2005), 207-214. [Aleman-Meza et al. 2006] B Aleman-Meza, M Nagarajan, C Ramakrishnan, L Ding “Semantic analytics on social networks: experiences in addressing the problem of conflict of interest” in Proc. 15th international conference on World Wide Web, (2006). [Arasu, Garcia-Molina 2003] A. Arasu, H. Garcia-Molina. “Extracting structured data from Web pages.” ACM SIGMOD Record, (June 2003), 337-348. [Atzeni et al. 2002] P. Atzeni, G. Mecca, P. Merialdo, “Managing Web-based data - database models and transformations,” IEEE Internet Computing, 6, 4, (July-Aug. 2002), 33 –37. [Bazzoli 2000] F. Bazzoli, “Gateways to the Internet,” Internet Health Care Magazine, (Mar./Apr. 2000), 70-77. [Benslimane et al. 2006] S. M. Benslimane, D. Benslimane, M. Malki, Y. Amghar, H. SaliahHassane "Acquiring owl ontologies from data-intensive web sites" Proc. 6th international conference on Web engineering, Palo Alto, California, (July 2006), 361 – 368. [Berners-Lee et al. 2001] T. Berners-Lee, J. Hendler, O. Lassila Scientific American, 284, 5, (May 2001), 34-43.
" The Semantic Web"
[Carroll, Klyne 2004] J. J. Carroll, G. Klyne "Resource description framework (RDF): Concepts and Abstract syntax,". (February 10, 2004), http://www.w3.org/TR/2004/REC-rdfconcepts-20040210/. [Chang et al. 2006] C. Chang, M. Kayed, M. R. Girgis, K. F. Shaalan, "A Survey of Web Information Extraction Systems," IEEE Transactions on Knowledge and Data Engineering, 18, 10, (October 2006), 1411-1428. [Chidlovskii 2002] B. Chidlovskii, “Automatic repairing of Web wrappers by combining redundant views,” Proc. 14th IEEE International Conference on Tools with Artificial Intelligence, (ICTAI 2002). (4-6 Nov. 2002), 399-406. [Cohen et al. 2002] W. Cohen, M. Hurst, L. Jensen, “A flexible learning system for wrapping tables and lists in HTML documents,” Proc. 11th international conference on World Wide Web, (2002), 232-241.
20
[Crescenzi et al. 2001] V. Crescenzi, G. Mecca, P. Merialdo, “ROADRUNNER: Towards Automatic Data Extraction from Large Websites,” Proc. Of VLDB, (2001), 109-118. [Detmer, Shortliffe 1997] W.M. Detmer, E.H. Shortliffe, “Using the Internet to Improve Knowledge Diffusion in Medicine.” Communications of the ACM, 40, 8, (1997), 101-108. [Embley 2004] D. W. Embley "Toward semantic understanding: an approach based on information extraction ontologies" Proc. 15th Australasian database conference, Dunedin, New Zealand, (2004), 3 - 12. [Embley et al. 1998] D. W. Embley, D. M. Campbell, R. D. Smith, S.W. Liddle, "Ontologybased extraction and structuring of information from data-rich unstructured documents," Proc. 7th international conference on Information and knowledge management, (Nov. 1998), 52-59. [Embley et al. 1999] D. W. Embley, Y. Jiang, Y. K. Ng "Record-boundary discovery in Web documents," ACM SIGMOD Record , 28, 2, (June 1999), 467-478. [Flesca et al. 2004] S. Flesca, G. Manco, E. Masciari, E. Rende, A. Tagarelli "Web wrapper induction: a brief survey." AI Communications, 17, 2, (2004), 57-61. [Gregg, Scott 2006] D. Gregg, J. Scott, "The Role of Reputation Systems in Reducing Online Auction Fraud," International Journal of Electronic Commerce, 10, 3, (Spring 2006), 97-122. [Gregg, Scott 2007] D. Gregg, J. Scott, "A Typology of Complaints about eBay Sellers," Communications of the ACM, (forthcoming 2007). [Gregg, Walczak 2006a] D. Gregg, S. Walczak, "Auction Advisor: Online Auction Recommendation and Bidding Decision Support System," Decision Support Systems, 41, 2, (January 2006), 449-471. [Gregg, Walczak 2006b] D. Gregg, S. Walczak, "Adaptive Web Information Extraction," Communications of the ACM, 45, 5, (May 2006), 78-84. [Gregg, Walczak 2007] D. Gregg, S. Walczak, "Exploiting the Information Web," IEEE Transactions on System, Man and Cybernetics Part. C, 37, 1, (January 2007), 109-125. [Gross et al. 2005] R. Gross, A. Acquisti, H. J. Heinz, III, Information revelation and privacy in online social networks, Workshop On Privacy In The Electronic Society, Alexandria, VA, USA, (2005), 71 – 80. [Knoblock et al. 2000] C.A. Knoblock, K. Leramn, S. Minton, I Muslea, “Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach,” Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 23, 4, (2000), 33-41. [Koivunen, Miller 2002] M. Koivunen, F. Miller, “W3C semantic web activity,” in E. Hyvonen, ed., ‘Semantic Web Kick-Off in Finland’, Helsinki, Finland, (2002), 27-44. [Kushmerick 2000] N. Kushmerick “Wrapper verification," World Wide Web Journal, 3, 2, (2000), 79-94. [Kushmerick et al. 1997] N. Kushmerick, D. Weld, R. Doorenbos, ”Wrapper induction for information extraction,” Proc. of the International Joint Conference on Artificial Intelligence, (1997), 729-735. [Laender et al. 2002] H. F. Laender, B. A. Ribeiro-Neto, A. S. da Silva, Juliana S. Teixeira, “Surveys: A brief survey of web data extraction tools ” ACM SIGMOD Record, 31, 2, (June 2002), 84-93. [Lerman et al. 2003] K. Lerman, S. Minton, C. Knoblock, “Wrapper Maintenance: A Machine Learning Approach”, Journal of Artificial Intelligence Research, 18, (Feb. 2003), 149-181.
21
[Liu, Maes 2005] H. Liu, P. Maes "InterestMap: Harvesting Social Network Profiles for Recommendations" Beyond Personalization IUI'05, San Diego, CA, USA, (January 9, 2005), http://ambient.media.mit.edu/assets/_pubs/BP2005-hugo-interestmap.pdf [Meng et al. 2003] X. Meng, D. Hu, C. Li, “Schema Guided Wrapper Maintenance for WebData Extraction,” Proc ACM Intl. Workshop on Web Information and Data Mining, (Nov. 2003), 1-8. Muslea, S. Minton, Knoblock. “A hierarchical approach to wrapper induction,”. 3rd International Conference on Autonomous Agents, (1999), 190-197. [Nanno et al. 2005] T. Nanno, T. Fujiki, Y. Suzuki, M. Okumura, "Automatically collecting, monitoring, and mining Japanese weblogs" Proceedings of the 13th international World Wide Web conference, New York, NY, USA, (2004), 320-321. [Papadakis et al. 2005] N.K. Papadakis,D. Skoutas, K. Raftopoulos, T. A. Varvarigou, STAVIES: A System for Information Extraction from Unknown Web Data Sources through Automatic Web Wrapper Generation Using Clustering Techniques, IEEE Transactions on Knowledge and Data Engineering, 17, 12, (December 2005), 1638-1652. [Sarawagi 2002] S. Sarawagi "Automation in Information Extraction and Integration," Proc. 28th International Conference on VLDB, (2002). [Schuff, Turetken 2006] D. Schuff, O. Turetken, "FeedWiz: Using Automated Document Clustering to "Map the Blogosphere" 4th Annual SIGDSS Pre-ICIS Workshop on Decision Support Systems, Milwaukee, WI., USA (December 10, 2006). [Schwartz 2003] D. G. Schwartz, “From Open IS Semantics to the Semantic Web: The Road Ahead,” IEEE Intelligent Systems, 18, 3, (2003), 52-58. [Seo et al. 2001] K. Seo, J. Yang, and J. Choi, “Building Intelligent Systems For Mining Information Extraction Rules From Web Pages By Using Domain Knowledge,” Proc. IEEE International Symposium on Industrial Electronics, 1, 12-16, (2001), 322 -327. [Spath 2000] P. Spath, “Case Management Making the Case for Information Systems.” M.D. Computing, 17, 3, (2000), 40-44. [Tijerino et al. 2005] Y. A. Tijerino, D. W. Embley, D. W. Lonsdale, Y. Ding, G. Nagy, "Towards Ontology Generation from Tables" World Wide Web, 8, 3, (Sept. 2005), 261-285. [Walczak 2003] S. Walczak, “A Multiagent Architecture for Developing Medical Information Retrieval Agents.” Journal of Medical Systems, 27, 5, (2003), 479-498. [WordNet 2007] WordNet® 3.0. "ontology" Princeton University. http://dictionary.reference. com/ browse/ontology, (Sept 2007).