So, it is not easy to find the desired data from diverse data sources. ... segments the input HTML document into tokens and then induces extraction rules of .... in T. We say t is an extension of t if t is a distinct element of an exact tandem repeating ... string, 2) N for number, 3) B for space, 4) SYM for symbol, and 5) t1 for base.
Type-rule-based Wrapper Generation Youngju Son, Hasan Jamil, and Farshad Fotouhi Department of Computer Science Wayne State University Detroit, MI 48202 {yson,jamil,fotouhi}@cs.wayne.edu
Keywords: Information Extraction; Type Hierarchy, Graph Grammar Abstract. Biological data sources are useful to bioinformatics researches. Several computational tools have been developed so that these data sources can be used as easily as possible. Most of biological data has been provided over the web. Web data is almost represented in unstructured format and cannot be queried using traditional querying language. Furthermore, the problems, which integration of biological data faces, come from several factors such as the various data types, presentations and formats. So, it is not easy to find the desired data from diverse data sources. Although human being can easily understand web data, which are heterogeneous and unstructured, it is impossible for machine itself to figure it out. In order for machine to extract data from the web, it requires knowledge of both their structures and contents. We propose a novel architecture for automatic wrapper induction that exploits a user supplied type system and an ontology for establishing schema correspondence precisely and efficiently. In this paper, the type system helps recognize target data and improves precision of schema matching which is impossible without manual intervention.
1
Introduction
There are huge amount of information available on the web, but much of most this information can be easily read by human beings not machine. Applications using information extraction from the semi-structure web sources have been increased. These applications rely on wrapper that extracts data from web sources and converts it to structured format. Semi-structure data have no explicit grammar or schema but implicit rules that are used to identify relevant information on the web[12]. For example, even email message has structure including title, address, sender, and body. There are many researches related to data extraction like semi-automatic or machine learning approaches. These systems may have source-dependent characteristics. If there are some slight changes in a page, it prevents wrapper from extracting correctly. We need to generate extraction rules that are independent of web pages as automatically as possible. It is entirely possible that in absence of a concrete ontology and a mediator, users can still submit queries to an autonomous web query engine and expect to receive a reasonable response that is consistent with the intention of the submitted query. The responsibility of the query engine will then include comprehension of the
query intent, identifying target data by developing schema correspondence, extracting data from multiple sources and resolving schema conflicts, combine the responses and return an answer that the user will most likely accept. In such a set up, the manual mediator is replaced by the query engine, and no pre-fabricated wrappers are used, rather the wrappers and mediation models are generated dynamically at run time. If the cost of precessing such a query is negligible, we can then support on the fly, or ad hoc, integration without any maintenance cost whatsoever that is too prohibitive for many applications. Such an integration framework will also allow full autonomy of participating of target sites, dynamic inclusion of information providers, and rapid adaptation to changes occurring without notice. In this paper, we present an architecture and a query interface for query driven autonomous mediation of heterogeneous web data sources. We describe a method for schema matching based on an extensible user supplied type system and ontology. The proposed system performs an automatic generation of the data types and the data extraction from web sites which user visits. There is no exact expectation on the types or number of web pages to be processed. We use a graph grammar for learning a set of type rules, which are used to discover the data types of specific web source, without annotation processing for training examples. In addition, user’s query performs roles as a guide during query processing. This paper is organized as follows. Section 2 discusses the related work, section 3 presents a system overview. Section 4, 5, 6, and 7 introduce an ontology, the model for type hierarchy based schema mapping, a wrapper generation, and a wrapper interpreter in detail. Implementation is provided in section 8, finally section 9 presents the conclusion.
2
Related Work
A large quantity of works have been performed in the field of data extraction and data integration from the web sources. Data integration starts with data extraction from several sources through wrappers[4]. Wrapper is a specified procedure that is designed for extracting specific data from interesting web sites. The result of a wrapper should be in a formal structure for processing. We will discuss researches related to automatic wrapper generation and compare them with our approach. Although many systems generate wrapper automatically, the generation methods depend on large sets of training documents or complicated induction rules. AutoWrapper[17] extracts table structures from unlabeled examples using smith-waterman algorithm but cannot handle table structures represented without HTML table tags, nested tables, empty tables, or even single row tables. PickUp[13, 14, 11] extracts both semi-automatically for non-table structures and automatically for table structures. Compared to AutoWrapper, the PickUp is able to extract complex table structures, but cannot deal with both non-HTML sources and data without repeat patterns. SoftMealy[4, 6] automatically generates extraction rules in finite state transducer(FST). The wrapper
segments the input HTML document into tokens and then induces extraction rules of attributes present on given training examples. SoftMealy uses delimiterbased extraction patterns and is limited to work on fairly structured data. RoadRunner[16] generates a wrapper for a set of HTML pages corresponding to an inferring HTML tag-based regular grammar for HTML document. It works by comparing two HTML pages at a time, which are a sample page and a wrapper page. All extraction processes are based on the study of similarities and dissimilarities between these pages and mismatching is used to identify relevant structures. However, RoadRunner requires both HTML tags and two training documents and cannot handle non-HTML sources. The system which has been proposed by Embley et. al [8, 7](called the BYU), is based on ontology approach for data extraction. In order to construct wrapper, BYU requires an ontology, which is designed by experts manually. Once the ontology is supplied, the system can map and consolidate records from the multiple sources to the ontological schema. As a major limitation, if application involving sites have wide variations in ontological structure, BYU tool will probably extract much less information. Table 1 presents a summary of characteristics of each system mentioned until now. Tools Automation Support non-HTML Methodology AutoWrapper Automatic None Smith-Waterman PickUp Automatic None Hierarchical Repeated Structure Recognition BYU Manual Full Ontology SoftMealy Semi-automatic Partial Finite-State-Transducer RoadRunner Automatic None HTML-tag based regular grammar Proposed system Automatic Full Graph grammar Table 1. Summary of characteristics
3
System Overview
There are several modules including the interface, the type learner, and the wrapper interpreter as shown in figure 1. The interface module forms a user query and sends it to the corresponding result pages. The type learner is the main module that generates type rules using base types provided by user. This module analyzes the contents of HTML page passed from the interface, and generates type rules by referring to knowledge in the ontology. The wrapper is constructed on the basis of generated type rules for each web page. The wrapper interpreter module activates the learned wrappers to extract specific part of information from the web pages. This system has been designed to deal with web pages as target documents. In general, a web page is text-based document. Elements in the web page, such as number or date, are parsed as texts. If these elements can be categorized by
Fig. 1. The overall system architecture
specific types, users can extract their desired data as easily as possible. In order for categorization in terms of types, we suggest the type hierarchy structure. Initially, pre-defined base types must be defined by developer and are located in ontology. Base types are simple valued data, such as string and integer, and they are used for generating the extension and composite type rules, which are extended versions of base types. Figure 2 shows the hypothetical type hierarchy in the genomic area. We can define node “String” as a root node because of text-based characteristics of HTML document. For example, although a number is used in the web document, machine understands it as one of text data during parsing. This type system does not rely on HTML tag structure and not sensitive to changes on web format. After defining types for particular domain, any documents in the domain can be targeted.
Fig. 2. A hypothetical type hierarchy in the genomic area
4
Ontology
Ontology usually describes terms, concepts, and mapping information for specific application domain. It performs an important role for communicating among documents in a given domain. We define an ontology consisted of set of wrappers, type rules, and term mapping set in a XML based scheme. An example for ontology is shown in table 2. The ontology is introduced within the and . The structure contains structure that specifies a number of type which consists of type name, rule, and source. The type name is used in user query statement, rule is represented in context free grammar format, the source expresses target web source name. The .. structure must specify a number of wrapper which consists of source, URL, and elements. The source is the same as source in structure. Each element has a name corresponding to an attribute used in a target source and the URL specifies the location of web source. The
DNA-SEQ [ACGT](2,) GENBANK GENBANK WWW.GENBANK.ORG ORIGIN DNA-SEQ SEQ ORIGIN GENBANK Table 2. The example of ontology
specifies a number of term which consists of term name, its several synonyms, and destination source in which synonym are actually used. Therefore, structure is used for communicating with variables used in a query statement. The ontology has been developed by domain expert.
5
A Model for Type Hierarchy based Schema Mapping
The mediation model assumes that the query data and the target data are structured and typed according to a context dependent type system. The context is captured in the form of a user supplied type hierarchy, which allows the change in the type hierarchy by altering the interpretation of query variables and then the meaning of the query. While the presentation of the target data can be unstructured or semi-structured, the data is generated to a structured data source with a precise presentation format, and structural regularities, no matter how complex, are recognizable in automatic ways. This principle has been utilized in many autonomous wrapper induction tools successfully and we choose to use this assumption in our system too. 5.1
Query Scheme
Let Q be a query of the form extract v 1 , ..... v n from [index|form at] URL where c 1 and/or ... and/or c k types x 1 (t 1 ), ...,x m (t m ) as in T where, URL is a web page location, c S i is a boolean expression involving variables u i and constants w i . Let X = ∪(v i ) ∪(u j ) = ∪(x k ), and T = ∪(t k ), where X is the set of all query variables, and T is the set of all types corresponding to the query variables. Let T be the type hierarchy associated with query Q. Then the query scheme of Q, denoted SQ , is the pair . The following SQLlike example returns two items (SOURCE, and DNA-SEQ), which correspond to types(t 1 and t 2 ) and the value of ACCNO is matched to “AB000263” from GenBank web site. We assume that two types(t 1 and t 2 ) are already defined. extract SOURCE, DNA-SEQ from GenBank where ACCNO = ”AB000263” types SOURCE(t 1 ), DNA-SEQ(t 2 ) as in T 5.2
Type Scheme
Let t be a given type, and { t 1 , ..... t n } is a set of arbitrary types in T. We say t 0 is an extension of t if t is a distinct element of an exact tandem repeating
sequence of types involving types in { t 1 , ..... t n }, and t 0 is maximal(interspersed repeat). t is then called a base type. Conversely, a generalization of t 0 into t is an order preserving sequence involving elements in t 0 that are identical to t. Let t 1 and t 2 be two types in T. A type t 2 is called a composite extension of t 1 , if t 2 is a sequence of the form t 2 0 tt 2 00 such that t2 0 , and t2 00 are base types, and t is either a base type or an extension of some base type t 0 . t is called the prime component of t 2 . A generalization of a composite extension t 2 into t 1 is the prime component t of t 2 . A type hierarchy T is called a type system if the elements in T are related via conventional sub-type, extension and composite extension type relationships only. 5.3
Source and Candidate Scheme
Let w be a web document, τ be a set of terms in w, and µ be the partial mapping(type assignment) from terms in τ to types in T. Then a source scheme S is the set of all type assignments of the terms in W with respect to T, donated S = hwi , ti i such that wi ∈ τ and ti ∈ T. A source scheme S is called a candidate scheme C Q with respect to a query Q when C Q ⊆ S and it is so chosen that all t i s in C Q are in T. 5.4
Target Scheme
Let θ be a set of mapping of terms, Q be a query and W be another set of terms from a web page. Let Ψ be a similarity mapping of terms in Q and W, with respect to the mapping Θ to a number between 0 and 1, i.e., Ψ : Q × W 7−→ [0,1] such that types of q and w match, and Ψ (q,q) = 1 when q = w or ∃ a mapping from q to w in θ, and Ψ (q,w ) = v, 0 ≤ v < 1 such that v is computed using algorithm similarity.
6
Wrapper Generation
Information extraction is the process of identifying the particular fragments of a document that constitute its core semantic content[3]. Information extraction system needs wrapper, which is based on extraction rules tailored to access a specific data source for gathering relevant data. Developing wrapper manually[9] is time-consuming and error-prone because of the difficulty in writing and maintaining wrappers. Many extraction information tools have been proposed using HTML structure analysis and machine learning approaches. The HTML structure-based tools[15, 5, 18] are source-dependent and the wrapper induction[6, 10] has to construct wrapper from a set of examples. After generating wrapper, the learned wrapper is actually used to extract contents of page. In this section, we introduce the technique employed in type rule generation from web pages in order to construct wrapper.
6.1
Type Learner
Type learner automatically generates type rules for elements in a given web page. There are two submodules: 1) Converting data and 2) Generating type rules. Input page is converted into token string after cleaning dummy-data within web page, and then the type rules are generated by applying graph grammar to the token string. Converting Data Web sites usually contain unnecessary parts like HTML delimiter tags. Although these information may be useful for user browsing, we consider them as dummy-data that may make the process of extraction data complicate. We need a data cleaning process to discard the dummy-data. Consequently, system identifies specific part containing the data of interest to the user. The core part being identified is represented by pre-defined base types and tokens, called tokenizing. Table 3 shows a converting example from input HTML data to token string. Assume that we have a web page represented in HTML and sequence data which starts with string “ORIGIN” and ends with symbol “//”. In this example, base types provided by user are as follows: 1) STR for string, 2) N for number, 3) B for space, 4) SYM for symbol, and 5) t1 for base DNA sequence. Input data goes through a cleanup phase to remove dummy-data and then the core contents have been filtered. The tokenizing phase converts the filtered data to token string. Generating type rules The context free graph grammar is a set of grammar production rules that describe a graph-based database. Although textual grammars are useful, they are limited to describing database that contain DNA sequence. Sequences can be viewed as linear directed graphs. We decide to use graph grammar[2, 1] because most our target documents include information related with many sequences as bioinformatic database. The GenBank sequence has been represented by numbers, sequences, and blanks. The number is located at the beginning of each sequence line and then followed by pairs of blank and t1 (base type of DNA sequence). The second sequence line has same pattern except the number of pairs of blank and t1 . Figure 3 shows a graph for token sequence corresponding to whole raw DNA sequence of GenBank. There are several repeated patterns in a graph as shown in figure 3 and it is necessary for a graph to be simplified. Figure 4 presents a simplified graph. In comparison with figure 3 drawn sequentially, figure 4 is drawn by only distinct tokens. The pseudo code shown in table 4 describes how to create a graph grammar rules from target data. A token string enters as an input and be labeled to distinguish which token is a start, an end, or both. For all distinct tokens in the token string, next tokens for their each token are found. The graph grammar has been built using this data structure. Let us assume that there are n number of distinct tokens in a token string. The procedures are as follows: 1) label in terms of start node, end node, or both, 2) read the first token(ti=1 ) from token set(T), 3) find it’s next tokens
1) INPUT DATA
ORIGIN 1 AAAGG GTTTT GGGTT TTTTT GGCCC 26 GGTTG GGGG // |
2) CLEANED DATA ORIGIN 1 AAAGG GTTTT GGGTT TTTTT GGCCC 26 GGTTG GGGG // 3) TOKENIZED STRING STR N B t1 B t1 B t1 B t1 B t1 N B t1 B t1 SYM Table 3. The example for data converting
Function Graph-Grammar-Creation Input Token string S Output graph grammar G Begin For token string S Label tokens as S, E, or Both; For every distinct token in token string Find next tokens; //Let us call it N Build a graph grammar for N; End End End Table 4. Pseudo code for graph grammar creation
Fig. 3. GenBank DNA sequence graph
Fig. 4. Simplified GenBank DNA sequence graph
from T, 4) increase i by 1, and 5) repeat from step 2 until arriving at the end of token((tn ). Table 5 shows data structure generated through applying algorithm Graph-Grammar-Creation to figure 4. There are three tokens. The token “N” is a start node and is followed by a token “B”. The token “B”, which is followed by t1 , is an end token. The other end token, “t1 ”, has two next tokens(“B” and “N”).
i tokens label next 1 N S B 2 B E t1 3 t1 E B, N Table 5. The example for applying algorithm
6.2
Constructing wrapper
As shown in table 6, a wrapper describes a source as well as names and types for elements in the page. This example represents only one element(Seq) within source(GenBank). When a user requests query for specific web page, the .. part is checked if a wrapper of this page is defined in ontology or not. If defined, the actual web page is fetched using information of .. . Elements in the web page is described within .. . A generated wrapper is stored in wrapper database in ontology. Later, when a page is requested, we directly retrieve the wrapper for a page from the wrapper database without running the wrapper generation again.
GenBank GenBank URL Seq t2 ........ omitted .....
Table 6. The partial wrapper for GenBank
7
Wrapper Interpreter
Wrapper interpreter parses the wrappers for the search pages. The procedures for information extraction is depicted in the following figure 5.
Fig. 5. The procedures for extracting information
Assume that type rules, wrappers, and term mapping set for a specific domain have been already generated. When a user requests a query for web page already defined in the ontology, the HTML web page is fetched using URL defined in the wrapper. Both “from” part and variables used in the user query are used to look up ontology. Interpreter parses wrappers corresponding to user request through the following steps: 1. 2. 3. 4. 5. 6.
Fetch a web page corresponding to an URL in the user query statement Let p be a actual web page Find a wrapper of p from Let w be a wrapper if found Let s be a value of , E be set of in w Find term names matched to query variables from
7. Let T be these term names 8. For each t ∈ T do (a) Find that value of is equal to value of s (b) Look for e ∈ E having the same element name as (c) Find a type rule for of e from (d) Apply the type rule into p (e) Add d into result if d is the matched data 9. return result to user
8
Implementation
The proposed system produces correct wrappers using type rules and the wrapper interpreter is able to extract all of available data from sites. We have tested several web sites having two different formats and randomly generated 50 documents. Type rules for DNA sequences are generated because test documents include DNA sequences. Wrapper constructed using type rules extracts data field from target document. The result is displayed in different colors for different types. This coloring allows user to distinguish which types are applied to the test data. The following screen shots show the result extracted from several DNA sequences. On the result window, the left pane is separated into two parts: the top part for user query and bottom part for result. The right pane displays the content of web page. We assume that the base type of DNA sequence, which is called t 1 , has been pre-defined so that t 1 consists of at least two combinations of character ‘A’, ‘C’, ‘G’, or ‘T’. If a user enters the GenBank web site as an input page and wants DNA sequence defined by type t 1 , the system chooses all data to be matched with type t 1 . It is natural that the result from type t 1 includes lots of false positive data because t 1 is so generic type rule. So, the system automatically generates an extended type rules and extract more confident or concrete data by the extended type rules. The composite extension has been created by combining both “before token” and “after token” of sequence which satisfies with the extension rule. The “before token” and “after token” are used as start token and end token. Table 7 shows lists of before token and after token of several DNA sequences. According to table 7, the GenBank format starts with token “ORIGIN” which is followed by DNA sequence, and then ends with “//”. The format EMBL begins with semicolomn(;) and ends with “//”, the format GCG has just a start token without an end token, and format IG ends with number 1 or 2 without start symbols. 8.1
GenBank
The GenBank format starts with a line containing the word “LOCUS” and a number of annotation lines. The start of the sequence is marked by a line containing “ORIGIN” and the end of the sequence is marked by two slashes “//”.
GenBank EMBL GCG IG Start tokens ORIGIN ; .. End tokens // // 1 or 2 Table 7. Gene sequence extraction delimiters of different formats
We have tested a GenBank format as a target document with the following user’s type:“(N[1,3]B[1,10](t1[2,10]B[1,10]){1,6}){1,20}. User’s type is the combination of tokens and their ranges which are surrounded by square brackets. Each bracket consists of two numbers, which are the minimum condition and maximum condition to be satisfied. Compared with each token, the set of tokens can be bound by using round bracket and be ranged by the same method as the single token. If token has been used without range, [1,1] for single token or {1,1} for sets of tokens has been used as a default. Figure 6 shows a screen shot for a type “(N[1,3]B[1,10](t1[2,10]B[1,10]){1,6}){1,20}” for GenBank web page. The highlight part of right window is used in the graph grammar and both extension and composite extension rules are generated. When applying composite extension rule into a target document, target data corresponding to each type of this rule has been represented in a type-dependent color scheme. In our case, the green color is for (t1 ), yellow color is for blank(B), blue color is for number(N), brown color is for start and end symbol. The left window of figure 6 is the result for representing each type in different colors.
Fig. 6. The result of GenBank format data
8.2
EMBL
A sequence file in EMBL format can contain several sequences. One sequence entry starts with an identifier line (“ID”), followed by further annotation lines. The start of the sequence is marked by a line starting with “SQ” and the end
of the sequence is marked by two slashes “//”. Figure 7 shows the screen shot for type “((B[1,2]t1[2,10]){1,6}B[1,2]N[1,3]){1,20}” from EMBL format. The explanation of user’s type is as follows: 1) one or two number of blanks which are followed by basic DNA sequence having at least two and at most ten as the length of t1 , 2) the set of blank and t1 should be repeated at least once and at most six times, 3) one or two blanks and numbers having at most three as the length of number have to be followed, and 4) steps from 1) to 3) must be performed up to 20 iterations.
Fig. 7. The result of EMBL format data
8.3
Sequence data by random generator
In case of previous screen shots, target sequences are well known data. However, we need randomly generated test data for more confident evaluation. Figure 8 shows the screen shot for type “(t1[2,7]B){1,5}” and “(B[1,10]t1[2,400]){1,40}” with randomly generated 50 sequences for each type. Each sequence has been represented on the right window with the serial number, which is given by the random generator. There are two buttons which are “Random-Gen” and “QueryRandom-Gen”. If a user wants to generate random sequences without considering users types, the user clicks the first button, However, if a user wants to generate test data on the basis of user types, the user selects the second button. Figure 8 shows data matched with given type from the randomly generated sequences with the serial number.
9
Conclusion
We implemented a type system which automatically generates type rules for web documents. Our system is based on the following two parts: 1) the type rule generation, and 2) an ontology embedding. There are three attributes in
Fig. 8. The result of randomly generated data
ontology: sets of wrappers, sets of mappings, and sets of types. We will extend the system so as to process several types from multiple documents, because current system deals with single type and one web document in a user query. Finally, we will reinforce a system so that integration for disparate sources will be performed through processing multiple types and multiple web sites.
References 1. L. B. Holder, I. Jonyer, and D. Cook. Concept formation using graph grammar. Proc, ACM SIGMOD, 2003. 2. L. B. Holder, I. Jonyer, and D. Cook. Mdl-based context-free graph grammar induction. Proc, ACM SIGMOD, 2003. 3. N. Kushmerick. Gleaning the web. IEEE Intelligent Systems, 14(2):20–22, 1999. 4. A. Laender, and B. Ribeiro-Neto. Brief survey of web data extraction tools. ACM SIGMOD Record, 31(2), 2002. 5. A.Sahuguet and F.Azavant. Building intelligent web applications using lightweight wrappers. 2000. 6. C. Hsu, and M. Dung. Generating finite-state transducers for semi-structured data extraction from the web. AAAI, 1998. 7. Y. Jiang, D. Embley, and Y. Ng. Record-boundary discovery in web documents. Preceeding ACM SIGMOD International Conference on Management of Data, pages 467–478, 1999. 8. Y. Jiang, S. Liddle, Y. Kai, D. Embley,, E. Campbell, and R. Smith. Conceptualmodel-based data extraction from multiple-record web documents. Data and Knowledge Engineering, 1999. 9. D. Quass, A. Rajaraman, Y. Sagiv, J. Ullman, J. Widom, H. Molina,, Y. Papakonstantinos, and V. Vassalos. The tsimmis approach to mediation: Data models. Journal of intelligent Information System, pages 117–132, 8(2) 1997. 10. I. Muslea. Extraction patterns for information extraction to wrapper induction. Machine Learning for Information Extraction, 1999. 11. Y. Lu, L. Chen, Y. Liao, J. Han,, H. M. Jamil, and J. Pei. Dna-minor: A system prototype for mining dna sequences. Proc, ACM SIGMOD, 2001.
12. S. N. Minton K. Lerman and C. A. Knoblock. Wrapper maintenance: A machine learning approach. Journal of Artificial Intelligence Research, 18:149–181, 2003. 13. L. Chen. On ad hoc integration and querying of heterogeneous online distributed databases. Ph.D Thesis, computer science, Mississippi state university, 2004. 14. L. Chen, and H. M. Jamil. On using remote user defined functions as wrappers for biological database interoperability. IJCIS, 12(2):161–195, 2003. 15. Calton Pu Ling Liu and Wei Han. Xwrap: An xml-enabled wrapper construction system for web information sources. ICDE, 2000. 16. G. Mecca, V. Crescenzi, and P. Merialdo. Roadrunner: Towards automatic data extraction from large web sites. The VLDB Journal, pages 109–118, 2001. 17. X. Gao, and L. Sterling. Autowrapper: automatic wrapper generation for multiple online services. In Asia Pacific Web Conference, 1999. 18. Youngju son, D.S Hwang, and F.Fotouhi. Case study:development of an organismspecific protein interaction database and its associated tools. International Journal of Cooperative Information System, 2002.