On the Complexity of Schema Inference from Web Pages in the Presence of Nullable Data Attributes Guizhen Yang
I. V. Ramakrishnan
Michael Kifer
Department of Computer Science and Engineering University at Buffalo Buffalo, NY 14260, USA
Department of Computer Science Stony Brook University Stony Brook, NY 11794, USA
Department of Computer Science Stony Brook University Stony Brook, NY 11794, USA
[email protected]
[email protected]
[email protected]
ABSTRACT
1.
An increasingly large number of Web pages are machinegenerated by filling in templates with data stored in backend databases. These templates can be viewed as the implicit schemas of those Web pages. The ability to infer the implicit schema from a collection of Web pages is important for scalable data extraction, since the inferred schema can be used to automatically identify schema attributes that are “encoded” in Web pages. However, the task of inferring a “good” schema is complicated due to the existence of nullable (missing) data attributes. Usually if an attribute contains a null value, then it will be omitted in the generated Web page, giving rise to different variations and permutations of layout structures in Web pages that are generated from the same template. In this paper we investigate the complexity of schema inference from Web pages in the presence of nullable data attributes. We introduce the notion of unambiguity as a quality measure for inferred schemas and prove that the problem of inferring “good” (unambiguous) schemas is NP-complete. Our complexity results imply that ambiguity resolution is one of the root causes of the computational difficulty underlying schema inference from Web pages.
A growing number of Web sites are maintained using content management software and an increasingly large number of Web pages are machine-generated from backend databases. Normally such Web pages possess an implicit, fixed schema, which underlies the templates used for generating these Web pages. By and large the process of generating Web pages can be considered as one of instantiating the implicit schema with the attribute data stored in backend databases. Knowing the implicit schema can help uncover the attribute data “encoded” in Web pages and thus automate the process of data extraction. Along with learning-based data extraction techniques [21, 16, 6, 20, 4, 7], schema inference approaches to data extraction are becoming important since they exhibit a high degree of automation and scalability. Recently interesting works on inferring the schema of a collection of template-driven Web pages were reported in [13, 8, 3]. Grumbach and Mecca [13] first introduced the schema inference problem and established several complexity results assuming the collection of Web pages satisfy certain properties such as data richness. In [8] the inferred schema is represented as a union-free regular expression with optionals (i.e., the “?” operator). But the sophisticated algorithm proposed in [8] for discovering a “good” schema can suffer from exponential blowup. Recently the work of Arasu and Garcia-Molina [3] describes efficient heuristics for the schema inference problem. However, the above proposals all assume that the Web pages from which the schema is to be inferred should be of “good quality”. In practice the schema inference problem is complicated by all kinds of variations in Web pages. One such complication can be traced to nullable data attributes. In a typical template, if an attribute’s value is null, then the attribute is omitted in the generated Web page. This omission gives rise to a multitude of possible variations in the layout structures of the Web pages generated from the same template (see Section 2 for a motivating example). Therefore, in the presence of nullable data attributes it is required that the inferred schema be able to accommodate variations in Web pages. Moreover, when used to extract attribute values, the inferred schema should also be able to disambiguate different occurrences of different attributes. Such a property is highly desirable since an unambiguous schema can boost the precision level of the extracted data. The quality of inferred schemas has been a relatively unexplored subject in the literature. In this paper we intro-
Categories and Subject Descriptors F.2 [Analysis of Algorithms and Problem Complexity]: Nonnumerical Algorithms and Problems
General Terms Theory
Keywords Data Extraction, Web Mining, Wrapper Induction, Schema Inference, Data Mining, Machine Learning, World Wide Web
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CIKM’03, November 3–8, 2003, New Orleans, Louisiana, USA. Copyright 2003 ACM 1-58113-723-0/03/0011 ...$5.00.
INTRODUCTION
(a)
(b)
(c)
Figure 1: Web Pages Listing Product Information
- OLFA SCS-1
- Utility Scissor, 5 Inches, Stainless Steel
- $19.99
- FISKARS 99247097
- Bent Trimmer, Cut Length 3 3/4 Inches, Overall Length 8 1/8 Inches
(a)
(b)
(c)
Figure 2: HTML Documents duce the notion of unambiguity as a quality measure for inferred schemas. Intuitively, the notion of a good schema corresponds to our notion of an unambiguous schema. We formally study the complexity of schema inference from Web pages in the presence of nullable data attributes. In particular, we prove that the problem of inferring unambiguous schemas1 is NP-complete. Our results point to ambiguity resolution as one of the root causes of the computational difficulty underlying schema inference from Web pages. The rest of this paper is organized as follows. Section 2 presents a motivating example to illustrate the significance of unambiguous schema inference. In Section 3 we formalize the problem of unambiguous schema inference and present our complexity results. Section 4 discusses related work and Section 5 concludes this paper.
2. A MOTIVATING EXAMPLE Web pages of data-intensive Web sites are usually automatically generated from backend databases. Consider the three Web pages shown in Figure 1. They list product description and pricing information about scissors. Each Web page corresponds to a database record about a scissor item. In our example the schema of the hidden relational database can be considered as a triple as follows hModelNumber , Description, Pricei Here ModelNumber is a key attribute containing the manufacture model number, Description is a short text about product information, and Price is the unit sales price. 1 As in [13, 8], here we use union-free regular expressions to represent schemas of HTML documents.
However, observe that while Figure 1(a) lists all three attributes of a scissor item, the Price attribute is missing in Figure 1(b) and Description missing in Figure 1(c). Normally if an attribute contains a null value, then it is completely omitted in the generated HTML document.2 To reflect the fact that the Description and Price attributes are nullable, we represent the implicit database schema as follows hModelNumber , Description?, Price?i Here a question mark (“?”) following an attribute name denotes that this attribute is nullable. Despite the fact that some attributes may be missing in the generated Web pages, there still is a high degree of consistency in the presentation style for encoding different attributes of a database record. Note that in all three Web pages of Figure 1: (i) manufacture model number is always followed by product description (if present) followed by unit sales price (if present); (ii) manufacture model number is always marked by a hyperlink while product description is presented using bold, italic font and unit sales price using bold font. Such consistency can be verified from the HTML documents shown in Figures 2(a), 2(b), and 2(c), which correspond to the Web pages shown in Figures 1(a), 1(b), and 1(c), respectively.3 2 Note that in the problem settings of [13] some form of markup encoding is always generated for a null data value. This difference in handling nullable data attributes has led to tractable schema inference problems [13]. See Section 4 for a detailed account of related work. 3 Note that the two sequences of HTML tags, and in Figures 2(a) and 2(b), respectively, render the same bold, italic font in a Web browser.
(a)
(b)
(c)
Figure 3: Specialized Schemas of HTML Documents Usually, associated with a collection of Web pages is an implicit schema from which these Web pages are generated. For instance, if we replace the attribute values with their corresponding attribute names in the HTML documents shown in Figures 2(a), 2(b), and 2(c), then we obtain the schemas shown in Figures 3(a), 3(b), and 3(c), respectively. We can think of the HTML documents in Figure 2 as being generated from the schemas in Figure 3 by filling in the attribute names with the corresponding attribute values stored in the backend database. Knowing the schema for a collection of Web pages can enable automated data extraction. For instance, if we use the schema in Figure 3(a) to parse 4 the HTML document in Figure 2(a), then the text strings matched at the positions of the three attributes, ModelNumber, Description, and Price, will be automatically extracted and assigned as their values, respectively. Similarly, applying the schema in Figure 3(b) to the HTML document in Figure 2(b) can pull out the attribute values of ModelNumber and Description. Note, however, that the schemas in Figure 3 are specialized in the sense that they capture only the structure common to a small collection of Web pages. For instance, the schema in Figure 3(b) is applicable only to the HTML document in Figure 2(b) but not to the other two in Figures 2(a) and 2(c). If we assume that a Web site can produce all three different types of Web pages like those shown in Figure 2, then one immediate consequence of such complication is that we cannot know a priori which schema to use for data extraction given an HTML document.
Figure 4: An Ambiguous Schema Therefore, the first important question is:
schema in Figure 4 generalizes all three HTML documents in Figure 2, i.e., those pages can be obtained by instantiating this schema. Note that, as in [13], we use union-free regular expressions to represent schemas in this paper. In particular, the “?” operator denotes that a token (or a sequence of tokens) is optional, i.e., can appear zero or once. Note that for any expression E, E? = ε|E, where ε denotes the empty string. Therefore, although explicit use of the union operator (“|”) is not allowed, a very restricted use of the union operator (“|”) actually exists in union-free regular expressions. There are practical reasons for restricting arbitrary use of the union operator. First, union-free regular expressions have been shown to be a natural, expressive formalism for representing schematic information of a large class of nested, structured data [5, 13]. Therefore, it is highly desirable to see if the true identity of an implicit schema can be inferred from examples, given the knowledge that it is in union-free regular expression form. Second, allowing arbitrary use of the union operator can easily cause the so called futility effect in the schema inference process [19]. For example, a naive algorithm can simply union all three specialized schemas in Figure 3 and claim that this is the schema for all HTML documents in Figure 2. Clearly, such a schema does not really generalize anything beyond the examples seen and as a result will fail to recognize an unseen Web page where both Description and Price attributes are missing (whereas the schema in Figure 4 will not). In fact, such a naive algorithm may need an exponential number of examples (one per attribute permutation) to work properly. Moreover, when an inferred schema is put into plain union form, its size can suffer from an exponential blowup with respect to the number of nullable data attributes.
Can we infer a schema that generalizes all the Web pages of a given Web site?
Figure 5: An Unambiguous Schema
In fact, a generalized schema can be obtained by aggregating several specialized schemas. Such a generalized schema can not only save storage space but speed up parsing time as well when it is used to extract attribute values. For instance, the
There is a problem, however, with the generalized schema in Figure 4. Suppose that we apply it to the HTML document shown in Figure 2(c) to extract attribute values. Then ambiguity arises since the string
$30.32 can be recognized by either of the following two expressions
4 Without loss of generality we will simply match HTML tags regardlessly of their attribute values. For instance, will match with . Moreover, attribute names such as ModelNumber , Description, and Price will match any text string. In other words, they act as placeholders for XML #PCDATA data (strings).
(Price)? (()?()?Description()?()?)?
assuming that both Description and Price match any text string. To this end, we will say that the schema in Figure 4
is ambiguous. Clearly, an ambiguous schema exhibits “bad” quality since it compromises precision of the extracted data. Therefore, a highly desirable property that we want to require of an inferred schema is that it be unambiguous. So the second important question is: Can we infer an unambiguous schema that generalizes a collection of Web pages? This is the problem that we will study in the rest of this paper. Observe that ambiguity of the schema in Figure 4 is due to its failure in identifying the tag as the distinguishing presentation style for the Description attribute. If such a feature is identified, then we can obtain an unambiguous schema as shown in Figure 5. To this end, we can claim that the schema in Figure 5 has “good” quality.
3. PROBLEM FORMALIZATION AND COMPLEXITY RESULTS In this section we formalize the problem of unambiguous schema inference from Web pages. We will show that the problem of inferring an unambiguous schema from Web pages can be cast as a problem of inferring union-free regular expressions (possibly with optionals) that are consistent with respect to positive and negative examples. Firstly, we can view the generalized schemas in Figures 4 and 5 as two regular expressions, S1 and S2 , respectively. But these expressions are not arbitrary regular expressions. Instead, their syntax is restricted to union-free regular expressions with optionals, which are formally defined below. Definition 1 (Union-Free Regular Expressions) Let Σ be a finite alphabet. A union-free regular expression (abbr. UFRE) over Σ is defined inductively as follows: • Any symbol c ∈ Σ is a UFRE. • If E1 and E2 are UFREs, so is E1 · E2 . • If E is a UFRE, so are E ∗ and E?. In the above definition, E ∗ means E can be repeated zero or multiple times while E? means zero or once, i.e., optional. For instance, (ab)∗ · d? is a valid UFRE. Note that E? = ε|E, where ε denotes the empty string. Therefore, the above definition actually allows a very restricted use of the union operator (“|”). But for simplicity we will still call such expressions union-free regular expressions. We should point out that the union-free regular expressions described here are exactly the same as those used in [13, 8] for representing schemas of HTML documents. Following the standard convention, we will use L(E) to denote the set of strings recognized by a regular expression E. Secondly, we can view the specialized schemas in Figure 3 as strings. Since these strings basically mirror the collection of HTML documents in Figure 2, they serve as positive examples from which a schema needs to be inferred. Let us denote this set of strings as P OS. Then clearly P OS ⊆ L(S1 ) and P OS ⊆ L(S2 ) (S1 and S2 represent the generalized schemas in Figures 4 and 5, respectively). This condition means that a schema should generalize all the positive examples. But we only see Web pages coming from a particular Web site as positive examples. Where do the negative examples
come from? In fact, negative examples can be implicitly derived from positive examples. For instance, from the positive examples in Figure 3 we can derive the two negative examples shown in Figure 6. Intuitively, these two negative examples imply that the Price attribute is not presented using bold, italic font and the Description attribute is not presented using bold font alone. Clearly, if an inferred schema accepts either of these two examples, then ambiguity can arise when this schema is used to extract attribute values from an HTML document.
(a)
(b)
Figure 6: Implicitly Derived Negative Examples Therefore, assuming that the strings in Figure 6 serve as negative examples, we would expect that a good schema should exclude all the negative examples. Observe that the ambiguous schema in Figure 4 accepts the negative example in Figure 6(b) whereas the unambiguous schema in Figure 5 does not. Technically, if we denote the set of strings in Figure 6 as N EG, then N EG∩L(S1 ) 6= ∅, N EG∩L(S2 ) = ∅. This is basically what distinguishes a good schema (S2 ) from a bad one (S1 ). In summary, an unambiguous schema can be considered as a union-free regular expression which accepts all the positive examples but excludes all the negative examples. To formalize the above discussion, we have the following definition. Definition 2 (Consistency) Let P OS and N EG be two sets of strings. Given a union-free regular expression E, we say that E is consistent w.r.t. hP OS, N EGi, iff L(E) ⊇ P OS and L(E) ∩ N EG = ∅. In the above Definition 2, the strings in P OS serve as positive examples while those in N EG serve as negative examples. Intuitively, a consistent UFRE generalizes all positive examples but excludes all negative examples. For instance, aab∗ is consistent w.r.t. h{aa, aab}, {ab, cd}i whereas a∗ b∗ is not, since the negative example ab ∈ L(a∗ b∗ ). Given a collection of Web pages as positive examples, our goal is to infer an unambiguous schema (represented using a UFRE) from these Web pages. Although we are not going to define how the set of implicit negative examples can be derived, we will informally say that an inferred schema is unambiguous iff it is consistent w.r.t. the positive examples and the (implicitly derived) negative examples. Now our unambiguous schema inference problem can be stated in the following form. Problem 1 (Consistent UFRE) Given two sets of strings P OS and N EG, is there a UFRE that is consistent w.r.t. hP OS, N EGi? Is there always a UFRE that is consistent w.r.t. two sets of positive and negative examples? The answer turns out to be “No”. For instance, it can be shown that there is no UFRE that is consistent w.r.t. h{a, b}, {ab, ba}i.
It turns out that the consistent UFRE problem is intractable in general. Theorem 1 The consistent UFRE problem is NP-complete. Proof.
See Appendix A.
2
Interestingly, we can also look at the problem of unambiguous schema inference from a slightly different perspective. Recall that in the motivating example of Section 2 the database schema is represented as hModelNumber , Description?, Price?i where the attributes Description and Price are nullable. Therefore, it is not surprising to see that the corresponding schema for these Web pages is roughly in the form of α·M ·(D)?·(P )?·β, where α, β are some strings and M, D, P represent the subschemas for the attributes ModelNumber, Description, and Price, respectively. Recall the motivating example in Section 2. Let S1 and S2 represent the two generalized schemas in Figures 4 and 5, respectively. Moreover, let us define5 α β
= =
M P
= =
#PCDATA #PCDATA
D1 D2
= =
()?()?#PCDATA()?()? ()?()?#PCDATA()?()?
Then S1 can be represented as α · M · (D1 )? · (P )? · β and S2 as α · M · (D2 )? · (P )? · β. Also let P OSD and P OSP be the sets of occurrences6 of the attributes Description and Price, respectively (see Figure 2): #PCDATA P OSD = #PCDATA
P OSP
=
{#PCDATA}
We can think of P OSD and P OSP as two sets of positive examples for the attributes Description and Price, respectively. Now looking inside the structures of S1 and S2 , we can see that both D1 and D2 generalize the set of examples in P OSD , i.e., P OSD ⊆ L(D1 ), P OSD ⊆ L(D2 ). However, if we take P OSP (the set of positive examples for the attribute Price) as the set of negative examples for the attribute Description, then it is straightforward to verify that D1 is not consistent w.r.t. hP OSD , P OSP i whereas D2 is. Clearly, the ambiguity of the schema S1 is due to its inconsistency w.r.t. the negative examples (observe that in this case the set of positive examples for one attribute is a set of negative examples for another attribute). In principle, to infer a schema in which multiple attributes, A1 , . . . , An , are nullable, we need to infer a subschema for each attribute Ai . So given a list of sets of positive examples, P1 , . . . , Pn , for the attributes, A1 , . . . , An , respectively, it is highly desirable to infer a subschema, Ei , for each attribute Ai such that Ei accepts all the positive examples in Ai but excludes all the positive examples of other attributes combined. This notion is formalized in the following definition. 5 We have replaced all the attributes names with #PCDATA, assuming that they all match text strings. 6 In our example all attribute values are just text strings. For simplicity we use #PCDATA to denote any text string.
Definition 3 (Unambiguity) Given a list of sets of strings, (P1 , . . . , Pn ), and a list of UFREs, (E1 , . . . , En ), we say that (E1 , . . . , En ) is unambiguous w.r.t. (P1 , . . . , Pn ), iff Ei is S consistent w.r.t. hPi , j6=i Pj i for all 1 ≤ i ≤ n.
However, even when a list of UFREs is unambiguous w.r.t. the given examples, the sets of strings recognized by these UFREs may overlap. In such a case ambiguity can still arise since the overlapping UFREs may match the same text string in a given Web page. Therefore, a even more desirable, better quality that we want to impose is that the sets of strings recognized by these UFREs be pairwise disjoint. If so, then we say the list of UFREs is inherently unambiguous.
Definition 4 (Inherent Unambiguity) Let (P1 , . . . , Pn ) be a list of sets of strings and (E1 , . . . , En ) be a list of UFREs. We say that (E1 , . . . , En ) is inherently unambiguous w.r.t. (P1 , . . . , Pn ), iff L(Ei ) ⊇ Pi for all 1 ≤ i ≤ n, and L(Ei ) ∩ L(Ej ) = ∅ for all 1 ≤ i ≤ n, 1 ≤ j ≤ n, i 6= j. Inherently unambiguous UFREs are able to retain more precision than those that are only unambiguous w.r.t. the given examples. Therefore, when we cast the problem of unambiguous schema inference as the problem of inferring multiple subschemas for multiple (nullable) data attributes, we obtain two different types of unambiguity — either unambiguity w.r.t. the given examples or inherent unambiguity. We formally state our new schema inference problems and complexity results below. Problem 2 (Unambiguous UFREs) Given a list of sets of strings, (P1 , . . . , Pn ), is there a list of UFREs, (E1 , . . . , En ), such that (E1 , . . . , En ) is unambiguous w.r.t. (P1 , . . . , Pn )? Problem 3 (Inherently Unambiguous UFREs) Given a list of sets of strings, (P1 , . . . , Pn ), is there a list of UFREs, (E1 , . . . , En ), such that (E1 , . . . , En ) is inherently unambiguous w.r.t. (P1 , . . . , Pn )? Theorem 2 The unambiguous UFREs problem is NP-complete. Theorem 3 The inherently unambiguous UFREs problem is decidable. Proof. [Sketch] We can show that given a list of sets of strings, (P1 , . . . , Pn ), if there exists a list of UFREs, (E1 , . . . , En ), that is inherently unambiguous w.r.t. (P1 , . . . , Pn ), then the size of each Ei is bounded by the size of Pi . Therefore, We can construct a naive algorithm to enumerate each Ei and check whether the resulting list of UFREs is inherently unambiguous w.r.t. (P1 , . . . , Pn ). 2 We should point out that Problems 1 and 2 are not equivalent problems. Given a pair of sets of strings, (P1 , P2 ), if there is a pair of UFREs, (E1 , E2 ), which is unambiguous w.r.t. (P1 , P2 ), then clearly E1 is consistent w.r.t. hP1 , P2 i. Therefore, a “yes” answer to Problem 2 implies a “yes” answer to Problem 1. However, the converse is not necessarily true. The existence of a UFRE that is consistent w.r.t. hP1 , P2 i does not necessarily imply the existence of a pair of UFREs that is unambiguous w.r.t. (P1 , P2 ). For example, (ab)?(ba)? is consistent w.r.t. h{ab, ba}, {a, b}i.
But there is no pair of UFREs that is unambiguous w.r.t. ({ab, ba}, {a, b}). Therefore, Problem 1 cannot directly reduce to Problem 2, although Theorem 1 immediately implies that Problem 2 is in NP. Hence it requires a separate treatment to establish the complexity result in Theorem 2. The proof of Theorem 2 is similar to that of Theorem 1 but is omitted here due to space limitation. Problems 2 and 3 are not equivalent problems either. Problem 2 cannot directly reduce to Problem 3. For example, the following pair of UFREs ( (a?)(b?)(a?)(cbbc)?(bc∗ b)?, (c?)(b?)(c?)(abba)?(ba∗ b)? ) is unambiguous w.r.t. the following pair of sets of strings ( {ab, ba, cbbc, bccb, bcb}, {cb, bc, abba, baab, bab} ) But it can be shown that there is no pair of UFREs that is inherently unambiguous w.r.t. these two sets of strings above. In fact, we can show that for any pair of UFREs, (E1 , E2 ), that is unambiguous w.r.t. the two sets of strings above, it must be true that b ∈ L(E1 ) ∩ L(E2 ). Because it can be done in polynomial time to check whether two regular expressions are disjoint, it follows that Problem 3 is in NP. Although we do not know yet exactly the complexity of the inherently unambiguous UFREs problem, our conjecture is that this problem is also NP-complete. Finally, summarizing all the main results of this section, we make the following claim. Corollary 4 It is an NP-complete problem to decide whether there exists an unambiguous schema, represented using unionfree regular expressions, that generalizes an arbitrary collection of Web pages with multiple nullable data attributes.
4. RELATED WORK Our work is closely related to the grammar inference problem which was first addressed in the seminal works of Gold and Angluin. Gold [11, 12] proved that the problem of inferring a DFA of minimum size from positive examples is NP-complete. In [1] Angluin showed that the problem of inferring a regular expression (with no restriction on the use of unions) of minimum size from positive and negative examples is NP-complete. In this paper, however, we restrict the syntax to union-free regular expressions and do not impose any constraint on the size of the inferred UFREs. Our problems of inferring consistent UFREs and unambiguous lists of UFREs do not have equivalent counterparts in the classical works on grammar inference and hence none of the known results there is applicable. In [2] Angluin proposed a polynomial time algorithm for actively learning the minimum DFA of a regular language from a teacher who knows the true identity of this regular language. We should point out that such an active learning framework is different from ours, which is passive. There is also a large body of work on learning subsequences and supersequences from sets of strings. The following problems are all NP-complete: (1) finding either the shortest common supersequence or the longest common subsequence of an arbitrary number of strings over a binary alphabet [17, 22]; (2) finding a sequence which is a common subsequence/supersequence of a set of positive examples but not a subsequence/supersequence of any string
in a set of negative examples [15, 18]. The syntax of UFREs is much more expressive than plain strings and hence much better suited for representing schematic information. Moreover, our problem of inferring an unambiguous set of UFREs (one per nullable attribute) has no counterpart in the area of sequence learning. The XTRACT system for inferring schemas of XML documents was reported in [10]. However, in the problem settings of [10], the examples are labeled trees instead of strings. Moreover, the main constraint there is minimization of the size of the inferred schema. Consequently, all input examples in [10] are positive examples and our notion of unambiguity w.r.t. positive and negative examples is not explored in [10]. The complexity results presented in this paper are inspired by our recent work on multi-attribute data extraction from Web sources [23]. But in [23] our focus is on learning extraction patterns (using different syntax from UFREs) from examples and on the precision and recall metrics of the extracted data. In fact, the issue of ambiguity resolution can commonly arise in learning-based approaches to data extraction and schema inference [4, 16, 14, 7, 8, 3] but has remained relatively unexplored in the literature. A different notion of unambiguity was introduced in our earlier work [9] to ascribe a quality measure to extraction patterns learned (using different syntax from UFREs). However, [9] is mainly concerned with the computational complexity of checking various properties (including unambiguity) of extraction patterns rather than inferring extraction patterns from examples. Our results have direct bearing on the works reported in [14, 13, 8, 3]. In [13], several schema inference problems were introduced and shown to be solvable in polynomial time. However, the lower complexity results reported in [13] are due to the assumption of explicit encoding of null data values. Our results have shown that if null data values are completely omitted (which can be commonly observed in practice), then the (unambiguous) schema inference problem quickly becomes intractable in general. Although exponential-time and polynomial-time heuristics have been proposed in [8, 3], respectively, to infer schemas from unlabeled Web pages, our complexity results imply that the schema inference problem is still computationally difficult even when the labels are already known.
5.
CONCLUSION
Research on wrapper construction for Web sources has made a transition from its early focus on manual and semiautomatic approaches to fully automated techniques based on machine learning and schema inference. In this paper we have formalized the problem of schema inference from Web pages in the presence of nullable data attributes. We introduced the notion of unambiguity as an important quality measure for inferred schemas and studied the schema inference problem from several different perspectives, proving that the problem of inferring a good schema is NP-complete. Our results provide a theoretical basis for the complexityrelated aspects of unambiguous data extraction and schema inference. However, it should be noted that our complexity results, although pessimistic, deal with worst-case scenarios. In practice many Web sources exhibit a rather high degree of regularity in presentation styles, which is, in fact, a hallmark of a good Web site design. For such Web sources efficient
heuristics can still be developed to construct robust wrapper systems. The notion of unambiguity and its impact on precision of the extracted data can provide important guidance in building these systems. An important issue raised by our complexity results has to do with training examples. Recall that it was assumed that the schema inference problem can begin with an arbitrary set of examples. This assumption is a major reason for the exponential blowup in worst-case complexity. We believe that certain “good” properties in the set of training examples can lower the computation cost of schema inference even if unambiguity is sought. Therefore, it is very important to be able to obtain “clean” training data to improve the performance of fully automated data extraction techniques. We believe that good domain ontologies will be able to play a significant role in boosting the quality of training data. Generation of training examples from such ontologies appears to be a promising direction of future research.
6. ACKNOWLEDGMENTS This work was supported in part by NSF grants IIS-0072927, CCR-0205376, and CCR-0311512. The authors would like to thank the anonymous referees for their comments and suggestions that help improve the quality of this paper.
7. REFERENCES [1] D. Angluin. On the complexity of minimum inference of regular sets. Information and Control, 39(3):337–350, 1978. [2] D. Angluin. Learning regular sets from queries and counterexamples. Information and Computation, 75(2):87–106, 1987. [3] A. Arasu and H. Garcia-Molina. Extracting structured data from web pages. In ACM International Conference on Management of Data (SIGMOD), 2003. [4] N. Ashish and C. Knoblock. Wrapper generation for semi-structured internet sources. SIGMOD Record, 26(4):8–15, 1997. [5] P. Atzeni, G. Mecca, and P. Merialdo. To weave the web. In International Conference on Very Large Data Bases (VLDB), 1997. [6] B. Chidlovskii. Wrapping web information providers by transducer induction. In European Conference on Machine Learning, 2001. [7] W. Cohen, M. Hurst, and L. Jensen. A flexible learning system for wrapping tables and lists in HTML documents. In International World Wide Web Conference (WWW), 2002. [8] V. Crescenzi, G. Mecca, and P. Merialdo. RoadRunner: Towards automatic data extraction from large web sites. In International Conference on Very Large Data Bases (VLDB), 2001. [9] H. Davulcu, G. Yang, M. Kifer, and I. V. Ramakrishnan. Computational aspects of resilient data extraction from semistructured sources. In ACM International Symposium on Principles of Database Systems (PODS), 2000. [10] M. N. Garofalakis, A. Gionis, R. Rastogi, S. Seshadri, and K. Shim. XTRACT: A system for extracting document type descriptors from XML documents. In ACM International Conference on Management of Data (SIGMOD), 2000.
[11] E. M. Gold. Language identification in the limit. Information and Control, 10(5):447–474, 1967. [12] E. M. Gold. Complexity of automaton identification from given data. Information and Control, 37(3):302–320, 1978. [13] S. Grumbach and G. Mecca. In search of the lost schema. In International Conference on Database Theory (ICDT), 1999. [14] C.-N. Hsu and M.-T. Dung. Generating finite-state transducers for semi-structured data extraction from the web. Information Systems, 23(8):521–538, 1998. [15] T. Jiang and M. Li. On the complexity of learning strings and sequences. Theoretical Computer Science, 119(2):363–371, 1993. [16] N. Kushmerick, D. S. Weld, and R. B. Doorenbos. Wrapper induction for information extraction. In International Joint Conference on Artificial Intelligence (IJCAI), 1997. [17] D. Maier. The complexity of some problems on subsequences and supersequences. Journal of ACM, 25(2):322–336, 1978. [18] M. Middendrof. On finding various minimal, maximal, and consistent sequences over a binary alphabet. Theoretical Computer Science, 145(1-2):317–327, 1995. [19] T. M. Mitchell. Machine Learning. McGraw Hill, 1997. [20] I. Muslea, S. Minton, and C. Knoblock. A hierarchical approach to wrapper induction. In Third International Conference on Autonomous Agents (Agents’99), 1999. [21] M. Perkowitz, R. B. Doorenbos, O. Etzioni, and D. S. Weld. Learning to understand information on the Internet: An example-based approach. Journal of Intelligent Information Systems, 8(2):133–153, 1997. [22] K.-J. R¨ aih¨ a and E. Ukkonen. The shortest common supersequence problem over binary alphabet is NP-complete. Theoretical Computer Science, 16:187–198, 1981. [23] G. Yang, S. Mukherjee, and I. V. Ramakrishnan. On precision and recall of multi-attribute data extraction from semistructured sources. In IEEE International Conerence on Data Mining (ICDM), 2003.
APPENDIX A. PROOF OF NP-COMPLETENESS In this section we will formally prove the complexity result in Theorem 1. In the sequel, we will use ε to denote either the empty string or the empty expression. Its intended usage should be clear from the context. We will also use the notation αk , where α is a string and k an integer, to represent the string obtained by repeating k times the string α. In particular, α0 = ε. Let P OS and N EG be two sets of strings. First, deciding whether or not a string is accepted by a regular expression can be done in polynomial time. Second, we can show that the size of the shortest UFRE that is consistent w.r.t. hP OS, N EGi is bounded by the size of P OS and N EG. Therefore, this problem is in NP. To prove that this problem is NP-hard, we will reduce SAT to our problem. We will also assume the alphabet Σ = {$, 0, 1}. Let F be a propositional formula in conjunctive normal form which has m clauses C1 , C2 , . . . , Cm and n variables V1 , V2 , . . . , Vn . For 1 ≤ i ≤ m and 1 ≤ j ≤ n, let us define: $10, if Vj appears positively in Ci ; $01, if Vj appears negatively in Ci ; Fij = $0, if Vj does not appear in Ci .
The idea is that in a string we will use $01 and $10 to represent the logical values true and false, respectively. Thus for all 1 ≤ i ≤ m, the string Fi1 Fi2 · · · Fin encodes the only assignment of truth values to the variables, V1 , V2 , . . . , Vn , which makes the clause Ci false. Moreover, define: P OS = {($0)n , ($1)n } N EG = N1 ∪ N2 ∪ N3 N1
=
{$k | 0 ≤ k ≤ n − 1}
N2
=
{$k 00$n−k , $k 11$n−k | 1 ≤ k ≤ n}
N3
=
{Fi1 Fi2 · · · Fin | 1 ≤ i ≤ m}
Next we will show that the formula F is satisfiable iff there is a UFRE that is consistent w.r.t. hP OS, N EGi. We will also use the following two UFREs Et Ef
= =
$(0?)(1?) $(1?)(0?)
to represent the logical values true and false, respectively. First, given an assignment of truth values to the variables, V1 , V2 , . . . , Vn , in the formula F , we can construct a UFRE, E = E1 E2 · · · En , where for all 1 ≤ j ≤ n, Et , iff the truth value assigned to Vj is true; Ej = Ef , iff the truth value assigned to Vj is false. So if the formula F is satisfiable, then there must be an assignment of truth values to the variables, V1 , V2 , . . . , Vn , which satisfies F . It can be verified that if we construct a UFRE, E, as defined above, then E is consistent w.r.t. hP OS, N EGi. Now suppose that there is a UFRE, E, which is consistent w.r.t. hP OS, N EGi. Then it follows that L(E) ⊇ P OS and L(E) ∩ N EG = ∅. We will show that from E we can obtain an assignment of truth values to the variables, V1 , V2 , . . . , Vn , which satisfies the formula F .
Let E = A1 A2 . . . Ai , where each Ak (1 ≤ k ≤ i) is a single symbol ($, 0, or 1), or in the form of (X)? or (X)∗ . Since L(E) ⊇ P OS, it follows that for each Ak , if Ak is a single symbol, then Ak = $. Because L(E)∩N1 = ∅, so the number of Ak ’s that are the single symbol $ must be exactly n. It follows that E must have the form of B0 $B1 $B2 . . . $Bn , where each Bk is a concatenation of expressions in the form of (X)? or (X)∗ . Moreover, if B0 is not an empty expression, then we can remove B0 from E, because the resulting expression would still be consistent w.r.t. hP OS, N EGi. Therefore, in the following we will assume that E = $B1 $B2 . . . $Bn . Next we will show that each Bk in E can be transformed into either (0?)(1?) or (1?)(0?) and the resulting new expression E is still consistent w.r.t. hP OS, N EGi. We define two transformation operations on the Bk ’s as follows: (i) remove a “?” operator; (ii) remove a “·” operator together with one of its operands. We will keep performing either of these two operations on each Bk unless it gives rise to inconsistency. So when this process stops, we cannot remove any operator from any of the Bk ’s and still get a new consistent expression. We claim that when the above process ends, each Bk must be either (0?)(1?) or (1?)(0?). Because L(E) ⊇ P OS, it must be true that {ε, 0, 1} ⊆ L(Bk ) for each Bk . Since L(E) ∩ N2 = ∅, it follows that 00 ∈ / L(Bk ), 11 ∈ / L(Bk ), for each Bk . Therefore, Bk cannot have the form of (X)∗ . Moreover, Bk cannot have the form of (X)? either; otherwise we could remove the “?” operator and the resulting new expression would still be consistent, because {0, 1} ⊆ L(X) ⊆ L((X)?). Therefore, it must be true that Bk = (Ck1 ) · (Ck2 ). Note that L(Ck1 ) and L(Ck2 ) must contain only one of 0 and 1 but not both; otherwise the “·” operator together with one of Ck1 and Ck2 could be removed. Since ε ∈ L(Bk ), it follows that ε ∈ L(Ck1 ), ε ∈ L(Ck2 ). Let us suppose 0 ∈ L(Ckj ) (j = 1 or j = 2). Since 00 ∈ / L(Bk ), it follows that 00 ∈ / L(Ckj ). So Ckj cannot be a single symbol, or have the form of (X)∗ or (X) · (Y ). Therefore Ckj must have the form of (D)?. By a similar argument, we can show that D must be the single symbol 0. So Ckj must be (0)?. Similarly, we can show that if 1 ∈ L(Ckj ) (j = 1 or j = 2), then Ckj must be (1)?. Therefore, Bk must be either (0?)(1?) or (1?)(0?). We have shown that we can obtain a consistent UFRE, E = $B1 $B2 . . . $Bn , where each Bk is either (0?)(1?) or (1?)(0?). We define an assignment of truth values to the variables, V1 , V2 , . . . , Vn , as follows: if Bk = (0?)(1?), then assign true to Vk ; if Bk = (1?)(0?), then assign f alse to Vk . Because L(E) ∩ N3 = ∅, we can verify that this truth value assignment must satisfy the formula F . Clearly, |P OS| + |N EG| = O(mn). Therefore this problem is NP-hard.