Extracting Structures of HTML Documents Using a High ... - CiteSeerX

1 downloads 0 Views 609KB Size Report
Provo, Utah 84602, U.S.A.. Email: [email protected]. Yiu-Kai ..... f“11 E. Pine Lane”.value(), Orem.value(), 84057.value()g. = f“11 E. Pine Lane”, Orem, 84057g. 2.
Extracting Structures of HTML Documents Using a High-Level Stack Machine Seung-Jin Lim Computer Science Department Brigham Young University Provo, Utah 84602, U.S.A. Email: [email protected]

Abstract Information on the Web, which are conglomeration of heterogeneous data such as texts, images and audio clips, are often accessed through documents written according to the HTML specification [7]. According to the HTML specification, HTML documents are semistructured in nature. We propose a high-level stack machine (HSM) which accesses an HTML document through its URL and constructs a semistructured data graph (SDG) of the document. The SDG of an HTML document H precisely captures the structure of the semistructured data embedded in H based on the dependency relationship [11] among the data objects in H . HSM is configurable to accommodate a user’s interest with respect to the HTML elements in H to be considered during the construction process of the SDG of H .

1 Introduction During the early days of the World-Wide Web (WWW or Web), users heavily relied on the mouse-button-click navigation method through hyperlinks provided by Web browsers to retrieve information of interest and soon found themselves lost somewhere in the midst of cyberspace [9]. Since then, Web designers, as well as Web users, have been looking for better alternatives. Two recent alternative approaches are (i) the keyword search method using index servers, and (ii) the method of using extended query languages, including SQL-like query languages such as [2, 12, 1, 8, 13], and Datalog-like query languages such as WebLog [10]. To better understand this issue, we clarify three types of data with respect to their structures as in [9]: Unstructured data: Data which is stored in files such as executable files, pure text files which contain no formatting code, and audio files. It is difficult, if not impossible, to ascertain the semantics of this type of data.  The preliminary version of this paper was published in the Proceedings of the 12th International Conference on Information Networking (ICOIN’98), Tokyo, Japan, January 1998.



Yiu-Kai Ng Computer Science Department Brigham Young University Provo, Utah 84602, U.S.A. Email: [email protected]

([5, 4] use the term unstructured data for any data of no rigid structure.) Structured data: A typical example of this type of data is tables in the relational database model. The semantics of this type of data can be obtained using the grammar of the formal language in which the data file is written. Semistructured data: This type of data is anything between the two extreme types of data mentioned above. Examples of this type of data are text files that contain formatting codes, such as LATEX or HTML, and files which require strict inner structure but some of the structural components can be omitted, such as BIBTEX files, Unix environment files, etc. For a BIBTEX file, if an entry of the file is of type article, then the meaning of article can be obtained using the set of predefined fields for article. Our view of information on the Web is a collection of heterogeneous data, such as HTML documents by and large and other types of data including images, sound, and video clips. HTML specification [7], which is the most widely used paradigm for posting information on the Web, does not require a uniform structure in documents, e.g., an element which appears in an HTML document may be missing in other documents. Hence, we treat an HTML document H as a textual representation of semistructured data embedded within H . One of the benefits of using semistructured data is its great flexibility in data representation [5]. A theory of semistructured data, however, is still missing [1], and hence there is no universally-standardized definition of semistructured data. In this paper, we consider a finite set of data objects fo1 , o2 , : : :, on g as semistructured data D if  



the structure of D is irregular or incomplete [14, 2, 3].

the distinction between the schema of D and fo1 , o2 , : : :, on g is blurred, and the schema may change dynamically [1, 4].

oi (1 i n) is not type-sensitive. 2 



For example, a semistructured data may include components A and B . Let the data fields of A be name and date of birth, and those of B be name, date of birth, and address. Further assume that date of birth in A is of type string, such as “6-12-1938”, whereas date of birth in B consists of three parts, Month, Day and Year, and Month can be of type integer, i.e., “1”, “2”, : : :, or string, i.e., “January”, “February”, : : :, or even encoded in a character set other than ASCII. The structures of A and B can be changed subsequently as needed. In this paper, we present an approach for extracting the structures of semistructured data embedded within an HTML document H on WWW, assuming that H is written in compliance1 with the HTML specification [7] and referred by an URL which is given. (Additional URLs can be further obtained from the hyperlinks included in H .) We first propose a graphical data model of semistructured data, called the semistructured data graph (SDG), and then present a tool, a high-level stack machine (HSM)2 , to extract the structures embedded in H . The main contribution of this paper is three-fold. First, we extend the concept of dependency relationship among database components in [11] to capture the structure of the semistructured data embedded in an HTML document. Second, we design and implement a simple automaton HSM, using the Java Language Environment, to construct the SDG for an HTML document specified by an URL. Since HSM is built based on pushdown automata, HSM is fairly easy to implement using a stack without concerning about sophisticated functions such as rereading or replacing an input, or assuming unlimited auxiliary memory as does a Turing machine. Third, our HSM is easily configurable according to the user’s need in terms of what HTML elements are to be included in an SDG. To configure HSM, a user simply provides a configuration file which contains a list of HTML elements of an HTML document H chosen by the user to be included in the SDG of H . Writing a configuration file does not require any knowledge on additional commands and their syntax, such as get(), split() and citytemp[1 : 0] in [8]. This paper is organized as follows: In Section 2, we discuss related work. In Section 3, we describe the details of SDGs. In Section 4, we propose HSM and demonstrate its application using a real-world example. In Section 5, we give the concluding remark.

Web. Two commonly-used methods are (i) the keyword search method using index servers [15] (which we do not further discuss), and (ii) the method of using extended query languages, which we further discuss below. There are proposed query languages, typically of SQLtype, for querying information on WWW which are the extensions to existing query languages. This approach has been proposed mainly from the object-oriented database model and is considered an extension to OQL which is based on the ODMG model. The approaches in [2, 1, 8, 13] are based on OEM [6] with minor variations. Lorel [2], a query language of SQL/OQL style for querying semistructured data, is implemented on top of the O2 system. Strict typing is relaxed in Lorel by the extensive use of coercion. In [8], a text-pattern based configurable extraction program for converting HTML pages into database objects is proposed. The configuration is specified in a static specification file which consists of a sequence of commands. (Writing a specification file, however, tends to be error-proning because it requires knowledge on additional commands and their syntax.) [13] identifies a type hierarchy for irregular data and proposes an algorithm for deriving the type hierarchy and rules for assigning types to the elements of a semistructured data. WebSQL [12], an SQL-like query language, integrates textual retrieval with structure and topology-based queries. The language is one of the few Web query languages which provides a formal semantics of the proposed language. Araneus [3] is another variation of ODMG and is a pageoriented model. In this model, each piece of information in a Web page, such as text, images and links, is considered an attribute of the page. Two other SQL-like languages, Ulixes and Penelope, are proposed based on Araneus to construct database views of the Web and to generate hypertextual views over the Web. WebLog [10], another Web language, which is a Datalog-like language exploits partial knowledge on the information being queried. It has a restructuring ability, namely, the answer to a query is presented in an HTML document. Two existing data models, OEM [2] and UnQL [5], are of interest in particular. Although these models are different, they are all tree-like data structures represented by a rooted, labeled, directed graph. A superficial difference between these two models is that labels are on nodes and edges in OEM, whereas labels are only on edges in UnQL.

2 Related Work

3 The Data Model

A number of methods have been proposed in the literature for improving information search and retrieval on the

The fundamental of our data model for semistructured data, SDG, is based on the notion of dependency relationship [11] among database objects. We give the definition of objects below.

1 Note that erroneous HTML documents are not excluded from our consideration. 2 HSM is a variation of the well-known pushdown automata (PDA).

Definition 1 An object o in a set D is a triple h label, value, identifier i, where  



label is the textual description (i.e., a string) of o.

value is a finite ordered set of strings. If value is an empty set, then o is called a free object; otherwise, o is called a bound object. identifier is a non-empty string which uniquely defines o among objects in D. 2

In an SDG, the identifier and label of an object are static, whereas its value is dynamic. In other words, the identifier and label of an object do not change, whereas its value may change. We formalize and illustrate how the value of an object changes from free to bound in Definition 2 and Example 1. Also, type constraint is not straightly enforced on an object in an SDG, and hence the value of an object can be of any type. For instance, the value of object Month can be a string “January,” an integer “1”, or a string encoded using a character set other than ASCII. This relaxation is desirable in that whenever a query is posted for an object o using values of different types other than o’s, the system may gracefully fail to process the query rather than invoke an error [14]. (For simplicity, we assume the value of an object is a set of strings since all atomic types such as integer, real and boolean, and even a set of atomic types can be represented as a string. In addition, we use str instead of fstrg for each singleton set fstrg, and fA:B; A:C g is written as A:fB; C g where A, B and C are strings.) In conjunction with the definition of an object, we define a few utility functions below, where S is a set of strings: (1) Function o:label() on object o returns the label of o.

(2) Function o:value() on object o returns the value of o.

con is a binary function such that con(arg1 , arg2 ) = arg1 :arg2 . Definition 2 An object o1 is said to directly depend on another object o2 , denoted o1 o2 , if o1 :value() := o2 :value(). Dependency is transitive, which means that o1 o3 if o1 o2 and o2 o3 . In such a case, we say that o1 indirectly depends on o3 , denoted o1 o2 o3  o . If an object o directly depends on or o1 multi3 o2S, , ple objects o1 , o2 , : : :, on , i.e., o So1 , o on , then o:value() := o1 :value() o2 :value() So o :value (). 2 n (3)

S

 S

! S





Example 1 Given the following set of objects, where each object o is associated with the information in the form of (o:label(); o:value(); a list of objects on which o directly depends). It is assumed that a string which contains spaces is enclosed by double quotes, and ‘–’ denotes an empty list.

Location (Location; (free); Address and “Time Zone”), Address (Address; (free); “Street Addr,” City and “Zip code”), “Street Addr” (Street; (free); “11 E. Pine Lane”), City (City; (free); Orem), “Zip code” (ZipCode; (free); 84057), Time Zone” (TimeZone; (free); MST), “11 E. Pine Lane” (“11 E. Pine Lane”; “11 E. Pine Lane”; – ), Orem (Orem; Orem; – ), 84057 (84057; 84057; – ), and MST (MST; “Mountain Standard Time”; – ).

Let’s consider object Address. The dependency constraint of Address is Address f“Street Addr” “11 E. Pine Lane”, City Orem, “Zip code” 84057g. Hence, Address.value() := f“Street Addr”.value(), City.value(), “Zip code”.value()g = f“11 E. Pine Lane”.value(), Orem.value(), 84057.value()g = f“11 E. Pine Lane”, Orem, 84057g. 2

We define the long label of an object o based on the dependency relationship of o with other objects. Definition 3 Given a dependency among objects o1 , o2 , : : :, oN such that o1 : : : oN , the long label of oi , denoted oi :Label(), 1  i  N , is defined as

o1 :Label() = o1 :label(), and oi :Label() = con(oi?1 :Label(); oi :label()), 2 i N . 2 



Example 2 Consider the set of objects in Example 1. According to Definition 3, all the expressions listed below are valid. Location.Label() = Location.label() = Location Orem.Label() = (City.Label()).(Orem.label()) = (Address.Label()).(City.label()).Orem = (Location.Label()).(Address.label()).City.Orem = (Location.label()).Address.City.Orem = Location.Address.City.Orem “Time Zone”.Label() = Location.TimeZone 2

We now formally define SDGs which are based on the notion of dependency constraints among objects. Definition 4 Given a semistructured data D, the semistructured data graph SDG3 of D is a triple (V; E; g ) which is a rooted, directed, labeled graph, where 

V

is a finite set of nodes and V = fVR g [ VI [ VL , where fVR g \ VI \ VL = ;. VR with label ‘D’ is the root node of SDG which serves as the entry point of SDG and represents D. VL is a finite set of leaf nodes, and VI is a finite set of nodes other than VR and VL in SDG. n 2 (VI [ VL ) with label ‘o’ represents an object o in D.

3 Although we define SDG on semistructured data, SDG can also be applied to structured data. In fact, semistructured data subsumes structured data.

pe(Address;Orem) = Address.City.Orem LAddress = fAddress.Street.“11 E. Pine Lane”g

[ fAddress.City.Oremg [ fAddress.ZipCode.84057g = Address.fStreet.“11 E. Pine Pine”, City.Orem, ZipCode.84057g

Figure 1. The SDG for the semistructured data Info  

E is a finite set of directed edges. g is a function from an edge to a pair of endpoints such that g (e) = (n1 ; n2 ) if and only if the object represented by n1 ( VR VI ) depends on the object represented by n2 (VI VL ). 2. 2

f

g [

2

[

Example 3 Consider the object Location in Example 1 and suppose semistructured data Info is defined by Location. Given the dependency constraints for Location and its relevant objects, the SDG for Info is obtained and as illustrated in Figure 1. A more comprehensive example of SDGs is presented in Section 4. 2 Definition 5 Given a long label of an object oN in the form of (o1 :label()).(: : :).(oN :label()) in an SDG, the path expression pe is a binary function, pe: S  S ! S , where S is a set of strings, such that

pe(oi ; oi ) = oi :label() (1 i N ), and pe(oi ; oj ) = con(pe(oi ; oj?1 ); oj :label()), 1 i < j N where o1 denotes the root node of the SDG. 2 It is easy to see that pe(VR ; o) = o:Label(), where VR is the root node and o (VI VL ) of the corresponding SDG. 





2



Similarly, the lexical SDG of Info LInfo can be obtained and LInfo = Info.Location.fAddress.fStreet.“11 E. Pine Lane”, City.Orem, ZipCode.84057g, TimeZone.MSTg. 2

Note that given an SDG = (fVR g [ VI [ VL , E , g ), the lexical representation of the SDG is obtained by [8i fpe(VR , oi )g = [8i foi :Label ()g, where oi 2 VL . It is obvious that the lexical representation L of an SDG is more difficult to read than SDG. To resolve this problem, L can be formatted using indentation and line feed. An example of the formatted lexical representation of the SDG for the semistructured data Guide in Figure 2 is shown below: Guide.f restaurant#1.f category.“gourmet”, name.“Chef Chu”, address.f street.“El Camino Real”, city.“Palo Alto”, zipcode.“92310” g, nearby.f restaurant#2, restaurant#3 g

, restaurant#2.f category.“Vietnamese”, name.“Saigon”, address.f “Mountain View”, “Menlo Park” g, nearby.restaurant#1, zipcode.“92310”, price.“cheap” g, restaurant#3.f price.“cheap”, name.“McDonald’s”, category.“fast food” g

[

An SDG precisely captures the structure and values of semistructured data graphically. Given below is the definition of the lexical representation of an SDG. The lexical representation of a semistructured data D is useful in some situations, e.g., finding an object in D using a textual path expression. Definition 6 Given an SDG = (fVR g [ VI [ VL ; E; g ), the lexical representation Lo or lexical SDG of o 2 (fVR g [ VI [ VL ) is a textual representation of the subgraph S of the SDG rooted at o such that Lo = [8i fpe(o; oi )g, where oi is a leaf node in S . 2 Example 4 Consider the SDG of Info in Example 3. The lexical representation of Info is

pe(Address, 84057) = con(pe(Address, “Zip code”), 84057.label()) = con(con(pe(Address; Address), “Zip..”.label()), 84057) = con(con(Address:label(), ZipCode), 84057) = con(Address.ZipCode, 84057) = Address.ZipCode.84057

g g

Definition 7 Given a semistructured data D and its corresponding semistructured data graph SDG = (fVR g [ VI [ VL ; E; g), the schema S of D in SDG is S = [8i fpe(o; oi )g where oi is any object in VI and o is oi ’s ancestor which is a child of VR . 2

Figure 2. The SDG for the semistructured data Guide Example 5 Consider Info in Example 3 again. The schema S of Info is

S = fpe(Location; Location)g [ fpe(Location; Address)g [ fpe(Location; \Time Zone")g [ fpe(Location; \Street Address")g [ fpe(Location; City )g [ fpe(Location; \Zip code")g = fLocationg [ fLocation.Addressg [ fLocation.TimeZoneg [ fLocation.Address.Streetg [ fLocation.Address.Cityg [ fLocation.Address.ZipCodeg = Location.fAddress.fStreet, City, ZipCodeg, TimeZoneg 2

4 The High-level Stack Machine In this section, we present a high-level stack machine (HSM) as a tool which extracts data structures embedded in an HTML document H and demonstrate the construction of the SDG for H specified by an URL using HSM4 . To construct the SDG for a given HTML document, we propose HSM which processes HTML elements. HSM is implemented in Java language environment as shown in Figure 3 and is currently being evolved. The HTML elements, which are to be handled by HSM, can be determined by users using a configuration file, as discussed earlier. The current version of HSM is implemented with a small set of HTML elements which are commonly used in HTML documents, and is continuously being enhanced to handle more generic HTML documents. As mentioned earlier, HSM is built based on pushdown automa (PDA). We are particularly interested in employing PDA since PDA is relatively easy to comprehend, design, and implement compared with Turing machines and has sufficient power to construct the SDGs for HTML documents. (We assume that readers are familiar with PDA.) With respect to the design of HSM, we classify the HTML elements in two types as follows: (1) elements which begin with a start-tag and ends with an end-tag, and 4 SDG is constructed in various formats by using HSM, which are all equivalent with respect to the structure that SDG represents. The formats include the lexical SDG, a textual definition of the SDG using long labels of the objects or object identifiers, and a graphical display of the SDG on the screen. Each of these output formats can be chosen by the user. See Figure 3 for an example of the HSM user interface.

Figure 3. An HSM user interface (2) elements whose end-tags are either optional or do not exist. Most of the commonly used HTML elements are of type 1. Such elements include document element HTML, head element HEAD, body BODY, headings H1 : : : H6, title TITLE, anchor A, some of the block structuring elements such as list elements OL and UL, block quote BLOCKQUOTE, preformatted text PRE, directory list DIR, menu list MENU, address element ADDRESS, and phrase markups such as EM, B, I and STRONG. Examples of elements of type 2 are some of the block structuring elements such as list element LI, definition lists DT and DD, line break BR, and horizontal rule HR. Some HTML elements, such as IMG, do not accompany an end-tag, but the closing angle brackets indicate where the end of the elements are. Therefore, we categorize IMG as an element of type 1. Besides the two types of HTML elements mentioned above, we treat some elements as ‘unproductive’ with respect to SDG, i.e., they do not generate an output on an SDG in our current version of HSM, and are simply ignored by HSM. UPE, the set of unproductive elements, is determined by the user on demand by excluding UPE from a configuration file as discussed earlier. At the current version of HSM, elements such as HR, B, CITE, DIR, TT, DL, DT, DD, and comments are included in UPE. In addition, we treat the set of elements of type 2 as a proper subset of UPE. However, when we consider a query language for

SDG in our future work, it may be worth to adjust UPE accordingly to give more weight to styled text than plain text. (The styled text is surrounded by character style tags such as , , and their corresponding end-tags.) During the construction process of an SDG, the following two rules of HSM are applied to elements in an HTML document H : 



Skip an element e if e 2 UPE. No stack operation is necessary, and no changes occur in the SDG of H . For an element e0 of type 1, push the corresponding stack symbol (defined in Definition 8) of e0 onto the stack whenever the start-tag of e0 is encountered, and pop from the stack of HSM whenever the corresponding end-tag of e0 is detected. In general, given the top of the stack symbol p and the SDG being constructed which includes a node op created for p, whenever a new stack symbol is pushed, is attached to the SDG as a child c of op with the edge from c to op . (We assume the existence of the function append() in HSM, in addition to the ordinary stack operations push() and pop().)

Definition 8 The high-level stack machine HSM is a system (Q; ; ?; ; qBOF ; ; F ), where 1.

Q is a finite set of states: qBOF , q1 , qA , qA ATTR , qB , qB ATTR , and qEOF . State symbol qBOF denotes the beginning-of-the-file state and qEOF denotes the end-of-the-file state. States qA and qA ATTR are used for anchors and APPLETs, and qB and qB ATTR are used for processing the IMG and META elements. Also, when the machine is in state q1 , HSM is not currently processing the elements IMG, META, A, or APPLET (as shown in the production rules of HSM below).

2.

 is a finite set of input string symbols:

, , , , , HTTP-EQUIV, NAME, CONTENT, , ,

; : : :;

, , , , , EOF, NT,
    ,
,
    ,
, , NAME ) = (qA ; DES ) (qA ; , DES ) = (q1 ; ) (qA ; NT, DES ) = (qA ; DES )y (q1 ; , ARCHIV E ) = (qA ; DES )

(qA ; >, CODEBASE ) = (qA ; DES ) (qA ; >, CODE ) = (qA ; DES ) (qA ; , DES ) = (q1 ; ) (q1 ; , IMG) = (q1 ; ) (qB ; >, SRC ) = (q1 ; ) (qB ; >, ALT ) = (q1 ; ) (qB ; SRC, IMG) = (qB ; SRC:IMG) (qB ; SRC, ALT ) = (qB ; SRC ) (qB ; ALT, IMG) = (qB ; ALT:IMG) (qB ; ALT, SRC ) = (qB ; ALT ) (qB ; “, SRC ) = (qB ATTR ; DES:SRC ) (qB ; “, ALT ) = (qB ATTR ; DES:ALT ) (qB ATTR ; NT, DES ) = (qB ATTR ; DES )y (qB ATTR ; ”, DES ) = (qB ; ) (q1 ; , META) = (q1 ; ) (qB ; >, HTTP -EQUIV ) = (q1 ; ) (qB ; >, NAME ) = (q1 ; ) (qB ; >, CONTENT ) = (q1 ; ) (qB ; HTTP-EQUIV, META) = (qB ; HTTP -EQUIV:META) (qB ; HTTP-EQUIV, NAME ) = (qB , HTTP -EQUIV ) (qB ; HTTP-EQUIV, CONTENT ) = (qB , HTTP -EQUIV ) (qB ; NAME, META) = (qB ; NAME:META) (qB ; NAME, HTTP -EQUIV ) = (qB ; NAME ) (qB ; NAME, CONTENT ) = (qB ; NAME ) (qB ; CONTENT, META) = (qB ; CONTENT:META) (qB ; CONTENT, HTTP -EQUIV ) = (qB ; CONTENT ) (qB ; CONTENT, NAME ) = (qB ; CONTENT ) (qB ; “, HTTP -EQUIV ) = (qB ATTR ; DES:HTTP -EQUIV ) (qB ; “, NAME ) = (qB ATTR ; DES:NAME ) (qB ; “, CONTENT ) = (qB ATTR ; DES:COMMENT ) (qB ATTR ; NT, DES ) = (qB ATTR ; DES )y (qB ATTR ; ”, DES ) = (qB ; g (q1 ; , BODY ) = (q1 ; ADDRESS:BODY ) (q1 ; , ADDRESS ) = (q1 ; ) (; EOF, ) = (qEOF ; ) (; NT, 1 ) = (; 1 ) y In this case, i.e., the current top-of-stack symbol is DES , nothing is pushed onto the stack, but NT is appended to the SDG.

where ‘:’ is used to separate different stack symbols.

Each rule of the form  (state1 ; INPUT , TOS1 ) = (state2 ; TOS2) is interpreted as that “the automaton is currently in state state1 with the top-of-stack symbol TOS1 . After reading the input symbol INPUT , the automaton replaces TOS1 by TOS2 , i.e., pop TOS1 and push TOS2 onto the stack, and enter state state2.” Hence, in case of (q1 ; , HTML) = (q1 ; HEAD:HTML), the stack symbol HEAD is pushed on top of HTML on the stack. In case of  (q1 ; , HEAD) = (q1 ; ), HEAD is popped and nothing is pushed onto the stack and the machine stays in q1 . ‘’ is a “syntactic sugar” used for simplifying notations. ‘’ in  (q1 , , ) denotes any stack symbol, and  (q1 ,
, ) = (q1 ; TABLE:)

Figure 4. SDG of WWW.CS.BYU.EDU is interpreted as “regardless of what the top-of-stack symbol is, when the machine is in q1 and the input string symbol is
, push TABLE onto the stack and remain in the same state.” Furthermore, 1 denotes any stack symbol except DES . 5. 6. 7.

QBOF Q is the initial state.  ? is the initial stack symbol. F Q is the set of final states, i.e., qEOF . 2 2

2



f

g

Example 6 Consider the Web page of the Computer Science Department at Brigham Young University whose URL is WWW.CS.BYU.EDU. WWW.CS.BYU.EDU is created as the root node of the SDG, and the element directly depends on two other elements, and . Furthermore, directly depends on which directly depends on the string value “BYU CS Department Homepage”, in addition to the five elements, whereas directly depends on two
elements, five (anchor) elements, and one element. The complete SDG of WWW.CS.BYU.EDU is as shown in Figure 4 and the SDG constructed by HSM in Figure 5. 2 Proposition 1 Given an HTML document H , HSM always halts. Proof. Note that the final state of HSM is qEOF , and HSM enters qEOF when it encounters or EOF . Since the reading head of HSM continues to move forward during the process of constructing an SDG and eventually will encounter EOF if is missing in H , HSM always terminates. 2 Proposition 2 Given an HTML document H with NW words, the time complexity of constructing the SDG for H using HSM is proportional to NW .

References [1] S. Abiteboul. Querying Semi-structured Data. In Proceedings of 6th International Conference on Database Theory, 1997. [2] S. Abiteboul, D. Quass, J. McHugh, J. Widom, and J. Wiener. The Lorel Query Language for Semistructured Data. Journal on Digital Libraries, 1(1), 1996. [3] P. Atzeni, G. Mecca, and P. Merialdo. To Weave the Web. In Proceedings of the International Conference on VLDB. Very Large Data Bases, 1997. [4] P. Buneman, S. Davidson, M. Fernandez, and D. Suciu. Adding Structure for Unstructured Data. In Proceedings of International Conference on Database Theory, 1997. [5] P. Buneman, S. Davidson, G. Hillebrand, and D. Suciu. A Query Language and Optimization Techniques for Unstructured Data. In SIGMOD, 1996. Figure 5. SDG of WWW.CS.BYU.EDU by HSM Proof. Recall that HSM processes H in one direction. While moving forward and detecting a word in H , HSM determines if the word is an HTML tag. If it is, then the stack and the SDG are modified accordingly. Otherwise, the word is discarded. Hence, the proof. 2

[6] R. Goldman, S. Chawathe, A. Crespo, and J. McHugh. A Standard Textual Interchange Format for the Object Exchange Model (OEM). Techical Report, Stanford University, 1996. [7] Network Working Group. HTML - 2.0. Request for Comments: #1866, November 1995.

Proposition 3 Given an HTML document H with NW words, the space complexity of HSM is proportional to NW .

[8] J. Hammer, H. Garcia-Molina, J. Cho, R. Aranha, and A. Crespo. Extracting Semistructured Information from the Web. In Proceedings of Workshop on Management of Semistructured Data, June 1997.

Proof. In the worst case when all the start-tags are placed before any end-tag in H , at most NW stack cells are required to keep all the start-tags on the stack until EOF is encountered. Hence, the proof. 2

[9] D. Konopnicki and O. Shmueli. W3QS: A Query System for the World-Wide Web. In Proceedings of the 21st VLDB, pages 54–65, Sept. 1995.

5 Conclusion In this paper, we view an HTML document H as the textual representation of semistructured data embedded within H . We present a graphical data model, called the semistructured data graph (SDG), which is based on the notion of the dependency relationships among the data objects in semistructured data D to describe the structure of D. The SDG of an HTML document precisely captures the structure and textual data embedded in the document. HSM, a high-level stack machine, is introduced as a tool to extract the structures of HTML documents. At present, we continue to work on HSM to extend it so that HSM is able to handle more general HTML documents, including forms. Recent fast emerging of Java applets and use of forms in an HTML document potentially make accessing information on the Web beyond the given HTML document more difficult. We are considering to investigate how this new trend affects our current approach. We also plan to design a logic query language for querying SDGs.

[10] L. Lakshmanan, F. Sadri, and I. Subramanian. A Declarative Language for Querying and Restructuring the Web. In Proceedings of Post-ICDE IEEE Workshop on Research Issues in Data Engineering, 1996. [11] S.-J. Lim and Y.-K. Ng. Vertical Fragmentation and Allocation in Distributed Deductive Database Systems. Information Systems, 22(1):1–24, 1997. [12] A. Mendelzon, G. Mihaila, and T. Milo. Querying the World Wide Web. In Proceedings of Conference on Parallel and Distributed Information Systems, 1996. [13] S. Nestorov, S. Abiteboul, and R. Motwani. Inferring Structure in Semistructured Data. In Proceedings of the Workshop on Management of Semistructured Data, May 1997. [14] D. Quass, A. Rajaraman, Y. Sagiv, J. Ullman, and J. Widom. Querying semistructured heterogeneous information. In Proceedings of the DOOD ’95 Conference, pages 319–344, December 1995. [15] B. Yuwono and D. Lee. WISE: A World Wide Web Resource Database System. IEEE Transactions on Knowledge & Data Engineering, 8(4):548–554, 1996.

Suggest Documents