Studying the XML Web: Gathering Statistics from an XML Sample Denilson Barbosa (
[email protected]) Department of Computer Science, University of Calgary, 2500 University Drive NW, Calgary, Alberta, T2N 1N4, Canada
Laurent Mignet (
[email protected]) IBM India Research Laboratory, Block 1 I.I.T., Hauz Kanz, New Delhi, India
Pierangelo Veltri (
[email protected]) Department of Experimental and Clinical Medicine, Magna Graecia University of Catanzaro, Campus di Germaneto, Viale Europa Germaneto, 88100, Catanzaro, Italy Abstract. XML has emerged as the language for encoding and exchanging data on the web and has attracted considerable interest both in industry and in academia. Nevertheless, to date, little is known about the XML documents published on the web. This article presents a comprehensive analysis of a sample of about 200,000 XML documents publicly available on the web, and is the first study of its kind. We study the distribution of XML documents across the web in several ways; moreover, we provided a detailed characterization of the structure of real XML documents. Our results provide valuable input to the design of algorithms, tools and systems that use XML in one form or another. Keywords: World Wide Web, XML, XML web, XML Documents, XML processing tools.
1. Introduction The advent of the World Wide Web (web for short) enabled the sharing of information at an unprecedented scale, both by content publishing for human consumption or by data exchange among applications. The tremendous success and popularity of the web is due in great part to the adoption of the Hypertext Markup Language (HTML) [37], proposed by the World Wide web Consortium (W3C) as the standard format for content representation. Besides providing a means for structuring hypertext and multimedia content for visual presentation, HTML enables the exchange of data among web agents (humans or computer applications) via forms and simple data transfer operations supported by the Hypertext Transfer Protocol (HTTP) [19]. However, HTML has a fixed set of markup, which poses limitations for content authoring; moreover, although forms are adequate for simple transactions, they do not scale up to complex data exchange transactions involving c 2006 Kluwer Academic Publishers. Printed in the Netherlands.
WWWJ.tex; 20/07/2006; 16:14; p.1
2
D. Barbosa, L.Mignet, and P.Veltri
several agents, such as those required in common business-to-business transactions. In response to these limitations, the W3C introduced the Extensible Markup Language (XML) [10], which is a simple and flexible text format derived from the Standard Generalized Markup Language (SGML) [23]. XML plays a key role in enabling web services, which is an emerging framework for service-oriented computation on the web, providing data exchange capabilities beyond those of HTML and HTTP. Essentially, web services represent the promise of bringing to distributed computation the flexibility that the web has brought to content sharing. The ultimate goal is to have collections of network-resident software services accessible via standardized protocols, and whose functionality can be automatically discovered and integrated into applications or composed to form other services [21]. Among the several challenges that arise in realizing the promise of web services, ranging from the development of the right middleware infrastructure to reasoning about web services based on their descriptions, the development of appropriate data management techniques for XML is paramount. As a consequence, XML has received considerable attention from the database community, which typically views XML documents as (semistructured) data on the web [1]. For instance, there has been work on storing XML data, both by developing new technologies (e.g., [18, 26]) and by leveraging mature ones (e.g., [9, 41]). New indexing and querying techniques (e.g., [30, 28]), as well of ways of updating XML data (e.g., [36, 43]). Furthermore, the database industry has also adopted XML aggressively: all major DBMS vendors already provide support for XML in one form or another [22, 31, 35], and “native” XML data management systems are also commercially available [42]. It is clear that the massive undergoing effort in building algorithms, tools and systems for dealing with XML could benefit from a clear description of the kinds of documents found in the subset of the web made of XML documents only, which we call the XML web. While the web as a whole has been extensively studied and we have a good understanding of its properties (e.g., shape, size and connectivity) [17, 27], the same cannot be said of the XML web. Thus, to date, the development of XML tools and systems has been mostly guided by folklore (e.g., XML documents are “shallow”), or by studying the few well known publicly available XML documents (e.g., [40, 49]) or some proprietary XML content.We note that, because XML is in fact a meta-language, the XML web can be stratified into document classes, typically specified by conceptual schemas such as Document Type Definitions [10] (DTDs) or XML Schema [44] specifications, or by associating particular file
WWWJ.tex; 20/07/2006; 16:14; p.2
Studying the XML Web
3
extensions to different kinds of documents (e.g., the “.rdf ” extension is associated with documents from the semantic web initiative [45]). Characterizing these classes of documents potentially gives a much more accurate picture of the actual content available on the XML web than is possible for the HTML web. In this article, which extends an earlier report [32], we study a sample of the XML web, consisting of about 200,000 documents publicly available on the web, crawled by Xyleme [4, 50, 51]. Our results discuss the XML web as whole, providing information about its pervasiveness, the kinds of contents found in it, and its connectivity, as well as structural information about the documents in the XML web. We note that while our results contribute to the development of algorithms, tools and systems (as discussed above), they are also invaluable for the generation of synthetic data sets for testing and benchmarking purposes, which are areas of active research [48, 47]. Outline The article is organized as follows. We start discussing related work and briefly describing the Xyleme crawler and the sample we used in our work in Sections 2 and 3. Our results are divided into statistics about the XML web as a whole (Sections 4 and 5), and statistics about the structure and contents of the documents found on the web (Sections 6 and 7). Section 4 covers the distribution of documents in the XML web and its connectivity properties, while in Section 5 we identify the kinds of XML content found in our sample, and discuss the use of schemas and namespaces. In Section 6, we discuss such properties as number of (element, attribute and text) nodes in the XML documents, their depth and fan-out, etc. Finally, Section 7 presents statistics about recursive content in the XML web. We conclude and discuss directions for future work in Section 8.
2. Related Work There are several organizations that periodically release statistics about the size and the shape of the Internet [13, 24]. The data collected by these organizations comes primarily from accessing network addresses found in Domain Name Service (DNS) servers, and thus are very accurate. Those reports, however, count the number of computers that belong to a given Internet domain, regardless of how much (if any) web content (XML or otherwise) is published by them. Our work, on the other hand, focuses on the XML web only. Also, we give the distributions of the number of sites, number of documents, and volume of published content according to Internet domains.
WWWJ.tex; 20/07/2006; 16:14; p.3
4
D. Barbosa, L.Mignet, and P.Veltri
The connectivity and structure of the web as a whole has been studied extensively [17, 27]. Those results are primarily aimed at studying web algorithmics; moreover, results of that nature have been shown to improve the accuracy of search engines [11]. Statistics on XML document contents and structure have been done for guiding query optimization on single documents [20]. Choi [15] reports a study on the kinds of constructs defined in 60 DTDs found on the web, about at the same time our sample was collected. Although we were able to find references to 75 different DTDs in our sample, our goals in this work differ from those of Choi [15], in the sense that we are interested on the usage of DTDs (and XML Schema specifications) on the web, rather than the kinds of rules defined in these schemas. Furthermore, we present several statistical results that cannot be derived from analyzing conceptual schemas alone, for obvious reasons. More recently, Jan Bex et al. [8] have conducted a similar study considering both DTDs and XML Schema specifications found on the web. In a sense, both studies above arrive at the same conclusion that the kinds of schemas defined on the web are somewhat simple, compared to the power of these schema formalisms.
3. The XML Documents Sample The sample of the XML web used for obtaining the results we present in this article consists of 190,417 XML documents that combined represent approximately 843MB of XML content. These documents come from 19,254 different web sites, and were randomly chosen from Xyleme’s repository of publicly available XML documents, which is populated by crawling the web. At the time our sample was collected (February 2002), Xyleme’s public repository contained about 500,000 XML documents, while its private repository, consisting of documents provided by Xyleme’s customers, contained 700,000 documents. Using fingerprinting techniques [34], we determined that 26,989 documents (nearly 20% of the total) are replicas of other documents in the sample. We note this rate is lower than usual replica rates on the web (e.g., Choo and Garcia-Molina [14] report that 36% of the documents in a large crawl were exact replicas of other documents). While describing the internal architecture of the Xyleme’s crawler is out of the scope of the article (see [33, 51] for details), we briefly describe how Xyleme’s crawler operates next. Each document is uniquely identified by its URL. The crawling starts with an initial set of pages (called seeds), from which all URLs are extracted and stored in a link matrix. Eventually, the pages referred to by entries in the matrix are
WWWJ.tex; 20/07/2006; 16:14; p.4
Studying the XML Web
5
collected, and more URLs are extracted and added to the matrix. In order to increase its coverage, Xyleme’s crawler fetches both HTML and XML pages, and collects URLs from all of them; however, only well-formed XML documents are added to the repository. For technical reasons, XHTML documents were not considered by Xyleme as XML documents, mostly because in an XHTML document there is no clear separation between presentations and data. Xyleme’s crawler continuously scans the link matrix in look for pages to fetch or refresh; the decision on whether or not to fetch a page is guided by the minimization of a cost function which prioritizes freshness of XML pages. Some parameters of this optimization are the importance of the page [2], the estimated page change frequency and the resources available to the crawler, such as bandwidth. We note that the design and implementation of web crawlers is an interesting and challenging problem on its own, which has received considerable attention in the literature. Our sample represents only a snapshot of the publicly available XML documents, and, unfortunately, there is not much we can say about its representativeness. Given the lack of reliable estimates of the size of the XML web, and the intrinsic difficulty of obtaining such estimates [3], we cannot claim our study is definitive. Nevertheless, it gives an accurate and valuable starting point for understanding the XML web. Finally, we note that our sample is available to interested researchers. Moreover, we have built a relational database containing meta-data extracted form the sample, corresponding to roughly 2.5 GB of data. All our results were obtained from this database, which is also available to those interested in it.
4. XML Documents Across the Web We start by describing how the documents in the XML web are distributed with respect to web sites; then we show how those sites and the contents of the XML web are distributed in terms of Internet domains and geographical regions. 4.1. Distribution of XML Documents per Site For simplicity, we group the 19,254 web sites in our sample by zones, which correspond to either the geographical region associated with the domain of the site (e.g., .fr for french sites), or a generic Internet domain (i.e., .com, .edu, .net, etc.) when no geographical information is available. Our geographical zones are as follows: Asia consists of
WWWJ.tex; 20/07/2006; 16:14; p.5
6
D. Barbosa, L.Mignet, and P.Veltri
Figure 1. Distribution of XML Sites by Zone.
Figure 2. Distribution of XML Documents by Zone.
China, India, Indonesia, Japan, Pakistan, Taiwan, South Korea and Singapore; the European Union consists of the fifteen countries in the European Union (as of 2002); North America consists of Canada, the United States, and Mexico; and the Rest of the World represents sites from all other countries. Figure 1 shows the distribution of sites according to the zones defined above. The two major zones are .com with 5,312 sites, and the European Union, with 3,993 sites. Following those, we have .edu with 2,022 sites, .org with 1,611 sites, Asia with 1,553 sites, .net with 968 sites and North America with 890 sites. The Rest of the World zone is mainly composed in the Russian Federation (314 sites), Switzerland (260), Czech Republic (251) and Norway(199).
WWWJ.tex; 20/07/2006; 16:14; p.6
Studying the XML Web
7
Figure 3. Distribution of Volume of XML Content by Zone.
In purely geographical terms, we can see that North America has at least 16% of all sites (corresponding to the zones North America, .edu, .gov and .mil). Although a more accurate description cannot be given since are unable to distinguish the geographical origin of the other generic domains, we note that at least one country from each other continent is represented in our sample: Brazil (56 sites), Cuba (1), Iran (3), South Africa (83), and Niue Island, Polynesia (39 sites). 4.2. Number of Documents per Site We now discuss how the contents of the XML web are distributed according to the zones defined above. First, we consider the distribution of the documents (i.e., the number of documents per zone). As already mentioned, there are 190,417 XML documents and 19,254 sites in our sample. This gives an average of 9.89 documents per site. The sites with the largest number of documents are: rpmfind.net, with 12,340 documents (6.5% of the total); download.sourceforge.net, with 7,948 documents (4%); and ludiwap.co.uk, with 7,029 documents (3.7%). The distribution of documents per zone is shown in Figure 2. One can notice that the first two top sites have an interesting impact on how the contents of the XML web are distributed: a comparison between Figures 1 and 2 shows a relative increase in the participation of the .net zone, from 5% of the sites to 20% of the documents in the XML web. Figure 2 also shows that the .com and European Union zones still dominate the distribution. Figure 3 shows the distribution of the volume of XML content (i.e., the sum of the sizes of all documents) per zone. This graph shows the .com zone as the dominant one, but it also shows another increase in
WWWJ.tex; 20/07/2006; 16:14; p.7
8
D. Barbosa, L.Mignet, and P.Veltri
Figure 4. Distribution of Documents by their Out-Degree (Power Law of Exponent 1.8).
the participation of the .net zone, now moving to second place with 35%. These two zones alone account for 53% of all documents and 76% of the volume of content on the XML web. One lesson learned from the previous results is the early adoption of XML in the open source and Linux communities, which is in line with the openness of all W3C standards, including XML. 4.3. XML Web Connectivity A common and important study of the HTML web concerns its connectivity [27]; that is, the distribution of outgoing links from HTML pages – or nodes in the HTML web graph. We conclude the section with a similar analysis on the XML web graph. As usual, the outdegree of an XML document will be defined as the number of links to other documents it contains; we measure such quantity by counting the number of attribute nodes labeled href, xmlhref, or xlink:href in each document. Figure 4 is a log-log plot of the out-degree distribution for the documents in our sample. Similarly to previous observations on the HTML web [27], we observe that the out-degrees of the documents in our sample seem to follow a power law: the fraction of XML documents with out-degree i is proportional to 1/ix for x = 1.8. This value is derived from the slope of the line providing the best fit to the data. The average out-degree of the documents in our sample is about 11.4, while the out-degree for HTML pages is about 7.2 [27]. However, given
WWWJ.tex; 20/07/2006; 16:14; p.8
Studying the XML Web
9
Figure 5. Distribution of DTDs by Zone.
the (expected) small size of our sample compared to the (unknown to us) size of the XML web, we cannot generalize this result.
5. XML Documents: Schema, Extension and Namespace As mentioned earlier, one distinguishing feature of the XML web is that it is stratified by classes of documents, defined by conceptual schemas. We now consider the use of the two standard schema languages defined for XML: Document Type Definitions [10] and XML Schema specifications [44]. It turns out that 48% of the XML documents in our sample link to a DTD, and, surprisingly, only 75 different DTDs are referenced in our sample. These DTDs come mostly from the .com, .org, and .net zones, as shown in Figure 5. Another surprise is that 92% of all DTD references are made to norms 1.1 or 1.2 of the Wireless Application Protocol [46]. The use of XML Schema is insignificant: only 0.09% of the documents (precisely 179) use either the attribute label “SchemaLocation” or “noNameSpaceSchemaLocation” as specified in the standard [44]. This can be partly explained by the fact that XML Schema is still a recent language, and it had just been published by the time our sample was collected. We note that both standard schema mechanisms for XML define document validity by requiring that the contents of elements of a given type1 form a word in an associated regular language. For example, the DTD rule book (title, author*, year) requires that, in a valid 1
In DDTs, all elements of the same label have the same type [10].
WWWJ.tex; 20/07/2006; 16:14; p.9
10
D. Barbosa, L.Mignet, and P.Veltri xschema:8% urn: 8% www.w3.org: 18% other http: 54%
purl.org: 4% schemas.xml soap.org:3% www.openar chive.org:2%
www.govtalk. gov.uk:3%
Figure 6. Distribution of Namespaces by web Domains.
document, the content of each element of type book be a sequence of elements e1 , e2 , . . . , en such that the type of e1 is title, the type of each e2 , . . . , en−1 is author and the type of en is year. Testing if this is the case is done by testing whether the typed content of the element (i.e., the string formed by concatenating the types of each ei is a word in the language described by the regular expression title,author*,year. It is interesting then to consider what kinds of regular expressions or other features of these schema languages are actually used in practice. As mentioned, Choi [15] and Jan Bex et al. [8] study the kinds of DTDs and XML Schema specifications on the web, and arrive at essentially the same conclusion: the schemas used in practice are quite simple, compared to the features provided these languages. Furthermore, to date, the only new feature of XML Schema (compared to DTDs) that is heavily used in practice is the specification of datatypes to text nodes [8] (as opposed to #PCDATA in DTDs). While we do not analyze the quality of the DTDs we found, we consider a related issue in Section 6.5: we show that the number of distinct words defining the contents of XML elements of a given type is rather small compared to the total number of elements of that same type. Namespaces Namespaces are a mechanism for allowing the specification of the precise definition to a name (e.g., element label) in an XML document by attaching to it a reference to a URL where that name is defined; moreover, the same name may appear associated with different namespaces in the same document. New and emerging web technologies (such as the web services framework, semantic web initiative, etc.) benefit from this mechanism immensely. While namespaces
WWWJ.tex; 20/07/2006; 16:14; p.10
Studying the XML Web
11
Figure 7. Distribution of Documents by Content Type.
Figure 8. Distribution of Volume by Content Type.
are a powerful mechanism for enriching the semantics of the XML web, their use may introduce challenges to usual document processing tasks such as validation of the document: it is not clear how to validate an element when the type definition of this subtree is defined outside the DTD. Thus, knowing the usage of namespace in the XML web is a crucial informations. A total of 77,100 documents (40% of the sample) include at least one reference to a namespace. Figure 6 shows the most origins of the most popular namespaces found in our sample. Distribution of Documents by Content Type We also study the documents in the XML web in terms of what kind of information they
WWWJ.tex; 20/07/2006; 16:14; p.11
12
D. Barbosa, L.Mignet, and P.Veltri
Figure 9. Distribution of Document by Size.
contain; to do that, we consider at the extension of the associated files or the method by which they can be accessed. We distinguish the following major groups of content in this work: documents from the semantic web (file extensions “.rdf” and “.rss”); wireless application protocol [46] (WAP) documents (file extension “.wml”); XSL and XSLT documents; form-accessible documents; and indistinguishable “.xml” documents. The distribution of the documents in our sample according to the groups described above is given by Figure 7. The graph shows that most documents belong to the “.xml”, WAP, and form-accessible classes. Documents from the semantic web community also make up a large fraction of the distribution. We also give the distribution of the volume of content according to these categories (Figure 8). Contrast Figures 7 and 8 and note that although WAP documents account for a large number of documents, their combined volume is not as significant. This can be explained by the fact that WAP documents are intended for consumption by mobile devices, for which memory and communication capabilities are severely limited. Another remark worth making is the considerable volume of the semantic web class. Finally, we observe an almost insignificant volume of XML content obtained from accessing forms (i.e., the “hidden XML web”), which means that either resources were not well-formed XML content or Xyleme’s crawler cannot get those documents. We note that even estimating the size of the hidden web (XML or otherwise) is not a trivial task and usually requires special-purpose tools [7, 38] as well as some human guidance [25].
WWWJ.tex; 20/07/2006; 16:14; p.12
Studying the XML Web
13
Figure 10. Document Categories by Size.
6. XML Documents Statistics This section discusses the properties of the documents in our sample. We start by clustering the documents by size and comparing the distribution of element, attribute and text nodes according to this clustering; we show that the markup/content ratio in the documents we found is surprisingly high. Next, we describe the structure of the documents in the XML web by studying their tree representations; for instance, we consider their depth, the distribution of the different kinds of nodes per level in the tree, and the fan-out of the element nodes in terms of element and attribute nodes. We conclude the section with an observation about the number of distinct strings forming the content of elements in our sample. 6.1. Document Categories by Size The sizes of the XML documents range from 10 to 500,608 bytes, for an average of 4,641 bytes. Figure 9 shows how the documents are distributed according to their sizes, on a log-log scale. The vertical lines in the figure represent, from left to right, 512, 1024 and 4096 bytes; we use these values for clustering the documents by size since they are natural candidates for disk page sizes in secondary-memory storage systems. We name the categories from C1 (documents smaller than 512 bytes) to C4 (documents larger than 4096 bytes). Figure 10 gives the distribution of documents per cluster. It is interesting to note the distribution of the different kinds of content discussed in the previous section in each cluster:
WWWJ.tex; 20/07/2006; 16:14; p.13
14
D. Barbosa, L.Mignet, and P.Veltri
C1
C2
C3
C4
Number Documents
100,000 10,000 1,000 100 10 1 0 0
1
2
3
4
5
6
7
8
9
10
11
Number of Namespaces
Figure 11. Distribution of Namespaces by Category.
− C1: (48671 documents in total): 12% “.wml” (5,871 documents), 60% “.xml” (29,556 documents), 1% “.rdf + .rss”, and 37% for other types; − C2: (39449 documents in total): 62% “.wml” (24,500 documents), 20% “.xml” (8,035 documents), 1% “.rdf + .rss” (681 documents), and 17% for other types; − C3: (69846 documents in total): 30% “.wml” (21,403 documents), 36% “.xml” (25,115 documents), 16% “.rdf + .rss” (11,765 documents), and 18% for other types; − C4: (32361 documents in total): 1% “.wml” (356 documents), 37% “.xml” (12,156 documents), 58% “.rdf + .rss” (18,733 documents), and 4% for other types. The categories above reveal that most “.wml” documents are relatively small (88% of them belong to either C1 or C2), while “.rdf” and “.rss” documents are usually larger than average (96% of all such documents belong to either C3 or C4), while no simple characterization is apparent for the other kinds of documents. Figure 11 shows the usage of namespaces in each of the clusters. Note that the larger documents – a large fraction of which correspond to those coming from the semantic web – tend to define substantially more references to namespaces than the other ones.
WWWJ.tex; 20/07/2006; 16:14; p.14
15
Studying the XML Web Elements
Attributes
Text
60%
Percentage
50% 40% 30% 20% 10% 0% C1
C2 C3 Document Size
C4
Figure 12. Structural vs. Textual Content: Percentage of Element, Attribute and Text Nodes by Cluster.
Figure 13. Structural vs. Textual Content: Relative Size in Bytes of Markup Versus Text by Cluster.
6.2. Node Distribution At least in spirit, elements in an XML document should be viewed as a means for providing more semantics to the text they contain, and attributes should be viewed as describing or qualifying either the usage of elements or their content [10]. Our goal in this section is to characterize the balance between markup (i.e., tags delimiting elements and attributes) and content (actual text) in the documents in our sample. In what follows, we ignore “empty” text nodes (that is, text nodes with no characters except the different blank characters as defined in [10]).
WWWJ.tex; 20/07/2006; 16:14; p.15
16
D. Barbosa, L.Mignet, and P.Veltri
Figure 14. Distribution of Documents by Depth.
Figure 12 shows the distribution of nodes of each type. It is interesting to note the following: the proportional number of element nodes decreases with document size, while the proportional number of text nodes follows an inverse trend in the first three clusters; also, the proportion of attribute nodes in the first three clusters is practically constant. In the C4 cluster, however, we have proportionally more attributes than elements (51.13% vs. 37.83%), and a much smaller fraction of text nodes (10.64%). This explains the rather odd fact that, out of 36,498,256 nodes in our sample, 14,514,673 are element nodes; 17,602,141 attribute nodes; and 4,381,442 text nodes. Thus there are 3,087,468 more attribute than element nodes. Figure 13 compares the size (in bytes) of the structural content versus the size of the textual content. For the text size we count the number of characters contained in each non-empty text node. The size of the structural part of the document is simply the size of the serialized form of the document minus the total size of the textual information in the document. As we can see, in all clusters, the structural information dominates the size of the documents. The observations above lead to the conclusion that the structural information found in XML documents is in fact dominant over the textual content. This comes as no surprise for small documents, since XML (fortunately) requires explicit closing tags for all elements in the document. However, although our results show that the content/markup ratio increases with the size of the documents, the dominance of the markup over the content and, especially, the high number of attribute nodes indicate that the notions of data (i.e., content) and meta-data (i.e., markup) may be somewhat blurred in the XML web.
WWWJ.tex; 20/07/2006; 16:14; p.16
Studying the XML Web
17
Figure 15. Distribution of Attribute Nodes by Level.
Figure 16. Distribution of Element Nodes by Level.
Another interesting fact we observed concerns the use of mixed element content (recall an element has mixed content if it has both text and other elements in its content [10]). It turns out that 782,602 elements (5% of the total), in 138,298 documents (72% of all documents) have mixed content. We note that the results above, particularly those describing the use of attributes and mixed element content, are in sharp contrast with the current assumption in the database community that attributes and mixed element content are not prevalent, and thus, dealing with them should not be a concern. In a sense, the focus of most of the work done so far misses the majority of the content found on the XML web!
WWWJ.tex; 20/07/2006; 16:14; p.17
18
D. Barbosa, L.Mignet, and P.Veltri
Figure 17. Distribution of Text Nodes by Level
6.3. Depth
XML documents are naturally viewed as trees (see, e.g., [5]). Such a representation is often convenient when one wants to describe structural properties of documents. For instance, the level of a node in the XML tree is its distance from the root node of the (tree representing the) document (the level of the root node is 0). Similarly, one defines depth of an XML document as the largest level among its elements. The distribution of documents according to their depth is given in Figure 14. As one can see, most documents are relatively shallow: 99% of the documents have fewer than 8 levels. The average depth is 4, and the deepest document has 135 levels. There are 1,986 documents whose depth is zero: 1,671 documents which consist of a single empty element node, and 377 other documents that have a single element with some textual content. Figures 15, 16, 17 present the distribution of attribute, element and text nodes per level in the XML tree. We note that the second level contains (on average) more attributes than any other level. In fact, 89% of all attributes are found in the first 3 levels of the documents. A similar pattern is also observed for element nodes and text nodes: 77% of all element nodes and 61% of all text nodes are found in the first 3 levels of the documents (see Figures 16 and 17, respectively). The next two sections analyze these distributions further to study the fan-out of the element nodes in terms of attributes and child elements.
WWWJ.tex; 20/07/2006; 16:14; p.18
19
Studying the XML Web
10000000
1000000
1000000
Number of Nodes
100000
Number of Nodes
10000
1000
100
100000
10000
1000
100
10
10
1 1
1 1
10
100
1000
10
100
1000
10000
100000
10000
Fan Out
Fan Out
(a) Element Fan-Out at Level 0
(b) Element Fan-Out at Level 1.
10000000
1000000
1000000
Number of Nodes
Number of Nodes
100000
100000
10000
1000
100
10000
1000
100
10 10
1
1 1
10
100
1
1000
10
(c) Element Fan-Out at Level 2
(d) Attribute Fan-Out at Level 0.
100000000
10000000
10000000
1000000
Number of Nodes
100
Fan Out
Fan Out
1000000
Number of Nodes
100000
10000
1000
100
100000 10000 1000 100 10
10
1
1
1
1
10
Fan Out
(e) Attribute Fan-Out at Level 1
100
10
100
Fan Out
(f) Attribute Fan-Out at Level 2.
Figure 18. Fan-Out of Element and Attribute Nodes at Levels 0, 1, and 2.
WWWJ.tex; 20/07/2006; 16:14; p.19
20
D. Barbosa, L.Mignet, and P.Veltri
6.4. Element and Attribute Fan-Out In this section we study the element fan-out; the fan-out of an element e is defined as the number of elements that are children of e. Our goal is to correlate the number of nodes for the first three levels in Figure 16 to study the structure of the subtrees rooted at them. Intuitively, one can expect large element fan-out for “collection” documents containing several similar items. For instance, in a document like DBLP [49], containing bibliographic data about conferences and journals, one would expect a large fan-out for elements representing conferences. Small fanout, on the other hand, intuitively indicates the element represents a single object (say, a single conference paper). Figures 18 (a), (b), and (c) show the element fan-out at levels 0, 1 and 2, respectively. As already mentioned, 1,986 documents consist of a single root node with no children, and 53,401 documents have exactly 2 nodes (a root node with a single child). The distribution of the element fan-out in the figures above follow power laws of degrees 1.85 for level 0 and 3.1 for level 2. The distribution for the element fan-out of element nodes at level 1 – Figure 18 (b) – is not as easy to characterize, however. Although part of the distribution follows a power law of degree 2.8, there is a considerable number of element nodes that have fan-out of about 10,000. A closer look at this cluster reveals that these elements belong to 518 documents (514 from the ibm.com/developerworks site and 4 from the w3.org/TR site) which, curiously enough, are all character encoding maps for various languages. Another observation we make is that the average values for the element fan-out at levels zero, one and two are, respectively, 8.57, 5.76 and 0.18. This not only reinforces the previous observation that XML documents are shallow but also suggests that “tall” documents (i.e., documents with large depth) are not wide.
Attribute Fan-Out A similar analysis with respect to the number of attribute nodes per element (i.e., the attribute fan-out) for levels 0, 1 and 2 is given by Figures 18 (d), (e) and (f), respectively. The distributions seem to follow power laws as well with degrees 4.5, 4.2 and 4.6 for levels 0, 1 and 2, respectively. Other results are: 2,588,286 element nodes (18% of the total) have no attributes, and, thus, were not counted in this analysis. The average number of attributes per element for the first four levels are: 0.09, 1.06, 1.48, and 0.48. The attribute fan-out values greater than 1 explain the excess of attribute nodes mentioned earlier.
WWWJ.tex; 20/07/2006; 16:14; p.20
21
Studying the XML Web
average number of distinct strings
1000
distinct(x) 2log(x) 8log(x) 32log(x)
100
10
1 1
10
100
1000
10000
100000
total number of string for the type
Figure 19. Average Number of Distinct Strings Versus Total Number of Strings per Element Type. 1
a
2
b
4
a
3
a
5
a
Figure 20. A Recursive XML Tree
6.5. Distinct Number of Strings An important aspect of the management of XML data is the validation problem, which consists of determining whether a document satisfies all the constraints in a given schema. The incremental version of the problem is determining whether an update to a valid document yields another valid document [6, 36]. As mentioned in Section 6, the validity of XML documents is determined by testing whether the content of an element of a given type spells a word in an associated regular language. It has been shown [6] that the incremental validation of XML documents can be done very efficiently in practice, by effectively storing in auxiliary data structures the trace of the computation done during validation of each string found in the document. This requires auxiliary data that is proportional to the size of the document; thus, an interesting question that arises in this context is: how many different words are there? To answer this question, we compare the number of distinct words to the total number of words in each element type (recall the discussion in Section 6). That is, we group all element types having the
WWWJ.tex; 20/07/2006; 16:14; p.21
22
D. Barbosa, L.Mignet, and P.Veltri
e 1 1 1 4
d 3 4 5 5
(a) AD interpretation
e 1 1 4
d 3 4 5
(b) CD Interpretation
Figure 21. ed Pairings in the AD and CD Interpretations.
same number of strings (i.e., occurring the same number of times in our sample), and compute the average number of distinct strings (regardless of length) for each group. Figure 19 shows our results. The graph shows that, fortunately, the number of distinct words is clearly bounded by a logarithmic factor of the total number of words. That is, the amount of auxiliary information that needs to be stored for the purposes of incremental validation is substantially smaller than the documents themselves, which means that more space efficient methods are possible in practice. This considerable redundancy can also be viewed as an empirical justification for the large compression rates obtained for XML document [29, 12].
7. Recursion in XML documents Our final study concerns the 28,208 XML documents (14.81% of the total) that contain recursive element content. The reasons for studying recursion in the XML Web are simple: while recursion is naturally captured by XML documents and schema specifications, it can have a considerable impact on the performance on query processing and parsing of XML documents [39]. This study is complementary to those in [8, 15], which characterizes recursive DTDs and XML Schemas found on the Web. For our purposes here, we say an element e is recursive if there exists at least one descendant of e (which we will call d) such that e and d have the same label. For simplicity, we call such an elementdescendant association an ed pair. A recursive XML tree is an XML tree that is rooted at a recursive element and whose leaves are recursive descendants of the root. For reasons that will become apparent shortly, we use two different interpretations of what to count as ed pairs. Consider the recursive XML tree in Figure 20, where numbers are node identifiers and letters are node labels. In the All-Descendants interpretation (AD), given in
WWWJ.tex; 20/07/2006; 16:14; p.22
Studying the XML Web
(a) AD interpretation.
23
(b) CD interpretation.
Figure 22. Distance Between Recursive Elements and their Descendants.
Figure 21 (a), elements 3, 4 and 5 are the recursive descendants of element 1. In the Closest-Descendants interpretation (CD), given in Figure 21 (b), only elements 3 and 4 are considered to be recursive descendants of element 1. In both interpretations, element 5 is a recursive descendant of element 4. We now present some statistics about the ed pairs found in our sample. In total, there are 66,139 recursive XML trees in our sample; there are 213,507 ed pairs in the AD interpretation, and 147,557 ed pairs in the CD interpretation. Only 260 different element labels are found in all recursive trees. In 27,577 documents with recursive content (98% of the total), a single element label is used for all recursive elements, and in 307 documents (1% of the total), 2 labels are found among all recursive elements. The maximum number of labels used for recursive elements in a single document is 9. The most popular labels for the recursive elements in all documents are: − ae, which labels 68,930 elements (32.28%) and is found in 77 documents (0.27% of all documents with recursive content); − description, which labels 30,509 elements (14.28% of the total) and is found in 25,368 documents (89.93%); − and page, which labels 30,429 elements (14.25%) and is found in 19 documents (0.06%). 95% of the documents with recursive content do not reference a DTD; amongst those that do, we observe the DTDs for the WAP protocol are the most popular ones: 876 documents (or 3% of the total) reference the WAP 1.2 DTD while 338 documents (1%) reference the
WWWJ.tex; 20/07/2006; 16:14; p.23
24
D. Barbosa, L.Mignet, and P.Veltri
(a) AD interpretation.
(b) CD interpretation.
Figure 23. Average Distance between Elements in all ed Pairs
WAP 1.1 DTD. We also note that most of the recursive documents come from the semantic Web community: 89% of them are either “.rdf” or “.rss” documents. Finally, we observe that 89% of these documents come from the .net zone: 36% from the rpmfind.net site and other 22% from download.sourceforge.net. In general, it seems that recursion is another feature of XML that has been only widely used in a specific community. Distance Our first study of the recursive XML content concerns the distance in the XML tree between elements and their recursive descendants. We measure distance by counting the number of edges separating the two nodes in the XML tree. For instance, the distance between elements 1 and 3 in Figure 20 is 2. Figures 22 (a) and 22 (b) show the distribution of the ed pairs in each distribution according to the distance of the paired elements. The graphs have only one plot in common: the 115,622 ed pairs of elements whose distance is 1; a quick look at the AD plot shows that there are recursive XML trees of depth up to 119 levels, while the CD plot shows that there are recursive elements separated by a path of length 20 that does not contain other elements with the same label. These results alone already justify the need for the AD and CD interpretations: the AD interpretation describes “global” properties of the recursive XML trees, while the CD interpretation describes “local” properties related to each recursive element and its descendants only. Evidently, some observations can be derived from either interpretation; for instance, both graphs above show that most elements in ed pairs have distance of 5 or less.
WWWJ.tex; 20/07/2006; 16:14; p.24
Studying the XML Web
(a) AD interpretation.
25
(b) CD interpretation.
Figure 24. Recursive Fan-Out of the Elements.
Regularity A natural question about the recursive XML trees is whether there is any regularity in their shape. A simple notion of regularity can be the average distance between the elements in all ed pairs in the document. For this study, the CD interpretation provides a better reading. To see why, consider again the recursive tree in Figure 20; its average distance using the CD interpretation is (1 + 1 + 2)/3 = 1, 33. Now, consider the subtree obtained by deleting element 2 from the tree in the figure; the average distance of the new tree is (1 + 1)/2 = 1. Intuitively, we can say that the recursion in the second tree is more regular, because the distance between the elements in all ed pairs is constant and, thus, equals the average. The notion of regularity we describe above has the advantage of being extremely simple to compute. However, it is easy to see that it can be misleading if different labels are present in the same recursive tree: consider an XML tree containing 3 ed pairs with elements labeled a, all with distance of 1, and 2 ed pairs with elements labeled b whose distances are 2. The regularity of this tree is (1 + 1 + 1 + 2 + 2)/5 = 1.4 despite the fact that the distance of the elements in ed pairs for each given label is constant. Therefore, a better notion of regularity would take different labels into account. However, since 98% of the documents with recursive content have a single label for all of their ed pairs, our simple notion of regularity can be used without incurring any significant error. Our results for regularity using the simple metric described above are shown in Figure 23. We draw the reader’s attention to two observations: (1) most documents have average distance of 1, which is nothing more than a re-statement of the results presented in Figure 22; (2) there is
WWWJ.tex; 20/07/2006; 16:14; p.25
26
D. Barbosa, L.Mignet, and P.Veltri
a lot of regularity in the recursion among the documents. The highest values in the CD interpretation, are: 26,388 documents (93% of the total) with average distance 1; 1,141 documents (4%) with average distance 2; and 124 documents (0.44%) with average distance 3. This shows that more than 97% of all documents with recursive content exhibit high regularity. We note that similar reading can be inferred from the AD interpretation as well.
Recursive fan-out Another important parameter in studying the recursion in XML documents is what we call the recursive fan-out of an element, which is the number of recursive descendants of that element (or, the number of ed pairs in which the element appears in the e column). Again, the AD and CD interpretations provide complementary readings. The AD interpretation measures the total number of recursive elements in a given recursive XML tree. This is precisely the semantics of an XPath [16] expression of the form //e//e, where e is the label of a recursive element. The recursive fan-out under the CD interpretation, on the other hand, can be viewed as a “branching factor” of the recursive XML trees: intuitively, it measures how wide the XML tree gets as a function of the distance of the root of the tree. We note that a recursive fan-out of 1 means that the recursive tree gets only taller, but not wider, as the distance from the root increases. Figures 24 (a) and (b) show the distribution of the recursive elements with respect to their average recursive fan-out, using the AD and CD interpretations, respectively. As one can see, both distributions seem to follow power laws (with degrees 2.4 and 2.9, respectively). We note that a similar notion of regularity applies here, and, again using the CD interpretation, we observe that the most common average fan-outs are 1, found in 36,498 elements (60% of the total); 3, found in 24,177 elements (37%); and 2, found in 2,951 (4%). Also, the average recursive fan-out of all elements is 2.23; the largest recursive fan-out is 752, found in a single element. Several observations can be made from our results in this section. First, the fraction of documents containing recursive elements (roughly 15%) is not negligible. Second, both the width and the height of recursive XML trees can grow relatively large. The final, and perhaps, most important observation that we make is that there is a considerable amount of regularity in the recursion found in the XML documents of the Web; this is indeed good news as it represents opportunities to be explored by system developers.
WWWJ.tex; 20/07/2006; 16:14; p.26
Studying the XML Web
27
8. Conclusion In this article we presented the results of a statistical analysis of a sample of the XML web, consisting of about 200,000 documents. Our results can be classified into two broad categories: macro-level results, describing the XML web and the kinds of contents in it, and documentlevel results, describing structural properties of typical XML documents on the web. Our results can be summarized as follows. We showed that XML content already permeates the web and can be found in all major Internet domains and in all continents of the globe, although 75% of all documents and 85% of the total volume of XML content come from the .com and .net domains and from the different countries of the European Union. We further categorized the XML web in terms of the kinds of documents in it, which, when crossreferenced with the provenance of those documents, revealed, among other interesting findings, considerable amount of content related to the semantic web initiative and to open source communities. We showed that, as in the HTML web, the out-degree of the XML documents seems to follow a power law. We showed that the use of DTDs and XML Schema specifications is not widespread: only 48% of the documents reference DTDs while the number of documents that reference XML Schema specifications is negligible (0.09%). In a sense, these statistics are empirical evidence for motivating the work on techniques for discovering semantic information from data (e.g., schema discovery, web mining and clustering,etc.). With respect to the XML documents on the web, we provide a comprehensive characterization describing such properties as their size, the depth and the shape of the trees that represent them, the use of recursion and the number of distinct strings they encode. We found that the volume of markup is surprisingly high when compared to the actual content of the documents, even for the larger ones; moreover, the number of attributes exceeds the number of elements in our sample by a significant margin. While these findings contradict some assumptions guiding the development of systems for handling XML, particularly within the database community, some of our results corroborate part of the folklore that prevails in the development of such tools. We hope that our results will provide valuable insight for guiding the development of algorithms, tools and systems that process XML in one form or another. In particular, we believe our results have direct application in the development of meaningful benchmarks for XML applications. Evidently, the results described in this article are only a starting point to the study of the XML web. We are currently working on fetching a new snapshot of the XML web, aiming at studying how it
WWWJ.tex; 20/07/2006; 16:14; p.27
28
D. Barbosa, L.Mignet, and P.Veltri
has evolved over time. It is our intent that interested researchers have prompt access to the documents and databases for carrying out other investigations concerning aspects of the XML web we did not study here. Acknowledgments We would like to thank Serge Abiteboul, Sophie Cluet and Guy Ferran for their support. The authors are also very grateful to Mariano P. Consens, Sergio Greco, Mario Cannataro, Alberto O. Mendelzon, Tova Milo and Ken Sevcik for their comments and useful remarks on an earlier version of this article. This work was done while D. Barbosa was a Ph.D. student at the University of Toronto; he was partially supported by an IBM PhD. Fellowship. L. Mignet has done this work when he was a Ph.D. student at INRIA and then a Post-Doc fellow at the University of Toronto. Finally we thank the anonymous reviewers for their encouraging and useful comments.
References 1. 2.
3.
4.
5.
6.
7.
8.
Serge Abiteboul, Peter Buneman, and Dan Suciu. Data on the Web. Morgan Kauffman, 1999. Serge Abiteboul, Mihai Preda, and Gregory Cobena. Adaptive On-Line Page Importance Computation. In Proceedings of the Twelfth International World Wide Web Conference, pages 271–279, Budapest, Hungary, May 20-24 2003. Serge Abiteboul and Victor Vianu. Queries and Computation on the Web. In Sixth International Conference on Database Theory, pages 262–275, Delphi, Greece, 1997. Vincent Aguil´era, Sophie Cluet, Tova Milo, Pierangelo Veltri, and Dan Vodislav. Views in a Large Scale XML Repository. VLDB Journal, 11(3):238–255, November 2002. Vidur Apparao, Steve Byrne, Mike Champion, Scott Isaacs, Ian Jacobs, Arnaud Le Hors, Gavin Nicol, Jonathan Robie, Robert Sutor, Chris Wilson, and Lauren Wood. Document Object Model (DOM) Level 1 Specification. W3C Recommendation, http://www.w3.org/TR/1998/REC-DOM-Level-1-19981001, October 1 1998. Denilson Barbosa, Alberto O. Mendelzon, Leonid Libkin, Laurent Mignet, and Marcelo Arenas. Efficient Incremental Validation of XML Documents. In Proceedings of the 20th International Conference on Data Engineering, pages 671–682, Boston, MA, USA, 2004. IEEE Computer Society. Luciano Barbosa and Juliana Freire. Siphoning Hidden-Web Data through Keyword-Based Interfaces. In XIX Simp´ osio Brasileiro de Bancos de Dados (Brazilian Symposium on Databases), pages 309–321, Bras´ılia, Distrito Federal, Brasil, October 18-20 2004. Geert Jan Bex, Frank Neven, and Jan Van den Bussche. DTDs versus XML Schema: A Practical Study. In Proceedings of the Seventh International Workshop on the Web and Databases, WebDB 2004, pages 79–84, Maison de la Chimie, Paris, France, June 17-18 2004.
WWWJ.tex; 20/07/2006; 16:14; p.28
Studying the XML Web
9.
10.
11.
12.
13. 14.
15.
16.
17.
18.
19.
20.
21.
22. 23. 24. 25.
29
Philip Bohannon, Juliana Freire, Prasan Roy, and J´erˆ ome Sim´eon. From XML Schema to Relations: A Cost-Based Approach to XML Storage. In Proceedings of the 18th International Conference on Data Engineering, pages 64–75, San Jose, CA, USA, February 26-March 1 2002. Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, and Eve Maler (Editors). Extensible Markup Language (XML) 1.0. World Wide Web Consortium, third edition, February 4 2004. http://www.w3.org/TR/2004/REC-xml-20040204. Sergey Brin and Larry Page. The Anatomy of a Large-Scale Hypertextual Web Search Engine. In Proceedings of the Seventh International World Wide Web Conference, Brisbane, Australia, April 1998. Peter Buneman, Martin Grohe, and Christoph Koch. Path Queries on Compressed XML. In Proceedings of 29th International Conference on Very Large Data Bases, pages 141–152, Berlin, Germany, September 9–12 2003. Cooperative Association for Internet Data Analysis. http://www.caida.org/. Junghoo Cho and Hector Garcia-Molina. Finding Replicated Web Collections. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pages 355–366, Dallas, TX, USA, May 14-19 2000. Byron Choi. What Are Real DTDs like. In Proceedings of the Fifth International Workshop on the Web and Databases, pages 43–48, Madison, Wisconsin, June 6-7 2002. James Clark and Steve DeRose. XML Path Language (XPath) – Version 1.0. World Wide Web Consortium, November 16 1999. http://www.w3.org/TR/1999/REC-xpath-19991116. Stephen Dill, Ravi Kumar, Kevin S. McCurley, Sridhar Rajagopalan, D. Sivakumar, and Andrew Tomkins. Self-similarity in the Web. In Proceedings 27th International Conference on Very Large Data Bases, pages 69–78, Rome, Italy, September 11-14 2001. Thorsten Fiebig, Sven Helmer, Carl-Christian Kanne, Guido Moerkotte, Julia Neumann, Robert Schiele, and Till Westmann. Anatomy of a Native XML Base Management system. VLDB Journal, 11(4):292–314, 2002. Roy T. Fielding, Jim Gettys, Jeffrey C. Mogul, Henrik Frystyk Nielsen, Larry Masinter, Paul Leach, and Tim Berners-Lee. Hypertext Transfer Protocol – HTTP/1.1. RFC 2616. HTTP Working Group, June 1999. ftp://ftp.isi.edu/in-notes/rfc2616.txt. Juliana Freire, Jayant R. Haritsa, Maya Ramanath, Prasan Roy, and Jrme Simon. StatiX: making XML count. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, pages 181–191, Madison, WI, USA, June 3-6 2002. Richard Hull, Michael Benedikt, Vassilis Christophides, and Jianwen Su. Eservices: a Look Behind the Curtain. In Proceedings of the Twenty-Second ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pages 1–14, San Diego, CA, USA, June 09 - 11 2003. IBM DB2 v8.1. http://www.ibm.com. International Standards Organization. ISO 8879 - Standard Generalized Markup Language (SGML), 1986. Internet Domain Survey. http://www.isc.org/ds/. Panagiotis Iperiotis, Luis Gravano, and Mehran Saham. Probe, Count, and Classify: Categorizing Hidden Web Databases. In Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, pages 67–78, Santa Barbara, CA, USA, May 21-24 2001.
WWWJ.tex; 20/07/2006; 16:14; p.29
30 26.
27.
28.
29.
30.
31. 32.
33.
34. 35. 36.
37.
38.
39.
40. 41.
42. 43.
D. Barbosa, L.Mignet, and P.Veltri
H. V. Jagadish, Shurug Al-Khalifa, Adriane Chapman, Laks V. S. Lakshmanan, Andrew Nierman, Stelios Paparizos, Jignesh M. Patel, Divesh Srivastava, Nuwee Wiwatwattana, Yuqing Wu, and Cong Yu. TIMBER: A native XML database. VLDB Journal, 11(4):274–291, 2002. Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, D. Sivakumar, Andrew Tomkins, and Eli Upfal. The Web as a Graph. In Proceedings of the Nineteenth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, Dallas, TX, USA, May 15-17 2000. Quanzhong Li and Bongki Moon. Indexing and Querying XML Data for Regular Path Expressions. In Proceedings 27th International Conference on Very Large Data Bases, pages 361–370, Rome, Italy, September 11-14 2001. Hartmut Liefke and Dan Suciu. XMILL: An Efficient Compressor for XML Data. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pages 153–164, Dallas, TX, USA, May 14-19 2000. Ioana Manolescu, Daniela Florescu, and Donald Kossmann. Answering XML Queries on Heterogeneous Data Sources. In Proceedings 27th International Conference on Very Large Data Bases, Rome, Italy, September 11-14 2001. Microsoft SQL Server, 2000. http://www.microsoft.com/sql. Laurent Mignet, Denilson Barbosa, and Pierangelo Veltri. The XML Web: A First Study. In Proceedings of the Twelfth International Conference on World Wide Web, Budapest, Hungary, May 20-24 2003. Laurent Mignet, Mihai Preda, Serge Abiteboul, Sbastien Ailleret, Bernd Amann, and Amlie Marian. Aquiring XML Pages for a WebHouse. In Base de Donnes Avances, 2000. RFC 1321 - The MD5 Message-Digest Algorithm. Oracle 9i. http://www.oracle.com. Yannis Papakonstantinou and Victor Vianu. Incremental Validation of XML Documents. In Proceeedings of The 9th International Conference on Database Theory, pages 47–63, Siena, Italy, January 8-10 2003. Dave Raggett, Arnaud Le Hors, and Ian Jacobs. HTML 4.01 Specification. World Wide Web Consortium, December 24 1999. http://www.w3.org/TR/1999/REC-html401-19991224. Sriram Raghavan and Hector Garcia-Molina. Crawling the Hidden Web. In Proceedings 27th International Conference on Very Large Data Bases, pages 129–138, Rome, Italy, September 11-14 2001. Luc Segoufin and Victor Vianu. Validating streaming XML documents. In Proceedings of the Twenty-First ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pages 53–64, Madison, Wisconsin, June 03-05 2002. The Plays of Shakespeare in XML. http://metalab.unc.edu/bosak/xml/. Jayavel Shanmugasundaram, Kristin Tufte, Chun Zhang, Gang He, David J. DeWitt, and Jeffrey F. Naughton. Relational Databases for Querying XML Documents: Limitations and Opportunities. In Proceedings of 25th International Conference on Very Large Data Bases, pages 302–314, Edinburgh, Scotland, UK, September 7-10 1999. Tamino XML Server. http://www.softwareag.com/tamino. Igor Tatarinov, Zachary Ives, Alon Halevy, and Daniel Weld. Updating XML. In Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, pages 413–424, Santa Barbara, CA, USA, May 21-24 2001.
WWWJ.tex; 20/07/2006; 16:14; p.30
Studying the XML Web
44.
45. 46. 47. 48. 49. 50. 51.
31
Henry S. Thompson, David Beech, Murray Maloney, and Noah Mendelsohn (Editors). XML Schema Part 1: Structures. World Wide Web Consortium, May 2 2001. http://www.w3.org/TR/2001/REC-xmlschema-1-20010502/. Semantic Web. http://www.w3.org/2001/sw. Wireless Application Protocol. http://www.wapforum.org/. The XOO7 Benchmark. http://www.comp.nus.edu.sg/~ebh/XOO7.html. The XML benchmark project. http://www.xml-benchmark.org/. DBLP XML. http://dblp.uni-trier.de/xml/. Xyleme S.A. http://www.xyleme.com/. Lucie Xyleme. A Dynamic Warehouse for XML Data of the Web. IEEE - Data Engineering Bulletin, 24(2), 2001.
WWWJ.tex; 20/07/2006; 16:14; p.31
WWWJ.tex; 20/07/2006; 16:14; p.32