Proceedings of the Seventh International Workshop on ... - WebDB 2004

Proceedings of the Seventh International Workshop on the Web and Databases WebDB 2004 June 17–18, 2004 Maison de la Chimie, Paris, France Colocated with ACM SIGMOD/PODS 2004 In Cooperation with ACM SIGMOD Sponsored by INRIA Sihem Amer-Yahia and Luis Gravano Editors

WORKSHOP CO-CHAIRS Sihem Amer-Yahia, AT&T Labs Research, USA Luis Gravano, Columbia University, USA

PROGRAM COMMITTEE MEMBERS Serge Abiteboul, INRIA-Futurs, France Ricardo Baeza-Yates, Universidad de Chile, Chile Sergey Brin, Google, USA Kevin Chang, University of Illinois at Urbana-Champaign, USA Junghoo Cho, University of California at Los Angeles, USA Juliana Freire, Oregon Health and Science University, USA Norbert Fuhr, University of Duisburg-Essen, Germany Alon Halevy, University of Washington, USA Taher Haveliwala, Google, USA Rick Hull, Bell Labs, USA Giansalvatore Mecca, Universitá della Basilicata, Italy Philippe Pucheral, INRIA, France Jayavel Shanmugasundaram, Cornell University, USA Divesh Srivastava, AT&T Labs Research, USA Torsten Suel, Polytechnic University, USA Wang-Chiew Tan, University of California at Santa Cruz, USA Val Tannen, University of Pennsylvania, USA Andrew Tomkins, IBM Almaden, USA Philip Wadler, University of Edinburgh, UK

WEB CHAIR Panagiotis G. Ipeirotis, Columbia University, USA

EXTERNAL REVIEWERS Maha Abdallah Bernd Amann Chavdar Botev Luc Bouganim Paolo Cappellari Jean Carletta Yi Chen Laura Chiticariu Byron Choi Shui-Lung Chuang Valter Crescenzi Francois Dang-Ngoc Roberto De Alin Deutsch

Wenfei Fan Gudrun Fischer Lin Guo Harry Halpin Bin He Seung-Won Hwang Utku Irmak Zack Ives Jonathan Kilgaur Claus-Peter Klas Laks V. S. Laksmanan Wang Lam Chengkai Li

Prakash Linga Victor Liu Xiaohui Long Ashwin Machanavajjhala Gurmeet Manku Ioana Manolescu Alexander Markowetz Paolo Merialdo Sudarshan Murthy Henrik Nottelmann Alexandros Ntoulas Paolo Papotti Nicoleta Preda

KEYNOTE TALK Text Centric Structure Extraction and Exploitation, by Prabhakar Raghavan (Verity, Inc.)

Sourashis Roy Cristian-Augustin Saita Feng Shao Ka Cheung Sia Cristina Sirangelo Igor Tatarinov Henry Thompson Richard Tobin Gaurav Vijayvargiya Lingzhi Zhang Zhen Zhang Yifeng Zheng Qinghua Zou

CONTENTS PAPER SESSION 1: Web Querying and Mining Spam, Damn Spam, and Statistics: Using Statistical Analysis to Locate Spam Web Pages Dennis Fetterly, Mark Manasse, and Marc Najork

1

Querying Bi-level Information Sudarshan Murthy, David Maier, and Lois Delcambre

7

Visualizing and Discovering Web Navigational Patterns Jiyang Chen, Lisheng Sun, Osmar R. Zaiane, and Randy Goebel

13

PAPER SESSION 2: Peer-to-Peer Search Systems One Torus to Rule Them All: Multidimensional Queries in P2P Systems Prasanna Ganesan, Beverly Yang, and Hector Garcia-Molina

19

Querying Peer-to-Peer Networks Using P-Trees Adina Crainiceanu, Prakash Linga, Johannes Gehrke, and Jayavel Shanmugasundaram

25

PAPER SESSION 3: Data Dissemination Scalable Dissemination: What’s Hot and What’s Not Jonathan Beaver, Nicholas Morsillo, Kirk Pruhs, Panos K. Chrysanthis, and Vincenzo Liberatore

31

Semantic Multicast for Content-based Stream Dissemination Olga Papaemmanouil and Ugur Cetintemel

37

PAPER SESSION 4: XML Query Processing Twig Query Processing over Graph-Structured XML Data Zografoula Vagena, Mirella M. Moro, and Vassilis J. Tsotras

43

Unraveling the Duplicate-Elimination Problem in XML-to-SQL Query Translation Rajasekar Krishnamurthy, Raghav Kaushik, and Jeffrey F. Naughton

49

Best-Match Querying from Document-Centric XML Jaap Kamps, Maarten Marx, Maarten de Rijke, and Borkur Sigurbjornsson

55

PAPER SESSION 5: Approximate and Ranked Query Processing Challenges in Selecting Paths for Navigational Queries: Trade-Off of Benefit of Path versus Cost of Plan Maria-Esther Vidal, Louiqa Raschid, and Julian Mestre

61

Content and Structure in Indexing and Ranking XML Felix Weigel, Holger Meuss, Klaus U. Schulz, and Francois Bry

67

Mining Approximate Functional Dependencies and Concept Similarities to Answer Imprecise Queries Ullas Nambiar and Subbarao Kambhampati

73

PAPER SESSION 6: XML Schemas and Validation DTDs versus XML Schema: A Practical Study Geert Jan Bex, Frank Neven, and Jan Van den Bussche

79

On Validation of XML Streams Using Finite State Machines Cristiana Chitic and Daniela Rosu

85

Checking Potential Validity of XML Documents Ionut Emil Iacob, Alexander Dekhtyar, and Michael I. Dekhtyar

91

Spam, Damn Spam, and Statistics Using statistical analysis to locate spam web pages Dennis Fetterly

Mark Manasse

Marc Najork

Microsoft Research Microsoft Research Microsoft Research 1065 La Avenida 1065 La Avenida 1065 La Avenida Mountain View, CA 94043, USA Mountain View, CA 94043, USA Mountain View, CA 94043, USA

[email protected]

[email protected]

ABSTRACT

pages returned by a search engine. In fact, high placement in a search engine is one of the strongest contributors to a commercial web site’s success. For these reasons, a new industry of “search engine optimizers” (SEOs) has sprung up. Search engine optimizers promise to help commercial web sites achieve a high ranking in the result pages to queries relevant to a site’s business, and thus experience higher traffic by web surfers. In the best case, search engine optimizers help web site designers generate content that is well-structured, topical, and rich in relevant keywords or query terms. Unfortunately, some search engine optimizers go well beyond producing relevant pages: they try to boost the ratings of a web site by loading pages with a wide variety of popular query terms, whether relevant or not. However, such behavior is relatively easily detected by a search engine, since pages loaded with disjoint, unrelated keywords lack topical focus, and this lack of focus can be detected through term vector analysis. Therefore, some SEOs go one step further: Instead of including many unrelated but popular query terms into the pages they want to boost, they synthesize many pages, each of which contains some tightly-focused popular keywords, and all of which redirect to the page intended to receive traffic. Another reason for SEOs to synthesize pages is to boost the PageRank [11] of the target page: each of the dynamically-created pages receives a minimum guaranteed PageRank value, and this rank can be used to endorse the target page. Many small endorsements from these dynamically-generated pages result in a sizable PageRank for the target page. Search engines can try to counteract such behavior by limiting the number of pages crawled and indexed from any particular web site. In a further escalation of this arms race, SEOs have responded by setting up DNS servers that will resolve any host name within their domain (and typically map it to a single IP address). Most if not all of the SEO-generated pages exist solely to (mis)lead a search engine into directing traffic towards the “optimized” site; in other words, the SEO-generated pages are intended only for the search engine, and are completely useless to human visitors. In the following, we will refer to such web pages as “spam pages”. Search engines have an incentive to weed out spam pages, so as to improve the search experience of their customers. This paper describes a variety of techniques that can be used by search engines to detect a portion of the spam pages. In the course of two earlier studies, we collected statistics on a large sample of web pages. As part of the first

The increasing importance of search engines to commercial web sites has given rise to a phenomenon we call “web spam”, that is, web pages that exist only to mislead search engines into (mis)leading users to certain web sites. Web spam is a nuisance to users as well as search engines: users have a harder time finding the information they need, and search engines have to cope with an inflated corpus, which in turn causes their cost per query to increase. Therefore, search engines have a strong incentive to weed out spam web pages from their index. We propose that some spam web pages can be identified through statistical analysis: Certain classes of spam pages, in particular those that are machine-generated, diverge in some of their properties from the properties of web pages at large. We have examined a variety of such properties, including linkage structure, page content, and page evolution, and have found that outliers in the statistical distribution of these properties are highly likely to be caused by web spam. This paper describes the properties we have examined, gives the statistical distributions we have observed, and shows which kinds of outliers are highly correlated with web spam.

Categories and Subject Descriptors H.5.4 [Information Interfaces and Presentation]: Hypertext/Hypermedia; K.4.m [Computers and Society]: Miscellaneous; H.4.m [Information Systems]: Miscellaneous

General Terms Measurement, Experimentation, Algorithms

Keywords Web characterization, web spam, statistical properties of web pages

1.

[email protected]

INTRODUCTION

Search engines have taken pivotal roles in web surfers’ lives: Most users have stopped maintaining lists of bookmarks, and are instead relying on search engines such as Google, Yahoo! or MSN Search to locate the content they seek. Consequently, commercial web sites are more dependant than ever on being placed prominently within the result Copyright is held by the author/owner. Seventh International Workshop on the Web and Databases (WebDB 2004), June 17-18, 2004, Paris, France.

1

1E+7

1,000,000

1E+6

Number of IP addresses

Number of host names

100,000

10,000

1,000

100

10

1E+5

1E+4

1E+3

1E+2

1E+1

1E+0 1E+0

1 0

20

40

60

80

100

120

140

160

180

200

1E+1

1E+2

1E+3

1E+4

1E+5

1E+6

1E+7

Number of host names mapping to single IP address

Host name length (in characters)

Figure 1: Distribution of lengths of symbolic host names

Figure 2: Distribution of number of different hostnames mapping to the same IP address

study [5], we crawled 429 million HTML pages and recorded the hyperlinks contained in each page. As part of the second study [8], we crawled 150 million HTML pages repeatedly, once a week for 11 weeks, and recorded a feature vector for each page allowing us to measure how much a given page changes week over week, as well as several other properties. In the study presented in this paper, we computed statistical distributions for a variety of properties in these data sets. We discovered that in a number of these distributions, outlier values are associated with web spam. Consequently, we hypothesize that statistical analysis is a good way to identify certain kinds of spam web pages (namely, various types of machine-generated pages). The ability to identify a large number of spam pages in a data collection is extremely valuable to search engines, not only because it allows the engine to exclude these pages from their corpus or to penalize them when ranking search results, but also because these pages can then be used to train other, more sophisticated machine-learning algorithms aimed at identifying additional spam pages. The remainder of the paper is structured as follows: Section 2 describes the two data sets on which we based our experiments. Section 3 discusses how various properties of a URL are predictive of whether or not the page referenced by the URL is a spam page. Section 4 describes how domain name resolutions can be used to identify spam sites. Section 5 describes how the link structure between pages can be used to identify spam pages. Section 6 describes how even purely syntactic properties of the content of a page are predictive of spam. Section 7 describes how anomalies in the evolution of web pages can be used to spot spam. Section 8 discusses how excessive replication of the same (or nearly the same) content is indicative of spam. Section 9 discusses related work, and section 10 offers concluding remarks and outlines avenues for future work.

the time of download, the document length, the number of non-markup words in the document, a checksum of the entire page, and a “shingle” vector (a feature vector that allows us to measure how much the non-markup content of a page has changed between downloads). In addition, we retained the full text of 0.1% of all downloaded pages, chosed based on a hash of the URL. Manual inspection of 751 pages sampled from the set of retained pages discovered 61 spam pages, a prevalence of 8.1% spam in the data set, with a confidence interval of 1.95% at 95% confidence. The second data set (“DS2”) is the result of a single breadth-first-search crawl. This crawl was conducted between July and September 2002, started at the Yahoo! home page, and covered about 429 million HTML pages as well as 38 million HTTP redirects. For each downloaded HTML page, we retained the URL of the page and the URLs of all hyperlinks contained in the page; for each HTTP redirection, we retained the source as well as the target URL of the redirection. The average HTML page contained 62.55 links, the median number of links per page was 23. If we consider only distinct links on a given page, the average was 42.74 and the median was 17. Unfortunately, we did not retain the full-text of any downloaded pages when the crawl was performed. In order to estimate the prevalence of spam, we looked at current versions of a random sample of 1,000 URLs from DS2. Of these pages, 465 could not be downloaded or contained no text when downloaded. Of the remaining 535 pages, 37 (6.9%) were spam.

2.

3. URL PROPERTIES Link spam is a particular form of web spam, where the SEO attempts to boost the PageRank of a web page p by creating many pages referring to p. However, given that the PageRank of p is a function of both the number of pages endorsing p as well as their quality, and given that SEOs typically do not control many high-quality pages, they must resort to using a very large number of low-quality pages to endorse p. This is best done by generating these pages automatically; a technique commonly known as “link spam”. One might expect the URLs of automatically generated pages to be different from those of human-created pages, given that the URLs will be machine-generated as well. For example, one might expect machine-generated URLs to be

DESCRIPTION OF OUR DATA SETS

Our study is based on two data sets collected in the course of two separate previous experiments [5, 8]. The first data set (“DS1”) represents 150 million URLs that were crawled repeatedly, once every week over a period of 11 weeks, from November 2002 to February 2003. For every downloaded page, we retained the HTTP status code,

2

1E+8 1E+7

Number of pages

1E+6 1E+5 1E+4 1E+3 1E+2 1E+1 1E+0 1E+0

1E+1

1E+2

1E+3

1E+4

1E+5

1E+6

Out-degree

Figure 3: Distribution of “host-machine ratios” among all links on a page, averaged over all pages on a web site

Figure 4: Distribution of out-degrees

host names map to a single IP address; the vertical axis indicates how many such IP addresses there are. A point at position (x, y) indicates that there are y IP addresses with the property that each IP address is mapped to by x hosts. 1,864,807 IP addresses in DS2 are referred to by one host name each (indicated by the topmost point); 599,632 IP addresses are referred to by two host names each; and 1 IP address is referred to by 8,967,154 host names (far-right point). We found that 3.46% of the pages in DS2 are served from IP addresses that are mapped to by more than 10,000 different symbolic host names. Casual inspection of these URLs showed that they are predominantly spam sites. If we drop the threshold to 1,000, the yield rises to 7.08%, but the rate of false positives goes up significantly. Applying the same technique to DS1 flagged 2.92% percent of all pages in DS1 as spam candidates; manual inspection of a sample of 250 of these pages showed that 167 (66.8%) were spam, 64 (25.6%) were false positives (largely attributable to community sites that assign unique host names to each user), and 19 (7.6%) were “soft errors”, that is, pages displaying a message indicating that the resource is not currently available at this URL, despite the fact that the HTTP status code was 200 (“OK”). It is worth noting that this metric flags about 20 times more URLs as spam than the hostname-based metric did. Another item of folklore in the SEO community is that Google’s variant of PageRank assigns greater weight to offsite hyperlinks (the rationale being that endorsing another web site is more meaningful than a self-endorsement), and even greater weight to pages that link to many different web sites (these pages are considered to be “hubs”). Many SEOs try to capitalize on this alleged behavior by populating pages with hyperlinks that refer to pages on many different hosts, but typically all of the hosts actually resolve to one or at most a few different IP addresses. We detect this scheme by computing the average “hostmachine-ratio” of a web site. Given a web page containing a set of hyperlinks, we define the host-machine-ratio of that page to be the size of the set of host names referred to by the link set divided by the size of the set of distinct machines that the host names resolve to (two host names are assumed to refer to distinct machines if they resolve to non-

longer, have more arcs, more digits, or the like. However, when we examined our data set DS2 for such correlations, we did not find any properties of the URL at large that are correlated to web spam. However, we did find that various properties of the host component of a URL are indicative of spam. In particular, we found that host names with many characters, dots, dashes, and digits are likely to be spam web sites. (Coincidentally, 80 of the 100 longest host names we discovered refer to adult web sites, while 11 refer to financial-credit-related web sites.) Figure 1 shows the distribution of host name length. The horizontal axis shows the host name length in characters; the vertical axis shows how many host names with that length are contained in DS2. Obviously, the choice of threshold values for the number of characters, dots, dashes and digits that cause a URL to be flagged as a spam candidate determines both the number of pages flagged as spam as well as the rate of false positives. 0.173% of all URLs in DS2 have host names that are at least 45 characters long, or contain at least 6 dots, 5 dashes, or 10 digits. The vast majority of these URLs appear to be spam.

4.

HOST NAME RESOLUTIONS

One piece of folklore among the SEO community is that search engines (and Google in particular), given a query q, will rank a result URL u higher if u’s host component contains q. SEOs try to exploit this by populating pages with URLs whose host components contain popular queries that are relevant to their business, and by setting up a DNS server that resolves those host names. The latter is quite easy, since DNS servers can be configured with wildcard records that will resolve any host name within a domain to the same IP address. For example, at the time of this writing, any host within the domain highriskmortgage.com resolves to the IP address 65.83.94.42. Since SEOs typically synthesize a very large number of host names so as to rank highly for a wide variety of queries, it is possible to spot this form of web spam by determining how many host names resolve to the same IP address (or set of IP addresses). Figure 2 shows the distribution of host names per IP address. The horizontal axis shows how many

3

1E+9 1E+8 1E+7

Number of pages

1E+6 1E+5 1E+4 1E+3 1E+2 1E+1 1E+0 1E+0

1E+1

1E+2

1E+3

1E+4 In-degree

1E+5

1E+6

1E+7

1E+8

Figure 6: Variance of the word counts of all pages served up by a single host

Figure 5: Distribution of in-degrees

denotes the number of pages in DS2 with that in-degree, and both axes are drawn on a logarithmic scale. The graph appears linear over an even wider range than the previous graph, exhibiting an even more pronounced Zipfian distribution. However, there is also an even larger set of outliers, and some of them are even more pronounced. For example, there are 369,457 web pages with in in-degree 1001 in DS2, while according to the overall in-degree distribution we would expect only about 2,000 such pages. Overall, 0.19% of the pages in DS2 have an in-degree that is at least three times more common than the Zipfian distribution would suggest. We examined a cross-section of these pages, and the vast majority of them are spam.

identical sets of IP addresses). The host-machine-ratio of a machine is defined to be the average host-machine-ratio of all pages served by that machine. If a machine has a high host-machine-ratio, most pages served by this machine appear to link to many different web sites (i.e. have nonnepotistic, meaningful links), but actually all endorse the same property. In other words, machines with high hostmachine-ratios are very likely to be spam sites. Figure 3 shows the host-machine ratios of all the machines in DS2. The horizontal axis denotes the host-machine-ratio; the vertical axis denotes the number of pages on a given machine. Each point represents one machine; a point at position (x, y) indicates that DS2 contains y pages from this machine, and that the average host-machine-ratio of these pages is x. We found that host-machine ratios greater than 5 are typically indicative of spam. 1.69% of the pages in DS2 fulfill this criterion.

5.

6. CONTENT PROPERTIES As we mentioned earlier on, SEOs often try to boost their rankings by configuring web servers to generate pages on the fly, in order to perform “link spam” or “keyword stuffing.” Effectively, these web servers spin an infinite web — they will return an HTML page for any requested URL. A smart SEO will generate pages that exhibit a certain amount of variance; however, many SEOs are na¨ıve. Therefore, many auto-generated pages look fairly templatic. In particular, there are numerous spam web sites that dynamically generate pages which each contain exactly the same number of words (although the individual words will typically differ from page to page). DS1 contains the number of non-markup words in each downloaded HTML page. Figure 6 shows the variance in word count of all pages drawn from a given symbolic host name. We restrict ourselves to hosts with a nonzero mean word count. The x-axis shows the variance of the word count, the y-axis shows the number of pages in DS1 downloaded from that host. Both axes are shown on a log-scale; we have offset data points with zero variance by 10−7 , in order to deal with the limitations of the log-scale. The blue oval highlights web servers that have at least 10 pages and no variance in word count. There are 944 such hosts serving 323,454 pages (0.21% of all pages). Drawing a random sample of 200 of these pages and manually assessing them showed that 55% were spam, 3.5% contained no text, and 41.5% were soft errors.

LINKAGE PROPERTIES

Web pages and the hyperlinks between them induce a graph structure. Using graph-theoretic terminology, the out-degree of a web page is equal to the number of hyperlinks embedded in the page, while the in-degree of a page is equal to the number of hyperlinks referring to that page. Figure 4 shows the distribution of out-degrees. The xaxis denotes the out-degree of a page; the y-axis denotes the number of pages in DS2 with that out-degree. Both axes are drawn on a logarithmic scale. (The 53.7 million pages in DS2 that have out-degree 0 are not included in this graph due to the limitations of the log-scale plot.) The graph appears linear over a wide range, a shape characteristic of a Zipfian distribution. The blue oval highlights a number of outliers in the distribution. For example, there are 158,290 pages with out-degree 1301; while according to the overall distribution of out-degrees we would expect only about 1,700 such pages. Overall, 0.05% of the pages in DS2 have an outdegree that is at least three times more common than the Zipfian distribution would suggest. We examined a crosssection of these pages, and virtually all of them are spam. Figure 5 shows the distribution of in-degrees. As in figure 4, the x-axis denotes the in-degree of a page, the y-axis

4

1E+8 1E+7

Number of clusters

1E+6 1E+5 1E+4 1E+3 1E+2 1E+1 1E+0 1E+0

7.

1E+2

1E+3

1E+4

1E+5

1E+6

Figure 8: Distribution of sizes of clusters of nearduplicate documents

CONTENT EVOLUTION PROPERTIES

Some spam web sites that dynamically generate a page for any requested URL do so without actually using the URL in the generation of the page. This approach can be detected by measuring the evolution of web pages and web sites. Overall, the web evolves slowly, 65% of all pages will not change at all from one week to the next, and only about 0.8% of all pages will change completely [8]. In contrast, spam pages that are created in response to an HTTP request, independent of the requested URL, will change completely on every download. Therefore, we can detect such spam sites by looking for web sites that display a high rate of average page mutation. Figure 7 shows the average amount of week-to-week change of all the web pages on a given server. The horizontal axis denotes the average week-to-week change amount; 0 denotes complete change, 85 denotes no change. The vertical axis denotes the number of pairs of successive downloads served up by a given IP address (change from week 1 to week 2, week 2 to week 3, etc.). The data items are represented as points; each point represents a particular IP address. The blue oval highlights IP addresses for which almost all pages change almost completely every week. There are 367 such servers, which account for 1,409,353 pages in DS1 (0.93% of all pages). Sampling 106 of these pages and manually assessing them showed that 103 of them (97.2%) were spam, 2 pages were soft errors, and 1 page was a (pornographic) false positive. One might think that our technique would conflate news sites with spam sites, given that news changes often. However, we did not find any news pages among the spam candidates returned by this method. We attribute this to the fact that most news sites have fast-changing index pages, but essentially static articles. Since we measure the average amount of change of all pages from a particular site, news sites will not show up prominently.

8.

1E+1

Cluster size

Figure 7: Average change week over week of all pages served up by a given IP address

from the template hardly vary. We can detect this by forming clusters of very similar pages, for example by using the “shingling” algorithm due to Broder et al. [3]. The full details of our clustering algorithm are described elsewhere [9]. Figure 8 shows the distribution of the sizes of clusters of near-duplicate documents in DS1. The x-axis shows the size of the cluster (i.e. how many web pages are in the same near-equivalence class), the y-axis shows how many clusters of that size exist in DS1. Both axes are drawn on a log-scale; as so often, the distribution is Zipfian. The distribution contains two groups of outliers. Examining the outliers highlighted by the red oval did not uncover any spam site; these outliers were due to genuine replication of popular content across many distinct web sites (e.g. mirrors of the PHP documentation). However, the clusters highlighted by the blue oval turned out to be predominantly spam: 15 of the 20 largest clusters were spam, accounting for 2,080,112 pages in DS1 (1.38% of all pages).

9. RELATED WORK Henzinger et al. [10] identified web spam as one of the most impiortant challenges to web search engines. Davison [7] investigated techniques for discovering nepotistic links, i.e. link spam. Move recently, Amitay et al. [1] identified feature-space based techniques for identifying link spam. Our paper, in contrast, presents techniques for detecting not only link spam, but more generally spam web pages. All of our techniques are based on detecting anomalies in statistics gathered through web crawls. A number of papers have presented such statistics; but focused on the trend rather than the outliers. Broder et al. investigated the link structure of the web graph [4]. They observed that the in-degree and the outdegree distributions are Zipfian, and mentioned that outliers in the distribution were attributable to web spam. Bharat et al. have expanded on this work by examining not only the link structure between individual pages, but also the higher-level connectivity between sites and between top-level domains [2]. Cho and Garcia-Molina [6] studied the fraction of pages

CLUSTERING PROPERTIES

Section 6 argued that many spam sites serve large numbers of pages that all look fairly templatic. In some cases, pages are formed by inserting varying keywords or phrases into a template. Quite often, the individual pages created

5

on 270 web servers that changed day over day. Fetterly et al. [8] expanded on this work by studying the amount of week-over-week change of 150 million pages (parts of the results described in this paper are based on the data set collected during that study). They observed that the much higher than expected change rate of the German web was due to web spam. Earlier, we used that same data set to examine the evolution of clusters of near-duplicate content [9]. In the course of that study, we observed that the largest clusters were attributable to spam sites, each of which served a very large number of near-identical variations of the same page.

Our next goal is to benchmark the individual and combined effectiveness of our various techniques on a unified data set that contains the full text and the links of all pages. A more far-reaching ambition is to use semantic techniques to see whether the actual words on a web page can be used to decide whether it is spam. Techniques for detecting web spam are extremely useful to search engines. They can be used as a factor in the ranking computation, in deciding how much and how fast to crawl certain web sites, and, in the most extreme scenario, they can be used to excise low-quality content from the engine’s index. Applying these techniques enables engines to present more relevant search results to their customers while reducing the index size. More speculatively, the techniques described in this paper could be used to assemble a large collection of spam web pages, which can then be used as a training set for machine-learning algorithms aimed at detecting a more general class of spam pages.

10. CONCLUSIONS This paper described a variety of techniques for identifying web spam pages. Many search engine optimizers aim to improve the ranking of their clients’ web sites by trying to inject massive numbers of spam web pages into the corpus of a search engine. For example, raising the PageRank of a web page requires injecting many pages endorsing that page into the search engine. The only way to effectively create a very large number of spam pages is to generate them automatically. The basic insight of this paper is that many automatically generated pages differ in one way or another from web pages authored by a human. Some of these differences are due to the fact that many automatically generated pages are too “templatic”, that is, they have little variance in word count or even actual content. Other differences are more intrinsic to the goal of the optimizers: pages that are ranked highly by a search engine must, by definition, differ from average pages. For example, effective link-spam requires pages to have a high in-degree, while effective keyword spam requires pages to contain many popular terms. This paper describes a number of properties that we have found to be indicative of spam web pages. These properties include:

11. REFERENCES [1] E. Amitay, D. Carmel, A. Darlow, R. Lempel and A. Soffer. The Connectivity Sonar: Detecting Site Functionality by Structural Patterns. In 14th ACM Conference on Hypertext and Hypermedia, Aug. 2003. [2] K. Bharat, B. Chang, M. Henzinger, and M. Ruhl. Who Links to Whom: Mining Linkage between Web Sites. In 2001 IEEE International Conference on Data Mining, Nov. 2001. [3] A. Broder, S. Glassman, M. Manasse and G. Zweig. Syntactic Clustering of the Web. In 6th International World Wide Web Conference, Apr. 1997. [4] A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins and J. Wiener. Graph Structure in the Web. In 9th International World Wide Web Conference, May 2000. [5] A. Broder, M. Najork and J. Wiener. Efficient URL Caching for World Wide Web Crawling. In 12th International World Wide Web Conference, May 2003. [6] J. Cho and H. Garcia-Molina. The evolution of the web and implications for an incremental crawler. In 26th International Conference on Very Large Databases, Sep. 2000. [7] B. Davison. Recognizing Nepotistic Links on the Web. In AAAI-2000 Workshop on Artificial Intelligence for Web Search, July 2000. [8] D. Fetterly, M. Manasse, M. Najork and J. Wiener. A large-scale study of the evolution of web pages. In 12th International World Wide Web Conference, May 2003. [9] D. Fetterly, M. Manasse and M. Najork. On the Evolution of Clusters of Near-Duplicate Web Pages. In 1st Latin American Web Congress, Nov. 2003. [10] M. Henzinger, R. Motwani, C. Silverstein. Challenges in Web Search Engines. SIGIR Forum 36(2), 2002. [11] L. Page, S. Brin, R. Motwani and T. Winograd. The PageRank Citation Ranking: Bringing Order to the Web. Technical Report, Stanford Digital Libraries Technologies Project, Jan. 1998.

• various features of the host component of a URL, • IP addresses referred to by an excessive number of symbolic host names, • outliers in the distribution of in-degrees and out-degrees of the graph induced by web pages and the hyperlinks between them, • the rate of evolution of web pages on a given site, and • excessive replication of content. We applied all the techniques that did not require link information (that is, all techniques except for the in- and outdegree outlier detection and the host-machine-ratio technique) in concert to the DS1 data set. The techniques flagged 7,475,007 pages as spam candidates according to at least one technique (4.96% of all pages in DS1, out of an estimated 8.1% ± 2% true spam pages). The false positives, without excluding overlap between the techniques, amount to 14% of the flagged pages. Most of the false positives are due to imprecisions in the host name resolution technique. Judging from the results we observed for DS2, the techniques that we could not apply to DS1 (since it does not include linkage information) could have flagged up to an additional 1.7% of the pages in DS1 as spam candidates.

6

Querying Bi-level Information David Maier

Sudarshan Murthy

Lois Delcambre

Dept. of CSE, OGI School of Sc. & Eng. at OHSU 20000 NW Walker Road Beaverton, OR 97006, USA +1 503 748 7068

smurthy, maier, lmd @cse.ogi.edu ABSTRACT 1. INTRODUCTION

In our research on superimposed information management, we have developed applications where information elements in the superimposed layer serve to annotate, comment, restructure, and combine selections from one or more existing documents in the base layer. Base documents tend to be unstructured or semi-structured (HTML pages, Excel spreadsheets, and so on) with marks delimiting selections. Selections in the base layer can be programmatically accessed via marks to retrieve content and context. The applications we have built to date allow creation of new marks and new superimposed elements (that use marks), but they have been browse-oriented and tend to expose the line between superimposed and base layers. Here, we present a new access capability, called bilevel queries, that allows an application or user to query over both layers as a whole. Bi-level queries provide an alternative style of data integration where only relevant portions of a base document are mediated (not the whole document) and the superimposed layer can add information not present in the base layer. We discuss our framework for superimposed information management, an initial implementation of a bi-level query system with an XML Query interface, and suggest mechanisms to improve scalability and performance.

You are conducting background research for a paper you are writing. You have found relevant information in a variety of sources: HTML pages on the web, PDF documents on the web and on your SIGMOD anthology of CDs, Excel spreadsheets and Word documents from your past work in a related area, and so on. You identify relevant portions of the documents and add annotations with clarifications, questions, and conclusions. As you collect information, you frequently reorganize the information you have collected thus far (and your added annotations) to reflect your perspective. You intentionally keep your information structure loose so you can easily move things around. When you have collected sufficient information, you import it, along with your comments, in to a word-processor document. As you write your paper in your word-processor, you revisit your sources to see information in its context. Also, as you write your paper you reorganize its contents, including the imported information, to suit the flow. Occasionally, you search the imported annotations, selections, and the context of the selections. You mix some of the imported information with other information in the paper and transform the mixture to suit presentation needs. Most researchers will be familiar with manual approaches to the scenario we have just described. Providing computer support for this scenario requires a toolset with the following capabilities: 1. Select portions of documents of many kinds (PDF, HTML, etc.) in many locations (web, CD, local file system, etc.), and record the selections.

Categories and Subject Descriptors

2. Create and associate annotations (of varying structure) with document selections.

H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval---Information filtering, Retrieval models, H.2.5 [Database Management]: Heterogeneous Databases.

3. Group and link document selections and annotations, reorganize them as needed, and possibly even maintain multiple organizations.

General Terms Management, Performance, Design.

4. See a document selection in its context by opening the document and navigating to the selected region, or access the context of a selection without launching its original document.

Keywords Bi-level queries, SPARCE, Superimposed information management, Information integration.

5. Place document selections and annotations in traditional documents (such as the word-processor document that contains your paper).

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Seventh International Workshop on the Web and Databases (WebDB 2004) June 17–18, 2004, Paris, France. Copyright author/owner.

6. Search and transform a mixture of document selections, annotations, and other information. Systems that support some subset of these capabilities exist, but no one system supports the complete set. It is hard to use a collection of systems to get the full set of features because the systems do not interoperate well. Some hypertext systems can

7

as these operate on both superimposed information and base information. Consequently, we call them bi-level queries.

create multiple organizations of the same information, but they tend to lack in the types of source, granularity of information, or the location of information consulted. For example, Dexter [6] requires all information consulted to be stored in its proprietary database. Compound document systems can address subdocuments, but they tend to have many display constraints. For example, OLE 2 [9] relies on original applications to render information. Neither type of system supports querying a mixture of document selections and annotations. Superimposed information management is an alternative solution for organizing heterogeneous in situ information, at document and sub-document granularity. Superimposed information (such as annotations) refers to data placed over existing information sources (base information) to help organize, access, connect and reuse information elements in those sources [8]. In our previous work [12], we have described the Superimposed Pluggable Architecture for Contexts and Excerpts (SPARCE), a middleware for superimposed information management, and presented some superimposed applications built using SPARCE. Together they support Capabilities 1 through 4. In this paper, we show how SPARCE can be used to support Capability 6. Details of support for Capability 5 are outside the scope of this paper. Before we proceed with the details of how we support Capability 6, we introduce a superimposed application called RIDPad [12]. Figure 1 shows a RIDPad document that contains information selections and annotations related to the topic of information integration. The document shown contains eight items: CLIO, Definition, SchemaSQL, Related Systems, Goal, Model, Query Optimizer, and Press. These items are associated with six distinct base documents of three kinds—PDF, Excel, and HTML. An item has a name, a descriptive text, and a reference (called a mark) to a selection in a base document. For example, the item labeled ‘Goal’ contains a mark into a PDF document. The boxes labeled Schematic Heterogeneity and Garlic are groups. A group is a named collection of items and other groups. A RIDPad document is a collection of items and groups.

Figure 1: A RIDPad document. There are many possible choices on how to present the contents of superimposed documents (such as the RIDPad document in Figure 1) and base documents for querying. We could make the division between the superimposed and base documents obvious and let the user explicitly follow marks from superimposed information to base information. Instead, our approach is to integrate a superimposed document’s contents and related base information to present a uniform representation of the integrated information for querying. The rest of this paper is organized as follows: Section 2 provides an overview of SPARCE. Section 3 provides an overview of bilevel query systems and describes a naïve implementation of a bilevel query system along with some example bi-level queries. Section 4 discusses some applications and implementation alternatives for bi-level query systems. Section 5 briefly reviews related work. Section 6 summarizes the paper.

RIDPad affords many operations for items and groups. A user can create new items and groups, and move items between groups. The user can also rename, resize, and change visual characteristics such as color and font for items and groups. With the mark associated with an item, the user can navigate to the base layer if necessary, or examine the mark’s properties and browse context information (such as containing paragraph) from within RIDPad via a reusable Context Browser we have built.

We use the RIDPad document in Figure 1 for all examples in this paper.

2. SPARCE OVERVIEW

The operations RIDPAD affords are at the level of items and groups. However, we have seen the need to query and manipulate a RIDPad document and its base documents as a whole. For example, possible queries over the RIDPad document in Figure 1 include:

The Superimposed Pluggable Architecture for Contexts and Excerpts (SPARCE) facilitates management of marks and context information in the setting of superimposed information management [12]. A mark is an abstraction of a selection in a base document. Several mark implementations exist, typically one per base type (PDF, HTML, Excel, and so on). A mark implementation chooses an addressing scheme appropriate for the base type it supports. For example, an MS Word mark implementation uses the starting and ending character index of a text selection, whereas an MS Excel mark uses the row and column names of the first and last cell in the selection. All mark implementations provide a common interface to address base information, regardless of base type or access protocol they

Q1: List base documents used in this RIDPad document. Q2: Show abstracts of papers related to Garlic. Q3: Create an HTML table of contents from the groups and items. Query Q1 examines the paths to base documents of marks associated with items in the RIDPad document. Q2 examines the context of marks of items in the group labeled ‘Schematic Heterogeneity.’ Q3 transforms the contents of the RIDPad document to another form (table of contents). In general, queries such

8

support. A superimposed application can work uniformly with any base type due to this common interface.

Superimposed Application

Context is information concerning a base-layer element. Presentation information such as font name, containment information such as enclosing paragraph and section, and placement information such as line number are examples of context information. An Excerpt is the content of a marked base-layer element. (We treat an excerpt also as a context element.) Figure 2 shows the PDF mark corresponding to the item ‘Goal’ (of the RIDPad document in Figure 1) activated. The highlighted portion is the marked region. Table 1 shows some of the context elements for this mark.

Clipboard

Mark Management Superimposed Information Management

Base Application

Context Management

Figure 3: SPARCE architecture reference model. A superimposed application allows creation of information elements (such as annotations) associated with marks. It can use an information model of its choice (SPARCE does not impose a model) and the model may vary from one application to another. For example, RIDPad uses a group-item model (simple nesting), whereas the Schematics Browser, another application we have built, uses an ER model [2, 12]. The superimposed model may be different from any of the base models. A detailed description of SPARCE is available in our previous work [12].

3. BI-LEVEL QUERY SYSTEM A bi-level query system allows a superimposed application and its user to query the superimposed information and base information as a whole. User queries are in a language appropriate to the superimposed model. For example, XQuery may be the query language if the superimposed model is XML (or a model that can be mapped to XML), whereas SQL may be the query language if superimposed information is in the relational model. Figure 2: A PDF mark activated. Superimposed Info

Figure 3 shows the SPARCE architecture reference model. The Mark Management module is responsible for operations on marks (such as creating and storing marks). The Context Management module retrieves context information. The Superimposed Information Management module provides storage service to superimposed applications. The Clipboard is used for interprocess communication.

Model Transformers

Table 1: Some context elements of a PDF mark. Element name

Value

Excerpt

provide applications and users with … Garlic system

Font name

Times New Roman

Enclosing paragraph

Loosely speaking, the goal …

Section Heading

Garlic Overview

Mark Info

Query

Query Processor

Base Info 1

Context Agents

Base Info n

Result

Figure 4: Overview of a bi-level query system. Figure 4 provides an overview of a bi-level query system. An oval in the figure represents an information source. A rectangle denotes a process that manipulates information. Arrows indicate data flow. The query processor accepts three kinds of information—superimposed, mark, and context. Model transformers transform information from the three sources in to model(s) appropriate for querying. One of these transformers, the context transformer, is responsible for transforming context information. We restrict bi-level query systems to use only one superimposed model at a time, for practical reasons. Choosing a query language and the model for the result can be hard if superimposed models are mixed.

SPARCE uses mediators [13] called context agents to interact with different base types. A context agent is responsible for resolving a mark and returning the set of context elements appropriate to that mark. A context agent is different from mediators used in other systems because it only mediates portions of base document a mark refers to. For example, if a mark refers to the first three lines of a PDF document, the mark’s context agent mediates those three lines and other regions immediately around the lines. A user could retrieve broader context information for this mark, but the agent will not do so by default.

9

3.1 Implementation

Q2: Show abstracts of papers related to Garlic.

We have implemented a naïve bi-level query system for the XML superimposed model. We have developed a transformer to convert RIDPad information to XML. We have developed a context transformer to convert context information to XML. We are able to use mark information without any transformation since SPARCE already represents that information in XML. User queries can be in XPath, XSLT, and XQuery. We use Microsoft’s XML SDK 4.0 [10] and XQuery demo implementation [11] to process queries.

This query must examine the context of items in the group labeled ‘Garlic.’ The following XPath expression suffices. This expression returns the text of a context element whose name attribute is ‘Abstract’, but only for items in the required group.

We use three XML elements to represent RIDPad information in XML— for the document, for a group, and for an item. For each RIDPad item, the system creates four children nodes in the corresponding element. These children nodes correspond to the mark, container (base document where the mark is made), application, and context. We currently transform the entire context of the mark. The XML data is regenerated if the RIDPad document changes.

We use an XSLT style-sheet to generate a table of contents (TOC) from a RIDPad document. Figure 6 shows the query in the left panel and its results in the right panel. The right panel embeds an instance of MS Internet Explorer. The result contains one list item (HTML LI tag) for each group in the RIDPad document. There is also one list sub-item (also an HTML LI tag) for each item in a group. The group-less item CLIO is in the list titled ‘Other Items.’ A user can save the HTML results, and open it in any browser outside our system.

//Group[@name='Garlic']/Item/Context//Elemen t[@name='Abstract']/text() Q3: Create an HTML table of contents from the groups and items.

Figure 5: Partial XML data from a RIDPad document. Figure 5 shows partial XML data generated from the RIDPad document in Figure 1. It contains two elements (corresponding to the two groups in Figure 1). The ‘Garlic’ element contains four elements (one for each item in that group in Figure 1). There is also an element for the group-less item CLIO. The element for ‘Goal’ is partially expanded to reveal the , , , and elements it contains. Contents of these elements are not shown.

The HTML TOC in Figure 6 shows that each item has a hyperlink (HTML A tag) attached to it. A hyperlink is constructed using a custom URL naming scheme and handled using a custom handler. Custom URLs are one means of implementing Capability 5 identified in Section 1.

3.2 Example Bi-level Queries

4. DISCUSSION

Figure 6: RIDPAD document transformed to HTML TOC.

We now provide bi-level query expressions for the queries Q1 to Q3 listed in Section 1.

The strength of the current implementation is that it retrieves context information for only those parts of base documents that the superimposed document refers to (via marks). Interestingly, the same is also its weakness: it retrieves context information for all parts of the base documents the superimposed document refers to, regardless of whether executing a query requires those elements. For example, only Query Q2 looks at context information (Q1 looks only at container information, Q3 looks at superimposed information and mark information). However, the XML data generated includes context information for all queries. Generating data in this manner is both inefficient and unnecessary— information may be replicated (different items may use the same mark), and context information can be rather large (the size of the complete context of a mark could exceed the size of its docu-

Q1: List base documents used in this RIDPad document. This query must retrieve the path to the base document of the mark associated with each item in a RIDPad document. The following XQuery expression does just that. The Location element in the Container element contains the path to the document corresponding to the mark associated with an item. {FOR $l IN document("source")//Item/Container/Location RETURN {$l/text()} }

10

capability to construct and format (on demand) superimposed information elements themselves. For example, a RIDPad item’s name may be a section heading. Such a representation of an item could be expressed as the result of a query or a transformation.

ment), depending on what context elements a context agent provides. It is possible to get the same results by separating RIDPad data from the rest and joining the various information sources. Doing so preserves the layers, and potentially reduces the size of data generated. Also, it is possible to execute a query incrementally and only generate or transform data that qualifies in each stage of execution. Figure 7 gives an idea of the proposed change to the schema of the XML data generated. Comparing with the Goal Item element of Figure 5, we see that mark, container, application, and context information are no longer nested inside the Item element. Instead, an element has a new attribute called markID. In the revised schema, the RIDPad data, mark, container, application, and context information exist independently in separate documents, with references linking them. With the revised schema, no context information would be retrieved for Query Q1. Context information would be retrieved only for items in the ‘Schematic Heterogeneity’ group when Q2 is executed.

Figure 8: A RIDPad document with relationships. Bi-level queries could also be used for repurposing information. For example, Query Q3 could be extended to include the contents of items (instead of just names) and transform the entire RIDPad document to HTML (like in Figure 6). The HTML version can then be published on the web.

Figure 7: XML data in the revised schema. Preserving the layers of data has some disadvantages. A major disadvantage is that a user will need to use joins to connect data across layers. Such queries tend to be error-prone, and writing them can take too much time and effort. A solution would be to allow a user to write bi-level queries as they currently do (against a schema corresponding to the data in Figure 5), and have the system rewrite the query to match the underlying XML schema (as in Figure 7). That is, user queries would actually be expressed against a view of the actual data. We are currently pursuing this approach to bi-level querying.

We have demonstrated bi-level queries using XML query languages, but superimposed applications might benefit from other query languages. The choice of the query language depends largely on the superimposed information model (which in turn depends on the task at hand). More than one query language may be appropriate for some superimposed information models, in some superimposed applications. For example, both CXPath [3] and XQuery may be appropriate for some applications that use the XML superimposed model.

Our current approach of grabbing context information for all marks could be helpful in some cases. For example, if a query workload ends up retrieving context of all (or most) marks, the current approach is similar to materializing views, and could lead to faster overall query execution.

The base applications we have worked with so far do not themselves have query capabilities. If access to context or a selection over context elements can be posed as a query in a base application, we might benefit from applying distributed queryprocessing techniques. Finally, the scope of a bi-level query is currently the superimposed layer and the base information accessible via the marks used. Some applications might benefit from including marks generated automatically (for example, using IR techniques) in the scope of a query.

The current implementation does not exploit relationships between superimposed information elements. For example, Figure 8 shows the RIDPad document in Figure 1 enhanced with two relationships ‘Uses’ and ‘Addresses’ from the item CLIO. A user may exploit these relationships, to pose richer queries and possibly recall more information. For example, with the RIDPad document in Figure 8, a user could now pose the following queries: What system does CLIO use? How is CLIO related to SchemaSQL?

5. RELATED WORK SPARCE differs from mediated systems such as Garlic [4] and MIX [1]. Sources are registered with SPARCE simply by the act of mark creation in those sources. Unlike in Garlic there is no need to register a source and define its schema. Unlike MIX, SPARCE does not require a DTD for a source.

Our initial use anticipated for bi-level queries was to query superimposed and base information as a whole, but we have noticed that superimposed application developers and users could use the

11

METAXPath [5] allows a user to attach metadata to XML elements. It enhances XPath with an ‘up-shift’ operator to navigate from data to metadata (and metadata to meta-metadata, and so on). A user can start at any level, but only cross between levels in an upwards direction. In our system, it is possible to move both upwards and downwards between levels. METAXPath is designed to attach only metadata to data. A superimposed information element can be used to represent metadata about a base-layer element, but it has many other uses.

Based Information Mediation with MIX. In Proceedings of the SIGMOD conference on Management of Data (Philadelphia, June, 1999). ACM Press, New York, NY, 1999, 597-599. [2] Bowers, S., Delcambre, L. and Maier, D. Superimposed Schematics: Introducing E-R Structure for In-Situ Information Selections. In Proceedings of ER 2002 (Tampere, Finland, October 7-11, 2002). Springer LNCS 2503, 2002. 90–104.

CXPath [3] is an XPath-like query language to query concepts, not elements. The names used in query expressions are concept names, not element names. In the CXPath model there is no document root—all concepts are accessible from anywhere. For example, the CXPath expression ‘/Item’ and ‘Item’ are equivalent. They both return all Item elements when applied to the XML data in Figure 5. The ‘/’ used for navigation in XPath follows a relationship (possibly named) in CXPath. For example, the expression “/Item/{Uses}Group” returns all groups that are related to an item by the ‘Uses’ relationship when applied to an XML representation of the RIDPad in Figure 8. CXPath uses predefined mappings to translate CXPath expressions to XPath expressions. There is one mapping for each concept name and for each direction of every relationship of every XML source. In our system, we intend to support multiple sources without predefined mappings, but we would like our query system to operate at a conceptual level like CXPath does.

[3] Camillo, S.D., Heuser, C.A., and Mello, R. Querying Heterogeneous XML Sources through a Conceptual Schema. In Proceedings of ER 2003 (Chicago, October 13-16, 2003). Springer LNCS 2813, 2003. 186–199. [4] Carey, M.J., Haas, L.M., Schwarz, P.M., Arya, M., Cody, W.F., Fagin, R., Flickner, M., Luniewski, A.W., Niblack, W., Petkovic, D., Thomas, J., Williams, J.H., and Wimmers, E.L. Towards heterogeneous multimedia information systems: The Garlic approach. IBM Technical Report RJ 9911, 1994. [5] Dyreson, C.E., Bohlen, M.H., and Jensen, C.S. METAXPath. In Proceedings of the International Conference on Dublin Core and Metadata Applications (Tokyo, Japan, October 2001). 2001, 17-23. [6] Halasz, F.G., and Schwartz, F. The Dexter Hypertext Reference Model. Communications of the ACM, 37, 2, 3039.

As discussed in Section 4, preserving the layers of data, yet allowing a user to express queries as if all data is in one layer means queries are expressed against views. Information Manifold [7] provides useful insight in to how heterogeneous source may be queried via views. That system associates a capability record with each source to describe its inputs, outputs, and selection capabilities. We currently do not have such a notion in our system, but we expect to consider source descriptions in the context of distributed query processing mentioned in Section 4.

[7] Levy, A.Y., Rajaraman, A., and Ordille, J.J. Querying heterogeneous information sources using source descriptions. In Proceedings of VLDB (Bombay, India 1996). 251-262. [8] Maier, D., and Delcambre, L. Superimposed Information for the Internet. In Informal Proceedings of WebDB ’99 (Philadelphia, June 3-4, 1999). 1-9. [9] Microsoft. COM: The Component Object Model Specification, Microsoft Corporation. 1995.

6. SUMMARY

[10] Microsoft. MS XML 4.0 Software Development Kit. Microsoft Corporation. Available online at http://msdn.microsoft.com/

Our existing framework for superimposed applications supports examination and manipulation of individual superimposed and base information elements. More global ways to search and manipulate information become necessary as the size and number of documents gets larger. A bi-level query system is a first step in that direction. We have an initial implementation of a query system, but still have a large space of design options to explore.

[11] Microsoft. XQuery Demo. Microsoft Corporation. Available online at http://xqueryservices.com/ [12] Murthy, S., Maier, D., Delcambre, L., and Bowers, S. Putting Integrated Information in Context: Superimposing Conceptual Models with SPARCE. In Proceedings of the First Asia-Pacific Conference of Conceptual Modeling (Dunedin, New Zealand, Jan. 22, 2004). 71-80.

7. ACKNOWLEDGMENTS This work was supported in part by US NSF Grant IIS-0086002. We thank all reviewers.

[13] Wiederhold, G. Mediators in the architecture of future information systems. IEEE Computer, 25, 3 (March 1992). 38–49.

8. REFERENCES [1] Baru, C., Gupta, A., Ludäscher, B., Marciano, R., Papakonstantinou, Y., Velikhov, P., and Chu, V. XML-

12

Visualizing and Discovering Web Navigational Patterns Jiyang Chen, Lisheng Sun, Osmar R. Za¨ıane, Randy Goebel Department of Computing Science, University of Alberta Edmonton, Alberta, Canada {jiyang, lisheng, zaiane, goebel}@cs.ualberta.ca

ABSTRACT

expected?”, “Where are people leaving my site?” or “What is the average viewing time of a particular document?”, etc.

Web site structures are complex to analyze. Cross-referencing the web structure with navigational behaviour adds to the complexity of the analysis. However, this convoluted analysis is necessary to discover useful patterns and understand the navigational behaviour of web site visitors, whether to improve web site structures, provide intelligent on-line tools or offer support to human decision makers. Moreover, interactive investigation of web access logs is often desired since it allows ad hoc discovery and examination of patterns not a priori known. Various visualization tools have been provided for this task but they often lack the functionality to conveniently generate new patterns. In this paper we propose a visualization tool to visualize web graphs, representations of web structure overlaid with information and pattern tiers. We also propose a web graph algebra to manipulate and combine web graphs and their layers in order to discover new patterns in an ad hoc manner.

Visualization tools for web site structure, web site usage data and navigational patterns are paramount because to better understand events on a site or consequences of changes and design on a web site, it is important to have a big picture of the state of the site and the activities on the web site. Moreover, while there are myriad algorithms and web mining tools to discover useful patterns from within web usage logs, the discovered patterns are so numerous and complex that it is difficult to analyze them and understand them without their web site context. A visualization tool to represent the discovered patterns and overlap them over a comprehensive web site structure depiction is capital for the evaluation and interpretation of web mining results. A depiction of the facts with cues such as colour, thickness and shape of vertexes and edges is easier to understand and use to spot anomalies or interesting phenomenon than a long list of patterns, rules and numbers provided by data mining algorithms. Furthermore, visualization in its own is not enough. Interactive visualization can help evaluate data mining results more effectively. We believe that visualization can and should be even a means for data mining. Visual Data mining has already been used in particular for classification [1]. We consider here the visualization of cues representing web information, and the means to manipulate and operate on these visual cues to discover useful knowledge in an ad hoc manner. We call this process visual web mining.

Keywords Web Visualization, Web Graph Algebra, Interactive Web Mining, Visual Data Mining

1.

INTRODUCTION

Web information visualization is of interest to two major groups of people[8]. On one hand, web users and information seekers need information visualization tools to facilitate their browsing experience. On the other hand, web site administrators and information publishers need visualization tools to assist them in their maintenance, publishing or analysis work. While web users are assisted by many navigation recommendation systems and other tools for information retrieval, web site administrators and decision makers are still lacking the needed effective visualization tools to help them discover and explain the interesting navigational patterns within their web site usage data. Website administrators are concerned with a site’s effectiveness, asking questions such as “Are people entering and visiting my site in the way I had

In this paper, we describe a visual data mining system which uses a new algebra to operate on web graph objects to highlight interesting characteristics of web data, and consequently help discover new patterns and interesting hidden facts in web navigational data. Our system, Web Knowledge Visualization and Discovery System(WEBKVDS), is composed of two main parts: FootPath and Web Graph Algebra. FootPath renders a 2D representation of a web structure overlaid with tiers of various data, called a web Graph. The algebra operates on graphs and their layers to generate other graphs highlighting relevant information. The contributions of the paper are as follows. First, we propose the concept of Web Graph, a multi-tier object comprised of layers of data or patterns overlaid over a web structure representation. We adopt the idea of a disk-tree for the web structure representation [4]. However, we propose

Copyright is held by the author/owner. Seventh International Workshop on the Web and Databases (WebDB 2004), June 17-18, 2004, Paris, France.

13

a different approach to transform the web topology into a 2D tree. Moreover, the depiction is not static but rather dynamically adapts to the representation of the layers of data and parameter settings (i.e. thickness and number of links). Finally, we propose an algebra to operate on a series of web graphs for the purpose of ad hoc visual web mining. Contrary to most of the previous visualization systems that usually only visualize the web data, the system we present is unique in that several operations are designed to mine the rendered visualizations to achieve more valuable patterns.

A recent preliminary proposal uses 3D representations rather than a 2D layout techniques to visualize the web structure and map it with web usage data [10]. However, the work is preoccupied with fashionable 3D visualization issues rather than mining the information itself or making the visualized data more accessible. As a result, the 3D images are information-rich but extremely difficult for an ordinary user to understand. Moreover, there is a significant issue with occlusion, where objects obstruct others and prevent appropriate visualization. Thus, its importance as an information visualization tool, whose objective is to explain and help users understand the information, is very limited.

The remainder of the paper is organized as follows. We review related web visualization work in Section 2. Section 3 introduces our WEBKVDS system and identifies the different objects we intend to manipulate and operate on. FootPath, which renders web graphs is detailed in Section 4. The Web Graph Algebra is introduced in Section 5. Finally we summarize our contributions in Section 6.

2.

Although none of the existing visualization tools presented above include the possibility of data manipulation and operation based on visualization, they do provide a universal background of where the research sits, and motivates our design of visual mining in web domains. To the best of our knowledge, there is no published work describing any kind of operators to manipulate graphs representing web structure or web usage data. The only relevant work worth mentioning is map-algebra [9]. It pertains to geographical information systems and provides operators to combine information layered on geographical maps. Our web algebra and the details of the layered web graphs we manipulate are described in the following sections.

RELATED WORK

Most of the existing tools for web visualizations concentrate on representing the relationship among the web usage data, users’ behaviour and the web structure. Among them, DiskTree [3, 4], designed by Ed Chi et al., was the first to display the web usage and structure information together by mapping the usage data on the structural objects. The graph representing the web site structure is collapsed into a ”disc” using a breadth-first search algorithm. If a page is linked from many other pages, only the first link found with the breadth-first search is kept and represented. This technique has been used to visualize web site evolution, web usage trends over time, and evaluation of information foraging. The concept of Time Tube Visualization, which is a series of visualizations constructed at different time periods but aligned together for comparison, was proposed in [4].

3. VISUALIZATION & DISCOVERY The Web Knowledge Visualization and Discovery System (WEBKVDS) is mainly composed of two parts: 1- FootPath: for visualizing the web structure with the different data and pattern layers. 2- Web Graph Algebra: for manipulating and operating on the web graph objects for visual data mining. In this section, we first present our framework with the details of the web graph objects and terminology we use, then discuss the architecture of our visualization and discovery system. The rendering engine, FootPath, and the algebra are presented in subsequent sections.

Minar’s Crowds Dynamics [5] is an animation visualization system, which treats a web site as a social, active space and illustrate how the web users navigate the web. The dynamic animation is simple but information-rich since it acts like a documentary movie for real-time user behaviour. However, the system doesn’t scale well for large web sites. First, the web site structure is hand made, the system does include the function to generate the web site map automatically. More importantly, rather than showing what web users have done, the system visualizes what web users are doing, making any analysis less revealing.

In addition to visualizing data from web access logs and visualizing patterns derived from web mining processes, the main goal steering the design of our system is to provide means to interactively play with visualization objects in order to perform ad hoc visual data mining and interpretation of the discoveries. To achieve this, our framework encompasses the definition of: 1. objects that include navigational data and patterns with their context, 2. tools to render these objects, and 3. ways to interact with these objects and their visualization. The navigational data are statistics extracted from pre-processed web logs, navigational patterns are patterns discovered with web mining techniques while the context of these is simply the structure of the web site.

Although web visualization is a complex task, researchers can begin to understand its characteristics by decomposing the task into the three dimension of scale, first proposed in [7], of web structure visualization, representation of aggregate and individual navigation usage behaviour, and the comparative display of navigation improvement methods. The authors also present Web Knowledge and Information Visualization (WEBKIV), a system that combines the strategies from several other web visualization tools, as part of the WebFrame project [11], which presents a initial development of a framework for gathering, analyzing, and re-deploying web data. However, WEBKIV lacks the functionality to operate on the graphs, presents a static graph, and uses the same web topology rendering algorithm as DiskTree. Our system is different on these three accounts.

There are tools to visualize discovered web patterns such as visitation click-stream sequences [2] or other regularities. However, patterns are visualized outside their context - i.e, the web site where they occur. Our goal is to render discovered patterns overlaid on top of the web structure visualization (i.e. in context). The idea of visualizing web usage data on relevant web struc-

14

ture was proposed before to help people better understand the statistics of web accesses [3, 4, 7, 10]. However, the visualization is static and the proposed tools do not provide the visual interaction that leads to new discoveries, which is known as visual data mining. To overcome the above challenges, we introduce a visualization object that can both be rendered, and manipulated for interpretation and application. This object, which we call a Web Graph, assembles together data and knowledge about navigational behaviour in their web structure context.

A Vertex size represents the NumberofVisits Layer

B

Arc thickness represents LinkUsage Layer

3.1 Visualization Objects The object that we use both for visualizing the web information and for expressing web mining operations is a web graph. It combines all the necessary data regarding the web structure, the usage data and whatever patterns available. A web graph is also a multi-tier object. The first tier, called a web image, is a tree representing the structure of a given web site (or subset of it). Each other tier, called an information layer, represents some statistics about web pages or the links connecting them. Layers can also contain patterns about the pages and links. Combining the tiers allows putting the navigational data in its web structural context. Each layer is identified separately and when the object is displayed, layers can be inhibited or rendered. The tree representing the web structure is always displayed as a background of the information layers, allowing the localization of any information vis-` a-vis the web site. Here we enumerate and summarize the terms and concepts we use:

C

D

Vertex colour represents the ViewTime Layer

Arc colour represents the ProbUsage Layer

Colour Images of all the graphs are available at http://db.cs.ualberta.ca/webkvds/

Figure 1: Visualization of Multiple Layers

Web Image: Also referred to as bare graph, is a tree representing the structure of a web site or a subset of it. It has a root, which is a starting page, and a certain given depth. Each node in the tree represents a web page and an edge represents a hyperlink between pages. A web site is in reality a graph, but it is collapsed into a tree as explained in Section 4. When visualized, the tree is displayed as a disc with the centre being the root, and the nodes of each level displayed on a circular perimeter, each level successively away from the centre. Figure 1 illustrates such discs. Note that all graphs illustrated in this paper are only 2 levels deep, but could be arbitrarily deeper.

Support= 0.4% Confidence = 40%

Figure 2: Association Rule Layer

Pattern layers: While information layers compile statistical data collected after cleaning and pre-processing the web access log, pattern layers hold patterns discovered using data mining algorithms such as sequence analysis, clustering, and association mining, etc. When displayed, they play the role of visualization for evaluation of patterns in the knowledge discovery process. Currently we have identified three such layers: Association Rules, Page Clustering, and Page Classification. Figure 2 shows a visualization of the Association Rule layer. The thickness represents the support of a rule, the arrow head size represents the confidence of the rule, while the colour can represent any defined interestingness measure. For clustering and classification, the class labels are identified by the colour of nodes in the graph.

Information Layers: Similar to the map layers geographers manipulate in GIS, we represent various web usage information as layers. An information layer is a logical collection of web data abstractions that can be laid over the web image. We identify here four different layers. These basic layers are prescribed by the identified useful expressions of the Web Graph algebra we will present later. However, other information layers are also possible. The four current basic layers are: (1) NumofVisit layer giving the visit statistics per page; (2) LinkUsage layer showing the usage statistics of links; (3) ViewTime layer representing the average access time per page; and (4) ProbUsage layer giving the access probability of the links. Visualizing these information layers is achieved using visual cues such as colour, size and shape of nodes, as well as colour and thickness of links. Figure 1 illustrates the visualization of these layers overlaid on the web image. Note that combining different visual cues, different information layers can be shown together.

Web Graph: A web image with one or more information or pattern layers constitutes a web graph.

3.2 System Architecture

15

Figure 3 illustrates the architecture of our system WEBKVDS. Given a web site and its web access log, a web image is generated using the structure of the site and the pre-processed web log. The data obtained after pre-processing the web access log is also used both to generate information layers and as input to some data mining modules. The patterns discovered by these data mining modules are used to generate pattern layers. The different layers and the web images are used to create web graphs, which are displayed using a rendering engine: FootPath. The web graph algebra allows the creation of new graphs, again displayed with FootPath. The combination of the algebra expressions and the visualization with FootPath constitutes the basis of our Visual Web Mining system.

4.

more “aerated” web image (i.e. sparse) without missing any page. Notice that no page is lost from the web graph, but like with any other strategy, some links in the site topology are not represented.

4.2 Dynamic Layout The more links are used or pages are visited, the thicker or bigger they appear in our web graph visualization, hence the name FootPath. In addition to the size of nodes and the thickness of edges, Footpath also uses colours and shapes of nodes as well as colours of edges to represent other information. It is the user that associates these visual cues to the different available information layers. The user, at display time, also chooses some parameters to indicate ranges inside which nodes or links are interesting. Other nodes and edges are eliminated from the visualization. To avoid occlusion because of various thicknesses and sizes, and to take advantage of space made available by dropping irrelevant edges and nodes identified by the algebra operators, we provide a dynamic layout. This method uses more efficiently the disc area by dividing the 360 degrees of the out most perimeter by the remaining number of nodes on the last level weighted by their respective edge thickness or node diameter. Parents, on the next level, are positioned in the middle of their children, and the same is done for each upper level up to the root. This process for distributing the nodes and edges on the disc is redone whenever necessary during the interactive visualization.

FOOTPATH

FootPath is the rendering engine of our visualization and discovery system. A web graph is displayed by first rendering the web image and then attributing visual characteristics to nodes and edges such as colour, thickness etc., to represent data from information layers. For efficiently visualizing the web topology in a web image, we adopted the Disktree representation proposed in [4], in which a node indicates a page and an edge represents the hyperlink that connects two pages; the root, a node located in the center of the graph, is usually the home page of a site or a given starting page for the analysis at hand. Nodes on the same perimeter are from the same level with regard to the root. The consecutive levels form a disc on top of which statistics and patterns are displayed. However, the novelty of FootPath is two folds, first we adopt a new strategy for collapsing the web topology into a tree structure exploiting usage data in addition to the structure of the site; second, we deal with the problem of occlusion, due to the additional visual characteristics representing information layers, by using a dynamic layout that redistributes the nodes of each level along the 360 degree perimeter.

5. WEB GRAPH ALGEBRA We propose an algebra, Web Graph Algebra, to manipulate and produce web graphs. A combination of unary and binary operators can form expressions and equations. Variables in our algebra are web graphs. Unary operators extract sub-parts of the underlying web structure or given information layers. Binary operators combine two web graphs by evaluating arithmetic expressions on the common information layers as well as the nodes and edges constituting the web images. Essentially, the algebra mechanism is designed to assist information layer manipulation and web analysis visualization once the resulting web graph is rendered by the FootPath tool.

4.1 Web Image rendering The Disktree algorithm, and many other other algorithms representing web topology in a tree structure, use a simple Breadth First Search (BFS) method to convert a connected graph into a tree. The idea is that for each node, only one link coming to it is represented. The BFS strategy is simply to represent the link that first leads to the node following the scan of the web topology. The popularity of this approach is due to its simplicity. Depth First Search (DFS) is another alternative. However, a study comparing both the BFS and the DFS strategies for a tree representation of web topology indicates BFS as a better choice in terms of ballance of the tree[6]. One disadvantage of BFS is the sheer number of nodes in the first levels of the disc making it difficult to overlay additional information without occluding or compressing the edges and nodes in the graph.

The idea of the algebra is similar to the Map Algebra[9], a general set of conventions, capabilities, and techniques that have been widely adopted for use with geographic information systems (GIS). Analogically, while the GIS researchers operate the geographic layers on a map, we intend to do web usage mining and data operations on a web graph. In the following, because of lack of space, we enumerate only one unary operator as example, and some of the proposed binary operators of the algebra for web graphs. Operator FILTER: θ = F LTLayer,threshold (α) The unary operator FILTER selects from α the objects (i.e. nodes and edges), whose layer contents fits within the given threshold parameters. The FILTER operator not only is used to refine the graph, but also acts as a basic step for many other operations and further analysis manipulations.

To better visualize the website’s structure, we introduce a new method, a Usage-Based method. The idea is simple and is integrated into the BFS. Since, in order to collapse a connected graph into a tree only one incoming link is kept per page, we chose to keep the link with the highest usage count in the web access log while avoiding cycles in the graph, and avoiding disconnecting the tree. This strategy leads to deeper trees but less nodes per level. This in turn gives a

Operator ADD: θ =α+β The binary operator ADD selects the objects that exist in both α and β, and transposes them into θ with the sum

16

FootPath Web Graph Rendering Engine

GUI

Web Image Web Site

Web Graph Algebra Operations

Web Usage Data Web Access Log

Information Layers Web Graph

Preprocessing

Web Mining

Web Usage Pattern

Ad-hoc Visual Mining

Figure 3: Visualization System Architecture and β, and transposes them into θ with the minimum content value for those layers that belong to both graphs. For example, a node with the layers N umOf V isit and V iewT ime values respectively 100 and 50 in α, and 80 and 30 in β will be transposed in θ with N umOf V isit layer value 80 and V iewT ime layer value of 30.

−. or .−

NumOfVisit

The COMMON operator is similar to the notion of intersection and can help achieve many interesting analysis. For example, to find in a web site the set of pages considered content pages, we can combine the information about the viewing time and the visit statistics. This is performed by filtering the graph to get only those nodes that are highly visited (based on a threshold) and filtering the same graph to get those nodes that are viewed for a long time, and find the commonality between these two resulting graphs.

LinkUsage

or

=

Graph = F LTNumOf V isit,tn (G) :: F LTV iewT ime,tv (G) A

Entry Points

B

Exit Points

Operator MINUS IN: θ = α − .β and Operator MINUS OUT θ = α. − β The two similar operators subtract values across different information layers. Both variables , α and β, have to have the information layers NumOfVisit and LinkUsage. These operators are basically similar to the COMMON operator except that the NumOfVisit layer of the resulting graph θ is calculated using both layers, NumOfVisit and LinkUsage, the following way:

Figure 5: The MINUS IN & OUT Operators (Finding Entry and Exit points)

of their respective content from the available layers in both graphs. The sums are done at each individual information layer. If an object exists only in one graph, it is kept with no change and transposed into θ. This operator makes the summary analysis possible. For example, to see a visualization of two consecutive months’ data together, simply adding the graphs of the two months achieves that. Figure 4 shows the ADD operation for web graphs of two months with 2 layers, NumOfVisit and LinkUsage layers. The result represents an aggregation over a time tube as defined in [4].

For MINUS IN, the NumOfVisit value of a node N in θ is calculated by subtracting the sum of the values of all links pointing to N from the layer LinkUsage of β from the NumOfVisit of N in α. For MINUS OUT, the NumOfVisit value of a node N in θ is calculated by subtracting the sum of the values of all links pointing from N from the layer LinkUsage of β from the NumOfVisit of N in α.

Operator MINUS: θ =α−β The binary operator MINUS also acts on individual information layers. The operator MINUS selects the objects that exist in both α and β, and transfers them to θ with the content being the difference of their content in α and β. If an object only exists in one graph, it is kept with positive values if it is from α and negative values if it is from β. The MINUS operator is mostly used in a Time Tube[4] to compare the graphs of different time periods.

These operators can generate some interesting analysis, such as “Entry Points”, pages where visitors typically enter the web site represented in the web graph, and “Exit Points”, pages where visitors leave the web site. For Web Entry Analysis, we use MINUS IN operator, subtract the sum of the access to the page from any other pages that have a link to it from the number of visit of that page, therefore, the left value is how often users enter the page directly. For Web

Operator COMMON: θ = α :: β Operator COMMON selects the objects that exist in both α

17

+

NumOfVisit and LinkUsage Layers of April

=

NumOfVisit and LinkUsage Layers of March

A Two−Month Summary

Figure 4: The ADD Operator (Cumulative navigational behaviour) Exit Analysis, we use MINUS OUT operator, subtract the sum of the access to any other pages from the specific page from the number of visit of that page. Therefore, the left value is how often a user stops the navigation and leaves the site in that page. Figure 5 shows the described examples for Entry Points and Exit Points.

ural Sciences and Engineering Research Council of Canada, and by the Alberta Ingenuity Centre for Machine Learning. We would like to thank Tong Zheng for providing the data used as layers for the web graphs visualized in this paper.

7. REFERENCES [1] M. Ankerst, M. Ester, and H.-P. Kriegel. Towards an effective cooperation of the user and the computer for classification. In ACM SIGKDD Conference, pages 179–188, 2000.

Operator EXCEPT: θ=αβ Operator EXCEPT selects the objects that exist in α but not in β, and transposes them into θ with no changes in the layers. While the MINUS operator makes it possible to show the time trends, the EXCEPT operator provides new possibilities for Time Tube comparison, such as identifying the recent popular pages for a given month. By applying EXCEPT on last month’s graph from the current month’s and adjusting an appropriate FILTER threshold setting, we can show the pages that have just become “hot” recently.

[2] B. Berendt. Understanding web usage at different levels of abstraction: coarsening and visualising sequences. In ACM SIGKDD - WEBKDD workshop, San Francisco, USA, August 2001. [3] E. H. Chi. Improving web usability through visualization. IEEE Internet Computing, 6(2):64–71, March/April 2002.

These operators look simple but are in fact powerful. We have yet to discover all the possible expressions and useful equations in this algebra. Based on web log data visualizations, we can interactively generate many other explainable and information-rich visualizations.

6.

[4] E. H. Chi, J. Pitkow, J. Mackinlay, P. Pirolli, R. Gossweiler, and S. K. Card. Visualizing the evolution of web ecologies. In Proceedings of the Conference on Human Factors in Computing Systems CHI’98, 1998.

CONCLUSIONS AND FUTURE WORK

[5] N. Minar and J. Donath. Visualizing the crowds at a web site. In Proceedings of CHI99., 1999.

In this paper, we introduced the concepts of web image and web graph and proposed an algebra for web graphs. We presented the idea of layering data and patterns in distinct layers on top of a disktree representation of the structure of the web, allowing the display of information in context which is more suited for the interpretation of discovered patterns. Our web knowledge visualization and discovery system visualizes multi-tier web graphs, and with the help of the web graph algebra, provides a powerful means for interactive visual web mining

[6] T. Munzner and P. Burchard. Visualizing the structure of the world wide web in 3rd hyperbolic space. In ACM VRML Conf., pages 33–38, 1995. [7] Y. Niu, T. Zheng, J. Chen, and R. Goebel. Webkiv: Visualizing structure and navigation for web mining applications. In Proceedings IEEE WIC, Halifax, Canada, Oct 13-17 2003. [8] R. Spence. Information Visualization. Addison-Wesley (ACM Press), 2000.

The work we presented in this paper is still preliminary. We have implemented a prototype. However, more theoretical work is needed for the algebra to mature. The few operators we presented are powerful and useful for the web mining and analysis we targeted, but more operators could be defined. Moreover, we have yet to study interesting properties such as commutativity, associativity, or distributivity of operators if coefficients are introduced later in the algebra. Also the order of operations in the algebra is not defined. This could be an interesting investigation leading to the implementation of web graph algebra expression optimizers.

[9] C. Tomlin. Map algebra-one perspective. Landscape and Urban Planning, 30(1-2):3–12, Oct 1994. [10] A. Youssefi, D.Duke, M.Zaki, and E.Glinert. Toward visual web mining. In Proceeding of Visual Data Mining at IEEE Intl Conference on Data Mining (ICDM), Florida, 2003. [11] T. Zheng, Y. Niu, and R. Goebel. Webframe: In pursuit of computationally and cognitively efficient web mining. In PAKDD, pages 264–275, 2002.

Acknowledgments: This research is supported by the Nat-

18

One Torus to Rule them All: Multi-dimensional Queries in P2P Systems Prasanna Ganesan

Beverly Yang

Hector Garcia-Molina

Stanford University {prasannag,byang,hector}@cs.stanford.edu

ABSTRACT

within 100 metres of a specified place. As another example, massively multi-player online games involve large sets of users moving about in a “virtual space”. Each user continuously queries the P2P system to locate all objects, and other users, within a certain distance of her own position in a two-dimensional, or three-dimensional, world [12]. Many solutions for multi-dimensional queries are available in the world of databases. However, adapting these solutions to the P2P world presents four challenges:

Peer-to-peer systems enable access to data spread over an extremely large number of machines. Most P2P systems support only simple lookup queries. However, many new applications, such as P2P photo sharing and massively multiplayer games, would benefit greatly from support for multidimensional range queries. We show how such queries may be supported in a P2P system by adapting traditional spatialdatabase technologies with novel P2P routing networks and load-balancing algorithms. We show how to adapt two popular spatial-database solutions – kd-trees and space-filling curves – and experimentally compare their effectiveness.

1.

Distribution Data needs to be partitioned across a large number of nodes while ensuring both load balance across nodes and efficient queries. Dynamism Nodes in a P2P system may join and leave frequently. Therefore, the data partitioning needs to be over a dynamic set of nodes, while still retaining good balance and efficient queries.

INTRODUCTION

Peer-to-peer systems have become a key medium for publishing and finding information on the internet today. Their popularity stems from their ease of use, self-administering nature, scalable support for large numbers of users, and their relatively anonymous, privacy-preserving content-publishing model. Content in a P2P system can be modelled as a horizontally partitioned relation, just as in a parallel database, with the system possessing control over how data is partitioned across nodes [17, 16, 7]. Most P2P systems thus far, both deployed and proposed in literature, support only simple lookup queries over such a relation, i.e., queries that retrieve all tuples with a particular key value [17, 15, 16]. Some recent work has extended this functionality to support efficient range queries over a single attribute [7, 11]. However, many interesting P2P applications require more powerful multi-dimensional range queries, i.e., conjunctive queries containing range predicates on two or more attributes of the relation. For example, consider a P2P photo-sharing application where each user publishes photographs tagged with metadata such as GPS location information, the time the picture was taken, keywords associated with the picture, and so on. A typical query in such a system would contain range predicates on multiple attributes; a user may request all photographs taken within the last year, whose location is

Data Evolution Data distributions may change over time, and can cause load imbalance even if nodes remain stable. Thus, data may need to be re-partitioned across nodes frequently to ensure load balance. Decentralization P2P systems do not have a central site that maintains a directory mapping data to nodes. Instead, a query submitted at any node must be efficiently transmitted to the relevant nodes by forwarding the query along on an overlay network of nodes. The overlay network is designed to ensure that both the cost of forwarding queries, and the cost of updating the network structure when nodes join and leave, are low. The problem of supporting multi-dimensional queries in a P2P system can be broken into two components: partitioning and routing. A partitioning strategy distributes a relation R over a set of nodes S, supporting the insertion and deletion of tuples from R, as well as the joining of new nodes into S or the leaving of existing ones from S. Once a partitioning strategy is chosen to distribute data across nodes, we require a routing strategy to transmit a query to the relevant nodes. As discussed earlier, nodes are interconnected in an overlay network, with each node having communication links to a small number of “neighbor” nodes; queries are routed on this overlay network to be delivered to the relevant nodes. In this paper, we adapt two different database approaches for multi-dimensional queries – space-filling curves and kdtrees – to the P2P setting while tackling the above challenges (Sections 3 and 4). We then compare the two resulting solutions in order to understand the strengths and weaknesses of each approach in the P2P context (Section 5).


19

1.1 Desiderata

one dimension; this approach becomes very expensive when queries involve ranges on a non-partitioning attribute, since the query would have to be forwarded to a large number of nodes. The BERD declustering strategy used in the Bubba parallel machine [5] improves on this idea by enhancing range partitioning with secondary indexes, but even short range queries continue to be expensive [8]. The MAGIC declustering strategy [8] fragments the data space into a grid of rectangular fragments, using a set of partitioning values on each dimension. Each fragment is allocated to a node, while ensuring that all nodes manage roughly the same number of fragments. The strategy requires prior knowledge of the total number of nodes, and it is unclear whether it can be adapted to support dynamic node joins and leaves. Moreover, there are serious load-balancing issues when the data distribution is not uniform [8]. Reference [6] partitions data using space-filling curves, as does our approach in Section 3. However, our use of spacefilling curves is very different from [6], where the objective was to destroy data locality rather than to preserve it.

Any P2P solution for supporting multi-dimensional queries should ideally have certain properties. First, a good partitioning strategy should have the following characteristics: (1) Locality The cost of executing a query in a P2P system is often proportional to the number of nodes that need to process the query; hence, each query should ideally execute at as few nodes as possible. For multi-dimensional range queries, this implies that the partitioning must have locality, i.e., tuples nearby in the data space should ideally be stored at the same node. (2) Load Balance The amount of data stored by each node should be roughly the same, to ensure load balance1 . This load balance should be ensured even as (a) data evolves with tuple insertions and deletions, and (b) nodes join and leave the system. (3) Minimal Metadata We define partition metadata as a directory that maps each data point to the node managing the partition containing the point. In a P2P system, there is no central site that can maintain this directory, and metadata will be distributed across the participant nodes themselves. The more the metadata, the more work that needs to be performed in updating it across multiple nodes when nodes join and leave the system. Therefore, we wish to keep the metadata as small and simple as possible. We note that these properties are desirable even when partitioning data in a parallel database. The P2P environment merely makes them even more crucial. In addition, the routing algorithm and overlay network should have the following characteristics: (1) Low per-node state: The number of links maintained by each node should be small. This is necessary since links need to be updated every time a new node joins, or an old node leaves. (2) Efficient Routing: The number of messages required to send a query to the relevant nodes should be small. (3) Routing Load Balance: The number of routing messages forwarded per second should be roughly equal across nodes. This rules out the use of tree-like overlay networks, since much traffic would have to pass through the root of the tree.

2.

Routing As we will discuss later, our solutions are adaptations of routing structures used in distributed hash tables [17, 15, 9, 3], but have to deal with added complexities arising from the multi-dimensional nature of the data space, and non-uniformity of node partitions in the data space.

3. SCRAP: SPACE-FILLING CURVES WITH RANGE PARTITIONING Our first approach to supporting multi-dimensional queries, SCRAP, uses a two-step solution to partition the data: (a) Data is first mapped down into a single dimension using a space-filling curve; (b)This single-dimensional data is then range-partitioned across a dynamic set of participant nodes. For the first step, we may map multi-dimensional data down to a single dimension using a space-filling curve such as z-ordering [14] or the Hilbert curve (e.g., [10]). For example, say we have two-dimensional data that consists of two 4-bit attributes, X and Y . A two-dimensional data point hx, yi can be reduced to a single 8-bit value z, by interleaving the bits in x and y. Thus, the two-dimensional point h0100, 0101i would be mapped to the single-dimensional value 00110001. (This mapping corresponds to the use of z-ordering as the space-filling curve.) Note that this mapping is bijective, i.e., every two-dimensional point maps to a unique single-dimensional value and vice versa. In the second step, once data has been reduced to one dimension using the space-filling curve, relation R can then be range partitioned across the available S nodes; i.e., each node manages data in one contiguous range of z values. Maintaining this range partitioning when nodes join and leave is easy: When a new node joins, it simply splits the range of some existing node; when a node leaves, one of its “neighbors” takes over its range2 .

RELATED WORK

Partitioning Single-Dimensional Data Hash partitioning can be used to distribute tuples across a set of disks. When using a relational key as the hash attribute, this approach ensures load balance, and minimal meta-data (just the hash function). However, hashing destroys data locality, and range queries are very expensive [2]. Range partitioning designates each node responsible for one contiguous range of attribute values, and thus provides good locality. The amount of meta-data is fairly small, requiring just the attribute values at the partition boundaries. However, ensuring load balance across partitions as data evolves is a non-trivial problem. Recently, we have shown how such load balance may be achieved efficiently [7].

Routing To execute queries over a SCRAP network, the multi-dimensional query must be sent to the set of nodes that contain data relevant to the query. We may again visualize routing as a two-step process: (In an actual implementation, the two steps are interleaved for efficiency.) (a) The

Partitioning Multi-Dimensional Data When data is multi-dimensional, one could still partition it based on just 1 More generally, we may require the number of queries executing at each node to be equal, when access patterns are not uniform across data. Our subsequent discussion and algorithms generalize directly to this case.

2

Since nodes may not leave gracefully, data may need to be replicated to avoid loss. We will ignore this issue here as standard P2P techniques take care of this problem.

20

1

multi-dimensional range query is first converted to a set of 1-d range queries, which together are guaranteed to contain all query answers. (b) We route each of the 1-d range queries acquired in step (a) to the appropriate nodes for that query, i.e., those nodes whose ranges intersect the query range. Step (a) is performed using well-known algorithms for query-mapping with space-filling curves, e.g., [14, 4]. We note that the query mapping algorithms use simple heuristics to ensure that the number of 1-d ranges returned is not too large. These heuristics may result in “false positives” – portions of the resulting 1-d ranges that do not actually map to a point in the native query range. For example, if the correct set of 1-d ranges is {[0,4), [6,9)}, the algorithm may return {[0,9)}. False positives may result in a non-relevant node receiving and processing the query. Step (b) is performed using a well-known routing network known as the Skip graph [3, 9]. A skip graph is a circular linked list of nodes, arranged in order of their partition boundaries, enhanced with additional skip pointers to enable faster routing. The pointers are constructed using a randomized protocol by which each node establishes O(log n) pointers (n is the number of nodes) at exponentially increasing distances from itself on the list, i.e., the ith skip pointer is expected to point to a node that is 2i positions away on the list. With a normal doubly-linked list, locating the node containing a specific data point will take O(n) messages. With skip pointers, only O(log n) messages are required. Finding all relevant nodes for a query consists of first finding the node containing the minimum point in the query range, via the skip-graph routing protocol; all remaining relevant nodes can then be reached via neighbor links to the successor nodes on the list.

1

1 2

2

(a)

3

? x (b)

Figure 1: (a) Evolution of partitions as nodes join the network. Each partition represents a node in the network. (b) Routing example over a MURK network.

4. MURK: MULTI-DIMENSIONAL RECTANGULATION WITH KD-TREES Our second approach, MURK, partitions the data in situ in the high-dimensional space, breaking up the data space into “rectangles” (hypercuboids in high dimensions), with each node managing one rectangle. One way to achieve this partitioning is to use kd-trees, in which each leaf corresponds to a rectangle being stored by a node. We illustrate this partitioning in Figure 1(a). Imagine there is initially one node in the system that manages the entire 2-d space, corresponding to a single-node kd-tree. When a second node arrives, the space is split along the first dimension into two parts of equal load, with one node managing each part; this corresponds to splitting the root node of the kd-tree to create two children. As more nodes arrive, each node splits the partition managed by an existing node, i.e., an existing leaf in the kd-tree. The dimensions are used cyclically in splitting, to ensure that locality is preserved in all dimensions. When a node leaves, its space needs to be taken over by existing nodes. The simple case is when the node’s sibling in the tree is also a leaf, e.g., node 3 in figure; in this case, the sibling node 2 takes over 3’s space. The more complex case arises when the node’s sibling is an internal node, e.g., node 1 in figure. When node 1 leaves, a lowest-level leaf node in its sibling subtree, say node 2, hands over data to its sibling node 3, and takes over the position of node 1. We note that the above means of partitioning is very similar to that employed in an existing P2P system called CAN [15], with one crucial difference: CAN hashes data into a multi-dimensional space and, since data is expected to be uniformly distributed, a new node splits an existing node’s data space equally, rather than splitting the load equally. At a philosophical level, another key difference in CAN is that the number of dimensions used by CAN is governed not by the dimensionality of the data, but by routing considerations. We will see (Section 5) that for low dimensionality (e.g., 2-d), routing in CAN performs very poorly.

Discussion The SCRAP approach meets many of our desiderata defined earlier. Locality is achieved since the space-filling curve attempts to ensure that nearby data points in the multi-dimensional space are also adjacent in the single dimension. However, as the number of dimensions increases, locality becomes worse since space-filling curves are afflicted by the curse of dimensionality. Load Balance in SCRAP can be achieved using recent techniques we have developed for single-dimensional range partitioning [7]. The key idea behind the techniques is to perform “local” alterations to the range partitioning using two operations: (a) NbrAdjust adjusts the partition boundary between two nodes managing neighboring ranges, to transfer load from one node to the other; (b) Reorder uses a node with an empty partition to split the range of a heavily loaded node. We show in [7] that a judicious use of these two operations leads to guaranteed load balance, while guaranteeing that the cost of achieving load balance is very small. The metadata required to describe the partitioning is also small: all that is required for each node is the partition boundaries of itself and its neighbors. For the overlay network, low state is achieved with only O(log n) links per node. Routing load balance is achieved due to the symmetric nature of skip graphs. Query routing for a single 1-d range is efficiently performed in O(log n) hops. However, as the native dimensionality of the data increases, the number of “relevant” nodes for the original multi-dimensional query may increase dramatically, leading to an increased routing cost.

Routing First, we must interconnect nodes so that a multidimensional range query can be sent along to all the relevant nodes. One simple way to interconnect nodes is to create a link between all “neighboring” nodes, i.e., nodes that share a boundary, resulting in a grid-like structure. (CAN [15] uses this very structure for routing, albeit with all rectangles being roughly the same size.) Observe that this structure is the multi-dimensional analogue of the linked list. Routing over these “grid” links proceeds by the use of greedy routing. Say the multi-dimensional query requires data in the “rectangle” Q. We define the distance from a node N to rectangle Q as the minimum Manhattan distance (L1 distance) from any point in N ’s rectangle to any point in Q. The routing protocol forwards the query from a node

21

to its grid neighbor that reduces the Manhattan distance to Q by the largest amount. An example of query routing is shown in Figure 1(b), where a query is routed from the node labeled ‘?’ to its destination marked ‘X’. Once the query has reached one of the nodes with relevant data, say D, node D sends the query along to those of its neighbors that also have relevant data, proceeding recursively until all relevant nodes are reached. Note that each node must know the partition boundaries of each of its neighbors.

created, such that the distance between nodes in the linear ordering approximates the grid distance between nodes in the native space. This linear ordering is achieved as follows: The ID of a node is defined to be the coordinate of the centroid of its partition, mapped down to one dimension using a spacefilling curve, such as the z-curve. Nodes are ordered linearly using this node ID, and a skip graph is built on this linear ordering of nodes, i.e., nodes maintain a linked list sorted by node ID along with additional skip pointers just as dictated in the skip graph. (Note that the skip graph construction continues to occur in a completely decentralized fashion.) Intuitively, this structure is a multi-dimensional approximation of the skip graph, and the skip pointers of nodes are distributed in an exponential fashion.

Optimized Routing The above “naive” approach is simple; however, it has two major shortcomings: (1) Nonuniformity: The number of grid neighbors that a node has is no longer uniform, unlike in a linked list where each node has two neighbors. Unless the distribution of data points is uniform, nodes will have to manage partitions of unequal space. Nodes that manage partitions covering a large space are likely to have more grid neighbors, compared to nodes managing small spaces. This is a problem because we would like the number of neighbors to be balanced among peers, and because it may translate into unbalanced routing load. (2) Inefficiency: “Grid” pointers are not very effective in improving query efficiency, especially when dimensionality is low. For example, in a uniform 2-d grid of n nodes, with each √ node having 4 neighbors, query routing still requires Θ( n) messages. On the other hand in a linked list with each node having just one skip pointer in addition to its two list pointers, routing requires only Θ(log 2 n) messages. Therefore, despite the presence of grid pointers, additional skip pointers are necessary to ensure efficient routing. We do not attempt to combat the first problem of nonuniform numbers of grid pointers; it is unclear whether it can be overcome while preserving routing load balance. For the second problem, we can have each node maintain “skip pointers” to a few other nodes in order to speed up routing. Note that we maintain the same greedy routing protocol: each node would forward the query to the neighbor closest to the destination, whether the neighbor be a “grid” neighbor, or a neighbor through a skip pointer. The key is to determine how skip pointers are chosen for each node. In the 1-d case, we simply used skip graphs to assign these skip pointers. However, there is no obvious analogue of skip graphs in higher dimensions3 . Our approach, therefore, is to develop heuristic methods for establishing skip pointers, and evaluate their performance through experiments. We consider two different approaches for establishing skip pointers:

Discussion The MURK solution offers good data locality, since it partitions the data space into exactly as many rectangles as there are nodes. The metadata necessary to describe the partitioning is simply the kd-tree itself, whose size is fairly small (proportional to the number of nodes in the system). However, load balancing across partitions is tricky. If the data distribution is static, and does not change over time, it is possible to also obtain good load balance across nodes, using simple techniques developed for load balancing in distributed hash tables, e.g. [1, 13]. When the data distribution is itself dynamic, however, the insertion and deletion of tuples can make the partitions unbalanced even if the set of nodes is fixed. As ongoing work, we are investigating the use of the NbrAdjust and Reorder primitives discussed in Section 3 to achieve dynamic load balancing. Since the routing properties of MURK are difficult to formalize, we defer a discussion of routing performance to our evaluation in Section 5.

5. EVALUATION We now evaluate the effectiveness of our proposed solutions for multi-dimensional queries in P2P systems. Setup Our experiments compare the following approaches: • SCRAP – A SCRAP network using the z-ordering spacefilling curve [14]. • MURK-Ran – A MURK network in which each node maintains 2 log n skip pointers, chosen at random. • MURK-SF – A MURK network in which skip pointers are determined by a space-filling skip graph over nodes’ partition centroids. • MURK-CAN – As a baseline for comparison, we will evaluate multi-dimensional range queries over a standard CAN [15] (grid) network with no skip pointers. We evaluate each approach via simulation. For each highlevel action, such as a query, we simulate the messages that are sent between nodes. We measure the performance of an approach with the following two metrics: • Locality – the average number of nodes across which the answer to a given query is stored. • Routing Cost – the average number of messages exchanged between nodes in order to route a given query. Due to space constraints, we omit a description of our evaluation on other metrics, such as data and routing load distribution across nodes, and comment on these metrics in Section 6.

(a) Random: Each node maintains skip pointers to s nodes chosen at random from the set S of all nodes. In practice, finding a random node is implemented using random walks on the overlay network. Such random skip pointers allow the same kind of query efficiency and routing robustness as multiple realities in CAN, without requiring that data be replicated [15]. (b) Space-filling Skip graph: Recall that the key idea in a skip graph was to ensure that each node maintained skip pointers at exponentially increasing distances from it. To emulate such an exponential distribution of skip pointers, we use the following strategy: a linear ordering of nodes is 3 Note that when node distribution is uniform, it is indeed possible to obtain exactly the same routing performance irrespective of the number of dimensions, using a generalization of skip graphs. We do not discuss this here.

22

40 SCRAP MURK

30

Locality

Locality

6000 4000

1200 SCRAP MURK

1000

Locality

8000

20 10

2000

SCRAP MURK

Uniform: solid line Skew: dotted line

800 600 400 200

0 1

0 2 10

8

7

6

5

4

3

2

Dimensionality 4

2

10

1

10

0

2

4

3

5

0.004

0.002

6

7

8

Dimensionality

Figure 2: Performance as dimensionality of data is varied

0.008

0.01

60

15

10 SCRAP MURK−Can MURK−sf MURK−Ran

5

0 2 10

0.006

Query Selectivity

Routing Cost

10

Routing Cost

Routing Cost

0 0

6

10

20

SCRAP MURK−Can MURK−sf MURK−Ran

3

1

10

10

Network Size

10

10

5

4

3

10

5

4

3

10

10

10

6

10

Network Size

Figure 3: Performance as size of network is varied

50 40 30

SCRAP MURK−Can MURK−sf MURK−Ran

Uniform: solid line Skew: dotted line

20 10 0 0

0.002

0.004

0.006

0.008

0.01

Query Selectivity

Figure 4: Performance as selectivity of range query is varied

creasing dimensions in MURK. However, query locality in SCRAP becomes very poor even for three dimensions; i.e., queries will have to visit many nodes to find all relevant data. Locality suffers in SCRAP because of the curse of dimensionality. As dimensionality increases, nearby points in native space become increasingly far apart in the mapped 1-d space. In addition, because the approximate query-mapping algorithms returns a fixed number of 1-d intervals in which the native region is contained (see Section 3 for discussion), the number of false positives introduced by the algorithm increases with dimensionality. With false positives, nodes that are not actually relevant will process the query; therefore, locality suffers. Figure 3 depicts performance as the number of nodes in the system increases, for two-dimensional data and a fixed query selectivity. In the locality graph (top), we see that MURK scales well with increasing numbers of nodes; the query almost always hits just one node. SCRAP, on the other hand, does somewhat worse with the locality increasing roughly linearly with the number of nodes. (Note that the x-axis is in logarithmic scale.) Finally, Figure 4 shows the impact of query selectivity and data skew on performance. The solid lines correspond to experiments with the UNIFORM dataset, while the dotted lines correspond to the CLUSTERED dataset. Along the x-axis we vary the selectivity of the query, executed over 2-dimensional data on a network of 8192 nodes. In the locality graph (top), we see that locality in MURK increases linearly with increasing query selectivity, even for clustered data, which suggests that the data is well balanced across nodes. Locality of SCRAP is much worse than for MURK, with some additional degradation being induced by clustered data. Locality degrades in SCRAP because of the imperfect ability of space-filling curves to map nearby points in native space to nearby points in the mapped 1-d space. As the query covers an increasingly large region in native space, a proportionally larger number of disjoint 1-d intervals can be needed to express this native region. We see in Figure 4 (top) that locality is in fact sublinear, largely due to the fact that many of the disjoint intervals map to the same node. In summary, MURK far outperforms SCRAP in terms of locality, especially as dimensionality, selectivity, and net-

Note that when a query is sent out to the network, some nodes process the query to “route” it to nodes with answers, while other nodes process it because their partitions overlap with the query (and hence, potentially have answers to the query). We differentiate between these two types of work – the first type is captured by Routing Cost, while the second is captured by Locality. Our workload consists of hypercubic region queries of a given selectivity s over the entire data space. We define the selectivity to be s, 0 ≤ s ≤ 1, if a fraction s of the entire data space is covered by the query region. Note that selectivity does not determine the number of data points in the region, if data distribution is not uniform. Selectivity may be varied to study how each approach performs under various range sizes. We evaluate queries over two types of data: (1) uniformly random data (UNIFORM) which we generate in any dimensionality desired, and (2) highly skewed/clustered data (CLUSTERED), consisting of 2-d GPS coordinates over a real-life digital photo collection. We present experiments on a static network, constructed by initially inserting all data into a single node, and progressively allowing nodes to join the system to expand it to the desired size. Experiments with dynamic networks are omitted due to space constraints.

5.1 Results Locality Our first set of experiments compares the merits of the SCRAP and MURK partitioning strategies, in terms of the locality obtained for queries, as the following parameters are varied: (a) the number of dimensions, (b) the number of nodes (c) query selectivity, and (d) data skew. Each experiment varies one of the above four parameters, while holding the others fixed. Locality results are shown in the upper set of graphs in Figures 2, 3, and 4. We will later discuss routing cost results, shown in the lower set of graphs. Recall that the total number of nodes that handle the query is equal to routing cost and locality added together. Figure 2 depicts performance of each approach as the number of dimensions increases. All queries are of a fixed selectivity of 10−5 , executed on a network of 8192 nodes. As we can see, locality (top) degrades very little with in-

23

ing costs together, we find that MURK-sf and MURK-ran are superior approaches across all the variables studied.

work size increase. Routing Our second set of experiments compares the routing costs in SCRAP and MURK, while varying the same four parameters as in the earlier experiments. In Figure 2 (bottom), we see the routing cost of each approach as the number of dimensions is varied. Not surprisingly, the cost of SCRAP routing is independent of the number of dimensions, since SCRAP routes in a 1-d space, and the number of 1-d intervals output by the query-mapping algorithm is fixed. For MURK, we see that both MURK-sf and MURK-ran are much better than MURK-CAN in low dimensions. For example, MURK-CAN requires over 2 orders of magnitude more messages to route a query in 1-d, compared to MURK-sf and MURK-ran. (Note that the yaxis is in logarithmic scale.) In high dimensions, there are so many grid pointers per node that the improvement obtained from skip pointers becomes marginal. Figure 3 (bottom) plots the routing cost for 2-d data as the number of nodes is varied, with a fixed-selectivity query. We observe that MURK-CAN performs √ very poorly, which is expected since the routing cost is Θ( n). The routing cost of SCRAP increases logarithmically with the number of nodes, just as expected. MURK-sf performs consistently better than SCRAP, suggesting that the space-filling skip graph heuristic is an effective one; it performs better than SCRAP because nodes also have additional grid pointers in MURK, and because only a single query needs to be routed. In SCRAP, we must route one 1-d query for each 1-d interval output by the query-mapping algorithm. Intriguingly, MURK-Ran outperforms MURK-sf for small network sizes. This is because many skip-graph pointers in small networks are “too close” and do not aid efficient routing as much as in large networks. In fact, it turns out that the “threshold” network size at which MURK-sf outperforms MURK-ran increases with dimensionality. In 1-d, the threshold is at around 1000 nodes, in 2-d around 8000, and even higher values for higher dimensions. Similarly, as queries select more and more data, the threshold size increases; a query selecting lots of data only needs to reach one of the many relevant nodes through routing, which is intuitively akin to routing in a network with fewer nodes. Figure 4 (bottom) depicts routing costs for uniform and clustered data, as the selectivity of the query is varied. Once again, solid lines depict the cost for uniform data, and dotted lines the cost for clustered data. The cost of SCRAP routing remains flat irrespective of the query range or data clustering, showing that it adapts well to data distribution skew. The cost of MURK-CAN is much higher than that of the other approaches, confirming the need for skip pointers in MURK. Both MURK-sf and MURK-ran perform better for uniform data than for clustered data. For clustered data, we see that MURK-sf performs considerably better than MURK-ran (by about a factor of 2), for all query ranges. This confirms once again that the space-filling skip graph heuristic used by MURK-sf performs well in practice, and is better than using random skip pointers. In summary, we find that routing in the baseline MURKCAN network is very expensive and unscalable. Skip pointers are effective in reducing routing cost in MURK, especially as network size increases. In particular, MURK-sf tends to outperform MURK-ran when network size is large, or data is skewed or in low dimensionality. SCRAP routing is also efficient; however, when looking at locality and rout-

6. CONCLUSIONS AND FUTURE WORK We have presented two approaches for supporting multidimensional queries. The first approach, SCRAP, uses spacefilling curves with range partitioning, and performs well in low dimensions. It also allows for efficient load balancing across nodes even as tuples are inserted and deleted [7]. The second approach, MURK, partitions data into rectangles in the native space. In combination with a space-filling skip graph, MURK proves much more efficient than SCRAP, especially in high dimensions. Preliminary experiments suggest that both SCRAP and MURK offer good data and routing load balance when data distribution is static. However, achieving efficient load balance in MURK under dynamic data distributions is potentially expensive, and is the subject of current research. In future work, we plan to investigate the use of SCRAP and MURK for the specific application of P2P photo sharing.

7. REFERENCES [1] M. Adler, E. Halperin, R. M. Karp, and V. V. Vazirani. A stochastic process on the hypercube with applications to peer-to-peer networks. In Proc. STOC, pages 575–584, June 2003. [2] A.Silberschatz, H.F.Korth, and S.Sudarshan. ”Database System Concepts”, chapter 17. McGraw-Hill, 1997. [3] J. Aspnes and G. Shah. Skip graphs. In Proc. SODA, 2003. [4] C. Bohm, G. Klump, and H. Kriegel. Xz-ordering: A space-filling curve for objects with spatial extension. In Proc. Symposium on Large Spatial Databases, 1999. [5] G. Copeland, W. Alexander, E. Boughter, and T. Keller. Data placement in Bubba. In Proc. SIGMOD, 1988. [6] C. Faloutsos and P. Bhagwat. Declustering using fractals. In Proc. Intl. Conf. on Parallel and Distributed Information Systems, 1993. [7] P. Ganesan, M. Bawa, and H. Garcia-Molina. Online balancing of range-partitioned data with applications to peer-to-peer systems. Technical report, Stanford University, 2004. [8] S. Ghandeharizadeh and D. J. DeWitt. A performance analysis of alternative multi-attribute declustering strategies. In Proc. SIGMOD, 1992. [9] N. J. A. Harvey, M. Jones, S. Saroiu, M. Theimer, and A. Wolman. Skipnet: A scalable overlay network with practical locality properties. In Proc. USITS, 2003. [10] H. Jagadish. Linear clustering of objects with multiple attributes. In Proc. SIGMOD, 1990. [11] D. R. Karger and M. Ruhl. Simple efficient load-balancing algorithms for peer-to-peer systems. In Proc. IPTPS, 2004. [12] G. Knutsson, H. Lu, W. Xu, and B. Hopkins. Peer-to-peer support for massively multiplayer games. In Proc. INFOCOM, 2004. [13] M. Naor and U. Wieder. Novel architectures for P2P applications: The continuous-discrete approach. In Proc. 15th ACM Symp. on Parallelism in Algorithms and Architectures (SPAA 2003), pages 50–59, June 2003. [14] J. Orenstein and T. Merrett. A class of data structures for associative searching. In Proc. PODS, 1984. [15] S. Ratnasamy, P. Francis, M. Handley, and R. M. Karp. A scalable Content-Addressable Network. In Proc. SIGCOMM, 2001. [16] A. I. T. Rowstron and P. Druschel. Pastry: Scalable, distributed object location, and routing for large-scale peer-to-peer systems. In Proc. Middleware, 2001. [17] I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrishnan. Chord: A scalable peer-to-peer lookup service for internet applications. In Proc. SIGCOMM, 2001.

24

Querying Peer-to-Peer Networks Using P-Trees Adina Crainiceanu Prakash Linga Johannes Gehrke Jayavel Shanmugasundaram Department of Computer Science Cornell University {adina,linga,johannes,jai}@cs.cornell.edu

ABSTRACT We propose a new distributed, fault-tolerant peer-to-peer index structure called the P-tree. P-trees efficiently evaluate range queries in addition to equality queries.

1.

INTRODUCTION

Peer-to-Peer (P2P) networks are emerging as a new paradigm for structuring large-scale distributed systems. The key advantages of P2P networks are their scalability, their faulttolerance, and their robustness, due to symmetrical nature of peers and self-organization in the face of failures. The above advantages made P2P networks suitable for content distribution and service discovery applications [1, 7, 8, 9]. However, many existing systems only support location of data items based on a key value (i.e. equality lookups). In this paper, we argue for a richer query semantics for P2P networks. We envision a future where users will use their local servers to offer data or services described by semantically-rich XML documents. Users can then query this “P2P data warehouse” or ”P2P service directory” as if all the data were stored in one huge centralized database. As a first step towards this goal we propose the P-tree, a new distributed fault-tolerant index structure that can efficiently support range queries in addition to equality queries. As an example, consider a large-scale computing grid distributed all over the world. Each grid node (peer) has an associated XML document that describes the node and its available resources. Specifically, each XML document has an IPAddress, an OSType, and a MainMemory attribute, each with the evident meaning. Given this setup, a user may wish to issue a query to find suitable peers for a main-memory intensive application - peers with a “Linux” operating system with at least 4GB of main memory: for $peer in //peer where $peer/@OSType = ’Linux’ and $peer/@MainMemory >= 4096 return $peer/@IPAddress A naive way to evaluate the above query is to contact every peer in the system, and select only the relevant peers. However, this approach has obvious scalability problems because all peers have to be contacted for every query, even though only a few of them may satisfy the query predicates. P2P index structures that support only equality queries will also be inefficient here: they will have to contact all the peers having “Linux” as the OSType, even though a large fraction of these may have main memory less than 4GB. Copyright is held by the author/owner. Seventh International Workshop on the Web and Databases (WebDB 2004), June 17-18, 2004, Paris, France

25

In contrast, the P-tree supports the above query efficiently as it supports both equality and range queries. In a stable system (no insertions or deletions), a P-tree of order d provides O(m+logd N ) search cost for range queries, where N is the number of peers in the system, m is the number of peers in the selected range and the cost is the number of messages. The P-tree requires O(d · logd N ) space at each peer and is resilient to failures of even large parts of the network. Our experimental results (both on a large-scale simulated network and in a small real network) show that P-trees can handle frequent insertions and deletions with low maintenance overhead and small impact on search performance. In the rest of the paper we use the following terminology and make the following assumptions. We target applications that offer a single data item or service per peer, such as resource discovery applications for web services or the grid. We call the XML document describing each service a data item. Our techniques can be applied to systems with multiple data items per peer by first using a scheme such as [4] to assign ranges of data items to peers, and then considering each range as being one data item. We call the attributes of the data items on which the index is built the search key (in our example, the search key is a composite key of the OSType and MainMemory attributes). For ease of exposition, we shall assume that the search key only consists of a single attribute. The generalization to multiple attributes is similar to B+-tree composite keys [3].

2.

THE P-TREE INDEX

The P-tree index structure supports equality and range queries in a distributed environment. P-trees are highly distributed, fault-tolerant, and scale to a large number of peers.

2.1

P-tree: Overview

Centralized databases use the B+-tree index [3] to efficiently evaluate equality and range queries. The key idea behind the P-tree is to maintain parts of semi-independent B+-trees at each peer. This allows for fully distributed index maintenance, without need for inherently centralized and unscalable techniques such as primary copy replication. Conceptually, each peer views the search key values as being organized in a ring, with the highest value wrapping around the lowest value (see Figure 1). When constructing its semi-independent B+-tree, each peer views its search key value as being the smallest value in the ring (on a ring, any value can be viewed as the smallest). In a P-tree, each peer stores and maintains only the left-most root-to-leaf path of its corresponding B+-tree. Each peer relies on a selected sub-set of other peers to complete its tree.

5|29|31

2 5|29|31

p5 p7

2 42| 7|23|30

p2 p4 p6 p8 p1 p8

p2 p5 p7 1 31|42| 5

p7 p8 p1

2 30|42| 7

p8 p2 p6 p7

p1 p2 p3 p4 p1 p2 5 7 42 13

31

p6

1 30|31

p5 p7 p1

1 5| 7|13|23

1 42| 5

2 31| 7|29

2 7|29|31|5

p2 p3 p4 p5 p5 p8 p2

13 p3

23 p4

29 p5

31|42| 5

30 p6

31 p7

42 p8

Figure 2: P-tree nodes for p1 ’s tree

1 13|23|29|30

5|42

p3 p4 p5 p6 30 29 23 p4 p5 2 23|31| 5 p7 p1 2 29|42| 7 p8 p2 1 23|29|30 p4 p5 p6

1 29|30|31

5|23|29|30

23|30|42| 7 29|31| 7 30|42| 7

5| 7

23|29

5 p1

p5 p6 p7 Figure 1: Full P-tree

7 13 23 p2 p3 p4

29|30

30|31

29 p5

30 p6

42| 7|23|30 42|5

31 42 p7 p8

Figure 3: Inconsistent P-tree

As an illustration, consider Figure 2. The peer p1 , which stores the item with value 5, only stores the root-to-leaf path of its B+-tree. To complete the remaining parts of its tree - i.e., the sub-trees corresponding to the search key values 29 and 31 at the root node - p1 simply points to the corresponding nodes in the peers p5 and p7 (which store the data items corresponding to 29 and 31, respectively). p5 and p7 also store the root-to-leaf paths of their independent B+-trees, as shown in Figure 1, so p1 just points to the appropriate nodes in p5 and p7 to complete its own B+-tree. To illustrate an important difference between P-trees and B+-trees, consider the semi-independent B+-tree at peer p1 . The root node of this tree has three sub-trees stored at the peers with values 5, 29, and 31, respectively. The first subtree covers values in the range 5-23, the second sub-tree covers values in the range 29-31, and the third sub-tree covers values in the range 31-5. These sub-trees have overlapping ranges, and the same data values (31 and 5) are indexed by multiple sub-trees. Such overlap is permitted because it allows peers to independently grow or shrink their tree; this in turn eliminates the need for excessive coordination and communication between peers. The above P-tree structure has the following advantages. First, since the P-tree maintains the B+-tree-like hierarchical structure, it provides O(logd N ) search performance for equality queries in a consistent state. Second, since the order of values in the ring corresponds to the order in the search key space, range queries can be answered efficiently by first finding the smallest value in the range (using equality lookup), and then scanning the relevant portions of the ring. Third, since each peer is only responsible for maintaining the consistency of its leftmost root-to-leaf path nodes, it does not require global coordination and does not need to be notified for every insertion/deletion. Finally, since each peer only stores tree nodes on the root-to-leaf path, and each node has at most 2d entries, the total storage requirement per peer is O(d · logd N ).

2.2

7 p2

5 p1

2 13|29|42| 7

p3

29|30|31

5| 7|13|23

1 7|13|23|29

peer storing the data item with search key value value. For convenience, we define level 0 in the P-tree at p as having d entries (p.value, p). We define succ(p) to be the peer p0 such that p0 .value appears right after p.value in the clockwise traversal of the P-tree ring. In Figure 1, succ(p1 ) = p2 , succ(p8 ) = p1 , and so on. We similarly define pred(p). To easily reason about the ordering of peers in the P-tree ring, we introduce the comparison operator
1=(L) Si =(2L) and the bandwidth for the unicast requests is i:pi 1=(L) pi Si . In particular, if L increases, while all other variables remain fixed, then more documents are pushed, an observation that will be used to derive the final algorithm. The second and more substantial modification to the previous argument is due to the mismatch between the objectives of Algorithm 1 and the desired objectives for bandwidth division and document selection. Algorithm 1 minimizes the amount of required bandwidth to achieve a fixed L, whereas our goal is to minimize L given a fixed deployed bandwidth B . In some sense, our problem is the dual of the one that Algorithm 1 solves. Algorithm 2 solves the bandwidth division and document selection problems, and uses Algorithm 1 as a subroutine. The algorithm employs a parameter > 1 that measures the target level of over-provisioning for the pull channel. More precisely, the actual bandwidth we reserve for pull is times what an idealized estimate predicts. Queuing theory asserts that > 1 guarantees bounded queuing delays, whereas 1 leads to infinite queuing delays. As such, the parameter can also be thought of as a safety margin for the pull channel. The algorithm also uses a parameter > 0, which is an arbitrarily small positive number, and finds a solution that has latency within of the optimum for the given bandwidth and popularities. Algorithm 1 assumes that documents have been sorted in nonincreasing order of popularity, i.e., pi pi+1 (1 i < n). It can be easily seen that if i is pushed, then j < i should be pushed as well. Then, the problem becomes that of finding a value of k such

(1)

Algorithm 1 follows this approach. However, as we are interested in the average latency of a pushed document instead of the worst-case latency, we need to make the following modification to equation (1). The unicast pull term pi Si in (1) is the bandwidth

32

Algorithm 2 Bandwidth Division and Document Classification Require: n B S p, and as defined in Table 1, and p i pi+1 (1 i < n) Ensure: k is the optimal number of documents on the push channel, pullBW is the optimal pull bandwidth, pushBW is the optimal push bandwidth 1: for i = 1 : : : n do 2: rspti rspti;1 + pi Si 3: sizeTotali = sizeTotali;1 + Si 4: end for 5: lMax sizeTotaln =B 6: lMin 0 7: while lMax lMin > do (lMax + lMin)=2 8: L 9: k tryLatency(L p n) 10: pullBW (rsptn rsptk ) 11: pushBW B pullBW 12: if pushBW (sizeTotalk =(2L)) then L 13: lMax 14: else L 15: lMin 16: end if 17: end while

the objective of minimizing the maximum relative inaccuracy observed in the estimated popularities of the pushed documents. In this case, we show analytically that each report probability should be set inversely proportional to the predicted access probability for that document. First, the server calculates the rate of incoming reports that it can tolerate. Presumably, is approximately equal to the rate that the server can accept TCP connections minus the rate of connection arrivals for pulled documents. Therefore, the value of can be estimated from the access probabilities and the current request rate, all scaled down by a safety factor to give the server a little leeway for error. Then, the si ’s have to be set such that ki=1 pi si , where documents 1 : : : k are on the push channel. The expected number of reports i that the server can expect to see for i over a unit time period is pi si . Using standard Chernoff bounds, the probability that number of reports is more than (1+ ) i is roughly ; i 2 e 4 , and that the probability that number of reports is less than ; 2 (1 ) i is roughly e 2i . If the goal is to minimize the expected maximum relative inaccuracy of the reports, all of the upper tail bounds should be equal and all of the lower tail bounds should be equal. That is, all i should be equal, or equivalently it should i k, si = pi k . Hence, each be the case that for all i, 1 document should have a report percentage inversely proportional to its access probability.

P

; ; ;

f

;

g

that the multicast push set 1 2 : : : k minimizes the latency L given a certain bandwidth B and pull over-provisioning factor . The optimal value k can be found by trying all possible values of L, computing the document k that achieves L with Algorithm 1, and checking that this value of k satisfies the bandwidth requirements. The pull bandwidth requirement is n i=k+1 pi Si , which leaves pushBW = B n k+1 pi Si for the push channel, and average latency for the pushed documents of ki=1 Si =2pushBW. If this computed average latency for the pushed documents is greater than L, then L needs to be increased, otherwise L needs to be decreased. Algorithm 2 follows this approach but with two optimizations. In the first place, the algorithm performs a binary search over all possible values of L and stops when the interval for L is bounded by the tolerance . Moreover, the algorithm pre-computes the sums k pi Si and k Si in the arrays rspt and the sizeTotal rei=1 i=1 spectively (Lines 1–4). The purpose of these computations is to use the totals in the place of the sums in the bandwidth computations. Because of this optimization, the portion of the algorithm before the binary search runs in linear time. The maintenance of the rspt and sizeTotal arrays can be implemented in logarithmic time per query using standard augmented binary tree techniques [8]. Thus, n Si =1 ))). We the running time of algorithm 2 is O(max(n log( iB expect that as a practical matter that the running time will be O(n).

;

P

P

3.

EVALUATION

The objective of the experiments is to validate the algorithms introduced in Section 2. In particular, the development of Algorithm 2 made several idealized assumptions about the environment and these assumptions need to be investigated experimentally. The choice of is a major parameter in the following experiments. We also wish to verify that lower delays are achieved by an integrated algorithm that does both document classification and bandwidth division. Finally, the scalability of various popularity estimation algorithms remains to be verified.

P P

3.1

Methodology

The experimental analysis leverages on an existing prototype middleware. The middleware supports the hybrid dissemination scheme utilizing multicast push and unicast pull. It acts as a reverseproxy to a Web server for the delivery of documents that are materialized views [23]. A simulated client uses the middleware and generates Poisson requests for documents with a Zipf probability distribution. In this paper, we report on the case in which the size of the documents is fixed to 0.5KB, and we have additional evidence suggesting that results are fundamentally the same with variable sized documents. An objective of this evaluation was to isolate algorithm performance from network factors, such as network congestion or routing transients. On the other hand, scalability is asserted when requests are generated by a large number of clients. Our solution was to run both the client and server on the same machine so that network effects would not be visible. (Although the emulation runs on a single machine, the middleware is capable of running on a distributed environment [23].) Aggregate requests from multiple clients was simulated by a background request filler. The filler simulates a specified number of clients, and sends requests to the server. The requests by the filler are treated identically to those made by another distinguished client, except that we record latency only for the requests from the distinguished client. All experiments were run for 10000 requests and figures reflect the average statistics from these runs. The computer used in these simulations was a 2.0Ghz

P

P

2.2

Report Probabilities

Document selection and bandwidth division rely on estimates p of document popularity. The values of p can be estimated by sampling the client population as follows. The server publishes a report probability si for each pushed document i. Then, if a client wishes to access document i, it submits an explicit request for that document with probability si . In principle, clients would not need to submit any request for push documents, but if they do send requests with probability si , the server can use those requests to estimate pi . At the same time, the report probability si should be small enough that server is almost surely not going to be overwhelmed with requests for pushed documents. In particular, we consider

33

dual processor machine with 1.2GB of RAM and running Linux Redhat Version 8.0. JRMS was used for multicasting [24]. Parameter Document Size Zipf parameter System Bandwidth Request rate Total items n Total Requests Made

Value 0.5K bytes 1.1 - 2.0 100000 bytes/sec 250 / sec 100 - 10000 10000 1.1 - 4.2 .005

Default 0.5K bytes 1.5 100000 bytes/sec 250/sec 1000 10000 2 .005

Table 2: Simulation Parameters Figure 1: Effects of various values on average latency

Table 2 describes the parameters used in the experiments. Although we explored the algorithm sensitivity to parameter values within the stated range, we report for compactness only on the default values unless otherwise noted.

3.2

Document Classification and Bandwidth Division Evaluation

Figure 1 shows the effects of various values of on the average latency of Algorithm 2. The curve in Figure 1 is jagged because an infinitesimal change in can have a discrete effect in the number of items pushed. Figure 1 shows that the value of that minimizes average latency is between 2.0 to 3.0. We adopt = 2:0 in the rest of the paper — although this is not the actual minimum, any value in the range produces similarly good results. Note that as changes in figure 1 our system adjusts the bandwidth division and document classification to maintain optimality. This in part explains why the average latency is near optimal for a relatively wide range of . Figure 2 can be interpreted as a brute force search for a good bandwidth split and document classification by trying several closely spaced values of k and pushBW. In the chart legend, the first number in the bandwidth split refers to pull. In addition to the points plotted in the figure, we verified that if less than half of the bandwidth was devoted to pull, the latency was suboptimal. In this scenario, Algorithm 2 assigns the most popular 7 documents on the push channel, and allocates 63% percent of the bandwidth to push. The figure shows the algorithm’s outcome with a circular point and an arrow pointing to it. The solution produced by our algorithm is better than any other point in the diagram. More specifically, our algorithm chose a split of 63/37 and the closest brute force curve in the figure is the 65/35 curve. The 65/35 line was also the lowest in the graph. Algorithm 2 chose k = 7 point as the number of push documents, which is also the minimum point on the 65/35 curve. Thus, Algorithm 2 chose a better bandwidth split than the brute force approach and a document classification that was just as good. Let G(k) be the average latency if the k most popular documents are placed on the push channel. The function G(k) is a weighted average of the average latency for pushed documents and the average latency for pulled documents. A graph showing an idealized G(k) from [25] is shown in Figure 3. The function G(k) has a unique local minimum, which can be be found by local search [25]. Figure 3 shows that the minimum of G(k) is to the right of the intersection of the push and pull curves. In this case, pulled documents would have lower latency than pushed documents. The actual curve that we obtained from our experiments is shown in Figure 4. Notice that the minimum of G(k) is to the left of the intersection of the push and pull curves, and thus pushed documents have lower latencies than pulled documents. Further, the minimum of G(k) occurs at a relatively small value of k, and thus compli-

Figure 2: Demonstrating the optimality of Algorithm 2 for document classification and bandwidth division. The arrow points to the single point found by the algorithm. cated hierarchical schemes for the push channel may not be useful in this setting. The location of the minimum is due to two complementary reasons. First, the most popular items are chosen for push and are also those to which a Zipf (or Zipf-like) distribution gives substantially more weight. Therefore, if a solution favors multicast push, it will also have the largest impact on the globally average delays. Second, the unicast pull curve levels off and, from that point on, the exact choice of k has little impact on pull delay. In other where words, pull delays are practically minimized at the point k precedes the intersection the pull curve flattens out. However, k of the pull curve with the push curve, and so the overall minimum occurs before that intersection. In conclusion, Algorithm 2 was shown to be better than the best value returned by a brute force search. Furthermore, the integrated algorithm led to a behavior of the push and pull curves that differs qualitatively and quantitatively from previously published work, e.g., in terms of the relative behavior of push and pull delays.

3.3 Report Probabilities Evaluation In order to determine the usefulness of our proposed push popularity scheme, we compare it to a solution found in a comparable work to our own. The solution for the push popularity problem proposed in [25] was to occasionally drop each pushed document i off of the push channel so that clients would have to make explicit requests to i. However, there is a danger that these explicit requests for i could overload the server. Thus, in [25] it was recommended that i should be dropped as short of a period of time as possible.

34

Figure 5: Effect on latency of demoting an item.

Figure 3: Relation of Push and Pull latencies as number of items pushed is changed, according to Stathatos et al.

Figure 4: Relation of Push and Pull latencies as number of items pushed changes according to our experiments

Figure 6: Drop down method versus our probability method.

The shortest possible time the the document can be dropped is one broadcast cycle. However, we show here that even such a short drop disrupts the server, while our proposed method does not suffer from such disruptions. Figure 5 shows the average latencies around the broadcast cycle T when the most popular item is dropped from the push channel. The figure shows a performance degradation for about 5 broadcast cycles. Basically, looking at the graph shows that before the drop occurs, the system is in a steady state of response times. However, once the item is dropped down the clients are no longer getting requests off the push channel. Instead, they must make requests directly to the server. Based on the Zipf distribution, as mentioned earlier, the bulk of requests were for items that were on the push channel. Therefore, dropping an item down causes a brief but substantial influx of requests to the server. This brief surge causes response times for requests during the given broadcast cycle and a few subsequent cycles to suffer while the server recovers and returns to its steady state. Figure 6 shows the average latency over the next 5 broadcast cycles when the i th most popular document is dropped from the push channel for one broadcast cycle. The flat line represents the average response time using our method for push popularity. If the most popular document is dropped, then we see a 35% increase in average latency over the next 5 broadcast cycles. If the 6th most popular document is dropped, we see an 8% increase in average latency over the next 5 broadcast cycles. This increase is in comparison to using the simple yet effective scheme we proposed of

simply including a popularity estimator with the broadcast index. In fairness, our proposed method has the disadvantages that it requires extra space in the broadcast index and it slightly increases the request rate at the server during all broadcast cycles.

4. RELATED WORK Scalable data delivery has often been approached through data caching and replication such as, for example, in client or proxy caches [11, 16], server-side caches [10], and content-delivery networks [22]. Moreover, back-end methods are deployed between a web server and a back-end database server and include web server cache plug-in mechanisms and asynchronous caches [5, 17, 18, 19]. These approaches follow the traditional unicast pull paradigm, whereby data is delivered from the server to each client individually on demand. In turn, the unicast pull approach severely limits the inherent scalability of data delivery. The document classification problem was introduced in [25]. In addition to directly related work, some other work has been done addressing the issue of hot and cold documents and of bandwidth division, though not in the context we are describing. In [1, 14, 2, 26] the issue of mixing pull and push documents together on a single broadcast channel is examined. The idea is that popular documents are similarly considered hot, and are continuously broadcast while all other documents are cold. These documents are request through a back channel and scheduled for broadcast. Similarly, in [1] the authors discuss how to divide the broadcast

35

[8] T. Cormen, C. Leiserson, R. Rivest, and C. Stein. Introduction to Algorithms. MIT Press, 2001. [9] G. Cormode and S. Muthukrishnan. What’s hot and what’s not: Tracking frequent items dynamically. In Proceedings of Principles of Database Systems, pp. 296–306, 2003. [10] A. Datta, K. Dutta, K. Ramamritham, H. M. Thomas, and D. E. Vandermeer. Dynamic content acceleration: A caching solution to enable scalable dynamic web page generation. In ACM SIGMOD, 2001. [11] A. Datta, K. Dutta, H. M. Thomas, D. E. VanderMeer, Suresha, and K. Ramamritham. Proxy-based acceleration of dynamically generated content on the world wide web: an approach and implementation. In ACM SIGMOD, pp. 97–108, 2001. [12] M. Franklin and S. Zdonik. “ data in your face ”: Push technology in perspective. In ACM SIGMOD, pp. 516–519, 1998. [13] Y. Guo, M. Pinotti, and S. Das. A new hybrid scheduling algorithm for asymmetric communication systems. ACM SIGMobile Computing and Communications Review, 5(3):123–130, 2001. [14] A. Hall and H. Taubig. Comparing push- and pull-based broadcasting or: Would “microsoft watches” profit from a transmitter? LCNS, 2647, January 2003. [15] J. Jannotti, D. Gifford, K. Johnson, M. Kaashoek, and J. O’Toole, Jr. Overcast: Reliable multicasting with an overlay network. In OSDI, pp. 197–212, 2000. [16] S. Jin and A. Bestavros. Temporal locality in Web request streams: sources, characteristics, and caching implications. In SIGMETRICS, pp. 110–111, 2000. [17] A. Labrinidis and N. Roussopoulos. Webview materialization. In ACM SIGMOD, pp. 367–378, 2000. [18] A. Labrinidis and N. Roussopoulos. Webview balancing performance and data freshness in web database servers. In VLDB, pp. 393–404, 2003. [19] Q. Luo, S. Krishnamurthy, C. Mohan, H. Woo, H. Pirahesh, B. G. Lindsay, and J. F. Naughton. Middle-tier database caching for e-business. In ACM SIGMOD, pp. 600–611, 2002. [20] J. Nonnenmacher and E. W. Biersack. Scalable feedback for large groups. IEEE/ACM Trans. Netw., 7(3):375–386, 1999. [21] V. Padmanabhan and L. Qiu. The context and access dynamics of a busy web site: Findings and implications. In ACM SIGCOMM’00, pp. 111–123, 2000. [22] V. S. Pai, L. Wang, K. Park, R. Pang, and L. Peterson. The dark side of the Web: An open proxy’s view. In HotNets-II, 2004. [23] V. Penkrot, J. Beaver, M. Sharaf, S. Roychowdhury, W. Li, W. Zhang, P. Chrysanthis, K. Pruhs, and V. Liberatore. An optimized multicast-based data dissemination middleware: A demonstration. In ICDE 2003, pp. 761–764, 2003. [24] P. Rosenzweig, M. Kadansky, and S. Hanna. The java reliable multicast service: A reliable multicast library. SMLI TR-98-68, Sun Microsystems, 1998. [25] K. Stathatos, N. Roussopoulos, and J. S. Baras. Adaptive data broadcast in hybrid networks. In VLDB, pp. 326–335, 1997. [26] P. Triantafillou, R. Harpantidou, and M. Paterakis. High performance data broadcasting systems. Mobile Networks and Applications, 7:279–290, 2002. [27] W. Zhang, W. Li, and V. Liberatore. Application-perceived multicast push performance. In IPDPS, 2004.

channel bandwidth between hot and cold documents. The main difference between previous work and ours is previous work deals with a broadcast environment with a single channel and focuses on scheduling items, not how to divide them into hot and cold. We are looking into the division of both documents and bandwidth to minimize latency. The hybrid scheme relies on estimates of the popularity of documents in the web site because popularity determines the assignment of documents to dissemination modes. Popularity estimation can be approached separately for pulled and for pushed documents. Pull popularity can be solved in sub-linear space by monitoring the client request stream [9]. As for push popularity, the problem is complicated by the absence of a client request stream. One solution is to occasionally drop each pushed document from the push channel, thus forcing clients to send explicit requests. Such requests can then be counted and the document popularity estimated [25]. A related problem is multicast group estimation [20], which can be specialized as follows in our context: remove a document from the multicast push channel and re-insert it as soon as the first request for that document is received. The document popularity can be estimated by the length of time it takes for the first client request to reach the server.

5.

CONCLUSION

In this paper we examined three data management problems that arise at the server in a hybrid data dissemination scheme. We argued that the document classification problem and bandwidth division problem should be solved in an integrated manner. We then presented a simple, yet essentially optimal, algorithm for the integrated problem. We validated the optimality of our algorithm experimentally. We proposed solving the push popularity problem by having each client request a hot document D i with some probability si , which the server sets in the push index. We looked at the difference in using our push popularity scheme versus using a scheme which simply drops an item off the push channel in order to test its popularity. We showed that dropping an item off the hot channel for one broadcast cycle can appreciably increase the average latency for approximately five broadcast cycles. Our proposed scheme does not suffer from such disruptions.

6.

REFERENCES

[1] S. Acharya, M. Franklin, and S. Zdonik. Balancing push and pull data broadcast. In ACM SIGMOD, pp. 183–194, 1997. [2] D. Aksoy and M. Franklin. Rxw: A scheduling approach for large-scale on-demand data broadcast. ACM/IEEE Transactions on Networking, 7(6):846–860, 1999. [3] K. C. Almeroth, M. H. Ammar, and Z. Fei. Scalable delivery of Web pp. using cyclic best-effort (UDP) multicast. In INFOCOM, pp. 1214–1221, 1998. [4] M. Altinel, D. Aksoy, T. Baby, M. Franklin, W. Shapiro, and S. Zdonik. Dbis toolkit: Adaptable middleware for large scale data delivery. In ACM SIGMOD, pp. 544–546, 1999. [5] M. Altinel, Q. Luo, S. Krishnamurthy, C. Mohan, H. Pirahesh, B. G. Lindsay, H. Woo, and L. Brown. Dbcache: Database caching for web application servers. In ACM SIGMOD, page 612, 2002. [6] Y. Azar, M. Feder, E. Lubetzky, D. Rajwan, and N. Shulman. The multicast bandwidth advantage in serving a web site. In 3rd NGC, pp. 88–99, 2001. [7] P. Chrysanthis, K. Pruhs, and V. Liberatore. Middleware support for multicast-based data dissemination: a working reality. In WORDS, 2003.

36

Semantic Multicast for Content-based Stream Dissemination 1

Olga Papaemmanouil

Uğur Çetintemel

Department of Computer Science Brown University Providence, RI, USA

Department of Computer Science Brown University Providence, RI, USA

[email protected]

[email protected]

ABSTRACT We consider the problem of content-based routing and dissemination of highly-distributed, fast data streams from multiple sources to multiple receivers. Our target application domain includes real-time, stream-based monitoring applications and large-scale event dissemination. We introduce SemCast, a new semantic multicast approach that, unlike previous approaches, eliminates the need for content-based forwarding at interior brokers and facilitates fine-grained control over the construction of dissemination overlays. We present the initial design of SemCast and provide an outline of the architectural and algorithmic challenges as well as our initial solutions. Preliminary experimental results show that SemCast can significantly reduce overall bandwidth requirements compared to traditional event-dissemination approaches.

1. INTRODUCTION There is a host of existing and newly-emerging applications that require sophisticated processing of high-volume, real-time streaming data. Examples of such stream-based applications include network monitoring, large-scale environmental monitoring, and real-time financial services and enterprises. Many stream-based applications are inherently distributed and involve data sources and consumers that are highly dispersed. As a result, there is a need for data streams to be routed, based on their contents, from their sources to the destinations where they will be consumed. Such content-based routing differs from traditional IP-based routing schemes in that routing is based on the data being transmitted rather than any routing information attached to it. In this model, sources generate data streams according to application-specific stream schemas, with no particular destinations associated with them. The destinations are independent of the producers of the Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright is held by the author/owner. International Workshop on the Web and Databases (WebDB), June 17-18, 2004, Paris, France. 1

This work was supported by the National Science Foundation under grant ITR-0325838.

37

messages and are identified by the consumers’ interests. Consumers express their interests using declarative specifications, called profiles, which are typically expressed as query predicates over the application schemas. The goal of the content-based routing infrastructure is to efficiently identify and route the relevant data to each receiver. In this paper, we present an overview of SemCast, an overlay network-based substrate that performs efficient and Quality-ofService (QoS)-aware content-based routing of high-volume data streams. SemCast creates a number of semantic multicast channels for disseminating streaming data. Each channel is implemented as an independent dissemination tree of brokers (i.e., routers) and is defined by a specific channel content expression. Sources forward the streaming data to one or more channels. Destination brokers listen to one or more channels that collectively cover their clients’ profiles. The key advantages of SemCast are twofold: First, SemCast requires content-based filtering only at the source and destination brokers. As each message enters the network, it is mapped to a specific semantic multicast channel and is forwarded to that channel. As a result, the routing at each interior broker corresponds to simply reading the channel identifier and forwarding the message to the corresponding channel, thereby eliminating the need to perform potentially expensive content-based filtering. This approach differs from traditional content-based routing approaches [4, 5] that commonly rely on distributed discrimination trees: starting from the root, at each level of the tree, a predicate-based filtering network filters each incoming message and identifies the outgoing links to which the message should be forwarded. Even though a large body of work has focused on optimizing local filtering using intelligent data structures [3, 6, 9, 11, 13], the forwarding times in the presence of a large of number of profiles are typically in the order of tens of milliseconds or higher (depending on expressiveness of the data and profile model, and the number of profiles) [3, 6]. As a result, forwarding costs can easily dominate overall dissemination latencies, as well as excessively consume broker processing resources. The problem becomes more pronounced when data needs to be compressed for transmission, as each broker will then have to incur the cost of decompression and recompression, which are commonly employed when transmitting XML data streams [15].

Second, SemCast semantically splits the incoming data streams and sends each of the sub-streams through a different dissemination channel. This approach makes the system quite flexible and adaptive, as different types of dissemination trees can be dynamically created for each dissemination channel, depending on the QoS required by the clients of the system. On the other hand, existing approaches commonly rely on predetermined overlay topologies, assuming that an acyclic undirected graph (e.g., a shortest path tree) of the broker network is determined in advance [4, 5]. As we demonstrate, these approaches fail to recognize many opportunities for creating better optimized dissemination trees. Moreover, as shown by the recent work [7, 12], using meshes or multiple trees can significantly increase the efficiency and effectiveness of data dissemination. In the SemCast approach, different channels are created and the content of each channel is identified based on stream and profile characteristics. Each channel is implemented as a multicast tree, and clients subscribe to the channels by joining the corresponding multicast trees. The system gathers statistics in order to adapt to dynamic changes in the profile set and stream rates. SemCast uses such statistical information to periodically reorganize the channels’ contents, striving to improve the overall bandwidth efficiency of the system. The rest of the paper is structured as follows. In Section 2, we present the basic design and architecture of SemCast. Section 3 outlines the algorithmic challenges and presents our content-based channelization algorithm. In Section 4, we discuss our heuristic for QoS–aware tree construction. In Section 5, we present preliminary experimental results that characterize the bandwidth savings that can be achieved by SemCast over traditional content-based routing approaches. Section 6 briefly discusses prior related work and, finally Section 7 concludes with a brief discussion of our contributions and directions for future research.

2. SYSTEM MODEL SemCast disseminates XML streams2 consisting of a sequence of XML messages, where each XML message is a single XML document. Subscribers of the system express their profiles using the XPath query language [17]. Each client profile is also associated with a QoS specification that expresses the client’s service expectations [1]. A QoS specification is a constraint on a specific metric of interest: our initial concern is staleness. The system consists of a set of brokers organized as a P2P overlay network. Each publisher and subscriber is connected to some broker in the network; we refer to these brokers as source brokers and gateway brokers, respectively. Brokers with no publishers or subscribers connected to them are called interior brokers. SemCast also includes a coordinator node. The coordinator is responsible for maintaining the dissemination channels and deciding their content. Specific brokers of the overlay will also serve as rendezvous points (RPs). Each RP is responsible for at least one channel and serves as the root of the corresponding multicast tree. This network model is shown in Figure 1. Sources receive from the coordinator the current XPath channel content expressions. Any modification to these 2

Our approach is not restricted to XML streams and is also readily applicable to relational data.

38

Publishers

Source brokers

S

S

S Rendezvous point

Coordinator

RP

RP

RP

Gateway broker

Internal broker

Broker Network

Subscribers Channel A Channel B

Channel C

Figure 1: Basic SemCast system model expressions is pushed to the sources to keep them up-to-date. Sources filter incoming messages according to each channel’s content expression and decide to which channels to forward them. They forward these messages to the corresponding RPs, which are responsible for efficiently multicasting them to the subscribed clients through the broker network. Each broker maintains a simple routing table whose entries map channel identifiers to the descendant brokers. Upon receipt of a message addressed to a specific channel, a simple lookup is performed on the channel’s identifier and the message is forwarded to the returned descendant list. Each gateway broker maintains a local filtering table whose entries map the subscribers’ profiles to their IP addresses. Received messages are matched against the profiles of the filtering table and are forwarded only to interested clients. This local filtering ensures that end clients do not receive irrelevant data.

3. CONTENT-BASED CHANNELIZATION SemCast addresses the problem of content-based channelization of data streams. The problem requires determining: (1) how many channels to use, (2) the content of each channel, (3) which channels to use for each incoming message, and (4) which channels each subscriber should listen to. SemCast creates semantic channels on the fly: the number and content of the channels depend on the overlap of the subscribers’ profiles as well as stream characteristics (e.g., rates) and thus may change dynamically. SemCast has two primary operational goals while performing channelization. First, it ensures that there are no false exclusions: the subscription of gateway brokers to semantic channels guarantees that every XML message will be delivered to all the end clients with matching profiles. Second, SemCast strives to minimize the run-time cost: the cost metric we use here is the overall bandwidth consumed by the system. To minimize the run-time cost, we focus on (1) reducing any redundant messages reaching the channels and (2) creating efficient dissemination trees. In order to reduce data redundancy, our channelization algorithm strives to minimize

the overlap between the channel contents. Moreover, a decentralized incremental algorithm for constructing low cost multicast trees is used, in order to reduce bandwidth consumption. We distinguish two phases during SemCast’s operation: initial channelization and dynamic channelization. The former is used when there are no statistics available. As a result, initial channelization is based solely on syntactic similarities among profiles. On the other hand, dynamic channelization is used to reorganize the channels’ contents and membership, and is based on statistical information obtained from the system at run time. In the rest of the section, we describe these two phases in more detail.

3.1 Initial Channelization In SemCast, the initial content assignment to channels is based on a syntax-based containment function, which identifies containment relations among profiles. To achieve this, we use existing algorithms for identifying containment relations among XPath queries [18]3. We say that an XPath expression P1 covers (or contains) another expression P2 if and only if any document matching P2 also matches P1. Covering relationships for simple conjunctive queries can be evaluated in O(nm) time, where n and m are the number of nodes of the two tree patterns representing the XPath expressions [18]. Upon receipt of a new profile from a client, a gateway broker GB checks if the client can be served by the channels to which GB is already subscribed. Profiles not covered by these channels are sent to the coordinator. Each new profile, Pnew, received at the coordinator will be assigned to a channel whose content expression covers Pnew. If such a channel does not exist, a new channel is created for Pnew. If there are multiple channels that cover Pnew, then Pnew is assigned to the channel with the minimum incoming stream rate. This approach strives to assign profiles to a channel that disseminates the least relevant content to the gateway brokers. The coordinator informs the GB of the channel C it should subscribe to, and the GB joins the corresponding multicast tree. We then say that Pnew is assigned to channel C. As profiles are assigned to channels based on containment relations, syntax-based containment hierarchies among profiles are materialized and maintained. In these hierarchies, each profile has as ancestor a more general profile and as descendants more specific profiles. More specifically, each profile P has as parent a profile covering P, and as children profiles covered by P. Among multiple candidate parents, the most specific one (based on syntax-based containment function) is chosen. This implies that if Pi covers Pj and Pj covers Pk, the candidate parents for Pk are Pi and Pj, and Pj is chosen as Pk’s ancestor.

3.2 Dynamic Channelization The containment algorithms used in the initial channel assignment are based on the structure of XPath expressions. However, no conclusions can be inferred for the overlap of matching documents when no containment relation is found among a profile and a channel content expression. This implies that channels may not disseminate disjoint sets of documents, leading to multiple copies of the same document being sent 3

In the case of relational data, where profiles (attribute, value) pairs, containment relations defined in [5].

39

through different channels. Thus, initial channelization based on syntactic analysis alone is unlikely to produce good results. SemCast continuously re-evaluates its channel assignment decisions based on run-time stream rates and profile similarities, attempting to minimize the overall bandwidth consumption. To implement this, SemCast creates rate-based containment hierarchies existing among profiles. Periodically, SemCast creates the rate-based hierarchies to decide whether the channels’ contents and membership should be reorganized. Rate-based hierarchies reflect how the channels’ membership and content should be organized, based on the latest stream rates and profiles overlap. If the current state of SemCast “sufficiently” differs from the state shown in the rate-based hierarchies, channel reorganization takes place. In the following, we explain the dynamic channelization phase in more detail.

3.2.1 Rate-Based Containment Hierarchies We now define a partial overlap metric between profiles Pi and Pj: we say that Pi k-overlaps with Pj, denoted Pi ⊆k Pj, if the ratio of the number of messages matching Pj to that matching Pi is k. Obviously if k=1, then Pj contains (or covers) Pi, and the set of messages matching Pi is a subset of those matching Pj. In order to reduce the overlap among the contents of the channels, SemCast places profiles matching similar messages under the same channel. Thus, we construct the rate-based hierarchies such that each profile Pj is an ancestor of Pi if Pj covers Pi. Among multiple candidate parents for Pi, SemCast chooses the one with (1) higher overlapping part with Pi and (2) lower stream rate of messages matching the non-overlapping part between Pi and the candidate parent. Using the k-overlap metric defined above, the best parent Pj is the one with Pi ⊆1 Pj and k = max k r | P ⊆ k P , Pj , j ≠ i

{

j −i

j

i

}

where rj-i is the stream rate matching the non-overlapping part between Pi and Pj. To create the rate-based hierarchies, SemCast maintains information about the selectivity of the profiles in a selectivity matrix. This matrix is implemented as a sliding window of the incoming messages and maintained by the sources. The size of this window should be large enough to collect statistics with high confidence. Incoming messages are assigned weights. New messages have higher importance, and as different incoming messages are added in the sliding window, the importance of the old ones decades and their weight decreases. By holding a weighted sliding window, we base our statistics on the most recent messages, allowing a graceful aging of messages. Each row in this matrix refers to a different incoming message and each column to a different profile. Whenever the source creates a message matching a channel, the channel’s containment hierarchy is parsed in a top-down manner. At each step, we check whether the message matches the current profile. If it does, we continue with its children, otherwise we stop. If a match is found for a profile, the corresponding entry in the selectivity matrix is set to one, otherwise it remains zero. We note that although an array of size p x w has to be stored (where p is the number of profiles and w the size of the sliding window), only the profiles of the matching channels are updated each time. However, only a part of these profiles will be updated for each new message, since the match check will not always reach the leaves of the hierarchy.

Profile overlap estimates. SemCast calculates the pairwise partial overlap metric of the profile set in order to identify the current rate-based hierarchies. An overlap matrix is computed from the selectivity matrix. Each entry Oij represents the koverlap metric for the profiles Pi and Pj. One example is shown in the Figure 2 and Table 1, where we assume that rj-i is equal to 1. The dashed line shows that P1 is a candidate parent for P6, but it is not it final parent because of the low overlap they share. The following rules apply for the overlap matrix: (1) the root profiles of the hierarchies Pj have a single ‘1’ in the j-th column at the entry Ojj; (2) candidate parents Pj of each non-root profile Pi have Oji =1, i≠j; (c) the best candidate parent of Pi, is Pj such that j = max O r | O = 1 .

{

j ,i ≠ j

ij

j −i

ji

}

Once the coordinator has decided on the new rate-based hierarchy, it identifies if there are new channels to be created or removed and which profiles should migrate to other channels. This is a straightforward procedure involving comparisons between the root profiles of the current channels and the roots of the new rate-based hierarchies. Brokers are informed by the coordinator about the new channels they should subscribe or unsubscribe and sources receive the new channel list and their updated content expressions. Table 1. Overlap matrix P1 P2 P3 P4 P5 P6 P7

P1 1 0 0 0 0.8 0.2 0

P2 0 1 1 1 0 0 0

P3 0 0.4 1 1 0 0 0

P4 0 0.5 0.6 1 0 0.4 0.2

P5 1 0 0 0 1 0 0

P6 1 0 0 1 0 1 0.5

P7 0 0 0 1 0 1 1

3.2.2 Hierarchy Merging Highly diverse profiles could result in a large number of channels. This scenario may arise when only a small percentage of the profiles is covered by another. At the extreme case, the number of channels could be equal to the number of distinct profiles in the system. In this case, the system will not be able to exploit the actual overlap among the subscribers’ interests and will create independent paths for potentially each profile regardless of common messages. Moreover, the number of channels and space requirements for routing tables will increase. To address this problem, SemCast places in the same channel the containment hierarchies that overlap. However, hierarchy merging could increase redundancy, as gateway brokers attached to the final channel might receive extra messages matching the non-overlapping part between the merged hierarchies. For this reason, SemCast merges only those channels with a content overlap above a predefined threshold. Moreover, the channels to be merged should have low stream P4

P3

P6

P2

P7

P1

P5

Figure 2. Containment hierarchy from Table 1

40

rate for their non-overlapping parts. This strategy attempts to allow merging operations which cause the least possible data redundancy among channels. To implement this SemCast consults the overlap matrix. We compare two root profiles Pj and Pi and check their partial overlap, given by Oij. If this overlap is above the merging threshold, then the hierarchies of Pi and Pj are placed on the same dissemination channel. The root profile of the channel is set to the disjunction of Pi and Pj. We note that the dynamic channelization contains implicitly the hierarchy splitting operation. Assume that by the time for the next reorganization phase, the selectivity of P2 in Figure 2 has changed (or even the rate of its matching messages) and it is no longer covered by any profile. In this case the first hierarchy will split into two, one with P4 as the root and one with P2, creating three final hierarchy trees (and possibly three channels).

4. DISSEMINATION TREE CONSTRUCTION SemCast relies on a distributed and incremental approach to create two different types of dissemination trees, low-cost or delay-bounded trees, which we briefly describe in the rest of this section. Low-cost trees. In order to connect to a channel, a gateway broker, GB, first receives a list of the current destinations in the channel from the coordinator. GB then finds the min-cost path to each destination and connects to the channel through the closest one. Join requests are sent all the way from the GB to the first node in the tree and the routing tables of the brokers on this route are updated to reflect the new tree structure. In order to reduce message replication, we try to reduce the number of edges in the trees. This algorithm is essentially a distributed and incremental adaptation of centralized Steiner tree construction algorithms (e.g., [16]). Delay-bounded trees. SemCast can also create delaybounded dissemination trees if clients indicate latency expectations (as part of their QoS specifications). As before, a GB first finds a connection point, CP, in the tree using the lowcost heuristic described earlier. All brokers continuously track their “distance” from the root of the channels they maintain. If the CP realizes that it will not be able to meet the latency constraint tagged to the join request, it will forward the join request to its parent. This process will continue until either a broker that is sufficiently close to the root is found or it is decided that the constraint cannot be met. In the latter case, the client will be notified and, optionally, the GB will be connected through the shortest path to the root.

5. PRELIMINARY EVALUATION We performed a preliminary evaluation of SemCast using simulations over broker networks, generated using GT-ITM [19]. In the experiments, we used networks of up to 700 nodes and 7000 profiles. In order to study the impact of profile similarities, we use a parameter called the profile selectivity factor, which determines the number of distinct nonoverlapping profiles. Unless otherwise noted, no partial overlap exists among channels; so there is no need to merge hierarchies. We simulated four different approaches to content-based dissemination. First, we simulated an approach that models a traditional, distributed Publish-Subscribe system (such as SIENA). In this model, profiles are propagated and aggregated

90

Unicast SPT

80

% cost degradation

% cost degradation

90

SemCast Heuristic

70 60 50 40 30 20

Unicast

80

SPT

70

SemCast Heuristic

60 50 40 30 20 10

10

0

0

200 0

0.025

0.05

0.06

0.125

0.18

0.25

0.5

0.7

400

600

network size (profile selectivity factor =0.7%)

profile selectivity factor (%)

Figure 3. Relative bandwidth efficiency for varying profile similarity factors

Figure 4. Relative bandwidth efficiency for different network sizes

up to the source over shortest paths from the clients to the source. In other words, we created a predicate-based overlay filtering tree on top of a shortest path spanning tree of the broker network. Filter-based routing tables are placed at each broker and messages are matched at each hop against the filters in the filter table to determine the next hop(s) towards the clients. We refer to this algorithm as SPT. Second, we simulated our tree construction heuristic described in the previous section. Third, we simulated a centralized Steiner tree construction algorithm [16] where the coordinator knew a priori all the profiles and the destinations for each channel. This approach provides a good approximation to the optimal solution; we therefore refer to it as the Optimal approach. Finally, we simulated a Unicast approach, where messages are filtered at the sources and are forwarded to the interested parties using shortest paths. This last approach models a centralized filtering system that uses unicast dissemination of filtered messages. Basic Performance. Figure 3 shows the cost degradation (i.e., extra bandwidth consumption) over the optimal algorithm for the other three approaches for 700 nodes. SemCast performs the closest to the optimal, incurring up to only 6.4% higher cost on the average. SPT performs better than Unicast; it does, however, increase the cost by 23% on the average, whereas Unicast’s average cost increase is 41%. Figure 4 shows that the cost degradation increases with increasing network size for Unicast and SPT. At the same time, SemCast’s heuristic is able to keep the bandwidth consumption lower than 8.5% of the optimal solution, for all network sizes. Benefits of Dynamic Reorganization. We also measured how dynamic reorganization can improve bandwidth consumption when there is partial overlap among channels and, thus, hierarchy merging is used to reduce data redundancy. We observed that, using a merging threshold of 40% partial overlap among hierarchies can reduce the cost up to 23%, compared to SPT. As the partial overlap among profiles increases, SemCast also improves its bandwidth efficiency. In cases where a profile is contained by another with probability less than 0.7, SemCast has a cost improvement up to 21% (with a 30% merging threshold and 1% profile selectivity factor). We also conducted experiments with varying number of profiles and selectivity factors. The results show that even with large profile sets, SemCast can significantly improve bandwidth

efficiency compared to the other approaches. Finally, as the profile selectivity factor in the system increases, we can dynamically set the merging threshold to higher values and continue to achieve significant cost improvements. QoS-Aware Dissemination. We also studied SemCast’s performance when subscribers express latency constraints. Here, there is a fundamental tradeoff between dissemination costs and latencies: the stricter the delay expectations are, the higher the cost of dissemination is. Our results clearly demonstrated this tension, but also showed that the cost increase due to the latency constraints typically does not go beyond 10%, even with low profile selectivity factors (i.e., 1%). Due to space limitations, we do not include these results.

41

6. RELATED WORK XML filtering. A large body of work has focused on optimizing local filtering using intelligent data structures. For instance, XFilter [3] uses indexing mechanisms and converts queries to finite state machine representations to create matching algorithms for fast location and evaluation of profiles. In [9], a space-efficient index structure that decomposes tree patterns into collections of substrings and index them using a trie is used to improve the efficiency of XML data. Content-based routing. Unlike SemCast, existing approaches for content-based networking do not the address the issue of constructing the dissemination overlays. Gryphon [4] assumes that the best paths connecting the brokers are known a priori. Likewise, Siena [5] assumes that an acyclic spanning tree of the brokers is given. Application-level multicast. Many recent research proposals (e.g., [8, 14]) employ application-level multicast for delivering data to multiple receivers. Members of the multicast group self-organize into an overlay topology over which multicast trees are created. In all these cases, members of a group have the same interests. Similar to SemCast, SplitStream [7] aims to achieve high-bandwidth data dissemination by striping content across various multicast trees. However, SplitStream does not address profile-based dissemination and channelization. Bullet [12] uses an overlay construction algorithm for creating a mesh over any distribution tree, improving the bandwidth throughput of the participants. As a result, SemCast can utilize the techniques introduced in this work to improve its bandwidth efficiency. XML-based filtering using overlays was investigated in [15]. This work uses

multiple interior brokers in the overlay to redundantly transmit the same data from source to destination, reducing loss rates and improving delivery latency. Channelization. The channelization problem was studied in [2] in the context of simple traffic flows and a fixed number of channels. The problem was shown to be NP-complete. In our model, the fact that the number of multicast groups is not fixed, and is dependent on the overlap of the receivers’ interests, makes the problem more challenging. Finally topic-based semantic multicast was introduced in [10]. In this work, users express their interests by specifying a specific topic area. Similarity ontologies are used to discover more general topics served by existing channels. This work does not consider content-based routing, and the focus is not on minimizing bandwidth consumption.

[5]

[6] [7]

[8]

7. CONCLUSIONS & FUTURE WORK We introduced a semantic multicast system, called SemCast, which implements content-based routing and dissemination of data streams. The key idea is to split the data streams based on their contents and spread the pieces across multiple multicast channels, each of which is implemented as an independent dissemination tree. SemCast continuously reevaluates its channelization decisions (i.e., the number channels and the contents of each channel) based on current stream rates and profile similarities. The key advantages of the SemCast approach over traditional approaches are that (1) it does not require multi-hop content-based forwarding of streams, and (2) it facilitates finegrained control over the construction of dissemination trees. Preliminary experimental results show that SemCast can significantly improve the bandwidth-efficiency of dissemination, even when the similarity among client profiles is low. Currently, we are finalizing SemCast’s design and pertinent channelization protocols. There are several directions in which we plan to extend our current work. First, we plan to extend our QoS notion to include data quality ― the idea is to have multiple channels carrying the same data stream at different “resolutions”. Second, we would like to extend our basic channelization approach to also take into account the network topology and physical locality of the brokers, in order to be able to create better optimized dissemination trees. Finally, we plan to implement a SemCast prototype to verify the practicality and efficiency of the approach as well as to demonstrate the claimed processing cost benefits for the interior brokers.

[9]

[10]

[11] [12]

[13]

[14]

[15]

8. REFERENCES [1]

[2] [3] [4]

D. Abadi, D. Carney, U. Cetintemel, M. Cherniack, C. Convey, C. Erwin, E. Galvez, M. Hatoun, J. Hwang, A. Maskey, A. Rasin, A. Singer, M. Stonebraker, N. Tatbul, Y. Xing, R.Yan, and S. Zdonik. Aurora: A Data Stream Management System. In Proc. of the International SIGMOD Conf, 2003. M. Adler, Z. Ge, J. F. Kurose, D. Towsley, and S. Zabele. Channelization problem in large scale data dissemination. In ICNP'01, November 2001. M. Altinel and M. J. Franklin. Efficient Filtering of XML Documents for Selective Dissemination of Information. In 26th VLDB Conference, 2000. G. Banavar, T. Chandra, B. Mukrerjee, J. Nagarajarao, R. Strom, and D. Sturman. An Efficient

42

[16] [17] [18] [19]

Multicast Protocol for Content-Based PublishSubscribe Systems. In 19th ICDCS, May 1999. A. Carzaniga, D. S. Rosenblum, and A. L. Wolf. Design and Evaluation of a Wide-Area Event Notification Service. ACM Transactions on Computer Systems, 19(3):332-383, August 2001. A. Carzaniga and A. L. Wolf. Forwarding in a Content-Based Network. In SIGCOMM, 2003. M. Castro, P. Druschel, A.-M. Kermarrec, A. Nandi, A. Rowstron, and A. Singh. SplitStream: Highbandwidth content distribution in cooperative environments. In 19th ACM Symposium on Operating Systems Principles, October 2003. M. Castro, P. Druschel, A.-M. Kermarrec, and A. Rowstron. SCRIBE: A large-scale and decentralized application-level multicast infrastructure. IEEE Journal on Selected Areas in communications (JSAC), 20(8), October 2002. C.-Y. Chan, P. Felber, M. N. Garofalakis, and R. Rastogi. Efficient Filtering of XML Documents with XPath Expressions. VLDB Journal, Special Issue on XML, 11(4):354-379, 2002. S. Dao, E. Shek, A. Vellaikal, R. R. Muntz, L. Zhang, M. Potkonjak, and O. Wolfson. Semantic multicast: intelligently sharing collaborative sessions. ACM Computing Surveys (CSUR), 31(2es), 1999. Y. Diao, M. Franklin, P. Fischer, and R. To. YFilter: Efficient and Scalable Filtering of XML Documents (Demonstration Description). In ICDE, 2002. D. Kostic, A. Rodriguez, J. Albrecht, and A. Vahdat. Bullet: High Bandwidth Data Dissemination Using an Overlay Mesh. In 19th ACM Symposium on Operating Systems Principles, October 2003. L. Opyrchal, M. Astley, J. Auerbach, G. Banavar, R. Strom, and D. Sturman. Exploiting IP Multicast in Content-Based Publish-Subscribe Systems. In IFIP/ACM International Conference on Distributed Systems Platforms (Middleware), April 2000. S. Ratnasamy, M. Handley, R. Karp, and S. Shenker. Application-Level Multicast using Content Addressable Networks. In 3rd International Workshop on Networked Group Communication (NGC '01), November 2001. A. C. Snoeren, K. Conley, and D. K. Gifford. MeshBased Content Routing using XML. In 20th ACM Symposium on Operating Systems Principles, 2001. H. Takahashi and A. Matsuyama. An Approximate Solution for the Steiner Tree Problem in Graphs. Mathematica Japonica, 1980. W3C. XML Path Language (XPath) 1.0. 1999. P. T. Wood. Containment for XPath fragments under DTD constraints. In 9th International Conference of Database Theory, January 2003. E. Zegura, K. Calvert, and S. Bhattacharjee. How to Model an Internetwork. In INFOCOM96, 1996.

Twig Query Processing over Graph-Structured XML Data Zografoula Vagena

Mirella M. Moro

Vassilis J. Tsotras

University of California Computer Science & Engineering Riverside, CA 92521, USA



[email protected]

[email protected]

[email protected]

ABSTRACT

benefit from the existence of a schema, such as query formulation. These structural summaries can play an important role in query evaluation, since they make it possible to answer queries directly from them, instead of considering the original data, which has potentially larger size. In general, they have graph forms, even when derived from tree structures. Query languages proposed for XML data, such as XQuery [5] and XPath [4], consider the inherent graph-structure and lack of schema and permit querying both on the structure and on simple values. The structural selection part is performed with a navigational approach, where data is explored and elements are located starting from determined entry points. Tree Pattern Queries [3], also known as twigs, that involve element selections with specified tree structures, have been defined to enable efficient processing of the structural part of the queries. Consider for example the bibliography database of Figure 1. Figure 1a illustrates the graph representation of the XML database, Figure 1b shows its structural summary, namely the A(k)-Index (where k is ∞) [18]. In Figure 1a, solid lines represent edges between elements and subelements, while dashed lines represent idref edges. We issue the query: //bib[.//author[@name=‘Chen’]]/article[@title=‘t1’] that retrieves articles with title ‘t1’ if there exists an author with name ‘Chen’. The structural part of the query can be represented as a twig and is shown in Figure 1c. While value-based queries can be evaluated by adopting traditional indexing schemes, e.g. B + -trees, efficient support for the structural part is a unique and challenging problem. Previous efforts to process the structural part of the query directly from the data [29, 2, 6, 12, 21, 27, 26] have assumed the tree structured model of XPath and cannot be applied when the structure is a general directed graph. Considering graph structures, research on structural summaries [14, 24, 13, 20, 18, 9, 7, 28] has focused on the construction and maintenance of effective structures, i.e. structures that are (a) space efficient, (b) optimized with regard to a specified query workload, and (c) gracefully adaptable in the presence of updates. In other words, the query processing task has received little attention. In this paper, we propose techniques for the evaluation of twig queries over graph-structured data. We investigate new ways to adapt ideas that have been proved successful for the case of tree-structured data, namely mapping of the structural conditions to join conditions. We start with the particular class of directed and acyclic graphs (which we show being the structure of the data in many cases) and

XML and semi-structured data is usually modeled using graph structures. Structural summaries, which have been proposed to speedup XML query processing have graph forms as well. The existent approaches for evaluating queries over tree structured data (i.e. data whose underlying structure is a tree) are not directly applicable when the data is modeled as a random graph. Moreover, they cannot be applied when structural summaries are employed and, to the best of our knowledge, no analogous techniques have been reported for this case either. As a result, the potential of structural summaries is not fully exploited. In this paper, we investigate query evaluation techniques applicable to graph-structured data. We propose efficient algorithms for the case of directed acyclic graphs, which appear in many real world situations. We then tailor our approaches to handle other directed graphs as well. Our experimental evaluation reveals the advantages of our solutions over existing methods for graph-structured data.

Categories and Subject Descriptors H.3.3 [Information Search and Retrieval]

1.

∗

INTRODUCTION

The widespread acceptance and employment of XML in B2B applications as the standard for data exchange among heterogeneous sources has claimed for efficient storage and retrieval of XML data. Contrary to relational data, XML data is self-describing and irregular. Hence, it belongs to the category of semi-structured data, inheriting its graphstructure model. This absence of schema in XML has led to the employment of structural summaries [14, 24, 13, 20, 18, 9, 7, 28], derived from the data, in order to facilitate tasks that would ∗ This research is partially supported by CAPES, NSF IIS0339032, and Lotus Interworks under UCMicro grant 47938

Copyright is held by the author/owner. Seventh International Workshop on the Web and Databases (WebDB 2004), June 17-18, 2004, Paris, France .

43

Figure 1: (a) Bibliography XML database, (b) A(k)-Index , (c) Sample Query

then adapt them for the case of general directed graphs. Our techniques can be used to answer queries either over the original data or through structural summaries. The rest of the paper is organized as follows. Section 2 presents background and related work. In Section 3, we describe and analyze our techniques, and in Section 4 we experimentally investigate their effectiveness. Section 5 provides our conclusions and directions for further work.

2.

construct the result by matching parts of the query incrementally. While the above techniques operate at the granularity of the original data, query processing can employ structural summaries that have been proposed [20, 18, 19, 9, 7]. Using those structures, the data graph is summarized with a graph of a smaller size that maintains the structural characteristics of the original data. Each index node identifies a group of nodes in the original document. The concept of bisimilarity [25] is utilized to form the groups. Those proposals are mainly concerned with creating effective indexes and little is said about the processing of the twig queries. In summary, they regard the twig query as a path query (primary path) with structural constraints on the nodes of that path. They employ navigation-based techniques to identify nodes that match the primary path and recursively check the validity of structural constraints. However, such an approach may suffer from unnecessary access over the input in order to check the structural constraints. It would be desirable to devise techniques with similar characteristics as the ones in the previous paragraph for the case of structural summaries too. Nevertheless, the graph structure of those summaries hinders the direct adaptation of the techniques described in the previous paragraph.

BACKGROUND AND RELATED WORK

We consider a model of XML data in which we ignore comments, processing instructions and namespaces. Then an XML database can be modeled as a directed, node-labeled graph G = (V, E, ERef , root), where V is the set of nodes (indicating elements, values or attributes), E is the set of edges, which indicate an object-subobject relation, and ERef is the set of reference edges (idref ). Each node in V is assigned a string literal label and has a unique identifier. The single root element is labeled ROOT. In Figure 1a, the graph structure of a bibliography database is presented. Previous work has considered in detail the case where no reference edges exist. In those cases, the data structure is a tree. In order to efficiently capture the structural relationships between nodes, a region-based numbering scheme [11, 29, 2, 6] is usually embedded at the document nodes. In that scheme, each node in the document tree is assigned a region: (left,right). Given a node pair (u, v), node v is a descendant of node u (which is then an ancestor of v), if and only if, u.lef t < v.lef t < v.right < u.right. Using that numbering scheme, set-based techniques [29, 22, 2, 6] regard the input as sequences of elements, and perform a join in a sort-merge fashion. They use the containment condition described in the previous paragraph as the join condition. Indexes have been proposed to skip elements that do not participate in any of the results [8, 6, 16]. Navigation-based techniques [12, 15], which use the input to guide the computation, have also been proposed. Those techniques answer the queries with a single pass over the input. The input is traversed in document order (i.e. in pre-order) and, using an FSM (Finite Sate Machine), the query pattern is matched with particular path instances as those instances become available. Work in [21] also utilizes navigation based techniques to solve a wider range of queries. More recently, similar numbering schemes have been employed in query-driven input-probing methods [27, 26], where the structure of the query defines the part of the input data to be examined next. Methods in this category

3.

PROCESSING TWIGS IN GRAPHS

In this section, we describe our solution to efficiently process twig queries over graph structured data. We focus on the descendant axis and consider the child axis as a special case. In a directed graph environment, the ancestordescendant relationship of a tree pattern edge is satisfied if there is a path from the ancestor node to the descendant node. We first describe node labeling schemes that identify the structural relationship between two graph nodes (as the numbering scheme described in the previous section does for tree structures). We then describe our matching algorithms that utilize those schemes to efficiently compute twig queries and analyze their behavior.

3.1

Labeling Scheme for Digraphs

As mentioned earlier, in a directed graph environment, the ancestor-descendant relationship is satisfied if there is a path from the ancestor node to the descendant node. We would like to be able to answer the question of the existence of such a path within reasonable time. In graph theory, this is the well know reachability problem and can be handled by computing the transitive closure of the graph structure.

44

Then, for each pair of nodes, the reachability question can be answered in constant time. However such an approach is space consuming, as it requires space O(n2 ) in the number of nodes. The question that is posed is whether one can get the same functionality using less space, possibly trading some of the time to check reachability. The answer is yes, by employing the minimum 2-hop cover of a graph. However, it has been shown [10] that the problem of identifying the minimum 2-hop cover is a NP-hard problem. A practical solution that identifies a 2-hop cover close to the smallest cover is presented in [10] and is employed in our techniques. The notion of 2-hop labeling is defined as follows:

Theorem 1. The A(k)-Index derived from a tree structure, with k lower bounded by the height of the tree, is acyclic. Proof Sketch. The proof is based on the observation that if two nodes do not have the same level, then they cannot be k-bisimilar for k larger than the height of the tree (proof is omitted for lack of space). Moreover, documents where subelements cannot have the same label as their superelements also produce directed and acyclic graphs. From the above discussion, it is obvious that the class of directed acyclic graphs is very important in many real world situations. As one would expect, there are opportunities to further optimize the computation of twig queries over such structures. In [17], a labeling scheme is discussed to handle the reachability problem in planar directed graphs with one source and one sink. We argue that a large number of directed acyclic graphs present in our environment can be mapped to this framework, by introducing, if necessary, a dummy node to play the role of a sink. A similar method (dummy nodes introduction) allows to use techniques for planar graphs into non-planar ones. Moreover, because the reachability property between two nodes is not affected by the introduction of those dummy nodes, the latters can be used to create the labeling and be discarded afterwards. The role of the source can be played by the root element in the graph model described in Section 2. In such graphs the following theorem holds [17]:

Definition 1. Let G = (V, E) be a directed graph. A 2hop reachability labeling of G assigns to each vertex v ∈ V a label L(v) = (Lin (v), Lout (v)), such that Lin (v), Lout (v) ⊆ V and there is a path from every x ∈ Lin to v and from v to every x ∈ Lout . Furthermore, for any two vertices u, v ∈ V , we should have: \ v ; u iff Lout (v) Lin (u) 6= ∅ (1) P The labeling size is defined to be v∈V |Lin (v)| + |Lin (v)|. To obtain 2-hop labels, for each node, the 2-hop cover is defined as: Definition 2. Let G = (V, E) be a directed graph. For every u, v ∈ V , let Puv be a collection of paths from u to v and P = {Puv }. A hop is defined to be a pair (h, u), where h is a path in G and u ∈ V is one of the endpoints of h (u is called the handle of the hop). A collection of hops H is said to be a 2-hop cover of P if for every u, v ∈ V , such that Puv 6= ∅, there is a path p ∈ Puv and two hops (h1 , u) ∈ H and (h2 , v) ∈ H, such that p = h1 h2 , i.e. p is the concatenation of h1 and h2 . The size of the cover H, is equal to the number of hops in H.

Theorem 2. Let G = (V, E) be a planar, directed and acyclic graph with one source and one sink. For each node v ∈ V , a label L consisting of two numbers av and bv is assigned to it. Then for every pair of nodes u and v ∈ V with labels (au ,bu ) and (av ,bv ) the following holds: u ; v iff au < av and bu < bv

In the above definition, the set of paths P can be the set of all shortest paths between each pair of labels in the graph. Having produced a 2-hop cover for a graph, the 2-hop labels for the nodes in the graph can be created by mapping a hop with handle v to an item in the label of the node v. In [10], a polynomial time algorithm is provided that finds a 2-hop cover close to the smallest such cover (larger by a factor of O(logn)). Their experiments show that 2-hop covers produced are compact, and each label size is a very small portion of the total number of nodes in the graph. Having two nodes v and u with labels (Lin (v), Lout (v)) and (Lin (u), Lout (u)), equation 1 can be checked efficiently either in a hash or in a sort based fashion.

3.2

(2)

One can always find such a label. The algorithm to embed the labeling is very simple. Two depth-first searches are performed. A counter assigns the labels to the nodes and is initialized to n + 1, where n is the number of vertices in the graph. First, a left/depth-first search is performed. A stack is used such that a node is pushed when it is first reached, and popped when all the edges directed from it have been examined. The value that the counter has when a node is popped is assigned to the node as the first number of each label, and the value of the counter is decreased by one. At the end of the traversal, the counter is reinitialized and a second, right/depth-first, search is performed. During this search, the second number is assigned to the nodes in the same way as before.

Labeling Scheme for DAGs

In the previous section we discussed a labeling scheme that can be used to answer the reachability question on any directed graph. However, although the time to answer is reasonably small, it is not constant anymore. In this section, we attempt to fix that for special graphs, namely directed acyclic graphs (DAG). We first argue that many useful structures in our environment fall into this category and then we describe an alternative scheme, which achieves to answer the reachability question in constant time for those graphs. As mentioned in Section 1, in general, the structural summaries are graph structures, even when derived from tree structures. Nevertheless, in that case, one can prove the following theorem:

3.3

Twig Processing in XML Graphs

We proceed with first describing our algorithm to handle twig queries when the input is a directed acyclic graph and the 2-number labelling scheme is utilized. Subsequently, we explain how the technique can be adapted when the structure is a general directed graph.

3.3.1

Twig Processing in DAGs

The algorithm proceeds in a navigational manner and employs an FSM to identify matching nodes. It adopts ideas that have appeared in [6, 12, 15] to guide the computation of the FSM, to encode partial results and to output

45

Figure 2: Twig processing example

the final results. We assume that the whole twig instance is required as output, however, XPath semantics can also be supported with minimal modifications (e.g. by filtering out the elements that do not participate in the result in a post-processing step that is pipelined with the matching algorithm). In [12], a technique to answer multiple path queries, by sharing common prefixes over tree structures, is proposed. We describe an adaptation of this technique to efficiently answer twig queries too. Then, we point out necessary modifications to support directed acyclic graph structures. The twig query is first divided into its constituent path queries, which are then simultaneously handled as in [12]. The algorithm utilizes an FSM, which is derived by the twig query to guide pattern matching. Either an NFA (Nondeterministic Finite Automaton) or DFA (Deterministic Finite Automaton) can be used, and each choice has its advantages and disadvantages. An NFA will possibly have a smaller number of states, while a DFA will avoid the backtracking that the programmatic simulation of an NFA incurs. We decided to adopt the DFA solution as it has been shown to provide performance advantages [23, 15]. The query-to-automaton mapping algorithm is an extension of the one provided in [12], so that the NFA is converted to the minimal equivalent DFA, and is omitted for lack of space. A runtime stack maintaining the DFA states is utilized to buffer previous states of the machine and to allow backtracking to those states, when necessary. Besides the runtime stack, a stack is associated with each query node and is called “elements stack” from now on. The role of elements stacks is to buffer document nodes that compose intermediate results, until the final results that contain them are formulated. The set of those stacks creates a compact representation of partial and total answers as in [6]. The input is accessed in document order, and only elements with the same tag as those of the query nodes are accessed. To achieve that, the document is preprocessed, and the elements are tag divided into sequences sorted in document order. Moreover, each element is augmented with the region-based numbering scheme described in Section 2. That way, the access in document order can be performed by sequentially traversing each sequence and picking, among the current elements, the one with the smallest left number. Visiting the elements in such a way guarantees that: (a) descendant nodes will be accessed after their ancestor counterparts and (b) when an ancestor node is found not to be joined with a descendant node, it can be discarded, as it is not going to be joined with any of the subsequent elements. The algorithm proceeds as follows:

into the runtime stack and becomes the active state. • When a new element arrives, the DFA execution follows all matching transitions from all current active states. Moreover, the new element is pushed into its corresponding element stack, possibly triggering the popping of elements that will not participate into new results. • When the subtree rooted at the element that pushed the current active states into the stack has been processed, the top set of active states is popped and the automaton backtracks. • When an element with the same tag as a query leaf is about to be pushed into the stack, the path instances being formulated are created in a recursive manner, in a way similar to [6]. As described above, the algorithm produces all the path instances matching the path patterns that constitute the twig query. To construct the twig instances, those path instances need to be combined. An efficient way is by merging them on their common prefixes. This requires the path instances to be sorted in root-to-leaf order. However, the algorithm produces the results in leaf-to-root order. We adapt the blocking technique described in [6] to achieve that. In this technique, two linked lists are associated with each node q in the stack: the first, (S)elf-list, holds all partial results with root node q, and the second, (I)nherit-list, holds all partial results of descendant of q. When a node is to be popped from its stack, its self- and inherit- lists (in that order) are appended to the inherit lists of the node below it, if one exists; or to the self list of all the associated elements of the stack that corresponds to the parent node in the query. If the element is the last one in the stack corresponding to the root node of the query, then the lists are returned. When the input structure is a directed acyclic graph, one node may have multiple parents. In this case, region based numbering schemes cannot be used to identify the relative position of two elements, and the scheme described in Section 3.2 is employed instead. Moreover, the document order is not defined anymore. However, one can still get the advantages of that order as follows: if the nodes are accessed in the same order as in the left/depth-first search described in 3.2, it still holds that when an ancestor node is found not to be joined with a descendant node, it can be discarded, as it is not going to be joined with any of the subsequent elements. The computation proceeds similarly as in the case of the tree data structure. Furthermore, using the numbering scheme of Section 3.2 and by tag grouping the elements,

• The DFA is set to its initial state. That state is pushed

46

Figure 3: (a) Document DTD, (b) Queries, (c) Performance Evaluation sorted on the first number of the label, we need only access elements having the same tags as the query nodes, instead of the whole document. Paths rooted at nodes with multiple parents will be accessed multiple times (once for each parent), i.e. backtracking needs to be performed in the element sequences. In several cases we can further optimize the performance of the algorithm, if we have a bound on the number of parents of each node. In those cases, and possibly performing a breadth first like access of the graph, we can re-use portions of paths that have already been accessed, and reside in the stacks, without accessing the same nodes again.

formance characteristics are the same as in [20] (where only path queries are discussed).

4.

4.1

Experimental Setup

We compared the performance of our techniques with the technique described in [18] for the evaluation of branch expressions, where twig queries are treated as having a primary path with structural constraints on the nodes that belong to that path. We call this technique PEWSC (Path Evaluation With Structural Constraints) from now on. For graph structured data, we used the XMark [1] generator to create an 100Mb database, and we only took into consideration the part of the document described by the DTD graphically illustrated in Figure 3a. For this part of the document, the node category residing under the node incategory is an IDREF referencing the node category residing under the node categories. Hence, the document is a DAG. To abstract between XML documents and structural summaries, we treat IDREF edges as any other edge. We run the queries shown in Figure 3b. In this figure, the primary path is marked with a ∗ symbol. We compare the PEWSC technique with our algorithm when 2-hop covers are used to determine reachability (we call the technique IN2HR Input Navigation with 2 Hop-cover for Reachability - from now on) and when the labeling described in Section 3.2 is employed (we call this technique IN2NR - Input Navigation with 2 Numbers for Reachability - from now on). In this last case we could access only the sequences with the same tags as the query nodes (as already described in Section 3.3). The performance measure is the total number of nodes visited to answer the query. This choice is justified as in the general case of graph structures it is difficult to make any guarantees about clustering for the index nodes [20]. Consequently, the access to the index nodes is, in general, an I/O operation per index node accessed.

Example 1. Figure 2 illustrates a simple example. The query to be evaluated is //a//b[.//c]//d, which is decomposed in two path queries, P1 and P2, as showed in Figure 2a. This figure also illustrates the DFA that is generated by grouping the path queries in common shared states (0, 1, and 2). In the DFA: a circle denotes a state; a bold circle denotes a final (accepting) state, which is marked by the IDs of accepted queries; a directed edge is a transition; the symbol on an edge is the input that triggers the transition. This is just a simplified illustration, since the actual DFA has more transitions (one per symbol in the query in each state), not showed here for clarity purposes. Consider the document fragment in Figure 2b. While the elements from a1 to d1 are read through the left path, the list of current states is pushed to the run-time stack as shown in Figure 2c, and each element is pushed into its corresponding element stack, as shown in Figure 2d. When c1 is read, the query path P1 is matched, and when d1 is read, P2 is matched. When the subtree with root b2 finishes processing, the backtracking is performed by popping the run-time stack once for each of its elements, as the second stack in 2c shows, while the elements b2 , c1 , d1 are popped from the elements stacks. When b3 is read, its parent is a2 , and it behaves the same way as b2 , such that it is combined with c1 and d1 to form new partial results as well (not shown in the figure).

3.3.2

EXPERIMENTAL EVALUATION

In order to investigate the effectiveness of our techniques, we performed a group of experiments over benchmark data. We begin by describing our evaluation methodology and then provide the results of our study.

Twig Processing in general digraphs

The same algorithm can be utilized for the case of general directed graphs. However, in this case the relative position of two nodes is determined by 2-hop covers, as described in Section 3.1, and access is performed by following the actual edges of the graph. A table similar to the one described in [20] is employed to avoid unnecessary cycles and the per-

4.2

Performance Results

The results of the experiments described in the previous paragraph are shown in Figure 3c. When the query is a path

47

query (Q1), PEWSC and IN2HR access the same number of nodes. However, when structural constraints are added (Q2Q5), PEWSC has to rescan parts of the documents to check for their validity, and as a result its performance degrades. Algorithm IN2NR performs always much better than the other algorithms as it manages to discard the sequences of elements that do not participate in the results. On the other hand, when the tag of query node exists under different contexts (i.e. nodes with tag name), then the algorithm will need to access a possibly large number of elements that do not participate in the result. However, in those cases, indexing as in the case of region-based numbering schemes ([8, 6, 16]) could further improve performance. The encouraging news is that the labeling described in Section 3.2 enables such indexing. We plan to investigate the indexing possibility in the future.

5.

[10] E. Cohen, E. Halperin, H. Kaplan, and U. Zwick. Reachability and distance queries via 2-hop labels. In Proc. of ACM-SIAM SODA, 2002. [11] M. P. Consens and T. Milo. Optimizing queries on files. In Proc. of SIGMOD, 1994. [12] Y. Diao, M. Altinel, M. J. Franklin, H. Zhang, and P. M. Ficher. Path sharing and predicate evaluation for high-performance xml filtering. ACM TODS, 28(4), Dec 2003. [13] M. N. Garofalakis, A. Gionis, R. Rastogi, S. Seshadri, and K. Shim. Xtract: A system for extracting document type descriptors from xml documents. In Proc. of SIGMOD, 2000. [14] R. Goldman and J. Widom. Dataguides: Enabling formulation and optimization in semistructured databases. In Proc. of VLDB, 1997. [15] A. Halverson, J. Burger, L. Galanis, A. Kini, R. Krishnamurthy, A. N. Rao, F. Tian, S. Viglas, Y. Wang, J. F. Naughton, and D. J. DeWitt. Mixed mode xml query processing. In Proc. of VLDB, 2003. [16] H. Jiang, W. Wang, H. Lu, and J. X. Yu. Holistic twig joins on indexed xml documents. In Proc. of VLDB, 2003. [17] T. Kameda. On the vector representation of the reachability in planar directed graphs. Information Processing Letters, 3(3), January 1975. [18] R. Kaushik, P. Bohannon, J. F. Naughton, and H. F. Korth. Covering indexes for branching path queries. In Proc. of SIGMOD, 2002. [19] R. Kaushik, P. Bohannon, J. F. Naughton, and P. Shenoy. Updates for structure indexes. In Proc. of VLDB, 2002. [20] R. Kaushik, P. Shenoy, P. Bohannon, and E. Gudes. Exploiting local similarity for indexing paths in graph-structured data. In Proc. of ICDE, 2002. [21] C. Koch. Efficient processing of expressive node-selecting queries on xml data in secondary storage: A tree automata-based approach. In Proc. of VLDB, 2003. [22] Q. Li and B. Moon. Indexing and querying xml data for regular path expressions. In Proc. of VLDB, 2001. [23] H. Liefke. Hotizontal query optimization on ordered semistructured data. In Proc. of WEDDB, 1999. [24] T. Milo and D. Suciu. Index structures for path expressions. In Proc. of ICDT, 1999. [25] R. Paige and R. Tarjan. Three partition refinement algorithms. SIAM Journal on Computing, 16(6), Dec 1987. [26] P. R. Rao and B. Moon. Prix: Indexing and querying xml using prufer sequences. In Proc. of ICDE, 2004. [27] H. Wang, S. Park, W. Fan, and P. S. Yu. Vist: A dynamic index method for querying xml data by tree structures. In Proc. of ACM SIGMOD, 2003. [28] K. Yi, H. He, I. Stanoi, and J. Yang. Incremental maintainance of xml structural indexes. In Proc. of SIGMOD, 2004. [29] C. Zhang, J. Naughton, D. DeWitt, Q. Luo, and G. Lohman. On supporting containment queries in relational database management systems. In Proc. of SIGMOD, 2001.

CONCLUSION

In this paper, we proposed techniques for evaluating twig queries over graph-structured data. We motivated our work both by the graph structure of XML documents, and by the existence of index graphs, namely structural summaries, which are graph structures too. We identified an important class of graphs that emerges in many real world situations, namely DAGs, for which we further tailored our approaches. Our preliminary evaluation shows the technique to be a viable solution to the aforementioned problem. We plan to further evaluate our techniques with a variety of documents as well as structural summaries. Moreover, the investigation of the properties of the graph structures that emerge in related applications is an interesting path for future research.

6.

REFERENCES

[1] The xml benchmark project. In Available from http://www.xml-benchmark.org. [2] S. Al-Khalifa, H. Jagadish, N. Koudas, J. M. Patel, D. Srivastava, and Y. Wu. Structural joins: A primitive for efficient xml query pattern matching. In Proc. of ICDE, 2002. [3] S. Amer-Yahia, S. Cho, L. V. S. Lakshmanan, and D. Srivastava. Minimization of tree pattern queries. In Proc. of SIGMOD, 2001. [4] A. Berglund, S. Boag, D. Chamberlin, M. Fernandez, M. Key, J. Robie, and J. Simeon. Xml path language (xpath) 2.0. Nov 2003. [5] S. Boag, D. Chamberlin, M. F. Fernandez, D. Florescu, J. Robie, and J. Simeon. Xquery 1.0: An xml query language. In W3C Working Draft. Available from http://www.w3.org/TR/xquery, Nov 2003. [6] N. Bruno, N. Koudas, and D. Srivastava. Holistic twig joins: Optimal xml pattern matching. In Proc. of ACM SIGMOD, 2002. [7] Q. Chen, A. Lim, and K. W. Ong. D(k)-index: An adaptive structural summary for graph-structured data. In Proc. of SIGMOD, 2003. [8] S.-Y. Chien, Z. Vagena, D. Zhang, V. J. Tsotras, and C. Zaniolo. Efficient structural joins on indexed xml documents. In Proc. of VLDB, 2002. [9] C.-W. Chung, J.-K. Min, and K. Shim. Apex: An adaptive path index for xml data. In Proc. of SIGMOD, 2002.

48

Unraveling the Duplicate-Elimination Problem in XML-to-SQL Query Translation Rajasekar Krishnamurthy

Raghav Kaushik

Jeffrey F Naughton

University of Wisconsin

Microsoft Corporation

University of Wisconsin

[email protected]

[email protected]

[email protected]

1

ABSTRACT We consider the scenario where existing relational data is exported as XML. In this context, we look at the problem of translating XML queries into SQL. XML query languages have two different notions of duplicates: node-identity based and value-based. Path expression queries have an implicit node-identity based duplicate elimination built into them. On the other hand, SQL only supports value-based duplicate elimination. In this paper, using a simple path expression query we illustrate the problems that arise when we attempt to simulate the node-identity based duplicate elimination using value-based duplicate elimination in the SQL queries. We show how a general solution for this problem covering the class of views considered in published literature requires a fairly complex mechanism.

1.

2

3 title Book.title

(ii) : TopSection.id = NestedSection.topsectionid

* book Book

(i) 5 sectionTopSection

* *

* 6

7

title

(ii) NestedSection

section

TopSection.title 8

title

NestedSection.title

Figure 1: XML view T1

THE DUPLICATE ELIMINATION PROBLEM

Suppose we want to retrieve the titles of all sections. One possible XML query is Q = //section//title. XML-to-SQL query translation algorithms proposed in literature, such as [6, 15], work as follows. Logically, the first step is to identify schema nodes that match each step of the query. In this example, there are three matching evaluations for Q, namely S = {, , }. The second step is to generate an SQL query for each matching evaluation in S. The final query is the union of queries generated for all matching evaluations. For Q, the SQL query SQ1 obtained in this fashion is given below.

Using a simple example scenario, we first explain why we need duplicate elimination in XML-to-SQL query translation. Consider the following relational schema for a collection of books. • • • •

4 author Author

(i) : Book.id = TopSection.bookid

books

Book (id, title, price, . . . ) Author (name, bookid, . . . ) TopSection (id, bookid, title, . . . ) NestedSection (id, topsectionid, title, . . . )

select TS.title from Book B, TopSection TS where B.id = TS.bookid union all select NS.title from Book B, TopSection TS, NestedSection NS where B.id = TS.bookid and TS.id = NS.topsectionid union all select NS.title from Book B, TopSection TS, NestedSection NS where B.id = TS.bookid and TS.id = NS.topsectionid

The Book relation has basic information about books and the Author relation has information about authors of each book. Each book has sections and subsections, and the corresponding information is in the TopSection and NestedSection relations respectively. Consider the XML view T1 defined over this relational schema shown in Figure 1. We represent the XML view using simple annotations on the nodes and edges of the XML schema. For example, each book tuple creates a new book element. The title of the book is represented as a title subelement. Each section is represented as a subelement and this is captured by the join condition on the edge . The rest of the view definition can be understood in a similar fashion.

In the above query, we see that there are three entries in S, and, as a result, SQ1 is the union of three queries. On the other hand, looking at the view definition, we notice that there are only two paths that match the query ending at the two schema nodes 6 and 8. The path appears twice in S, once as and again as . This occurs because the section step in the query matches both the section elements in the schema. The following title step matches the title element (node 8) for each of these evaluations, due to the // axis in the query. As a result, the second and third (sub)queries in SQ1 are identical and generate duplicate results.

Copyright is held by the author/owner. Seventh International Workshop on the Web and Databases (WebDB 2004), June 17-18, 2004, Paris, France .

49

According to XPath semantics, the result of a query should not have any duplicates. Here, duplicate-elimination is defined in terms of node-identity. As a result, we need to add a distinct clause to SQ1 to achieve the same effect. We refer to this as the Duplicate-Elimination problem. The fact that we need to simulate node-identity based duplicate elimination using the value-based distinct clause in SQL creates several problems and providing a complete solution to this problem is the focus of this paper. Notice that this extra duplicate-elimination step is required to make sure that existing algorithms work correctly for this particular example. Let us start with the simplest approach to eliminate duplicates from SQ1 . By adding an outer distinct(title) clause, we can eliminate duplicate titles. The corresponding SQL query SQ11 is given below.

1

2

6 title Book.title

7 author Author 12

title

books

cheapbooks (i)

4

(i) : Book.price P2

3 costlybooks (ii)

*

5

book Book

*

* 8 sectionTopSection

NestedSection

13

section

9 title Book.title

10 author Author

* 11 sectionTopSection

NestedSection

15

section

title

TopSection.title

TopSection.title 16

17

title

NestedSection.title

with Temp(title) as ( select TS.title from Book B, TopSection TS where B.id = TS.bookid union all select NS.title from Book B, TopSection TS, NestedSection NS where B.id = TS.bookid and TS.id = NS.topsectionid union all select NS.title from Book B, TopSection TS, NestedSection NS where B.id = TS.bookid and TS.id = NS.topsectionid ) select distinct(title) from Temp

*

14

*

book Book

title

NestedSection.title

Figure 2: XML view T2 In order to address this issue, we need to create a key across the two relations. A straightforward way to do this is to combine the name of the relation (or some other identifier) along with the key column(s). This results in the following query SQ31 . with Temp(relname, id,title) as ( select ‘‘TS’’, TS.id, TS.title from Book B, TopSection TS where B.id = TS.bookid union all select ‘‘NS’’, NS.id, NS.title from Book B, TopSection TS, NestedSection NS where B.id = TS.bookid and TS.id = NS.topsectionid union all select ‘‘NS’’, NS.id, NS.title from Book B, TopSection TS, NestedSection NS where B.id = TS.bookid and TS.id = NS.topsectionid ) select distinct(relname, id, title) from Temp

Notice that the above query does value-based duplication and may eliminate more values than required. For example, duplicates in the XML view that arise in the following ways must be retained in the query result. 1. Two top-level sections have the same title 2. Two nested sections have the same title 3. A top-level section and a nested section have the same title Since SQ11 applies a distinct clause on title, it eliminates duplicates that arise in the above three contexts. As a result, SQ11 is not a correct query. Let us next see if using the key column(s) in the relational schema help us solve this problem. Recall that the id column is the key for both the TopSection and NestedSection relations. So, by projecting this key column and applying the following distinct clause: “distinct(id,title)”, we get the following query SQ21 .

Notice how the above approach creates a global key across the entire relational schema. This duplicate-elimination technique is correct for this query, and in fact, it is correct for any query over a class of views that we call non-redundant (see Table 1 in Section 4). Unfortunately this solution is not general enough and is incorrect when parts of the relational data may appear multiple times in the XML view. In the rest of the paper, we identify the techniques required for different class of views ending with a generic solution that is applicable over all views.

with Temp(id,title) as ( select TS.id, TS.title from Book B, TopSection TS where B.id = TS.bookid union all select NS.id, NS.title from Book B, TopSection TS, NestedSection NS where B.id = TS.bookid and TS.id = NS.topsectionid union all select NS.id, NS.title from Book B, TopSection TS, NestedSection NS where B.id = TS.bookid and TS.id = NS.topsectionid ) select distinct(id, title) from Temp

2. REDUNDANT XML VIEWS In this section, we look at some of the simplifying assumptions we implicitly made while generating the correct duplicate-elimination clause for the query Q over view T1 . First, using a slightly modified XML view over the same underlying relational schema we show how the correct solution gets more complex than before. Then, we look at the scenario when the join conditions are not key-foreign key joins.

2.1 A Hierarchical XML view example The XML view T2 , in Figure 2, has created a simple hierarchy, partitioning the books into cheap and costly books by the relationship of their prices to two constants P1 and P2 . Let us look at how Q1 will be translated in this case by some of the existing algorithms [6, 15]. The equivalent SQL query is the union of six queries, three for cheapbooks and

While the above query retains the duplicate values corresponding to 1 and 2 above, it does not retain duplicates when a top-level section and a nested section have the same title and the same id as well. This can occur because keys in relational databases are unique only in the context of the corresponding relation.

50

three for costlybooks. Again, the nested section titles occur twice and need to be eliminated through a duplicate elimination operation. At the end of the previous section, we saw how a distinct clause over the three fields: relname, id and title may suffice (see query SQ31 in Section 1). Applying the same idea here, we obtain the following query, SQ12 .

union all select 16, NS.id, NS.title from Book B, TopSection TS, NestedSection NS where B.id = TS.bookid and TS.id = NS.topsectionid and B.price P2 union all select 17, NS.id, NS.title from Book B, TopSection TS, NestedSection NS where B.id = TS.bookid and TS.id = NS.topsectionid and B.price > P2 union all select 17, NS.id, NS.title from Book B, TopSection TS, NestedSection NS where B.id = TS.bookid and TS.id = NS.topsectionid and B.price > P2

with Temp(relname, id,title) as ( select ‘‘TS’’, TS.id, TS.title from Book B, TopSection TS where B.id = TS.bookid and B.price P2 ) select distinct(relname, id, title) from Temp

) select distinct(nodeid, id, title) from Temp

The above query will be a correct translation for all the three cases: P1 = P2 , P1 < P2 and P1 > P2 . The above solution is correct for any query on a class of views that we call well-formed (see Table 1 in Section 4). Notice how simple syntactic restrictions on the view definition language will not allow us to differentiate between views T1 and T2 . As a result, unless we know that T1 is a non-redundant view, when we translate Q over T1 , we have to use the schema node number to perform duplicate elimination. In Section 4, we present a way for identifying when XML views are non-redundant.

2.2 Beyond key-foreign key joins For the two example views, T1 and T2 , we saw how a correct translation for query Q1 results in queries SQ31 and SQ22 respectively. The join conditions present in both these view definitions are key-foreign key joins. Under these circumstances, the duplicate elimination technique used is correct. An interesting point to note is that the class of views considered in literature, such as [1, 5, 6, 15], allow the join conditions to be between any two columns (in particular, non key-foreign key joins). Also, excepting [5], the XML-to-SQL query translation algorithms in literature do not know (or use) information about the integrity constraints that hold on the underlying relational schema. In this section, we look at what needs to be done to perform duplicate-elimination correctly when the join conditions are allowed to be over any two columns. Suppose id is not a key for the Book, TopSection and NestedSection relations. Then, the join between Book.id and TopSection.bookid in the view definition T2 is not a keyforeign key join. Similarly, the join between TopSection.id and NestedSection.topsectionid is also not a key-foreign key join. What happens in this case is that some parts of the relational data may appear in the XML view multiple times. For example, part of an instance of the relational data is shown in Figure 3. Suppose the XML view T2 was defined with P1 = 65 and P2 = 65. For the above data instance, the corresponding view will have three book elements. Since two of the books have the same value for the id column, the sections of each of these two books will be repeated under both of them. For example, the Introduction and Motivation sections will appear twice in the XML view, once for each of the two books with id 1. But, both occurrences of these section titles correspond to the same schema node (node 12 in Figure 2). As a result, our earlier technique using schema node numbers does not work. Note how the XML view has redundant data

Let us now consider three possible scenarios: P1 = P2 , P1 < P2 and P1 > P2 . If P1 = P2 , then the XML view has information about all the books exactly once, while if P1 < P2 the XML view has information about only certain books. On the other hand, when P1 > P2 , the XML view has information about the books in the price range {P2 . . . P1 } twice. For the two cases, P1 = P2 and P1 < P2 , each book appears at most once in the XML view. As a result, the SQL query SQ12 eliminates duplicates correctly. But when P1 > P2 , books in the price range {P2 . . . P1 } appear twice in the XML view. So, the corresponding section titles must appear twice in the query result. But, since SQ12 applies a distinct clause using the triplet “relname, id, title”, it will retain only one copy of each section title and the query result is incorrect. The main problem in this example scenario is that some parts of the relational data appear multiple times in the XML view. For the XML view T2 , notice that multiple occurrences of the same section title are associated with different schema nodes. One way to obtain the correct result in this case is to keep track of the schema node corresponding to each result tuple. The distinct clause in this case will include the schema node instead of the relation name. The corresponding SQL query SQ22 is shown below. with Temp(nodeid, id,title) as ( select 12, TS.id, TS.title from Book B, TopSection TS where B.id = TS.bookid and B.price ǫ}

ǫ ∈ {0, 1}

The database M supports the boolean query processing model (i.e. a tuple either satisfies or does not satisfy a query). To access tuples of R one must issue structured queries over R. The answers to Q must be determined without altering the data model of M . Moreover the solution should require minimal domain-specific information. 2 Challenges: Supporting imprecise queries over autonomous Web enabled databases requires us to solve the following problems: • Model of similarity: Supporting imprecise queries necessitates the extension of the query processing model from binary (where tuples either satisfy the query or not) to a matter of the degree (to which a given tuple is a satisfactory answer).

INTRODUCTION

The rapid expansion of the World Wide Web has made a variety of databases like bibliographies, scientific databases, travel reservation systems, vendor databases etc. accessible to a large number of lay external users. The increased visibility of these systems has brought about a drastic change in their average user profile from tech-savvy, highly trained professionals to lay users demanding “instant gratification”. Often such users may not know how to precisely express their needs and may formulate queries that lead to unsatisfactory results. Although users may not know how to phrase their queries, they can often tell which tuples are of interest when presented with a mixed set of results with varying degrees of relevance to the query. Example: Suppose a user is searching a car database for “sedans”. Since most sedans have unique model names, the user must convert the general query sedan to queries binding the attribute Model. To find all sedans in the database, the user must iteratively change her search criteria and submit queries to the database. Usually the user might know only a couple of models and would end up just searching for them. Thus limited domain knowledge combined with the

• Estimating Semantic similarity: Expecting ‘lay’ users of the system to provide similarity metrics for estimating the similarity among values binding a given attribute is unrealistic. Hence an important but difficult issue we face is developing domain-independent similarity functions that closely approximate “user beliefs”. • Attribute ordering: To provide ranked answers to a query, we must combine similarities shown over distinct attributes of the relation into a overall similarity score for each tuple. While this measure may vary from user to user, most users usually are unable to correctly quantify the importance they ascribe to an attribute. Hence another challenging issue we face is to automatically (with minimal user input) determine the importance ascribed to an attribute.

1.1


where

Overview of our approach

In response to these challenges, we propose a query processing framework that integrates techniques from IR and database literature to efficiently determine answers for imprecise queries over autonomous databases. Below we begin by describing the query representation model we use and explain how we map from imprecise to precise queries. Precise Query: A user query that requires data exactly matching the query constraint is a precise query. For exam-

73

Map: Convert “like” to “=” Q pr = Map(Q)

Derive Base Set Abs A bs = Q

(R)

Use Base Set as set of relaxable selection queries

Use Concept similarity to measure tuple similarities

Using AFDs find relaxation order

Prune tuples below threshold

pr

Return

Derive Extended Set by executing relaxed queries

Ranked Set

Figure 1: FlowGraph of our Approach

answers to Q, Abs . Suppose Abs contains the tuples

ple, the query Q : −CarDB(M ake = “F ord”)

M ake = “T oyota”, M odel = “Camry”, P rice = “10k”, Y ear = “2000”

is a precise query, all of whose answer tuples must have attribute ‘Make’ bound by the value ‘Ford’. Imprecise Query: A user query that requires a close but not necessarily exact match is an imprecise query. Answers to such a query must be ranked according to their closeness/similarity to the query constraints. For example, the query

M ake = “T oyota”, M odel = “Camry”, P rice = “10k”, Y ear = “2001”

Q : −CarDB(M ake

like

“F ord”)

is an imprecise query, the answers to which must have the attribute ‘Make’ bound by a value similar to ‘Ford’.

The tuples in Abs exactly satisfy the base query Qpr . But the user is also interested in tuples that may be similar to the constraints in Q. Assuming we knew that “Honda Accord” and “Toyota Camry” are similar cars, then we could also show tuples containing “Accord” to the user if these tuples had values of Price or Year similar to tuples in Abs . Thus, M ake = “Honda”, M odel = “Accord”, P rice = “9.8k”, Y ear = “2000”

could be seen as being similar to the first tuple in Abs and therefore a possible answer to Q. We could also show other Camrys whose Price and Year values are slightly different to Algorithm 1 Finding Relevant Answers those of the tuples in Abs . Specifically, all tuples of CarDB Require: Imprecise Query Q, Relation R, Threshold Tsim , that have one or more binding values close to some tuple in Concept Similarities, Approximate Functional DependenAbs can be considered as potential answers to query Q. cies (AFDs) By extracting tuples having similarity above a predefined begin threshold, Tsim , to the tuples in the Abs , we can get a larger Let RelaxOrder = FindAttributeOrder(R, AFDs). subset of potential answers called extended set,Aes . But to Let Qpr = Map(Q) such that Abs = Qpr (R) & |Abs | > 0. extract additional tuples we would require new queries. We for k=1 to |Abs | do can identify new queries by considering each tuple in the Qrel = CreateQueries(Abs [k], RelaxOrder) base set, Abs , as a relaxable selection query. However ranfor j=1 to |Qrel | do domly picking attributes of tuples to relax could generate Arel = Qrel [j](R). a large number of tuples of possibly low relevance. In thefor n=1 to |Arel | do ory, the tuples closest to a tuple in the base set will have simval = MeasureSimilarity(Arel [n],Abs [k]). differences in the attribute that least affect the binding valif simval ≥ Tsim then ues of other attributes. Approximate functional dependenAes = Aes + Arel [n]. cies(AFDs) [11] capture relationships between attributes of end if a relation and can be used to determine the degree to which end for a change in binding value of an attribute affects other atend for tributes. Therefore we mine approximate dependencies beend for tween attributes of the relation and use them to determine Return Top-K(Aes ). a heuristic to guide the relaxation process. Details about end our approach of using AFDs to create a relaxation order are in Section 2. Relaxation involves extracting tuples by identifying and executing new relaxed queries obtained by Finding Relevant Answers: Figure 1 shows a flow graph reducing the constraints on an existing query. of our approach for answering an imprecise query. AlgoIdentifying possibly relevant answers only solves part of rithm 1 gives further details of our approach. Specifically, the problem since we must now rank the tuples in terms given an imprecise query Q, of the similarity they show to the seed tuples. We assume Q : −CarDB(M odel like “Camry”, P rice like “10k”) that a similarity threshold Tsim is available and only the answers that are above this threshold are to be provided to over the relation CarDB(Make, Model, Price, Year), we bethe user. This threshold may be user given or decided by 1 gin by converting Q to a precise base query (which is equivthe system. The tuple similarity is estimated as a weighted alent to replacing all “like” predicates with equality predisum of similarities over distinct attributes in the relation. cates) with non-null result set over the database, Qpr . Thus That is, the base query Qpr is n X Qpr : −CarDB(M odel = “Camry”, P rice = “10k”) Sim(t1 , t2 ) = Sim(t1 (Ai ), t2 (Ai )) × Wi

The tuples of CarDB satisfying Qpr also satisfy the imprecise query Q. Thus answers to Qpr form the base set of 1 In this paper, we assume that the resultset of the base query is non-null. If the resultset of the base query is null, then by iteratively relaxing the base query we may obtain a query that has a non-null resultset. The attributes to be relaxed can be arbitrarily chosen.

i=1

P where |attributes(R)| = n and n i=1 Wi = 1. In this paper we assume that attributes have either discrete numerical or categorical binding values. We assume that the Euclidean distance metric captures the semantic similarity between numeric values. But no universal measure is available for measuring the semantic distances between values binding a

74

categorical attribute. Hence in the Section 3 we present a solution to automatically learn the semantic similarity among values binding a categorical attribute. While estimating the similarity between values is definitely an important problem, an equally important issue is that of assigning weights to the similarity shown by tuples over different attributes. Users can be expected to assign weights to be used for similarity shown over a particular attribute. However, in [19, 18], our studies found users are not always able to map the amount of importance they ascribe to an attribute into a good numeric weight. Hence after determining the attribute order for query relaxation, we will automatically assign importance weights to attributes based on their order i.e. attribute to be relaxed first is least important and so gets lowest weight.

1.2

Organization

The rest of the paper is organized as follows. Section 2 explains how we use approximate functional dependencies between attributes to guide the query relaxation process. Section 3 describes our domain-independent approach for estimating the semantic similarity among concepts binding a categorical attribute. In Section 4, we provide preliminary results showing the accuracy of our concept learning approach and the effectiveness of the query relaxation technique we develop. In Section 5, we compare our work with research done in the areas of similarity search, cooperative query answering and keyword search over databases, all of which focus on providing answers to queries with relaxed constraints. Finally, in Section 6, we summarize our contributions and list the future directions for expansion of this work .

2.

QUERY RELAXATION USING AFDS

Our proposed solution for answering an imprecise query requires us to generate new selection queries by relaxing the constraints of tuples in the initial set tÎS . The underlying motivation there is to identify tuples that are closest to some tuple t ∈ tÎS . Randomly relaxing constraints and executing queries will produce tuples in arbitrary order of similarity thereby increasing the cost of answering the query. In theory the tuples most similar to t will have differences only in the least important attribute. Therefore the first attribute to be relaxed must be the least important attribute. We define least important attribute as the attribute whose binding value, when changed, has minimal effect on values binding other attributes. Approximate Functional Dependencies (AFDs) [11] efficiently capture such relations between attributes. In the following we will explain how we use AFDs to identify the importance of an attribute and thereby guide the query relaxation process.

2.1

Definitions

Functional Dependency: For a relational schema R, an expression of the from X → A where X ⊆ R and A ∈ R is a functional dependency over R. The dependency is said to hold in a given relation r over R if for all pairs of tuples t, u ∈ r we have t[B] = u[B] ⇒ t[A] = u[A]

where

A, B ∈ X

Several algorithms [13, 12, 7, 17] have proposed various measures to approximate functional dependencies that hold in a database. Among them, the g3 measure proposed by Kivinen and Mannila [12], has been widely accepted. The g3 measure is defined as the minimum number of tuples that need be removed from relation r so that X → Y is an FD divided by the number of tuples in r. Huhtala et al [11] have developed an algorithm, TANE, for efficiently discovering all AFDs in a database whose g3 approximation measure is below a user specified threshold. We use TANE

to extract the AFDs and approximate keys. We mine the AFDs and keys using the a subset of the database extracted by probing. Approximate Functional Dependency: The functional dependency X → A is an approximate functional dependency if it does not hold over a small fraction of the tuples. Specifically, X → A is an approximate functional dependency if and only if error(X → A) is at most equal to an error threshold ǫ (0 < ǫ < 1), where the error is measured as the fraction of tuples that violate the dependency. Approximate Key: An attribute set X is a key if no two distinct tuples agree on X. Let error(X) be the minimum fraction of tuples that need to be removed from relation r for X to be a key. If error(X) is ≤ ǫ then X is an approximate key. Some of the AFDs determined in the used car database CarDB are: error(Model → Make) < 0.1, error(Model → Price) < 0.6, error(Make, Price → Model) < .7 and error(Make, Year → Model)< 0.7.

2.2

Generating the relaxation order

Algorithm 2 Query Relaxation Order Require: Relation R, Tuples(R) begin for ǫ = 0.1 to 0.9 do SÂF Ds = ExtractAFDs(R, ǫ). SÂKeys = ExtractKeys(R, ǫ). end for Kbest = BestSupport(SÂKeys ). NKey = Attributes(R)-Kbest . for k=1 to |Kbest | do W tKey [k]=[Kbest [k],Decides(NKey,Kbest [k], SÂF Ds )]. end for for j=1 to |N Key| do W tN Key [k]=[NKey[k],Depends(Kbest ,NKey[k], SÂF Ds )]. end for Return [Sort(W tKey ), Sort(W tN Key )]. end

The query relaxation technique we use is described in Algorithm 2. Given a database relation R and a dataset containing tuples of R, we begin by extracting all possible AFDs and approximate keys by varying the error threshold. Next we identify the approximate key (Kbest ) with the least error (or highest support). If a key has high support then it implies that fewer tuples will have the same binding values for the subset of attributes in the key. Thus the key can be seen as almost uniquely identifying a tuple. Therefore we can assume that two tuples are similar if the values binding the key are similar. All attributes of relation R not found in Kbest are approximately dependent on Kbest . Hence by relaxing the non-key attributes first we can create queries whose answers do not satisfy the dependency but have the same key. We now face the issue of which key (non-key) attribute to relax first. We make use of the AFDs to decide the relaxation order within the two subsets of attributes. For each attribute belonging to the key we determine a weight depending on how strongly it influences the non-key attributes. The influence weight for an attribute is computed as W eight(Ai ) =

n X 1 − error(Aˆ → Aj ) ˆ |A| j=1

where

Ai ∈ Aˆ ⊆ R,

j 6= i

&

n = |Attributes(R)|

If an attribute highly influences other non-key attributes then it should be relaxed last. By sorting the key attributes

75

in ascending order of their influence weights we can ensure that the least influencing attribute is relaxed first. On similar lines we would like to relax the least dependent non-key attribute first and hence we sort the non-key attributes in descending order of their dependence on the key attributes. The relaxation order we produce using Algorithm 2 only provides the order for relaxing a single attribute of the query. Basically we use a greedy approach towards relaxation and try to create all 1-attribute relaxed queries first, then the 2-attribute relaxed queries and so on. The multi-attribute query relaxation order is generated by assuming independence among attributes and combining the attributes in terms of their single attribute order. E.g., if {a1, a3, a4, a2} is the 1-attribute relaxation order, then the 2-attribute order will be {a1a3, a1a4, a1a2, a3a4, a3a2, a4a2}. The 3-attribute order will be a cartesian product of 1 and 2-attribute orders and so on. Attribute sets appearing earlier in the order are relaxed first. Given the relaxation order and a query Q, we formulate new queries from Q by removing the constraints (if any) on the attributes as given in the order. The number of attributes to be relaxed in each query will depend on order (1-attribute, 2-attribute etc). To ease the query generation process, we assume that the databases do not impose any binding restrictions. For our example database CarDB, the 1-attribute relaxation order was determined as {Make, Price, Year, Model}. Consequently the 2-attribute relaxation order becomes {(Make, Price), (Make, Year), (Make, Model), (Price, Year), (Price, Model), (Year, Model)}.

3.

LEARNING CONCEPT SIMILARITIES

Below we provide an approach to solve the problem of estimating the semantic similarity between values binding a categorical attribute. We determine the similarity between two values as the similarity shown by the values correlated to them. Concept: We define a concept over the database as any distinct attribute value pair. E.g. Make=“Ford” is a concept over database CarDB(Make,Model,Price,Year). Concept Similarity: Two concepts are correlated if they occur in the same tuple. We estimate the semantic similarity between two concepts as the percentage of correlated concepts which are common to both the concepts. More specifically, given a concept, the concepts correlated to a concept can be seen as the features describing the concept. Consequently, the similarity between two concepts is the similarity among the features describing the concepts. For example, suppose the database CarDB contains a tuple t ={Ford, Focus, 15k, 2002}. Given t, the concept Make=“Ford” is correlated to the concepts Model=“Focus”, Price=“15k” and Year=“2002”. The distinct values binding attributes Model, Price and Year can be seen as features describing the concepts over Make. Similarly Make, Price and Year for Model and so on. Let Make=“Ford” and Make=“Toyota” be two concepts over attribute Make. Suppose most tuples containing the two concepts in the database CarDB have same Price and Year values. Then we can safely assume that Make=“Ford” and Make=“Toyota” are similar over the features Price and Year.

3.1

The number of concepts identified is proportional to the size of the database extracted by sampling. However we

Focus:5, ZX2:7, F150:8 ... 10k-15k:3, 20k-25k:5, .. 1k-5k:5, 15k-20k:3, .. White:5, Black:5, ... 2000:6, 1999:5, ....

Table 1: Supertuple for Concept Make=‘Ford’ A concept can be visualized as a selection query called concept query that binds only a single attribute. By issuing the concept query over the extracted database we can identify a set of tuples all containing the concept. We represent the answerset containing each concept as a structure called the supertuple. The supertuple contains a bag of keywords for each attribute in the relation not bound by the concept. Table 1 shows the supertuple for the concept Make=‘Ford’ over the relation CarDB as a 2-column tabular structure. To represent a bag of keywords we extend the semantics of a set of keywords by associating an occurrence count for each member of the set. Thus for attribute Color in Table 1, we see ‘White’ with an occurrence count of five, suggesting that there are five White colored Ford cars in the database that satisfy the concept-query.

3.2

Measuring Concept Similarities

The similarity between two concepts is measured as the similarity shown by their supertuples. The supertuples contain bags of keywords for each attribute in the relation. Hence we use Jaccard Coefficient [10, 2] with bag semantics to determine the similarity between two supertuples. The Jaccard Coefficient (SimJ ) is calculated as SimJ (A, B) =

|A ∩ B| |A ∪ B|

We developed the following two similarity measures based on Jaccard Coefficient to estimate the similarity between concepts: Doc-Doc similarity: In this method, we consider each supertuple STQ , as a document. A single bag representing all the words in the supertuple is generated. The similarity between two concepts C1 and C2 is then determined as Simdoc−doc (C1 , C2 ) = SimJ (STC1 , STC2 ) Weighted-Attribute similarity: Unlike pure text documents, supertuples would rarely share keywords across attributes. Moreover all attributes may not be equally important for deciding the similarity among concepts. For example, given two cars, their prices may have more importance than their color in deciding the similarity between them. Hence, given the answersets for a concept, we generate bags for each attribute in the corresponding supertuple. The similarity between concepts is then computed as a weighted sum of the attribute bag similarities. Calculating the similarity in this manner allows us to vary the importance ascribed to different attributes. The supertuple similarity will then be calculated as m X SimJ (BagC1 (Ai ), BagC2 (Ai )) × Wi Simwatr (C1 , C2 ) = i=1

Semantics of a Concept

Databases on the web are autonomous and cannot be assumed to provide any meta-data such as possible distinct values binding an attribute. Hence we must extract this information by probing the database using sample queries. We begin by extracting a small subset of the database by sampling the database. From the extracted subset we can then identify a subset of concepts2 over the relation. 2

Model Mileage Price Color Year

where C1, C2 have m attributes

4.

EVALUATION

To evaluate the effectiveness of our approach in answering imprecise queries, we set up a prototype used car search

can incrementally add new concepts as and when they are encountered and learn similarities for them. But in this paper we do not focus on the issue of incremental updating of the concept space.

76

Time 181 sec 1.5 hours

Size 11 MB 6.0 MB

180

Th 0.7 160

Th 0.6 140

Work/Relevant Tuple

Algorithm Step SuperTuple Generation Similarity Estimation

Table 2: Timing Results and Space Usage database system that accepts precise queries over the relation

100 80 60 40

CarDB(M ake, M odel, Y ear, P rice, M ileage, Location, Color) The database was setup using the open-source relational database MySQL. We populated the relation CarDB using 30, 000 tuples extracted from the publicly accessible used car database Yahoo Autos [20]. The system was hosted on a Linux server running on Intel Celeron- 2.2 Ghz with 512Mb RAM.

Th 0.5

120

20 0 1

2

3

4

5

6

7

8

9

10

Queries

Figure 3: Work/RelevantTuple using GuidedRelax 900

4.1 Concept Similarity Estimation

Th 0.7 800

Th 0.6

Th 0.5

Work/Relevant Tuple

700

Dodge Nissan

0.15 0.11 Honda

BMW

0.12

600 500 400 300 200 100

0.22

0.25 0.16

0

Ford

1

2

3

4

5

6

7

8

9

10

Queries

Chevrolet

Toyota

Figure 4: Work/RelevantTuple using RandomRelax Figure 2: Concept Similarity Graph for Make The attributes Make, Model, Location, Color in the relation CarDB are categorical in nature and contained 132, 1181, 325 and 100 distinct values. We estimated concept similarities for these attributes as described in Section 3. Time to estimate the concept similarity is high (see Table 2), as we must compare each concept with every other concept binding the same attribute. We calculated only the doc-doc similarity between each pair of concepts. The concept similarity estimation is a preprocessing step and can be done offline and hence the high processing time requirement for this process can be ignored. Figure 2 provides a graphical representation of the estimated semantic similarity between some of the values binding attribute Make. The concepts Make=“Ford” and Make=“Chevrolet” show high similarity and so do concepts Make=“Toyota” and Make=“Honda” while the concept Make=“BMW” is not connected to any other node in the graph. We found these results to be intuitively reasonable and feel our approach is able to efficiently determine the semantic distances between concepts. Moreover in [19, 18] we used a similar approach to determine the semantic similarity between queries in a query log. The estimated similarities were validated by doing a user study and our approach was found to have above 75% accuracy.

4.2 Efficient query relaxation To verify the efficiency of the query relaxation technique we propose in Section 2, we setup a test scenario using the CarDB database and a set of 10 randomly picked tuples. For each of these tuples our aim was to extract 20 tuples from CarDB that had similarity above some threshold Tsim (0.5 ≤ Tsim < 1). We designed two query relaxation algorithms GuidedRelax and RandomRelax for creating selection queries by relaxing the tuples in the initial set. GuidedRelax makes use of the AFDs and approximate keys and decides a relaxation scheme as described in Algorithm 2. The RandomRelax algorithm was designed to somewhat mimic the

random process by which users would relax queries. The algorithm randomly identifies a set of attributes to relax and creates queries. We put an upper limit of 64 on the number of queries that could be issued by either algorithm for extracting the 20 similar answers to a tuple from the initial set. To measure the efficiency of the algorithms we use a metric called Work/RelevantTuple defined as W ork/RelevantT uple =

|TExtracted | |TRelevant |

where TExtracted gives the total tuples extracted while TRelevant is the number of extracted tuples that were found as relevant. Specifically Work/RelevantTuple is a measure of the average number of tuples that an user would have to look at before finding a relevant tuple. Tuples that showed similarity above the threshold Tsim were considered relevant. Similarity between two tuples was estimated as the weighted sum of semantic similarities shown by each attribute of the tuple. Equal weightage was given to the similarity shown by all attributes. The graphs in figures Figure 3 and Figure 4 show the average number of tuples that had to be extracted by GuidedRelax and RandomRelax respectively to identify a relevant tuple for the query. Intuitively the larger the expected similarity, the more the work required to identify a relevant tuple. While both algorithms do follow this intuition, we note that for higher thresholds RandomRelax (see Figure 4) ends up extracting hundreds of tuples before finding a relevant tuple. GuidedRelax is much more resilient to the variations in threshold and generally needs to extract about 4 tuples to identify a relevant tuple. Thus by using GuidedRelax, a user would have to look at much less number of tuples before obtaining satisfactory answers. The initial results we obtained are quite encouraging. However for the current set of experiments we did not verify

77

whether the tuples considered relevant are truly relevant as measured by the user. We plan to conduct a user study to verify that our query relaxation approach not only saves time but also provides answers that are truly relevant according to the user. The evaluations we performed were aimed at studying the accuracy and efficiency of our concept similarity learning and query relaxation approaches in isolation. We are currently working on evaluating our approach for answering imprecise queries over BibFinder [1, 21], a fielded autonomous bibliography mediator that integrates several autonomous bibliography databases such as DBLP, ACM DL, CiteSeer. Studies over BibFinder will enable us to better evaluate and tune the query relaxation approach we use. We also plan to conduct user studies to measure how many of the answers we present are considered truly relevant by the user.

results showing the efficiency and accuracy of our concept similarity learning and query relaxation approaches. Both the concept similarity estimation and AFDs and keys extraction process we presented heavily depend on the size of the initial dataset extracted by probing. Moreover the size of the initial dataset also decides the number of concepts we may find for each attribute of the database. A future direction of this work is to estimate the effect of the probing technique and the size of the initial dataset on the quality of the AFDs and concept similarities we learn. Moreover the data present in the databases may change with time. We plan to investigate ways to incrementally update the similarity values between existing concepts and develop efficient methods to compute distances between existing and new concepts without having to recompute the entire concept graph. In this paper we only looked at answering imprecise selection queries over a single database relation. Answering imprecise queries spanning multiple re5. RELATED WORK lations forms an interesting extension to our work. Early approaches for retrieving answers to imprecise queries Acknowledgements: We thank Hasan Davulcu for helpful were based on theory of fuzzy sets. Fuzzy information sysdiscussions during the development of this work. This work tems [14] store attributes with imprecise values, like height= is supported by ECR A601, the ASU Prop301 grant to ET I 3 “tall” and color=“blue or red”, allowing their retrieval with initiative. fuzzy query languages. The WHIRL language [6] provides approximate answers by converting the attribute values in 7. REFERENCES the database to vectors of text and ranking them using the [1] BibFinder: A Computer Science Bibliography Mediator. Availvector space model. In [16], Motro extends a conventional able at :http://kilimanjaro.eas.asu.edu/. [2] R. Baeza-Yates and B. Ribiero-Neto. Modern Information Redatabase system by adding a similar-to operator that uses trieval. Addison Wesley Longman Publishing, 1999. distances metrics over attribute values to interpret vague [3] C. Buckley, G. Salton, and J. Allan. Automatic Retrieval with queries . The metrics required by the similar-to operator Locality Information Using Smart. TREC-1, National Institute must be provided by database designers. Binderberger [22] of Standards and Technology, Gaithersburg, MD, 1992. [4] W.W. Chu, Q. Chen, and R. Lee. Cooperative query answering investigates methods to extend database systems to supvia type abstraction hierarchy. Cooperative Knowledge Based port similarity search and query refinement over arbitrary Systems, pages 271–290, 1991. abstract data types. In [9], the authors propose to provide [5] W.W. Chu, Q. Chen, and R. Lee. A structured approach for ranked answers to queries over Web databases but require cooperative query answering. IEEE TKDE, 1992. [6] W. Cohen. Integration of heterogeneous databases without comusers to provide additional guidance in deciding the similarmon domains using queries based on textual similarity. Proc. of ity. These approaches however are not applicable to existing SIGMOD, pages 201–212, June 1998. databases as they require large amounts of domain specific [7] M. Dalkilic and E. Robertson. Information Dependencies. In information either pre-estimated or given by the user of the Proc. of PODS, 2000. [8] N.E. Efthimiadis. Query Expansion. In Annual Review of Inquery. Further [22] requires changing the data models and formation Systems and Technology, Vol. 31, pages 121–187, operators of the underlying database while [9] requires the 1996. database to be represented as a graph. [9] R. Goldman, N .Shivakumar, S. Venkatasubramanian, and In contrast to the above, the solution we propose proH. Garcia-Molina. Proximity search in databases. VLDB, 1998. [10] T. Haveliwala, A. Gionis, D. Klein, and P Indyk. Evaluatvides ranked results without re-organizing the underlying ing strategies for similarity search on the web. Proceedings of database and thus is easier to implement over any database. WWW, Hawai, USA, May 2002. In our approach we assume that tuples in the base set are all [11] Y. Huhtala, J. Krkkinen, P. Porkka, and H. Toivonen. Efficient relevant to the imprecise query and create new queries. The discovery of functional and approximate dependencies using partitions. Proceedings of ICDE, 1998. technique we use is similar to the pseudo-relevance feedback [12] J. Kivinen and H. Mannila. Approximate Dependency Inference [3, 8] technique used in IR system. Pseudo-relevance feedfrom Relations. Theoretical Computer Science, 1995. back ( also known as local feedback or blind feedback) in[13] T. Lee. An information-theoretic analysis of relational volves using top k retrieved documents to form a new query databases-part I: Data Dependencies and Information Metric. IEEE Transactions on Software Engineering SE-13, October to extract more relevant results. 1987. In [4, 5], authors explore methods to generate new queries [14] J.M. Morrissey. Imprecise information and uncertainty in inforrelated to the user’s original query by generalizing and remation systems. ACM Transactions on Information Systems, fining the user queries. The abstraction and refinement rely 8:159–180, April 1990. [15] A. Motro. Flex: A tolerant and cooperative user interface to on the database having explicit hierarchies of the relations database. IEEE TKDE, pages 231–245, 1990. and terms in the domain. In [15], Motro proposes allowing [16] A. Motro. Vague: A user interface to relational databases that the user to select directions of relaxation, thereby indicating permits vague queries. ACM Transactions on Office Informawhich answers may be of interest to the user. In contrast, tion Systems, 6(3):187–214, 1998. [17] K. Nambiar. Some analytic tools for the Design of Relational we automatically learn the similarity between concepts and Database Systems. In Proc. of 6th VLDB, 1980. use functional dependency based heuristics to decide the [18] U. Nambiar and S. Kambhampati. Providing ranked relevant direction for query relaxation. results for web database queries. To appear in WWW Posters

6.

CONCLUSION AND FUTURE WORK

In this paper we first motivated the need for supporting imprecise queries over databases. Then we presented a domain independent technique to learn concept similarities that can be used to decide semantic similarity of values binding categorical attributes. Further we identified approximate functional dependencies between attributes to guide the query relaxation phase. We presented preliminary

2004,, May 17-22, 2004. [19] U. Nambiar and S. Kambhampati. Answering imprecise database queries: A novel approach. ACM Workshop on Web Information and Data Management, November 2003. [20] Yahoo! autos. Available at http://autos.yahoo.com/ . [21] Z. Nie, S. Kambhampati, and T. Hernandez. BibFinder/StatMiner: Effectively Mining and Using Coverage and Overlap Statistics in Data Integration. In Proc. of VLDB, 2003. [22] M. Ortega-Binderberger. Integrating Similarity Based Retrieval and Query Refinement in Databases. PhD thesis, UIUC, 2003.

78

DTDs versus XML Schema: A Practical Study Geert Jan Bex

Frank Neven

Limburgs Universitair Centrum Diepenbeek, Belgium


[email protected]

[email protected]

ABSTRACT Among the various proposals answering the shortcomings of Document Type Definitions (DTDs), XML Schema is the most widely used. Although DTDs and XML Schema Defintions (XSDs) differ syntactically, they are still quite related on an abstract level. Indeed, freed from all syntactic sugar, XML Schemas can be seen as an extension of DTDs with a restricted form of specialization. In the present paper, we inspect a number of DTDs and XSDs harvested from the web and try to answer the following questions: (1) which of the extra features/expressiveness of XML Schema not allowed by DTDs are effectively used in practice; and, (2) how sophisticated are the structural properties (i.e. the nature of regular expressions) of the two formalisms. It turns out that at present real-world XSDs only sparingly use the new features introduced by XML Schema: on a structural level the vast majority of them can already be defined by DTDs. Further, we introduce a class of simple regular expressions and obtain that a surprisingly high fraction of the content models belong to this class. The latter result sheds light on the justification of simplifying assumptions that sometimes have to be made in XML research.

1.

INTRODUCTION

As Document Type Definitions where historically the first means to describe the structure of XML documents, a large number of them can be found on the Web. The growing success of XML, combined with certain shortcomings of DTDs, generated a large number of alternative proposals for the description of schemas, such as RELAX [12], TREX [6], Relax NG [7], DSD [11], and XML Schema [1, 9, 17]. Judging from the number of schemas one can find on the Web, XML Schema seems the most accepted one. The definition of XML Schema is nevertheless quite complicated and the necessity of various constructs is not always very clear. For this reason, we investigate a number of XSDs collected from the Web, and try to determine to what extent the features of XML Schema not occurring in DTDs are used in practice. In the second part of the paper we look at struc-

Jan Van den Bussche


[email protected]

tural properties of schemas. In particular, we show that the vast majority of content models occurring in practice belong to a well-defined class of simple regular expressions. To facilitate a comparison between the two formalisms, we first describe DTDs and XSDs on a structural level.

1.1 A structural view of DTDs and XSDs When dealing with the structure of XML documents only, it is common to view XML documents as finite ordered trees with node labels from some finite alphabet Σ. We refer to such trees as Σ-trees. definition 1. A DTD is a pair (d, s) where d is a function that maps Σ-symbols to regular expressions over Σ, and s ∈ Σ is the start symbol. A tree satisfies the DTD if its root is labeled by s and for every node u with label a, the sequence a1 · · · an of labels of its children matches the regular expression d(a). The class of tree languages definable by DTDs is usually referred to as the local tree languages [4, 13]. A simple example of a DTD defining the inventory of a store is the following: store dvd

→ →

dvd dvd∗ title price

For clarity, in examples we write a → r rather than d(a) = r. We next recall the definition of a specialized DTD [15]. definition 2. A specialized DTD (SDTD) is a 4-tuple (Σ, Σ , δ, µ), where Σ is an alphabet of types, δ is a DTD over Σ and µ is a mapping from Σ to Σ. Note that µ can be applied to a Σ -tree as a relabeling of the nodes, thus yielding a Σ-tree. A Σ-tree t then satisfies the SDTD if t can be written as µ(t ) where t satisfies the DTD δ. As SDTDs are equivalent to unranked tree automata [4], the class of tree languages definable by SDTDs is the class of regular tree languages. The XML equivalent of that class is captured by the schema language Relax NG [7].


79

For ease of exposition, we always take Σ = {ai | 1 ≤ i ≤ ka , a ∈ Σ, i ∈ N} for some natural numbers ka and set µ(ai ) = a.

2. DATASET AND METHODOLOGY

A simple example of an SDTD is the following: store dvd1 dvd2

→ (dvd1 + dvd2 )∗ dvd2 (dvd1 + dvd2 )∗ dvd2 · (dvd1 + dvd2 )∗ → title price → title price discount

Here, dvd1 defines ordinary DVDs while dvd2 defines DVDs on sale. The rule for store specifies that there should be at least two of the latter. The following restriction on SDTDs corresponds to the expressiveness of XML Schema [13]: definition 3. A single-type SDTD is an SDTD (Σ, Σ , (d, s), µ) with the property that no regular expression d(a) has occurrences of types of the form bi and bj with the same b but with different i and j. The example SDTD above is not single type as both dvd1 and dvd2 occur in the rule for store. It is shown by Murata et al [13], that the class of trees defined by single-type SDTDs is strictly between the local and the regular tree languages. An example of a single-type grammar is given below: store regulars discounts dvd1 dvd2

→ → → → →

regulars discounts (dvd1 )∗ dvd2 dvd2 (dvd2 )∗ title price title price discount

Although there are still two element definitions dvd1 and dvd2 , they can only occur in a different context, regulars and discounts respectively.

1.2 Related work In 2000, while the XML Schema specification was still under development, Sahuguet [16] investigated a sample of DTDs to determine the shortcomings of the Document Type Definition specification. What he found missing has been remedied in XML Schema. Moreover, XML Schema introduces many features not envisioned by Sahuguet. One of the goals of this paper is to investigate to what measure these features are used in real world XSDs currently available. Choi [5] has tried to identify features that are characteristic for DTDs used to describe three types of schemas: application, data and meta-data related. He created a content model classification based on syntactic features and considered several measures for the complexity of a DTD. In this paper we extend his classification of content models and consider XSDs beside DTDs.

1.3 Overview Based on the characterizations given above, it is clear that DTDs and XSDs are both grammar based where XML Schema in addition is extended with a restricted typing mechanism. In Section 3, we inspect the use of that typing mechanism in practice together with another notion added to XML Schema: derived types. In Section 4, we compare the properties of the grammars underlying real DTDs and XSDs. First, we discuss in the next section our dataset and methodology.

80

We have tried to gather a representative sample of DTDs and XSDs. The XML Cover pages [8] have proved to be an excellent repository so almost all schemas in our sample have been obtained automatically from that source by using a simple web crawler. To ensure that the sample contains a base set of quality DTDs and XSDs, a number of the W3C standards have been included. Among those are the DTDs for MathML, SVG, XHTML, XML Schema and SMIL and the XSD for RDF and XML Schema. All in all, 109 DTDs and 93 XSDs have been obtained. Although some 600 DTDs and XSDs are mentioned on the Cover Pages, only these 109 + 93 were actually available for download, thus illustrating once again the transient nature of the Internet and its various technologies. All 93 XSDs have been used for the analysis in Sections 3.2 and 3.3 while unfortunately only 30 of the 93 XSDs can be used for most of the analysis in Sections 3.1 and 4, due to various errors discussed in Section 6 below. In the appendix, we provide a list of some of the XSDs we used.

3. EXPRESSIVENESS OF XML SCHEMA 3.1 Single-type The formal taxonomy presented in Section 1, elicits the following question: is the expressive power of single-type SDTDs actually used in real-world XSDs, and if so, to what extent? Given that XSDs tend to be a bit unwieldy due to their inherent verbosity, it is interesting to identify use cases for the distinctive features of single-type SDTDs versus DTDs. Those use cases might suggest a simpler formalism that finds an appropriate balance between designerfriendliness and expressive power. Surprisingly, most XSDs in our sample turn out to define local tree languages, that is, can actually be defined by DTDs. Only 5 out of 30 are true single-type SDTDs, which corresponds to approximately 15%. There might be several possible reasons for this low percentage. A first possibility is that expressiveness beyond local tree languages is simply rarely needed. Another explanation might be that due to the relatively new nature of XML Schema and its complicated definition most users have no clear view on what can be expressed. All five examples we found are of the following form: p q a1 a2

→ → → →

. . . a1 . . . . . . a2 . . . expr1 expr2

The meaning of a1 and a2 is the following: when the parent of an A is P (resp. Q) use the rule for A1 (resp. A2 ). No other use cases have been found in the sample.

3.2 Derived types Two kinds of types are provided by XML Schema: simple and complex types. The former describes the character data an element can contain (cfr. #PCDATA in DTDs) while the latter specifies which elements may occur as children in a given element.

extension restriction

simple type (%) 27 73

complex type (%) 37 7

abstract implying that one should derive new types from it. Although slightly more common, it is only used in 11 XSDs in our sample.

Table 1: Relative use of derivation features in XSDs XML Schema facilitates derivation of new types from existing types via two mechanisms: extension and restriction. Both simple and complex types can be extended or restricted. The four cases are introduced below; for a thorough discussion, we refer to the W3C specification [9, 17]. • A complex type can be derived from a simple type by extension to add attributes to elements. • A complex type can be extended to add a sequence of additional elements to its content model or to add attributes. • Restricting a simple type limits the acceptable range of values for that type. For example, one can enforce that a phone number should consist of three digits, a dash, followed by six more digits. • Restricting a complex type is similar to restricting a simple type in that it limits the set of acceptable subtrees. Table 1 lists the number of XSDs using a particular derivation feature. Note that in this section we used all 93 XSDs we retrieved since conformance was not an issue for this analysis. Approximately one fifth of the XSDs considered do not construct new types through derivation at all. Extension is mostly used to define additional attributes (58%); elements are added to a content model to a lesser degree (42%). As expected, restriction of complex elements is hardly used (7%). A typical example of the latter mechanism is the modification of the multiplicity of an element: maxOccurs =”unbounded” to maxOccurs=”1”. The statistics also show that only just over a third of the XSDs (37%) use extension of complex types, a feature that parallels inheritance in the object orientation paradigm. This might indicate that the data modeled by these XSDs is often too simple to merit such a (relatively) sophisticated approach. It might also be “underused” due to the relative novelty of XML Schema, since many data architects are trained to think in terms of relational data rather than object orientation. Extension of simple types occurs in 27% of the XSDs. Restriction of simple types is most heavily used (73%), which comes as no surprise since it allows a much more fine-grained control over the content of an element, rather than the unrestrictive #PCDATA DTDs are limited to, thus alleviating one of the more glaring shortcomings of DTDs. Several mechanisms have been defined to control type creation by derivation. The final attribute for type definitions indicates that the type can not either be restricted, extended or both. Only 6 out of the 93 XSDs use this feature. As opposed to finalizing a type definition, it can also be declared

81

As a general rule, derived types can occur anywhere in the content model where the original type is allowed. However, this can be prevented by applying the block attribute to the original type. As for the final attribute, replacement by either restricted, extended or both types can be blocked. Blocking is used in 2 out of the 93 XSDs. The fixed attribute that is usually used to indicate that an element or an attribute is restricted to a specific value also serves a purpose in the context of derivation from simple types. It can be applied to fix a facet of a simple type (e.g. the length of a xsd: string ) in a restrictive type derivation. Only a single XSD uses the fixed attribute in this sense. Although not directly related to derivation, the substitution group feature nevertheless deserves to be mentioned here. Elements are declared members of a substitution group using the substitutionGroup attribute with an element name as value and may occur instead in the content model akin to derived types. Substitution groups are used in 10 out of 93 XSDs.

3.3 Additional features XML Schema defines various additional features with respect to DTDs, see [9, 18] for an introduction. One feature of SGML DTDs that was lost to XML DTDs is the &-operator that specifies that all elements must occur but that their order is not significant. Obviously this can be simulated in an XML DTD by explicitly listing all orders (e.g. a1 &a2 &a3 ≡ a1 a2 a3 | a2 a3 a1 | . . . | a3 a2 a1 , so a choice between 6 cases), but this doesn’t exactly improve the clarity of the content model. XML Schema restores this feature by defining the xsd: all element. However, only 4 out of 93 XSDs use this operator. Elements in an XML document can be identified using ID attributes and referred to by IDREF or IDREFS. This feature is part of the XML 1.0 specification [2] and is supported by DTDs. These IDs are unique throughout a document and are the only attributes with such a restriction for DTDs. In XML Schema, any element or attribute can be declared to require a unique value by selecting the relevant nodes using an XPath expression and specifying the list of fields that combined should have a unique value. In our sample, 6 XSDs out of 93 use this feature, all applied to a single field. Referring to elements can also be accomplished by key/keyref pairs. Using a reference to a key implies that the element with the corresponding key should exist in the document. This feature is reminiscent of the foreign key concept in relational databases. It is used in 4 XSDs in our sample. An interesting feature introduced in XML Schema is the use of namespaces for modularity. This allows to use elements and types defined in the current XSD that are defined elsewhere without fear of name clashes. Apart from the obvious inclusion of the XML Schema namespace, 20 XSDs in our sample used this mechanism.

A last feature to discuss is the ability to redefine types and groups. It should be noted that W3C’s primer on XML Schema cautions against the use of this feature since it may break type derivation without warning. It turns out that the authors of the XSDs in our sample set heeded this advice and avoided redefine altogether.

4.

REGULAR EXPRESSION CHARACTERIZATION

The second question we try to answer is how sophisticated regular expressions tend to be in real world DTDs and XSDs. If simple expressions make up the vast majority of schema definitions, it is worthwhile to take this into account when developing implementations of XML related applications and fine-tune algorithms to take advantage of this simplicity whenever possible. In order to facilitate the analysis some preprocessing was performed. For the DTDs parsed entities were resolved and conditional sections included/excluded as appropriate. Since we are only concerned with the schema structure, the DTD element definitions were extracted and converted to a canonical form, which abstracts away the actual node labels and replaces them by canonical names c1 , c2 , c3 , . . . For example, < !ELEMENT l i b ( ( book | j o u r n a l ) ∗ )>

is represented by a canonical form (c1 | c2 )∗ to preserve only the structure related DTD information. The XSDs were preprocessed using XSLT to the canonical representation mentioned above for DTDs. To capture multiplicity constraints, ’ ?’ is used, e.g. for an element a with minOccurs=”1”, maxOccurs=”3”, \lstinlinea a? a?— is substituted. This approach allows us to reuse much of the software developed to analyze DTDs for XSDs as well. For all DTDs, there is a total of 11802 element definitions which reduce to 750 distinct canonical forms. The 1016 element definitions in the XSDs yield 138 distinct canonical forms, totaling 838 for both types of schemas combined. The majority of these can be classified in one of the categories of “simple expressions”, which are subclasses of the expressions studied by Martens, Neven, and Schwentick [14].

a1 , . . . , an ∈ Σ and n ≥ 1, or a? for some a ∈ Σ. Table 2 provides an overview. Factor a a∗ a? (a1 + · · · + an )

Abbr. a a∗ a? (+a)

Factor (a1 + · · · + an )∗ (a1 + · · · + an )? (a∗1 + · · · + a∗n ) (a∗1 + · · · + a∗n )∗

Abbr. (+a)∗ (+a)? (+a∗ ) (+a∗ )∗

Table 2: Possible factors in simple regular expressions and how they are denoted (a, a1 , . . . , an ∈ Σ). We analyze the DTDs and XSDs to characterize their content models according to the subclasses defined above. The result is represented in Table 3 that lists the non-overlapping categories of expressions having a significant population (i.e. more than 0.5%). Two prominent differences between DTDs and XSDs immediately catch the eye: XSDs have (1) more simpleType elements (denoted by #PCDATA) and (2) less expressions in the category RE(a, (+ a)∗ ). The first difference is due to the fact that it pays to introduce more distinct simpleType elements in XSD since thanks to type restriction, it is now possible to fine tune the specification of an element’s content (cfr. the discussion in Section 3.2). The second distinction is most probably due to the nature of the XSDs in the sample since those describing data are overrepresented with respect to those describing meta documents [5]. The latter tend to have more complex recursive structures than the former. To gauge the quality of our sample of XSDs, we compared DTDs and XSDs using several of the measures proposed by Choi [5]. No significant differences between the two samples are observed, which is confirmed by an additional measure in Figure 1, the density of XSDs. The density of a schema is defined as the number of elements occurring in the right hand side of its rules divided by the number of elements. DTDs and XSDs do not fundamentally differ in this respect. Several other measures such as the width and depth of canonical forms viewed as expression trees show no significant differences.

definition 4. A base symbol is a regular expression a, a?, or a∗ where a ∈ Σ; a factor is of the form e, e∗ , or e? where e is a disjunction of base symbols. A simple regular expression is ε, ∅, or a sequence of factors. The following is an example of a simple regular expression: (a∗ + b∗ )(a + b)?b∗ (a + b)∗ . We introduce a uniform syntax to denote subclasses of simple regular expressions by specifying the allowed factors. We distinguish base symbols extended by ? or ∗. Further, we distinguish between factors with one disjunct or with arbitrarily many disjuncts; the latter is denoted by (+ · · · ). Finally, factors can again be extended by ∗ or ?. For example, we write RE((+ a)∗ , a?) for the set of regular expressions e1 · · · en where every ei is (a1 + · · · + an )∗ for some

82

Figure 1: Fraction of DTDs (left) and XSDs (right) versus their density More importantly though, it is clear that the vast majority of expressions are simple, i.e. 92% and 97% of all element definitions in DTDs and XSDs respectively. Figure 2 shows the fraction of DTDs and XSDs versus the fraction of their

#PCDATA EMPTY ANY RE(a) RE(a, a?) RE(a, a∗ ) RE(a, a?, a∗ ) RE(a, (+ a)) RE(a, (+ a)?) RE(a, (+ a)∗ ) RE(a, (+ a)?, (+ a)∗ ) RE(a, (+ a∗ )∗ ) total simple expr. non-simple expr.

DTDs (%) 34 16 1 5 2 8 1 3 0 20 0 0 92 8

XSDs (%) 48 10 0 5 10 10 4 3 1 2 1 2 97 3

holonym/meronym relations) that are very often quite simple to express. Some randomly chosen examples of non-simple regular expressions that we encountered follow: c1 + | (c2 ?c3 + ) (c1 c2 ?c3 ?)?c4 ?(c5 | . . . | c18 )∗ c1 ?(c2 c3 ?)?(c4 | . . . | c44 )∗ c45 + c1 ?c2 c3 ?c4 ?(c5 + | ((c6 | . . . | c61 )+ c5 ∗ )) ∗ c1 (c2 | c3 )∗ (c4 , (c2 | c3 | c5 )∗ )

Table 3: Relative occurrence of various types of regular expressions given in % of element definitions simple content models: the majority of documents have 90% or more simple content models.

5. SCHEMA AND AMBIGUITY The XML 1.0 specification published by the W3C [2] requires schema definitions to be one-unambiguous, i.e. that all regular expressions in the grammar’s rules are deterministic in the following sense [3]: a regular expression is oneunambiguous iff the corresponding Glushkov automaton is deterministic. Note that the terminology is somewhat confusing in the literature: in the context of SGML ‘unambiguous’ is used to denote this feature while Choi [5] refers to it as ‘deterministic’. We checked whether the DTDs and XSDs in our sample respect this requirement and find that they almost all do. IBM’s XML Schema Quality Checker (SQC) [10] reported 3 out of 93 XSDs as having one or more ambiguous content models (see also Section 6). For DTDs, a first exception is a regular expression of the following type: (. . . | ci | . . . | ci | . . .)∗ that occurred in several DTDs. However, this is merely a typo, not a design feature.

Figure 2: Fraction of DTDs (left) and XSDs (right) having a given % of simple expression content models The relative simplicity of most DTDs and XSDs is further illustrated by the star height that is given in Table 4. The star height of a regular expression is the maximum nesting depth of Kleene stars occurring in the expression, e.g. 2 for the last example given below, 1 for all others. Content models with star height larger than 1 are very rare. No significant differences are observed between DTDs and XSDs, except for the star height but this is consistent with the relative abundance of RE(a, (+ a)∗ ) type of expressions in DTDs with respect to XSDs. star height 0 1 2 3

DTDs (%) 61 38 1 0

XSDs (%) 78 17 4 ≈0

The latter example illustrates a shortcoming of DTDs that has been addressed in XML Schema. Element definitions in the latter formalism allow the specification of the number of times an element can occur using the minOccurs and maxOccurs attributes. The specification for the example above would be captured by the following snippet of XML Schema (with slight abuse of notation):

We found three XSDs defining non-deterministic (or ambiguous) content models. Two canonical forms are found: c1 ?(c1 | c2 )∗ and (c1 c2 ) | (c1 c3 ).

Table 4: Star height observed in DTDs and XSDs

6. ERRORS

In a sense this should not come as too great a surprise: DTDs and XSDs model data that reflect real world entities. Mostly those entities are subject to simple relations among one another such as is-a, or is-part relations (pertainym1 , 1

A second type of one-unambiguous regular expression proves to be more interesting: c1 c2 ?c2 ?. The designer’s intention is clearly to state that c2 may occur zero, one or two times.

meaning relating to or pertaining to

83

It was a bit disappointing to notice that a relatively large fraction of the XSDs we retrieved did not pass a conformance test by SQC. As mentioned in Section 2, only 30 out of a total of 93 XSDs were found to be adhering to the current specifications of the W3C [17]. We decided to use only conforming XSDs for those parts of the analysis that require

conversion to canonical form to ensure correct processing by our software. Often, lack of conformance can be attributed to growing pains of an emerging technology: the SQC validates according to the 2001 specification and 19 out of the 93 XSDs have been designed according to a previous specification. Some simple types have been omitted or added from one version of the specs to another causing the SQC to report errors. Some errors concern violation of the Datatypes part of the specification [1]: regular expressions restricting xsd:string are malformed. Some XSDs violate the XML Schema specification by e.g. specifying a type attribute for a complexType element or leaving out the name attribute for a top-level complexType element.

7.

CONCLUSION

Our analysis has shown that many features defined in the XML Schema specification are not widely used yet, especially those that are related to object oriented data modeling such as derivation of complex types by extension. Most importantly, it turns out that almost all XSDs are local tree grammars, i.e. proper single type grammars are rarely used. The expressive power encountered in real world XSDs turns out to be mostly equivalent to that of DTDs. Hence it seems that — barring some exceptions — the current generation of XSDs could just as well have been written as DTDs from the point of view of structure. This might change in the future, as acceptance of a relatively new technology increases, or it might be a symptom that the level of sophistication offered by XML Schema is simply unnecessary for many applications. The data type part of the XML Schema specification is heavily used though since it alleviates a glaring shortcoming of DTDs, namely the ability to specify the format and type of the text of an element. This is accomplished through restriction of simple types. The content models specified in both real world DTDs and XSDs tend to be very simple. For XSDs, 97% of all content models can be classified in the categories of simple expressions we identified. This observation can guide software engineers when developing new implementations of XML related tools and applications, for instance by avoiding optimizations for complex cases that rarely occur in practice.

8.

REFERENCES

[1] P. Biron and A. Malhotra. XML Schema part 2: datatypes. W3C, May 2001, http://www.w3.org/TR/xmlschema-2/

[6] J. Clark. TREX - Tree Regular Expressions for XML: language specification, February 2001, http://www.thaiopensource.com/trex/spec.html [7] J. Clark and M. Murata. RELAX NG Specification. OASIS, December 2001, http://www.oasis-open.org/committees/ relax-ng/spec-20011203.html [8] R. Cover. The cover pages, 2003, http://xml.coverpages.org/ [9] D. Fallside. XML Schema part 0: primer. W3C, May 2001, http://www.w3.org/TR/xmlschema-0/ [10] IBM corp. XML Schema Quality Checker, 2003. http://www.alphaworks.ibm.com/tech/xmlsqc [11] A. Møller. Document Structure Description 2.0. BRICS, 2003, http://www.brics.dk/DSD/dsd2.pdf [12] M. Murata. Document description and processing languages – regular language description for XML (RELAX): Part 1: RELAX core. Technical report, ISO/IEC, May 2001. [13] M. Murata, D.Lee, M. Mani, and K. Kawaguchi. Taxonomy of xml schema languages using formal language theory. To be submitted to ACM TOIT, 2003. [14] W. Martens, F. Neven and T. Schwentick Complexity of Decision Problems for Simple Regular Expressions. Submitted. [15] Y. Papakonstantinou and V. Vianu. DTD inference for views of XML data. In PODS proceedings, pages 35–46, 2000. [16] A. Sahuguet. Everything you ever wanted to know about DTDs, but were afraid to ask. In Proceedings of WebDB 2000, 2000. [17] H. Thompson, D. Beech, M. Maloney, and N. Mendelsohn. XML Schema part 1: structures. W3C, May 2001, http://www.w3.org/TR/xmlschema-1/ [18] E. van der Vliet. XML Schema. O’Reilly, Cambridge, 2002.

APPENDIX

[2] T. Bray, J. Paoli, C. M. Sperberg-McQueen, E. Maler, and F. Yergeau. Extensible Markup Language (XML) 1.0. W3C, 3 edition, February 2004, http://www.w3.org/TR/2004/REC-xml-20040204/ [3] A. Br¨ uggemann-Klein and D. Wood. One-unambiguous regular languages. Information and computation, 140(2):229–253, 1998. [4] A. Br¨ uggemann-Klein, M. Murata, and D. Wood. Regular tree languages over non-ranked alphabets (draft 1). Unpublished manuscript, 1998. [5] B. Choi. What are real DTDs like? In Proceedings WebDB 2002, pages 43–48, 2002.

84

A list of XSDs used in the regular expression and single-type analysis (with number of definitions in brackets) and a few of the XSDs considered in the other parts of the paper. DSML v2 (1), EPAL-cs-xacml-schema-policy (34), EPALepal-interface (12), epal-interface (12), extensions (13), ipdr 2.5 (14), mets (1), OAI DC (1) OAI GetRecord (9), ODRLEX v1.0 (25), ODRL-EX v1.1 (23), PersonName v1.2 (8), PIDXLib-2002-02-14 v1.0 (255), PMXML2 (1), PostalAddress v1.2 (16), RIXML2 (1), simpledc (15), TC-1025 schema v1.0 xpdl (91), UKGovTalk-BS7666 v1.2 (68), VRXML 20021204 (43), wsrp v1.0 types (1), WS-Security-Schema-xsd-20020411 (7), wsui (26), xgmml (8), xpdl (91), BPML, GenXML v1.0, GML Base, HEML, LogML, MPEG21, PSTC-CS v1.0, RDF, UDDI v2.0, XML Schema

On validation of XML streams using finite state machines Cristiana Chitic

Daniela Rosu

Department of Comp. Science University of Toronto Toronto, ON, Canada [email protected]

Department of Comp. Science University of Toronto Toronto, ON, Canada [email protected]

ABSTRACT We study validation of streamed XML documents by means of finite state machines. Previous work has shown that validation is in principle possible by finite state automata, but the construction was prohibitively expensive, giving an exponential-size nondeterministic automaton. Instead, we want to find deterministic automata for validating streamed documents: for them, the complexity of validation is constant per tag. We show that for a reading window of size one and nonrecursive DTDs with one-unambiguous content (i.e. conforming to the current XML standard) there is an algorithm producing a deterministic automaton that validates documents with respect to that DTD. The size of the automaton is at most exponential and we give matching lower bounds. To capture the possible advantages offered by reading windows of size k, we introduce k-unambiguity as a generalization of one-unambiguity, and study the validation against DTDs with k-unambiguous content. We also consider recursive DTDs and give conditions under which they can be validated against by using one-counter automata.

1. INTRODUCTION As an increasing number of organizations and individuals are using the Internet, the ability to manipulate information from various sources is becoming a fundamental requirement for modern information systems. The XML data format has been adopted as the common format for data exchange on the Web, and, to facilitate query answering, XML data needs to be validated against DTDs and XML-Schemas. Streamed data is data originating from sources that provide a continuous flow of information. Storing the flowing data stream in order to process it is not a feasible solution since it would normally require large amounts of memory. To cope with this issue, one needs to find ways to process data as it comes without going back and forth in the stream. The symbols in the stream are the element tags and the included text in the order of their appearance in the XML document. Since we are interested in checking only structural properties, such as DTD conformance, we ignore the data values and consider the stream as a sequence of opening and closing Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright is held by the author/owner. Seventh International Workshop on the Web and Databases (WebDB 2004), June 17-18, 2004, Paris, France.

85

tags. Checking that an XML stream conforms to a DTD in a single pass under memory constraints is referred to as the on-line validation of streamed XML documents [13]. There are two types of on-line validation considered in [13]: strong validation, which includes checking well-formedness of the input document, and validation, which assumes that the input is a well-formed document. In this work, we investigate on-line validation of XML documents using finite-state machines. The nonrecursive DTDs conforming to the current standard [14] (i.e. with one-unambiguous content) correspond to a reading window of size one. Further, we introduce k-unambiguous regular expressions as a generalization of one-unambiguous regular expressions and study nonrecursive DTDs with kunambiguous content. In both cases the finite machines are deterministic. The advantage is having constant time complexity of validation for each open/closed tag. We prove exponential lower and upper bounds for the size of the minimal deterministic automaton used for strong validation. According to statistical data [6], nonrecursive DTDs are more frequent than recursive DTDs, but still recursive DTDs are used commonly in practice. For on-line validation against recursive DTDs we propose using one-counter automata. We also give syntactic conditions under which recursive DTDs can be recognized by one-counter automata. Related Work On-line validation of streamed XML documents under memory constraints has also been studied in [13]. One of the results there is that for any nonrecursive DTD one can construct a finite automaton (a particular case of a result in [12]). An algorithm was given that constructs for any nonrecursive DTD a finite automaton that can be used to perform strong validation. However, the resulting automaton was nondeterministic, exponential in the size of the DTD and had exponential (in the size of the DTD) pertag complexity of validation. Validating against recursive DTDs was also considered in [13]. Under the restrictive assumption that the input stream is well-formed, they present a class of recursive DTDs validated by finite automata. Strong validation against recursive DTDs can be performed by push-down automata. However, their disadvantage is having a stack of size proportional, in the worst case, to the size of the XML stream. This approach was discussed in [13, 10]. In our approach the space required by the counter is logarithmic in the size of the input stream. One-unambiguous regular expressions [4] reflect the requirement that a symbol in the input word be matched uniquely to a position in the regular expressions without looking ahead in the word. We generalize this concept in a

different way than in the extension considered in [7]. With a lookahead of k symbols we want to determine the next, unique, matching position in the regular expression, while in the approach considered in [7] a lookahead of at most k symbols will determine uniquely the next k positions. We assume the reader is familiar with basic notions of language theory: (nondeterministic) deterministic finite-state automaton ((N)DFSA), context-free grammar (CFG) and language (CFL), extended context-free grammar (ECFG)(e.g., see [2, 12]). The paper is organized as follows. The first section describes the problem of validating XML streams against DTDs. Section 2 presents canonical XML-grammars associated to any DTD and also the size of the minimal automaton used for validating nonrecursive DTDs. In Section 3 we establish results for DTDs with one-unambiguous and k-unambiguous content. Finally, in Section 4 we investigate strong validation against recursive DTDs.

2.

THE VALIDATION PROBLEM

Let Σ be a finite alphabet. An XML document is abstracted as a tree document. A tree document over Σ is a finite ordered tree with labels in Σ. Formally, a string representation denoted [t] is associated to each tree document t as follows: if t is a single node labeled a, then [t] = a a ¯; if t consists of a root labeled a and subtrees t1 ...tk then [t] = a[t1 ]...[tk ]¯ a, where a and a ¯ are opening and closing a|a ∈ Σ} denote the alphabet of closing tags. Let Σ = {¯ tags. An XML document is a well-formed document if the string representation corresponding to the XML tree is wellbalanced. If T is a set of tree documents, L(T ) denotes the language consisting of the string representations of the tree documents in T . A DTD (Document Type Definition) [14] D = (Σ, R, r) that ignores the attribute elements is a finite set of rules R of the form a → Ra such that a ∈ Σ, Ra is a regular expression over Σ and r ∈ / Σ is called the root. The set of tree documents satisfying a DT D D is denoted by SAT (D). The language over Σ ∪ Σ consisting of the string representations of all tree documents in SAT (D) is defined as: L(D) = {[t] | t ∈ SAT (D)}. The dependency graph Dg of a DTD D is the graph whose set of nodes is Σ, and for each rule a → Ra in the DTD there is an edge from a to b for each b occurring in some word in Ra . A DTD is nonrecursive if and only if Dg is acyclic and is recursive if and only if Dg is cyclic. The problem of validating an XML stream with respect to a DTD D is defined as checking that the string representation of the XML document is contained in the associated language L(D).

3. CANONICAL XML-GRAMMARS In the context of streaming, since the attributes are ignored, a DTD appears to be a special kind of extended context-free grammar. The formal grammar that captures explicitly the opening and closing tags is the XML-grammar, first introduced in [1]. Given a DTD D = (Σ, R) we denote a corresponding XML-grammar by DECF = (N, Σ ∪ Σ, ROOT, P ), where Σ ∪ Σ is the set of terminals, N is the set of nonterminals and N is in 1-1 correspondence with Σ, ROOT is the start symbol and P is the set of productions. The set P contains only rules of the following types:

86

ROOT → r RROOT r¯; A → a RA a; A → a a, where A ∈ N , a ∈ Σ and a ¯ ∈ Σ. RROOT , RA is a regular expression containing only nonterminals and corresponds to the nonterminal ROOT , A respectively. Since XML-grammars [1] were studied without providing a way to link them with DTDs, we give an algorithm that transforms a DTD D into an XML-grammar DECF such that L(DECF ), i.e. the language of the grammar DECF , is the same as L(D). We call the grammar produced by the algorithm the canonical XML-grammar associated to the DTD D. Let D = (Σ ∪ Σ, r, R) be a DTD, Π be an alphabet such that Π ∩ Σ = ∅ and ROOT be the symbol such that ROOT ∈ Π. Π is the alphabet of nonterminals in the canonical XML-grammar. Let f : RegExp(Σ) ∪ {r} → RegExp(Π) ∪ {ROOT } be a function such that f (r) = ROOT , f (Σ) = Π and f/E is a bijection. The set of terminals of the canonical XML-grammar is T = Σ ∪ Σ. The set of productions P of the canonical XML-grammar are modifications of the rules of the DTD where rules of the form a → Ra are transformed into productions of the form f (a) → af (Ra )¯ a. The output is the canonical XML-grammar DECF = (N, T, ROOT, P ). The canonical XML-grammar associated to a DTD is instrumental in proving the results of this paper. Example 3.1. Consider the DTD D over Σ = {r, a, b, c} with the rules: r → a∗ b, a → bc, b → c + , c → . The algorithm gives the XML-grammar DECF = (N, Σ ∪ Σ, ROOT, P ), where N = {ROOT, A, B, C} and the set of productions P is: {ROOT → rA∗ B¯ r, A → aBC¯ a, B → b(C + )¯b, C → c¯ c}. Given a DTD D, the language L(D) is the same as the language generated by the canonical XML-grammar that corresponds to D. Thus, validating an XML stream with respect to a DTD D becomes equivalent to checking that the stream belongs to L(DECF ), which is the definition we will use throughout the paper. The canonical XML-grammar DECF corresponding to a nonrecursive DTD D is nonrecursive, thus the language L(DECF ) is regular [12]. We give an algorithm of bottomup substitution that takes as input a canonical nonrecursive XML-grammar DECF and returns a regular expression that generates the language L(DECF ) [5]. The regular expression that generates L(DECF ) is the regular expression corresponding to the nonterminal ROOT , computed by the bottom-up substitution algorithm. Example 3.2. Let D and DECF be the DTD, respectively the XML-grammar grammar from the previous example. The regular expression obtained by bottom-up substitution is r(a(b(c¯ c + )¯bc¯ c)¯ a)∗ (b(c¯ c + )¯b)¯ r. The regular expression obtained by applying bottom-up substitution to a DECF is used to construct an automaton which is used as a validating tool. The size of the automaton is linear in the size of the expression. An algorithm was presented in [13], which yields a validating non-deterministic automaton whose size is exponential in the size of the DTD. Here we show that even for the minimal automata the lower bound on their size is still exponential with respect to the size of the DTD. In order to compute the lower bound we partition the symbols of the alphabet into strata, defined below.

Definition 3.3. Stratum 0 of a nonrecursive DTD is the set of all symbols in Σ such that the right-hand sides of the corresponding rules contain only the symbol . Stratum i of a DTD is the set of all symbols in Σ such that the right-hand sides of the corresponding rules contain only symbols from Stratum 0, ..., Stratum i − 1 with at least one symbol from Stratum i − 1. We define the depth of a nonrecursive DTD to be the corresponding number of strata. We parameterize the size of a nonrecursive DTD by its depth and the maximum length of a right-hand side of a rule. Remark 3.4. For a nonrecursive DTD, the number of strata is finite and a symbol in the alphabet belongs to only one stratum. Also, in general, the depth of a DTD is not related to the number of productions. Let D be a nonrecursive DTD. Let d be its depth and let L be the maximum number of symbols that appear in the body of a rule. Lemma 3.5. [5] The size of the minimal automaton AD d −1 that recognizes L(D) is bounded from above by 1 + 2 · LL−1 . We show that the bound is tight by considering the following example. Example 3.6. Let D = ({r, a1 , ..., an }, R) be a nonrecursive DTD s.t. the set of rules is R = {r → a1 a1 , a1 → a2 a2 , ............, an−1 → an an , an → }. The regular expression corresponding to this DTD obtained by applying the bottom-up substitution algorithm to the associated DECF is r (a1 (a2 (... (an−1 (an an )2 an−1 )2 ...)2 a2 )2 a1 )2 r. The minimal automaton that recognizes L(D) has 1 + 2 · (2n+1 − 1) states, where n + 1 is the depth of D and the maximum length of a production in R is 2.

4.

STREAMED VALIDATION OF NONRECURSIVE DTDS

We now investigate the problem of validating XML streams against nonrecursive DTDs. The XML standard [14] requires that the regular expressions associated with each production match uniquely a position of a symbol in the expression to a symbol in an input word without looking beyond that symbol. This scenario corresponds to processing the stream with a deterministic automaton having a reading window of size one. The regular expressions described by the standard are the one-unambiguous regular expressions introduced in [4]. Before presenting the results of this section, we recall the definition and some characterizations of one-unambiguous regular expressions [4]. To denote different occurrences of the same symbol in an expression, all the symbols are marked with unique subscripts. For example, a marking of the expression b((a+ bc)∗ |(abd)∗ ) is b7 ((a1 + b2 c3 )∗ |(a4 b5 d6 )∗ ). The set of symbols of expression E is denoted by sym(Σ). For expression E, we denote its marking by E . Each subscripted symbol is called a position. For a given position x, X (x) indicates the corresponding symbol in Σ without the subscript. Formally, an expression E is defined to be one-unambiguous if and only if, for all words u, v, w over sym(E ) and all symbols x,y in sym(E ), if the words uxv, uyw ∈ L(E ) and x = y then X (x) = X (y). One possible method to convert a regular expression into a finite

87

automaton was proposed in [8]. In the Glushkov automaton of an expression E, the states correspond to positions in E and the transitions connect positions that are consecutive in a word in L(E ). For each expression E, the following sets are defined: F irst(E ), the set of positions that match the first symbol of some word in L(E ); Last(E ), similarly for the last positions; and F ollow(E , z), the set of positions that can follow position z in some word in L(E ). A Glushkov automaton corresponding to a regular expression is constructed using the sets defined above. The Glushkov automaton has as many states as the number of positions in the corresponding marked expression plus one. Definition 4.1. A DTD D (an XML-grammar DECF ) is one-unambiguous if all the rules (productions) have oneunambiguous regular expression in their right-hand sides. The canonical XML-grammar associated to a oneunambiguous DTD is also one-unambiguous. Theorem 4.2. Let D = (Σ, R) be a nonrecursive oneunambiguous DTD. Then the language L(D) is oneunambiguous. To prove that the language L(D) (or the language L(DECF ) generated by the canonical grammar DECF ) is one-unambiguous it is sufficient to show that there exists a one-unambiguous expression that generates the language. The solution to this problem is the regular expression obtained by bottom-up substitution [5]. Corollary 4.3. There is an algorithm that, given a nonrecursive one-unambiguous DTD D, constructs a deterministic Glushkov automaton GD such that GD accepts precisely the language consisting of the string representations of the documents that conform to D. The algorithm in [13] constructs an exponential-size nondeterministic automaton, which yields per open/closed tag complexity that is exponential in the size of the DTD. In contrast, for one-ambiguous DTDs, we can construct a deterministic automaton that verifies conformance to the DTD for streaming documents. This yields constant per open/closed tag complexity. Remark 4.4. If the validation of nonrecursive DTDs is performed using their corresponding Glushkov automata, the exponential lower and upper bounds of these automata in the size of the DTDs are the same as the bounds shown in the previous section. From example 3.6 and remark 4.4 is follows that the exponential bound on the size of the Glushkov automata accepting L(D) for a one-unambiguous DTD D cannot be improved. The lower bound on the size of the minimal deterministic automata accepting L(D) is also exponential in the size of D. We consider now the case of DTDs with non-deterministic content, which appear in practice in a variety of fields [6, 5]. To model these types of DTDs we introduce the kunambiguous regular expressions. They are a generalization of one-unambiguous regular expressions, different from the one considered in [7]. Informally, a k-unambiguous expression matches uniquely a position of a symbol in the expression to a symbol in an input word by looking ahead k symbols. Practically, an XML stream can be validated against

a nonrecursive DTD with k-unambiguous content by a finite automaton that uses a reading window of size k to move to a unique state. Definition 4.5. A regular expression E is kunambiguous, where k ≤ |E|, if and only if for all words u, v, w, x = x1 ... xk and y = y1 ... yk over sym(E ), ux1 ...xk v ∈ L(E ), uy1 ...yk w ∈ L(E ) and xk = yk implies that X (x1 ) ... X (xk ) = X (y1 ) ... X (yk ). Example 4.6. Consider the regular expression b((a+ bc)∗ |(abd)∗ ) marked as b7 ((a1 + b2 c3 )∗ |(a4 b5 d6 )∗ ). This regular expression is not one-unambiguous. Given the word babc it cannot be decided if the symbol a following the symbol b is matching position a1 or a4 . By looking ahead 4 symbols we can match the occurrence of symbol c with position c3 . Example 4.7. The following 3-unambiguous content model stand point → ((back sight, f ore sight|back sight, f ore sight, back sight), intermidiate sight∗ , inf o stand?, inf o i∗ ) appears in a DTD that describes digital measurements http:// gama.fsv.cvut.cz/˜soucek/ dis/ convert/ dtd/ dnp-1.00.dtd. Another way of characterizing k-unambiguous regular expressions is using Glushkov automata. Definition 4.8. A Glushkov automaton G = (Q, Σ, δ, F ) is k-deterministic if for every state p ∈ Q and every word a1 ... ak over Σ, the extended transition δ ∗ (p, a1 ...ak ) contains at most one state. In other words, we call a Glushkov automaton kdeterministic if from any state following all paths labeled a1 ... ak we reach at most one state. Given a marked expression E we define the sets: F irst(E , k) = {w ∈ sym(E )| there is a word u s.t. wu ∈ L(E ) and |w| = k}, F ollow(E , z, k) = {w ∈ sym(E )| there are words u and v s.t. uzwv ∈ L(E ) and |w| = k}, for all symbols z ∈ sym(E ) and Last(E , k) = { w ∈ sym(E )| there is a word u s.t. uw ∈ L(E ) and |w| = k}. These sets can be computed in time polynomial in the size of E , and thus we can give a polynomial-time algorithm to check whether a regular expression is k-unambiguous, for a given k. The worst case time complexity is O(|E |k+1 ). There are regular expressions that are not k-unambiguous for any number k. Example 4.9. The Glushkov automaton in Figure 1(i) is k-deterministic for any k ∈ N, since δ ∗ (a1 , aaaaaaaaa) = {a1 , a3 }. The Glushkov automaton in Figure 1(ii) is 3 − unambiguous. Proposition 4.10. (a) If a regular expression E is kunambiguous then E is k + 1-unambiguous. (b) A regular expression E is k-unambiguous if and only if the Glushkov automaton corresponding to E is k-deterministic. Given a nonrecursive DTD D = (Σ, R), we obtain by applying our bottom-up substitution algorithm to the canonical XML-grammar DECF a regular expression ED that describes the language L(DECF ) = L(D). The Glushkov automaton corresponding to ED is used for validation XML streams against D. If the regular expression ED is k-unambiguous, then the automaton used for online validation is also k-deterministic. We define the set

88

a4

a

b

d

b5

d6

d d7

b

a a1

b2

b

(i)

d

c3

c

b b

a a1

a a

b a

(ii)

b2

a

a a3

a

b5 b a

b7

a b a4

a

a6

Figure 1: Glushkov automaton corresponding to the regular expressions (i): (a + b)∗ a(a + b)(a + b) and (ii): (abc + ab∗ d)d. Σk = { a ∈ Σ| a → Ra ∈ R and Ra is k -unambiguous} for any number k. A regular expression is called finite if it denotes a finite language. Now, we define the set Σfk in = { a ∈ Σ | a → Ra ∈ R and the regular expression Ra is finite k-unambiguous} for any number k. On the dependency graph Dg we define the set Reachablea , a ∈ Σ, to be the set of symbols contained in the subtree rooted in a. As shown in the example 4.9, there exist expressions that are not k-unambiguous for any number k. Also, similarly to the class of one-unambiguous expressions, the class of k-unambiguous expressions is not closed under union, concatenation and star operation. Thus, we need to impose conditions on the nonrecursive DTDs that guarantee that the bottom-up substitution algorithm applied to the corresponding XML-grammars yields a k-unambiguous expression. Theorem 4.11. Let D = (Σ, R) be a nonrecursive DTD with the root rule denoted by r → Rr . Let Dg be the dependency graph of D. Assume that one of the following conditions is true: 1. The regular expression Rr is k-unambiguous and Reachablea ⊆ Σfp in for a ∈ sym(Rr ) 2. The regular expression Rr is one-unambiguous and there exists at most one a ∈ Σk such that Reachablea ⊆ Σf1 in , none of the elements on the path from a to the root appear under a Kleene-star and

a∈

Reachableb .

b∈sym(Rr )

Then there exists a number k such that the regular expression associated to the canonical XML-grammar DECF and obtained by bottom-up substitution is k -unambiguous. We illustrate the conditions of the theorem in the following two examples. Example 4.12. Let D = ({r, a, b, c, d, e, f }, {r → (a+ e|ac)(cd)∗ , a → b, b → ce|cf, c → , d → ccf |cce, e → , f → }) be a nonrecursive DTD satisfying the first condition of the theorem. The expression Rr is 2unambiguous and the rest of the rules correspond to finite 3-unambiguous expressions. The regular expression

obtained by the algorithm of bottom-up substitution is r((ab(c¯ ce¯ e|c¯ cf f¯)¯b¯ a)+ e¯ e|ab(c¯ ce¯ e|c¯ cf f¯)¯b¯ ac¯ c)(c¯ cd(c¯ cc¯ cf f¯|c¯ cc¯ c ¯ ∗ r¯, where the maximum of the lengths of the regular e¯ e)d) expressions associated to the nonterminals A, B, C, D, E, F is 14. Thus, k = 14. Example 4.13. Let D = ({r, a, b, c, d, e, f }, {r → ba+ , a → ed∗ f, b → c|d+ , c → f + e|f d, f → ee, d → , e → }) be a nonrecursive DTD that satisfies the second condition of the theorem. The regular expression obtained by the algorithm of bottom-up substitution applied on the canonical XML-grammar associated to D is ¯ c|(dd) ¯ + )¯b(ae¯ ¯ ∗ f e¯ rb(c((f e¯ ee¯ ef¯)+ e¯ e|f e¯ ee¯ ef¯dd)¯ e(dd) ee¯ ef¯a ¯)+ r¯. In this case k = 18, which is the length of the regular expression associated to the nonterminal C. As the following examples show, the theorem is quite tight, since there are DTDs that deviate only slightly from the conditions of the theorem and have associated regular expressions that are not k-unambiguous. Example 4.14. Let D = ({r, a, b, c, d}, {r → ab|ac, a → d∗ , b → , c → , d → }) be a nonrecursive DTD, for which the first condition of the theorem is not satisfied. The regular expression associated to the DTD is ¯ ∗a ¯ ∗a r(a(dd) ¯b¯b|a(dd) ¯c¯ c)¯ r, which is not a k -unambiguous regular expression for any k ≤ 14 (the length of the expression being 14). Example 4.15. Let D = ({r, a, b, c, d, e}, {r → ab, a → cd|ce, b → , c → d∗ , d → , e → }) be a nonrecursive DTD, for which the second condition of the theorem is not satisfied. The regular expression associated to the DTD ¯ ¯ ∗ c¯e¯ is ra(c(dd¯∗ c¯dd|c(d d) e)¯ ab¯b¯ r. The Glushkov automaton of this expression is not k -unambiguous for any k , since the extended transition of the automaton contains more ¯ d...d ¯ d¯ of arbitrary than one state when reading words racddd length.

5.

STREAMED VALIDATION OF RECURSIVE DTDS

We now consider the problem of strongly validating an XML stream against a recursive DTD using (restricted) onecounter automata. Recursive DTDs appear in fields like: computational linguistics, web distributed data exchange, financial reposting, etc [5]. Example 5.1. The following recursive content model exception → (type, msg, contextdata∗ , exception?) appears in a DTD from Workflow Management Coalition http://www.oasis-open.org/cover/WFXML10a-Alpha.htm. A one-counter automaton (1-CF SA) is a quintuple M = (Q, Σ, H, q0 , F ), where Q is a finite set of states, q0 ∈ Q is the initial state, F ⊂ Q is the set of final states, Σ is the finite alphabet and the transition set H is a finite subset of Q × (Σ ∪ ) × Z+ × {−1, 0, 1} × Q [9, 11]. Thus, the 1CF SA consists of a finite state automaton and a counter that can hold any nonnegative integer and can only test if the counter is zero or not. The move of the machine depends on the current state, input symbol and whether the counter is zero or not. In one move, M can change state, add +1, 0 or −1 to the counter. However, the counter is not allowed to subtract 1 from 0. The machine accepts a word if it starts

89

in the initial state with the counter 0 and reaches a final state with the input completely scanned and the counter 0. A language accepted by a one-counter machine is called one-counter language. We denote the family of one-counter languages by OCL. A restricted one-counter automaton (restricted-1-CF SA) is a one-counter automaton which, during the computation cannot test if the counter is zero [9, 2, 3]. The language accepted by a restricted one-counter automaton is called restricted one-counter language and the family of such languages is denoted ROCL. It is known that the family of restricted one-counter languages are strictly included in the family of one-counter languages [3] and that OCL is N SP ACE(log n) [15]. Every language accepted by a one-counter automaton is a context-free language [2]. a/+1

q0

r

q1

a/+1

a/−1 a/−1

q2

r

q3

q4

Figure 2: One-counter automaton recognizing the language {ran an r|n > 0}

c2 c/+1 a

a1

a

a3

a

c a

c4

a6 a

c

c5

c

c7 a

b10

c/−1 a8

b/+1

b/−1

b/−1 b9

b/+1

Figure 3: One-counter automaton recognizing the c)¯ a(¯ ca ¯)n bm¯bm } language {(ac)n a(|c¯ Example 5.2. The language corresponding to the DTD D = {r → a, a → a|} is ran an r and a streamed XML document can be strongly validated against D using the onecounter automaton in Figure 2. Definition 5.3. A DTD D is (deterministic) onecounter recognizable, (deterministic) restricted one-counter recognizable, if the language L(D) is in (D−)OCL, (D−)ROCL respectively. An XML-grammar DECF is (deterministic) one-counter recognizable, (deterministic) restricted one-counter recognizable, if the language L(DECF ) is in (D−)OCL,(D−)ROCL respectively. We denote the family of one-counter recognizable (restricted one-counter recognizable) DTDs by OCRD (ROCRD). The family of languages generated by (restricted) one-counter recognizable DTDs is contained strictly in the family (R)OCL, since there are languages that are (restricted) one-counter but cannot be generated by a DTD, e.g. {αβ| α, β ∈ Σ∗ , ∈ / Σ, |α| = |β|} [2]. The family of strongly recognizable DTDs [13] is included in (R)OCRD, since the regular languages are included in the (restricted) one-counter languages. However, both ROCRD and OCRD are incomparable with the family of recognizable DTDs [13]. We illustrate this result in the following example. Example 5.4. Consider the DTD D = {r → aa, a → a|}. D is not recognizable, but is in OCRD since L(D) = ¯ n am a ¯m r¯|n, m ≥ 0} ∈ OCL. Conversely, let D be {ran a

the DTD {r → a, a → a?|b, b → b?}. D is recognizable but the corresponding language, L(D) = {ran bm¯bm a ¯n r¯|n ≥ 1, m ≥ 0} is not OCL [2], hence D is not in OCRD. Consider now the restricted one-counter recognizable DTD D = {r → a, a → a∗ }. By straightforward testing of the necessary conditions provided in [13], one can prove that D is not recognizable. Finally, the DTD D = {r → a1 a2 , a1 → a1 ?, a2 → a2 ?} is recognizable, but using the iteration theorem for the family of restricted one-counter languages [3] one can show that L(D) ∈ / ROCL and thus, D is not restricted one-counter recognizable. Finding grammatical characterizations of (D)ROCL and (D)OCL is still an open problem. However, adapting a well known result from [3] one can infer sufficient conditions for a DTD to be in ROCRD. We present syntactic restrictions that yield a class of recursive DTDs that are OCRD. Theorem 5.5. Let D = (Σ, R) be a recursive DTD and let DECF = (Σ ∪ Σ, N, P ) be the associated canonical XMLgrammar. Let {A1 , ..., An } ⊆ N be the set of recursive nonterminals in DECF and let G be the dependency graph of DECF . Assume the following conditions are true: • every node in G appears in at most one simple cycle. • the regular expressions RA1 , ..., RAn contain only one recursive nonterminal and that recursive nonterminal does not appear under the scope of a Kleene-star. / {A1 , ..., An } have corre• all the nonterminals Bi ∈ sponding regular expressions by bottom-up substitution. Then the language L(DECF ) = L(D) is a one-counter language, which means that D is in OCRD. Moreover, there is an algorithm to construct a one-counter automaton that can be used to validate an XML stream against the DTD D. If D has only one-unambiguous content then the resulting one-counter automaton is deterministic. Intuitively, if the dependency graph of a DTD D satisfies the conditions of Theorem 5.5, one can find an expression describing L(D) and construct based on it a one-counter automaton that precisely accepts L(D) [5]. Example 5.6. Let D = ({r, a, b, c}, {r → ab, a → c|, c → a|, b → b|}) be a recursive DTD. The corresponding canonical XML-grammar DECF has the following productions {ROOT → rAB¯ r, A → aC¯ a| a¯ a, C → cA¯ c| c¯ c, B → bB¯b|b¯b}. The language generated by the grammar is {(ac)n a(|c¯ c)¯ a(¯ ca ¯)n bm¯bm |n, m ≥ 1}, which is not a regular language. The one-counter automaton recognizing the language generated by DECF is presented in Figure 3. Relaxing slightly the conditions of the theorem we can find examples of the DTDs whose languages are no longer one-counter. Example 5.7. Let D = ({r, a, b}, {r → a, a → a|b|, b → b|}) be a recursive DTD. D deviates slightly from the first two conditions of the theorem. The language generated by the corresponding canonical XML-grammar is {ran bm¯bm a ¯n r¯|n ≥ 1, m ≥ 0}, which is not OCL. Example 5.8. Let D = ({r, a, b, c}, {r → a, a → ac|, c → b, b → b|}) be a recursive DTD. D verifies the first two conditions of the theorem but deviates slightly from the ¯)n |n ≥ 1, m ≥ 0}, which is not third. L(D) = {an (cbm¯bm c¯a OCL.

90

6. CONCLUSION This paper continues the formal investigation of the problem of on-line validation of streamed XML documents with respect to a DTD. We provided further insights on the size of the minimal (deterministic) automata that can be used for strong validation against non recursive DTDs. Motivated by real world examples of DTDs with non-oneunambiguous content models and to capture the possible advantages of using reading windows of size greater than one, we introduced the notion of k-unambiguous regular expressions as a generalization of one-unambiguous regular expressions. We also investigated the problem of strong validation against recursive DTDs without imposing that the streamed document be well-formed. We introduced a hierarchy of classes of DTD that can be recognized using variants of onecounter automata and provided syntactic conditions on the structure of the DTDs that ensure the existence of a onecounter automaton for performing strong validation. A precise characterization of the DTDs recognizable by deterministic counter automata is yet not available and will be considered in future work.

7.

ACKNOWLEDGMENTS

We are indebted to Leonid Libkin and Alberto Mendelzon for their guidance and feedback.

8. REFERENCES [1] J. Berstel and L. Boasson. XML Grammars. MFCS 2000. [2] J. Berstel and L. Boasson. Context-Free Languages. Handbook of Theoretical Computer Science, 1990. [3] L. Boasson. Two iteration theorems for some families of languages. J. Computer and System Sciences, 7, 1973. [4] A. Bruggemann-Klein and D. Wood. One-unambiguous regular languages. Information and Computation, 140, 1998. [5] C. Chitic and D. Rosu. On validation of XML streams using finite state machines. Technical Report CSRG-489, University of Toronto, 2004. [6] B. Choi. What are real DTDs like. WebDB, 2002. [7] D. Giammarresi, R. Montalbano, and D. Wood. Blockdetereministic regular languages. ICTCS, 2001. [8] V.M. Glushkov. The abstract theory of automata. Russian Mathematical Surveys, 16, 1961. [9] S. Greibach. An inifinite hierarchy of context-free languages. Journal of the ACM, 16, 1969. [10] C. Koch and S.Scherzinger. Attribute grammars for scalable query processing on XMl streams. DBPL 2003. [11] A. Meyer P. Fisher and A. Rosenberg. Counter machines and counter languages. Mathematical Systems Theory, 2, 1968. [12] A. Salomaa. Computation and Automata. Cambridge University Press, Cambridge, 1985. [13] L. Segoufin and V. Vianu. Validating streaming XML documents. PODS, 2002. [14] J. Paoli T. Bray and C.M. Sperberg-McQueen. XML 1.0. World Wide Web Consortium Recommendation, 1998. http://www.w3.org/TR/REC-xml/. [15] K. Wagner and G.Wechsung. Computational Complexity. Mathematics and its applications. VEB Deutscher Verlang der Wissenschaften, Berlin, 1986.

Checking Potential Validity of XML Documents Ionut E. Iacob

Alex Dekhtyar

Michael I. Dekhtyar "

Dep. of Computer Science University of Kentucky Lexington, KY 40506

Dep. of Computer Science University of Kentucky Lexington, KY 40506

Dep. of Computer Science Tver State University Tver 170000, Russia

!

#$% &'( ! )*! +!%

`z?234@BAC.< :D 34.4<:1 ;'3\> 234@ AC.4<:|/?1 234.4.>?595l7(2WW'2QR54§l: -B.: .V : @T9WT342<: .4<:|;'5 W'29> .>Y9Z: -B.4?;': 21k3%-B225.45\: - .X7v1%9mD AC.4<:e27: - .$: .V :Yo96;'<:1 2 > @ 34.45~52AC.C.4< 342 > ;'< mZ7(21$;[:^M`z< AC25O:l395.454Y: -B.21%> .1|;'?2ZQR;[: -G: -B.ChX,h5O:1 @ 3: @ 1 .Ya96A89}rwg.q> ;'3:%9P: .> w}r9Z5/g.434;[B3$:%95]r9:e-B9r7(21e: -B.{.>?;[: 21§X.^ m?^YyA891 ] ;' 23@BAC.4<:l5.4<: .40QX^ 1^ :^c: -B. hX,heiPFXj 3- .4A89~@ 5.>M7v21R: -B.IA89P1 ] @B/_^

ABSTRACT ,- .0/ 1 234.55627831 .9: ;'2?2[email protected]<:DE34.4<:1 ;'3GFIHKJL> 23@ D AC.< : 5M27+: .4 @ 34.45RA891 ] @ /_â`b ;9: .CFIHKJc;'5$9WAC25O:$< .4d.1ed9W';>fQR;[: -G1 .45/g.43:e: 2K: - . hI,heij 3- .4A89k@B5.4>$7(21l: -B.R.4< 342 > ;'< m ôn:o: -B.R5 9AC.: ;'AC.Y;[:);'5 ;'AC/g21:%9<:a: 2X.4 ;[:b}f27RFXHKJc>?2[email protected]<: 54Y_QR-B;'39W'W2QR5~@B5k: 2Z> ;'5O: ;'Zwg.439@B5.~52AC.e27o: - .~hI,h1 @BW'.45 m@B;> ;'< mM: -B.5O:1 @B3: @?1 .$27): -B..4 ;' 23@BAC.4<: 54^

n:o: -B.R5 9AC.: ;'AC.YQ).\< 2: .: -T9P:o: -B.R;'