1=(L) Si =(2L) and the bandwidth for the unicast requests is i:pi 1=(L) pi Si . In particular, if L increases, while all other variables remain fixed, then more documents are pushed, an observation that will be used to derive the final algorithm. The second and more substantial modification to the previous argument is due to the mismatch between the objectives of Algorithm 1 and the desired objectives for bandwidth division and document selection. Algorithm 1 minimizes the amount of required bandwidth to achieve a fixed L, whereas our goal is to minimize L given a fixed deployed bandwidth B . In some sense, our problem is the dual of the one that Algorithm 1 solves. Algorithm 2 solves the bandwidth division and document selection problems, and uses Algorithm 1 as a subroutine. The algorithm employs a parameter > 1 that measures the target level of over-provisioning for the pull channel. More precisely, the actual bandwidth we reserve for pull is times what an idealized estimate predicts. Queuing theory asserts that > 1 guarantees bounded queuing delays, whereas 1 leads to infinite queuing delays. As such, the parameter can also be thought of as a safety margin for the pull channel. The algorithm also uses a parameter > 0, which is an arbitrarily small positive number, and finds a solution that has latency within of the optimum for the given bandwidth and popularities. Algorithm 1 assumes that documents have been sorted in nonincreasing order of popularity, i.e., pi pi+1 (1 i < n). It can be easily seen that if i is pushed, then j < i should be pushed as well. Then, the problem becomes that of finding a value of k such
(1)
Algorithm 1 follows this approach. However, as we are interested in the average latency of a pushed document instead of the worst-case latency, we need to make the following modification to equation (1). The unicast pull term pi Si in (1) is the bandwidth
32
Algorithm 2 Bandwidth Division and Document Classification Require: n B S p, and as defined in Table 1, and p i pi+1 (1 i < n) Ensure: k is the optimal number of documents on the push channel, pullBW is the optimal pull bandwidth, pushBW is the optimal push bandwidth 1: for i = 1 : : : n do 2: rspti rspti;1 + pi Si 3: sizeTotali = sizeTotali;1 + Si 4: end for 5: lMax sizeTotaln =B 6: lMin 0 7: while lMax lMin > do (lMax + lMin)=2 8: L 9: k tryLatency(L p n) 10: pullBW (rsptn rsptk ) 11: pushBW B pullBW 12: if pushBW (sizeTotalk =(2L)) then L 13: lMax 14: else L 15: lMin 16: end if 17: end while
the objective of minimizing the maximum relative inaccuracy observed in the estimated popularities of the pushed documents. In this case, we show analytically that each report probability should be set inversely proportional to the predicted access probability for that document. First, the server calculates the rate of incoming reports that it can tolerate. Presumably, is approximately equal to the rate that the server can accept TCP connections minus the rate of connection arrivals for pulled documents. Therefore, the value of can be estimated from the access probabilities and the current request rate, all scaled down by a safety factor to give the server a little leeway for error. Then, the si ’s have to be set such that ki=1 pi si , where documents 1 : : : k are on the push channel. The expected number of reports i that the server can expect to see for i over a unit time period is pi si . Using standard Chernoff bounds, the probability that number of reports is more than (1+ ) i is roughly ; i 2 e 4 , and that the probability that number of reports is less than ; 2 (1 ) i is roughly e 2i . If the goal is to minimize the expected maximum relative inaccuracy of the reports, all of the upper tail bounds should be equal and all of the lower tail bounds should be equal. That is, all i should be equal, or equivalently it should i k, si = pi k . Hence, each be the case that for all i, 1 document should have a report percentage inversely proportional to its access probability.
P
; ; ;
f
;
g
that the multicast push set 1 2 : : : k minimizes the latency L given a certain bandwidth B and pull over-provisioning factor . The optimal value k can be found by trying all possible values of L, computing the document k that achieves L with Algorithm 1, and checking that this value of k satisfies the bandwidth requirements. The pull bandwidth requirement is n i=k+1 pi Si , which leaves pushBW = B n k+1 pi Si for the push channel, and average latency for the pushed documents of ki=1 Si =2pushBW. If this computed average latency for the pushed documents is greater than L, then L needs to be increased, otherwise L needs to be decreased. Algorithm 2 follows this approach but with two optimizations. In the first place, the algorithm performs a binary search over all possible values of L and stops when the interval for L is bounded by the tolerance . Moreover, the algorithm pre-computes the sums k pi Si and k Si in the arrays rspt and the sizeTotal rei=1 i=1 spectively (Lines 1–4). The purpose of these computations is to use the totals in the place of the sums in the bandwidth computations. Because of this optimization, the portion of the algorithm before the binary search runs in linear time. The maintenance of the rspt and sizeTotal arrays can be implemented in logarithmic time per query using standard augmented binary tree techniques [8]. Thus, n Si =1 ))). We the running time of algorithm 2 is O(max(n log( iB expect that as a practical matter that the running time will be O(n).
;
P
P
3.
EVALUATION
The objective of the experiments is to validate the algorithms introduced in Section 2. In particular, the development of Algorithm 2 made several idealized assumptions about the environment and these assumptions need to be investigated experimentally. The choice of is a major parameter in the following experiments. We also wish to verify that lower delays are achieved by an integrated algorithm that does both document classification and bandwidth division. Finally, the scalability of various popularity estimation algorithms remains to be verified.
P P
3.1
Methodology
The experimental analysis leverages on an existing prototype middleware. The middleware supports the hybrid dissemination scheme utilizing multicast push and unicast pull. It acts as a reverseproxy to a Web server for the delivery of documents that are materialized views [23]. A simulated client uses the middleware and generates Poisson requests for documents with a Zipf probability distribution. In this paper, we report on the case in which the size of the documents is fixed to 0.5KB, and we have additional evidence suggesting that results are fundamentally the same with variable sized documents. An objective of this evaluation was to isolate algorithm performance from network factors, such as network congestion or routing transients. On the other hand, scalability is asserted when requests are generated by a large number of clients. Our solution was to run both the client and server on the same machine so that network effects would not be visible. (Although the emulation runs on a single machine, the middleware is capable of running on a distributed environment [23].) Aggregate requests from multiple clients was simulated by a background request filler. The filler simulates a specified number of clients, and sends requests to the server. The requests by the filler are treated identically to those made by another distinguished client, except that we record latency only for the requests from the distinguished client. All experiments were run for 10000 requests and figures reflect the average statistics from these runs. The computer used in these simulations was a 2.0Ghz
P
P
2.2
Report Probabilities
Document selection and bandwidth division rely on estimates p of document popularity. The values of p can be estimated by sampling the client population as follows. The server publishes a report probability si for each pushed document i. Then, if a client wishes to access document i, it submits an explicit request for that document with probability si . In principle, clients would not need to submit any request for push documents, but if they do send requests with probability si , the server can use those requests to estimate pi . At the same time, the report probability si should be small enough that server is almost surely not going to be overwhelmed with requests for pushed documents. In particular, we consider
33
dual processor machine with 1.2GB of RAM and running Linux Redhat Version 8.0. JRMS was used for multicasting [24]. Parameter Document Size Zipf parameter System Bandwidth Request rate Total items n Total Requests Made
Value 0.5K bytes 1.1 - 2.0 100000 bytes/sec 250 / sec 100 - 10000 10000 1.1 - 4.2 .005
Default 0.5K bytes 1.5 100000 bytes/sec 250/sec 1000 10000 2 .005
Table 2: Simulation Parameters Figure 1: Effects of various values on average latency
Table 2 describes the parameters used in the experiments. Although we explored the algorithm sensitivity to parameter values within the stated range, we report for compactness only on the default values unless otherwise noted.
3.2
Document Classification and Bandwidth Division Evaluation
Figure 1 shows the effects of various values of on the average latency of Algorithm 2. The curve in Figure 1 is jagged because an infinitesimal change in can have a discrete effect in the number of items pushed. Figure 1 shows that the value of that minimizes average latency is between 2.0 to 3.0. We adopt = 2:0 in the rest of the paper — although this is not the actual minimum, any value in the range produces similarly good results. Note that as changes in figure 1 our system adjusts the bandwidth division and document classification to maintain optimality. This in part explains why the average latency is near optimal for a relatively wide range of . Figure 2 can be interpreted as a brute force search for a good bandwidth split and document classification by trying several closely spaced values of k and pushBW. In the chart legend, the first number in the bandwidth split refers to pull. In addition to the points plotted in the figure, we verified that if less than half of the bandwidth was devoted to pull, the latency was suboptimal. In this scenario, Algorithm 2 assigns the most popular 7 documents on the push channel, and allocates 63% percent of the bandwidth to push. The figure shows the algorithm’s outcome with a circular point and an arrow pointing to it. The solution produced by our algorithm is better than any other point in the diagram. More specifically, our algorithm chose a split of 63/37 and the closest brute force curve in the figure is the 65/35 curve. The 65/35 line was also the lowest in the graph. Algorithm 2 chose k = 7 point as the number of push documents, which is also the minimum point on the 65/35 curve. Thus, Algorithm 2 chose a better bandwidth split than the brute force approach and a document classification that was just as good. Let G(k) be the average latency if the k most popular documents are placed on the push channel. The function G(k) is a weighted average of the average latency for pushed documents and the average latency for pulled documents. A graph showing an idealized G(k) from [25] is shown in Figure 3. The function G(k) has a unique local minimum, which can be be found by local search [25]. Figure 3 shows that the minimum of G(k) is to the right of the intersection of the push and pull curves. In this case, pulled documents would have lower latency than pushed documents. The actual curve that we obtained from our experiments is shown in Figure 4. Notice that the minimum of G(k) is to the left of the intersection of the push and pull curves, and thus pushed documents have lower latencies than pulled documents. Further, the minimum of G(k) occurs at a relatively small value of k, and thus compli-
Figure 2: Demonstrating the optimality of Algorithm 2 for document classification and bandwidth division. The arrow points to the single point found by the algorithm. cated hierarchical schemes for the push channel may not be useful in this setting. The location of the minimum is due to two complementary reasons. First, the most popular items are chosen for push and are also those to which a Zipf (or Zipf-like) distribution gives substantially more weight. Therefore, if a solution favors multicast push, it will also have the largest impact on the globally average delays. Second, the unicast pull curve levels off and, from that point on, the exact choice of k has little impact on pull delay. In other where words, pull delays are practically minimized at the point k precedes the intersection the pull curve flattens out. However, k of the pull curve with the push curve, and so the overall minimum occurs before that intersection. In conclusion, Algorithm 2 was shown to be better than the best value returned by a brute force search. Furthermore, the integrated algorithm led to a behavior of the push and pull curves that differs qualitatively and quantitatively from previously published work, e.g., in terms of the relative behavior of push and pull delays.
3.3 Report Probabilities Evaluation In order to determine the usefulness of our proposed push popularity scheme, we compare it to a solution found in a comparable work to our own. The solution for the push popularity problem proposed in [25] was to occasionally drop each pushed document i off of the push channel so that clients would have to make explicit requests to i. However, there is a danger that these explicit requests for i could overload the server. Thus, in [25] it was recommended that i should be dropped as short of a period of time as possible.
34
Figure 5: Effect on latency of demoting an item.
Figure 3: Relation of Push and Pull latencies as number of items pushed is changed, according to Stathatos et al.
Figure 4: Relation of Push and Pull latencies as number of items pushed changes according to our experiments
Figure 6: Drop down method versus our probability method.
The shortest possible time the the document can be dropped is one broadcast cycle. However, we show here that even such a short drop disrupts the server, while our proposed method does not suffer from such disruptions. Figure 5 shows the average latencies around the broadcast cycle T when the most popular item is dropped from the push channel. The figure shows a performance degradation for about 5 broadcast cycles. Basically, looking at the graph shows that before the drop occurs, the system is in a steady state of response times. However, once the item is dropped down the clients are no longer getting requests off the push channel. Instead, they must make requests directly to the server. Based on the Zipf distribution, as mentioned earlier, the bulk of requests were for items that were on the push channel. Therefore, dropping an item down causes a brief but substantial influx of requests to the server. This brief surge causes response times for requests during the given broadcast cycle and a few subsequent cycles to suffer while the server recovers and returns to its steady state. Figure 6 shows the average latency over the next 5 broadcast cycles when the i th most popular document is dropped from the push channel for one broadcast cycle. The flat line represents the average response time using our method for push popularity. If the most popular document is dropped, then we see a 35% increase in average latency over the next 5 broadcast cycles. If the 6th most popular document is dropped, we see an 8% increase in average latency over the next 5 broadcast cycles. This increase is in comparison to using the simple yet effective scheme we proposed of
simply including a popularity estimator with the broadcast index. In fairness, our proposed method has the disadvantages that it requires extra space in the broadcast index and it slightly increases the request rate at the server during all broadcast cycles.
4. RELATED WORK Scalable data delivery has often been approached through data caching and replication such as, for example, in client or proxy caches [11, 16], server-side caches [10], and content-delivery networks [22]. Moreover, back-end methods are deployed between a web server and a back-end database server and include web server cache plug-in mechanisms and asynchronous caches [5, 17, 18, 19]. These approaches follow the traditional unicast pull paradigm, whereby data is delivered from the server to each client individually on demand. In turn, the unicast pull approach severely limits the inherent scalability of data delivery. The document classification problem was introduced in [25]. In addition to directly related work, some other work has been done addressing the issue of hot and cold documents and of bandwidth division, though not in the context we are describing. In [1, 14, 2, 26] the issue of mixing pull and push documents together on a single broadcast channel is examined. The idea is that popular documents are similarly considered hot, and are continuously broadcast while all other documents are cold. These documents are request through a back channel and scheduled for broadcast. Similarly, in [1] the authors discuss how to divide the broadcast
35
[8] T. Cormen, C. Leiserson, R. Rivest, and C. Stein. Introduction to Algorithms. MIT Press, 2001. [9] G. Cormode and S. Muthukrishnan. What’s hot and what’s not: Tracking frequent items dynamically. In Proceedings of Principles of Database Systems, pp. 296–306, 2003. [10] A. Datta, K. Dutta, K. Ramamritham, H. M. Thomas, and D. E. Vandermeer. Dynamic content acceleration: A caching solution to enable scalable dynamic web page generation. In ACM SIGMOD, 2001. [11] A. Datta, K. Dutta, H. M. Thomas, D. E. VanderMeer, Suresha, and K. Ramamritham. Proxy-based acceleration of dynamically generated content on the world wide web: an approach and implementation. In ACM SIGMOD, pp. 97–108, 2001. [12] M. Franklin and S. Zdonik. “ data in your face ”: Push technology in perspective. In ACM SIGMOD, pp. 516–519, 1998. [13] Y. Guo, M. Pinotti, and S. Das. A new hybrid scheduling algorithm for asymmetric communication systems. ACM SIGMobile Computing and Communications Review, 5(3):123–130, 2001. [14] A. Hall and H. Taubig. Comparing push- and pull-based broadcasting or: Would “microsoft watches” profit from a transmitter? LCNS, 2647, January 2003. [15] J. Jannotti, D. Gifford, K. Johnson, M. Kaashoek, and J. O’Toole, Jr. Overcast: Reliable multicasting with an overlay network. In OSDI, pp. 197–212, 2000. [16] S. Jin and A. Bestavros. Temporal locality in Web request streams: sources, characteristics, and caching implications. In SIGMETRICS, pp. 110–111, 2000. [17] A. Labrinidis and N. Roussopoulos. Webview materialization. In ACM SIGMOD, pp. 367–378, 2000. [18] A. Labrinidis and N. Roussopoulos. Webview balancing performance and data freshness in web database servers. In VLDB, pp. 393–404, 2003. [19] Q. Luo, S. Krishnamurthy, C. Mohan, H. Woo, H. Pirahesh, B. G. Lindsay, and J. F. Naughton. Middle-tier database caching for e-business. In ACM SIGMOD, pp. 600–611, 2002. [20] J. Nonnenmacher and E. W. Biersack. Scalable feedback for large groups. IEEE/ACM Trans. Netw., 7(3):375–386, 1999. [21] V. Padmanabhan and L. Qiu. The context and access dynamics of a busy web site: Findings and implications. In ACM SIGCOMM’00, pp. 111–123, 2000. [22] V. S. Pai, L. Wang, K. Park, R. Pang, and L. Peterson. The dark side of the Web: An open proxy’s view. In HotNets-II, 2004. [23] V. Penkrot, J. Beaver, M. Sharaf, S. Roychowdhury, W. Li, W. Zhang, P. Chrysanthis, K. Pruhs, and V. Liberatore. An optimized multicast-based data dissemination middleware: A demonstration. In ICDE 2003, pp. 761–764, 2003. [24] P. Rosenzweig, M. Kadansky, and S. Hanna. The java reliable multicast service: A reliable multicast library. SMLI TR-98-68, Sun Microsystems, 1998. [25] K. Stathatos, N. Roussopoulos, and J. S. Baras. Adaptive data broadcast in hybrid networks. In VLDB, pp. 326–335, 1997. [26] P. Triantafillou, R. Harpantidou, and M. Paterakis. High performance data broadcasting systems. Mobile Networks and Applications, 7:279–290, 2002. [27] W. Zhang, W. Li, and V. Liberatore. Application-perceived multicast push performance. In IPDPS, 2004.
channel bandwidth between hot and cold documents. The main difference between previous work and ours is previous work deals with a broadcast environment with a single channel and focuses on scheduling items, not how to divide them into hot and cold. We are looking into the division of both documents and bandwidth to minimize latency. The hybrid scheme relies on estimates of the popularity of documents in the web site because popularity determines the assignment of documents to dissemination modes. Popularity estimation can be approached separately for pulled and for pushed documents. Pull popularity can be solved in sub-linear space by monitoring the client request stream [9]. As for push popularity, the problem is complicated by the absence of a client request stream. One solution is to occasionally drop each pushed document from the push channel, thus forcing clients to send explicit requests. Such requests can then be counted and the document popularity estimated [25]. A related problem is multicast group estimation [20], which can be specialized as follows in our context: remove a document from the multicast push channel and re-insert it as soon as the first request for that document is received. The document popularity can be estimated by the length of time it takes for the first client request to reach the server.
5.
CONCLUSION
In this paper we examined three data management problems that arise at the server in a hybrid data dissemination scheme. We argued that the document classification problem and bandwidth division problem should be solved in an integrated manner. We then presented a simple, yet essentially optimal, algorithm for the integrated problem. We validated the optimality of our algorithm experimentally. We proposed solving the push popularity problem by having each client request a hot document D i with some probability si , which the server sets in the push index. We looked at the difference in using our push popularity scheme versus using a scheme which simply drops an item off the push channel in order to test its popularity. We showed that dropping an item off the hot channel for one broadcast cycle can appreciably increase the average latency for approximately five broadcast cycles. Our proposed scheme does not suffer from such disruptions.
6.
REFERENCES
[1] S. Acharya, M. Franklin, and S. Zdonik. Balancing push and pull data broadcast. In ACM SIGMOD, pp. 183–194, 1997. [2] D. Aksoy and M. Franklin. Rxw: A scheduling approach for large-scale on-demand data broadcast. ACM/IEEE Transactions on Networking, 7(6):846–860, 1999. [3] K. C. Almeroth, M. H. Ammar, and Z. Fei. Scalable delivery of Web pp. using cyclic best-effort (UDP) multicast. In INFOCOM, pp. 1214–1221, 1998. [4] M. Altinel, D. Aksoy, T. Baby, M. Franklin, W. Shapiro, and S. Zdonik. Dbis toolkit: Adaptable middleware for large scale data delivery. In ACM SIGMOD, pp. 544–546, 1999. [5] M. Altinel, Q. Luo, S. Krishnamurthy, C. Mohan, H. Pirahesh, B. G. Lindsay, H. Woo, and L. Brown. Dbcache: Database caching for web application servers. In ACM SIGMOD, page 612, 2002. [6] Y. Azar, M. Feder, E. Lubetzky, D. Rajwan, and N. Shulman. The multicast bandwidth advantage in serving a web site. In 3rd NGC, pp. 88–99, 2001. [7] P. Chrysanthis, K. Pruhs, and V. Liberatore. Middleware support for multicast-based data dissemination: a working reality. In WORDS, 2003.
36
Semantic Multicast for Content-based Stream Dissemination 1
Olga Papaemmanouil
Uğur Çetintemel
Department of Computer Science Brown University Providence, RI, USA
Department of Computer Science Brown University Providence, RI, USA
[email protected]
[email protected]
ABSTRACT We consider the problem of content-based routing and dissemination of highly-distributed, fast data streams from multiple sources to multiple receivers. Our target application domain includes real-time, stream-based monitoring applications and large-scale event dissemination. We introduce SemCast, a new semantic multicast approach that, unlike previous approaches, eliminates the need for content-based forwarding at interior brokers and facilitates fine-grained control over the construction of dissemination overlays. We present the initial design of SemCast and provide an outline of the architectural and algorithmic challenges as well as our initial solutions. Preliminary experimental results show that SemCast can significantly reduce overall bandwidth requirements compared to traditional event-dissemination approaches.
1. INTRODUCTION There is a host of existing and newly-emerging applications that require sophisticated processing of high-volume, real-time streaming data. Examples of such stream-based applications include network monitoring, large-scale environmental monitoring, and real-time financial services and enterprises. Many stream-based applications are inherently distributed and involve data sources and consumers that are highly dispersed. As a result, there is a need for data streams to be routed, based on their contents, from their sources to the destinations where they will be consumed. Such content-based routing differs from traditional IP-based routing schemes in that routing is based on the data being transmitted rather than any routing information attached to it. In this model, sources generate data streams according to application-specific stream schemas, with no particular destinations associated with them. The destinations are independent of the producers of the Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright is held by the author/owner. International Workshop on the Web and Databases (WebDB), June 17-18, 2004, Paris, France. 1
This work was supported by the National Science Foundation under grant ITR-0325838.
37
messages and are identified by the consumers’ interests. Consumers express their interests using declarative specifications, called profiles, which are typically expressed as query predicates over the application schemas. The goal of the content-based routing infrastructure is to efficiently identify and route the relevant data to each receiver. In this paper, we present an overview of SemCast, an overlay network-based substrate that performs efficient and Quality-ofService (QoS)-aware content-based routing of high-volume data streams. SemCast creates a number of semantic multicast channels for disseminating streaming data. Each channel is implemented as an independent dissemination tree of brokers (i.e., routers) and is defined by a specific channel content expression. Sources forward the streaming data to one or more channels. Destination brokers listen to one or more channels that collectively cover their clients’ profiles. The key advantages of SemCast are twofold: First, SemCast requires content-based filtering only at the source and destination brokers. As each message enters the network, it is mapped to a specific semantic multicast channel and is forwarded to that channel. As a result, the routing at each interior broker corresponds to simply reading the channel identifier and forwarding the message to the corresponding channel, thereby eliminating the need to perform potentially expensive content-based filtering. This approach differs from traditional content-based routing approaches [4, 5] that commonly rely on distributed discrimination trees: starting from the root, at each level of the tree, a predicate-based filtering network filters each incoming message and identifies the outgoing links to which the message should be forwarded. Even though a large body of work has focused on optimizing local filtering using intelligent data structures [3, 6, 9, 11, 13], the forwarding times in the presence of a large of number of profiles are typically in the order of tens of milliseconds or higher (depending on expressiveness of the data and profile model, and the number of profiles) [3, 6]. As a result, forwarding costs can easily dominate overall dissemination latencies, as well as excessively consume broker processing resources. The problem becomes more pronounced when data needs to be compressed for transmission, as each broker will then have to incur the cost of decompression and recompression, which are commonly employed when transmitting XML data streams [15].
Second, SemCast semantically splits the incoming data streams and sends each of the sub-streams through a different dissemination channel. This approach makes the system quite flexible and adaptive, as different types of dissemination trees can be dynamically created for each dissemination channel, depending on the QoS required by the clients of the system. On the other hand, existing approaches commonly rely on predetermined overlay topologies, assuming that an acyclic undirected graph (e.g., a shortest path tree) of the broker network is determined in advance [4, 5]. As we demonstrate, these approaches fail to recognize many opportunities for creating better optimized dissemination trees. Moreover, as shown by the recent work [7, 12], using meshes or multiple trees can significantly increase the efficiency and effectiveness of data dissemination. In the SemCast approach, different channels are created and the content of each channel is identified based on stream and profile characteristics. Each channel is implemented as a multicast tree, and clients subscribe to the channels by joining the corresponding multicast trees. The system gathers statistics in order to adapt to dynamic changes in the profile set and stream rates. SemCast uses such statistical information to periodically reorganize the channels’ contents, striving to improve the overall bandwidth efficiency of the system. The rest of the paper is structured as follows. In Section 2, we present the basic design and architecture of SemCast. Section 3 outlines the algorithmic challenges and presents our content-based channelization algorithm. In Section 4, we discuss our heuristic for QoS–aware tree construction. In Section 5, we present preliminary experimental results that characterize the bandwidth savings that can be achieved by SemCast over traditional content-based routing approaches. Section 6 briefly discusses prior related work and, finally Section 7 concludes with a brief discussion of our contributions and directions for future research.
2. SYSTEM MODEL SemCast disseminates XML streams2 consisting of a sequence of XML messages, where each XML message is a single XML document. Subscribers of the system express their profiles using the XPath query language [17]. Each client profile is also associated with a QoS specification that expresses the client’s service expectations [1]. A QoS specification is a constraint on a specific metric of interest: our initial concern is staleness. The system consists of a set of brokers organized as a P2P overlay network. Each publisher and subscriber is connected to some broker in the network; we refer to these brokers as source brokers and gateway brokers, respectively. Brokers with no publishers or subscribers connected to them are called interior brokers. SemCast also includes a coordinator node. The coordinator is responsible for maintaining the dissemination channels and deciding their content. Specific brokers of the overlay will also serve as rendezvous points (RPs). Each RP is responsible for at least one channel and serves as the root of the corresponding multicast tree. This network model is shown in Figure 1. Sources receive from the coordinator the current XPath channel content expressions. Any modification to these 2
Our approach is not restricted to XML streams and is also readily applicable to relational data.
38
Publishers
Source brokers
S
S
S Rendezvous point
Coordinator
RP
RP
RP
Gateway broker
Internal broker
Broker Network
Subscribers Channel A Channel B
Channel C
Figure 1: Basic SemCast system model expressions is pushed to the sources to keep them up-to-date. Sources filter incoming messages according to each channel’s content expression and decide to which channels to forward them. They forward these messages to the corresponding RPs, which are responsible for efficiently multicasting them to the subscribed clients through the broker network. Each broker maintains a simple routing table whose entries map channel identifiers to the descendant brokers. Upon receipt of a message addressed to a specific channel, a simple lookup is performed on the channel’s identifier and the message is forwarded to the returned descendant list. Each gateway broker maintains a local filtering table whose entries map the subscribers’ profiles to their IP addresses. Received messages are matched against the profiles of the filtering table and are forwarded only to interested clients. This local filtering ensures that end clients do not receive irrelevant data.
3. CONTENT-BASED CHANNELIZATION SemCast addresses the problem of content-based channelization of data streams. The problem requires determining: (1) how many channels to use, (2) the content of each channel, (3) which channels to use for each incoming message, and (4) which channels each subscriber should listen to. SemCast creates semantic channels on the fly: the number and content of the channels depend on the overlap of the subscribers’ profiles as well as stream characteristics (e.g., rates) and thus may change dynamically. SemCast has two primary operational goals while performing channelization. First, it ensures that there are no false exclusions: the subscription of gateway brokers to semantic channels guarantees that every XML message will be delivered to all the end clients with matching profiles. Second, SemCast strives to minimize the run-time cost: the cost metric we use here is the overall bandwidth consumed by the system. To minimize the run-time cost, we focus on (1) reducing any redundant messages reaching the channels and (2) creating efficient dissemination trees. In order to reduce data redundancy, our channelization algorithm strives to minimize
the overlap between the channel contents. Moreover, a decentralized incremental algorithm for constructing low cost multicast trees is used, in order to reduce bandwidth consumption. We distinguish two phases during SemCast’s operation: initial channelization and dynamic channelization. The former is used when there are no statistics available. As a result, initial channelization is based solely on syntactic similarities among profiles. On the other hand, dynamic channelization is used to reorganize the channels’ contents and membership, and is based on statistical information obtained from the system at run time. In the rest of the section, we describe these two phases in more detail.
3.1 Initial Channelization In SemCast, the initial content assignment to channels is based on a syntax-based containment function, which identifies containment relations among profiles. To achieve this, we use existing algorithms for identifying containment relations among XPath queries [18]3. We say that an XPath expression P1 covers (or contains) another expression P2 if and only if any document matching P2 also matches P1. Covering relationships for simple conjunctive queries can be evaluated in O(nm) time, where n and m are the number of nodes of the two tree patterns representing the XPath expressions [18]. Upon receipt of a new profile from a client, a gateway broker GB checks if the client can be served by the channels to which GB is already subscribed. Profiles not covered by these channels are sent to the coordinator. Each new profile, Pnew, received at the coordinator will be assigned to a channel whose content expression covers Pnew. If such a channel does not exist, a new channel is created for Pnew. If there are multiple channels that cover Pnew, then Pnew is assigned to the channel with the minimum incoming stream rate. This approach strives to assign profiles to a channel that disseminates the least relevant content to the gateway brokers. The coordinator informs the GB of the channel C it should subscribe to, and the GB joins the corresponding multicast tree. We then say that Pnew is assigned to channel C. As profiles are assigned to channels based on containment relations, syntax-based containment hierarchies among profiles are materialized and maintained. In these hierarchies, each profile has as ancestor a more general profile and as descendants more specific profiles. More specifically, each profile P has as parent a profile covering P, and as children profiles covered by P. Among multiple candidate parents, the most specific one (based on syntax-based containment function) is chosen. This implies that if Pi covers Pj and Pj covers Pk, the candidate parents for Pk are Pi and Pj, and Pj is chosen as Pk’s ancestor.
3.2 Dynamic Channelization The containment algorithms used in the initial channel assignment are based on the structure of XPath expressions. However, no conclusions can be inferred for the overlap of matching documents when no containment relation is found among a profile and a channel content expression. This implies that channels may not disseminate disjoint sets of documents, leading to multiple copies of the same document being sent 3
In the case of relational data, where profiles (attribute, value) pairs, containment relations defined in [5].
39
through different channels. Thus, initial channelization based on syntactic analysis alone is unlikely to produce good results. SemCast continuously re-evaluates its channel assignment decisions based on run-time stream rates and profile similarities, attempting to minimize the overall bandwidth consumption. To implement this, SemCast creates rate-based containment hierarchies existing among profiles. Periodically, SemCast creates the rate-based hierarchies to decide whether the channels’ contents and membership should be reorganized. Rate-based hierarchies reflect how the channels’ membership and content should be organized, based on the latest stream rates and profiles overlap. If the current state of SemCast “sufficiently” differs from the state shown in the rate-based hierarchies, channel reorganization takes place. In the following, we explain the dynamic channelization phase in more detail.
3.2.1 Rate-Based Containment Hierarchies We now define a partial overlap metric between profiles Pi and Pj: we say that Pi k-overlaps with Pj, denoted Pi ⊆k Pj, if the ratio of the number of messages matching Pj to that matching Pi is k. Obviously if k=1, then Pj contains (or covers) Pi, and the set of messages matching Pi is a subset of those matching Pj. In order to reduce the overlap among the contents of the channels, SemCast places profiles matching similar messages under the same channel. Thus, we construct the rate-based hierarchies such that each profile Pj is an ancestor of Pi if Pj covers Pi. Among multiple candidate parents for Pi, SemCast chooses the one with (1) higher overlapping part with Pi and (2) lower stream rate of messages matching the non-overlapping part between Pi and the candidate parent. Using the k-overlap metric defined above, the best parent Pj is the one with Pi ⊆1 Pj and k = max k r | P ⊆ k P , Pj , j ≠ i
{
j −i
j
i
}
where rj-i is the stream rate matching the non-overlapping part between Pi and Pj. To create the rate-based hierarchies, SemCast maintains information about the selectivity of the profiles in a selectivity matrix. This matrix is implemented as a sliding window of the incoming messages and maintained by the sources. The size of this window should be large enough to collect statistics with high confidence. Incoming messages are assigned weights. New messages have higher importance, and as different incoming messages are added in the sliding window, the importance of the old ones decades and their weight decreases. By holding a weighted sliding window, we base our statistics on the most recent messages, allowing a graceful aging of messages. Each row in this matrix refers to a different incoming message and each column to a different profile. Whenever the source creates a message matching a channel, the channel’s containment hierarchy is parsed in a top-down manner. At each step, we check whether the message matches the current profile. If it does, we continue with its children, otherwise we stop. If a match is found for a profile, the corresponding entry in the selectivity matrix is set to one, otherwise it remains zero. We note that although an array of size p x w has to be stored (where p is the number of profiles and w the size of the sliding window), only the profiles of the matching channels are updated each time. However, only a part of these profiles will be updated for each new message, since the match check will not always reach the leaves of the hierarchy.
Profile overlap estimates. SemCast calculates the pairwise partial overlap metric of the profile set in order to identify the current rate-based hierarchies. An overlap matrix is computed from the selectivity matrix. Each entry Oij represents the koverlap metric for the profiles Pi and Pj. One example is shown in the Figure 2 and Table 1, where we assume that rj-i is equal to 1. The dashed line shows that P1 is a candidate parent for P6, but it is not it final parent because of the low overlap they share. The following rules apply for the overlap matrix: (1) the root profiles of the hierarchies Pj have a single ‘1’ in the j-th column at the entry Ojj; (2) candidate parents Pj of each non-root profile Pi have Oji =1, i≠j; (c) the best candidate parent of Pi, is Pj such that j = max O r | O = 1 .
{
j ,i ≠ j
ij
j −i
ji
}
Once the coordinator has decided on the new rate-based hierarchy, it identifies if there are new channels to be created or removed and which profiles should migrate to other channels. This is a straightforward procedure involving comparisons between the root profiles of the current channels and the roots of the new rate-based hierarchies. Brokers are informed by the coordinator about the new channels they should subscribe or unsubscribe and sources receive the new channel list and their updated content expressions. Table 1. Overlap matrix P1 P2 P3 P4 P5 P6 P7
P1 1 0 0 0 0.8 0.2 0
P2 0 1 1 1 0 0 0
P3 0 0.4 1 1 0 0 0
P4 0 0.5 0.6 1 0 0.4 0.2
P5 1 0 0 0 1 0 0
P6 1 0 0 1 0 1 0.5
P7 0 0 0 1 0 1 1
3.2.2 Hierarchy Merging Highly diverse profiles could result in a large number of channels. This scenario may arise when only a small percentage of the profiles is covered by another. At the extreme case, the number of channels could be equal to the number of distinct profiles in the system. In this case, the system will not be able to exploit the actual overlap among the subscribers’ interests and will create independent paths for potentially each profile regardless of common messages. Moreover, the number of channels and space requirements for routing tables will increase. To address this problem, SemCast places in the same channel the containment hierarchies that overlap. However, hierarchy merging could increase redundancy, as gateway brokers attached to the final channel might receive extra messages matching the non-overlapping part between the merged hierarchies. For this reason, SemCast merges only those channels with a content overlap above a predefined threshold. Moreover, the channels to be merged should have low stream P4
P3
P6
P2
P7
P1
P5
Figure 2. Containment hierarchy from Table 1
40
rate for their non-overlapping parts. This strategy attempts to allow merging operations which cause the least possible data redundancy among channels. To implement this SemCast consults the overlap matrix. We compare two root profiles Pj and Pi and check their partial overlap, given by Oij. If this overlap is above the merging threshold, then the hierarchies of Pi and Pj are placed on the same dissemination channel. The root profile of the channel is set to the disjunction of Pi and Pj. We note that the dynamic channelization contains implicitly the hierarchy splitting operation. Assume that by the time for the next reorganization phase, the selectivity of P2 in Figure 2 has changed (or even the rate of its matching messages) and it is no longer covered by any profile. In this case the first hierarchy will split into two, one with P4 as the root and one with P2, creating three final hierarchy trees (and possibly three channels).
4. DISSEMINATION TREE CONSTRUCTION SemCast relies on a distributed and incremental approach to create two different types of dissemination trees, low-cost or delay-bounded trees, which we briefly describe in the rest of this section. Low-cost trees. In order to connect to a channel, a gateway broker, GB, first receives a list of the current destinations in the channel from the coordinator. GB then finds the min-cost path to each destination and connects to the channel through the closest one. Join requests are sent all the way from the GB to the first node in the tree and the routing tables of the brokers on this route are updated to reflect the new tree structure. In order to reduce message replication, we try to reduce the number of edges in the trees. This algorithm is essentially a distributed and incremental adaptation of centralized Steiner tree construction algorithms (e.g., [16]). Delay-bounded trees. SemCast can also create delaybounded dissemination trees if clients indicate latency expectations (as part of their QoS specifications). As before, a GB first finds a connection point, CP, in the tree using the lowcost heuristic described earlier. All brokers continuously track their “distance” from the root of the channels they maintain. If the CP realizes that it will not be able to meet the latency constraint tagged to the join request, it will forward the join request to its parent. This process will continue until either a broker that is sufficiently close to the root is found or it is decided that the constraint cannot be met. In the latter case, the client will be notified and, optionally, the GB will be connected through the shortest path to the root.
5. PRELIMINARY EVALUATION We performed a preliminary evaluation of SemCast using simulations over broker networks, generated using GT-ITM [19]. In the experiments, we used networks of up to 700 nodes and 7000 profiles. In order to study the impact of profile similarities, we use a parameter called the profile selectivity factor, which determines the number of distinct nonoverlapping profiles. Unless otherwise noted, no partial overlap exists among channels; so there is no need to merge hierarchies. We simulated four different approaches to content-based dissemination. First, we simulated an approach that models a traditional, distributed Publish-Subscribe system (such as SIENA). In this model, profiles are propagated and aggregated
90
Unicast SPT
80
% cost degradation
% cost degradation
90
SemCast Heuristic
70 60 50 40 30 20
Unicast
80
SPT
70
SemCast Heuristic
60 50 40 30 20 10
10
0
0
200 0
0.025
0.05
0.06
0.125
0.18
0.25
0.5
0.7
400
600
network size (profile selectivity factor =0.7%)
profile selectivity factor (%)
Figure 3. Relative bandwidth efficiency for varying profile similarity factors
Figure 4. Relative bandwidth efficiency for different network sizes
up to the source over shortest paths from the clients to the source. In other words, we created a predicate-based overlay filtering tree on top of a shortest path spanning tree of the broker network. Filter-based routing tables are placed at each broker and messages are matched at each hop against the filters in the filter table to determine the next hop(s) towards the clients. We refer to this algorithm as SPT. Second, we simulated our tree construction heuristic described in the previous section. Third, we simulated a centralized Steiner tree construction algorithm [16] where the coordinator knew a priori all the profiles and the destinations for each channel. This approach provides a good approximation to the optimal solution; we therefore refer to it as the Optimal approach. Finally, we simulated a Unicast approach, where messages are filtered at the sources and are forwarded to the interested parties using shortest paths. This last approach models a centralized filtering system that uses unicast dissemination of filtered messages. Basic Performance. Figure 3 shows the cost degradation (i.e., extra bandwidth consumption) over the optimal algorithm for the other three approaches for 700 nodes. SemCast performs the closest to the optimal, incurring up to only 6.4% higher cost on the average. SPT performs better than Unicast; it does, however, increase the cost by 23% on the average, whereas Unicast’s average cost increase is 41%. Figure 4 shows that the cost degradation increases with increasing network size for Unicast and SPT. At the same time, SemCast’s heuristic is able to keep the bandwidth consumption lower than 8.5% of the optimal solution, for all network sizes. Benefits of Dynamic Reorganization. We also measured how dynamic reorganization can improve bandwidth consumption when there is partial overlap among channels and, thus, hierarchy merging is used to reduce data redundancy. We observed that, using a merging threshold of 40% partial overlap among hierarchies can reduce the cost up to 23%, compared to SPT. As the partial overlap among profiles increases, SemCast also improves its bandwidth efficiency. In cases where a profile is contained by another with probability less than 0.7, SemCast has a cost improvement up to 21% (with a 30% merging threshold and 1% profile selectivity factor). We also conducted experiments with varying number of profiles and selectivity factors. The results show that even with large profile sets, SemCast can significantly improve bandwidth
efficiency compared to the other approaches. Finally, as the profile selectivity factor in the system increases, we can dynamically set the merging threshold to higher values and continue to achieve significant cost improvements. QoS-Aware Dissemination. We also studied SemCast’s performance when subscribers express latency constraints. Here, there is a fundamental tradeoff between dissemination costs and latencies: the stricter the delay expectations are, the higher the cost of dissemination is. Our results clearly demonstrated this tension, but also showed that the cost increase due to the latency constraints typically does not go beyond 10%, even with low profile selectivity factors (i.e., 1%). Due to space limitations, we do not include these results.
41
6. RELATED WORK XML filtering. A large body of work has focused on optimizing local filtering using intelligent data structures. For instance, XFilter [3] uses indexing mechanisms and converts queries to finite state machine representations to create matching algorithms for fast location and evaluation of profiles. In [9], a space-efficient index structure that decomposes tree patterns into collections of substrings and index them using a trie is used to improve the efficiency of XML data. Content-based routing. Unlike SemCast, existing approaches for content-based networking do not the address the issue of constructing the dissemination overlays. Gryphon [4] assumes that the best paths connecting the brokers are known a priori. Likewise, Siena [5] assumes that an acyclic spanning tree of the brokers is given. Application-level multicast. Many recent research proposals (e.g., [8, 14]) employ application-level multicast for delivering data to multiple receivers. Members of the multicast group self-organize into an overlay topology over which multicast trees are created. In all these cases, members of a group have the same interests. Similar to SemCast, SplitStream [7] aims to achieve high-bandwidth data dissemination by striping content across various multicast trees. However, SplitStream does not address profile-based dissemination and channelization. Bullet [12] uses an overlay construction algorithm for creating a mesh over any distribution tree, improving the bandwidth throughput of the participants. As a result, SemCast can utilize the techniques introduced in this work to improve its bandwidth efficiency. XML-based filtering using overlays was investigated in [15]. This work uses
multiple interior brokers in the overlay to redundantly transmit the same data from source to destination, reducing loss rates and improving delivery latency. Channelization. The channelization problem was studied in [2] in the context of simple traffic flows and a fixed number of channels. The problem was shown to be NP-complete. In our model, the fact that the number of multicast groups is not fixed, and is dependent on the overlap of the receivers’ interests, makes the problem more challenging. Finally topic-based semantic multicast was introduced in [10]. In this work, users express their interests by specifying a specific topic area. Similarity ontologies are used to discover more general topics served by existing channels. This work does not consider content-based routing, and the focus is not on minimizing bandwidth consumption.
[5]
[6] [7]
[8]
7. CONCLUSIONS & FUTURE WORK We introduced a semantic multicast system, called SemCast, which implements content-based routing and dissemination of data streams. The key idea is to split the data streams based on their contents and spread the pieces across multiple multicast channels, each of which is implemented as an independent dissemination tree. SemCast continuously reevaluates its channelization decisions (i.e., the number channels and the contents of each channel) based on current stream rates and profile similarities. The key advantages of the SemCast approach over traditional approaches are that (1) it does not require multi-hop content-based forwarding of streams, and (2) it facilitates finegrained control over the construction of dissemination trees. Preliminary experimental results show that SemCast can significantly improve the bandwidth-efficiency of dissemination, even when the similarity among client profiles is low. Currently, we are finalizing SemCast’s design and pertinent channelization protocols. There are several directions in which we plan to extend our current work. First, we plan to extend our QoS notion to include data quality ― the idea is to have multiple channels carrying the same data stream at different “resolutions”. Second, we would like to extend our basic channelization approach to also take into account the network topology and physical locality of the brokers, in order to be able to create better optimized dissemination trees. Finally, we plan to implement a SemCast prototype to verify the practicality and efficiency of the approach as well as to demonstrate the claimed processing cost benefits for the interior brokers.
[9]
[10]
[11] [12]
[13]
[14]
[15]
8. REFERENCES [1]
[2] [3] [4]
D. Abadi, D. Carney, U. Cetintemel, M. Cherniack, C. Convey, C. Erwin, E. Galvez, M. Hatoun, J. Hwang, A. Maskey, A. Rasin, A. Singer, M. Stonebraker, N. Tatbul, Y. Xing, R.Yan, and S. Zdonik. Aurora: A Data Stream Management System. In Proc. of the International SIGMOD Conf, 2003. M. Adler, Z. Ge, J. F. Kurose, D. Towsley, and S. Zabele. Channelization problem in large scale data dissemination. In ICNP'01, November 2001. M. Altinel and M. J. Franklin. Efficient Filtering of XML Documents for Selective Dissemination of Information. In 26th VLDB Conference, 2000. G. Banavar, T. Chandra, B. Mukrerjee, J. Nagarajarao, R. Strom, and D. Sturman. An Efficient
42
[16] [17] [18] [19]
Multicast Protocol for Content-Based PublishSubscribe Systems. In 19th ICDCS, May 1999. A. Carzaniga, D. S. Rosenblum, and A. L. Wolf. Design and Evaluation of a Wide-Area Event Notification Service. ACM Transactions on Computer Systems, 19(3):332-383, August 2001. A. Carzaniga and A. L. Wolf. Forwarding in a Content-Based Network. In SIGCOMM, 2003. M. Castro, P. Druschel, A.-M. Kermarrec, A. Nandi, A. Rowstron, and A. Singh. SplitStream: Highbandwidth content distribution in cooperative environments. In 19th ACM Symposium on Operating Systems Principles, October 2003. M. Castro, P. Druschel, A.-M. Kermarrec, and A. Rowstron. SCRIBE: A large-scale and decentralized application-level multicast infrastructure. IEEE Journal on Selected Areas in communications (JSAC), 20(8), October 2002. C.-Y. Chan, P. Felber, M. N. Garofalakis, and R. Rastogi. Efficient Filtering of XML Documents with XPath Expressions. VLDB Journal, Special Issue on XML, 11(4):354-379, 2002. S. Dao, E. Shek, A. Vellaikal, R. R. Muntz, L. Zhang, M. Potkonjak, and O. Wolfson. Semantic multicast: intelligently sharing collaborative sessions. ACM Computing Surveys (CSUR), 31(2es), 1999. Y. Diao, M. Franklin, P. Fischer, and R. To. YFilter: Efficient and Scalable Filtering of XML Documents (Demonstration Description). In ICDE, 2002. D. Kostic, A. Rodriguez, J. Albrecht, and A. Vahdat. Bullet: High Bandwidth Data Dissemination Using an Overlay Mesh. In 19th ACM Symposium on Operating Systems Principles, October 2003. L. Opyrchal, M. Astley, J. Auerbach, G. Banavar, R. Strom, and D. Sturman. Exploiting IP Multicast in Content-Based Publish-Subscribe Systems. In IFIP/ACM International Conference on Distributed Systems Platforms (Middleware), April 2000. S. Ratnasamy, M. Handley, R. Karp, and S. Shenker. Application-Level Multicast using Content Addressable Networks. In 3rd International Workshop on Networked Group Communication (NGC '01), November 2001. A. C. Snoeren, K. Conley, and D. K. Gifford. MeshBased Content Routing using XML. In 20th ACM Symposium on Operating Systems Principles, 2001. H. Takahashi and A. Matsuyama. An Approximate Solution for the Steiner Tree Problem in Graphs. Mathematica Japonica, 1980. W3C. XML Path Language (XPath) 1.0. 1999. P. T. Wood. Containment for XPath fragments under DTD constraints. In 9th International Conference of Database Theory, January 2003. E. Zegura, K. Calvert, and S. Bhattacharjee. How to Model an Internetwork. In INFOCOM96, 1996.
Twig Query Processing over Graph-Structured XML Data Zografoula Vagena
Mirella M. Moro
Vassilis J. Tsotras
University of California Computer Science & Engineering Riverside, CA 92521, USA
University of California Computer Science & Engineering Riverside, CA 92521, USA
University of California Computer Science & Engineering Riverside, CA 92521, USA
[email protected]
[email protected]
[email protected]
ABSTRACT
benefit from the existence of a schema, such as query formulation. These structural summaries can play an important role in query evaluation, since they make it possible to answer queries directly from them, instead of considering the original data, which has potentially larger size. In general, they have graph forms, even when derived from tree structures. Query languages proposed for XML data, such as XQuery [5] and XPath [4], consider the inherent graph-structure and lack of schema and permit querying both on the structure and on simple values. The structural selection part is performed with a navigational approach, where data is explored and elements are located starting from determined entry points. Tree Pattern Queries [3], also known as twigs, that involve element selections with specified tree structures, have been defined to enable efficient processing of the structural part of the queries. Consider for example the bibliography database of Figure 1. Figure 1a illustrates the graph representation of the XML database, Figure 1b shows its structural summary, namely the A(k)-Index (where k is ∞) [18]. In Figure 1a, solid lines represent edges between elements and subelements, while dashed lines represent idref edges. We issue the query: //bib[.//author[@name=‘Chen’]]/article[@title=‘t1’] that retrieves articles with title ‘t1’ if there exists an author with name ‘Chen’. The structural part of the query can be represented as a twig and is shown in Figure 1c. While value-based queries can be evaluated by adopting traditional indexing schemes, e.g. B + -trees, efficient support for the structural part is a unique and challenging problem. Previous efforts to process the structural part of the query directly from the data [29, 2, 6, 12, 21, 27, 26] have assumed the tree structured model of XPath and cannot be applied when the structure is a general directed graph. Considering graph structures, research on structural summaries [14, 24, 13, 20, 18, 9, 7, 28] has focused on the construction and maintenance of effective structures, i.e. structures that are (a) space efficient, (b) optimized with regard to a specified query workload, and (c) gracefully adaptable in the presence of updates. In other words, the query processing task has received little attention. In this paper, we propose techniques for the evaluation of twig queries over graph-structured data. We investigate new ways to adapt ideas that have been proved successful for the case of tree-structured data, namely mapping of the structural conditions to join conditions. We start with the particular class of directed and acyclic graphs (which we show being the structure of the data in many cases) and
XML and semi-structured data is usually modeled using graph structures. Structural summaries, which have been proposed to speedup XML query processing have graph forms as well. The existent approaches for evaluating queries over tree structured data (i.e. data whose underlying structure is a tree) are not directly applicable when the data is modeled as a random graph. Moreover, they cannot be applied when structural summaries are employed and, to the best of our knowledge, no analogous techniques have been reported for this case either. As a result, the potential of structural summaries is not fully exploited. In this paper, we investigate query evaluation techniques applicable to graph-structured data. We propose efficient algorithms for the case of directed acyclic graphs, which appear in many real world situations. We then tailor our approaches to handle other directed graphs as well. Our experimental evaluation reveals the advantages of our solutions over existing methods for graph-structured data.
Categories and Subject Descriptors H.3.3 [Information Search and Retrieval]
1.
∗
INTRODUCTION
The widespread acceptance and employment of XML in B2B applications as the standard for data exchange among heterogeneous sources has claimed for efficient storage and retrieval of XML data. Contrary to relational data, XML data is self-describing and irregular. Hence, it belongs to the category of semi-structured data, inheriting its graphstructure model. This absence of schema in XML has led to the employment of structural summaries [14, 24, 13, 20, 18, 9, 7, 28], derived from the data, in order to facilitate tasks that would ∗ This research is partially supported by CAPES, NSF IIS0339032, and Lotus Interworks under UCMicro grant 47938
Copyright is held by the author/owner. Seventh International Workshop on the Web and Databases (WebDB 2004), June 17-18, 2004, Paris, France .
43
Figure 1: (a) Bibliography XML database, (b) A(k)-Index , (c) Sample Query
then adapt them for the case of general directed graphs. Our techniques can be used to answer queries either over the original data or through structural summaries. The rest of the paper is organized as follows. Section 2 presents background and related work. In Section 3, we describe and analyze our techniques, and in Section 4 we experimentally investigate their effectiveness. Section 5 provides our conclusions and directions for further work.
2.
construct the result by matching parts of the query incrementally. While the above techniques operate at the granularity of the original data, query processing can employ structural summaries that have been proposed [20, 18, 19, 9, 7]. Using those structures, the data graph is summarized with a graph of a smaller size that maintains the structural characteristics of the original data. Each index node identifies a group of nodes in the original document. The concept of bisimilarity [25] is utilized to form the groups. Those proposals are mainly concerned with creating effective indexes and little is said about the processing of the twig queries. In summary, they regard the twig query as a path query (primary path) with structural constraints on the nodes of that path. They employ navigation-based techniques to identify nodes that match the primary path and recursively check the validity of structural constraints. However, such an approach may suffer from unnecessary access over the input in order to check the structural constraints. It would be desirable to devise techniques with similar characteristics as the ones in the previous paragraph for the case of structural summaries too. Nevertheless, the graph structure of those summaries hinders the direct adaptation of the techniques described in the previous paragraph.
BACKGROUND AND RELATED WORK
We consider a model of XML data in which we ignore comments, processing instructions and namespaces. Then an XML database can be modeled as a directed, node-labeled graph G = (V, E, ERef , root), where V is the set of nodes (indicating elements, values or attributes), E is the set of edges, which indicate an object-subobject relation, and ERef is the set of reference edges (idref ). Each node in V is assigned a string literal label and has a unique identifier. The single root element is labeled ROOT. In Figure 1a, the graph structure of a bibliography database is presented. Previous work has considered in detail the case where no reference edges exist. In those cases, the data structure is a tree. In order to efficiently capture the structural relationships between nodes, a region-based numbering scheme [11, 29, 2, 6] is usually embedded at the document nodes. In that scheme, each node in the document tree is assigned a region: (left,right). Given a node pair (u, v), node v is a descendant of node u (which is then an ancestor of v), if and only if, u.lef t < v.lef t < v.right < u.right. Using that numbering scheme, set-based techniques [29, 22, 2, 6] regard the input as sequences of elements, and perform a join in a sort-merge fashion. They use the containment condition described in the previous paragraph as the join condition. Indexes have been proposed to skip elements that do not participate in any of the results [8, 6, 16]. Navigation-based techniques [12, 15], which use the input to guide the computation, have also been proposed. Those techniques answer the queries with a single pass over the input. The input is traversed in document order (i.e. in pre-order) and, using an FSM (Finite Sate Machine), the query pattern is matched with particular path instances as those instances become available. Work in [21] also utilizes navigation based techniques to solve a wider range of queries. More recently, similar numbering schemes have been employed in query-driven input-probing methods [27, 26], where the structure of the query defines the part of the input data to be examined next. Methods in this category
3.
PROCESSING TWIGS IN GRAPHS
In this section, we describe our solution to efficiently process twig queries over graph structured data. We focus on the descendant axis and consider the child axis as a special case. In a directed graph environment, the ancestordescendant relationship of a tree pattern edge is satisfied if there is a path from the ancestor node to the descendant node. We first describe node labeling schemes that identify the structural relationship between two graph nodes (as the numbering scheme described in the previous section does for tree structures). We then describe our matching algorithms that utilize those schemes to efficiently compute twig queries and analyze their behavior.
3.1
Labeling Scheme for Digraphs
As mentioned earlier, in a directed graph environment, the ancestor-descendant relationship is satisfied if there is a path from the ancestor node to the descendant node. We would like to be able to answer the question of the existence of such a path within reasonable time. In graph theory, this is the well know reachability problem and can be handled by computing the transitive closure of the graph structure.
44
Then, for each pair of nodes, the reachability question can be answered in constant time. However such an approach is space consuming, as it requires space O(n2 ) in the number of nodes. The question that is posed is whether one can get the same functionality using less space, possibly trading some of the time to check reachability. The answer is yes, by employing the minimum 2-hop cover of a graph. However, it has been shown [10] that the problem of identifying the minimum 2-hop cover is a NP-hard problem. A practical solution that identifies a 2-hop cover close to the smallest cover is presented in [10] and is employed in our techniques. The notion of 2-hop labeling is defined as follows:
Theorem 1. The A(k)-Index derived from a tree structure, with k lower bounded by the height of the tree, is acyclic. Proof Sketch. The proof is based on the observation that if two nodes do not have the same level, then they cannot be k-bisimilar for k larger than the height of the tree (proof is omitted for lack of space). Moreover, documents where subelements cannot have the same label as their superelements also produce directed and acyclic graphs. From the above discussion, it is obvious that the class of directed acyclic graphs is very important in many real world situations. As one would expect, there are opportunities to further optimize the computation of twig queries over such structures. In [17], a labeling scheme is discussed to handle the reachability problem in planar directed graphs with one source and one sink. We argue that a large number of directed acyclic graphs present in our environment can be mapped to this framework, by introducing, if necessary, a dummy node to play the role of a sink. A similar method (dummy nodes introduction) allows to use techniques for planar graphs into non-planar ones. Moreover, because the reachability property between two nodes is not affected by the introduction of those dummy nodes, the latters can be used to create the labeling and be discarded afterwards. The role of the source can be played by the root element in the graph model described in Section 2. In such graphs the following theorem holds [17]:
Definition 1. Let G = (V, E) be a directed graph. A 2hop reachability labeling of G assigns to each vertex v ∈ V a label L(v) = (Lin (v), Lout (v)), such that Lin (v), Lout (v) ⊆ V and there is a path from every x ∈ Lin to v and from v to every x ∈ Lout . Furthermore, for any two vertices u, v ∈ V , we should have: \ v ; u iff Lout (v) Lin (u) 6= ∅ (1) P The labeling size is defined to be v∈V |Lin (v)| + |Lin (v)|. To obtain 2-hop labels, for each node, the 2-hop cover is defined as: Definition 2. Let G = (V, E) be a directed graph. For every u, v ∈ V , let Puv be a collection of paths from u to v and P = {Puv }. A hop is defined to be a pair (h, u), where h is a path in G and u ∈ V is one of the endpoints of h (u is called the handle of the hop). A collection of hops H is said to be a 2-hop cover of P if for every u, v ∈ V , such that Puv 6= ∅, there is a path p ∈ Puv and two hops (h1 , u) ∈ H and (h2 , v) ∈ H, such that p = h1 h2 , i.e. p is the concatenation of h1 and h2 . The size of the cover H, is equal to the number of hops in H.
Theorem 2. Let G = (V, E) be a planar, directed and acyclic graph with one source and one sink. For each node v ∈ V , a label L consisting of two numbers av and bv is assigned to it. Then for every pair of nodes u and v ∈ V with labels (au ,bu ) and (av ,bv ) the following holds: u ; v iff au < av and bu < bv
In the above definition, the set of paths P can be the set of all shortest paths between each pair of labels in the graph. Having produced a 2-hop cover for a graph, the 2-hop labels for the nodes in the graph can be created by mapping a hop with handle v to an item in the label of the node v. In [10], a polynomial time algorithm is provided that finds a 2-hop cover close to the smallest such cover (larger by a factor of O(logn)). Their experiments show that 2-hop covers produced are compact, and each label size is a very small portion of the total number of nodes in the graph. Having two nodes v and u with labels (Lin (v), Lout (v)) and (Lin (u), Lout (u)), equation 1 can be checked efficiently either in a hash or in a sort based fashion.
3.2
(2)
One can always find such a label. The algorithm to embed the labeling is very simple. Two depth-first searches are performed. A counter assigns the labels to the nodes and is initialized to n + 1, where n is the number of vertices in the graph. First, a left/depth-first search is performed. A stack is used such that a node is pushed when it is first reached, and popped when all the edges directed from it have been examined. The value that the counter has when a node is popped is assigned to the node as the first number of each label, and the value of the counter is decreased by one. At the end of the traversal, the counter is reinitialized and a second, right/depth-first, search is performed. During this search, the second number is assigned to the nodes in the same way as before.
Labeling Scheme for DAGs
In the previous section we discussed a labeling scheme that can be used to answer the reachability question on any directed graph. However, although the time to answer is reasonably small, it is not constant anymore. In this section, we attempt to fix that for special graphs, namely directed acyclic graphs (DAG). We first argue that many useful structures in our environment fall into this category and then we describe an alternative scheme, which achieves to answer the reachability question in constant time for those graphs. As mentioned in Section 1, in general, the structural summaries are graph structures, even when derived from tree structures. Nevertheless, in that case, one can prove the following theorem:
3.3
Twig Processing in XML Graphs
We proceed with first describing our algorithm to handle twig queries when the input is a directed acyclic graph and the 2-number labelling scheme is utilized. Subsequently, we explain how the technique can be adapted when the structure is a general directed graph.
3.3.1
Twig Processing in DAGs
The algorithm proceeds in a navigational manner and employs an FSM to identify matching nodes. It adopts ideas that have appeared in [6, 12, 15] to guide the computation of the FSM, to encode partial results and to output
45
Figure 2: Twig processing example
the final results. We assume that the whole twig instance is required as output, however, XPath semantics can also be supported with minimal modifications (e.g. by filtering out the elements that do not participate in the result in a post-processing step that is pipelined with the matching algorithm). In [12], a technique to answer multiple path queries, by sharing common prefixes over tree structures, is proposed. We describe an adaptation of this technique to efficiently answer twig queries too. Then, we point out necessary modifications to support directed acyclic graph structures. The twig query is first divided into its constituent path queries, which are then simultaneously handled as in [12]. The algorithm utilizes an FSM, which is derived by the twig query to guide pattern matching. Either an NFA (Nondeterministic Finite Automaton) or DFA (Deterministic Finite Automaton) can be used, and each choice has its advantages and disadvantages. An NFA will possibly have a smaller number of states, while a DFA will avoid the backtracking that the programmatic simulation of an NFA incurs. We decided to adopt the DFA solution as it has been shown to provide performance advantages [23, 15]. The query-to-automaton mapping algorithm is an extension of the one provided in [12], so that the NFA is converted to the minimal equivalent DFA, and is omitted for lack of space. A runtime stack maintaining the DFA states is utilized to buffer previous states of the machine and to allow backtracking to those states, when necessary. Besides the runtime stack, a stack is associated with each query node and is called “elements stack” from now on. The role of elements stacks is to buffer document nodes that compose intermediate results, until the final results that contain them are formulated. The set of those stacks creates a compact representation of partial and total answers as in [6]. The input is accessed in document order, and only elements with the same tag as those of the query nodes are accessed. To achieve that, the document is preprocessed, and the elements are tag divided into sequences sorted in document order. Moreover, each element is augmented with the region-based numbering scheme described in Section 2. That way, the access in document order can be performed by sequentially traversing each sequence and picking, among the current elements, the one with the smallest left number. Visiting the elements in such a way guarantees that: (a) descendant nodes will be accessed after their ancestor counterparts and (b) when an ancestor node is found not to be joined with a descendant node, it can be discarded, as it is not going to be joined with any of the subsequent elements. The algorithm proceeds as follows:
into the runtime stack and becomes the active state. • When a new element arrives, the DFA execution follows all matching transitions from all current active states. Moreover, the new element is pushed into its corresponding element stack, possibly triggering the popping of elements that will not participate into new results. • When the subtree rooted at the element that pushed the current active states into the stack has been processed, the top set of active states is popped and the automaton backtracks. • When an element with the same tag as a query leaf is about to be pushed into the stack, the path instances being formulated are created in a recursive manner, in a way similar to [6]. As described above, the algorithm produces all the path instances matching the path patterns that constitute the twig query. To construct the twig instances, those path instances need to be combined. An efficient way is by merging them on their common prefixes. This requires the path instances to be sorted in root-to-leaf order. However, the algorithm produces the results in leaf-to-root order. We adapt the blocking technique described in [6] to achieve that. In this technique, two linked lists are associated with each node q in the stack: the first, (S)elf-list, holds all partial results with root node q, and the second, (I)nherit-list, holds all partial results of descendant of q. When a node is to be popped from its stack, its self- and inherit- lists (in that order) are appended to the inherit lists of the node below it, if one exists; or to the self list of all the associated elements of the stack that corresponds to the parent node in the query. If the element is the last one in the stack corresponding to the root node of the query, then the lists are returned. When the input structure is a directed acyclic graph, one node may have multiple parents. In this case, region based numbering schemes cannot be used to identify the relative position of two elements, and the scheme described in Section 3.2 is employed instead. Moreover, the document order is not defined anymore. However, one can still get the advantages of that order as follows: if the nodes are accessed in the same order as in the left/depth-first search described in 3.2, it still holds that when an ancestor node is found not to be joined with a descendant node, it can be discarded, as it is not going to be joined with any of the subsequent elements. The computation proceeds similarly as in the case of the tree data structure. Furthermore, using the numbering scheme of Section 3.2 and by tag grouping the elements,
• The DFA is set to its initial state. That state is pushed
46
Figure 3: (a) Document DTD, (b) Queries, (c) Performance Evaluation sorted on the first number of the label, we need only access elements having the same tags as the query nodes, instead of the whole document. Paths rooted at nodes with multiple parents will be accessed multiple times (once for each parent), i.e. backtracking needs to be performed in the element sequences. In several cases we can further optimize the performance of the algorithm, if we have a bound on the number of parents of each node. In those cases, and possibly performing a breadth first like access of the graph, we can re-use portions of paths that have already been accessed, and reside in the stacks, without accessing the same nodes again.
formance characteristics are the same as in [20] (where only path queries are discussed).
4.
4.1
Experimental Setup
We compared the performance of our techniques with the technique described in [18] for the evaluation of branch expressions, where twig queries are treated as having a primary path with structural constraints on the nodes that belong to that path. We call this technique PEWSC (Path Evaluation With Structural Constraints) from now on. For graph structured data, we used the XMark [1] generator to create an 100Mb database, and we only took into consideration the part of the document described by the DTD graphically illustrated in Figure 3a. For this part of the document, the node category residing under the node incategory is an IDREF referencing the node category residing under the node categories. Hence, the document is a DAG. To abstract between XML documents and structural summaries, we treat IDREF edges as any other edge. We run the queries shown in Figure 3b. In this figure, the primary path is marked with a ∗ symbol. We compare the PEWSC technique with our algorithm when 2-hop covers are used to determine reachability (we call the technique IN2HR Input Navigation with 2 Hop-cover for Reachability - from now on) and when the labeling described in Section 3.2 is employed (we call this technique IN2NR - Input Navigation with 2 Numbers for Reachability - from now on). In this last case we could access only the sequences with the same tags as the query nodes (as already described in Section 3.3). The performance measure is the total number of nodes visited to answer the query. This choice is justified as in the general case of graph structures it is difficult to make any guarantees about clustering for the index nodes [20]. Consequently, the access to the index nodes is, in general, an I/O operation per index node accessed.
Example 1. Figure 2 illustrates a simple example. The query to be evaluated is //a//b[.//c]//d, which is decomposed in two path queries, P1 and P2, as showed in Figure 2a. This figure also illustrates the DFA that is generated by grouping the path queries in common shared states (0, 1, and 2). In the DFA: a circle denotes a state; a bold circle denotes a final (accepting) state, which is marked by the IDs of accepted queries; a directed edge is a transition; the symbol on an edge is the input that triggers the transition. This is just a simplified illustration, since the actual DFA has more transitions (one per symbol in the query in each state), not showed here for clarity purposes. Consider the document fragment in Figure 2b. While the elements from a1 to d1 are read through the left path, the list of current states is pushed to the run-time stack as shown in Figure 2c, and each element is pushed into its corresponding element stack, as shown in Figure 2d. When c1 is read, the query path P1 is matched, and when d1 is read, P2 is matched. When the subtree with root b2 finishes processing, the backtracking is performed by popping the run-time stack once for each of its elements, as the second stack in 2c shows, while the elements b2 , c1 , d1 are popped from the elements stacks. When b3 is read, its parent is a2 , and it behaves the same way as b2 , such that it is combined with c1 and d1 to form new partial results as well (not shown in the figure).
3.3.2
EXPERIMENTAL EVALUATION
In order to investigate the effectiveness of our techniques, we performed a group of experiments over benchmark data. We begin by describing our evaluation methodology and then provide the results of our study.
Twig Processing in general digraphs
The same algorithm can be utilized for the case of general directed graphs. However, in this case the relative position of two nodes is determined by 2-hop covers, as described in Section 3.1, and access is performed by following the actual edges of the graph. A table similar to the one described in [20] is employed to avoid unnecessary cycles and the per-
4.2
Performance Results
The results of the experiments described in the previous paragraph are shown in Figure 3c. When the query is a path
47
query (Q1), PEWSC and IN2HR access the same number of nodes. However, when structural constraints are added (Q2Q5), PEWSC has to rescan parts of the documents to check for their validity, and as a result its performance degrades. Algorithm IN2NR performs always much better than the other algorithms as it manages to discard the sequences of elements that do not participate in the results. On the other hand, when the tag of query node exists under different contexts (i.e. nodes with tag name), then the algorithm will need to access a possibly large number of elements that do not participate in the result. However, in those cases, indexing as in the case of region-based numbering schemes ([8, 6, 16]) could further improve performance. The encouraging news is that the labeling described in Section 3.2 enables such indexing. We plan to investigate the indexing possibility in the future.
5.
[10] E. Cohen, E. Halperin, H. Kaplan, and U. Zwick. Reachability and distance queries via 2-hop labels. In Proc. of ACM-SIAM SODA, 2002. [11] M. P. Consens and T. Milo. Optimizing queries on files. In Proc. of SIGMOD, 1994. [12] Y. Diao, M. Altinel, M. J. Franklin, H. Zhang, and P. M. Ficher. Path sharing and predicate evaluation for high-performance xml filtering. ACM TODS, 28(4), Dec 2003. [13] M. N. Garofalakis, A. Gionis, R. Rastogi, S. Seshadri, and K. Shim. Xtract: A system for extracting document type descriptors from xml documents. In Proc. of SIGMOD, 2000. [14] R. Goldman and J. Widom. Dataguides: Enabling formulation and optimization in semistructured databases. In Proc. of VLDB, 1997. [15] A. Halverson, J. Burger, L. Galanis, A. Kini, R. Krishnamurthy, A. N. Rao, F. Tian, S. Viglas, Y. Wang, J. F. Naughton, and D. J. DeWitt. Mixed mode xml query processing. In Proc. of VLDB, 2003. [16] H. Jiang, W. Wang, H. Lu, and J. X. Yu. Holistic twig joins on indexed xml documents. In Proc. of VLDB, 2003. [17] T. Kameda. On the vector representation of the reachability in planar directed graphs. Information Processing Letters, 3(3), January 1975. [18] R. Kaushik, P. Bohannon, J. F. Naughton, and H. F. Korth. Covering indexes for branching path queries. In Proc. of SIGMOD, 2002. [19] R. Kaushik, P. Bohannon, J. F. Naughton, and P. Shenoy. Updates for structure indexes. In Proc. of VLDB, 2002. [20] R. Kaushik, P. Shenoy, P. Bohannon, and E. Gudes. Exploiting local similarity for indexing paths in graph-structured data. In Proc. of ICDE, 2002. [21] C. Koch. Efficient processing of expressive node-selecting queries on xml data in secondary storage: A tree automata-based approach. In Proc. of VLDB, 2003. [22] Q. Li and B. Moon. Indexing and querying xml data for regular path expressions. In Proc. of VLDB, 2001. [23] H. Liefke. Hotizontal query optimization on ordered semistructured data. In Proc. of WEDDB, 1999. [24] T. Milo and D. Suciu. Index structures for path expressions. In Proc. of ICDT, 1999. [25] R. Paige and R. Tarjan. Three partition refinement algorithms. SIAM Journal on Computing, 16(6), Dec 1987. [26] P. R. Rao and B. Moon. Prix: Indexing and querying xml using prufer sequences. In Proc. of ICDE, 2004. [27] H. Wang, S. Park, W. Fan, and P. S. Yu. Vist: A dynamic index method for querying xml data by tree structures. In Proc. of ACM SIGMOD, 2003. [28] K. Yi, H. He, I. Stanoi, and J. Yang. Incremental maintainance of xml structural indexes. In Proc. of SIGMOD, 2004. [29] C. Zhang, J. Naughton, D. DeWitt, Q. Luo, and G. Lohman. On supporting containment queries in relational database management systems. In Proc. of SIGMOD, 2001.
CONCLUSION
In this paper, we proposed techniques for evaluating twig queries over graph-structured data. We motivated our work both by the graph structure of XML documents, and by the existence of index graphs, namely structural summaries, which are graph structures too. We identified an important class of graphs that emerges in many real world situations, namely DAGs, for which we further tailored our approaches. Our preliminary evaluation shows the technique to be a viable solution to the aforementioned problem. We plan to further evaluate our techniques with a variety of documents as well as structural summaries. Moreover, the investigation of the properties of the graph structures that emerge in related applications is an interesting path for future research.
6.
REFERENCES
[1] The xml benchmark project. In Available from http://www.xml-benchmark.org. [2] S. Al-Khalifa, H. Jagadish, N. Koudas, J. M. Patel, D. Srivastava, and Y. Wu. Structural joins: A primitive for efficient xml query pattern matching. In Proc. of ICDE, 2002. [3] S. Amer-Yahia, S. Cho, L. V. S. Lakshmanan, and D. Srivastava. Minimization of tree pattern queries. In Proc. of SIGMOD, 2001. [4] A. Berglund, S. Boag, D. Chamberlin, M. Fernandez, M. Key, J. Robie, and J. Simeon. Xml path language (xpath) 2.0. Nov 2003. [5] S. Boag, D. Chamberlin, M. F. Fernandez, D. Florescu, J. Robie, and J. Simeon. Xquery 1.0: An xml query language. In W3C Working Draft. Available from http://www.w3.org/TR/xquery, Nov 2003. [6] N. Bruno, N. Koudas, and D. Srivastava. Holistic twig joins: Optimal xml pattern matching. In Proc. of ACM SIGMOD, 2002. [7] Q. Chen, A. Lim, and K. W. Ong. D(k)-index: An adaptive structural summary for graph-structured data. In Proc. of SIGMOD, 2003. [8] S.-Y. Chien, Z. Vagena, D. Zhang, V. J. Tsotras, and C. Zaniolo. Efficient structural joins on indexed xml documents. In Proc. of VLDB, 2002. [9] C.-W. Chung, J.-K. Min, and K. Shim. Apex: An adaptive path index for xml data. In Proc. of SIGMOD, 2002.
48
Unraveling the Duplicate-Elimination Problem in XML-to-SQL Query Translation Rajasekar Krishnamurthy
Raghav Kaushik
Jeffrey F Naughton
University of Wisconsin
Microsoft Corporation
University of Wisconsin
[email protected]
[email protected]
[email protected]
1
ABSTRACT We consider the scenario where existing relational data is exported as XML. In this context, we look at the problem of translating XML queries into SQL. XML query languages have two different notions of duplicates: node-identity based and value-based. Path expression queries have an implicit node-identity based duplicate elimination built into them. On the other hand, SQL only supports value-based duplicate elimination. In this paper, using a simple path expression query we illustrate the problems that arise when we attempt to simulate the node-identity based duplicate elimination using value-based duplicate elimination in the SQL queries. We show how a general solution for this problem covering the class of views considered in published literature requires a fairly complex mechanism.
1.
2
3 title Book.title
(ii) : TopSection.id = NestedSection.topsectionid
* book Book
(i) 5 sectionTopSection
* *
* 6
7
title
(ii) NestedSection
section
TopSection.title 8
title
NestedSection.title
Figure 1: XML view T1
THE DUPLICATE ELIMINATION PROBLEM
Suppose we want to retrieve the titles of all sections. One possible XML query is Q = //section//title. XML-to-SQL query translation algorithms proposed in literature, such as [6, 15], work as follows. Logically, the first step is to identify schema nodes that match each step of the query. In this example, there are three matching evaluations for Q, namely S = {, , }. The second step is to generate an SQL query for each matching evaluation in S. The final query is the union of queries generated for all matching evaluations. For Q, the SQL query SQ1 obtained in this fashion is given below.
Using a simple example scenario, we first explain why we need duplicate elimination in XML-to-SQL query translation. Consider the following relational schema for a collection of books. • • • •
4 author Author
(i) : Book.id = TopSection.bookid
books
Book (id, title, price, . . . ) Author (name, bookid, . . . ) TopSection (id, bookid, title, . . . ) NestedSection (id, topsectionid, title, . . . )
select TS.title from Book B, TopSection TS where B.id = TS.bookid union all select NS.title from Book B, TopSection TS, NestedSection NS where B.id = TS.bookid and TS.id = NS.topsectionid union all select NS.title from Book B, TopSection TS, NestedSection NS where B.id = TS.bookid and TS.id = NS.topsectionid
The Book relation has basic information about books and the Author relation has information about authors of each book. Each book has sections and subsections, and the corresponding information is in the TopSection and NestedSection relations respectively. Consider the XML view T1 defined over this relational schema shown in Figure 1. We represent the XML view using simple annotations on the nodes and edges of the XML schema. For example, each book tuple creates a new book element. The title of the book is represented as a title subelement. Each section is represented as a subelement and this is captured by the join condition on the edge . The rest of the view definition can be understood in a similar fashion.
In the above query, we see that there are three entries in S, and, as a result, SQ1 is the union of three queries. On the other hand, looking at the view definition, we notice that there are only two paths that match the query ending at the two schema nodes 6 and 8. The path appears twice in S, once as and again as . This occurs because the section step in the query matches both the section elements in the schema. The following title step matches the title element (node 8) for each of these evaluations, due to the // axis in the query. As a result, the second and third (sub)queries in SQ1 are identical and generate duplicate results.
Copyright is held by the author/owner. Seventh International Workshop on the Web and Databases (WebDB 2004), June 17-18, 2004, Paris, France .
49
According to XPath semantics, the result of a query should not have any duplicates. Here, duplicate-elimination is defined in terms of node-identity. As a result, we need to add a distinct clause to SQ1 to achieve the same effect. We refer to this as the Duplicate-Elimination problem. The fact that we need to simulate node-identity based duplicate elimination using the value-based distinct clause in SQL creates several problems and providing a complete solution to this problem is the focus of this paper. Notice that this extra duplicate-elimination step is required to make sure that existing algorithms work correctly for this particular example. Let us start with the simplest approach to eliminate duplicates from SQ1 . By adding an outer distinct(title) clause, we can eliminate duplicate titles. The corresponding SQL query SQ11 is given below.
1
2
6 title Book.title
7 author Author 12
title
books
cheapbooks (i)
4
(i) : Book.price P2
3 costlybooks (ii)
*
5
book Book
*
* 8 sectionTopSection
NestedSection
13
section
9 title Book.title
10 author Author
* 11 sectionTopSection
NestedSection
15
section
title
TopSection.title
TopSection.title 16
17
title
NestedSection.title
with Temp(title) as ( select TS.title from Book B, TopSection TS where B.id = TS.bookid union all select NS.title from Book B, TopSection TS, NestedSection NS where B.id = TS.bookid and TS.id = NS.topsectionid union all select NS.title from Book B, TopSection TS, NestedSection NS where B.id = TS.bookid and TS.id = NS.topsectionid ) select distinct(title) from Temp
*
14
*
book Book
title
NestedSection.title
Figure 2: XML view T2 In order to address this issue, we need to create a key across the two relations. A straightforward way to do this is to combine the name of the relation (or some other identifier) along with the key column(s). This results in the following query SQ31 . with Temp(relname, id,title) as ( select ‘‘TS’’, TS.id, TS.title from Book B, TopSection TS where B.id = TS.bookid union all select ‘‘NS’’, NS.id, NS.title from Book B, TopSection TS, NestedSection NS where B.id = TS.bookid and TS.id = NS.topsectionid union all select ‘‘NS’’, NS.id, NS.title from Book B, TopSection TS, NestedSection NS where B.id = TS.bookid and TS.id = NS.topsectionid ) select distinct(relname, id, title) from Temp
Notice that the above query does value-based duplication and may eliminate more values than required. For example, duplicates in the XML view that arise in the following ways must be retained in the query result. 1. Two top-level sections have the same title 2. Two nested sections have the same title 3. A top-level section and a nested section have the same title Since SQ11 applies a distinct clause on title, it eliminates duplicates that arise in the above three contexts. As a result, SQ11 is not a correct query. Let us next see if using the key column(s) in the relational schema help us solve this problem. Recall that the id column is the key for both the TopSection and NestedSection relations. So, by projecting this key column and applying the following distinct clause: “distinct(id,title)”, we get the following query SQ21 .
Notice how the above approach creates a global key across the entire relational schema. This duplicate-elimination technique is correct for this query, and in fact, it is correct for any query over a class of views that we call non-redundant (see Table 1 in Section 4). Unfortunately this solution is not general enough and is incorrect when parts of the relational data may appear multiple times in the XML view. In the rest of the paper, we identify the techniques required for different class of views ending with a generic solution that is applicable over all views.
with Temp(id,title) as ( select TS.id, TS.title from Book B, TopSection TS where B.id = TS.bookid union all select NS.id, NS.title from Book B, TopSection TS, NestedSection NS where B.id = TS.bookid and TS.id = NS.topsectionid union all select NS.id, NS.title from Book B, TopSection TS, NestedSection NS where B.id = TS.bookid and TS.id = NS.topsectionid ) select distinct(id, title) from Temp
2. REDUNDANT XML VIEWS In this section, we look at some of the simplifying assumptions we implicitly made while generating the correct duplicate-elimination clause for the query Q over view T1 . First, using a slightly modified XML view over the same underlying relational schema we show how the correct solution gets more complex than before. Then, we look at the scenario when the join conditions are not key-foreign key joins.
2.1 A Hierarchical XML view example The XML view T2 , in Figure 2, has created a simple hierarchy, partitioning the books into cheap and costly books by the relationship of their prices to two constants P1 and P2 . Let us look at how Q1 will be translated in this case by some of the existing algorithms [6, 15]. The equivalent SQL query is the union of six queries, three for cheapbooks and
While the above query retains the duplicate values corresponding to 1 and 2 above, it does not retain duplicates when a top-level section and a nested section have the same title and the same id as well. This can occur because keys in relational databases are unique only in the context of the corresponding relation.
50
three for costlybooks. Again, the nested section titles occur twice and need to be eliminated through a duplicate elimination operation. At the end of the previous section, we saw how a distinct clause over the three fields: relname, id and title may suffice (see query SQ31 in Section 1). Applying the same idea here, we obtain the following query, SQ12 .
union all select 16, NS.id, NS.title from Book B, TopSection TS, NestedSection NS where B.id = TS.bookid and TS.id = NS.topsectionid and B.price P2 union all select 17, NS.id, NS.title from Book B, TopSection TS, NestedSection NS where B.id = TS.bookid and TS.id = NS.topsectionid and B.price > P2 union all select 17, NS.id, NS.title from Book B, TopSection TS, NestedSection NS where B.id = TS.bookid and TS.id = NS.topsectionid and B.price > P2
with Temp(relname, id,title) as ( select ‘‘TS’’, TS.id, TS.title from Book B, TopSection TS where B.id = TS.bookid and B.price P2 ) select distinct(relname, id, title) from Temp
) select distinct(nodeid, id, title) from Temp
The above query will be a correct translation for all the three cases: P1 = P2 , P1 < P2 and P1 > P2 . The above solution is correct for any query on a class of views that we call well-formed (see Table 1 in Section 4). Notice how simple syntactic restrictions on the view definition language will not allow us to differentiate between views T1 and T2 . As a result, unless we know that T1 is a non-redundant view, when we translate Q over T1 , we have to use the schema node number to perform duplicate elimination. In Section 4, we present a way for identifying when XML views are non-redundant.
2.2 Beyond key-foreign key joins For the two example views, T1 and T2 , we saw how a correct translation for query Q1 results in queries SQ31 and SQ22 respectively. The join conditions present in both these view definitions are key-foreign key joins. Under these circumstances, the duplicate elimination technique used is correct. An interesting point to note is that the class of views considered in literature, such as [1, 5, 6, 15], allow the join conditions to be between any two columns (in particular, non key-foreign key joins). Also, excepting [5], the XML-to-SQL query translation algorithms in literature do not know (or use) information about the integrity constraints that hold on the underlying relational schema. In this section, we look at what needs to be done to perform duplicate-elimination correctly when the join conditions are allowed to be over any two columns. Suppose id is not a key for the Book, TopSection and NestedSection relations. Then, the join between Book.id and TopSection.bookid in the view definition T2 is not a keyforeign key join. Similarly, the join between TopSection.id and NestedSection.topsectionid is also not a key-foreign key join. What happens in this case is that some parts of the relational data may appear in the XML view multiple times. For example, part of an instance of the relational data is shown in Figure 3. Suppose the XML view T2 was defined with P1 = 65 and P2 = 65. For the above data instance, the corresponding view will have three book elements. Since two of the books have the same value for the id column, the sections of each of these two books will be repeated under both of them. For example, the Introduction and Motivation sections will appear twice in the XML view, once for each of the two books with id 1. But, both occurrences of these section titles correspond to the same schema node (node 12 in Figure 2). As a result, our earlier technique using schema node numbers does not work. Note how the XML view has redundant data
Let us now consider three possible scenarios: P1 = P2 , P1 < P2 and P1 > P2 . If P1 = P2 , then the XML view has information about all the books exactly once, while if P1 < P2 the XML view has information about only certain books. On the other hand, when P1 > P2 , the XML view has information about the books in the price range {P2 . . . P1 } twice. For the two cases, P1 = P2 and P1 < P2 , each book appears at most once in the XML view. As a result, the SQL query SQ12 eliminates duplicates correctly. But when P1 > P2 , books in the price range {P2 . . . P1 } appear twice in the XML view. So, the corresponding section titles must appear twice in the query result. But, since SQ12 applies a distinct clause using the triplet “relname, id, title”, it will retain only one copy of each section title and the query result is incorrect. The main problem in this example scenario is that some parts of the relational data appear multiple times in the XML view. For the XML view T2 , notice that multiple occurrences of the same section title are associated with different schema nodes. One way to obtain the correct result in this case is to keep track of the schema node corresponding to each result tuple. The distinct clause in this case will include the schema node instead of the relation name. The corresponding SQL query SQ22 is shown below. with Temp(nodeid, id,title) as ( select 12, TS.id, TS.title from Book B, TopSection TS where B.id = TS.bookid and B.price ǫ}
ǫ ∈ {0, 1}
The database M supports the boolean query processing model (i.e. a tuple either satisfies or does not satisfy a query). To access tuples of R one must issue structured queries over R. The answers to Q must be determined without altering the data model of M . Moreover the solution should require minimal domain-specific information. 2 Challenges: Supporting imprecise queries over autonomous Web enabled databases requires us to solve the following problems: • Model of similarity: Supporting imprecise queries necessitates the extension of the query processing model from binary (where tuples either satisfy the query or not) to a matter of the degree (to which a given tuple is a satisfactory answer).
INTRODUCTION
The rapid expansion of the World Wide Web has made a variety of databases like bibliographies, scientific databases, travel reservation systems, vendor databases etc. accessible to a large number of lay external users. The increased visibility of these systems has brought about a drastic change in their average user profile from tech-savvy, highly trained professionals to lay users demanding “instant gratification”. Often such users may not know how to precisely express their needs and may formulate queries that lead to unsatisfactory results. Although users may not know how to phrase their queries, they can often tell which tuples are of interest when presented with a mixed set of results with varying degrees of relevance to the query. Example: Suppose a user is searching a car database for “sedans”. Since most sedans have unique model names, the user must convert the general query sedan to queries binding the attribute Model. To find all sedans in the database, the user must iteratively change her search criteria and submit queries to the database. Usually the user might know only a couple of models and would end up just searching for them. Thus limited domain knowledge combined with the
• Estimating Semantic similarity: Expecting ‘lay’ users of the system to provide similarity metrics for estimating the similarity among values binding a given attribute is unrealistic. Hence an important but difficult issue we face is developing domain-independent similarity functions that closely approximate “user beliefs”. • Attribute ordering: To provide ranked answers to a query, we must combine similarities shown over distinct attributes of the relation into a overall similarity score for each tuple. While this measure may vary from user to user, most users usually are unable to correctly quantify the importance they ascribe to an attribute. Hence another challenging issue we face is to automatically (with minimal user input) determine the importance ascribed to an attribute.
1.1
Copyright is held by the author/owner. Seventh International Workshop on the Web and Databases (WebDB 2004), June 17-18, 2004, Paris, France.
where
Overview of our approach
In response to these challenges, we propose a query processing framework that integrates techniques from IR and database literature to efficiently determine answers for imprecise queries over autonomous databases. Below we begin by describing the query representation model we use and explain how we map from imprecise to precise queries. Precise Query: A user query that requires data exactly matching the query constraint is a precise query. For exam-
73
Map: Convert “like” to “=” Q pr = Map(Q)
Derive Base Set Abs A bs = Q
(R)
Use Base Set as set of relaxable selection queries
Use Concept similarity to measure tuple similarities
Using AFDs find relaxation order
Prune tuples below threshold
pr
Return
Derive Extended Set by executing relaxed queries
Ranked Set
Figure 1: FlowGraph of our Approach
answers to Q, Abs . Suppose Abs contains the tuples
ple, the query Q : −CarDB(M ake = “F ord”)
M ake = “T oyota”, M odel = “Camry”, P rice = “10k”, Y ear = “2000”
is a precise query, all of whose answer tuples must have attribute ‘Make’ bound by the value ‘Ford’. Imprecise Query: A user query that requires a close but not necessarily exact match is an imprecise query. Answers to such a query must be ranked according to their closeness/similarity to the query constraints. For example, the query
M ake = “T oyota”, M odel = “Camry”, P rice = “10k”, Y ear = “2001”
Q : −CarDB(M ake
like
“F ord”)
is an imprecise query, the answers to which must have the attribute ‘Make’ bound by a value similar to ‘Ford’.
The tuples in Abs exactly satisfy the base query Qpr . But the user is also interested in tuples that may be similar to the constraints in Q. Assuming we knew that “Honda Accord” and “Toyota Camry” are similar cars, then we could also show tuples containing “Accord” to the user if these tuples had values of Price or Year similar to tuples in Abs . Thus, M ake = “Honda”, M odel = “Accord”, P rice = “9.8k”, Y ear = “2000”
could be seen as being similar to the first tuple in Abs and therefore a possible answer to Q. We could also show other Camrys whose Price and Year values are slightly different to Algorithm 1 Finding Relevant Answers those of the tuples in Abs . Specifically, all tuples of CarDB Require: Imprecise Query Q, Relation R, Threshold Tsim , that have one or more binding values close to some tuple in Concept Similarities, Approximate Functional DependenAbs can be considered as potential answers to query Q. cies (AFDs) By extracting tuples having similarity above a predefined begin threshold, Tsim , to the tuples in the Abs , we can get a larger Let RelaxOrder = FindAttributeOrder(R, AFDs). subset of potential answers called extended set,Aes . But to Let Qpr = Map(Q) such that Abs = Qpr (R) & |Abs | > 0. extract additional tuples we would require new queries. We for k=1 to |Abs | do can identify new queries by considering each tuple in the Qrel = CreateQueries(Abs [k], RelaxOrder) base set, Abs , as a relaxable selection query. However ranfor j=1 to |Qrel | do domly picking attributes of tuples to relax could generate Arel = Qrel [j](R). a large number of tuples of possibly low relevance. In thefor n=1 to |Arel | do ory, the tuples closest to a tuple in the base set will have simval = MeasureSimilarity(Arel [n],Abs [k]). differences in the attribute that least affect the binding valif simval ≥ Tsim then ues of other attributes. Approximate functional dependenAes = Aes + Arel [n]. cies(AFDs) [11] capture relationships between attributes of end if a relation and can be used to determine the degree to which end for a change in binding value of an attribute affects other atend for tributes. Therefore we mine approximate dependencies beend for tween attributes of the relation and use them to determine Return Top-K(Aes ). a heuristic to guide the relaxation process. Details about end our approach of using AFDs to create a relaxation order are in Section 2. Relaxation involves extracting tuples by identifying and executing new relaxed queries obtained by Finding Relevant Answers: Figure 1 shows a flow graph reducing the constraints on an existing query. of our approach for answering an imprecise query. AlgoIdentifying possibly relevant answers only solves part of rithm 1 gives further details of our approach. Specifically, the problem since we must now rank the tuples in terms given an imprecise query Q, of the similarity they show to the seed tuples. We assume Q : −CarDB(M odel like “Camry”, P rice like “10k”) that a similarity threshold Tsim is available and only the answers that are above this threshold are to be provided to over the relation CarDB(Make, Model, Price, Year), we bethe user. This threshold may be user given or decided by 1 gin by converting Q to a precise base query (which is equivthe system. The tuple similarity is estimated as a weighted alent to replacing all “like” predicates with equality predisum of similarities over distinct attributes in the relation. cates) with non-null result set over the database, Qpr . Thus That is, the base query Qpr is n X Qpr : −CarDB(M odel = “Camry”, P rice = “10k”) Sim(t1 , t2 ) = Sim(t1 (Ai ), t2 (Ai )) × Wi
The tuples of CarDB satisfying Qpr also satisfy the imprecise query Q. Thus answers to Qpr form the base set of 1 In this paper, we assume that the resultset of the base query is non-null. If the resultset of the base query is null, then by iteratively relaxing the base query we may obtain a query that has a non-null resultset. The attributes to be relaxed can be arbitrarily chosen.
i=1
P where |attributes(R)| = n and n i=1 Wi = 1. In this paper we assume that attributes have either discrete numerical or categorical binding values. We assume that the Euclidean distance metric captures the semantic similarity between numeric values. But no universal measure is available for measuring the semantic distances between values binding a
74
categorical attribute. Hence in the Section 3 we present a solution to automatically learn the semantic similarity among values binding a categorical attribute. While estimating the similarity between values is definitely an important problem, an equally important issue is that of assigning weights to the similarity shown by tuples over different attributes. Users can be expected to assign weights to be used for similarity shown over a particular attribute. However, in [19, 18], our studies found users are not always able to map the amount of importance they ascribe to an attribute into a good numeric weight. Hence after determining the attribute order for query relaxation, we will automatically assign importance weights to attributes based on their order i.e. attribute to be relaxed first is least important and so gets lowest weight.
1.2
Organization
The rest of the paper is organized as follows. Section 2 explains how we use approximate functional dependencies between attributes to guide the query relaxation process. Section 3 describes our domain-independent approach for estimating the semantic similarity among concepts binding a categorical attribute. In Section 4, we provide preliminary results showing the accuracy of our concept learning approach and the effectiveness of the query relaxation technique we develop. In Section 5, we compare our work with research done in the areas of similarity search, cooperative query answering and keyword search over databases, all of which focus on providing answers to queries with relaxed constraints. Finally, in Section 6, we summarize our contributions and list the future directions for expansion of this work .
2.
QUERY RELAXATION USING AFDS
Our proposed solution for answering an imprecise query requires us to generate new selection queries by relaxing the constraints of tuples in the initial set tˆIS . The underlying motivation there is to identify tuples that are closest to some tuple t ∈ tˆIS . Randomly relaxing constraints and executing queries will produce tuples in arbitrary order of similarity thereby increasing the cost of answering the query. In theory the tuples most similar to t will have differences only in the least important attribute. Therefore the first attribute to be relaxed must be the least important attribute. We define least important attribute as the attribute whose binding value, when changed, has minimal effect on values binding other attributes. Approximate Functional Dependencies (AFDs) [11] efficiently capture such relations between attributes. In the following we will explain how we use AFDs to identify the importance of an attribute and thereby guide the query relaxation process.
2.1
Definitions
Functional Dependency: For a relational schema R, an expression of the from X → A where X ⊆ R and A ∈ R is a functional dependency over R. The dependency is said to hold in a given relation r over R if for all pairs of tuples t, u ∈ r we have t[B] = u[B] ⇒ t[A] = u[A]
where
A, B ∈ X
Several algorithms [13, 12, 7, 17] have proposed various measures to approximate functional dependencies that hold in a database. Among them, the g3 measure proposed by Kivinen and Mannila [12], has been widely accepted. The g3 measure is defined as the minimum number of tuples that need be removed from relation r so that X → Y is an FD divided by the number of tuples in r. Huhtala et al [11] have developed an algorithm, TANE, for efficiently discovering all AFDs in a database whose g3 approximation measure is below a user specified threshold. We use TANE
to extract the AFDs and approximate keys. We mine the AFDs and keys using the a subset of the database extracted by probing. Approximate Functional Dependency: The functional dependency X → A is an approximate functional dependency if it does not hold over a small fraction of the tuples. Specifically, X → A is an approximate functional dependency if and only if error(X → A) is at most equal to an error threshold ǫ (0 < ǫ < 1), where the error is measured as the fraction of tuples that violate the dependency. Approximate Key: An attribute set X is a key if no two distinct tuples agree on X. Let error(X) be the minimum fraction of tuples that need to be removed from relation r for X to be a key. If error(X) is ≤ ǫ then X is an approximate key. Some of the AFDs determined in the used car database CarDB are: error(Model → Make) < 0.1, error(Model → Price) < 0.6, error(Make, Price → Model) < .7 and error(Make, Year → Model)< 0.7.
2.2
Generating the relaxation order
Algorithm 2 Query Relaxation Order Require: Relation R, Tuples(R) begin for ǫ = 0.1 to 0.9 do SˆAF Ds = ExtractAFDs(R, ǫ). SˆAKeys = ExtractKeys(R, ǫ). end for Kbest = BestSupport(SˆAKeys ). NKey = Attributes(R)-Kbest . for k=1 to |Kbest | do W tKey [k]=[Kbest [k],Decides(NKey,Kbest [k], SˆAF Ds )]. end for for j=1 to |N Key| do W tN Key [k]=[NKey[k],Depends(Kbest ,NKey[k], SˆAF Ds )]. end for Return [Sort(W tKey ), Sort(W tN Key )]. end
The query relaxation technique we use is described in Algorithm 2. Given a database relation R and a dataset containing tuples of R, we begin by extracting all possible AFDs and approximate keys by varying the error threshold. Next we identify the approximate key (Kbest ) with the least error (or highest support). If a key has high support then it implies that fewer tuples will have the same binding values for the subset of attributes in the key. Thus the key can be seen as almost uniquely identifying a tuple. Therefore we can assume that two tuples are similar if the values binding the key are similar. All attributes of relation R not found in Kbest are approximately dependent on Kbest . Hence by relaxing the non-key attributes first we can create queries whose answers do not satisfy the dependency but have the same key. We now face the issue of which key (non-key) attribute to relax first. We make use of the AFDs to decide the relaxation order within the two subsets of attributes. For each attribute belonging to the key we determine a weight depending on how strongly it influences the non-key attributes. The influence weight for an attribute is computed as W eight(Ai ) =
n X 1 − error(Aˆ → Aj ) ˆ |A| j=1
where
Ai ∈ Aˆ ⊆ R,
j 6= i
&
n = |Attributes(R)|
If an attribute highly influences other non-key attributes then it should be relaxed last. By sorting the key attributes
75
in ascending order of their influence weights we can ensure that the least influencing attribute is relaxed first. On similar lines we would like to relax the least dependent non-key attribute first and hence we sort the non-key attributes in descending order of their dependence on the key attributes. The relaxation order we produce using Algorithm 2 only provides the order for relaxing a single attribute of the query. Basically we use a greedy approach towards relaxation and try to create all 1-attribute relaxed queries first, then the 2-attribute relaxed queries and so on. The multi-attribute query relaxation order is generated by assuming independence among attributes and combining the attributes in terms of their single attribute order. E.g., if {a1, a3, a4, a2} is the 1-attribute relaxation order, then the 2-attribute order will be {a1a3, a1a4, a1a2, a3a4, a3a2, a4a2}. The 3-attribute order will be a cartesian product of 1 and 2-attribute orders and so on. Attribute sets appearing earlier in the order are relaxed first. Given the relaxation order and a query Q, we formulate new queries from Q by removing the constraints (if any) on the attributes as given in the order. The number of attributes to be relaxed in each query will depend on order (1-attribute, 2-attribute etc). To ease the query generation process, we assume that the databases do not impose any binding restrictions. For our example database CarDB, the 1-attribute relaxation order was determined as {Make, Price, Year, Model}. Consequently the 2-attribute relaxation order becomes {(Make, Price), (Make, Year), (Make, Model), (Price, Year), (Price, Model), (Year, Model)}.
3.
LEARNING CONCEPT SIMILARITIES
Below we provide an approach to solve the problem of estimating the semantic similarity between values binding a categorical attribute. We determine the similarity between two values as the similarity shown by the values correlated to them. Concept: We define a concept over the database as any distinct attribute value pair. E.g. Make=“Ford” is a concept over database CarDB(Make,Model,Price,Year). Concept Similarity: Two concepts are correlated if they occur in the same tuple. We estimate the semantic similarity between two concepts as the percentage of correlated concepts which are common to both the concepts. More specifically, given a concept, the concepts correlated to a concept can be seen as the features describing the concept. Consequently, the similarity between two concepts is the similarity among the features describing the concepts. For example, suppose the database CarDB contains a tuple t ={Ford, Focus, 15k, 2002}. Given t, the concept Make=“Ford” is correlated to the concepts Model=“Focus”, Price=“15k” and Year=“2002”. The distinct values binding attributes Model, Price and Year can be seen as features describing the concepts over Make. Similarly Make, Price and Year for Model and so on. Let Make=“Ford” and Make=“Toyota” be two concepts over attribute Make. Suppose most tuples containing the two concepts in the database CarDB have same Price and Year values. Then we can safely assume that Make=“Ford” and Make=“Toyota” are similar over the features Price and Year.
3.1
The number of concepts identified is proportional to the size of the database extracted by sampling. However we
Focus:5, ZX2:7, F150:8 ... 10k-15k:3, 20k-25k:5, .. 1k-5k:5, 15k-20k:3, .. White:5, Black:5, ... 2000:6, 1999:5, ....
Table 1: Supertuple for Concept Make=‘Ford’ A concept can be visualized as a selection query called concept query that binds only a single attribute. By issuing the concept query over the extracted database we can identify a set of tuples all containing the concept. We represent the answerset containing each concept as a structure called the supertuple. The supertuple contains a bag of keywords for each attribute in the relation not bound by the concept. Table 1 shows the supertuple for the concept Make=‘Ford’ over the relation CarDB as a 2-column tabular structure. To represent a bag of keywords we extend the semantics of a set of keywords by associating an occurrence count for each member of the set. Thus for attribute Color in Table 1, we see ‘White’ with an occurrence count of five, suggesting that there are five White colored Ford cars in the database that satisfy the concept-query.
3.2
Measuring Concept Similarities
The similarity between two concepts is measured as the similarity shown by their supertuples. The supertuples contain bags of keywords for each attribute in the relation. Hence we use Jaccard Coefficient [10, 2] with bag semantics to determine the similarity between two supertuples. The Jaccard Coefficient (SimJ ) is calculated as SimJ (A, B) =
|A ∩ B| |A ∪ B|
We developed the following two similarity measures based on Jaccard Coefficient to estimate the similarity between concepts: Doc-Doc similarity: In this method, we consider each supertuple STQ , as a document. A single bag representing all the words in the supertuple is generated. The similarity between two concepts C1 and C2 is then determined as Simdoc−doc (C1 , C2 ) = SimJ (STC1 , STC2 ) Weighted-Attribute similarity: Unlike pure text documents, supertuples would rarely share keywords across attributes. Moreover all attributes may not be equally important for deciding the similarity among concepts. For example, given two cars, their prices may have more importance than their color in deciding the similarity between them. Hence, given the answersets for a concept, we generate bags for each attribute in the corresponding supertuple. The similarity between concepts is then computed as a weighted sum of the attribute bag similarities. Calculating the similarity in this manner allows us to vary the importance ascribed to different attributes. The supertuple similarity will then be calculated as m X SimJ (BagC1 (Ai ), BagC2 (Ai )) × Wi Simwatr (C1 , C2 ) = i=1
Semantics of a Concept
Databases on the web are autonomous and cannot be assumed to provide any meta-data such as possible distinct values binding an attribute. Hence we must extract this information by probing the database using sample queries. We begin by extracting a small subset of the database by sampling the database. From the extracted subset we can then identify a subset of concepts2 over the relation. 2
Model Mileage Price Color Year
where C1, C2 have m attributes
4.
EVALUATION
To evaluate the effectiveness of our approach in answering imprecise queries, we set up a prototype used car search
can incrementally add new concepts as and when they are encountered and learn similarities for them. But in this paper we do not focus on the issue of incremental updating of the concept space.
76
Time 181 sec 1.5 hours
Size 11 MB 6.0 MB
180
Th 0.7 160
Th 0.6 140
Work/Relevant Tuple
Algorithm Step SuperTuple Generation Similarity Estimation
Table 2: Timing Results and Space Usage database system that accepts precise queries over the relation
100 80 60 40
CarDB(M ake, M odel, Y ear, P rice, M ileage, Location, Color) The database was setup using the open-source relational database MySQL. We populated the relation CarDB using 30, 000 tuples extracted from the publicly accessible used car database Yahoo Autos [20]. The system was hosted on a Linux server running on Intel Celeron- 2.2 Ghz with 512Mb RAM.
Th 0.5
120
20 0 1
2
3
4
5
6
7
8
9
10
Queries
Figure 3: Work/RelevantTuple using GuidedRelax 900
4.1 Concept Similarity Estimation
Th 0.7 800
Th 0.6
Th 0.5
Work/Relevant Tuple
700
Dodge Nissan
0.15 0.11 Honda
BMW
0.12
600 500 400 300 200 100
0.22
0.25 0.16
0
Ford
1
2
3
4
5
6
7
8
9
10
Queries
Chevrolet
Toyota
Figure 4: Work/RelevantTuple using RandomRelax Figure 2: Concept Similarity Graph for Make The attributes Make, Model, Location, Color in the relation CarDB are categorical in nature and contained 132, 1181, 325 and 100 distinct values. We estimated concept similarities for these attributes as described in Section 3. Time to estimate the concept similarity is high (see Table 2), as we must compare each concept with every other concept binding the same attribute. We calculated only the doc-doc similarity between each pair of concepts. The concept similarity estimation is a preprocessing step and can be done offline and hence the high processing time requirement for this process can be ignored. Figure 2 provides a graphical representation of the estimated semantic similarity between some of the values binding attribute Make. The concepts Make=“Ford” and Make=“Chevrolet” show high similarity and so do concepts Make=“Toyota” and Make=“Honda” while the concept Make=“BMW” is not connected to any other node in the graph. We found these results to be intuitively reasonable and feel our approach is able to efficiently determine the semantic distances between concepts. Moreover in [19, 18] we used a similar approach to determine the semantic similarity between queries in a query log. The estimated similarities were validated by doing a user study and our approach was found to have above 75% accuracy.
4.2 Efficient query relaxation To verify the efficiency of the query relaxation technique we propose in Section 2, we setup a test scenario using the CarDB database and a set of 10 randomly picked tuples. For each of these tuples our aim was to extract 20 tuples from CarDB that had similarity above some threshold Tsim (0.5 ≤ Tsim < 1). We designed two query relaxation algorithms GuidedRelax and RandomRelax for creating selection queries by relaxing the tuples in the initial set. GuidedRelax makes use of the AFDs and approximate keys and decides a relaxation scheme as described in Algorithm 2. The RandomRelax algorithm was designed to somewhat mimic the
random process by which users would relax queries. The algorithm randomly identifies a set of attributes to relax and creates queries. We put an upper limit of 64 on the number of queries that could be issued by either algorithm for extracting the 20 similar answers to a tuple from the initial set. To measure the efficiency of the algorithms we use a metric called Work/RelevantTuple defined as W ork/RelevantT uple =
|TExtracted | |TRelevant |
where TExtracted gives the total tuples extracted while TRelevant is the number of extracted tuples that were found as relevant. Specifically Work/RelevantTuple is a measure of the average number of tuples that an user would have to look at before finding a relevant tuple. Tuples that showed similarity above the threshold Tsim were considered relevant. Similarity between two tuples was estimated as the weighted sum of semantic similarities shown by each attribute of the tuple. Equal weightage was given to the similarity shown by all attributes. The graphs in figures Figure 3 and Figure 4 show the average number of tuples that had to be extracted by GuidedRelax and RandomRelax respectively to identify a relevant tuple for the query. Intuitively the larger the expected similarity, the more the work required to identify a relevant tuple. While both algorithms do follow this intuition, we note that for higher thresholds RandomRelax (see Figure 4) ends up extracting hundreds of tuples before finding a relevant tuple. GuidedRelax is much more resilient to the variations in threshold and generally needs to extract about 4 tuples to identify a relevant tuple. Thus by using GuidedRelax, a user would have to look at much less number of tuples before obtaining satisfactory answers. The initial results we obtained are quite encouraging. However for the current set of experiments we did not verify
77
whether the tuples considered relevant are truly relevant as measured by the user. We plan to conduct a user study to verify that our query relaxation approach not only saves time but also provides answers that are truly relevant according to the user. The evaluations we performed were aimed at studying the accuracy and efficiency of our concept similarity learning and query relaxation approaches in isolation. We are currently working on evaluating our approach for answering imprecise queries over BibFinder [1, 21], a fielded autonomous bibliography mediator that integrates several autonomous bibliography databases such as DBLP, ACM DL, CiteSeer. Studies over BibFinder will enable us to better evaluate and tune the query relaxation approach we use. We also plan to conduct user studies to measure how many of the answers we present are considered truly relevant by the user.
results showing the efficiency and accuracy of our concept similarity learning and query relaxation approaches. Both the concept similarity estimation and AFDs and keys extraction process we presented heavily depend on the size of the initial dataset extracted by probing. Moreover the size of the initial dataset also decides the number of concepts we may find for each attribute of the database. A future direction of this work is to estimate the effect of the probing technique and the size of the initial dataset on the quality of the AFDs and concept similarities we learn. Moreover the data present in the databases may change with time. We plan to investigate ways to incrementally update the similarity values between existing concepts and develop efficient methods to compute distances between existing and new concepts without having to recompute the entire concept graph. In this paper we only looked at answering imprecise selection queries over a single database relation. Answering imprecise queries spanning multiple re5. RELATED WORK lations forms an interesting extension to our work. Early approaches for retrieving answers to imprecise queries Acknowledgements: We thank Hasan Davulcu for helpful were based on theory of fuzzy sets. Fuzzy information sysdiscussions during the development of this work. This work tems [14] store attributes with imprecise values, like height= is supported by ECR A601, the ASU Prop301 grant to ET I 3 “tall” and color=“blue or red”, allowing their retrieval with initiative. fuzzy query languages. The WHIRL language [6] provides approximate answers by converting the attribute values in 7. REFERENCES the database to vectors of text and ranking them using the [1] BibFinder: A Computer Science Bibliography Mediator. Availvector space model. In [16], Motro extends a conventional able at :http://kilimanjaro.eas.asu.edu/. [2] R. Baeza-Yates and B. Ribiero-Neto. Modern Information Redatabase system by adding a similar-to operator that uses trieval. Addison Wesley Longman Publishing, 1999. distances metrics over attribute values to interpret vague [3] C. Buckley, G. Salton, and J. Allan. Automatic Retrieval with queries . The metrics required by the similar-to operator Locality Information Using Smart. TREC-1, National Institute must be provided by database designers. Binderberger [22] of Standards and Technology, Gaithersburg, MD, 1992. [4] W.W. Chu, Q. Chen, and R. Lee. Cooperative query answering investigates methods to extend database systems to supvia type abstraction hierarchy. Cooperative Knowledge Based port similarity search and query refinement over arbitrary Systems, pages 271–290, 1991. abstract data types. In [9], the authors propose to provide [5] W.W. Chu, Q. Chen, and R. Lee. A structured approach for ranked answers to queries over Web databases but require cooperative query answering. IEEE TKDE, 1992. [6] W. Cohen. Integration of heterogeneous databases without comusers to provide additional guidance in deciding the similarmon domains using queries based on textual similarity. Proc. of ity. These approaches however are not applicable to existing SIGMOD, pages 201–212, June 1998. databases as they require large amounts of domain specific [7] M. Dalkilic and E. Robertson. Information Dependencies. In information either pre-estimated or given by the user of the Proc. of PODS, 2000. [8] N.E. Efthimiadis. Query Expansion. In Annual Review of Inquery. Further [22] requires changing the data models and formation Systems and Technology, Vol. 31, pages 121–187, operators of the underlying database while [9] requires the 1996. database to be represented as a graph. [9] R. Goldman, N .Shivakumar, S. Venkatasubramanian, and In contrast to the above, the solution we propose proH. Garcia-Molina. Proximity search in databases. VLDB, 1998. [10] T. Haveliwala, A. Gionis, D. Klein, and P Indyk. Evaluatvides ranked results without re-organizing the underlying ing strategies for similarity search on the web. Proceedings of database and thus is easier to implement over any database. WWW, Hawai, USA, May 2002. In our approach we assume that tuples in the base set are all [11] Y. Huhtala, J. Krkkinen, P. Porkka, and H. Toivonen. Efficient relevant to the imprecise query and create new queries. The discovery of functional and approximate dependencies using partitions. Proceedings of ICDE, 1998. technique we use is similar to the pseudo-relevance feedback [12] J. Kivinen and H. Mannila. Approximate Dependency Inference [3, 8] technique used in IR system. Pseudo-relevance feedfrom Relations. Theoretical Computer Science, 1995. back ( also known as local feedback or blind feedback) in[13] T. Lee. An information-theoretic analysis of relational volves using top k retrieved documents to form a new query databases-part I: Data Dependencies and Information Metric. IEEE Transactions on Software Engineering SE-13, October to extract more relevant results. 1987. In [4, 5], authors explore methods to generate new queries [14] J.M. Morrissey. Imprecise information and uncertainty in inforrelated to the user’s original query by generalizing and remation systems. ACM Transactions on Information Systems, fining the user queries. The abstraction and refinement rely 8:159–180, April 1990. [15] A. Motro. Flex: A tolerant and cooperative user interface to on the database having explicit hierarchies of the relations database. IEEE TKDE, pages 231–245, 1990. and terms in the domain. In [15], Motro proposes allowing [16] A. Motro. Vague: A user interface to relational databases that the user to select directions of relaxation, thereby indicating permits vague queries. ACM Transactions on Office Informawhich answers may be of interest to the user. In contrast, tion Systems, 6(3):187–214, 1998. [17] K. Nambiar. Some analytic tools for the Design of Relational we automatically learn the similarity between concepts and Database Systems. In Proc. of 6th VLDB, 1980. use functional dependency based heuristics to decide the [18] U. Nambiar and S. Kambhampati. Providing ranked relevant direction for query relaxation. results for web database queries. To appear in WWW Posters
6.
CONCLUSION AND FUTURE WORK
In this paper we first motivated the need for supporting imprecise queries over databases. Then we presented a domain independent technique to learn concept similarities that can be used to decide semantic similarity of values binding categorical attributes. Further we identified approximate functional dependencies between attributes to guide the query relaxation phase. We presented preliminary
2004,, May 17-22, 2004. [19] U. Nambiar and S. Kambhampati. Answering imprecise database queries: A novel approach. ACM Workshop on Web Information and Data Management, November 2003. [20] Yahoo! autos. Available at http://autos.yahoo.com/ . [21] Z. Nie, S. Kambhampati, and T. Hernandez. BibFinder/StatMiner: Effectively Mining and Using Coverage and Overlap Statistics in Data Integration. In Proc. of VLDB, 2003. [22] M. Ortega-Binderberger. Integrating Similarity Based Retrieval and Query Refinement in Databases. PhD thesis, UIUC, 2003.
78
DTDs versus XML Schema: A Practical Study Geert Jan Bex
Frank Neven
Limburgs Universitair Centrum Diepenbeek, Belgium
Limburgs Universitair Centrum Diepenbeek, Belgium
[email protected]
[email protected]
ABSTRACT Among the various proposals answering the shortcomings of Document Type Definitions (DTDs), XML Schema is the most widely used. Although DTDs and XML Schema Defintions (XSDs) differ syntactically, they are still quite related on an abstract level. Indeed, freed from all syntactic sugar, XML Schemas can be seen as an extension of DTDs with a restricted form of specialization. In the present paper, we inspect a number of DTDs and XSDs harvested from the web and try to answer the following questions: (1) which of the extra features/expressiveness of XML Schema not allowed by DTDs are effectively used in practice; and, (2) how sophisticated are the structural properties (i.e. the nature of regular expressions) of the two formalisms. It turns out that at present real-world XSDs only sparingly use the new features introduced by XML Schema: on a structural level the vast majority of them can already be defined by DTDs. Further, we introduce a class of simple regular expressions and obtain that a surprisingly high fraction of the content models belong to this class. The latter result sheds light on the justification of simplifying assumptions that sometimes have to be made in XML research.
1.
INTRODUCTION
As Document Type Definitions where historically the first means to describe the structure of XML documents, a large number of them can be found on the Web. The growing success of XML, combined with certain shortcomings of DTDs, generated a large number of alternative proposals for the description of schemas, such as RELAX [12], TREX [6], Relax NG [7], DSD [11], and XML Schema [1, 9, 17]. Judging from the number of schemas one can find on the Web, XML Schema seems the most accepted one. The definition of XML Schema is nevertheless quite complicated and the necessity of various constructs is not always very clear. For this reason, we investigate a number of XSDs collected from the Web, and try to determine to what extent the features of XML Schema not occurring in DTDs are used in practice. In the second part of the paper we look at struc-
Jan Van den Bussche
Limburgs Universitair Centrum Diepenbeek, Belgium
[email protected]
tural properties of schemas. In particular, we show that the vast majority of content models occurring in practice belong to a well-defined class of simple regular expressions. To facilitate a comparison between the two formalisms, we first describe DTDs and XSDs on a structural level.
1.1 A structural view of DTDs and XSDs When dealing with the structure of XML documents only, it is common to view XML documents as finite ordered trees with node labels from some finite alphabet Σ. We refer to such trees as Σ-trees. definition 1. A DTD is a pair (d, s) where d is a function that maps Σ-symbols to regular expressions over Σ, and s ∈ Σ is the start symbol. A tree satisfies the DTD if its root is labeled by s and for every node u with label a, the sequence a1 · · · an of labels of its children matches the regular expression d(a). The class of tree languages definable by DTDs is usually referred to as the local tree languages [4, 13]. A simple example of a DTD defining the inventory of a store is the following: store dvd
→ →
dvd dvd∗ title price
For clarity, in examples we write a → r rather than d(a) = r. We next recall the definition of a specialized DTD [15]. definition 2. A specialized DTD (SDTD) is a 4-tuple (Σ, Σ , δ, µ), where Σ is an alphabet of types, δ is a DTD over Σ and µ is a mapping from Σ to Σ. Note that µ can be applied to a Σ -tree as a relabeling of the nodes, thus yielding a Σ-tree. A Σ-tree t then satisfies the SDTD if t can be written as µ(t ) where t satisfies the DTD δ. As SDTDs are equivalent to unranked tree automata [4], the class of tree languages definable by SDTDs is the class of regular tree languages. The XML equivalent of that class is captured by the schema language Relax NG [7].
Copyright is held by the author/owner. Seventh International Workshop on the Web and Databases (WebDB 2004), June 17-18, 2004, Paris, France.
79
For ease of exposition, we always take Σ = {ai | 1 ≤ i ≤ ka , a ∈ Σ, i ∈ N} for some natural numbers ka and set µ(ai ) = a.
2. DATASET AND METHODOLOGY
A simple example of an SDTD is the following: store dvd1 dvd2
→ (dvd1 + dvd2 )∗ dvd2 (dvd1 + dvd2 )∗ dvd2 · (dvd1 + dvd2 )∗ → title price → title price discount
Here, dvd1 defines ordinary DVDs while dvd2 defines DVDs on sale. The rule for store specifies that there should be at least two of the latter. The following restriction on SDTDs corresponds to the expressiveness of XML Schema [13]: definition 3. A single-type SDTD is an SDTD (Σ, Σ , (d, s), µ) with the property that no regular expression d(a) has occurrences of types of the form bi and bj with the same b but with different i and j. The example SDTD above is not single type as both dvd1 and dvd2 occur in the rule for store. It is shown by Murata et al [13], that the class of trees defined by single-type SDTDs is strictly between the local and the regular tree languages. An example of a single-type grammar is given below: store regulars discounts dvd1 dvd2
→ → → → →
regulars discounts (dvd1 )∗ dvd2 dvd2 (dvd2 )∗ title price title price discount
Although there are still two element definitions dvd1 and dvd2 , they can only occur in a different context, regulars and discounts respectively.
1.2 Related work In 2000, while the XML Schema specification was still under development, Sahuguet [16] investigated a sample of DTDs to determine the shortcomings of the Document Type Definition specification. What he found missing has been remedied in XML Schema. Moreover, XML Schema introduces many features not envisioned by Sahuguet. One of the goals of this paper is to investigate to what measure these features are used in real world XSDs currently available. Choi [5] has tried to identify features that are characteristic for DTDs used to describe three types of schemas: application, data and meta-data related. He created a content model classification based on syntactic features and considered several measures for the complexity of a DTD. In this paper we extend his classification of content models and consider XSDs beside DTDs.
1.3 Overview Based on the characterizations given above, it is clear that DTDs and XSDs are both grammar based where XML Schema in addition is extended with a restricted typing mechanism. In Section 3, we inspect the use of that typing mechanism in practice together with another notion added to XML Schema: derived types. In Section 4, we compare the properties of the grammars underlying real DTDs and XSDs. First, we discuss in the next section our dataset and methodology.
80
We have tried to gather a representative sample of DTDs and XSDs. The XML Cover pages [8] have proved to be an excellent repository so almost all schemas in our sample have been obtained automatically from that source by using a simple web crawler. To ensure that the sample contains a base set of quality DTDs and XSDs, a number of the W3C standards have been included. Among those are the DTDs for MathML, SVG, XHTML, XML Schema and SMIL and the XSD for RDF and XML Schema. All in all, 109 DTDs and 93 XSDs have been obtained. Although some 600 DTDs and XSDs are mentioned on the Cover Pages, only these 109 + 93 were actually available for download, thus illustrating once again the transient nature of the Internet and its various technologies. All 93 XSDs have been used for the analysis in Sections 3.2 and 3.3 while unfortunately only 30 of the 93 XSDs can be used for most of the analysis in Sections 3.1 and 4, due to various errors discussed in Section 6 below. In the appendix, we provide a list of some of the XSDs we used.
3. EXPRESSIVENESS OF XML SCHEMA 3.1 Single-type The formal taxonomy presented in Section 1, elicits the following question: is the expressive power of single-type SDTDs actually used in real-world XSDs, and if so, to what extent? Given that XSDs tend to be a bit unwieldy due to their inherent verbosity, it is interesting to identify use cases for the distinctive features of single-type SDTDs versus DTDs. Those use cases might suggest a simpler formalism that finds an appropriate balance between designerfriendliness and expressive power. Surprisingly, most XSDs in our sample turn out to define local tree languages, that is, can actually be defined by DTDs. Only 5 out of 30 are true single-type SDTDs, which corresponds to approximately 15%. There might be several possible reasons for this low percentage. A first possibility is that expressiveness beyond local tree languages is simply rarely needed. Another explanation might be that due to the relatively new nature of XML Schema and its complicated definition most users have no clear view on what can be expressed. All five examples we found are of the following form: p q a1 a2
→ → → →
. . . a1 . . . . . . a2 . . . expr1 expr2
The meaning of a1 and a2 is the following: when the parent of an A is P (resp. Q) use the rule for A1 (resp. A2 ). No other use cases have been found in the sample.
3.2 Derived types Two kinds of types are provided by XML Schema: simple and complex types. The former describes the character data an element can contain (cfr. #PCDATA in DTDs) while the latter specifies which elements may occur as children in a given element.
extension restriction
simple type (%) 27 73
complex type (%) 37 7
abstract implying that one should derive new types from it. Although slightly more common, it is only used in 11 XSDs in our sample.
Table 1: Relative use of derivation features in XSDs XML Schema facilitates derivation of new types from existing types via two mechanisms: extension and restriction. Both simple and complex types can be extended or restricted. The four cases are introduced below; for a thorough discussion, we refer to the W3C specification [9, 17]. • A complex type can be derived from a simple type by extension to add attributes to elements. • A complex type can be extended to add a sequence of additional elements to its content model or to add attributes. • Restricting a simple type limits the acceptable range of values for that type. For example, one can enforce that a phone number should consist of three digits, a dash, followed by six more digits. • Restricting a complex type is similar to restricting a simple type in that it limits the set of acceptable subtrees. Table 1 lists the number of XSDs using a particular derivation feature. Note that in this section we used all 93 XSDs we retrieved since conformance was not an issue for this analysis. Approximately one fifth of the XSDs considered do not construct new types through derivation at all. Extension is mostly used to define additional attributes (58%); elements are added to a content model to a lesser degree (42%). As expected, restriction of complex elements is hardly used (7%). A typical example of the latter mechanism is the modification of the multiplicity of an element: maxOccurs =”unbounded” to maxOccurs=”1”. The statistics also show that only just over a third of the XSDs (37%) use extension of complex types, a feature that parallels inheritance in the object orientation paradigm. This might indicate that the data modeled by these XSDs is often too simple to merit such a (relatively) sophisticated approach. It might also be “underused” due to the relative novelty of XML Schema, since many data architects are trained to think in terms of relational data rather than object orientation. Extension of simple types occurs in 27% of the XSDs. Restriction of simple types is most heavily used (73%), which comes as no surprise since it allows a much more fine-grained control over the content of an element, rather than the unrestrictive #PCDATA DTDs are limited to, thus alleviating one of the more glaring shortcomings of DTDs. Several mechanisms have been defined to control type creation by derivation. The final attribute for type definitions indicates that the type can not either be restricted, extended or both. Only 6 out of the 93 XSDs use this feature. As opposed to finalizing a type definition, it can also be declared
81
As a general rule, derived types can occur anywhere in the content model where the original type is allowed. However, this can be prevented by applying the block attribute to the original type. As for the final attribute, replacement by either restricted, extended or both types can be blocked. Blocking is used in 2 out of the 93 XSDs. The fixed attribute that is usually used to indicate that an element or an attribute is restricted to a specific value also serves a purpose in the context of derivation from simple types. It can be applied to fix a facet of a simple type (e.g. the length of a xsd: string ) in a restrictive type derivation. Only a single XSD uses the fixed attribute in this sense. Although not directly related to derivation, the substitution group feature nevertheless deserves to be mentioned here. Elements are declared members of a substitution group using the substitutionGroup attribute with an element name as value and may occur instead in the content model akin to derived types. Substitution groups are used in 10 out of 93 XSDs.
3.3 Additional features XML Schema defines various additional features with respect to DTDs, see [9, 18] for an introduction. One feature of SGML DTDs that was lost to XML DTDs is the &-operator that specifies that all elements must occur but that their order is not significant. Obviously this can be simulated in an XML DTD by explicitly listing all orders (e.g. a1 &a2 &a3 ≡ a1 a2 a3 | a2 a3 a1 | . . . | a3 a2 a1 , so a choice between 6 cases), but this doesn’t exactly improve the clarity of the content model. XML Schema restores this feature by defining the xsd: all element. However, only 4 out of 93 XSDs use this operator. Elements in an XML document can be identified using ID attributes and referred to by IDREF or IDREFS. This feature is part of the XML 1.0 specification [2] and is supported by DTDs. These IDs are unique throughout a document and are the only attributes with such a restriction for DTDs. In XML Schema, any element or attribute can be declared to require a unique value by selecting the relevant nodes using an XPath expression and specifying the list of fields that combined should have a unique value. In our sample, 6 XSDs out of 93 use this feature, all applied to a single field. Referring to elements can also be accomplished by key/keyref pairs. Using a reference to a key implies that the element with the corresponding key should exist in the document. This feature is reminiscent of the foreign key concept in relational databases. It is used in 4 XSDs in our sample. An interesting feature introduced in XML Schema is the use of namespaces for modularity. This allows to use elements and types defined in the current XSD that are defined elsewhere without fear of name clashes. Apart from the obvious inclusion of the XML Schema namespace, 20 XSDs in our sample used this mechanism.
A last feature to discuss is the ability to redefine types and groups. It should be noted that W3C’s primer on XML Schema cautions against the use of this feature since it may break type derivation without warning. It turns out that the authors of the XSDs in our sample set heeded this advice and avoided redefine altogether.
4.
REGULAR EXPRESSION CHARACTERIZATION
The second question we try to answer is how sophisticated regular expressions tend to be in real world DTDs and XSDs. If simple expressions make up the vast majority of schema definitions, it is worthwhile to take this into account when developing implementations of XML related applications and fine-tune algorithms to take advantage of this simplicity whenever possible. In order to facilitate the analysis some preprocessing was performed. For the DTDs parsed entities were resolved and conditional sections included/excluded as appropriate. Since we are only concerned with the schema structure, the DTD element definitions were extracted and converted to a canonical form, which abstracts away the actual node labels and replaces them by canonical names c1 , c2 , c3 , . . . For example, < !ELEMENT l i b ( ( book | j o u r n a l ) ∗ )>
is represented by a canonical form (c1 | c2 )∗ to preserve only the structure related DTD information. The XSDs were preprocessed using XSLT to the canonical representation mentioned above for DTDs. To capture multiplicity constraints, ’ ?’ is used, e.g. for an element a with minOccurs=”1”, maxOccurs=”3”, \lstinlinea a? a?— is substituted. This approach allows us to reuse much of the software developed to analyze DTDs for XSDs as well. For all DTDs, there is a total of 11802 element definitions which reduce to 750 distinct canonical forms. The 1016 element definitions in the XSDs yield 138 distinct canonical forms, totaling 838 for both types of schemas combined. The majority of these can be classified in one of the categories of “simple expressions”, which are subclasses of the expressions studied by Martens, Neven, and Schwentick [14].
a1 , . . . , an ∈ Σ and n ≥ 1, or a? for some a ∈ Σ. Table 2 provides an overview. Factor a a∗ a? (a1 + · · · + an )
Abbr. a a∗ a? (+a)
Factor (a1 + · · · + an )∗ (a1 + · · · + an )? (a∗1 + · · · + a∗n ) (a∗1 + · · · + a∗n )∗
Abbr. (+a)∗ (+a)? (+a∗ ) (+a∗ )∗
Table 2: Possible factors in simple regular expressions and how they are denoted (a, a1 , . . . , an ∈ Σ). We analyze the DTDs and XSDs to characterize their content models according to the subclasses defined above. The result is represented in Table 3 that lists the non-overlapping categories of expressions having a significant population (i.e. more than 0.5%). Two prominent differences between DTDs and XSDs immediately catch the eye: XSDs have (1) more simpleType elements (denoted by #PCDATA) and (2) less expressions in the category RE(a, (+ a)∗ ). The first difference is due to the fact that it pays to introduce more distinct simpleType elements in XSD since thanks to type restriction, it is now possible to fine tune the specification of an element’s content (cfr. the discussion in Section 3.2). The second distinction is most probably due to the nature of the XSDs in the sample since those describing data are overrepresented with respect to those describing meta documents [5]. The latter tend to have more complex recursive structures than the former. To gauge the quality of our sample of XSDs, we compared DTDs and XSDs using several of the measures proposed by Choi [5]. No significant differences between the two samples are observed, which is confirmed by an additional measure in Figure 1, the density of XSDs. The density of a schema is defined as the number of elements occurring in the right hand side of its rules divided by the number of elements. DTDs and XSDs do not fundamentally differ in this respect. Several other measures such as the width and depth of canonical forms viewed as expression trees show no significant differences.
definition 4. A base symbol is a regular expression a, a?, or a∗ where a ∈ Σ; a factor is of the form e, e∗ , or e? where e is a disjunction of base symbols. A simple regular expression is ε, ∅, or a sequence of factors. The following is an example of a simple regular expression: (a∗ + b∗ )(a + b)?b∗ (a + b)∗ . We introduce a uniform syntax to denote subclasses of simple regular expressions by specifying the allowed factors. We distinguish base symbols extended by ? or ∗. Further, we distinguish between factors with one disjunct or with arbitrarily many disjuncts; the latter is denoted by (+ · · · ). Finally, factors can again be extended by ∗ or ?. For example, we write RE((+ a)∗ , a?) for the set of regular expressions e1 · · · en where every ei is (a1 + · · · + an )∗ for some
82
Figure 1: Fraction of DTDs (left) and XSDs (right) versus their density More importantly though, it is clear that the vast majority of expressions are simple, i.e. 92% and 97% of all element definitions in DTDs and XSDs respectively. Figure 2 shows the fraction of DTDs and XSDs versus the fraction of their
#PCDATA EMPTY ANY RE(a) RE(a, a?) RE(a, a∗ ) RE(a, a?, a∗ ) RE(a, (+ a)) RE(a, (+ a)?) RE(a, (+ a)∗ ) RE(a, (+ a)?, (+ a)∗ ) RE(a, (+ a∗ )∗ ) total simple expr. non-simple expr.
DTDs (%) 34 16 1 5 2 8 1 3 0 20 0 0 92 8
XSDs (%) 48 10 0 5 10 10 4 3 1 2 1 2 97 3
holonym/meronym relations) that are very often quite simple to express. Some randomly chosen examples of non-simple regular expressions that we encountered follow: c1 + | (c2 ?c3 + ) (c1 c2 ?c3 ?)?c4 ?(c5 | . . . | c18 )∗ c1 ?(c2 c3 ?)?(c4 | . . . | c44 )∗ c45 + c1 ?c2 c3 ?c4 ?(c5 + | ((c6 | . . . | c61 )+ c5 ∗ )) ∗ c1 (c2 | c3 )∗ (c4 , (c2 | c3 | c5 )∗ )
Table 3: Relative occurrence of various types of regular expressions given in % of element definitions simple content models: the majority of documents have 90% or more simple content models.
5. SCHEMA AND AMBIGUITY The XML 1.0 specification published by the W3C [2] requires schema definitions to be one-unambiguous, i.e. that all regular expressions in the grammar’s rules are deterministic in the following sense [3]: a regular expression is oneunambiguous iff the corresponding Glushkov automaton is deterministic. Note that the terminology is somewhat confusing in the literature: in the context of SGML ‘unambiguous’ is used to denote this feature while Choi [5] refers to it as ‘deterministic’. We checked whether the DTDs and XSDs in our sample respect this requirement and find that they almost all do. IBM’s XML Schema Quality Checker (SQC) [10] reported 3 out of 93 XSDs as having one or more ambiguous content models (see also Section 6). For DTDs, a first exception is a regular expression of the following type: (. . . | ci | . . . | ci | . . .)∗ that occurred in several DTDs. However, this is merely a typo, not a design feature.
Figure 2: Fraction of DTDs (left) and XSDs (right) having a given % of simple expression content models The relative simplicity of most DTDs and XSDs is further illustrated by the star height that is given in Table 4. The star height of a regular expression is the maximum nesting depth of Kleene stars occurring in the expression, e.g. 2 for the last example given below, 1 for all others. Content models with star height larger than 1 are very rare. No significant differences are observed between DTDs and XSDs, except for the star height but this is consistent with the relative abundance of RE(a, (+ a)∗ ) type of expressions in DTDs with respect to XSDs. star height 0 1 2 3
DTDs (%) 61 38 1 0
XSDs (%) 78 17 4 ≈0
The latter example illustrates a shortcoming of DTDs that has been addressed in XML Schema. Element definitions in the latter formalism allow the specification of the number of times an element can occur using the minOccurs and maxOccurs attributes. The specification for the example above would be captured by the following snippet of XML Schema (with slight abuse of notation):
We found three XSDs defining non-deterministic (or ambiguous) content models. Two canonical forms are found: c1 ?(c1 | c2 )∗ and (c1 c2 ) | (c1 c3 ).
Table 4: Star height observed in DTDs and XSDs
6. ERRORS
In a sense this should not come as too great a surprise: DTDs and XSDs model data that reflect real world entities. Mostly those entities are subject to simple relations among one another such as is-a, or is-part relations (pertainym1 , 1
A second type of one-unambiguous regular expression proves to be more interesting: c1 c2 ?c2 ?. The designer’s intention is clearly to state that c2 may occur zero, one or two times.
meaning relating to or pertaining to
83
It was a bit disappointing to notice that a relatively large fraction of the XSDs we retrieved did not pass a conformance test by SQC. As mentioned in Section 2, only 30 out of a total of 93 XSDs were found to be adhering to the current specifications of the W3C [17]. We decided to use only conforming XSDs for those parts of the analysis that require
conversion to canonical form to ensure correct processing by our software. Often, lack of conformance can be attributed to growing pains of an emerging technology: the SQC validates according to the 2001 specification and 19 out of the 93 XSDs have been designed according to a previous specification. Some simple types have been omitted or added from one version of the specs to another causing the SQC to report errors. Some errors concern violation of the Datatypes part of the specification [1]: regular expressions restricting xsd:string are malformed. Some XSDs violate the XML Schema specification by e.g. specifying a type attribute for a complexType element or leaving out the name attribute for a top-level complexType element.
7.
CONCLUSION
Our analysis has shown that many features defined in the XML Schema specification are not widely used yet, especially those that are related to object oriented data modeling such as derivation of complex types by extension. Most importantly, it turns out that almost all XSDs are local tree grammars, i.e. proper single type grammars are rarely used. The expressive power encountered in real world XSDs turns out to be mostly equivalent to that of DTDs. Hence it seems that — barring some exceptions — the current generation of XSDs could just as well have been written as DTDs from the point of view of structure. This might change in the future, as acceptance of a relatively new technology increases, or it might be a symptom that the level of sophistication offered by XML Schema is simply unnecessary for many applications. The data type part of the XML Schema specification is heavily used though since it alleviates a glaring shortcoming of DTDs, namely the ability to specify the format and type of the text of an element. This is accomplished through restriction of simple types. The content models specified in both real world DTDs and XSDs tend to be very simple. For XSDs, 97% of all content models can be classified in the categories of simple expressions we identified. This observation can guide software engineers when developing new implementations of XML related tools and applications, for instance by avoiding optimizations for complex cases that rarely occur in practice.
8.
REFERENCES
[1] P. Biron and A. Malhotra. XML Schema part 2: datatypes. W3C, May 2001, http://www.w3.org/TR/xmlschema-2/
[6] J. Clark. TREX - Tree Regular Expressions for XML: language specification, February 2001, http://www.thaiopensource.com/trex/spec.html [7] J. Clark and M. Murata. RELAX NG Specification. OASIS, December 2001, http://www.oasis-open.org/committees/ relax-ng/spec-20011203.html [8] R. Cover. The cover pages, 2003, http://xml.coverpages.org/ [9] D. Fallside. XML Schema part 0: primer. W3C, May 2001, http://www.w3.org/TR/xmlschema-0/ [10] IBM corp. XML Schema Quality Checker, 2003. http://www.alphaworks.ibm.com/tech/xmlsqc [11] A. Møller. Document Structure Description 2.0. BRICS, 2003, http://www.brics.dk/DSD/dsd2.pdf [12] M. Murata. Document description and processing languages – regular language description for XML (RELAX): Part 1: RELAX core. Technical report, ISO/IEC, May 2001. [13] M. Murata, D.Lee, M. Mani, and K. Kawaguchi. Taxonomy of xml schema languages using formal language theory. To be submitted to ACM TOIT, 2003. [14] W. Martens, F. Neven and T. Schwentick Complexity of Decision Problems for Simple Regular Expressions. Submitted. [15] Y. Papakonstantinou and V. Vianu. DTD inference for views of XML data. In PODS proceedings, pages 35–46, 2000. [16] A. Sahuguet. Everything you ever wanted to know about DTDs, but were afraid to ask. In Proceedings of WebDB 2000, 2000. [17] H. Thompson, D. Beech, M. Maloney, and N. Mendelsohn. XML Schema part 1: structures. W3C, May 2001, http://www.w3.org/TR/xmlschema-1/ [18] E. van der Vliet. XML Schema. O’Reilly, Cambridge, 2002.
APPENDIX
[2] T. Bray, J. Paoli, C. M. Sperberg-McQueen, E. Maler, and F. Yergeau. Extensible Markup Language (XML) 1.0. W3C, 3 edition, February 2004, http://www.w3.org/TR/2004/REC-xml-20040204/ [3] A. Br¨ uggemann-Klein and D. Wood. One-unambiguous regular languages. Information and computation, 140(2):229–253, 1998. [4] A. Br¨ uggemann-Klein, M. Murata, and D. Wood. Regular tree languages over non-ranked alphabets (draft 1). Unpublished manuscript, 1998. [5] B. Choi. What are real DTDs like? In Proceedings WebDB 2002, pages 43–48, 2002.
84
A list of XSDs used in the regular expression and single-type analysis (with number of definitions in brackets) and a few of the XSDs considered in the other parts of the paper. DSML v2 (1), EPAL-cs-xacml-schema-policy (34), EPALepal-interface (12), epal-interface (12), extensions (13), ipdr 2.5 (14), mets (1), OAI DC (1) OAI GetRecord (9), ODRLEX v1.0 (25), ODRL-EX v1.1 (23), PersonName v1.2 (8), PIDXLib-2002-02-14 v1.0 (255), PMXML2 (1), PostalAddress v1.2 (16), RIXML2 (1), simpledc (15), TC-1025 schema v1.0 xpdl (91), UKGovTalk-BS7666 v1.2 (68), VRXML 20021204 (43), wsrp v1.0 types (1), WS-Security-Schema-xsd-20020411 (7), wsui (26), xgmml (8), xpdl (91), BPML, GenXML v1.0, GML Base, HEML, LogML, MPEG21, PSTC-CS v1.0, RDF, UDDI v2.0, XML Schema
On validation of XML streams using finite state machines Cristiana Chitic
Daniela Rosu
Department of Comp. Science University of Toronto Toronto, ON, Canada [email protected]
Department of Comp. Science University of Toronto Toronto, ON, Canada [email protected]
ABSTRACT We study validation of streamed XML documents by means of finite state machines. Previous work has shown that validation is in principle possible by finite state automata, but the construction was prohibitively expensive, giving an exponential-size nondeterministic automaton. Instead, we want to find deterministic automata for validating streamed documents: for them, the complexity of validation is constant per tag. We show that for a reading window of size one and nonrecursive DTDs with one-unambiguous content (i.e. conforming to the current XML standard) there is an algorithm producing a deterministic automaton that validates documents with respect to that DTD. The size of the automaton is at most exponential and we give matching lower bounds. To capture the possible advantages offered by reading windows of size k, we introduce k-unambiguity as a generalization of one-unambiguity, and study the validation against DTDs with k-unambiguous content. We also consider recursive DTDs and give conditions under which they can be validated against by using one-counter automata.
1. INTRODUCTION As an increasing number of organizations and individuals are using the Internet, the ability to manipulate information from various sources is becoming a fundamental requirement for modern information systems. The XML data format has been adopted as the common format for data exchange on the Web, and, to facilitate query answering, XML data needs to be validated against DTDs and XML-Schemas. Streamed data is data originating from sources that provide a continuous flow of information. Storing the flowing data stream in order to process it is not a feasible solution since it would normally require large amounts of memory. To cope with this issue, one needs to find ways to process data as it comes without going back and forth in the stream. The symbols in the stream are the element tags and the included text in the order of their appearance in the XML document. Since we are interested in checking only structural properties, such as DTD conformance, we ignore the data values and consider the stream as a sequence of opening and closing Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright is held by the author/owner. Seventh International Workshop on the Web and Databases (WebDB 2004), June 17-18, 2004, Paris, France.
85
tags. Checking that an XML stream conforms to a DTD in a single pass under memory constraints is referred to as the on-line validation of streamed XML documents [13]. There are two types of on-line validation considered in [13]: strong validation, which includes checking well-formedness of the input document, and validation, which assumes that the input is a well-formed document. In this work, we investigate on-line validation of XML documents using finite-state machines. The nonrecursive DTDs conforming to the current standard [14] (i.e. with one-unambiguous content) correspond to a reading window of size one. Further, we introduce k-unambiguous regular expressions as a generalization of one-unambiguous regular expressions and study nonrecursive DTDs with kunambiguous content. In both cases the finite machines are deterministic. The advantage is having constant time complexity of validation for each open/closed tag. We prove exponential lower and upper bounds for the size of the minimal deterministic automaton used for strong validation. According to statistical data [6], nonrecursive DTDs are more frequent than recursive DTDs, but still recursive DTDs are used commonly in practice. For on-line validation against recursive DTDs we propose using one-counter automata. We also give syntactic conditions under which recursive DTDs can be recognized by one-counter automata. Related Work On-line validation of streamed XML documents under memory constraints has also been studied in [13]. One of the results there is that for any nonrecursive DTD one can construct a finite automaton (a particular case of a result in [12]). An algorithm was given that constructs for any nonrecursive DTD a finite automaton that can be used to perform strong validation. However, the resulting automaton was nondeterministic, exponential in the size of the DTD and had exponential (in the size of the DTD) pertag complexity of validation. Validating against recursive DTDs was also considered in [13]. Under the restrictive assumption that the input stream is well-formed, they present a class of recursive DTDs validated by finite automata. Strong validation against recursive DTDs can be performed by push-down automata. However, their disadvantage is having a stack of size proportional, in the worst case, to the size of the XML stream. This approach was discussed in [13, 10]. In our approach the space required by the counter is logarithmic in the size of the input stream. One-unambiguous regular expressions [4] reflect the requirement that a symbol in the input word be matched uniquely to a position in the regular expressions without looking ahead in the word. We generalize this concept in a
different way than in the extension considered in [7]. With a lookahead of k symbols we want to determine the next, unique, matching position in the regular expression, while in the approach considered in [7] a lookahead of at most k symbols will determine uniquely the next k positions. We assume the reader is familiar with basic notions of language theory: (nondeterministic) deterministic finite-state automaton ((N)DFSA), context-free grammar (CFG) and language (CFL), extended context-free grammar (ECFG)(e.g., see [2, 12]). The paper is organized as follows. The first section describes the problem of validating XML streams against DTDs. Section 2 presents canonical XML-grammars associated to any DTD and also the size of the minimal automaton used for validating nonrecursive DTDs. In Section 3 we establish results for DTDs with one-unambiguous and k-unambiguous content. Finally, in Section 4 we investigate strong validation against recursive DTDs.
2.
THE VALIDATION PROBLEM
Let Σ be a finite alphabet. An XML document is abstracted as a tree document. A tree document over Σ is a finite ordered tree with labels in Σ. Formally, a string representation denoted [t] is associated to each tree document t as follows: if t is a single node labeled a, then [t] = a a ¯; if t consists of a root labeled a and subtrees t1 ...tk then [t] = a[t1 ]...[tk ]¯ a, where a and a ¯ are opening and closing a|a ∈ Σ} denote the alphabet of closing tags. Let Σ = {¯ tags. An XML document is a well-formed document if the string representation corresponding to the XML tree is wellbalanced. If T is a set of tree documents, L(T ) denotes the language consisting of the string representations of the tree documents in T . A DTD (Document Type Definition) [14] D = (Σ, R, r) that ignores the attribute elements is a finite set of rules R of the form a → Ra such that a ∈ Σ, Ra is a regular expression over Σ and r ∈ / Σ is called the root. The set of tree documents satisfying a DT D D is denoted by SAT (D). The language over Σ ∪ Σ consisting of the string representations of all tree documents in SAT (D) is defined as: L(D) = {[t] | t ∈ SAT (D)}. The dependency graph Dg of a DTD D is the graph whose set of nodes is Σ, and for each rule a → Ra in the DTD there is an edge from a to b for each b occurring in some word in Ra . A DTD is nonrecursive if and only if Dg is acyclic and is recursive if and only if Dg is cyclic. The problem of validating an XML stream with respect to a DTD D is defined as checking that the string representation of the XML document is contained in the associated language L(D).
3. CANONICAL XML-GRAMMARS In the context of streaming, since the attributes are ignored, a DTD appears to be a special kind of extended context-free grammar. The formal grammar that captures explicitly the opening and closing tags is the XML-grammar, first introduced in [1]. Given a DTD D = (Σ, R) we denote a corresponding XML-grammar by DECF = (N, Σ ∪ Σ, ROOT, P ), where Σ ∪ Σ is the set of terminals, N is the set of nonterminals and N is in 1-1 correspondence with Σ, ROOT is the start symbol and P is the set of productions. The set P contains only rules of the following types:
86
ROOT → r RROOT r¯; A → a RA a; A → a a, where A ∈ N , a ∈ Σ and a ¯ ∈ Σ. RROOT , RA is a regular expression containing only nonterminals and corresponds to the nonterminal ROOT , A respectively. Since XML-grammars [1] were studied without providing a way to link them with DTDs, we give an algorithm that transforms a DTD D into an XML-grammar DECF such that L(DECF ), i.e. the language of the grammar DECF , is the same as L(D). We call the grammar produced by the algorithm the canonical XML-grammar associated to the DTD D. Let D = (Σ ∪ Σ, r, R) be a DTD, Π be an alphabet such that Π ∩ Σ = ∅ and ROOT be the symbol such that ROOT ∈ Π. Π is the alphabet of nonterminals in the canonical XML-grammar. Let f : RegExp(Σ) ∪ {r} → RegExp(Π) ∪ {ROOT } be a function such that f (r) = ROOT , f (Σ) = Π and f/E is a bijection. The set of terminals of the canonical XML-grammar is T = Σ ∪ Σ. The set of productions P of the canonical XML-grammar are modifications of the rules of the DTD where rules of the form a → Ra are transformed into productions of the form f (a) → af (Ra )¯ a. The output is the canonical XML-grammar DECF = (N, T, ROOT, P ). The canonical XML-grammar associated to a DTD is instrumental in proving the results of this paper. Example 3.1. Consider the DTD D over Σ = {r, a, b, c} with the rules: r → a∗ b, a → bc, b → c + , c → . The algorithm gives the XML-grammar DECF = (N, Σ ∪ Σ, ROOT, P ), where N = {ROOT, A, B, C} and the set of productions P is: {ROOT → rA∗ B¯ r, A → aBC¯ a, B → b(C + )¯b, C → c¯ c}. Given a DTD D, the language L(D) is the same as the language generated by the canonical XML-grammar that corresponds to D. Thus, validating an XML stream with respect to a DTD D becomes equivalent to checking that the stream belongs to L(DECF ), which is the definition we will use throughout the paper. The canonical XML-grammar DECF corresponding to a nonrecursive DTD D is nonrecursive, thus the language L(DECF ) is regular [12]. We give an algorithm of bottomup substitution that takes as input a canonical nonrecursive XML-grammar DECF and returns a regular expression that generates the language L(DECF ) [5]. The regular expression that generates L(DECF ) is the regular expression corresponding to the nonterminal ROOT , computed by the bottom-up substitution algorithm. Example 3.2. Let D and DECF be the DTD, respectively the XML-grammar grammar from the previous example. The regular expression obtained by bottom-up substitution is r(a(b(c¯ c + )¯bc¯ c)¯ a)∗ (b(c¯ c + )¯b)¯ r. The regular expression obtained by applying bottom-up substitution to a DECF is used to construct an automaton which is used as a validating tool. The size of the automaton is linear in the size of the expression. An algorithm was presented in [13], which yields a validating non-deterministic automaton whose size is exponential in the size of the DTD. Here we show that even for the minimal automata the lower bound on their size is still exponential with respect to the size of the DTD. In order to compute the lower bound we partition the symbols of the alphabet into strata, defined below.
Definition 3.3. Stratum 0 of a nonrecursive DTD is the set of all symbols in Σ such that the right-hand sides of the corresponding rules contain only the symbol . Stratum i of a DTD is the set of all symbols in Σ such that the right-hand sides of the corresponding rules contain only symbols from Stratum 0, ..., Stratum i − 1 with at least one symbol from Stratum i − 1. We define the depth of a nonrecursive DTD to be the corresponding number of strata. We parameterize the size of a nonrecursive DTD by its depth and the maximum length of a right-hand side of a rule. Remark 3.4. For a nonrecursive DTD, the number of strata is finite and a symbol in the alphabet belongs to only one stratum. Also, in general, the depth of a DTD is not related to the number of productions. Let D be a nonrecursive DTD. Let d be its depth and let L be the maximum number of symbols that appear in the body of a rule. Lemma 3.5. [5] The size of the minimal automaton AD d −1 that recognizes L(D) is bounded from above by 1 + 2 · LL−1 . We show that the bound is tight by considering the following example. Example 3.6. Let D = ({r, a1 , ..., an }, R) be a nonrecursive DTD s.t. the set of rules is R = {r → a1 a1 , a1 → a2 a2 , ............, an−1 → an an , an → }. The regular expression corresponding to this DTD obtained by applying the bottom-up substitution algorithm to the associated DECF is r (a1 (a2 (... (an−1 (an an )2 an−1 )2 ...)2 a2 )2 a1 )2 r. The minimal automaton that recognizes L(D) has 1 + 2 · (2n+1 − 1) states, where n + 1 is the depth of D and the maximum length of a production in R is 2.
4.
STREAMED VALIDATION OF NONRECURSIVE DTDS
We now investigate the problem of validating XML streams against nonrecursive DTDs. The XML standard [14] requires that the regular expressions associated with each production match uniquely a position of a symbol in the expression to a symbol in an input word without looking beyond that symbol. This scenario corresponds to processing the stream with a deterministic automaton having a reading window of size one. The regular expressions described by the standard are the one-unambiguous regular expressions introduced in [4]. Before presenting the results of this section, we recall the definition and some characterizations of one-unambiguous regular expressions [4]. To denote different occurrences of the same symbol in an expression, all the symbols are marked with unique subscripts. For example, a marking of the expression b((a+ bc)∗ |(abd)∗ ) is b7 ((a1 + b2 c3 )∗ |(a4 b5 d6 )∗ ). The set of symbols of expression E is denoted by sym(Σ). For expression E, we denote its marking by E . Each subscripted symbol is called a position. For a given position x, X (x) indicates the corresponding symbol in Σ without the subscript. Formally, an expression E is defined to be one-unambiguous if and only if, for all words u, v, w over sym(E ) and all symbols x,y in sym(E ), if the words uxv, uyw ∈ L(E ) and x = y then X (x) = X (y). One possible method to convert a regular expression into a finite
87
automaton was proposed in [8]. In the Glushkov automaton of an expression E, the states correspond to positions in E and the transitions connect positions that are consecutive in a word in L(E ). For each expression E, the following sets are defined: F irst(E ), the set of positions that match the first symbol of some word in L(E ); Last(E ), similarly for the last positions; and F ollow(E , z), the set of positions that can follow position z in some word in L(E ). A Glushkov automaton corresponding to a regular expression is constructed using the sets defined above. The Glushkov automaton has as many states as the number of positions in the corresponding marked expression plus one. Definition 4.1. A DTD D (an XML-grammar DECF ) is one-unambiguous if all the rules (productions) have oneunambiguous regular expression in their right-hand sides. The canonical XML-grammar associated to a oneunambiguous DTD is also one-unambiguous. Theorem 4.2. Let D = (Σ, R) be a nonrecursive oneunambiguous DTD. Then the language L(D) is oneunambiguous. To prove that the language L(D) (or the language L(DECF ) generated by the canonical grammar DECF ) is one-unambiguous it is sufficient to show that there exists a one-unambiguous expression that generates the language. The solution to this problem is the regular expression obtained by bottom-up substitution [5]. Corollary 4.3. There is an algorithm that, given a nonrecursive one-unambiguous DTD D, constructs a deterministic Glushkov automaton GD such that GD accepts precisely the language consisting of the string representations of the documents that conform to D. The algorithm in [13] constructs an exponential-size nondeterministic automaton, which yields per open/closed tag complexity that is exponential in the size of the DTD. In contrast, for one-ambiguous DTDs, we can construct a deterministic automaton that verifies conformance to the DTD for streaming documents. This yields constant per open/closed tag complexity. Remark 4.4. If the validation of nonrecursive DTDs is performed using their corresponding Glushkov automata, the exponential lower and upper bounds of these automata in the size of the DTDs are the same as the bounds shown in the previous section. From example 3.6 and remark 4.4 is follows that the exponential bound on the size of the Glushkov automata accepting L(D) for a one-unambiguous DTD D cannot be improved. The lower bound on the size of the minimal deterministic automata accepting L(D) is also exponential in the size of D. We consider now the case of DTDs with non-deterministic content, which appear in practice in a variety of fields [6, 5]. To model these types of DTDs we introduce the kunambiguous regular expressions. They are a generalization of one-unambiguous regular expressions, different from the one considered in [7]. Informally, a k-unambiguous expression matches uniquely a position of a symbol in the expression to a symbol in an input word by looking ahead k symbols. Practically, an XML stream can be validated against
a nonrecursive DTD with k-unambiguous content by a finite automaton that uses a reading window of size k to move to a unique state. Definition 4.5. A regular expression E is kunambiguous, where k ≤ |E|, if and only if for all words u, v, w, x = x1 ... xk and y = y1 ... yk over sym(E ), ux1 ...xk v ∈ L(E ), uy1 ...yk w ∈ L(E ) and xk = yk implies that X (x1 ) ... X (xk ) = X (y1 ) ... X (yk ). Example 4.6. Consider the regular expression b((a+ bc)∗ |(abd)∗ ) marked as b7 ((a1 + b2 c3 )∗ |(a4 b5 d6 )∗ ). This regular expression is not one-unambiguous. Given the word babc it cannot be decided if the symbol a following the symbol b is matching position a1 or a4 . By looking ahead 4 symbols we can match the occurrence of symbol c with position c3 . Example 4.7. The following 3-unambiguous content model stand point → ((back sight, f ore sight|back sight, f ore sight, back sight), intermidiate sight∗ , inf o stand?, inf o i∗ ) appears in a DTD that describes digital measurements http:// gama.fsv.cvut.cz/˜soucek/ dis/ convert/ dtd/ dnp-1.00.dtd. Another way of characterizing k-unambiguous regular expressions is using Glushkov automata. Definition 4.8. A Glushkov automaton G = (Q, Σ, δ, F ) is k-deterministic if for every state p ∈ Q and every word a1 ... ak over Σ, the extended transition δ ∗ (p, a1 ...ak ) contains at most one state. In other words, we call a Glushkov automaton kdeterministic if from any state following all paths labeled a1 ... ak we reach at most one state. Given a marked expression E we define the sets: F irst(E , k) = {w ∈ sym(E )| there is a word u s.t. wu ∈ L(E ) and |w| = k}, F ollow(E , z, k) = {w ∈ sym(E )| there are words u and v s.t. uzwv ∈ L(E ) and |w| = k}, for all symbols z ∈ sym(E ) and Last(E , k) = { w ∈ sym(E )| there is a word u s.t. uw ∈ L(E ) and |w| = k}. These sets can be computed in time polynomial in the size of E , and thus we can give a polynomial-time algorithm to check whether a regular expression is k-unambiguous, for a given k. The worst case time complexity is O(|E |k+1 ). There are regular expressions that are not k-unambiguous for any number k. Example 4.9. The Glushkov automaton in Figure 1(i) is k-deterministic for any k ∈ N, since δ ∗ (a1 , aaaaaaaaa) = {a1 , a3 }. The Glushkov automaton in Figure 1(ii) is 3 − unambiguous. Proposition 4.10. (a) If a regular expression E is kunambiguous then E is k + 1-unambiguous. (b) A regular expression E is k-unambiguous if and only if the Glushkov automaton corresponding to E is k-deterministic. Given a nonrecursive DTD D = (Σ, R), we obtain by applying our bottom-up substitution algorithm to the canonical XML-grammar DECF a regular expression ED that describes the language L(DECF ) = L(D). The Glushkov automaton corresponding to ED is used for validation XML streams against D. If the regular expression ED is k-unambiguous, then the automaton used for online validation is also k-deterministic. We define the set
88
a4
a
b
d
b5
d6
d d7
b
a a1
b2
b
(i)
d
c3
c
b b
a a1
a a
b a
(ii)
b2
a
a a3
a
b5 b a
b7
a b a4
a
a6
Figure 1: Glushkov automaton corresponding to the regular expressions (i): (a + b)∗ a(a + b)(a + b) and (ii): (abc + ab∗ d)d. Σk = { a ∈ Σ| a → Ra ∈ R and Ra is k -unambiguous} for any number k. A regular expression is called finite if it denotes a finite language. Now, we define the set Σfk in = { a ∈ Σ | a → Ra ∈ R and the regular expression Ra is finite k-unambiguous} for any number k. On the dependency graph Dg we define the set Reachablea , a ∈ Σ, to be the set of symbols contained in the subtree rooted in a. As shown in the example 4.9, there exist expressions that are not k-unambiguous for any number k. Also, similarly to the class of one-unambiguous expressions, the class of k-unambiguous expressions is not closed under union, concatenation and star operation. Thus, we need to impose conditions on the nonrecursive DTDs that guarantee that the bottom-up substitution algorithm applied to the corresponding XML-grammars yields a k-unambiguous expression. Theorem 4.11. Let D = (Σ, R) be a nonrecursive DTD with the root rule denoted by r → Rr . Let Dg be the dependency graph of D. Assume that one of the following conditions is true: 1. The regular expression Rr is k-unambiguous and Reachablea ⊆ Σfp in for a ∈ sym(Rr ) 2. The regular expression Rr is one-unambiguous and there exists at most one a ∈ Σk such that Reachablea ⊆ Σf1 in , none of the elements on the path from a to the root appear under a Kleene-star and
a∈
Reachableb .
b∈sym(Rr )
Then there exists a number k such that the regular expression associated to the canonical XML-grammar DECF and obtained by bottom-up substitution is k -unambiguous. We illustrate the conditions of the theorem in the following two examples. Example 4.12. Let D = ({r, a, b, c, d, e, f }, {r → (a+ e|ac)(cd)∗ , a → b, b → ce|cf, c → , d → ccf |cce, e → , f → }) be a nonrecursive DTD satisfying the first condition of the theorem. The expression Rr is 2unambiguous and the rest of the rules correspond to finite 3-unambiguous expressions. The regular expression
obtained by the algorithm of bottom-up substitution is r((ab(c¯ ce¯ e|c¯ cf f¯)¯b¯ a)+ e¯ e|ab(c¯ ce¯ e|c¯ cf f¯)¯b¯ ac¯ c)(c¯ cd(c¯ cc¯ cf f¯|c¯ cc¯ c ¯ ∗ r¯, where the maximum of the lengths of the regular e¯ e)d) expressions associated to the nonterminals A, B, C, D, E, F is 14. Thus, k = 14. Example 4.13. Let D = ({r, a, b, c, d, e, f }, {r → ba+ , a → ed∗ f, b → c|d+ , c → f + e|f d, f → ee, d → , e → }) be a nonrecursive DTD that satisfies the second condition of the theorem. The regular expression obtained by the algorithm of bottom-up substitution applied on the canonical XML-grammar associated to D is ¯ c|(dd) ¯ + )¯b(ae¯ ¯ ∗ f e¯ rb(c((f e¯ ee¯ ef¯)+ e¯ e|f e¯ ee¯ ef¯dd)¯ e(dd) ee¯ ef¯a ¯)+ r¯. In this case k = 18, which is the length of the regular expression associated to the nonterminal C. As the following examples show, the theorem is quite tight, since there are DTDs that deviate only slightly from the conditions of the theorem and have associated regular expressions that are not k-unambiguous. Example 4.14. Let D = ({r, a, b, c, d}, {r → ab|ac, a → d∗ , b → , c → , d → }) be a nonrecursive DTD, for which the first condition of the theorem is not satisfied. The regular expression associated to the DTD is ¯ ∗a ¯ ∗a r(a(dd) ¯b¯b|a(dd) ¯c¯ c)¯ r, which is not a k -unambiguous regular expression for any k ≤ 14 (the length of the expression being 14). Example 4.15. Let D = ({r, a, b, c, d, e}, {r → ab, a → cd|ce, b → , c → d∗ , d → , e → }) be a nonrecursive DTD, for which the second condition of the theorem is not satisfied. The regular expression associated to the DTD ¯ ¯ ∗ c¯e¯ is ra(c(dd¯∗ c¯dd|c(d d) e)¯ ab¯b¯ r. The Glushkov automaton of this expression is not k -unambiguous for any k , since the extended transition of the automaton contains more ¯ d...d ¯ d¯ of arbitrary than one state when reading words racddd length.
5.
STREAMED VALIDATION OF RECURSIVE DTDS
We now consider the problem of strongly validating an XML stream against a recursive DTD using (restricted) onecounter automata. Recursive DTDs appear in fields like: computational linguistics, web distributed data exchange, financial reposting, etc [5]. Example 5.1. The following recursive content model exception → (type, msg, contextdata∗ , exception?) appears in a DTD from Workflow Management Coalition http://www.oasis-open.org/cover/WFXML10a-Alpha.htm. A one-counter automaton (1-CF SA) is a quintuple M = (Q, Σ, H, q0 , F ), where Q is a finite set of states, q0 ∈ Q is the initial state, F ⊂ Q is the set of final states, Σ is the finite alphabet and the transition set H is a finite subset of Q × (Σ ∪ ) × Z+ × {−1, 0, 1} × Q [9, 11]. Thus, the 1CF SA consists of a finite state automaton and a counter that can hold any nonnegative integer and can only test if the counter is zero or not. The move of the machine depends on the current state, input symbol and whether the counter is zero or not. In one move, M can change state, add +1, 0 or −1 to the counter. However, the counter is not allowed to subtract 1 from 0. The machine accepts a word if it starts
89
in the initial state with the counter 0 and reaches a final state with the input completely scanned and the counter 0. A language accepted by a one-counter machine is called one-counter language. We denote the family of one-counter languages by OCL. A restricted one-counter automaton (restricted-1-CF SA) is a one-counter automaton which, during the computation cannot test if the counter is zero [9, 2, 3]. The language accepted by a restricted one-counter automaton is called restricted one-counter language and the family of such languages is denoted ROCL. It is known that the family of restricted one-counter languages are strictly included in the family of one-counter languages [3] and that OCL is N SP ACE(log n) [15]. Every language accepted by a one-counter automaton is a context-free language [2]. a/+1
q0
r
q1
a/+1
a/−1 a/−1
q2
r
q3
q4
Figure 2: One-counter automaton recognizing the language {ran an r|n > 0}
c2 c/+1 a
a1
a
a3
a
c a
c4
a6 a
c
c5
c
c7 a
b10
c/−1 a8
b/+1
b/−1
b/−1 b9
b/+1
Figure 3: One-counter automaton recognizing the c)¯ a(¯ ca ¯)n bm¯bm } language {(ac)n a(|c¯ Example 5.2. The language corresponding to the DTD D = {r → a, a → a|} is ran an r and a streamed XML document can be strongly validated against D using the onecounter automaton in Figure 2. Definition 5.3. A DTD D is (deterministic) onecounter recognizable, (deterministic) restricted one-counter recognizable, if the language L(D) is in (D−)OCL, (D−)ROCL respectively. An XML-grammar DECF is (deterministic) one-counter recognizable, (deterministic) restricted one-counter recognizable, if the language L(DECF ) is in (D−)OCL,(D−)ROCL respectively. We denote the family of one-counter recognizable (restricted one-counter recognizable) DTDs by OCRD (ROCRD). The family of languages generated by (restricted) one-counter recognizable DTDs is contained strictly in the family (R)OCL, since there are languages that are (restricted) one-counter but cannot be generated by a DTD, e.g. {αβ| α, β ∈ Σ∗ , ∈ / Σ, |α| = |β|} [2]. The family of strongly recognizable DTDs [13] is included in (R)OCRD, since the regular languages are included in the (restricted) one-counter languages. However, both ROCRD and OCRD are incomparable with the family of recognizable DTDs [13]. We illustrate this result in the following example. Example 5.4. Consider the DTD D = {r → aa, a → a|}. D is not recognizable, but is in OCRD since L(D) = ¯ n am a ¯m r¯|n, m ≥ 0} ∈ OCL. Conversely, let D be {ran a
the DTD {r → a, a → a?|b, b → b?}. D is recognizable but the corresponding language, L(D) = {ran bm¯bm a ¯n r¯|n ≥ 1, m ≥ 0} is not OCL [2], hence D is not in OCRD. Consider now the restricted one-counter recognizable DTD D = {r → a, a → a∗ }. By straightforward testing of the necessary conditions provided in [13], one can prove that D is not recognizable. Finally, the DTD D = {r → a1 a2 , a1 → a1 ?, a2 → a2 ?} is recognizable, but using the iteration theorem for the family of restricted one-counter languages [3] one can show that L(D) ∈ / ROCL and thus, D is not restricted one-counter recognizable. Finding grammatical characterizations of (D)ROCL and (D)OCL is still an open problem. However, adapting a well known result from [3] one can infer sufficient conditions for a DTD to be in ROCRD. We present syntactic restrictions that yield a class of recursive DTDs that are OCRD. Theorem 5.5. Let D = (Σ, R) be a recursive DTD and let DECF = (Σ ∪ Σ, N, P ) be the associated canonical XMLgrammar. Let {A1 , ..., An } ⊆ N be the set of recursive nonterminals in DECF and let G be the dependency graph of DECF . Assume the following conditions are true: • every node in G appears in at most one simple cycle. • the regular expressions RA1 , ..., RAn contain only one recursive nonterminal and that recursive nonterminal does not appear under the scope of a Kleene-star. / {A1 , ..., An } have corre• all the nonterminals Bi ∈ sponding regular expressions by bottom-up substitution. Then the language L(DECF ) = L(D) is a one-counter language, which means that D is in OCRD. Moreover, there is an algorithm to construct a one-counter automaton that can be used to validate an XML stream against the DTD D. If D has only one-unambiguous content then the resulting one-counter automaton is deterministic. Intuitively, if the dependency graph of a DTD D satisfies the conditions of Theorem 5.5, one can find an expression describing L(D) and construct based on it a one-counter automaton that precisely accepts L(D) [5]. Example 5.6. Let D = ({r, a, b, c}, {r → ab, a → c|, c → a|, b → b|}) be a recursive DTD. The corresponding canonical XML-grammar DECF has the following productions {ROOT → rAB¯ r, A → aC¯ a| a¯ a, C → cA¯ c| c¯ c, B → bB¯b|b¯b}. The language generated by the grammar is {(ac)n a(|c¯ c)¯ a(¯ ca ¯)n bm¯bm |n, m ≥ 1}, which is not a regular language. The one-counter automaton recognizing the language generated by DECF is presented in Figure 3. Relaxing slightly the conditions of the theorem we can find examples of the DTDs whose languages are no longer one-counter. Example 5.7. Let D = ({r, a, b}, {r → a, a → a|b|, b → b|}) be a recursive DTD. D deviates slightly from the first two conditions of the theorem. The language generated by the corresponding canonical XML-grammar is {ran bm¯bm a ¯n r¯|n ≥ 1, m ≥ 0}, which is not OCL. Example 5.8. Let D = ({r, a, b, c}, {r → a, a → ac|, c → b, b → b|}) be a recursive DTD. D verifies the first two conditions of the theorem but deviates slightly from the ¯)n |n ≥ 1, m ≥ 0}, which is not third. L(D) = {an (cbm¯bm c¯a OCL.
90
6. CONCLUSION This paper continues the formal investigation of the problem of on-line validation of streamed XML documents with respect to a DTD. We provided further insights on the size of the minimal (deterministic) automata that can be used for strong validation against non recursive DTDs. Motivated by real world examples of DTDs with non-oneunambiguous content models and to capture the possible advantages of using reading windows of size greater than one, we introduced the notion of k-unambiguous regular expressions as a generalization of one-unambiguous regular expressions. We also investigated the problem of strong validation against recursive DTDs without imposing that the streamed document be well-formed. We introduced a hierarchy of classes of DTD that can be recognized using variants of onecounter automata and provided syntactic conditions on the structure of the DTDs that ensure the existence of a onecounter automaton for performing strong validation. A precise characterization of the DTDs recognizable by deterministic counter automata is yet not available and will be considered in future work.
7.
ACKNOWLEDGMENTS
We are indebted to Leonid Libkin and Alberto Mendelzon for their guidance and feedback.
8. REFERENCES [1] J. Berstel and L. Boasson. XML Grammars. MFCS 2000. [2] J. Berstel and L. Boasson. Context-Free Languages. Handbook of Theoretical Computer Science, 1990. [3] L. Boasson. Two iteration theorems for some families of languages. J. Computer and System Sciences, 7, 1973. [4] A. Bruggemann-Klein and D. Wood. One-unambiguous regular languages. Information and Computation, 140, 1998. [5] C. Chitic and D. Rosu. On validation of XML streams using finite state machines. Technical Report CSRG-489, University of Toronto, 2004. [6] B. Choi. What are real DTDs like. WebDB, 2002. [7] D. Giammarresi, R. Montalbano, and D. Wood. Blockdetereministic regular languages. ICTCS, 2001. [8] V.M. Glushkov. The abstract theory of automata. Russian Mathematical Surveys, 16, 1961. [9] S. Greibach. An inifinite hierarchy of context-free languages. Journal of the ACM, 16, 1969. [10] C. Koch and S.Scherzinger. Attribute grammars for scalable query processing on XMl streams. DBPL 2003. [11] A. Meyer P. Fisher and A. Rosenberg. Counter machines and counter languages. Mathematical Systems Theory, 2, 1968. [12] A. Salomaa. Computation and Automata. Cambridge University Press, Cambridge, 1985. [13] L. Segoufin and V. Vianu. Validating streaming XML documents. PODS, 2002. [14] J. Paoli T. Bray and C.M. Sperberg-McQueen. XML 1.0. World Wide Web Consortium Recommendation, 1998. http://www.w3.org/TR/REC-xml/. [15] K. Wagner and G.Wechsung. Computational Complexity. Mathematics and its applications. VEB Deutscher Verlang der Wissenschaften, Berlin, 1986.
Checking Potential Validity of XML Documents Ionut E. Iacob
Alex Dekhtyar
Michael I. Dekhtyar "
Dep. of Computer Science University of Kentucky Lexington, KY 40506
Dep. of Computer Science University of Kentucky Lexington, KY 40506
Dep. of Computer Science Tver State University Tver 170000, Russia
!
#$% &'( ! )*! +!%
`z?234@BAC.< :D 34.4<:1 ;'3\> 234@ AC.4<:|/?1 234.4.>?595l7(2WW'2QR54§l: -B.: .V : @T9WT342<: .4<:|;'5 W'29> .>Y9Z: -B.4?;': 21k3%-B225.45\: - .X7v1%9mD AC.4<:e27: - .$: .V :Yo96;'<:1 2 > @ 34.45~52AC.C.4< 342 > ;'< mZ7(21$;[:^M`z< AC25O:l395.454Y: -B.21%> .1|;'?2ZQR;[: -G: -B.ChX,h5O:1 @ 3: @ 1 .Ya96A89}rwg.q> ;'3:%9P: .> w}r9Z5/g.434;[B3$:%95]r9:e-B9r7(21e: -B.{.>?;[: 21§X.^ m?^YyA891 ] ;' 23@BAC.4<:l5.4<: .40QX^ 1^ :^c: -B. hX,heiPFXj 3- .4A89~@ 5.>M7v21R: -B.IA89P1 ] @B/_^
ABSTRACT ,- .0/ 1 234.55627831 .9: ;'2?2[email protected]<:DE34.4<:1 ;'3GFIHKJL> 23@ D AC.< : 5M27+: .4 @ 34.45RA891 ] @ /_^a`b ;9: .CFIHKJc;'5$9WAC25O:$< .4d.1ed9W';>fQR;[: -G1 .45/g.43:e: 2K: - . hI,heij 3- .4A89k@B5.4>$7(21l: -B.R.4< 342 > ;'< m ^on:o: -B.R5 9AC.: ;'AC.Y;[:);'5 ;'AC/g21:%9<:a: 2X.4 ;[:b}f27RFXHKJc>?2[email protected]<: 54Y_QR-B;'39W'W2QR5~@B5k: 2Z> ;'5O: ;'Zwg.439@B5.~52AC.e27o: - .~hI,h1 @BW'.45 m@B;> ;'< mM: -B.5O:1 @B3: @?1 .$27): -B..4 ;' 23@BAC.4<: 54^
n:o: -B.R5 9AC.: ;'AC.YQ).\< 2: .: -T9P:o: -B.R;'