A graph-based data model to represent transaction time in semistructured data Carlo Combi1 , Barbara Oliboni1 , and Elisa Quintarelli2 1
2
Dipartimento di Informatica - Universit` a degli Studi di Verona Ca’ Vignal 2 — Strada le Grazie, 15 — 37134 Verona (Italy) {combi,oliboni}@sci.univr.it Dipartimento di Elettronica e Informazione - Politecnico di Milano Piazza Leonardo da Vinci, 32 — 20133 Milano (Italy)
[email protected]
Abstract. In this paper we propose the Graphical sEmistructured teMporal data model (GEM), which is based on labeled graphs and allows one to represent in a uniform way semistructured data and their temporal aspects. In particular, we focus on transaction time.
1
Introduction
In the recent years the database research community has focused on the introduction of methods for representing and querying semistructured data [1]. Roughly speaking, this expression is used for data that have no absolute schema fixed in advance, and whose structure may be irregular or incomplete. A number of approaches have been proposed, in which labeled graphs are used to represent semistructured data without considering any temporal dimension [4, 9, 14]. These models organize data in graphs where nodes denote either objects or values, and edges represent relationships between them. In the context of semistructured data, proposals presented in the literature for representing temporal information also use labeled graphs [5, 8, 13]: they often deal with the development of methods for representing and querying changes in semistructured data. Recently, it has been recognized and emphasized that time is an important aspect to consider in designing and modeling (data-intensive) web-sites [3]. To this regard, semistructured temporal data models can provide the suitable infrastructure for an effective management of time-varying documents on the web: indeed, semistructured data models play the same role of the relational model for the usual database management systems. Thus, semistructured temporal data models could be considered as the reference model for the logical design of (dataintensive) web sites, when time-dependent information plays a key-role. In this
This work was partially supported by contributions from the Italian Ministry of University and Research (MIUR) under program COFIN-PRIN 2003 “Representing and managing spatial and geographical data on the Web” and from the Italian Ministry of Research Basic Research Found (FIRB) - Project KIWI.
scenario, it is important to consider also for semistructured temporal data models the formalization of the set of constraints needed to manage in the correct way the semantics of the represented time dimension(s), as it was deeply studied in the past years in the literature for the classical temporal database field [11]. Such an issue has not yet been completely considered in the literature related to semistructured data. We propose an original graphical temporal data model general enough to include the main features of semistructured data representation, by considering also the issues of modeling the semantics of a given time dimension. In particular, in this work we consider transaction time (TT), which is system-generated and represents the time when a fact is current in the database and may be retrieved [10, 11]: we focus on the specification of the main constraints and operations needed for a correct handling of this temporal aspect. The structure of the paper is as follows: in the next Section we describe the main proposals dealing with time for semistructured data. Section 3 introduces the Graphical sEmistructured teMporal (GEM) data model, and Section 4 describes and discusses constraints and operations when dealing with transaction time. In Section 5 we sketch some conclusions and possible lines for future work.
2
Related Work
Recently, some research contributions have been concerned with temporal aspects in semistructured databases. While they share the common goal of representing time-varying information, they consider different temporal dimensions and adopt different data models and strategies to capture the main features of the considered notion of time. The Delta Object Exchange Model (DOEM) proposed in [5] is a temporal extension of the Object Exchange Model (OEM) [14], a simple graph-based data model, with objects as nodes and object-subobject relationships represented as labeled arcs. Change operations (i.e. node insertion, update of node values, addition and removal of labeled arcs) are represented in DOEM by using annotations on nodes and arcs of an OEM graph for representing the history. Intuitively, annotations are the representation of the history of nodes and edges as it is recorded in the database. This proposal takes into account the transaction time dimension of a graph-based representation of semistructured data. DOEM graphs (and OEM graphs as well) do not consider labeled relationships between two objects (actually, each edge is labeled with the name of the unique pointed node). Another graph-based model proposed in the literature is described in [8]. This model uses labeled graphs to represent semistructured databases and the peculiarity of these graphs is that each edge label is composed by a set of descriptive properties (e.g. name, transaction time, valid time, security properties of relationships). This proposal is very general and extensible: any property may be used and added to adapt the model to a specific context. In particular, the model allows one to represent also temporal aspects: to this regard, some examples of constraints which need to be suitably managed to correctly support
semantics of the time-related properties are provided, both for querying and for manipulating graphs. The Temporal Graphical Model (TGM) [13] is a graphical model for representing semistructured data dynamics. This model uses temporal elements, instead of simple intervals, to keep trace of different time intervals when an object exists in the reality. In [13] the authors consider only issues (e.g. admitted operations and constraints) related to valid time representation. The Temporal XPath Data Model [2] is an extension of the XPath Data Model capable of representing history changes of XML documents. In particular, this approach introduces the valid time label only for edges in the XPath model.
3
The Graphical sEmistructured teMporal data model
In this Section we propose the Graphical sEmistructured teMporal (GEM) data model, which is able to represent in a uniform way semistructured information by considering also their time dimension. We focus on the classical notion of transaction time studied in the past years in the context of temporal databases [10, 11] and formalize the set of constraints that the considered time dimension imposes. Our proposal is based on rooted, connected, directed, labeled graphs. The transaction time is represented by means of an interval belonging to both node and edge labels. A GEM graph is composed by two kinds of nodes, complex and simple nodes, which are graphically represented in different ways. Complex nodes are depicted as rectangles and represent abstract entities, while simple nodes are depicted as ovals and represent primitive values. Formally, a GEM graph is a rooted labeled graph N, E, r, , where: 1. N is a (finite) set of nodes (actually, it is the set of object identifiers). 2. E is a set of labeled edges. 3. r ∈ N is the unique root of the graph and it is introduced in order to guarantee the reachability of all the other nodes. 4. Each node label is composed by the node name, the node type (complex or simple), the content and the time interval3 . The label function is such that for each node ni ∈ N , (ni ) = N namei , N typei , N contenti , N timei where N namei is a string, N typei ∈ {complex, simple}, N contenti is a value for simple nodes and the null value ⊥ for complex nodes (see constraints in Figures 1(a) and 1(b)), and N timei is a half-open interval. 5. Each edge label is composed by the relationship name and the time interval. Each edge ej = (nh , nk ), labelj , with nh and nk in N , has a label labelj = Enamej , Etimej . We do not suppose to have an identifier for edges: an edge can be identified by its label and the two connected nodes, because between two nodes we suppose to have only one edge with a particular name and a particular time interval. 3
In the figures related to GEM graphs, we report only the name label of nodes and edges and the related time intervals, because we graphically represent the different types of nodes by means of their shape (rectangle or oval). Moreover, we specify the content label only for simple nodes.
It is worth noting that, differently from other proposals [5, 8], we choose to associate labels both to edges and to nodes: this way, we can have a more compact, graph-based representation of semistructured databases.
{ CONTENT =
TYPE = complex}
(a)
{ TYPE1 = complex }
(b)
Fig. 1. (a) Complex nodes do not have a specified content label (b) Each simple node is a leaf.
There are some constraints for a GEM graph: indeed, we do not allow a complex node to have a primitive value as a content (it could be equivalent to not allowing mixed elements in XML [15], the widely known language for semistructured data); on the other hand, a simple node must be a leaf of the graph and must have a (primitive) content. Thus, a GEM graph must satisfy the following two basic constraints, not related to temporal aspects. 1. The content label of a node is ⊥ if and only if the node is complex (this is due to the fact that complex nodes represent abstract entities without a primitive value). We show this property in Figure 1(a). 2. If a node is simple, then it is a leaf. Figure 1(b) depicts this property by specifying that each node with outgoing edges must be a complex node. The graphical formalism we use in these two figures has been defined in [7, 12]. In this graphical formalism a constraint is composed by a graph, which is used to identify the subgraphs (i.e. the portions of a semistructured database) where the constraint is to be applied, and a set of formulae, which represent restrictions imposed on those subgraphs. In the following Section we will focus on the constraints we have to introduce on GEM graphs, to suitably represent transaction time. In [6] the complete formalization of constraints for both valid time and transaction time is described. We preferred to adopt a semistructured data model, instead of dealing directly with XML: this way, as mentioned in the introduction and according to other proposals [5, 8, 13], the proposed solution can be considered as a logical model for temporal semistructured data, which can be translated into different XML-based languages/technologies.
To the best of our knowledge, the only other work which explicitly addresses the issue of time-related semantics for semistructured data is [8]. In [8], the authors propose a framework for semistructured data, where graphs are composed by nodes and labeled edges: edge labels contain properties (i.e., meta-data), as valid and transaction times, the name of the edge, quality, security, and so on. Edges can have different properties: a property can be present in an edge and missing in another one. Nodes have no labels and are identified by paths leading to them. The focus of that work is on the definition of suitable operators (i.e., collapse, match, coalesce, and slice), which allow one to determine the (different) semantics of properties for managing queries on such graphs. As for the temporal aspects, even though in [8] some examples are provided about special semantics for update to accommodate transaction time, a detailed and complete examination of all the constraints for modeling either transaction or valid times is missing and is outside the main goal of that work. Moreover, the authors claim that they “leave open the issue of how these constraints are enforced on update” [8]. With respect to the proposal described in [8], we thus explicitly focus on the semantics of temporal aspects and do not consider the semantics of other properties; this way, even though we are less general than the authors in [8], we are able to provide a complete treatment of the constraints when representing either transaction or valid times [6], facing some important aspects which have not been completely considered in [8], such as, for example, the problem of the presence of nodes/subgraphs which could become unreachable from the root of the graph after some updates. Moreover, another novel feature of our work is that we explicitly address the issue of providing users with powerful operators for building GEM graphs consistent with the given temporal semantics.
4
Managing Transaction Time with the GEM Data Model
Transaction time allows us to maintain the graph evolutions due to operations, such as insertion, deletion, and update, on nodes and edges. From this point of view, a GEM graph represents the changes of a (atemporal) graph, i.e. it represents a sequence of several atemporal graphs, each of them obtained as result of some operations on the previous one. In this context, the current graph represents the current facts in the semistructured database, and is composed by nodes and edges which have the transaction time interval ending with the special value now. Our idea is that operations on nodes and edges of a GEM graph must have as result a rooted, connected graph. Thus, the current graph, composed by current nodes and edges must be a rooted connected graph, i.e. a GEM graph. Changes in a GEM graph are timestamped by transaction time in order to represent the graph history which is composed by the sequence of intermediate graphs resulting from the operations. In the next Section we define the set of constraints temporal labels must satisfy in order to guarantee that the current (atemporal) graph, resulting after each operation, is still a GEM graph.
From the other point of view, each operation on the graph corresponds to the suitable management of temporal labels of (possibly) several nodes and edges on the GEM graph. 4.1
Constraints for Transaction Time
The following constraints on a GEM graph allow us to explicitly consider the append-only feature of semistructured data timestamped by the transaction time. 1. The time interval of a generic edge connecting two nodes must be related to the their time intervals. Intuitively, a relation between two nodes can be established and maintained only in the time interval in which both nodes are present in the graph. For each edge ej = (nh , nk ), Enamej , [tjs , tje ) where (nh ) = N nameh , N typeh , N contenth , [ths , the ) and (nk ) = N namek , N typek , N contentk , [tks , tke ), then it must hold tjs ≥ max(ths , tks ) and tje ≤ min(the , tke ). In Figure 2 we report an example of this constraint: in part a) we show two nodes and the generic edge connecting them, while in part b) we show a possible set of nodes and edge time intervals that satisfy the constraint. The time interval [tjs , tje ) of the edge does not start before that both the node-related time intervals started and does not end after that one of the node-related time intervals ended.
t_hs
t_he
t_js
t_je
t_ks
t_ke
t a)
b)
Fig. 2. The TT constraint on the time interval of a generic edge.
2. The time interval of each node is related to the time interval of all its ingoing edges. Intuitively, a node (different from the root) can survive only if it is connected at least to one current complex node by means of an edge. For each node nk with (nk ) = N namek , N typek , N contentk , [tks , tke ) and for the set of its ingoing edges eji = (nhi , nk ), Enameji , [tji s , tji e ) (with i = 1, . . . , n) it must hold tks = min(tji s ) and tke = max(tji e ). Part a) of Figure 3 shows three complex nodes and a node (simple or complex) connected by means of three edges; part b) depicts an example of time intervals
satisfying this constraint, by showing three possible edge-related time intervals [tji s , tji e ) (with i = 1, 2, 3). The pointed node-related time interval cannot start before that the first edge-related time interval starts (according to the insertion order) and cannot end after that the last one ends (according to the deletion order).
t_j1s t_j2s
t_j2e
t_j3s
t_j3e
t_ks
t_ke
t a)
b)
Fig. 3. The TT constraint on the time interval of a node.
3. For each complex node nh with label N nameh , complex, ⊥, [ths , the ) and for all the simple nodes nki , with i = 1, . . . , n and (nki ) = N name, simple, N contentki , [tki s , tki e ), related to nhby means of the edges eji = (nh , nki ), n Ename, [tji s , tji e ), it must hold i=0 [tji s , tji e ) = ∅. Note that all the edges eji have the same name Ename, and all the simple nodes nki have the same name Nname. Intuitively, at a specific time instant, a complex node can be related to at most one simple node named Nname by means of a relation named Ename. This is due to the fact that a simple node represents a property and a complex node cannot have, at a given time, different values for the same property. For example, at a given time instant, the complex node Person can be connected at most to one simple node City by means of the edge Lives in. Part a) of Figure 4 shows a complex node and three simple nodes connected by means of edges with the same label; part b) shows the edge-related time intervals [tji s , tji e ) (with i = 1, 2, 3), which satisfy this constraint, being without intersection. With these restrictions we do not allow one to represent multi-valued attributes of a complex node. In order to overcome this limitation, multi-valued properties could be represented, for example, by a set of complex nodes with the same name connected to the referring complex node by edges (possibly having the same name). Each complex node representing a value of the considered property has a single simple node storing the value of the property.
t_j1s t_j1e
t_j2s
t_j2e t_j3s
t_j3e
t a)
b)
Fig. 4. The TT constraint on time interval of edges pointing to simple nodes.
4.2
Operations
Let us now consider how node and edge insertions, deletions, and updates modify a GEM graph: the result of each operation is a GEM graph which satisfies the previous constraints. By considering single operations, each of them producing a GEM graph, we obtain that any sequence of the defined operations is correct, thus we avoid the problem of having incorrect sequences of operations [5]. 1. Insert the root node objr = insert-root-node(Nname) inserts at time ts the root node with label N name, complex, ⊥, [ts , now) and gives as result the object identifier objr of the node itself. The time interval [ts , now) is system-generated. 2. Insert a complex node objc = insert-complex-node(N name, objk , Ename) inserts at time ts the complex node with label N name, complex, ⊥, [ts, now) connected to the node with object identifier objk by means of an edge starting from objk . The operation gives as result the object identifier objc of the inserted node. The inserted edge is (objk , objc ), Ename, [ts , now) and is added in order to avoid the possibility that the node objc cannot be reached from the root. If the node objk is not current (i.e., its time interval does not end with the special value now), then the operation fails (for example, it returns NULL) and the GEM graph is not modified. 3. Insert a simple node objs = insert-simple-node(N name, N content, objk , Ename) inserts at time ts the simple node with label N name, simple, N content, [ts, now) connected to the node objk by means of an edge starting from objk . This operation checks whether there is another current simple node objh , with label N name, simple, N contenth, [ths , now), connected to the same node objk by means of an edge with the same name Ename. If this is the case it calls the operation remove-edge(objk , objh , Ename), which (logically) deletes the old value of the considered property, which is updated by the new insertion.
4.
5.
6.
7.
5
This restriction is added in order to avoid the possibility to store for a given node two properties with the same label at the same time. The operation gives as result the object identifier objs of the inserted node. The inserted edge is (objk , objs ), Ename, [ts , now). If the node objk is not current, then the operation fails. Insert an edge insert-edge(objh , objk , EN ame) inserts at time ts the edge (objh , objk ), EN ame, [ts , now). This operation fails when at least one of the two nodes is not current or there is already a current edge Ename between the two considered nodes. Remove a node remove-node(objk ) removes at time te the node objk , if it is current, otherwise the operation fails. Suppose that the node objk has as label N name, N type, N content, [ts , now): the operation changes the label into N name, N type, N content, [ts , te ). Moreover, it removes also all its ingoing edges, i.e., for each edge (objh , objk ), Ename, [ts , now) it calls remove-edge(objh , objk , Ename). It removes also all outgoing edges, i.e., for each edge (objk , objx ), Ename, [ts , now) it calls remove-edge(objk , objx , Ename). Remove an edge remove-edge(objh , objk , Ename) removes at time te the edge (objh , objk ), Ename, [ts , now), which becomes (objh , objk ), Ename, [ts , te ). Note that this operation fails when there is not a current edge labeled Ename between the nodes objh and objk . Moreover, in order to avoid the possibility to have nodes that cannot be reached from the root, this operation implies another check on the node objk by calling the operation garbage − collection(objk ). Garbage collection garbage-collection(objk ) checks whether objk is current and there is at least one edge (objh , objk ), Ename, [ts , now). If such edge exists, then the operation terminates, otherwise it removes the node with object identifier objk by calling the operation remove-node(objk ).
Conclusions
In this paper we presented the new graph-based GEM model for semistructured temporal data. It allows us to model in a homogeneous way temporal dimensions of data. More particularly we discussed in some detail constraints and operations on GEM graphs dealing with transaction time, the well-known time dimension of data [10]. We showed how a GEM graph can represent a sequence of timestamped atemporal graphs. GEM can also be used to represent the valid time; in this case, the time dimension is given by the user, being the valid time related to the description of the considered real world. Thus, constraints and operations must be able to guarantee that the history of the given application domain is consistent. As for future work, the GEM data model will be extended to deal with both valid and transaction times together: further issues should be considered when
both the temporal dimensions are represented on the same graph. Moreover, different approaches, such as the logic-based or the algebraic ones, will be considered and studied in order to provide the GEM data model with a language for querying, viewing, and “transforming” GEM graphs.
References 1. S. Abiteboul. Querying Semi-Structured Data. In Proceedings of the International Conference on Database Theory, volume 1186 of Lecture Notes in Computer Science, pages 262–275, 1997. 2. T. Amagasa, M. Yoshikawa, and S. Uemura. Realizing Temporal XML Repositories using Temporal Relational Databases. In Proceedings of the Third International Symposium on Cooperative Database Systems and Applications, pages 63–68. IEEE Computer Society, 2001. 3. P. Atzeni. Time: A coordinate for web site modelling. In Advances in Databases and Information Systems, 6th East European Conference, volume 2435 of Lecture Notes in Computer Science, pages 1–7. Springer-Verlag, Berlin, 2002. 4. S. Ceri, S. Comai, E. Damiani, P. Fraternali, S. Paraboschi, and L. Tanca. XMLGL: a Graphical Language for Querying and Restructuring XML Documents. Computer Network, 31(11–16):1171–1187, 1999. 5. S. S. Chawathe, S. Abiteboul, and J. Widom. Managing historical semistructured data. Theory and Practice of Object Systems, 5(3):143–162, 1999. 6. C. Combi, B. Oliboni, and E. Quintarelli. A Unified Model for Semistructured Temporal Data. Technical Report 2003.11, Politecnico di Milano, February 2003. 7. E. Damiani, B. Oliboni, E. Quintarelli, and L. Tanca. Modeling semistructured data by using graph-based constraints. In OTM Workshops Proceedings, Lecture Notes in Computer Science, pages 20–21. Springer-Verlag, Berlin, 2003. 8. C. E. Dyreson, M. H. B¨ ohlen, and C. S. Jensen. Capturing and Querying Multiple Aspects of Semistructured Data. In VLDB’99, Proceedings of 25th International Conference on Very Large Data Bases, pages 290–301. Morgan Kaufmann, 1999. 9. M. Fernandez, D. Florescu, J. Kang, A. Levy, and D. Suciu. STRUDEL: A web site management system. In Proceedings of the ACM SIGMOD International Conference on Management of Data, volume 26,2 of SIGMOD Record, pages 549–552. ACM Press, 1997. 10. C. S. Jensen, C. E. Dyreson, and M. H. Bohlen et al. The consensus glossary of temporal database concepts - february 1998 version. In Temporal Databases: Research and Practice, volume 1399 of Lecture Notes in Computer Science, pages 367–405. Springer, 1998. 11. C. S. Jensen and R. Snodgrass. Temporal data management. IEEE Transactions on Knowledge and Data Engineering, 11(1):36–44, 1999. 12. B. Oliboni. Blind queries and constraints: representing flexibility and time in semistructured data. PhD thesis, Politecnico di Milano, 2003. 13. B. Oliboni, E. Quintarelli, and L. Tanca. Temporal aspects of semistructured data. In Proceedings of The Eighth International Symposium on Temporal Representation and Reasoning (TIME-01), pages 119–127. IEEE Computer Society, 2001. 14. Y. Papakonstantinou, H. Garcia-Molina, and J. Widom. Object Exchange Across Heterogeneous Information Sources. In Proceedings of the Eleventh International Conference on Data Engineering, pages 251–260. IEEE Computer Society, 1995. 15. World Wide Web Consortium. Extensible Markup Language (XML) 1.0, 1998. http://www.w3C.org/TR/REC-xml/.