Storing Semistructured Data in Relations Alin Deutsch
Mary Fernandez
Dan Suciu
[email protected]
[email protected]
[email protected]
1 Introduction Existing systems for managing and querying semistructured-data sources store the schema with the data. Lorel [QRS+ 95] and Tsimmis [PGMW95] store their data as graphs. The schema is stored as attributes labeling the graph's edges. Strudel [FFK+ 98] stores the data externally as structured text, and internally as a graph. Text processing systems [GT87, BBT92, ST94, RTW96] leave the data in text format and add specialized indexes. XML often is stored in proprietary object repositories or in tagged-text les in which the schema is replicated at each object as element tags. Storing the schema with the data provides important exibility to some applications. In data integration, for example, data from new sources can be loaded immediately, regardless of its structure, and changes to the structure of old sources can be handled seamlessly. This exibility, however, incurs a space cost , because the schema is replicated at each data item, and a time cost , because of the additional processing of the replicated schema. A more fundamental cost, however, is that it prohibits us from using o-the-shelf relational database-management systems for managing semistructured data. In this paper we argue that one can store semistructured data in relational format, by exploiting the regularities inherent in existing semistructured data instances. \Most" of the data will be stored in relational format: the outliers, and possible future insertions, will be still stored in a self-describing way. We propose to use data mining techniques to extract a \good" relational schema for a given semistructured data instance. Our algorithm accepts a variety of input parameters, such as maximum number of relations allowed, maximum number of attributes per relation, and, optionally, a collection of queries on the semistructured data for which the relational storage has to be optimized. Experimental results on the DBLP data show that around 90% of the data can be stored in relational format. The techniques described here are presented in more details in [DFS98].
Related Work Nestorov, Abiteboul, and Motwani [NAM98] describe an algorithm which, when given a
semistructured data, extracts a schema for that data. This is essentially a clustering algorithm. We found our storage generation problem hard to model as a data clustering problem, because objects with widely dierent structures may be well stored together. Wang and Li [WL98] have recently extended data mining techniques to semistructured data. The only technique, other than a text le, proposed for storing XML data uses an object-oriented database and is best described by Christophides et al. [CACS94] in the context of SGML. The idea is to use the document's type descriptor (DTD) to derive an object-oriented schema. Each tag name becomes a class name; additional classes may be introduced to split complex regular expressions in the DTD. Each element in the SGML document becomes an object in the database. This approach is generally accepted as the most appropriate for storing XML data and, for this reason, has renewed interest in object-oriented databases. The technique, however, is rigid, because the object-oriented schema cannot accommodate XML data that does not conform to the given DTD. Also, the object-oriented schema can induce data fragmentation, which can negatively impact performance. User intervention can help reduce this fragmentation [KKEX97, MKK96], but this can introduce other ineciencies. In comparison, our approach relies on an aggressive mapping to the relational model, which may be more space ecient when the data is relatively regular. Our approach is also more exible; when the data changes in an unanticipated way, we can still store it, but with lower performance.
1
Audit: &o1 {taxpayer: &o24 {name : &o41 "Gluschko", address : &o34 {street : &105 "Tyuratam", apartment : &o623 "2C" zip : &121 "07099"} audited : &o46 "10/12/63", taxamount : &o47 12332}, taxpayer : &o21 {name : &o132 "Kosberg", address : &o25 {street : &427 "Tyuratam", number : &928 206, zip : &121 "92443"} audited : &o46 "11/1/68", audited : &o46 "10/12/77", taxamount : &o283 0, taxevasion : &o632 "likely"} taxpayer : &o20 {name : &o132 "Korolev", address : &o253 "Baikonur, Russia", audited : &o46 "10/12/86", taxamount : &o283 0, taxevasion : &o632 "likely"} company : &o26 {name : &o623 "Rocket Propulsion Inc.", owner : &o24} }
Figure 1: Textual representation of semistructured data Taxpayer1 oid
name
street
no
o24 o21
Gluschko Kosberg
Tyuratam Tyuratam
206
apt
zip
audit1
audit2
taxamount
taxevation
2C
07099 92443
10/12/63 11/1/68
10/12/77
12332 0
likely
Taxpayer2
Company
oid
name
address
audited
taxamount
taxevasion
name
owner
o20
Korolev
Baikonur
10/12/86
0
likely
Rocket Propulsion Inc.
o24
Figure 2: Relational storage
2 Illustration of Relational Storage for Semistructured Data Our semistructured model is an ordered version of the OEM model [PGMW95]. Data consists of a collection of objects, in which each object is either complex or atomic . A complex object is an ordered set of (attribute, object) pairs, and an atomic object is an atomic value of type int, string, video, etc. Hence, data is a graph , with edges labeled by attributes and some leaves labeled with atomic values. Data is exchanged in a textual representation. An example of a textual representation is in Fig. 1. The order of an object's attributes is the only dierence from the OEM model, and we use it only when storing the data (i.e. the data is still unordered for the user's perspective). Any order will do; it can be the order in the textual representation or can result from comparing the binary representation of the object identi ers. The textual representation speci es the data in a tree-like format. To specify arbitrary graphs, we emit references to object identi ers, e.g., the value of the owner attribute in Fig. 1 is the object &o24. Hence, throughout the paper we consider the data to be a tree. Object identi ers in the textual representation are optional: when none is speci ed, an a unique identi er is assigned automatically. We note that these conventions and assumptions are compatible with XML. Considering the semistructured data in Fig. 1, one possible way to store it in a relational format is shown in Fig. 2. We separate objects by their \types": taxpayers and companies are stored separately. Within taxpayers, we separate those with a complex address from those whose address is a string. Even after this decomposition, objects are not uniform: there are nulls in several table entries. Table taxpayer1 has two attributes audit1 and audit2 to accommodate objects with two occurrences of the audit attribute. Most object identi ers from the semistructured data are omitted. The actual \mapping" is not explicitly de ned, but hidden in the choice of attribute and table names. For instance, the path Audit.taxpayer.name is mapped to both name in Taxpayer1 and name in Taxpayer2, and the path Audit.taxpayer.audited is mapped to audit1 and audit2 in Taxpayer1, and to audit in Taxpayer2. We emphasize that the data in Fig. 2 can be stored directly by any RDBMS. Unlike semistructured data, 2
Taxpayer oid
name
audit1
o24 o21 o20
Gluschko Kosberg Korolev
10/12/63 11/1/68 10/12/86
Address1 taxpayeroid o24 o21
street Tyuratam Tyuratam
audit2
taxamount
taxevasion
10/12/77
12332 0 0
likely likely
no.
apt 2C
206
Address2 taxpayeroid o20
zip 07099 92443
address Baikonur
Figure 3: Alternative relational storage (Company is the same). Parameter Name K A S C Supp
Meaning Maximum number of relations (tables) Maximum number of attributes allowed per relation Maximum disk space Collection size threshold Minimum Support
Table 1: Parameters for the storage generation algorithm the schema is not stored with the data. The particular choice of the schema in Fig. 2 is not the only choice nor necessarily the best. Fig. 3 contains an alternative. All taxpayers are now stored together, and their addresses are stored separately. The Company relation is the same. Of course, some updates to the semistructured data instance cannot be accommodated by the chosen relational storage. For example, we cannot add a new taxpayer with a phone attribute, because that attribute does not exist in the mapping. The solution we propose relies on an over ow graph. Thus the semistructured data is managed by (1) an RDBMS, handling \most" of the data, and (2) a Semistructured DBMS handling a small over ow graph. Details of the over ow graph are omitted from this abstract.
3 A Repertoire of Storage Techniques Our method will work only when there is enough regularity in the data to make it \store-able". Often, however, this regularity is hidden, and in order to exploit is we apply a number of techniques described here.
Unnesting Objects which have similar deep structured are unnested: only the values on their leaves need to be stored in the relations. For example the address objects of o24 and o21 are attened in Taxpayer1. Nulls We use nulls to store objects with seemingly dierent structure in the same relation. For example
o24
and o21 are both stored in Taxpayer1, although they have dierent sets of attributes.
Horizontal Splitting Objects whose structure dier too much can be stored in dierent relations: for example o24 and o20 in Fig. 2. Vertical Splitting Alternatively, \wide" objects can be split vertically and stored in several relations. For example o24 and o21 are split vertically in Fig. 3 between Taxpayer and Address1.
Multiple Attribute Occurrences To account for multiple occurrences of the same attribute we introduce
several (dierently named) attributes in the relations. For instance o21 has two audit attributes (Fig. 1), hence we provided two attributes, audit1, audit2 in Taxpayer1 (Fig. 2). We call a multiply occurring attribute with small cardinality a small collection . \Small" is a systems parameter.
Nested Relations When the cardinality of some attribute is too large we call it a large collection and store it as a nested relation.
3
4 Automatic Generation of Relational Storage Several competing goals determine how \good" or eective a particular relational storage is. First, we want to limit the number of tables: although many commercial RDBMS do not limit the maximum number of tables, the unusual possibility of storing each object in a separate table is undesirable. Second, we want a bound on the disk space. Although size of the semistructured data instance D is xed, its relational storage may be arbitrarily large, because some objects may be stored in more than one relation. A related goal is minimizing the number of nulls. Some RDBMS store nulls eciently, e.g., a null entry in a record requires only a byte, and nulls at the end of the record take no space at all. Nonetheless, we do not want to generate very wide, sparse tables, because, one byte per null entry can become expensive. A related restriction is that some RDBMS impose an upper limit on the number of attributes per table1 . Depending on the application, other goals may include reducing data fragmentation (i.e., avoiding horizontal and vertical splits), reducing redundant storage of objects in multiple relations, or increasing object redundancy to improve query evaluation. Finally, if a query mix is also given, we may want to optimize the weighted cost of computing those queries as well. Rather than trying to solve the problem in its full generality, we consider heuristics. We start with data mining techniques, and extend them to account for the query mix and to generate complex storage mappings. A data-mining algorithm for semistructured data has been recently described by Wang and Li [WL98]: we refer to it as WL's algorithm . Here we extend WL's algorithm and used it to search for relational storage. The result of our algorithm is a collection of relation names each with a set of attributes and subset of required attributes. Each object o having all required attributes for relation R will be stored in R: the remaining attributes in R may be lled with nulls. Our storage generation algorithm has ve parameters, listed in Table 1. It generates a relational storage with at most K tables, each having at most A attributes, and with total disk space at most S. We assume xed length records. This results in an approximation of the disk space: more accurate measures are possible (e.g., to account for when nulls take far less space). C is a parameter distinguishing between \small collections" and \large collections" An attribute with less than C occurrences is considered a small collection, and the algorithm attempts to produce one column for each. Attributes with C or more occurrences are represented by nested relations. Finally, Supp is the minimum support, a parameter for the data-mining algorithm.
Storage patterns Our storage patterns , that are dierent from WL's patterns. A storage pattern has the form fl1 : P1 lp : Pp g, where P1 Pp are other storage patterns, and each label is either a[i] (an indexed attribute), or a[*] (a collection attribute). An example of a pattern is: ;:::;
;:::;
Audit.taxpayer:{name[1], phone[2], address[*]:{street[1], city[1]}}
As in WL's algorithm we assume the semistructured data instance to be given by a collection of root objects: the storage pattern then applies to the root objects, and it's support is the number of root objects satisfying it2 . Intuitively, this pattern is contained in all taxpayer objects with at least a name, two phones, and C addresses, the latter with streets and cities. Note that phone[1] is missing: only the highest index occurs in a storage pattern. This is dierent from WL's algorithm. Also, * is not considered in WL.
Modeling Query Support Queries on semistructured data typically have one or more data patterns,
that specify paths to match in the input data, and conditions on the variables bound by the patterns. To account for a query mix, we represent a query's patterns as data, and extend the data mining algorithms to the data derived from the queries. The generated data ignores any query conditions other than patterns, and each pattern in a query is represented individually. Equivalently, we may assume that each query contains one pattern and no other conditions. Hence, a query is a tree, with edges labeled with attribute constants, attribute variables, or regular path expressions. Given the weight f of the query Q, we assume the data that \corresponds" to the pattern occurs f times. Given a query mix Q1 Qk with weights f1 fk , we de ne the query support of a storage pattern P to be the sum of all fi for which P is contained in Qi. The combined support of P is the sum of the data support and the query support. ;:::;
;:::;
1 Oracle 8.0.4 imposes a limit of 1000. 2 In the case when the semistructured data is given as a graph with a unique root, the objects of interest need to be discovered
in a separate step [DFS98].
4
Inproceedings: author1(required) author2 author3 title pages year booktitle url Article1: author(required) title pages year volume journal number url Article2: author1(required) author2(required) author3 pages year volume number
Figure 4: Generated Relation for A=8
The algorithm Our algorithm consists of a data mining part and a storage generation part. In the data mining part we deploy an extension of WL's algorithm to nd storage patterns with high combined support. Patterns are grown in WL's algorithm from simple to complex: for each storage pattern P we keep a pointer back to its subpattern with highest data support. We stop generating patterns either when the support decreases below Supp or when the patterns reach A leaves, whichever comes rst. The result of the data mining part is a set of maximally supported storage patterns. In the storage generation part we start by selecting K of the maximally supported storage patterns. Each such pattern de nes a relation. Next we choose the required attributes of that relation. Here we follow the pointers back to smaller and smaller subpatterns with high data support. We try to select the smallest possible such subpattern for each of the K relations, to allow the relation to contain as many objects as possible. But at the same time we have to avoid too much overlap between the relations: having a small required set of attributes may result in too much overlap between the relations, and this may exceed the maximum disc space S. We deploy some heuristics to reconcile these con icting goals.
5 Preliminary Experiments We ran some preliminary experiments on DBLP, the popular database bibliography Web site available from: http://www.informatik.uni-trier.de/~ley/db/about/instr.html
DBLP consists of a collection of XML-like les, with no explicit structure. Each XML le corresponds to one publication: a proceedings paper, a journal article, a book, a PhD thesis, etc. A typical entry is: Serge Abiteboul, Querying Semi-Structured Data., 1-18, 1997, ICDT, db/conf/icdt/icdt97.html#Abiteboul97
The publication data is irregular: some entries have multiple author's, optional url's, or many citation attributes; a few have unfamiliar attributes. Most attributes have scalar values, but there are exceptions. There are about 92,000 publications (root objects), 861,000 edges, and the total disk space is 95M. We chose a minimum support of 8500, which is 8 6%. Publications of type article and inproceedings had minimum support. All others had much less than our minimum support; for example, there were only 307 books and 67 phd's. In a separate experiment (not reported for lack of space), we also considered a query mix, in which one query with high weight referred to books. In that experiment, books did have minimum support, and the system generated a relation for storing book objects. We found no collection attributes; citation was a good candidate, but it did not have high enough support. We also found no nested attributes with high enough support. Fig. 4 contains an example of one generated relation query with A=8. There were only 8 attributes of high support for inproceedings, and all 8 in combination still had high support, therefore a single table was generated for inproceedings. There were more than 8 attributes of high support for article, therefore these article objects where split into two relations. The rst, Article1, has only author as required attribute: this allows it to cover most objects. The second, Article2, requires both author1 and author2, which gives it best chances to store those objects not stored in article1. We ran two sets of experiments: one that varied the maximum number of attributes per relation, denoted A in Sec. 4), and one that varied total disk space allocated to the relations. The results are shown in Table 2. The rst table that data fragmentation directly depends on the maximum relation arity. For small A's, many relations are generated, and large objects are split vertically. At query time, split objects must be :
5
A 3 4 5 6 7 8 9 No. relations 9 9 5 4 4 3 3 Coverage 91% 94% 94% 90% 92% 90% 90% Space 1.19M 1.59M 1.15M 1M 1M 0.9M 1.2M Nulls 23k 116k 112k 123k 91k 106k 201k Space 0.5M 0.78M 0.93M 1.0M Nulls 2.5k 40k 106k 106k Coverage 59% 77% 90% 90%
Table 2: Eect of varying maximum number of attributes per relation and maximum disk space. joined. On the other hand, space is better utilized for small A's, because the number of null entries is smaller. The coverage (total number of edges stored in the relational part) is constant, at approximately 90%. The second set of results show a clear degradation of the coverage when disk space is limited.
6 Conclusions We have proposed a novel method for storing semistructured data in a relational database management system. The method assumes a certain regularity to exists in the semistructured data, which is then exploited to derive good relational storage. Our preliminary experiments using the DBLP bibliography database show a 90% data coverage.
References [BBT92]
[CACS94] [DFS98] [FFK+ 98] [GT87] [KKEX97] [MKK96] [NAM98] [PGMW95] [QRS+95] [RTW96] [ST94] [WL98]
G.E. Blake, Tim Bray, and F. Tompa. Shortening the OED: experience with a gramamr-de ned database. ACM Transactions on Information Systems, 10(3):213{232, July 1992. V. Christophides, S. Abiteboul, S. Cluet, and M. Scholl. From structured documents to novel query facilities. In Richard Snodgrass and Marianne Winslett, editors, Proceedings of 1994 ACM SIGMOD International Conference on Management of Data, Minneapolis, Minnesota, May 1994. A. Deutsch, M. Fernandez, and D. Suciu. Storing semistructured data with stored, 1998. Manuscript available from http://www.research.att.com/~suciu. Mary Fernandez, Daniela Florescu, Jaewoo Kang, Alon Levy, and Dan Suciu. Catching the boat with Strudel: experience with a web-site management system. In Proceedings of ACM-SIGMOD International Conference on Management of Data, 1998. G. Gonnet and F. Tompa. Mind your grammar: A new approach to modelling text. In Proceedings of 13th International Conference on Very Large Databases, pages 339{346, 1987. K.Bohm, K.Aberer, E.Neuhold, and X.Yang. Structured document storage and re ned declarative and navigational access mechanisms in HyperStorM. VLDB Journal, 6(4):296{311, November 1997. M.Volz, K.Aberer, and K.Bohm. Applying a exible OODBMS-IRS-Coupling to structured document handling. In Internaltional Conference on Data Engineering, February 1996. S. Nestorov, S. Abiteboul, and R. Motwani. Extracting schema from semistructured data. In Proceedings of the ACM Conference on Management of Data, pages 295{306, 1998. Y. Papakonstantinou, H. Garcia-Molina, and J. Widom. Object exchange across heterogeneous information sources. In IEEE International Conference on Data Engineering, pages 251{260, March 1995. D. Quass, A. Rajaraman, Y. Sagiv, J. Ullman, and J. Widom. Querying semistructure heterogeneous information. In International Conference on Deductive and Object Oriented Databases, pages 319{344, 1995. D. Raymond, F. Tompa, and D. Wood. From dat representation to data models. Computer Standards & Interfaces, 18:25{36, 1996. A. Salminen and F. W. Tompa. Pat expressions: An algebra for text search. Acta Linguistica Hungarica, 41(1-4):277{306, 1994. Ke Wang and Huiqing Liu. Discovering typical structures of documents: a road map approach. In ACM SIGIR Conference on Research and Development in Information Retrieval, August 1998.
6