IEICE TRANS. INF. & SYST., VOL. E82–D, NO. 1 JANUARY 1999
3
INVITED PAPER
Special Issue on New Generation Database Technologies
Revisiting the Hierarchical Data Model H.V. JAGADISH† , Laks V.S. LAKSHMANAN†† , and Divesh SRIVASTAVA† , Nonmembers
SUMMARY Much of the data we deal with every day is organized hierarchically: file systems, library classification schemes and yellow page categories are salient examples. Business data too, benefits from a hierarchical organization, and indeed the hierarchical data model was quite prevalent thirty years ago. Due to the recently increased importance of X.500/LDAP directories, which are hierarchical, and the prevalence of aggregation hierarchies in datacubes, there is now renewed interest in the hierarchical organization of data. In this paper, we develop a framework for a modern hierarchical data model, substantially improved from the original version by taking advantage of the lessons learned in the relational database context. We argue that this new hierarchical data model has many benefits with respect to the ubiquitous flat relational data model. key words: hierarchy, X.500, LDAP, directories, white pages, data warehousing
1.
Motivation
Hierarchical organization of data is ubiquitous. For instance, the files we store in most file systems (for example, UNIX) are organized hierarchically, in directories and sub-directories. Usually, this organization is based on characteristics like subject, function, date and author, so that closely related files are found close by in the hierarchy. Standardized taxonomies, such as those used by biologists to classify species or by librarians to classify books, provide instances of hierarchies that are more carefully defined and more rigid. When we browse bookshelves in a library, we typically find similar books next to one another, by some common sense definition of similarity that has been codified by librarians. As a final example, consider yellow pages in a telephone directory. Telephone listings are categorized, and often further divided into sub-categories (for example, automobiles may be divided into automobiles:new car dealers, automobiles:used car dealers, automobiles:parts:wholesale , automobiles:parts:retail and automobiles:repair). This listing by category and sub-category is very useful when one does not know the exact name of a business, but has in mind only a business category. Manuscript received August 11, 1998. The authors are with the researchers at AT&T Labs– Research, USA. †† The author is with the faculty at Concordia University, Canada. Much of this work was done when the author was visiting AT&T Labs–Research, USA. †
The reason information tends to be organized hierarchically is not too hard to understand: hierarchical organizations place natural limits on the number of “sibling” entries for a given entry in the database; this promotes ease of use for humans and potentially higher performance for machines. Browsing becomes a possibility now. The hierarchical division also provides easy partitioning of the data set into subsets (rooted subtrees) that can be managed autonomously. A simple flat organization does not provide these benefits. The question we seek to address in this paper is, in the light of the ubiquity and widely demonstrated value of hierarchical storage, why database management systems do not display more hierarchical organization. We begin in Sect. 2 with a quick look at the traditional hierarchical data model and attempt to explain its demise. In Sects. 3 and 4, we propose a new hierarchical data model, and a high-level query language for our model, driven by the recent success of X.500 and LDAP. In Sects. 5 and 6, we demonstrate the applicability of our model and language to two diverse applications: white pages directories and data warehouses. Finally, in Sect. 7, we summarize the many benefits of our new hierarchical data model with respect to the ubiquitous flat relational data model. 2.
The Traditional Hierarchical Data Model
Thirty years ago, a hierarchical data model was prevalent in database management systems. Indeed, many database textbooks begin with a short discussion of this model (see, for example, [2], [14]). Even today, there remain many commercial installations of IMS Fastpath [5] (and possibly other) hierarchical database systems. However, it is now universally accepted as fact that the (flat) relational model has clearly supplanted the hierarchical (and the network) model of yesteryear. An argument could conceivably be made based on these facts that databases are more amenable to a flat representation than a hierarchical representation. However, this would imply that databases somehow differ from most other forms of data storage, which tend to be hierarchical; so this thesis is hard to accept. Instead, we find other more compelling reasons through a closer examination of the traditional hierarchical data model. The fundamental notion was a hierarchical scheme,
IEICE TRANS. INF. & SYST., VOL. E82–D, NO. 1 JANUARY 1999
4
with objects of one type “owning” or being the “parents” of objects of a second type. Thus, Suppliers could be parents of Parts manufactured. For each supplier, there would be a collection of children corresponding to the parts manufactured by that supplier. This tree was then directly reflected in the physical storage — records were stored using a tree (typically preorder) traversal. (To make the data structure updatable, links to subsequent records were stored as pointers, permitting the actual records to be moved around as needed). Of course, there were parts that had more than one supplier. Also, one may often have wanted to find out the supplier given a part. Neither of these would be possible in the strict hierarchical structure described in the preceding paragraph. Therefore, a virtual record type was introduced, the sole purpose of which was to permit the database structure to become an arbitrary graph rather than a tree. With virtual records in place, the hierarchy simply became a user-visible preferred data organization on what was really a network database. Along with this rather general storage structure came a procedural query language. The most popular, DL/1, looked like procedure calls in a COBOL program. There was a strong notion of the state of computation, including such things as “current record,” with operations like “get leftmost” and “get next.” In effect, data manipulation was extremely low-level and operational. There was no declarative query language, little abstraction, and purely navigational access. The relational revolution was so successful because it raised database concerns to a higher level, providing for the first time a well-defined query language, and the possibility of automated query optimization. The main point to note here is that the value introduced was primarily in terms of the possibility of high-level constructs, rather than in terms of a better data model. The extreme simplicity of the relational model was useful in developing the abstraction, but not necessarily a desired end in itself. Thinking of data as records in large flat tables appeared to be appropriate for a variety of applications, so relational databases had no shortage of immediate applications. If we now fast forward to the 1980s we see a strong push made by object-oriented databases; the primary argument for their emergence was to address inadequacies in commercial relational databases (see, for example, [18]). While object-oriented databases added a great deal of conceptualization, introducing concepts such as identity and encapsulation, the fundamental data organization of an OODB has been compared to a network data model [15]. OODBs have not thrived in the market place because the entrenched relational vendors came back with “object-relational” databases to address some of the key shortcomings of relational databases. Further, on the conceptual level, logical models for object databases tend to be rather complex,
and often of limited practical value. In contrast, we claim in this paper that the hierarchical data model now has an opportunity for a new lease on life. The primary reason for the demise of the hierarchical data model has now been addressed: in X.500 and LDAP we have the basis of a welldefined data model, query language and access protocol for hierarchical data. The benefits of a hierarchical model — the partitioning of concerns, the possibility of autonomous management of different pieces of the database, and so forth — can be exploited without losing the main advances made possible by relational technology in the past few decades. 3.
The Modern Hierarchical Data Model
X.500 and LDAP (see, for example, [4], [16]) use hierarchy as a central basis to represent and access data, and have been extensively used for corporate and global white pages services, providing, for example, electronic mail addresses of people connected to computer networks. Microsoft has adopted X.500 and LDAP as the core of its Active Directory [13], which is being proposed as a core component for use by multiple Microsoft products. X.500 and LDAP are also the basis of the recent directory enabled networks (DEN) initiative by Cisco and Microsoft for representing profiles of network users, applications and services, as well as policies for the overall management of the network, in a directory [1]. We present here a modern hierarchical data model inspired by X.500 and LDAP. 3.1 Entries The primary construct in the modern hierarchical model for holding information is the entry, which contains information about one entity, for example, a person in a white pages directory, or a product in a data warehouse, or about one relationship, for example, the sales of a product in a store on a given date. Each entry consists of a set of attribute-value pairs, each pair holding a single piece of information about the entity or the relationship, of a specified type. It is easy to see that individual files in the UNIX file system can be modeled in this fashion using, for example, the attributes fileName, fileContent, fileSize, filePermissions, fileOwner, creationTime, lastModificationTime. In a white pages directory, the entry for the person Divesh Srivastava could contain (among others) the following attribute-value pairs:
JAGADISH et al: REVISITING THE HIERARCHICAL DATA MODEL
5
Attribute givenName surName commonName mail telephoneNumber fax objectClass objectClass
Value divesh srivastava divesh srivastava
[email protected] +1 973 360 8776 +1 973 360 8871 person organizationalEmployee
In a Walmart data warehouse, the entry for the sales of Diet Coke in 1997 could contain (among others) the following attribute-value pairs: Attribute productId date unitsSold totalSales objectClass
Value beverages:coca cola:diet coke year=1997 100000 103000.00 PDSales
3.2 Attributes and Hierarchical Types Each attribute has a specified type, independent of the entries in which it occurs. For example, attributes surName and givenName in the white pages directory are of type string, attributes telephoneNumber and fax in the white pages directory are of type tel, and attribute totalSales in the data warehouse is of type dollaramount. Often, we have attribute values drawn from domains that can usefully be organized hierarchically. This is certainly true for most categorical attribute domains. For example, attribute productId can be defined to have a hierarchical type product; beverages is a parent of each of beverages:coca cola and beverages:pepsi, and beverages:coca cola:diet coke is a child of beverages:coca cola. As another example, attribute date may have a hierarchical type timeperiod, where the value year=1997 is a parent of year=1997:month=2, and year=1997:month=2:day=15 is a child of year=1997:month=2. When we do not know the exact value of some attribute, but do know which subtree it falls under in the attribute value hierarchy, the value can be incompletely specified by merely providing the root of the subtree. For instance, a fully specified date is expected to be of the form year=1997:month=2:day=15 . Suppose we do not know the exact date of a transaction. We may still record in the data warehouse entry our knowledge that the transaction took place on some day in year=1997:month=2. Such incompletely specified information may be useful in some contexts, compared to the alternative of leaving the attribute value completely unspecified, thereby providing no information.
3.3 Object Classes We use the concept of an object class to specify the type of entity or relationship a given entry represents, and an attribute named objectClass, which each entry must contain, to specify to which object class(es) the entry belongs. Each object class has a definition that lists all the attributes that could be present in an entry with that object class, and each attribute in the definition of an object class appears with several kinds of annotations: Mandatory/Optional : Mandatory attributes are those attributes that must be present, and optional attributes are those that may be present, in an entry of a given object class. For example, the attributes surName and mail could be mandatory attributes, while fax, telephoneNumber and givenName could be optional attributes, of the object class organizationalEmployee. Local/Inherited : A local attribute is defined explicitly as part of an entry. An inherited attribute has its value specified once at some node and then inherited by all immediate children that belong to a particular object class. For example, the attributes surName and mail are local attributes of the object class organizationalEmployee, whereas the attribute state could be inherited from the parent node, which records that all entries of object class organizationalEmployee in some organizational unit have state as new jersey. We saw above how many attribute value domains can naturally be organized in hierarchies. We can use this hierarchical representation as a natural bridge to creating more succinct, and more meaningful, representations of data. The full details can be found in [8], [9]. 3.4 Hierarchical Namespace of Entries Each entry is associated with a unique name, and the hierarchical nature of our data model arises from the hierarchical organization of the namespace of the entries; this hierarchical organization is called the information tree. Each entry occupies a certain location in the information tree. An entry that has no children is called a leaf entry, an entry that has children is a non-leaf entry. Each entry is associated with a relative distinguished name (RDN), which can be viewed as a “key” that distinguishes an entry from its sibling entries. There is a “root” entry that forms the base node of the information tree. The distinguished name (DN) of a specific entry is the sequence of RDNs of the entries on the path from the root entry to the entry in
IEICE TRANS. INF. & SYST., VOL. E82–D, NO. 1 JANUARY 1999
6
Fig. 1
Example white pages directory information tree.
question; the DN can be viewed as a “key” that distinguishes an entry from all other entries. RDNs and DNs are analogous to file names and fully-qualified file names in the UNIX file system. For example, in a corporate white pages directory, the natural hierarchy of organizational units in the organization can serve to hierarchically organize the namespace of directory entries. In a data warehouse for a US company, the geographical hierarchy of region-state can serve to hierarchically organize the namespace of the data warehouse entries. 3.5 White Pages Directories Our first application is a white pages directory belonging to a corporation, say AT&T. Assume that the root of this corporate white pages directory is an entry of type organization. This organization entry has multiple children of type organizationalUnit, some of which may have other organizational units as children, reflecting the organizational structure in AT&T; an example organizational entry corresponds to AT&T Labs–Research. The directory also contains entries of type person, and they may be children of any of the entries of type organization or organizationalUnit. Figure 1 depicts an information tree for this example. 3.6 Data Warehouses Our second application is a data warehouse for a US retail company, say Walmart. Assume that the root of this warehouse is an entry of type organization. This organization entry has multiple children of type
regionUnit, each of which has multiple children of type stateUnit, each of which has multiple children of type store, each of which has a child entry summarizing the total sales in the store for each combination of productId and date. Figure 2 depicts an information tree for this example. 4.
Querying Hierarchically Organized Data
For most purposes, hierarchical classification schemes work very well — after all most of us do manage to find files of interest on a computer typically without too much searching. However, there are times when we forget how we classified a file, or the basis used for classification at creation time is found to be inappropriate for the task at hand. In this case, what one needs is the ability to perform a search (or issue a selection query) over all or some large fraction of the database. In the context of hierarchical file systems, the glimpse utility [11] provides such an index. In the context of searches on the World Wide Web (viewed as a single, huge hierarchy, reflected in the structure of the URLs), [7] describes how to focus keyword-based searches, by restricting the space of documents searched to dynamically defined subtrees of the Web hierarchy. Our hierarchical data model was inspired by X.500/LDAP; hence, it is instructive to look at the LDAP query language and point out its limitations, in order to motivate our constructs for querying hierarchically organized data.
JAGADISH et al: REVISITING THE HIERARCHICAL DATA MODEL
7
Fig. 2
Example data warehouse information tree.
4.1 LDAP Queries An LDAP query consists of a base object, a search scope and a filter [4]. Atomic LDAP filters can compare individual attributes with values, test for the presence of an attribute, or do wildcard comparisons with the value of an attribute. Atomic LDAP filters can be combined using the standard boolean operators: and (&), or (|), not (!), in a parenthesis-prefix notation, to form complex LDAP filters. The applicability of the LDAP filter can be restricted in two ways. First, one can specify a base object, using an LDAP distinguished name that is the base object entry relative to which the filter is to be evaluated. Second, one can specify a scope, which indicates whether the filter is to be evaluated only at the base object (base), or within a single level up to the children of the base object (one), or in the entire directory information subtree rooted at the base object (sub). In particular, the entire database can be searched by specifying the base object to be the root of the hierarchically organized data, and using the subtree scope. We represent an LDAP query using the syntax: base object DN ? scope ? filter A query returns a sequence of zero or more responses,
each consisting of an entry found during the search, containing all attributes, complete with the associated values. From the database point of view, this is analogous to a selection query. For example, the LDAP query o = AT&T ? sub ? (&(surName = srivastava) (!(givenName = divesh))) would match all entries in AT&T (base object DN: o = AT&T and scope: sub) whose surName is srivastava and whose givenName is not divesh. 4.2 Limitations of the LDAP Query Language Often, restricting the query to a single subtree (specified using the root of this subtree as the base of the query) may not suffice. Instead, one may wish to execute the query against subtrees rooted at several different specified entries in the information tree. For example, one may want to ask the query “Find all entries in AT&T whose surName is srivastava, and who are in the same immediate organizational unit as someone whose surName is jagadish.” Of course, one can do so by issuing a sequence of LDAP queries, for example, as follows. First, issue the LDAP query:
IEICE TRANS. INF. & SYST., VOL. E82–D, NO. 1 JANUARY 1999
8
o = AT&T ? sub ? (surName = jagadish) Next, for each of the entries that are returned by this first query, determine its immediate organizational unit entry, and issue a second LDAP query, for example:† o = AT&T : ou = AT&T Labs-Research ? one ? (surName = srivastava) The number of LDAP queries issued at this second stage depends on the number of entries returned by the first query. Finally, one needs to compute the union of the results of the various LDAP queries issued at the second stage. An alternative way to express the query is to first issue the LDAP query: o = AT&T ? sub ? (surName = srivastava) Next, for each of the entries that are returned by this first query, test (again using multiple LDAP queries) whether or not its immediate organizational unit has a child entry with surName as jagadish. If it does, then the original entry is part of the final answer, else it is not. These different ways of expressing the original query using sequences of LDAP queries are not only awkward, but they can also result in considerably higher evaluation costs; clearly, one would prefer to be able to pose a single composite query to get the desired result. A systematically designed query language that allows the user to express complex queries, instead of requiring the user to pose a large number of simpler queries for the same task, also allows for the possibility of automated query optimization. We describe such a marking language next. 4.3 The Marking Language The marking language consists of a collection of marking operators, each of which takes (possibly) multiple lists of entries and conditions as arguments, and returns a single list of entries as its result. This allows expressions in the marking language to be composed to form complex marking expressions. The standard boolean operators, and (&), or (|), and not (!), are marking operators. In addition, the marking language includes hierarchical location operators, structural aggregate selection operators, and value semijoin operators. We illustrate these operators using queries drawn from our two motivating applications in the two sections that follow.
application of the modern hierarchical data model, and our marking language. We look at some instances of the use of our marking language in the context of our white pages directory application. 5.1 Hierarchical Location Suppose we want to ask the query “Find organizational units that directly contain persons whose surName is jagadish.” All organizational units can be located using the LDAP query: o = ATT ? sub ? (objectClass = organizationalUnit) All persons whose surName is jagadish can be located using the LDAP query: o = ATT ? sub ? (&(objectClass = person) (surName = jagadish)) These two LDAP queries can be composed into a single query, to obtain the desired result, using the binary hierarchical selection operator children (c) in the query filter, as follows: o = ATT ? sub ? (c(objectClass = organizationalUnit) (&(objectClass = person) (surName = jagadish))) Intuitively, the meaning of the c operator can be understood as follows. Each entry that satisfies the condition “(objectClass = organizationalUnit)” (the first argument to the binary children operator), and has at least one child entry that satisfies the condition “(&(objectClass = person)(surName = jagadish))” (the second argument to the binary children operator), is returned. One can now locate the entries whose surName is srivastava in organizational units that satisfy the above query, by composing the above query with a query that simply locates entries whose surName is srivastava using the symmetric hierarchical selection operator parents (p). Similar examples also arise when trying to model and unambiguously locate organizational and personal lists in directories; see [10] for more details. 5.2 Structural Aggregate Selection Suppose we wish to locate organizational units in
5.
White Pages Directories
Since the X.500 model was devised explicitly for the purpose of storing white pages directory information, it is to be expected that this will be the first valuable
† In X.500 and LDAP, the DN is the sequence of RDNs of the entries on the path from the entry in question to the root entry, using commas as separators. The base DN in this LDAP query was specified to be consistent with the convention used in the rest of this paper.
JAGADISH et al: REVISITING THE HIERARCHICAL DATA MODEL
9
AT&T that directly employ more than 10000 persons. Although LDAP queries cannot express selections based on aggregate conditions, the useful role played by aggregation in query languages such as SQL (see, for example, [12]) suggests the desirability of supporting marking operators involving aggregation. Our structural aggregate selection operators directly extend each of the hierarchical location operators by adding an extra argument that captures the aggregate selection condition. For example, the above query can be posed, using a ternary children (c) operator, and the aggregate selection condition “(count($2) > 10000),” as follows: o = AT&T ? sub ? (c(count($2) > 10000) (objectClass = organizationalUnit) (objectClass = person)) Intuitively, the meaning of the structural aggregate selection operator can be understood as follows. With each entry that satisfies “(objectClass = organizationalUnit)” (the second argument to the ternary children operator), associate all its children entries that satisfy the condition “(objectClass = person)” (the third argument to the ternary children operator). Against each such association, the aggregate selection condition “(count($2) > 10000)” is tested, and only those organizational units that have more than 10000 children person entries are returned. Note that the binary operators are special cases of the ternary operators, obtained by setting the aggregate selection condition to “(count($2) > 0).” 6.
Data Warehousing
Data warehousing has gained considerable prominence these days in the database community (see, for example, [6], [17]). Moreover, it shares almost nothing in common with white pages directories. To bring out the diversity of our hierarchical model and query language, we thought this would make a good second example to discuss. A typical data warehouse has a very large “fat table” that has one record per business transaction. The business transaction could be the placement of a telephone call, the notation of a credit card charge, the ringing of a cash register, and so on. The main point to note is that in any sizable business, this table is likely to have millions, perhaps even billions of entries. A data cube [3] is used to compute aggregates of this information along several different dimensions. The most commonly cited example for these dimensions, where we are aggregating sales information, are product class, location of sale, and time of sale. Thus a single cell in the cube could specify the total sales of a specific product type in a specific store on a specific day. It is easy to see that each of the dimensions
described above lends itself to a natural hierarchy, as described in Sect. 3.2. Figure 2 depicts a fragment of an information tree for the warehouse of a large retailer. Notice that the location hierarchy shown is region-state-store. Such a location hierarchy may be appropriate for the domestic U.S. operations of Walmart. Now suppose that Walmart has a few stores in Asia, and officially defines an “Asian region.” Given the small number of Walmart stores in Asia, one may want the next level of hierarchy below region to be store, directly. (Even if Walmart’s Asian operations grow, and there is a need to show an intermediate level, the most likely next level would be “country.”) Notice that our hierarchical data model comfortably permits different hierarchical schemes to coexist within the same database. This flexibility permits different businesses, or divisions, to tailor their information to their own needs, and yet to provide an ability to respond to global queries issued without regard to this detail. For instance, a query that asks for total sales by region will get exactly what it expects, even though a drill-down may show that some regions are organized differently than others. The marking language proposed above turns out to be of considerable value in a data warehousing context, just as it was in a directory context. A few examples are mentioned below. 6.1 Hierarchical Location Suppose we wish to locate all Walmart stores where the Diet Coke sales have exceeded $100,000.00 in 1997. Recall that the relevant sales summaries are stored as children entries of store entries. The desired result can be obtained, using the binary hierarchical selection operator children (c) in the query filter, as follows: o = Walmart ? sub ? (c(objectClass = store) (&(objectClass = PSDSales) (date = year=1997) (productId = beverages:coca cola:diet coke)(totalSales > 100000))) 6.2 Structural Aggregate Selection Selection using structural aggregation can also make use of aggregates computed over attribute values of entries. Suppose we wish to locate stateUnits of Walmart where the total sales (over all stores in that state) of Diet Coke in 1997 exceeds $1000,000.00. This query can be posed using a ternary descendant (d) operator, and the aggregate selection condition “(sum($2.totalSales) > 1000000),” as follows: o = Walmart ? sub ? (d(sum($2.totalSales) > 1000000)(objectClass = stateUnit)
IEICE TRANS. INF. & SYST., VOL. E82–D, NO. 1 JANUARY 1999
10
(&(objectClass = PSDSales) (date = year=1997) (productId = beverages:coca cola:diet coke))) 6.3 Value Semijoin Using the hierarchical selection and the structural aggregate selection operators described above, one can express a large number of practically important complex queries over hierarchically organized data. However, some interesting queries still cannot be expressed. Suppose we focus attention on the Walmart store with storeId = 7564312 in New Jersey, and want to locate all entries that summarize Diet Coke sales, and have a larger value of totalSales than the summary entry for Diet Pepsi sales on the same date. Expressing this query requires the use of a value semijoin on the date attribute. Our marking language includes a 4-ary operator value semijoin (v), and the above query can be expressed as follows: o = Walmart : . . . : storeId = 7564312 ? sub ? (v($1.date = $2.date) ($1.totalSales > sum($2.totalSales)) (&(objectClass = PSDSales) (productId = beverages:coca cola:diet coke)) (&(objectClass = PSDSales) (productId = beverages: pepsi : dietpepsi))) Intuitively, the meaning of the value semijoin operator can be understood as follows. With each entry that satisfies the condition “(&(objectClass = PSDSales) (productId = beverages:coca cola:diet coke))” (the third argument to the 4-ary v operator), associate all the entries that satisfy the condition “(&(objectClass = PSDSales)(productId = beverages:pepsi:diet pepsi))” (the fourth argument to the 4-ary v operator), and whose date attributes are equal (the first argument to the 4-ary v operator). Against each such association, the aggregate selection “($1.totalSales > sum($2.totalSales))” is tested, and only those Diet Coke entries whose totalSales is larger than the Diet Pepsi entries of the same date are returned. Essentially, value semijoins are similar to structural aggregate selection operators except that the association between entries is based on an arbitrary valuebased condition, instead of a hierarchical structural association.
7.
Discussion
The modern hierarchical data model can be understood in terms of two key aspects: information, where the unit of representing information about individual entities or relationships is the entry, which is a set of attributevalue pairs; and structure, where each entry has a distinguished name, and the namespace of DNs is hierarchically organized. The considerable flexibility of this data model arises because of three distinct reasons. • The hierarchical namespace provides easy partitioning of the set of entries into subsets (rooted sub-trees) that can be managed autonomously; a simple flat organization of data does not provide this benefit. • Different entries of an object class may contain different optional attributes; one organizationalEmployee entry may specify values for its surName, mail and fax attributes, another organizationalEmployee entry may specify values for each of the attributes surName, givenName and mail, while a third organizationalEmployee entry may specify values only for its mandatory attributes surName and mail. • An entry can have multiple values for its objectClass attribute; this permits the entry to have values for attributes in the union of the schemas of its object classes, without requiring a single object class to define this union of attributes as its schema. For example, in a directory containing information about objects in the network, where the object classes ftpServer, httpServer, ldapServer, ntServer and unixServer have been specified, it is extremely easy to model servers that understand any subset of the three protocols, and use one of the two operating systems. Doing so in an OO database system would be much more cumbersome. Standard LDAP provides a very limited query language against a flexible hierarchical data model. Such a query language may be adequate for most simple directory look-ups, but is far from adequate for advanced applications. We have presented here a considerably more powerful language, with much greater expressive power, yet with manageable computational cost. The new query language proposed is a far cry from the navigational access in days of yore, and overcomes what we believe were the major limitations of the hierarchical data model of the past: the absence of a declarative query language, and the lack of a suitable abstraction. We thus have a data model that is far more flexible and expressive than the flat relational model, but yet enjoys all the same benefits of an effective declarative query language. Unlike nested relational or objectoriented schemes, we continue to have a very simple
JAGADISH et al: REVISITING THE HIERARCHICAL DATA MODEL
11
(and fixed) data model, that is extremely popular in many applications. Finally, on top of all the benefits discussed above, the proposed hierarchical data model has the additional virtue of being extremely amenable to a partitioning of management. Multiple autonomous entities can manage the schema appropriately in their parts of the system, while still permitting global queries to remain oblivious to needless distinctions. References [1] Cisco, “Directory enabled networks,” Available from http://www.cisco.com/warp/public/734/den/. [2] C.J. Date, “An introduction to database systems,” Addison-Wesley, Reading, MA, 1981. [3] J. Gray, A. Bosworth, A. Layman, and H. Pirahesh, “Datacube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals,” Proc. IEEE International Conference on Data Engineering, pp.152–159, 1996. Also available as Microsoft Technical Report MSR-TR-9522. [4] T. Howes and M. Smith, “LDAP: Programming directoryenabled applications with lightweight directory access protocol,” Macmillan Technical Publishing, Indianapolis, Indiana, 1997. [5] IBM, White Plains, NY, “IMS/VS: Application programming reference manual,” Publication SH20-9026, 1978. [6] W.H. Inmon and C. Kelley, “Rdb/VMS: Developing the data warehouse,” QED Publishing Group, Boston, MA, 1993. [7] G. Jacobson, B. Krishnamurthy, D. Srivastava, and D. Suciu, “Focusing search in hierarchical structures with directory sets,” Proc. Seventh International Conference on Information and Knowledge Management (CIKM), Washington, DC, Nov. 1998. [8] H.V. Jagadish, “Incorporating hierarchy in a relational model of data,” Proc. ACM SIGMOD Conference on Management of Data, pp.78–87, Portland, OR, 1989. [9] H.V. Jagadish, “The INCINERATE data model,” ACM Trans. Database Syst., vol.20, no.1, pp.71–110, March 1995. [10] H.V. Jagadish, M.A. Jones, D. Srivastava, and D. Vista, “Flexible list management in a directory,” Proc. Seventh International Conference on Information and Knowledge Management (CIKM), Washington, DC, Nov. 1998. [11] U. Manber and S. Wu, “GLIMPSE: A tool to search through entire file systems,” Usenix Winter 1994 Technical Conference, pp.23–32, San Francisco, CA, Jan. 1994. [12] J. Melton and A.R. Simon, “Understanding the new SQL: A complete guide,” Morgan Kaufmann, San Francisco, CA, 1993. [13] Microsoft. Active directory, Available from http://www.microsoft.com/ntserver/basics/future/ activedirectory/. [14] J.D. Ullman, “Principles of Database Systems,” Computer Science Press, 1982. [15] J.D. Ullman, “Principles of Database and Knowledge-Base Systems, Volumes I and II,” Computer Science Press, 1989. [16] C. Weider, J.K. Reynolds, and S. Heker, “Technical overview of directory services using the X.500 protocol,” Request for Comments 1309. Available from ftp://ds.internic.net/rfc/rfc1309.txt, March 1992. [17] J. Widom, “Research problems in data warehousing,” Proc. Fourth International Conference on Information and Knowledge Management (CIKM), pp.25–30, Baltimore,
MD, Nov. 1995. [18] S. Zdonik and D. Maier, “Readings in Object-Oriented Database Systems,” Morgan-Kaufmann, 1990.
H.V. Jagadish received his Ph.D. from Stanford University in 1985, and since then has been with AT&T/Bell Labs. He currently heads the database research department at AT&T Labs. Beginning January 1999, he will be a Professor of Computer Science at the University of Illinois, Urbana-Champaign. Jagadish has over 65 major publications and 30 patents to his credit, and has previously served as an Editor of ACM TODS (1992–1995) and Program Chair for the SIGMOD Conference (1996). E-mail:
[email protected]
Laks V.S. Lakshmanan obtained his Bachelor of Engineering (1981) in Electronic and Communications from the A.C. College of Engineering and Technology, Karaikudi, India, and his Master of Engineering (1983) and Ph.D. (1987) in Computer Science from the Indian Institute of Science, Bangalore, India. He was awarded the Witold Lipski Memorial Best Student Paper Prize at the International Conference on Database Theory, Rome, Italy, September 1986, and the Gold Medal for the Best Doctoral Dissertation in Electrical Science Division at the Indian Institute of Science, Bangalore (1990). He was a postdoctoral fellow in the Department of Computer Science and Computer Systems Research Institute, at the University of Toronto, Canada, during 1987–1989. Currently he is an Associate Professor of Computer Science at Concordia University, Montreal, Canada. His research interests span a wide spectrum of topics in Database Systems and related areas, including: relational and object-oriented databases, advanced data models for novel applications, OLAP and data warehousing, database mining, data integration, and querying the WWW. A common theme underlying his research is to model problems not traditionally viewed as standard database problems and bring database technology to bear on them, thus pushing the frontiers of database technology. He collaborates widely with both industry and academia the world over. He has been a consultant to H.P. Labs, Palo Alto, CA, and AT&T Labs Research, Florham Park, NJ, and a visiting faculty at the Limburg Universitair Centrum, Limburg, Belgium, IASI, CNR, Rome, Italy, and the Indian Institute of Science, Bangalore, India. His research is funded by the Natural Sciences and Engineering Research Council of Canada (NSERC) and the Network of Centres of Excellence/Institute of Robotics and Intelligent Systems Phase 3 (NCE/IRIS-3). E-mail:
[email protected].
IEICE TRANS. INF. & SYST., VOL. E82–D, NO. 1 JANUARY 1999
12
Divesh Srivastava received his Ph.D. from the University of Wisconsin, Madison, in 1993, and since then has been with AT&T/Bell Labs. He is currently in the database research department at AT&T Labs, and has several publications and patents to his credit. His current research interests include directories, and directory enabled applications. E-mail:
[email protected]