RStar: An RDF Storage and Query System for Enterprise ... - CiteSeerX

84 downloads 0 Views 688KB Size Report
Nov 8, 2004 - Li Ma, Zhong Su, Yue Pan, Li Zhang, Tao Liu. IBM China Research Laboratory, No. 7, 5th Street, ShangDi, Beijing, 100085, P.R. China.
RStar: An RDF Storage and Query System for Enterprise Resource Management Li Ma, Zhong Su, Yue Pan, Li Zhang, Tao Liu IBM China Research Laboratory, No. 7, 5th Street, ShangDi, Beijing, 100085, P.R. China

Emails: {malli, suzhong, panyue, lizhang, liutao}@cn.ibm.com

ABSTRACT

1. INTRODUCTION

Modern corporations operate in an extremely complex environment and strongly depend on all kinds of information resources across the enterprise. Unfortunately, with the growth of an enterprise, its information resources are not only heterogeneous but also distributed in physically different systems and databases. How to effectively exploit information across the enterprise is becoming a critical but hard problem. In recent years, metadata which is the detailed description of the data is used to efficiently exploit information resources in the web. The World Wide Web Consortium (W3C) recommends the resource description framework (RDF) as a standard for the definition and use of metadata descriptions of resources in the web. In this paper, we present an RDF storage and query system called RStar for enterprise resource management. RStar uses a relational database as the persistent data store and defines RStar Query Language (RSQL) for resource retrieval. Currently, most of existing RDF storage and query systems are evaluated on small data sets and no detailed performance analysis is given for such systems. Therefore, we conduct extensive experiments on a large scale data set to investigate the performance problem in RDF storage. Such analysis will be helpful for designing RDF storage and query systems as well as for understanding not well-solved issues in RDF based enterprise resource management. In addition, experiences and lessons learned in our implementation are presented for further research and development.

Modern corporations operate in an extremely complex environment and strongly depend on all kinds of information resources across the enterprise. Unfortunately, with the growth of an enterprise, its various information resources, such as customer information and archived documents, are not only heterogeneous but also distributed in different systems and databases. How to effectively exploit resources across the enterprise is becoming a critical but hard problem. In recent years, metadata which is the detailed description of the data is used to efficiently exploit information resources in the web. The explicit use of metadata makes machines to process and understand information more easily. Metadata management has gained increasing attention with the development of the semantic web since the late 1990s [1,2]. Based on extensive discussions among many dedicated researchers and engineers, the World Wide Web Consortium (W3C) recommends the resource description framework (RDF) as a standard for the definition and use of metadata descriptions of resources in the web [3-5]. The objective of the RDF is to support the interoperability of metadata across different resource description communities. RDF based metadata management provides the enterprise a unified and powerful approach to effectively locate, interpret and transform enterprise resources distributed in physically different systems. To briefly illustrate how to use the metadata represented by the RDF to facilitate enterprise resource management, we take as an example Market Intelligence Portal (MIP), a project currently performed in IBM Research. In modern enterprises, information about documents is abundant but scattered in different databases and lacks integration. The effective use of such information can optimize business processes and thus result in considerable productivity gains for individuals and the enterprise. Market Intelligence Portal aims to provide a federated, digital representation of documents and enable virtual document management in the enterprise. The first generation of the MIP focuses on collecting and classifying documents from both the internet and the enterprise [23]. Currently, we are attempting to exploit ontology to characterize documents and their relationship for more efficient management in the next generation of the MIP. More precisely, we first construct an ontology to characterize digital attributes of a document, such as authors, date, named entities, interested users and so on. The digital attributes are in essence the metadata of a document. The RDF representation of all documents based on the defined ontology is a huge graph, maintaining the relationship among documents as well as providing information on how to use and manage related resources. The RDF graph can be regarded as a proxy and serve as control points for the capture, management and use of document

Categories and Subject Descriptors H.3.4 [Information Systems]: Systems and Software -Performance evaluation. H.3.2 [Information Systems]: Information Storage. H.3.3 [Information Systems]: Information Search and Retrieval -- Query formulation.

General Terms Design, Performance, Experimentation, Management

Keywords Ontology, metadata, resource management, RDF storage, RDF query language.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CIKM’04, November 8-13, 2004, Washington D.C., U.S.A. Copyright 2004 ACM 1-58113-874-1/04/0011...$5.00.

484

information. This will facilitate many operations in the enterprise, such as personalized e-learning.

speed up data retrieval are of great importance for RDF storage and should be paid particular attention. In [6], Beckett surveyed the features, maturity levels and details of existing RDF triple-stores but did not give a detailed quantitative analysis of system performance. He indicated that the performance of an RDF storage system can be optimized for a specific application using prior knowledge. Beckett and Grant further reported the database schema design of existing RDF storage systems in [7]. Alexaki et al. [13] proved that storing RDF triples by different property tables (all triples have the same property in a property table) and class tables (each class table keeps all instance belonging to the same class) outperforms storing all triples in a single table. However, this may create too many tables and cause a significant overhead on the response time of all queries when there are lots of classes and properties in an RDF graph. Harris et al. [15] pointed out that the efficiency of the existing RDF storage systems is not high enough for large scale data sets. In 3store system, they used the hashing operation to speed up triple load and retrieval. Particularly, Reynolds conducted performance test for Jena on a small data set (the RDF files for test are from 20k to 200k in size) [9]. Some valuable observations were obtained, but he also definitely pointed out that these need to be verified on a larger data set.

Based on our experiences in the project of Market Intelligence Portal, we suggest a 3-step method for enterprise resource management using the RDF representation. Build a powerful ontology to characterize enterprise resources. The ontology not only defines attributes of each resource class but also describes the relationship among resource classes. According to the defined ontology, extract the metadata of resources and build an RDF graph to represent them. Store the RDF graph describing enterprise resources and provide methods to access it. Build high-level applications based on the RDF graph. The first two issues correspond to the problems of ontology building and mapping instance data to an ontology, respectively. More details on these can be referred to [1,2]. Since the RDF graph is frequently accessed in real applications, how to store and retrieve a huge RDF graph is not trivial. In this paper, we will focus on this key problem. The RDF has a simple but powerful model to describe resources, which consists of a collection of statements [3]. Each statement is just a triple, including a subject S, a property P and an object O. The subject denotes resources that are uniquely identifiable by the uniform resource identifier (URI). The property expresses the relationship between the subject and the object. The object can be a resource or a literal (a text string). A statement (S, P, O) semantically asserts that a resource S has a property P whose value is the object O. Using such a triple model, the RDF provides an unambiguous method to convey semantics in a machine-readable encoding. In essence, the RDF model is a labeled directed graph, where the property corresponds to the labeled arc and connects two nodes, the subject and the object. Figure 1 presents a simple example of the RDF graph including both ontology and instance. Based on this graph, one finds that a paper about DB2 is archived in db10 and can thus know where to retrieve the original document. Also, one who wants to know the details of a project on RDF storage can learn from this graph that he should contact john for information. In addition, RDF Schema (RDFS), RDF’s vocabulary description language [4], provides a powerful vocabulary to characterize the class (a group of resources) and property hierarchies in the RDF graph. More details of the RDF and RDFS can be found in [3,4].

A great deal of progress in RDF storage and query has been made through these efforts. However, we also find that performance is still a critical problem in RDF storage and query. Considering the increasing number of enterprise resources, the performance issue in RDF storage and query systems on a large scale data set should be further investigated. Therefore, detailed and quantitative performance analysis on large scale RDF data sets is not trivial. Such analysis is helpful for database schema design as well as for understanding not well-solved issues in RDF based enterprise resource management.

1.2 Outline We have designed and developed an RDF storage and query system called RStar for enterprise resource management. RStar uses an RDBMS as the persistent data store and defines RStar Query Language (RSQL) for retrieval. Based on the developed RStar system, we further conduct a series of experiments on MusicBrainz RDF data set [22] for performance test and provide detailed discussions on the overall experimental results. By our analysis, we expect to obtain some valuable conclusions for the design and development of the RDF storage and query system. In addition, experiences and lessons learned in our implementation are presented for further research and development.

Figure 1. A simple RDF graph (See last page) For an enterprise, it is crucial to efficiently store and retrieve the RDF graph represented in the form of triples [2,6]. This is because that there are always a large number of various resources in an enterprise. Poor performance will necessarily limit the wide use of the RDF for enterprise resource management. Here, we have a brief look at the existing work on RDF storage and query.

The remainder of this paper is organized as follows. Section 2 provides an overview of the developed RStar system for RDF storage and query. Section 3 presents experiments, results and detailed discussions. Section 4 concludes this paper.

2. OVERVIEW OF RSTAR

1.1 Related Work

We have designed and developed RStar, an RDF storage and query system for enterprise resource management. Figure 2 shows the schematic client-server architecture of RStar. RStar provides web service functions and is developed using Java language. As Figure 2 shows, the function of the client is relatively simple. To search for desirable information, the client sends a query request in RSQL form to the server. After the server finishes the query, the client deserializes the results in XML format and presents them to

Most of existing RDF storage systems use relational or object-relational database management systems as backend stores [6-18]. This is a straightforward approach since it is appropriate to represent RDF triples in a relational table of three columns and the relational DBMS (RDBMS) has been well studied. Often, there are millions of triples in real applications. Therefore, how to design database schema and how to optimize related operations to

485

the user. The core components of RStar, namely the backend data store, data loader and RSQL query engine, are implemented in the server. The key issues related to these components will be addressed in the following subsections.

a unique ID (an integer) and stored in tables Resources and Literals respectively in order to save storage space as well as speed up data retrieval. As mentioned earlier, the RDF graph can be described by a set of triples. Accordingly, table Triples is created to keep all instance triples (not including ontology triples), each containing three columns, namely SubID, PreID and ObjID. Since the table Triples may include millions of triples (generally, the largest table) and is frequently accessed, how to build indices on this table for efficient retrieval is a problem deserving further investigation. Experimental results and our analysis on this problem are given in next section.

2.2 Data Loader Data loader takes RDF/XML files as input and provides both the original and inferred triples to the backend database. This is respectively realized by RDF parser, triple importer and inference engine. RDF parser analyzes the statements of an RDF/XML file according to the RDF syntax specification [3] and passes the resulting triples to triple importer. The triple importer then inserts the triples into the backend database. Currently, there are some open source RDF parsers, such as another RDF parser [14], the parser in Raptor toolkit [11] and Validating RDF parser [13]. In RStar, we develop an efficient RDF parser based on the simple API of XML (SAX), the first widely adopted API for XML in Java. Importing triples into the database is relatively simple but not trivial. In our implementation, the jdbc prepared statements are used for efficient data importing. For a frequently executed SQL statement, defining it as a prepared statement object can normally reduce the running time since the prepared statement is precompiled before execution. Our experimental results showed that the load time can be reduced by about 2 times in the presence of prepared statements.

Figure 2. Overview of RStar

2.1 RStar Storage Design How to store the RDF graph in a repository is at the heart of an RDF based resource management system and determines to a certain extent the system performance. Because RStar is intended to be used for enterprise resource management (hence a large amount of instance data), the database is a better choice for storage in comparison with file system. The persistent store of RStar is a popular relational database, IBM DB2. This section presents our database schema design in detail.

RDF is an assertional language which is used to express propositions using precise formal vocabularies, particularly those specified by RDFS [4]. RDF semantics based on the formal model theory [5] is embodied by a set of inference rules. For instance, if A is the subclass of B and B is the subclass of C, then A is the subclass of C. Inference engine of RStar generates entailments, namely the inferred triples, from the original triples based on inference rules described in RDF semantics specification [5]. At present, the inference engine is triggered when the triples are imported into the database instead of when a query is evaluated. The generated triples by the inference engine are inserted into the database together with the original triples. This will unavoidably increase the size of the original triples by several times. However, the cost for processing a query can be reduced as the inferred triples which will be retrieved have been generated in advance. In our implementation, we considered the following two points for effective inference.

Figure 3. Database schema of RStar As mentioned in Section 1, we need constructing an ontology to describe the abstract relationship among enterprise resources. An ontology can be used to characterize millions of instance data (resources). In such a sense, ontology plays a more important role in RDF query. Therefore, RStar respectively stores ontology information and instance data in different tables for efficient query, namely separating ontology from instance data. This is expected to make ontology navigation more effective as well as instance data retrieval faster. Figure 3 details the database schema design of RStar. Ontology information representing class, class hierarchy, property and property hierarchy is captured by tables Class, SubClass, Property, SubProperty, and Property-Class. A table named InstanceOfClass stores instances of all classes, which can be regarded in a sense as a bridge between ontology and instance data. The URIs of RDF resources are expressed in the form of (namespace + localname) and RDF literals are strings of varying length. Thus, in RStar, each resource and each literal are allocated

Construct a temporary triple table to facilitate inference. Since some inference rules include two or more preconditions, searching for triples which satisfy such rules by traversing millions of triples is not a desirable method. On the contrary, using a simple SQL query on a temporary triple table helps to retrieve the relevant triples faster and create the inferred triples more easily. This sufficiently uses the built-in optimized methods in RDBMSes. Choose the correct order to perform the inference rules since the inference results of some rules may affect other inference. An incorrect order may result in incomplete inference.

486

2.3.2 RSQL Parser and Translator

2.3 RSQL Query Engine

RSQL query engine essentially relies on SQL query engine of the backend database to search for resources in an RDF graph. Therefore, the basic principle underlying the design of RSQL query engine is to push down as many tasks as possible to the database. This is because that the well-studied relational database can perform much of the work using its built-in query evaluation and optimization mechanisms. How to interpret and translate an RSQL query into a corresponding SQL query is thus the main task of RSQL query engine. Figure 4 explains how to process RSQL queries.

An RDF graph describing various resources across an enterprise is remarkably complex, including complicated class and property hierarchies. Obviously, directly browsing such a huge graph for desirable information is troublesome and time-consuming. RStar defines a query language, RSQL, to make resource retrieval easier.

2.3.1 RSQL Query Language RSQL has a SQL-like syntax and is generally expressed as follows: Select Variable_1,Variable_2,…Variable_n From (S_1, P_1,O_1), (S_2, P_2,O_3),…(S_n, P_n,O_n) Where Constraints_1, Constraints_2,… Constraints_n Using NS_1=Namespace_1,… NS_n=Namespace_2 The select clause specifies which variables should be returned. The from clause consists of a set of triples which jointly identify specific paths in an RDF graph and are called path expressions. Variables are allowed to appear in any position in path expressions. The where clause comprises some constraints on variables in the from clause. Constraints are combined using the logical operators and, or and not. Since the URIs for identifying resources generally include long namespaces, it is somewhat inconvenient for users to write queries using such URIs. RSQL provides the using clause to define abbreviations for long namespaces. Every abbreviated URI is replaced with its full URI when the RSQL query engine parses a query. Some additional functions and features of RSQL include:

An RSQL

RSQL

query

parser

Pre-optimi zed structure

RSQL

A SQL

Translator

query

Figure 4. RSQL query processing diagram RSQL parser analyzes and checks whether the submitted query accords with the syntax of RSQL language defined in Section 2.3.1. A valid RSQL query is transformed into a pre-optimized structure which consists of a list of parsed triples and a condition tree reflecting the logical relationship of constraints in the where clause. Using some simple optimization rules, RSQL translator assembles the resulting pre-optimized structure into a single SQL query. Considering that a query generally combines ontology information and instance data as well as RStar uses multiple tables to store an RDF graph, we have to properly determine which tables are necessary for constructing a SQL query. Under such conditions, simple optimization rules which do not depend on the specific domain are useful and important. For example, if a variable appears in the subject’s position of a triple, this indicates that the variable is a resource and should be retrieved from the table Resources rather than the table Literals. Once a SQL query is generated, it will be fed to the database query engine for evaluation.

Built-in functions. Functions Typeof() and Langof() determine the data type and language of a literal variable, respectively. Functions IsResources() and IsLiterals() verify whether a variable is a resource or a literal, respectively. These functions are used in the where clause as constraints on variables. Fuzzy queries on string variables. The like operation is used to constrain a string variable to definitely contain a specified substring. In RSQL, full text search is provided for fuzzy queries besides the like operation supported by the traditional DBMS. If the specified substring for matching does not involve any wildcard (i.e., *, ?), the like operation is conducted by full text search, otherwise by the built-in string matching method of the DBMS. Data comparisons. Four types of data are supported in RSQL, namely integer, float, date and string. The definition and use of these data types accord with XML Schema specification. Supported numerical comparisons for the first three data types include equality (=), lower than () and inequality (!=). For the string data type, equality, inequality and fuzzy matching mentioned above are supported. Query using ontology knowledge. Path expressions in RSQL can be ontology triples, instance triples or both. This will support not only complex ontology navigation but also effective resource retrieval based on ontology knowledge.

Unlike RSQL, some existing RDF query languages, such as RQL [19] and SeRQL [21], decompose a query into a set of basic subqueries (or units) for evaluation and obtain the final results by further combining the results of each subquery. This is reasonable but missing an opportunity to make effective use of SQL query engine of the database. In RSQL, we avoid to use optimization rules only suitable for the specific database so that RStar can be easily transported to work on other RDBMSes.

3. EXPERIMENTS As mentioned in Section 1, performance is a critical problem in RDF storage and query, especially for enterprise resource management where a large number of resources need to be processed. Some factors, such as indices on the table Triples, can affect the system performance. We conduct a series of experiments to study the performance of RStar and understand which factors seriously affect the system performance. This section presents our extensive experiments as well as provides detailed discussions on the overall experiments.

It is fairly easy to use RSQL to query an RDF graph as shown in Figure1. For example, if one wants to know who is an expert on RDF storage and where to find documents related to RDF storage, he can write a simple RSQL query as follows. Select name, directory From (name, NS:HasProject, V1), (V1, NS:About, V2), (V1, NS:ArchivedIn, directory) Where Like(V2,“RDF Storage”) and IsLiterals(V2) Using NS:=”http:// www.ibm.com/example#”

3.1 Experimental Conditions MusicBrainz [22] is a user maintained music meta-database whose metadata includes the name of an artist, the name of an album and list of tracks that appear on an album. At present, the RDF schema (i.e., ontology) of MusicBrainz includes 20 classes and 10

487

properties. Instance data (i.e., resources) consists of 97,400 artists, 156,212 albums, 1,936,298 tracks and 2,284,346 TRMS. Due to the very large scale of MusicBrainz (a complete RDF file is over 1G in size), we choose it as testbed to investigate performance characteristics. According to experiment requirements, a family of subsets of MusicBrainz database is constructed. Table 1 lists both the number of triples and the number of resources of these subsets, respectively.

typical queries which involve complex operations on the table Triples are created for test. Table 4. Experimental results on indexing table Triples

Table 1. The size of test data sets File size No. of triples No. of resourc es

500 K 5, 445

1M

5M

10M

50M

100M

200M

500M

11, 390

62, 991

140, 381

768, 411

1,499, 304

2,923, 830

7,282 676

5, 249

10, 386

49, 410

94, 002

414, 208

731, 307

1,414, 533

3,514 836

Table 2. Representative queries for performance test (See last page)

With RStar as an example, we conducted a series of experiments to explore how to improve the performance of RDF storage and query systems as much as possible. The results are reported below. Database schema design As mentioned in Section 2, database schema design plays an important role in RDF storage and query. In RStar, ontology information and instance data are separately stored in order to more easily capture class and property hierarchies. Such a design is expected to accelerate resource retrieval. Compared with our schema, a basic schema is to store all RDF triples into a single table. Experiments on a 50M RDF file including 414,208 resources were performed to evaluate our schema and the basic schema, respectively. Q6

Q11

Q12

Q13

732.7

1258.4

719.6

30.6

31.2

837.8

412.5

647.6

392.7

7.3

6.2

568.3

Q1

868.3

706.0

(S P) (O) 730.1

850.2

(S) (P) (O) 356.5

Q2

563.9

56.1

40.1

551.8

36.1

45.1

Q3

1173.7

750.1

838.2

1157.6

412.5

949.4

Q4

561.8

40.1

33.0

549.8

27.0

37.0

Q5

1399.0

859.2

893.1

1391.0

647.6

898.3

Q6

1078.9

746.3

820.1

1011.9

392.7

914.5

S (P O)

(S P O) 834.2

Table 5. Comparative results of fuzzy queries using full text search and the like operation Query

Table 3. Comparative results of schema design Q5

(S P) O

Fuzzy queries on string variables When writing a query, we often forget the full name of a resource and only remember several isolated words. This results in an extensive use of string matching technologies in information retrieval. As is known, the like operation currently supported by various DBMSes is very time-consuming in comparison with other operations. Our observations show that, in most cases, one makes use of a complete word such as “Java” instead of a part of a word such as “Jav” for query. Therefore, we expect to speed up fuzzy queries using full text search. Table 5 tabulates comparative results, where “FTS” denotes full text search and “like” stands for the traditional like operation.

3.2 Results

Q3

SPO

The above experiments use an RDF file of 50M, a subset of MusicBrainz database. Indices are built on columns Subject, Property, Object and their joint, respectively. In Table 4, if a character is enclosed by a pair of parentheses, this denotes that an index is built on the corresponding column. For instance, “(S P) O” means a joint index on columns Subject and Property and no index on column Object. Table 4 demonstrates that building independent indices on each column takes the minimum query time and is superior to other index schemes. In next series of experiments, we will use this optimal index scheme.

All experiments were run on a 2.4GHz PC with 1G physical memory. The operating system is Windows 2000 professional and the backend database is DB2 UDB 8.0. Table 2 lists representative RSQL queries used in our experiments and their descriptions. Each query was conducted 10 times and the average query time was recorded. Due to space limitation, the namespaces of resources are omitted in Table 2.

Query Basic Schema RStar Schema

Query

Q1

Q5

Q7

Q8

Q9

Q10

FTS

356.5

647.6

15.7

1482.8

503.1

779.7

Like

1978.2

2821.4

162.5

4834.4

2700.0

3089.0

The above experiments are performed on a 50M RDF subset of MusicBrainz. We can see from the results that full text search is generally 4-5 times faster than the like operation supported by the DBMS (DB2 in our experiments). For Query 7 which constrains a variable to not include a given string, the computational cost advantage of full text search is extraordinarily obvious, about 10 times faster than the like operation. However, it should be addressed that the inverted index should be built to support full text search. This inevitably sacrifices disk space for more efficient retrieval. In RStar, full text search is provided for fuzzy queries besides the like operation supported by the traditional DBMSes. If a specified substring for matching does not involve any wildcard (i.e., *, ?), the like operation is conducted by full text search, otherwise by the built-in string matching method of the DBMSes.

As demonstrated in Table 3, the performance difference between our schema and the basic one is significant. For queries involving only ontology information (Q11 and Q12), the advantage of our design is more obvious. The comparative results indicate that ontology describing the abstract hierarchies of resources is very useful for retrieval and should be specially processed. Index building on the table Triples In real applications, the table Triples is most often accessed and may include millions of statements (generally, the largest table). Therefore, how to index this table efficiently is not trivial. A set of

Performance changes with the increasing size of test data RStar is designed and developed for enterprise resource

488

results show that RStar is able to effectively process large scale data sets and is suitable for enterprise resource management.

management. This means that it should be able to process large scale data sets with high performance. Figure 5 is the query time vs triple size curve which shows the overall performance of RStar. To clearly observe the effects of the data set size on system performance, we divide queries in Table 2 into three classes, namely queries only for ontology (Q11-Q12), queries only for instance data (Q1-5) and mixed queries (Q6,Q9-10,Q13), and record their average response time respectively.

Aiming to use metadata represented by the RDF for enterprise resource management, we discuss how to design an RDF storage and query system and investigate some general optimization schemes for such systems. Our experiences and experiments show that it is feasible for an enterprise to adopt RDF based metadata management to organize and use various resources since the current RDF systems such as RStar have an acceptable performance. To widely apply RDF resource management systems in the enterprise, however, some problems need to be further explored. It is really true that an RDF storage and query system can be optimized by analyzing the results generated in running phase. For example, observing the running results, we may find that some properties of a specific class have been often retrieved together. Then, a new table can be created to store these related properties for fast retrieval. That is, we can analyze the running results such as the query log to obtain specific knowledge for further system optimization [16]. This is a valuable research issue in the near future. An RDF graph is a resource-centric graph. Compared with SQL like RDF query languages, an object-oriented method for resource management is desirable (a resource can be regarded as an object). This will make developers to understand and implement RDF based high-level applications more easily. Currently, we are designing and developing object-oriented APIs to encapsulate RSQL for easier use.

Figure 5. Performance Curve The axes of Fig.5 are spread out using the logarithmic scale to improve the legibility of the plots. Fig.5 shows that the response time of queries for ontology is extremely stable. This is because that ontology does not change with the increase of instance data. The results also emphasize the benefit of the separation of ontology from instance data. Observing the plots of other two kinds of queries, we find that the response time is not in direct proportion to the size of the data set. Particularly, the average response time for mixed queries is only about 920ms on a 500M RDF file including 7,282,676 triples. The high performance of RStar makes it suitable for storing and managing enterprise resources of a large scale.

As mentioned in Section 1, ontology building is an important step in constructing an RDF based enterprise resource management system. Generally, ontology is built by enterprise experts who are very familiar with the related enterprise resources and know how to make effective use of these resources. In practice, a brainstorming discussion is also helpful for ontology building. With the fast development of an enterprise, both ontology describing the abstract relationship among resources and various resources may frequently change. Therefore, another problem which should be carefully investigated is how to dynamically update an RDF graph with the minimum cost. In fact, we are encountering such a problem in our project Market Intelligence Portal. The issues of ontology maintenance and RDF data transformation need more efforts.

3.3 Discussions Based on the above results and analysis, we can draw a number of conclusions as well as find that some issues need to be further investigated. In RStar, ontology information and instance data are separately stored in the backend database. Experiments have verified that such processing is efficient for retrieval.

4. CONCLUSIONS In this paper, we presented an RDF storage and query system called RStar for enterprise resource management. RStar is intended to facilitate the enterprise to manage various information resources using the resource description framework recommended by the World Wide Web Consortium. First, we briefly introduced the background of RDF, a standard for the definition and use of metadata descriptions of resources, and reviewed the related work. Then, we gave an overview of RStar and detailed its design, including database schema design, data loader, the definition of RSQL (RStar query language) and RSQL query processing. Extensive experiments on a large scale data set were performed to investigate the performance problem in RDF storage. Lessons learned in our implementation were also presented throughout the paper. These can be helpful for further research and development.

As far as indexing the table Triples is concerned, building independent indices on each component of a triple (S, P, O) is a reasonable scheme. Compared with the like operation supported by the traditional DBMSes, full text search is a better approach to fuzzy queries and can improve the system performance significantly. With the increase of the number of triples, the performance of RStar does not seriously deteriorate. Even on a large scale data set, RStar still exhibits an acceptable performance. In summary, the important features in RStar include an RDBMS data store, an easily used RSQL query language, RDF/RDFS inference support and full text search support. Experimental

489

RDF application framework, Proc. of the 10th international conference on World Wide Web, pp. 449 - 456, 2001.

5. ACKNOWLEDGEMENTS The authors would like to thank Bo Hu and Ming Zhao for their constructive comments and help with implementation, Zhuo Zhang, Guotong Xie and Shixia Liu for their helpful discussions. RStar is currently publicly available at the website of IBM AlphaWorks (http://www.alphaworks.ibm. com/tech/semanticstk).

[12] R. Guha, RDFDB An http://www.guha.com/rdfdb/, 2001.

T. Berners-Lee, J. Hendler, O. Lassila, The Semantic WEB, Scientific American, 2001.

[2]

J. Davies, D. Fensel, F. Harmelen, Eds., Towards the Semantic WEB: Ontology-driven Knowledge Management, England: John Wiley & Sons, Ltd., 2002.

[3]

G. Klyne and J. Carroll, Resource Description Framework (RDF): Concepts and Abstract Syntax, W3C Recommendation, http://www.w3.org/TR/rdf-concepts/, 2004.

[4]

D. Brickley and R. Guha, RDF Vocabulary Description Language 1.0: RDF Schema, W3C Recommendation, http://www.w3.org/TR/rdf-schema/, 2004.

[5]

P. Hayes, Resource Description Framework (RDF): Semantics, W3C Recommendation, http://www.w3.org/ TR/2004/REC-rdf-mt-20040210/#rdf_entail, 2004.

[6]

D. Beckett, Scalability and Storage: Survey of Free Software / Open Source RDF storage systems, http://www.w3.org/2001/sw/Europe/reports/rdf_scalable_sto rage_report/, 2003.

[7]

Database,

[13] S. Alexaki, V. Christophides, G. Karvounarakis, D. Plexousakis, K. Tolle, The ICS-FORTH RDFSuite: Managing Voluminous RDF Description Bases, Proc. of the 2nd International Workshop on the Semantic Web, pp. 1-13, 2001.

6. REFERENCES [1]

RDF

[14] K. Wilkinson, C. Sayers, H. Kuno, D. Reynolds, Efficient RDF Storage and Retrieval in Jena2, Proc. of the 1st International Workshop on Semantic Web and Databases, pp. 131-151, 2003. [15] S. Harris, N. Gibbins, 3store: Efficient Bulk RDF Storage, Proc. of the 1st International Workshop on Practical and Scalable Semantic Systems, pp. 1-14, 2003. [16] L. Ding, K. Wilkinson, C. Sayers, H. Kuno, Application-Specific Schema Design for Storing Large RDF Datasets, Proc. of the 1st International Workshop on Practical and Scalable Semantic Systems, pp. 15-28, 2003. [17] TAP Project, Stanford University, http://tap.stanford.edu/, 2002. [18] KAON, The Karlsruhe Ontology and Semantic Web Tool Suite, http://kaon.semanticweb.org/, 2002. [19] G. Karvounarakis, S. Alexaki, V. Christophides, D. Plexousakis, M. Scholl, RQL: A Declarative Query Language for RDF, Proc. of the 11th international conference on World Wide Web, pp. 592-603, 2002.

D. Beckett and J. Grant, Mapping Semantic Web Data with RDBMSes, http://www.w3.org/2001/sw/Europe/reports/ scalable_rdbms_mapping_report/, 2003.

[20] Hewlett-Packard Company: RDQL: RDF Data Query Language, http://www.hpl.hp.com/semweb/rdql.htm.

[8]

S. Melnik, Storing RDF in a relational database, http://www-db.stanford.edu/˜melnik/rdf/db.html, 2001.

[21] SeRQL, Sesame RDF Query Language, http://www.openrdf.org/doc/users/ch05.html, 2003.

[9]

D. Reynolds, Jena relational database interface performance notes, http://www.hpl.hp.com/semweb/doc/ RDB/rdb-performance.html, 2003.

[22] MusicBrainz. http://www.musicbrainz.org/, 2004. [23] Z. Su, J. Jiang, T. Liu, G. Xie and Y. Pan, “Market Intelligence Portal: An entity-based system for managing market intelligence”, IBM System Journal, vol. 43 No. 3, pp.534-546, 2004.

[10] J. Broekstra, A. Kampman, F. Harmelen, Sesame: A Generic Architecture for Storing and Querying RDF and RDF Schema, Proc. of the 1st International Semantic Web Conference, pp. 54-68, 2002. [11] D. Beckett, The design and implementation of the redland

ArchivedIn

Instance HasPaper

db10/1.pdf About

John

ArchivedIn HasProject

Star Project About

“db10”

Ontology

Employee

“DB2”

HasPaper

ArchivedIn HasProject

“\\c1\c:\star” “RDF storage

Namespace: http://www.ibm.com/example# Figure 1. A simple RDF graph

490

Paper

String

About

ArchivedIn

Project About

String

Table 2. Representative queries for performance test

No.

Query

Description

Q1

select artist from (artist, #sortName, name) where like(name, "Jack") select name, albumlist from (artist, #sortName, name), (artist, #albumList, albumlist) where artist= #fd38cb646ef4 select artist, albumlist from (artist, #sortName, name), (artist, #albumList, albumlist) where like(name, "Bill") select p, o from (artist, p, o) where artist=#8f0fe49dff1a

Return all instances with a specific property value Return two related property values of a specified resource Return all instances which have two specific properties with a specific value Return all properties and their values of a specified resource Return all instances which have two given properties and a given property value

Q2 Q3 Q4 Q5 Q6 Q7 Q8

select artist, albumlist, p, o from (artist, #sortName, name), (artist, #albumList, albumlist), (albumlist, p, o) where like(name, "Tom") select sub from (sub, , #Track), (sub, #Title, obj) where like(obj, “Dizzy”) select sub from (sub, , #Type) where not like(sub, "TypeLive") select artist, name from (artist, #sortName, name) where like(name, "Jack") or like(name, "Bill")

Q9

select artist, name from (artist, , #Artist), (artist, #sortName, name) where like(name, "Jack")

Q10

select artist, name, property, value from (artist, , #Artist), (artist, #sortName, name), (artist, property, value) where like(name, "Boss")

Q11 Q12 Q13

select sub from (sub, , #Type) select dom from (#sortName, , dom) select artist, property, value from (artist, property, value), (property, , #Artist)

491

Return instances of a specific class and the instances have a given property and value Find all subclasses of a specified class and their names do not contain a given string Find instances which has a specific property with a value including given strings Find instances of a class which has a specific property with a value including given strings Return instances of a class and their property values, and the instances must have a property value including a given string Find all subclasses of a class Return the domain of a property Find all triples which have properties belonging to a specified class

Suggest Documents