An Approach for Indexing, Storing and Retrieving Domain Knowledge Hao Wu, Hai Jin, Xiaomin Ning Cluster and Grid Computing Laboratory, School of Computer, Huazhong University of Science and Technology, Wuhan, 430074, China
[email protected] ABSTRACT In this poster, we present our solution to index, store and retrieve the domain knowledge. The main principle exploits Lucene to index the domain knowledge under guide of the domain schema. The method how to map domain knowledge structure into Lucene index structure, store and update the indices, and how to transfer RDF-based query into Lucene’s query are presented.
Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: Content Analysis and Indexing, Information Search and Retrieval.
General Terms: Management, Design, Experimentation. Keywords:
Semantic Web, Domain Knowledge, Indexing,
Retrieval.
be mapped to Lucene’s storage architecture and interface. In the same way, ontologies can be treated as more complex E-R graphs and further be translated into the index structure of Lucene.
2.1 Indexing Domain Knowledge In Lucene, a field is a section of a Document. Each field has two parts, a name and a value. For each concept in domain ontology, it can be translated into a Document object. And then, Field objects can work as the replacers of the attributes and relations. Generally, attributes serve as the role of annotation and their values are literal types, while relations are linked to other concepts. There exists a little difference on dealing with the attributes and the relations. When mapping a relation, its value is an URI reference (URIref) pointing to a concept instance. Moreover, for each concept, there are some properties which are not directly annotated to the concept, but can be reached by following relation paths from ontology graph (a RDF Schema, GRDF ). These relation paths also work as the field of the resources.
1. INTRODUCTION With the rapid spread of the Semantic Web, semantic information is emerging and accumulating. There are some solutions (e.g.[1,2]) to address how to store and retrieve the knowledge encoded with domain ontologies. These systems provide respective features to various applications. However, there are some questions needing further consideration: (1) Memory-based model can provide clipping retrieval but it is seriously restricted as the data increases. (2) RDBMS persistence is available, but database index is not specially designed for keyword search, therefore the performance seriously be harmed when meeting query “like %keyword%... %keyword%”. In this paper, we present our solution (named DKIR) which integrates the traditional information retrieval (IR) mechanism to deal with these issues.
2. DOMAIN KNOWLEDGE INDEXING, STORING AND RETRIEVAL In general, the key to build an efficient retrieval system lies in creating an inverted index. Lucene's index just falls into this family. Lucene’s data structure resembles the structure of Table→Record→Field (against E-R graph) provided by database. Thus many traditional application files and databases can easily
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SAC’07, March 11-15, 2007, Seoul, Korea. Copyright 2007 ACM 1-59593-480-4/07/0003…$5.00.
Figure 1. Domain knowledge examples from SemreX. After finishing the domain ontology process, we need build a parser to analyze the instances inside the domain knowledge. For each instance, an instance of corresponding document will be established. In the meanwhile, the attribute elements and relation elements of the instance will be extracted, such as subclass, data type attributes. These elements will be added to the unattached Fields inside the Document instance. We use the example shown in Fig.1 to explain the process. #1 indicates the paper itself and it cites the paper #2. They are both the concept instance of Publication. #hjin, #hwu and _:blanknode (an RDF node without an URIref, but it can be assigned private URIref in indexing phase) are the authors, as the instances of concept Person. To index this graph, all the predicates are mapped to the fields of document schema. The field “PATH” and “PLEN” (the length of PATH) are also appended to index the relation paths. Then the index schema appears as following: Doc (URI, PATH, PLEN, TYPE, #title, #year, #author, #cite, #citedby, #fullname, #email, #affiliation,…)
The field such as #title, #year can be treated as the path with length 1. This makeup can satisfy the entire triple matching (?subject, predicate, ?object) where “?x” means that x is a variable to be retrieved, the value of “subject” falls into field URI, and the predicate can straightway be specified with the corresponding field name, e.g. (?paper, #title, “indexing”). It can also carry out complex query, such as ((?paper1, #title, “indexing”) OR|AND (?paper1, #abstract, “indexing”)). Moreover, it can deal with path-involved query, such as (? paper, #author, _:o, #fullname, “Hai Jin”) which will retrieve all literatures with the author named “Hai Jin” in the repository. The value of the beginning node and the end node of the PATH fall into the field URI, and the last field #fullname, respectively. The predicate path length must be considered carefully. All the predicate paths need to be resolved in theory, however, from the experience it is enough to set PLEN 3 or 4 for most (>80%) queries. When the complex path-involved queries are issued, there is an easy solution by combing multiple short path queries. In addition, we can provide the special solution to accommodate special demands. Some relations with collection type such as rdf:Bag, rdf:Seq and rdf:Alt, can directly be treated as path with length 1 after a transpositional process.
2.2 Indices Storage and Updating The storage of the indices is done by the IndexWriter of Lucene to write Document instances into an index base. DKIR provides two implementations of the storage form. FSDirectory stores the indices at file system while RAMDirectory builds the indices in the memory. DKIR indices evolve via two processes: (1) creating new segments for newly added domain knowledge, and (2) merging existing segments.
2.3 Query and Retrieval We extend Lucene’s query interface to support path-involved query. To achieve compatibility, we can further use the RDF query to perform front-end search by following middle operators: Definition 1.Resources = Q (Clause) means a function to perform a query by Clause. Resources are results returned by Q (clause). It is a serial of URI. Clause follows the query syntax of Lucene. Definition 2.Variable = Resource (Attribute) means a function to get the attribute value of a resource. Attribute is mapped to the corresponding field of the inverted index structure. Definition 3.Path = P (Relation + Relation +…) means a function to get the relation path consist of multiple connected predicates. Path is also treated as a field mentioned above. Next, one example of the RDQL is given and the opposite transformation query with DKIR is illustrated. This query, based on the vCard vocabulary, finds the family name and given name from any vcards with formatted name (FN) "Hai Jin". The vCard vocabulary has a structured value for the name, using the vcard#N property to point to another node which, in turn, has the various name elements as further statements. RDQL:SELECT ?family , ?given WHERE (?vcard vcard#FN "Hai Jin") (?vcard vcard#N ?name) (?name vcard#Family ?family) (?name vcard#Given ?given) USING vcard FOR DKIR: 1: p1 = P (vcard#N + vcard#Family), p2 = P (vcard#N + vcard#Given)
2: vcard = Q (vcard#FN: “Hai Jin” AND PATH: (p1 OR p2)) 3: family = vcard (p1), given = vcard (p2) Three steps are needed to complete the same assignment. First, identify the relation paths which relate with vcard by the property vcard#N+vcard#Family and vcard#N+vcard#Given. Second, find all resources vcard with the attribute vcard#FN valued “Hai Jin”. Finally, return the family value and the given value of resource name against p1 and p2, respectively.
3. IMPLEMENTATION We use Jena and Lucene to implement our approach as a part of the system SemreX [3]. And by the GUI, we can generate RDF query or the equivalent DKIR query to retrieve the semantic information of literatures. We select a dataset (about 36,000 publications and more than 1,000,000 triples) from SemreX dump and do a test between DKIR and Jena RDB model on load time, query speed and disk consumption (index file/ source RDF file). Table 1. Comparison of the load and retrieval efficiency Load Time Jena RDB DKIR
Query Speed
Disk Storage
12.5 (ms/triple)
1-8(ms/triple)
6.0~6.5
10.5 (ms/triple)
1-15(ms/triple)
1.30~1.50
From current test, the query efficiency on DKIR and Jena both range broadly. DKIR is little bad than Jena, however, it performs repetitive query much better than Jena. In addition, the DKIR performs stably with the increasing of the data volume. Among the usage, when the triples exceed one million, most searching cost remains and keeps several hundred milliseconds.
4. CONCLUSION The integrated model and implementation for indexing, storing and retrieving domain knowledge have been presented. Our method distinguishes with related works from indexing, storing and retrieval mechanisms. It can be easy to combine the metadata search with the full-text search for managing documents. Moreover, it achieves better scalability even without RDB support.
5. ACKNOWLEDGMENTS This work is supported by National Basic Research Program (973) of China under Grant No.2003CB317003, and the Cultivation Fund of the Key Scientific and Technical Innovation Project, Ministry of Education of China under grant 705034.
6. REFERENCES [1] Wilkinson, K., Sayers, C., Kuno, H. A., and Reynolds, D. Efficient RDF storage and retrieval in Jena2. In Proceedings of 1st International Workshop on Semantic Web and Databases. 2003, 131-150. [2] Broekstra, J., Kampman, A., and van Harmelen, F. Sesame: a generic architecture for storing and querying RDF and RDF Schema. In Proceedings of 1st International Semantic Web Conference (ISWC’02). 2002, 54-68. [3] Ning, X.M., Jin, H., and Wu, H. SemreX: towards largescale literature information retrieval and browsing with semantic association. In Proceedings of 2nd IEEE International Symposium on Service-Oriented Applications, Integration and Collaboration (SOAIC'06). 2006.