A native XML database supporting approximate

1 downloads 0 Views 165KB Size Report
search. As an example, consider image search performed by using MPEG-7 vi- ... low-level feature descriptors, as in MPEG-7 [4] visual or audio descriptors.
A native XML database supporting approximate match search Giuseppe Amato and Franca Debole ISTI - CNR Pisa, Italy Giuseppe.Amato,Franca.Debole @isti.cnr.it



Abstract. XML is becoming the standard representation format for metadata. Metadata for multimedia documents, as for instance MPEG-7, require approximate match search functionalities to be supported in addition to exact match search. As an example, consider image search performed by using MPEG-7 visual descriptors. It does not make sense to search for images that are exactly equal to a query image. Rather, images similar to a query image are more likely to be searched. We present the architecture of an XML search engine where special techniques are used to integrate approximate and exact match search functionalities.

1 INTRODUCTION XML is becoming one of the primarily used formats for the representation of heterogeneous information in many and diverse application sectors, such as multimedia digital libraries, public administration, EDI, insurances, etc. This widespread use has posed a significant number of technical requirements to systems used for storage and contentbased retrieval of XML data, and many others is posing today. In particular, retrieval of XML data based on content and structure has been widely studied and it has been solved with the definition of query languages such as XPath [1] and XQuery [2] and with the development of systems able to execute queries expressed in these languages. However, many other research issues are still open. There are many cases where users may have a vague idea of the XML structure, either because it is unknown, or because is too complex, or because many different structures – with similar semantics – are used across the database [3]. In addition there are cases where the content of elements of XML documents cannot be exactly matched against constants expressed in a query, as for instance in case of large text context or low-level feature descriptors, as in MPEG-7 [4] visual or audio descriptors. In the first case structure search capabilities are needed, while in the second case we need approximate content search (sometime also referred as similarity search). In this paper we present the architecture of XMLSe a native XML search engine that allows both structure search and approximate content match to be combined with traditional exact match search operations. Our XML database can store and retrieve any valid XML document without need of specifying or defining their schema. Our system store XML documents natively and uses special indexes for efficient path expression execution, exact content match search, and approximate match search.

2

The paper is organized as follows. In Section 2, we set the context for our work. In Section 3 we present the overall architecture of the XMLSe system. In Section 4 we describe the query algebra at the basis of the query processor. Section 5 shows some example of query execution in terms of the query algebra. Section 6 concludes.

2 MOTIVATION AND RELATED WORK In the Digital Libraries field three different approaches are typically used to support document retrieval by means of XML encoded metadata. The first consists in using relational database to store and to search metadata. In this case metadata should be converted into relational schemes [5] [6] [7] and this is very difficult when complex and descriptive metadata schemes such as ECHO [8] and MPEG-7 [4] should be managed: even simple XML queries are translated into complex sequences of joins among the relational tables. The second approach consists in using full text search engines [9] to index metadata records, and in general this applications are limited to relatively simple and flat metadata schemes. Besides, it is not possible to search by specifying ranges of values. The third and last approach consists in doing full sequential scan of metadata records. In this case no indexing is performed on the metadata and the custom search algorithms always scans the entire metadata set to retrieve searched information. A relatively new promising approach is to store metadata in native XML databases as for instance Tamino [10], eXist [11], Xindice [12], which are some of software products which have been developed in recent years with this approach. However, these systems, in addition to some simple text search functionality, exclusively support exact match queries. They are not suitable to deal with multimedia metadata and to provides users with structure search functionalities. With the continuous increase of production of multimedia documents in digital format, the problem of retrieving stored documents by content from large archives is becoming more and more difficult. A very important direction toward the support of content-based retrieval is feature based similarity access. Similarity based access means that the user specifies some characteristics of the wanted information, usually by an example image (e.g., find images similar to this given image, represents the query). The system retrieves the most relevant objects with respect to the given characteristics, i.e., the objects most similar to the query. Such approach assumes the ability to measure the distance (with some kind of metric) between the query and the data set images. Another advantage of this approach is that the returned images can be ranked by decreasing order of similarity with the query. The standardization effort carried-out by MPEG-7 [4], intending to provide a normative framework for multimedia content description, has permitted several features for images to be represented as visual descriptors to be encoded in XML. In our system we have realized the techniques necessary to support XML represented feature similarity search. For instance, in case of an MPEG-7 visual descriptor, the system administrator can associate an approximate match search index to a specific XML element so that it can be efficiently searched by similarity. The XQuery language has been extended with new operators that deal with approximate match and ranking, in order to deal with these new search functionality.

3

3 SYSTEM ARCHITECTURE In this section we will discuss the architecture of our system, explaining the characteristics of the main components: the data storage and the system indexes. A sketch of the architecture is give in Figure 1. 3.1 Data storage In recent years various projects [6] have proposed several strategies for storing XML data sets. Some of these have used a commercial database management system to store XML documents [5], others have stored XML documents as ASCII files in the file system, and others have also used an Object storage [13]. We have chosen to store each XML document in its native format and to use special access methods to access XML elements. XML documents are stored in a file, called repository. Every XML element is identified by an unique Element Instance IDentifier (eiid ). As depicted in Figure 1, we use an offset file to associate every eiid with a 2-tuple , which contain respectively a reference to the start and end position of the element in the repository. By using structural containment join techniques in [14] containment relationships among elements can be solved. The mapping between XML element names and the corresponding list of eiid is realized through an element index.

  

3.2 System Index To improve the efficiency of XML queries, special indexes are needed. For this reason we have analyzed and realized some indexes to efficiently resolve the mapping between element and its occurrences and to process content predicates, similarity predicates, and navigation operations throughout the XML structure.

PathIndex

1 2



..

3



..













.. Signatures

ContentIndex



Similarity





8

< 13, 18>

10

< 16, 17>

11

< 19, 20>

Offset File

(1,1) (2,2) (3,3) (4,4)John (5) (5,6) Smith(7) (8) (6,9) San Diego (10) (11) (7,12) (8,13) (9,14) Bill(15) (10,16) McCulloc(17) (18) (11,19) San Francisco (20) (21) (22)

XML document System Index

Fig. 1. The components of data storage.

4

Path Index. Without special index, processing a path expression (es: //person/ln ), with optional wildcard, involves two steps: first, the occurrences of elements specified in the path expression (es: person and ln ) should be found and second, hierarchical relationships, according to the path expression being processed, should be verified with containment joins. Using ad hoc indexes, like in [15] [16] [17], which associate entire pathnames with the list of their occurrences in XML documents, processing a path expression is much more efficient. In accord to this approach we have proposed a new path index to resolve efficiently the path expressions. The advantage of our approach with respect to the others, is that also path expressions containing wildcards in arbitrary position can be efficiently processed. This approach, discussed in [18], is based on the construction of a rotated path lexicon, consisting of all possible rotations of all element names in a path. It is inspired by approaches used in text retrieval systems to processing partially specified query terms. In our system the concept of term is substituted by path: each path is associated with the list of its occurrences and for this reason we call path lexicon the set of occurring paths (see Figure 2). Let path, path be pure path expressions, that is path expressions containing just a sequence of element (and attribute) names, with no wildcards, and predicates. We can process with a single index access the following types of path expressions: path, //path, path//path , path//, and //path//. For more details on technique see [18].





Content Index. Processing the queries that, in addition to structural relationships, contains the content predicates (es: /people/person//ln=’McCulloc’ ), can be inefficient. In order to solve this problem we have extended our path index technique to handle simultaneously the content predicates and structural relationships. The content of an element is seen as a special child of an element so it is included as the last element of a path. Of course, it does not make sense to index content of all elements and attributes. The database administrator can decide, tacking into account performance issues, which elements and attributes should have their content indexed. By using this extension, an

Path lexicon

Posting List

/people

{1}

/people/person

{2,10}

/people/person/name

{3,11}

/people/person/name/fn

{4, 12}

/people/person/name/ln

{6, 14}

/people/person/address

{8,16}

people person name

1

3 address

fn

4

ln

6

John

5

Smith

7

10

person

2

name

8 fn

12

11 address

ln

13 9

Bill

San Diego

Fig. 2. The paths and their inverted lists associated

McCulloc

16

14 15

17

S. Francisco

5

expression of comparison can simply be processed by a single access to the path index [18]. Tree Signature. Efficient processing of path expressions in XQuery queries requires the efficient execution of navigation operations on trees (ancestor, descendant, parent etc . . . ), for this reason in our system we have used the signature file approach. Signatures are a compact representations of larger structures, which allow the execution of queries on the signatures instead of the documents. We define the tree signature [19] as sequences of tree-node entries to obtain a compact representation of the tree structures. To transform ordered trees into sequences we apply the preorder and the postorder numbering schema. The preorder and postorder sequences are ordered lists of all nodes of a given tree T. In a preorder sequence a tree node is traversed and assigned its rank before its children are assigned their rank and traversed from left to right, whereas in the postorder sequence a tree node is traversed and assigned its rank after its children are assigned their rank and traversed from left to right. The general structure of tree signature for a document tree T is

!"# %$'&()# +**,# +*-./0, 1$2&(0( 3**#0# +*-40,/)5)556/78- 1$2&(8- +**(8- +*-489 where **#:1*-4:1 is the preorder value of the first following ( first ancestor ) node of the node with the preorder number  . The signature of an XML file is maintained in a

corresponding signature file consisting of a list of records. Through this tree signature the most significant axes of XPath can be efficiently evaluated, resolving any navigation operation. Exploiting the capability of the tree signature is it is also possible to process structure search queries, as discussed in [3]. In fact, there are many cases where the user may have a vague idea of the XML structure, either because it is unknown, or because it is too complex. In these cases, what the user may need to search for are the relationships that exist among the specified components. For instance, in an XML encoded bibliography dataset, one may want to search for relationships between two specific persons to discover whether they were co-authors, editors, editor and co-author.

Approximate Match Index. Recently published papers [20] [21] investigate the possibility to interrogate XML documents not only with the exact-match paradigm but also with the approximate match paradigm. An exact-match approach is restrictive, since it limits the set of relevant and correlated results of queries. With the continuous increase of multimedia document encoded in XML, this problem is even more relevant. In fact rarely a user express exact requests on the features of a multimedia object (e.g., color histogram). Rather, the user will more likely express queries like ”Find all the images similar to this”. For supporting the approximate match search in our system, we have introduced a new operator , which can be applied to content of XML elements. To be able to resolve this type of query we have used suitable index structures. With regard to the generic similarity queries the index structure which we use is the AM-tree [22]. It can be used when a distance function is available to measure the (dis)-similarity among content representations. For instance it can be used to search by similarity MPEG-7 visual descriptors.

;

6

For the text search, we make a use of the functionalities of the full-text search engine library. Specifically we have used Lucene [23].

4 QUERY ALGEBRA An XQuery query is translated into a sequence of simple operations to be executed (the logical query execution plan). Operators of our query algebra take as arguments and return lists of tuples of eiid (see Section 3.1). We call these lists Element Instance Identifier Result (EIIR ). For instance, given an EIIR R, the evaluation of Parthat is the set of elements named article, which ent(R, article) gives back the EIIR are parents of elements contained in R.

=?

E1FG7 %D>HI D>H D>H D>H