Advanced Storage and Retrieval of XML Multimedia ... - Springer Link

2 downloads 503 Views 145KB Size Report
PHASME information engine. ... optimize the storage and to exploit this markup, it has been become .... tecture: database search and full-text search engine.
Advanced Storage and Retrieval of XML Multimedia Documents1 Jérôme Godard, Frédéric Andres, and Kinji Ono National Institute of Informatics, Hitotsubashi 2-1-2, Chiyoda-ku, Tokyo 101-8430, Japan [email protected] {andres, ono}@nii.ac.jp

Abstract. Multimedia documents are more and more expressed in XML as data representation and exchange services. In this paper, we describe the data and execution models of XML Multimedia Document management as a part of the PHASME information engine. The core data model exploits the Extended Binary Graph structure (so called EBG structure); we present the storage and indexing services. The goals of the on-going project are to tackle the increase of multilingual multimedia documents within the PHASME prototype as key advances to prepare the next generation of information engines.

1 Introduction In the age of multimedia worldwide web, XML [7] has been created as a metalanguage for describing tags and the structural relationships between them. So that richly structured documents could be used over the web. In the past, HTML and SGML were not practical for this purpose. HTML has been created with a set of semantics and does not provide arbitrary structure. SGML has provided arbitrary structure, but has been too difficult to implement just for a web browser. Full SGML systems solved large, complex problems that justify their expense. Viewing structured multimedia documents sent over the web rarely carries such justifications. In order to optimize the storage and to exploit this markup, it has been become important to use an efficient repository for XML multimedia data and related to Metadata. In this paper, we describe the architecture of the XML support inside the PHASME Information Engine, so called Application Oriented DBMS [4]. PHASME storage units are XML documents inside the core data structure (EBG structure). We base our approach on this EBG structure as it has shown its efficiency in some case studies [3,4] and we seek improvements related to the XML support both at the XML content level and at the XML content management level and the indexing support. XML content includes the structural information and the values. 1

o

This research project is supported by the National Science Foundation (grant n 9905603) and o the Japanese Ministry of Education, Culture, Sports, Science and Technology (grant n 13480108).

S. Bhalla (Ed.): DNIS 2002, LNCS 2544, pp. 64-73, 2002. © Springer-Verlag Berlin Heidelberg 2002

Advanced Storage and Retrieval of XML Multimedia Documents

65

Figure 1 shows an XML multimedia document, which describes a fragment of inventory related to the Silk Road, alongside its associated syntax tree; both representations present equivalent structural properties. We simplified the syntax tree avoiding the cdata nodes. Furthermore, each node has an OID assignment (noted OID(value); e.g. OID0 for the node “inventory”). 



LQYHQWRU\

2,' 

This study is focusing on the creation of the metadata related to the caravanserai object or "Khan" Object. This object has been used in the history to host the

REMHFW

exchange

roads

between

Europe

and

Asia.

REMHFW 2,' 

caravan (e.g. men, goods, animals) on the economic

2,' 

The

caravanserai is also used as hostel on the pilgrimage’s West-East and North-South routes.
The geography location is the Euro-Asiatic

GHVFULSWLRQ

LPDJH

2,' 

continent.


ORFDWLRQ

2,' 

VWULQJ

YLGHR

GHVFULSWLRQ

2,' 

2,' 

VWULQJ

WH[W

2,' 



VHFWLRQ “C0001-im.jpeg” “Teheran, Iran”

C0001-im.jpeg

“C0002.mpeg”

VHFWLRQ

2,' 



2,' 



VHFWLRQ

Teheran, Iran

2,' 

C0002.mpeg

VHFWLRQ 2,' 

WH[W

WH[W

WH[W

The geography location is the Euro-Asiatic continent.




“The geography location is the Euro-Asiatic continent.”



Fig. 1. XML Example

According to this equivalence, we are concerned in this paper with providing effective XML management tools. We describe the architecture of the XML support inside the PHASME Information engine. Our approach is innovative by two features. Firstly, the EBG storage processing does not require any related XML schema or any related DTDs. Secondly, it reduces the volume of stored data. An EBG structure enables to share common parts of documents between multimedia documents that are stored only once.

66

J. Godard, F. Andres, and K. Ono

The remainder of this paper is organized as follows: Section 2 discusses related works. In Section 3, we describe the XML support in the PHASME Information Engine and present initial qualitative assessments related to this approach. Our conclusions are contained in Section 4.

2 XML Management and States of the Art There are two main kinds of XML files: data-centric and document-centric [22]. Datacentric documents are using XML as a data transport. They are characterized by fairly regular structure, fine-grained data and little or no mixed content. Document-centric documents are usually documents that are designed for human consumption. They are characterized by less regular or irregular structure, larger grained data, and lots of mixed content. In fact, the document-centric view is resulting from SGML. As a general rule, data type is stored in a traditional database, such as a relational, objectoriented, or hierarchical database. Furthermore, document type can be also stored in a native XML database (a database designed especially for storing XML). But in the real world, with complex and heterogeneous XML files such as multimedia ones, it is very difficult to define the limit between those two types. That is why it is necessary to use hybrid systems to manipulate XML in order to keep its advantages. In this part, we will give an overview of the main and more interesting ways to store and retrieve XML documents. 2.1 Storage Management Storing data in a Traditional DBMS In order to transfer data between XML documents and a database, it is necessary to map the XML document structure (DTD, XML Schema) to the database schema. The structure of the document must exactly match the structure expected by the mapping procedure. Since this is rarely the case, products that use this strategy are often used with XSLT. That is, before transferring data to the database, the document is first transformed to the structure expected by the mapping; the data is then transferred. Similarly, after transferring data from the database, the resulting document is transformed to the structure needed by the application. A way to move or correct the structure of multilingual lexical XML data is described in [9]. One of the main weaknesses that imply the use of traditional DBMS is the waste of memory space. Implementation done in [14] shows that from 7,65 MB of XML documents, it requires 11,42 MB as a table-based structure; which is 50% more. Storing data in a native XML database It is valuable when the data is semi-structured, i.e. has a regular structure. As a result, mapping it to a relational database results in either a large number of columns with null values (which wastes space) or a large number of tables (which is inefficient). A

Advanced Storage and Retrieval of XML Multimedia Documents

67

second reason not to use traditional DBMS is retrieval speed. Some storage strategies used by native XML databases store entire documents together physically or use physical (rather than logical) pointers between the parts of the document. This allows the documents to be retrieved either without joins or with physical joins, both of which are faster than the logical joins used by relational databases. For example, an entire document might be stored in a single place on the disk, so retrieving it or a fragment of it requires a single index lookup and a single read to retrieve the data. A relational database would require several index lookups and at least as many reads to retrieve the data. Speed is increased only when retrieving data in the order it is stored on disk. Retrieving a different view of the data will probably bring worse performance than in a relational database. In [23], .the storage management is done through an offset space, which is an address space in secondary memory. This is an efficient way to store structures such as trees and avoids using multiple relations. An offset space is very similar to a main memory space and offers the same characteristics than UNIX file system does. We must here point out that this approach has been used from the beginning in the Phasme project [1]. Another problem with storing data in a native XML database is that most native XML databases can only return the data as XML. We must point out that using version control systems such as CVS brings the possibility to have simple transaction management. Encoding problems By definition, an XML document can contain any Unicode character except some of the control characters. Unfortunately, many databases offer limited or no support for Unicode and require special configuration to handle non-ASCII characters. If data contain non-ASCII characters, it has to be ensured that database and data transfer software handle these characters. A cleaning process is very often needed to make pure Unicode files. As an example, a rigorous strategy is described in [9] in the case of multilingual lexical data contained in XML files. 2.2 Retrieval Management Indexing Issues There are many ways to index XML document in order to use their content in a database. [15] gives an overview of two parts of XRS-II (XML Retrieval System) architecture: database search and full-text search engine. The text retrieval is performed with indexes based on term frequencies and siblings; it uses BUS (Bottom-Up Scheme) technique to index and retrieve full text. But in that case, the query must be very precise and respect the attributes definition because the database search engine manages only exact matching. Another method is described in [14]; the authors use path-based indexing to move out the tree structure into a relational table-based structure. They add to each node “regional” data that consist in some parameters defined by the position in the tree and their hierarchy. In that case, database schemas for storing XML documents are independent of the XML files structure (DTD or schema). So it

68

J. Godard, F. Andres, and K. Ono

makes it possible to add any kind of well-formed XML file to the database. Because of the decomposition into fixed tables, DBMS index structures can be used (e.g. B+ trees, R trees). We have to underline here XPath [20] that offers some strong and welldefined opportunities to describe XML documents without dealing directly with a tree structure. Query languages Issues Query languages dealing with XML are getting more complicated because they have to mix declarative and navigational features. In fact, queries are not necessary generic (i.e. a query answer does not only depend on the logical level of the data) as the separation of the logical and physical levels is not ensured with the various uses of XML. Another aspect brings some challenges: XML is ordered. It induces to fill the lack of order within database systems as it is explained in [19]. Many proposals have been done for XML query languages [6]; we shortly present the current leading solutions. XSLT allows users to transform documents to the structure dictated by the model before transferring data to the database, as well as the reverse. Because XSLT processing can be expensive, some products also integrate a limited number of transformations into their mappings. The long-term solution to this problem is the implementation of query languages that return XML, since the output has to be expressed as a tree. Currently, most of these languages rely on SELECT statements embedded in templates. This situation is expected to change when XQuery is finalized, as major database vendors are already working on implementations. Unfortunately, almost all of XML query languages (including XQuery 1.0) are read-only, so different means will be needed to insert, update, and delete data in the near term. (in the long term, XQuery will add these capabilities). 2.3 Mixed Approach: Hybrid System Since most of data become more and more complex and heterogeneous (especially in the multimedia area), it seems interesting to mix both storing approaches described above. First, the major reason is the storing constraints (memory space, access time, declustering…), and secondly it enables to use efficient retrieving methods (indexing, query management…). In the case of a hybrid system (this the name commonly used to describe the mixed approach), it is first necessary to look at the physical organization used to store the data. [11] introduces a hybrid system called Natix that has a physical record manager that is in charge of the disk memory management and buffering. Of course, it uses a tree data model. Then it handles methods to dynamically maintain the physical structure. [15] describes how XML documents can be indexed and how the text retrieval process can be improved with the use of a mixed storage model: attributes are stored in a DBMS and the element contents and their indices are saved in files. It seems this hybrid approach is a good trade-off between performance and cost in indexing and retrieval.

Advanced Storage and Retrieval of XML Multimedia Documents

69

3 XML Support in PHASME Information Engine The PHASME architecture is provided in Figure 2. XML is supported under the XML plug-ins service including the document management functions (creation, manipulation, suppression, indexing). The core of the system is the execution reactor, which mediates the requests coming from external applications (XSQL query or direct document manipulation). The document-type support includes the meta-data [16] associated to each document. PHASME is being extended to support DTDs and XML Schema. The latter support will allow mapping directly XML representation information such as structural properties of documents into Extended Binary Graphs. All the vertical XML support depends heavily on the many-sorted algebra that defines XML manipulation functions. For this reason, a plug-ins defines a set of functions based on the PHASME Internal Language. A major goal in this project is to extend PHASME customizability to XML support and to optimize the implementation of such an XML support plug-ins. Application

Application XML document

Query-based language (Xquery)

AO DBMS PHASME Data types

document type

Operation definition

document creation

Many-Sorte d Algebra Query optimization

document manipulation document suppression

Physical structure Execution model

document structure, index Execution reactor

Ope rating System Threads: Inter and intra-operation Parallelism Memory Mapped File

Fig. 2. XML Support inside PHASME Engine

70

J. Godard, F. Andres, and K. Ono

3.1 PHASME XML Support

The PHASME storage is based on the Extended Binary Graph (EBG structure). The EBG structure is a combination of three concepts DSM [18], DBgraph [17], and GDM [12]. It ensures a compact data structure to maximize the probability that the hot data set fits in main memory. Figure 3 shows how XML data are stored as EBGs (e.g. EBG1, EBG2). The left column is referred as source, the right column as destination. An EBG is a set of nonoriented arcs between items that are either oids or values. EBGs contain either fixedsize item values or variable-size item values. Each value is stored only once so data values are shared between oids when values belong to at least two different objects (e.g. OID5 and OID11 share the same value). Here, we do not include the description of the tag set for the values them-selves. The semantic tagging issues are tackled under the Linguistic DS cooperation [10]. 

VRXUFH

GHVWLQDWLRQVRXUFH

GHVWLQDWLRQ   “The geography location is the Euro-Asiatic continent”

OID0

OID1

OID0

OID7

“C0001-im.jpeg”

OID1

OID2

“Teheran, Iran”

OID1

OID6

“C0002.mpeg”





OID3

OID5





« …

 OID11

OID10

(%*

(%* Fig. 3. Extended Binary Graph Support

Persistency is managed in an orthogonal way from the data structure point of view. So in our case, any persistent index or persistent data structures are stored directly inside EBGs. EBGs map the graph contents of documents into the main-memory. PHASME uses the mmap file mechanism, which enables to have the same data image on disks and in memory. This mapping enables to tune the granularity of the retrieval mechanism. 3.2 Execution Model PHASME query processor includes a dynamic query optimization and execution optimization at run-time as it has been described in [5]. PHASME processing is based on

Advanced Storage and Retrieval of XML Multimedia Documents

71

the many-sorted algebra approach for query processing and optimization processing directly applied on the EBG structures. It gives a high performance layer that is customizable accordingly to workloads and to users. W3C XML algebra [21] is one issue of improvement for our system. Here, we omit a technical presentation of the EBG query processing and refer the interested reader to [4] for a comprehensive overview. 3.3 Indexing Support The PHASME XML Indexing is based on the EBG structure. The indexing mechanisms are those available in PHASME as plug-ins, so it gives a set of indexing mechanisms and strategies available according to the characteristics of the XML contents. Indexing support follows the EBG structure‘s fast access. It includes the support of multi-dimension indexing such as SR-TREE [2] or signature file indexing [1]. We intend to extend the indexing to support UBTREE [13] and to improve this indexing accelerator according to EBG features. Though indexing processing in the context of EBGs is different from traditional model for XML support, it enables to tune and to customize the usage of indexing strategies according to the XML-based application requirements and workloads. The tuning/customization issue is another key point to be addressed by this project where the knowledge and the environment mining processing are two relevant domains to be investigated. 3.4 Qualitative Assessments XML document processing can be processed either in a pipeline-based way or in a setbased way. This section will compare the alternatives. We assume that intermediate results of document querying cannot fit in memory. Definition: An EBG is graph G(X,A) where X= (S,D) is the set of vertices of G, A is the set of A edges of G; S is the Source set and D the Destination set. The Edge (s, d, S.k) th S, d D, and sS.k = d where S.k represents the k item of the Source set. sS.k iff s corresponds to the value in the Destination set. Processing Complexity: complex data operations are often expressed by composition of traversal operations to optimize the data access in the EBG structure. Also it has been demonstrated in [8] that pipeline-based and set-based processing strategies are equivalent for graph-based operations. Depth-first search inside the EBG has a complexity of O(max(card(X), card(A)) which is similar to breast-first search complexity equal to O(card(A)). The processing strategies are chosen according to the index support and to the management of intermediate results. Intermediate results are inside PHASME, either materialized or either transferred in pipeline. The main advantages of the materialization are to exploit share common results and to optimize multiple accesses to the XML document set. Multi-query optimization issue is often seen as a

72

J. Godard, F. Andres, and K. Ono

NP-hard problem so heuristics are necessary. This issue will be the object of a specific investigation in the context of XML document set.

4 Conclusions In this paper, we presented the data model and the execution model of the PHASME engine for XML multimedia documents efficient processing. We showed the core of the approach, the mapping between the XML document multimedia and the Extended Binary Graph (EBG structure). We highlighted two main issues related to our research: tuning/customization and multiquery optimization. Moreover, we expect to provide a better understanding of XML processing in the context of main-memory information engine, storage and indexing. Finally, the prototype will give the opportunity of performance benchmarking in a very large scale XML data set. We will focus as future works on multi-dimensional indexing of XML multimedia documents and multilingual support. This research and project cooperate within the Digital Silk Road Initiative of UNESCO.

References 1. Andres, F., Boulos, J., Ono, K.: Accessing Active Application-oriented DBMS from the World Wide Web. In Int’l Symp. on Cooperative Database Systems for Advanced Applications, pp 232-234, Dec 1996 2. Andres, F., Dessaigne, N., Ono, K., Satoh, S.: Toward The MediaSys Video Search Engine (MEVISE). In Proc. of 5th IFIP 2.6 Working Conference on Visual Database Systems (VDB5), pp. 31-44, Fukuoka - Japan, May 10-12, 2000 3. Andres, F., Kawtrakul, A., and al.: NLP Techniques and AHYDS Architecture for Efficient Document Retrieval System. Proc. of NLPPRS’99, 5th Natural Language Processing Pacific Rim Symposium, Beijing, China, November 5-7, 1999 4. Andres, F., Ono, K.: Phasme: A High Performance Parallel Application-oriented DBMS. Informatica Journal, Special Issue on Parallel and Distributed Database Systems, Vol.22, pp. 167-177 May, 1998 5. Andres, F., Ono, K.: The Distributed Management Mechanism of the Active Hypermedia Delivery System platform. In Trans. on IEICE. VolE84-D, No.8, pp.1033-1038, August, 2001 6. Bonifati, A., Ceri, S.: Comparative Analysis of Five XML Query Languages. ACM SIGMOD Record, 29(1), pp. 68-79, March 2000 7. Bray, T., Paoli, J., Sperberg-McQueen, C.M.: Extensible Markup Language (XML) 1.0. [http://www.w3.org/TR/REC-xml] 8. Gibbans, A.: Algorithmic Graph Theory. Cambridge University Press, Ltd, 1985 9. Godard, J., Mangeot-Lerebours, M., Andrès, F.: Data Repository Organization and Recuperation Process for Multilingual Lexical Databases. Proc. of SNLP-Oriental COCOSDA 2002, pp. 249-254 , Hua Hin, Prachuapkirikhan, Thailand, 9-11 May 2002

Advanced Storage and Retrieval of XML Multimedia Documents

73

10. Hashida, K., Andres, F., Boitet, C., Calzolari, N., Declerck, T., Fotouhi, F., Grosky, W., Ishizaki, S., Kawtrakul, A., Lafourcade, M., Nagao, K., Riza, H., Sornlertlamvanich, V., Zajac, R., Zampolli, A.: Linguistic DS, ISO/IEC JTC1/SC29/WG11, MPEG2001/M7818 11. Kanne, C., Moerkotte, G.: Efficient storage of XML data, Proc. of 16th International Conference on Data Engineering (ICDE) , page 198, San Diego, California, February 28 March 03, 2000 12. Kunii, H.S.: Graph Data Model and its Data Language. 1990 13. Ramsak, F., Markl, V., Fenk, R., Zirkel, M., Elhardt, K., Bayer, R.: Integrating the UB-Tree into a Database System Kernel. In Proc. of VLDB Conf. 2000, Cairo, Egypt, 2000 14. Shimura, T.,. Yoshikawa, M, Uemura, S.: Storage and Retrieval of XML Documents using Object-Relational Databases. Proc. of the 10th International Conference on Database and Expert Systems Applications (DEXA’99), Lecture Notes in Computer Science, Vol. 1677, Springer-Verlag, pp. 206-217, August-September 1999 15. Shin, D.: XML Indexing and Retrieval with a Hybrid Storage Model. Knowledge and Information Systems, 3, Springer-Verlag, pp. 252-261, 2001 16. The Dublin Core Metadata Initiative, [http://dublincore.org] 17. Thevenin, J.M.: Architecture d’un Systeme de Gestion de Bases de Donnees Grande Memoire, Ph.D. Thesis of Paris VI University, 1989 18. Valduriez, P., Khoshafian, S., Copeland, G.: Implementations techniques of Complex Objects. In Proc. of the International Conference of VLDB, pp 101-110, Kyoto, Japan, 1986 19. Vianu, V.: A Web Odyssey: from Codd to XML. In Proc. of the ACM Symposium on Principles of Database Systems (PODS), 2001 20. W3C (1999): XML Path Language (XPath) [www.w3.org/TR/xpath] 21. W3C XML algebra [http://www.w3.org/TR/query-algebra/] 22. XMLdev [http://www.xml.org/xml/xmldev.shtml] 23. Yamane, Y., Igata, N., Namba, I.: High-performance XML Storage/Retrieval System. Fujitsu Sci. Tech. J., 36, 2, pp. 185-192, December 2000

Suggest Documents