An application-oriented approach for hytime structured ... - Springer Link

16 downloads 607 Views 481KB Size Report
DBMSs provide efficient data storage facilities but still lack of customizability ... control, recovery, versioning) are very important for efficient multimedia manage- ment. .... It has a RAID disk (level 0) with 60GB as capacity. The document set.
An Application-Oriented Approach for HyTime Structured Document Management Frederic Andres l, John Buford 2, and Kinji Ono I 1 NAtional Center for Science Information Systems (NACSIS) - R&D Department Otsuka, Bunkyo-ku Tokyo 112, Japan {andres,ono} @rd.nacsis.ac.jp 2 GTE Laboratories 40 Sylvan Road, Waltham, MA 02254, USA jb01 @gte.com

Abstract. In this article, we point out the important functionality needed by emerging multimedia applications such hypermedia presentations or digital library retrieval systems to prepare the next generation database systems. Uniform management of hypermedia data is required to be suitable to various kinds of applications: different data types, different data model, varied data format or i/o devices. DBMSs provide efficient data storage facilities but still lack of customizability according to the target applications. Moreover, content-based and structure-based retrieval management are required by modem information retrieval systems. In order to combine the requirements of information retrieval systems and opened DBMS, we have implemented information retrieval functions inside the Application-Oriented DBMS Phasme. The document representation is either SGML or HyTime. SGML or HyTime documents are stored inside Phasme and are accessed using full text retrieval functionality. Such functionality are implemented as Phasme plug-ins and are stored inside Phasme. The storage management of the documents is independent from the way the user application will retrieve them. The developments achieved so far inside the AHYDS project (Active HYpermedia Delivery System) currently under process at NACSIS illustrate the chosen architecture design of the retrieval system. The performance of the current prototype is evaluated against a 40 GB document Benchmark showing that our approach yields excellent results.

1 Introduction During the last years, multimedia information systems based on client/server architecture have integrated various kinds of user tools such as authoring tools, browsers, presentation tools, or animation tools. Information themselves are also various (e.g. various types, various formats). Meanwhile, hypermedia information relate multimedia data by linking them together and enable navigation through links. Existing systems do not yet provide all the functionality required by the hypermedia information systems. As an example, it is necessary to support content-based and

261

structure-based retrieval [7] as well as database query mechanisms for hypermedia document management. Moreover, information retrieval issues are a key input for document management as it is mentioned in [ 12]. Also database services (concurrency control, recovery, versioning) are very important for efficient multimedia management. The contribution of this paper is twofold: first we present the implementation of an hypermedia retrieval system using Phasme DBMS, and second we evaluate the performance of this implementation and compared to the ORBDMS technology. We shows how the hypermedia document description using HyTime[9] and the retrieval functions can be mapped inside the application-oriented system Phasme[3], a state of the art, parallel database micro-kernel. Phasme stores data in a uniform way independently of the various kinds of user applications (different data formats, data types and semantics). Moreover, it provides traditional DBMS services such as persistency, concurrency control, recovery and versioning. The vertical customizability (from the data definition to the execution model) enables the Application-Oriented DBMS to be tailored according to the requirements of the application (e.g. index types, word matching algorithm). This paper presents a system that performs information retrieval with high performance. This system translates the hypermedia document description (in HyTime) into the Phasme EBG data structure. Phasme is used as a back-end for the information retrieval execution. The demonstration of the performance has been done using the Binary Document Management[2] benchmark and managing from 5GB to 40GB sets of hypermedia documents. The data set, and the queries of this benchmark illustrate the performance of our system. Object-relational databases typically bring together relational table storage and query processing, and the object-orientation of object-oriented DBMSs. Object-Relational systems with a capability of vertical customizability (adaptability from the data definition to the execution model) and with a uniform data storage will be able to adapt themselves dynamically according to the application requirements and to provide high performance. The end-goals of the AHYDS project is to demonstrate a system with all these features inside an hypermedia application. The work described in this paper is limited to document retrieval management. The organization of the remainder of the paper is as follows. Section 2 describes the architecture of the AHYDS platform. In Section 3, technical aspects of the hypermealia delivery system are sketched, while Section 4 analyses the performance experiments and results. Some related works are presented in Section 5. Finally, Section 6 summaries and concludes the paper.

2 The Active HYpermedia Delivery System (AHYDS) This section reminds the major requirement of hypermedia retrieval systems. Then the management of SGML/HyTime Documents within an AODBMS is discussed. Finally the architecture of the AHYDS platform is described.

262

2.1 Requirements of Hypermedia Document Retrieval Applications Hypermedia document management in many library applications requires the following features from the information server:

Support for structured document management. Such documents can include only text or text and media information such as audio, images, and moving pictures. Several document languages such as SGML, HyTime, or ODA have been developed in order to satisfy this requirement.

Support of DBMS services Transaction control, data integrity and data sharing, and version control are required for handling a large amount of multimedia data under dynamic multi-user environments.

Uniform data management independently of the application data model. Uniform storage structure is appropriate to deliver information to different kinds of multimedia applications (presentational type, conversational type) based on different kinds of data models (relational, object oriented).

Durability against the standardization. Several standard generalized markup languages have been proposed. SQL/MM is now emerging to support full text retrieval functions. The interface of the multimedia server must be flexible and customizable enough to support new languages.

Efficiency and quality of service for the data delivery. Combining functionality with efficiency and quality of services is mandatory in order to support both time-dependent and resolution-dependent media data.

2.2 Handling SGML/HyTime Documents within an AODBMS We have chosen to focus on the integration of the information retrieval functionality inside the AODBMS for the following reasons: (1) it allows a greater flexibility and extensibility with regards to the information retrieval system's requirements in terms of standards (e.g. information manipulation languages, information representation, application data models, data exchange); and (2) it provides high performance and customizability mechanisms for durable hypermedia systems. The advantages of such a coupling are a) process structure-based and content-based query processing closed to the stored data, b) high degree of concurrency, and c) update management inside the DBMS. Such an integration reduces the number of layers between the application and the document storage manager. The advantages are: a) a full AODBMS functionality, e.g. vertical customizability of the kernel to support efficiently information retrieval functionality, an uniform storage

263

management independently of the application's semantics; b) no overhead due to the traditional mapping between application data and database storage and c) the durability of the information system against the evolution of standards, and the hardware trends. The overall architecture is shown in Fig. 1. Information about the SGML/HyTime support can be found in [16]. It defined the framework to store and to retrieve SGML and HyTime documents. A short overview is given in the next section, further details can be found at the web site http://www.rd.nacsis.ac.jp/~andres/db/ahys.html. One of the objectives of the AHYDS (Active Hypermedia Delivery System) Project is to demonstrate the great interest of an application-oriented database application framework for structured and hypermedia document storage. The data representation is based on the Extended Binary Graph structure (EBG) of the Phasme DBMS[3]. Each component of the document corresponds to an EBG. The EBGs corresponding to a document's elements make up a hierarchy. Leaf EBGs contain text or media information. Application

Application

]

~

SQL/OQL

I

[ ML/HyTime

Phasme Interface Language

[

AODBMS Phasme Full text plug-ins

Vertical customizability Many-sortedAlgebra

Data definition Operation definition Query optimization Physical structure Execution model

I

Documenttype Documentcreation Documentmanipulation Documentsuppression Documentswactureindex Executionreactor

Fig. 1. System Architecture

2.3 Architecture of the AHYDS Platform

We applied the application-oriented approach to the SGMLAtyTime framework described in the previous section. The integration of the IR-functionality with the AODBMS requires the design and the development of plug-ins in order to customize vertically the Phasme DBMS. On the one hand, hypermedia document retrieval processing is done inside the DBMS kernel closed to the data storage manager. On the other hand, the plug-ins mechanism enables to adapt the DBMS according to the real application requirements. The Active Hypermedia Delivery System allows a dynamic

264

creation of new components of hypermedia documents. Combined structure and content-oriented queries are done within the user query language or following the user application requirements.

3 Technical Aspects of the Hypermedia Document Management

3.1 The HyTime plugins SGML/HyTime support has been provided by the HyMinder Library[14]. SGML/HyTime documents are stored in the Phasme DBMS. Each SGML/HyTime element is stored as a SGML/HyTime item of Phasme. It enables various granularity creation and manipulation of hypermedia documents. The Hypermedia document plug-ins is given in Fig. 2. Plug-ins HyDocument ; ITEM HyDocument ADD hydocADD; DEL hydocDEL ; GET hydocGET ; SIZE hydocSIZE ; COMP hydocCOMP ; INIT hydocNULL ; INDEX h y d o c INDEX; MEM hydocMEM; TOA hydoc2a ; ATO a2hydoc ; END HyDocument ;

Fig. 2. Hypermedia Document Plug-ins The plug-ins includes both hypermedia document manipulations, conversion from EBG format to ascii format, and index management. For example, the operations hydocADD and hydocDEL respectively inserts into the Phasme data structure and deletes a document from the Phasme data structure. The operation hydocGETretrieves a document or some parts of it from the database.

3.2 Document Retrieval Queries Document retrieval queries enable both structural and content-based accesses. To illustrate our approach, the document retrieval query part is formulated as arguments of the plug-ins operation hydocGET. The examples of queries are based on the many-

265

sorted algebra of the Phasme EBG data structure. Here is two examples of the query formulation: "Select all the paragraphs of documents having an image and an audio summary about 'Austria Culture'" : hydocGET (hydoc, paragraph, hydoc, image, i sabout ("Kamakura temples" )and hydoc .audio. isabout ("Austria Culture" ) ) ; "Select the title and the abstract of documents created in 1996 containing a paragraph about 'World Cup Soccer'": hydocGET ( (hydoc. title, hydoc, abstract, hydoc, d a t e c r e a t i o n (1996) and h y d o c . p a r a g r a p h . i s a b o u t ("World Cup Soccer") ) ;

4 Performance Experiments This section describes the results of experiments that measure the efficiency of the hypermedia document manipulation inside the AHYDS platform compared to a commercial ORDBMS.

4.1 Configuration Our server platform was a SPARC 1000 with 128 Mb of main memory and 8 SPARC processors. It has a RAID disk (level 0) with 60GB as capacity. The document set includes tags such as title, author, abstract, keywords, and at least 3 chapters with image and video. Complete details are given in the full report[4]. The test-bed server of the Phasme DBMS version 2.02 used as the data storage manager of AHYDS. The client test software just retrieves the document from the server.

4.2 Performance results

Performance evaluation of the Response Time The first set of measurements evaluates the performance of the access method of the hypermedia document in the context of the text retrieval issues. Fig. 3 shows the average elapsed time of document retrieval queries varying the number of documents from 10000 to 200000. The hypermedia document identifier is randomly chosen for one couple of measures from the Hypermedia application running on the ADHYDS platform and on the ORDBMS platform. The average size of a document is 0.2 MB. The maximum size of documents managed inside both the AHYDS platform and the ORDBMS is 40 GB of documents.

266 7 6 tO O ID

4

._E

3 2

I--

5

I

.. ~

" " "

1 0

i

J

i

i

J

i

n~a.~, rm - - ORDBMS

platform

1

i

Number of Doouments

Fig. 3. Performance Evaluation of the AHYDS DBMS and an ORDBMS.

Inter-operatio. p a r a l l d Fig. 4 shows the influence of the intra-operation parallelism of the Phasme DBMS on a document retrieval access. Database is set to 200 000 documents. This experience was run on a SUN ULTRA 2 workstation. We varied the number of threads from 1 to 8 threads. The set of hypermedia documents has been vertically clustered. In this way, the set of threads can be dynamically associated to the set of hypermedia document buckets. The mechanism for intra-operation parallelism of the execution model considerably improves the performance of the document retrieval. 4 "0 e-

80 E

3.5 3 2.5 2

A H Y D S -~ platformj

1.5

0

I

I

1

2

Number

I

I

4

8

of Threads

Fig. 4. Hypermedia Retrieval and Intra-operation Parallelism. The multi-threading management using more than 4 threads produces an overhead due to the small number of processors. There are overheads due to the thread swapping over the processors. Some improvements will be introduced due using pattern programming approach.

User sealability and Elapsed Time An important issue is the scalability and the influence of the increase of the number of users on the elapse time in order to receive a complete document. Fig. 5 shows the

267

variation of the performance (elapsed time) of the AHYDS platform increasing the number of users. We verified that the system supports the increase of users. The ORDBMS is becoming better when the number of users increase due to caching effects. Then the elapsed time is increasing as the caching effect is no more improving the elapsed time according to the number of users.

~7 e-

B6 v

==5

I

~:3 "~2 o=

~0

I

10

1

50

I

100

AHYDS platform - - - ORDBMS platform

I

150

Number of Users Fig. 5. Influence of the Number of Users on the Elapsed Time

User scalability and number of documents delivered per second An important issue is the behavior of a delivery system related to the number of delivered documents according to the number of users querying the system. =

40

o==~= 3 25 0

AHYDS platform

20 "5~ ~~ 15

, I " " "

~~" ,

" " " platformORDBMS

5

10

50 100 150 200 Number of Users

250

Fig. 6. Evaluation of the Number of Documents Delivered per Second Fig. 6 shows the variation of the number of documents delivered per second for both the AHYDS platform and the ORDBMS-based platform when increasing the number of users. The ORDBMS platform does not support the scalability when we increase the number of users. Our AHYDS supports the increase of users even if the number of users goes over 200.

268

5 Related Works Quite a number of research prototypes of hypermedia database management systems[l, 6, 8, 11, 13, 15] have been developed over the last years, mostly on OODBMS platforms. As an example of commercial development, the Oracle Universal Server[13] is extensible for the development of so-called "Data Cartridges" which are manageable objects. The Oracle Full text retrieval (Context Cartridge) has been provided to extend traditional RDBMSs to manage text within a document and structured data. The management of hypermedia documents including text, images and video still lack of flexibility and lack of efficiency. Large data set (over 100Go) will be used in the future in order to stress the hypermedia document management. In the field of extensibility, ORBDMSs[10] provide a new approach to supporting user-defined data types using datablade but the integration between modules and database services is still very loosely. The Phasme DBMS integrates user-defined modules in a vertical tight way.

6

Conclusion

In this paper, we have described an information delivery system providing the integrated functionality of information retrieval system based on non query languages (SGML, HyTime) inside the Phasme Application-oriented DBMS. We have pointed out the problems that arose when traditional DBMSs are used. Further, we believe that our approach is rather flexible for the following reasons: (1) Uniform data management independently of the language or of the text representation standards. (2) Customizability of the information server according to different users' tools. (3) the plugins-based hypermedia document management. The performance evaluations demonstrate convincingly that a DBMS kernel based on an application-oriented capabilities can be scaled to manage documents and to support information retrieval. The following issues remain for future investigation: integration of OQL-based query language for HyTime information retrieval with the AHYDS platform, relevant feedback from very large data set (over 100Go of data). Very large database experimentation are being done inside the Text REtrieval Conference (TREC-7) evaluation for text retrieval processing. Beyond that, we are studying dynamic optimization rules to provide high performance for full text performance. Finally, the management of emerging representation languages such as XML, full text SQL or SMIL are investigated as extensions of the current retrieval and delivery system.

Acknowledgement We thank TechnoTeacher for their support and help in the implementation of the HyTime plug-ins for the Phasme DBMS.

269

References 1. Andrews K., Kappe F., and Maurer H. "The Hyper-G network Information System", Information Processing and Management, Special Issue: Selected Proceedings of the Workshop on Distributed Multimedia Systems, Graz, Austria, (1994) 2. Andres F., and Asada K. "The BDM Benchmark: an Example of Use on G-BASE", IEICE'95, Fukuoka, Japan, (1995) 3. Andres F., and Ono K. "Phasme: A High Performance Parallel Application-oriented DBMS" in Informatica Journal, Special issue on Parallel and Distributed Database Systems (1997) 4. Andres F., and Ono K. "An Application-oriented Approach for HyTime Structured Document Management" NACSIS report, 1998. 5. Buford J.F., and Rutledge L. "Third generation Distributed Hypermedia Systems" in Multimedia Information Management Handbook (ed. W. Grozky), Prentice Hall, (1997) 6. Christophides V., Abiteboul S., Cluet S., and Scholl M. "From Structured Documents to Novel Query Facilities", in Proceedings on ACM SIGMOD International Conference on Management of Data, (1994) 313-324 7. DeFasio S., Daoud A., Smith L.A., and Srinivasan J. "Integrating IR and RDBMS using cooperative indexing" in Proc. of the 18th Annual Int. SIGIR Conf. on Research and Development in Information Retrieval, (1995) 84-92 8. Hara Y., Hirata K., Takano H., and Kawasaki S. "Hypermedia Database "Himotoki" and Its Application" in Proc. Int'l Conference on ICDE, (1996) 372-379 9. HyTime ISO/IEC 10744, (1997) 10. "Developing DataBladeR Modules for InformixR-Universal Server" White Paper, (1998) 11. Kim H., Zhoo Z., Shin H., and Chang J. "An Object-oriented Hypermedia System for Structured Documents" in Proc. Pacific DBMS'96, Hong Kong, (1996) 286-295 12. Kowalski Gerald "Information Retrieval Systems Theory and Implementation", Kluwer Academic Publishers, second printing, (1998) 13. "Managing Text with Oracle8 TM ConText Cartridge, An Oracle White Paper, (1997) 14. Ozsu M. T., Iglinski Paul, Szafron Duane, E1-Medani Sherine, and Junghanns Manuela, "An Object-oriented SGML/Hytime Compliant Multimedia Database Management System", in Proc. 5th International Multimedia Conference (ACM Multimedia 97), Seattle, WA, (1997) 239-249 15. Volz M., Aberer K., and Bohm K. "An OODBMS-IRS Coupling for Structured Documents" in Proc. IEEE ICDE, (1996) 10-19 16. Technoteacher "HyMinder User Guide", (1996)

Suggest Documents