Abstract. In big data era, traditional architecture of digital library needs to deal with massive, complex, heterogeneous and continuous changing data. This pa-.
PuntStore: A Non-relational Database Based Architecture of Data Management in Digital Library Chao Lan, Yong Zhang, Yang Gao, and Chunxiao Xing Research Institute of Information Technology Tsinghua National Laboratory for Information Science and Technology Department of Computer Science and Technology, Tsinghua University, Beijing, China {lanc11,yang-gao10}@mails.tsinghua.edu.cn, {zhangyong05,xingcx}@tsinghua.edu.cn
Abstract. In big data era, traditional architecture of digital library needs to deal with massive, complex, heterogeneous and continuous changing data. This paper proposes PuntStore, a novel architecture based on NoSQL database PuntDB. PuntStore improves and optimizes many aspects of traditional digital resource management systems, including more flexible, efficient in meta-data management, digital content storage and label management. PuntStore has been deployed successfully to Chinese Science and Technology History library and solved the issue of managing the heterogeneous and complex metadata. Test results shows that PuntStore could be an effective solution of similar application scenarios. Keywords: Digital library, DL, NoSQL, Digital Infrastructure, Architecture.
Digital Library has brought new innovations for spreading of knowledge in the past a few decades. Large amount of resources have been transformed from traditional preservation media to new digital library forms. Lots of issues have been raised from both academia and industry area about how to provide services and long-term preservation for huge massive amounts of digital resources in digital libraries. Such problems are extremely critical in Big Data era[1]. New data management architecture needs to be designed to face the challenge of big data. Addressing the above issues, this paper designed a digital resource management architecture called PuntStore using a non-relational database PuntDB. A new service model called Message Object (MO) system was designed to support the system to be a distributed architecture. PuntStore was designed with optimization with unstructured data storage, distribution, and heterogeneous metadata support. PuntDB improves the flexibility of data storage and management by using a new data management model. A variety of indexing mechanisms were achieved in PuntDB for both structured data and unstructured data. PuntStore supports a wide range of data query and provides flexible choice for different type of data. PuntStore is a generic digital resource management system and has been successfully deployed in History of Chinese science and technology library. H.-H. Chen and G. Chowdhury (Eds.): ICADL 2012, LNCS 7634, pp. 357–358, 2012. © Springer-Verlag Berlin Heidelberg 2012
358
C. Lan et al.
200 190 180 170 160 150 140 130 120 110 100 90 80 70 60 50 40 30 20 10
50
Indexing time (s)
Data Items (10K)
We implemented and tested PuntStore based on PuntDB which was optimized for the needs of digital library. Fig. 1 shows the data import time using PuntDB and MySQL. Fig. 2 shows the time efficiency of indexing data. Fig. 3 shows that query time of PuntDB and MySQL with and without index.
PuntDB
45
MySQL
40
PuntDB
35 30 25 20 15 10
MySQL
5 0
500
1000
1500
2000
2500
200 190 180 170 160 150 140 130 120 110 100 90 80 70 60 50 40 30 20 10 0
0
3000
Data Set
Import time(S)
Query time (s)
Fig. 1. Time efficiency of metadata import 12 11 10 9 8 7 6 5 4 3 2 1 0
(10K)
Fig. 2. Time efficiency of indexing
MySQL without Index PuntDB without Index MySQL with Index PuntDB with Index
10
20
30
40
50
60
70
80
90 100 110 120 130 140 150 160 170 180 190 Data Set (10K)
Fig. 3. Query time of structure data
This paper brought the idea of non-relational database storage to the specific digital library projects, and successfully developed a highly practical digital library system architecture and storage solutions which made an exploration on the future digital library construction. Compared with relational database storage, the non-relational form are much closer to real life data and more flexible for the data to be expressed.
Reference 1. Rochwerger, B., Breitgand, D., Levy, E., Galis, A., Nagin, K., Llorente, I.M., et al.: The reservoir model and architecture for open federated cloud computing. IBM Journal of Research and Development 53(4), 4:1–4:11 (2009)