Metadata-based Information Resource Integration

1 downloads 0 Views 3MB Size Report
[5] Nassis V, Rajugan R, Dillon TS, Rahayu W. Conceptual Design of XML Document Warehouses. Lecture Notes in Computer Science. 2004; 3181: 1-14.
Available online at www.sciencedirect.com

Procedia Computer Science 17 (2013) 54 – 61

Information Technology and Quantitative Management (ITQM2013)

Metadata-based information resource integration for research management Zhilong Chen a, Dengsheng Wu ub,*, Jingxiu Luc, Yuanping Chenc a

College of Engineering and Information Technology, University of Chinese Academy of Sciences, Beijing 100049, China b Institution of Policy and Management, Chinese Academy of Sciences, Beijing 100190, China c Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China

Abstract With the continuous development of information technology, information data for research management show s an explosive growth. However, it is not an effective integration method which leads to the phenomenon of the islands of information. Various types of data exchange technology can integrate heterogeneous data resources , and provide data support for comprehensive data utilization and analysis. This research proposes the data integration framework and technology based on metadata. By taking the information resource for research management as a case, the research also analyzes the information resources integration for research management.

© 2013 The Authors. Published by Elsevier B.V. Selection and peer-review under responsibility of the organizers of the 2013 International Conference on Information Technology and Quantitative Management Keywords: metadata; information resource; data integration

1. Introduction The research management informat ion system is used to supervise research processes and relevant elements for research organizations. In recent years, with the continuous imp rovement of technology and the increasing research management business, research management departments at all levels have already established a batch of large scale databases with different types of resources and disciplines, which form a large, distributed, heterogeneous and diverse database group. The informat ion resources for research management are generated by different aspects and departments in the management activ ities of research institutions. Since information resources are characteristic of decentralized-managed and separated-stored, it is very difficult to share and manage the information resources for research management. With the launching of dig ital research project, digitized, electronic and networked research resource information will have a rapid accu mulation and growth. Therefore, it becomes urgent problems that how we can make fu ll use of, share, exchange and integrate these

* Corresponding author. Tel.:86-10-59358813 E-mail address: [email protected] 1877-0509 © 2013 The Authors. Published by Elsevier B.V. Selection and peer-review under responsibility of the organizers of the 2013 International Conference on Information Technology and Quantitative Management doi:10.1016/j.procs.2013.05.009

Zhilong Chen et al. / Procedia Computer Science 17 (2013) 54 – 61

huge amounts of exp loding research management information resources in order to better serve the scientific research personnel, research process and research management [1]. Information integration is the answer to these problems. Cu rrently, the methods of information resource integration main ly include data conversion, data federation and data warehouse [2]. The method of data conversion is a kind of schema mapping between databases by a conversion tool, which copies and converts the data from one database to another [3]. The data conversion method is a traditional method of data integration. Its operation is relatively simple, but the cost is high and the data scalability and security are poor. The method of data federation realizes data integration through the establishment of the federal access to data sources rather than move data into the central repository [4]. It does not need to create a new server and move large amounts of data to create a data warehouse. However, it is only suitable for the carefully -selected data subset, and it also needs to consider the network bandwidth, source application performance, data quality etc. Besides, it cannot clear the errors and inconsistencies in data. The method of data warehouse adds a layer of data between the client and data sources to store the data from data sources and provides inquires for data warehouse [5, 6]. It makes the original d istributed application system s till keep independent action and does not destroy the original application architecture, so it is more suitable for a large number of heterogeneous data migrations, and can integrate a variety of data sources. By co mparative analysis of above three data integration methods, the data warehouse method is more suitable to solve these problems. Th is research utilizes metadata to extract, converse and load data for research management informat ion resources, establishes data warehouse, realizes the structured and unstructured data integration, and gives a unified view form to achieve the sharing of the heterogeneous dat a. This research also conducts the study of research management informat ion resource integration based on metadata, which will help provide a reference for it. 2. Scope and characteristics of information resource for research management 2.1. The scope of information resource for research management The source of informat ion resource for research management is very wide. It mainly includes research project data, research output data, research personnel data, etc. The scope of information resource for research management is shown in Table 1. T able 1. Scope of information resource for research management Research project data

Research output data

Research personnel data

project name

research papers

name

start time

patents

title

end time

books

degree

targets

standards

age

host

software copyright

gender

commitment unit

scientific and technological achievements

hometown

fund

new products

salary

2.2. The characteristics of information resource for research management (1) Heterogeneity The information resource for research management is heterogeneous. It includes structured data and

55

56

Zhilong Chen et al. / Procedia Computer Science 17 (2013) 54 – 61

unstructured data. Structured data is main ly the database data and unstructured data main ly refers to electronic documents, audio, static images, printing types, games and microforms. (2) Large quantities of data The source of information resource for research management is very wide and the data update s ext remely fast, making the data grow exp losively. Therefore, the data size of information resource for research management is quite large. (3) Dispersibility The informat ion resource for research management co mes fro m d ifferent organizations and departments, and the management of relevant organizations is decentralized, so the storage of data is very scattered. The information resource for research management is especially dispersed. The above characteristics of informat ion resource for research management make it difficult to realize the sharing of the informat ion resources. In order to take advantage of these valuable information resources and serve the research personal, research process and research management better, the informat ion resource for research management should be integrated. 3. The integrated model of information resource for research management The informat ion integration refers to organically integration of the heterogeneous data from different sources, in different forms and with different characteristics in logical or physical levels to eliminate the differences of the various data sources and uniformly make the heterogeneous data visualized and shared [7]. 3.1. Metadata-based information resource integration framework The information integration process includes data preparation, ETL (Extraction, Transformat ion, Loading) and data usage. Each node of the whole process is correlated and unified. Data preparation is the first stage of data integration. This stage is responsible for collecting data fro m different data sources and unifying the data. In the process of collecting data, we need to consider the different data storage mode and the quality assurance in data transmission. Reliable data acquisition process ensures that the data will enter the next process. ETL refers to the process that foundation database, metadata database and subject database are formed after data source reference metadata is standardized [8]. Data utilization refers to the process that we call foundation database and subject database through the algorith m model to display the data by statements, GIS maps or other visual forms. Figure 1 shows the architecture figure of data integration. 3.2. Metadata-based information resource integration technology Metadata-based informat ion resource integration technology mainly includes: co mbing the source of the informat ion resources, the construction of the metadata standard, ETL operations, resource catalog system construction and so on. (1) Combing the source of the information resources The data is the cornerstone of the information system and the quality of the data is the guarantee of a successful informat ion system. In order to ensure the quality of the data, useful informat ion system data sources should be selected to ensure the accuracy and completeness of the data (2) The construction of the metadata standard Metadata is the data about other data, or the structured data that is used to provide the information of some sources. Metadata is usually used to describe the contents, coverage, quality, management, the owner of data and other related informat ion of elements, data set or data set series, which is an important means of achieving [9].

Zhilong Chen et al. / Procedia Computer Science 17 (2013) 54 – 61

At present, the metadata standard has been established both at home and abroad and the Dublin core metadata is a widely-accepted one. In this research, the Chinese Academy of Sciences Scientific Database Core Metadata is proposed. The construction of the proposed metadata standard references metadata standards at home and abroad and also considers the actual situation of information resources. (3) ETL operation ETL process includes data extracting, transformation and loading [10, 11]. Data extraction refers to the process of source data capture, which mainly involves the scattered data from each business system and different geographic positions. It needs to plan required data sources and data definition , formu lation and extraction methods based on the thoroughly understanding of data definition. Data transformation refers to conversing and cleaning the extraction data according to the pre-designed rules, dealing with some redundant, amb iguous and incomplete data, unifying the data size and making the originally heterogeneous data format unified. Data loading means loading the converted data into the data warehouse in planned iincremental or total quantity. (4) Information resource directory sys temconstruction In order to organize, maintain and use the informat ion resources, we need to build resource directory system. Resource directory system is used to classify f the resources effectively. The classification includes subject classification, o rganizat ion classification, business classification, resources form classification, resource use classification and so on. According to the characteristics of the resources, one or several effective classification approaches are chosen to set up resource directory system.

Fig. 1. T he architecture figure of data integration

4. Research information resources integration analysis based on the Chinese Academy of Sciences ARP The Ch inese Academy of Sciences Resource Planning pro jects (ARP) is the info rmation system engineering to realize resource planning for the Chinese Academy of Sciences. The ARP pro ject aims at constructing an effective in formation management and service platform on which management process of human resources, funds, scientific research conditions and so on are integrated and optimized, by using innovative management

57

58

Zhilong Chen et al. / Procedia Computer Science 17 (2013) 54 – 61

philosophy and advanced information technology. The project is supposed to be a scientific project and management oriented, and it should be in accord with the Academy-Institute two level structure of Chinese Academy of Sciences. The ARP p roject further pro motes the Chinese Academy of Sciences management innovation, constantly enhances the management level and efficiency, and boosts the scientific and technological innovation. After nearly 10 years construction in ARP engineering, ten application systems and two application platforms have been established in the CAS. Ten application systems include: scientific research projects, human resources, comprehensive financial, scientific research conditions, basic construction, CAS and local cooperation, international cooperation, intellectual property, assessment evaluation and educational resources. Two major application platforms are unified portal and public affairs management platform and information resource center. 5HVHDUFK SURMHFWV

+XPDQ UHVRXUFHV

&RQVROLGDWHG ILQDQFLDO

5HVHDUFK FRQGLWLRQV

EDVLF &$6 DQG ORFDO ,QWHUQDWLRQDO FRQVWUXFWLRQ FRRSHUDWLRQ FRRSHUDWLRQ

,QWHOOHFWXDO SURSHUW\ ULJKWV

$VVHVVPHQW HYDOXDWLRQ

(GXFDWLRQDO UHVRXUFHV

Unifi f ed porrtal and public afffairs managemen nt platfo f rm Unifi f ed portal, documeent management,, records manageement, governmeent info f rmation, news and propag ganda, meeting management, sch hedule managem ment, info f rmation n dissemination, Service publicly, Institute of publiic aff ffairs

7KH DSSOLFDWLR Q RI WKH JUDGXDWH VFKRRO

3URMHFW 0DQDJHPHQW

'HSOR\PHQW SODQ

%XGJHW PDQDJHPHQW

5HVXOWV EDVHG PDQDJHPHQW

6DODU\ PDQDJHPHQW

3D\PHQW PDQDJHPHQW

2UJDQL]DWLR Q PDQDJHPHQW 

)LQDO DFFRXQWV PDQDJHPHQW 

2XWSXW REMHFWV 

3URMHFW DSSURYDO 7KH 3URMHFW DSSOLFDWLR DSSOLFDWLRQ Q 2I UHVHDUFK 3URMHFW LQVWLWXWH LPSOHPHQWDW LRQ 

3HUVRQQHO 0DQDJHPHQW +XPDQ 5HVRXUFHV

&RPSHQVDWLR Q 0DQDJHPHQW 

*HQHUDO OHGJHU PDQDJHPHQW 5HFHLYDEOHV SURFHVVLQJ &RSH ZLWK PDQDJHPHQW 

(TXLSPHQW SODQ 6FLHQWLILF UHVHDUFK DQG PDQXIDFWXUL QJ PDQDJHPHQW *RYHUQPHQW SURFXUHPHQW 

$VVHW PDQDJHPHQW 0DWHULDO PDQDJHPHQW ,QYHQWRU\ PDQDJHPHQW 

7KH SDUN SODQQLQJ ,QIUDVWUXFW XUH SURMHFWV 0RQH\ PDQDJHPHQW 

&RRSHUDWLYH FRQVWUXFWLR Q UHVHDUFK LQVWLWXWH &RRSHUDWLYH UHVHDUFK &RRSHUDWLRQ PDQDJHPHQW 

3URSHUW\ PDQDJHPHQW /DQG PDQDJHPHQW &RQVWUXFWLR Q PDQDJHPHQW 

&RRSHUDWLRQ SURJUDP 7KH LQWHUQDWLRQ DO FRQIHUHQFH 5HZDUG PDQDJHPHQW 

9LVLW PDQDJHPHQW ([WHUQDO WUDQVIHU $FKLHYHPHQW V LQWR

3URSHUW\ PDQDJHPHQW

9LVLWRU PDQDJHPHQW ,QWHUQDWLRQ DO FRRSHUDWLRQ

3URSHUW\ ULJKWV LVVXH 2XWSXW PDQDJHPHQW 

3URSHUW\ ULJKWV GHFODUDWLRQ 3URSHUW\ ULJKWV WUDQVIRUPDW LRQ &RQYHUVLRQ 7UDFNLQJ

%DVLF GDWD 6HWWLQJ DVVHVVPHQW

753

([SHUWV DVVHVVPHQW 

6HOI $VVHVVPHQW 'LVFLSOLQDU \ DQDO\VLV

(GXFDWLRQ 0DQDJHPHQW

6WUDWHJLF $QDO\VLV

,QIRUPDWLRQ 5HVRXUFH &HQWHU ,5& &RPSUHKHQVLYH VWDWLVWLFV TXHU\ UHSRUWV GDWD DQDO\VLV SURMHFW PRQLWRULQJ FRQWHQW PDQDJHPHQW

Fig. 2. T he structure of Chinese Academy of Sciences ARP

Ten systems and two platforms of A RP have been accumulated a large nu mber of data resources on scientific research management activities. In order to further mine and use these resources to support scientific research management and scientific decisions, we need to integrate all the scientific res earch information resources. Metadata-based information resource integration for research management is adopted to realize resource integration. 4.1. The carding of the source of the Chinese academy of sciences research management information resources The research management information resources of the Chinese Academy of Sciences main ly co me fro m the ARP system, Chinese Academy of Sciences Network Science Popular Alliance, Science Times, Ch inese Academy of Sciences Websites and external resources. The sources of scientific research management information resource data are shown in Table 2.

Zhilong Chen et al. / Procedia Computer Science 17 (2013) 54 – 61

T able 2. T he sources of scientific research management information resources data No.

Data Sources

Data Content

Data Introduction

1

ARP system

multiple aspects of research management elements

relate to various research and management elements, such as human resources, research projects and the consolidated financial

2

Chinese Academy of Sciences Network Science Popular Alliance

popular science

popular science data

3

Science T imes

research news

the research news data and Web of Science data

4

Chinese Academy of Sciences Websites

multiple aspects of research management elements

overview of research institutions, news reports, achievements in scientific research, the research team, international exchange data

5

National Natural Science Foundation of China

scientific research and talent

natural Science Foundation project data, important project moderator or expert group data

6

State Intellectual Property Office

patent

the data of invention patents, utility model patents and design patents

7

National Science Library

books

Chinese and foreign language books data

8

Web of Science Index System

foreign papers

highly cited SCI, SSCI papers, ordinary SCI, SSCI papers, IST P papers

9

T he China Scientific literature service system

Chinese paper

Chinese Journals

10

Elsevier, springer database

foreign papers and editorial board

SCI and publications editor, deputy editor, SCI included highend publication's editorial board

11

National Office for Science and T echnology Awards

science and technology awards

the country five awards: State Supreme Science and T echnology Award, the National Natural Science Award, the State T echnological Invention Award, the National Science and T echnology Progress Award, the International Science and T echnology Cooperation Award Data

4.2. Establishment of metadata standard The metadata construction of scientific research management in formation resource is mainly based on the Co mbined with the actual situation of information resources, some necessary adjustments and expansionsare were made for the existing metadata standards [12, 13]. For clearer explanations of each entity and elements of the metadata, the metadata standard gives ten characteristics: Ch inese name, English name, logo, definit ion, nature, conditions, maximu m occurrence number, data type, range and notes. The metadata of scientific research management informat ion resource include s six non-reusable main subsets and one reusable secondary subset. Six non-reusable main subsets are content information, establishing informat ion, data quality information, form informat ion, security management info rmation and metadata reference in formation. The reusable secondary subset is contact Information. The metadata of scientific research management information resource are shown in Table 3.

59

60

Zhilong Chen et al. / Procedia Computer Science 17 (2013) 54 – 61

T able 3. T he metadata of scientific research management information resource No.

Name

Definition

Some elements

1

content information

2

establishing information

3

data quality information

name, URI, keywords, introduction, source, and languages the responsible department, production department, maintenance department data log, quality report

4

form information,

the basic information of title, theme, description, source, languages, coverage, date about the data set information such as personnel, organizations and business related to generation, management, maintenance and use of data sets the overall evaluation of the data quality of the data set information such as date, manifestation ,status quo and identifies of the data set

5

security management information

security management information of the data set

6

metadata reference information

the metadata information of the data set

7

Contact Information

contact information of the person or institution related to data set

creation date, resource morphology, unique identifier, the storage location classification of security restrictions, open form ,open time metadata standard name, metadata languages, metadata character set contact names, units, address, telephone, EMAIL

4.3. The construction of scientific research management information resource catalog system Research management information resources are classified according to the different angles and the message catalog system is established. The classification angles include: sources of information, the mechanis m of the informat ion resources, information resource carriers, scientific research management elements of the informat ion resources and information resources themes. Through the information resource catalog system, users can query, manage and prevent metadata. The examp le of classification standard of the scientific research management information resource catalog system is shown in Table 4. T able 4. T he classification standard example of the scientific research management information resource catalog system No.

Classification standard

1

sources of information

2

institutions involved in Information resources

3

information resources carriers

4

elements involved in research management of information resources

5

information resources themes

4.4. Use of data Based on the metadata standard, scientific research management informat ion resources form metadata database, basic database and subject database by ETL operation. Scientific research management information resource catalog system can query and maintain metadata by calling metadata database. For the needs of users, we establish algorithm model and call subject database to show the results of data analysis by a visual way which includes diagrams, reports, GIS maps, etc.

Zhilong Chen et al. / Procedia Computer Science 17 (2013) 54 – 61

5. Discussions and Conclusions This research, starting with the analysis of research data sharing, investigates the existing related researches. In order to solve the problems on data sharing, an academic data integration mode is put forward based on the concept of data integration, in wh ich the data integration process is formulized. The metadata technology and methods of data integration is illustrated and the data mode, the pro cess flow and features are analyzed. This research also takes the research management informat ion as a case to integrate data by the basic mode and process of data integration. The research will provide an important reference for the various types of information resources integration. Acknowledgements This research is supported by National Natural Science Foundation of China (71201156), Informatization Special Funds of the Chinese Academy of Sciences (Y201711601, Y201701601) and Youth Innovation Promotion Association of Chinese Academy of Sciences.

References [1] Wu D, Chen Z, Lu J, Chen Y. Study on Core Metadata of Research Management Information Resource. Science and Technology for Development 2012; (10): 26-30. (In Chinese) [2] Wu T. System of teaching quality analyzing and evaluating based on data warehouse. Computer Engineering and Design 2009; 6(2): 1545-1547. (In Chinese) [3] Smith J, Gounaris A, Watson P. Distributed query processing on the grid. The international Journal of High Performance Computing Applications 2003; 17(4): 353-367. [4] Haas LM, Soffer A. New Challenges in Information Integration. Lecture Notes in Computer Science 2009; 5691:1-8. [5] Nassis V, Rajugan R, Dillon T S, Rahayu W. Conceptual Design of XML Document Warehouses. Lecture Notes in Computer Science 2004; 3181: 1-14. [6] Trujillo J, Palomar M, Gomez J, Song IY. Designing Data Warehouses with OO Conceptual Models. IEEE Computer Society 2001; 34(12): 66 75. [7] Bernstein PA, Haas LM. Information integration in the enterprise. Communications of the ACM; 2008, p.170-177. [8] El Akkaoui Z, Zimányi E, Mazón JN, Trujillo J. A model-driven framework for ETL process development. Proceedings of the ACM 14th international workshop on Data Warehousing and OLAP; 2011, p. 45 52. [9] Ye Y, Jin G. Metadata-based information organization and ontology-based knowledge organization. Journal of Academic Libraries 2004; (4): 43-47. (In Chinese) [10] Skoutas D, Simitsis A. Ontology-based conceptual design of ETL processes for both structured and semi-structured data. International Journal on Semantic Web and Information Systems 2007; 3(4): 1-24. [11] De S. Ribeiro L, Goldschmidt RR, Cavalcanti MC. Complementing Data in the ETL Process. Data Warehousing and Knowledge Discovery 2011; 6862: 112-123. [12] Nguyen T B, T joa AM, Mangisengi O. MetaCube XTM: A Multidimensional Metadata Approach for Semantic Web Warehousing Systems. Lecture Notes in Computer Science 2003; 2737: 76-88. [13] De Santana AS, De Carvalho Moura AM. Metadata to Support Transformations and Data & Metadata Lineage in a Warehousing Environment. Lecture Notes in Computer Science 2004; 3181: 249-258.

61