Implementation of Data Transform Method into NoSQL Database for ...

2 downloads 3954 Views 2MB Size Report
analysis software, such as SAS, SPSS, and STATA, they cannot effectively share ... HBase database. Keywords-NoSQL Database; Hadoop; HBase; Key-value.
2013 International Conference on Parallel and Distributed Computing, Applications and Technologies

Implementation of Data Transform Method into NoSQL Database for Healthcare Data Chao-Tung Yang, Jung-Chun Liu, Wen-Hung Hsu, Hsin-Wen Lu, and William Cheng-Chung Chu Department of Computer Science, Tunghai University, Taichung City, 40704 Taiwan R.O.C. Email: [email protected]; [email protected]; [email protected]; [email protected]; [email protected]

in relational database, in which with fixed data field architecture, data are stored to a relational database for further processing. In addition to the challenges of huge amount of data, the information coming from various structures tends to complicate the situation. In today’s national health care system, the distributed and independent characteristic of the medical information system is potentially a serious problem faced by the authority of each health care system. Due to different needs of different ethnic groups, medical care information system is highly fragmented, cannot be easily integrated and interconnected. Moreover, since systems are built at different times, there may be no suitable connection among systems. The causes of this situation include: (1) information used by the personnel; (2) preference of data entry personnel; (3) services on different objects; (4) system build time; (5) change of health regulations; (6) different supporting plans or sources; and (7) varying definition of database field names in different database systems. Database of the medical care cannot afford errors. Integration without shutdown of the medical system, under the premise of ensuring the quality of medical care and no disruption to maintain normal maintenance and operations of the health care system, is now the overall conservative approach adopted by medical institutions. Without integration of medical database, patients visiting hospitals for different illness or physical conditions may repeat same physical checks that not only delay treatments of patients but also cause serious waste of medical resources. The current health care system within the medical divisions still uses Excel file formats to store a range of scale statistics, such as the clinical self-care ability scale for Functional Independence Measure (FIM). Those values stored in Excel can be analyzed via statistical analysis software, such as SAS, SPSS, and STATA, but cannot be efficiently shared among divisions. To integrate data, we proposed to convert the format of the Excel data and store them in a database based on HBase. HBase is a non-relational database with structure similar to Excel, but with higher implementing feasibility [2], [3], [11]–[14]. Considering of the medical data usage characteristic that needs fast, accurate, and efficient query of

Abstract—Currently, most health care systems used among divisions in medical centers still adopt the Excel file format for a variety of scales statistics, such as the clinical self-care ability scale for Functional Independence Measure. Although people can further analyze Excel files using other statistical analysis software, such as SAS, SPSS, and STATA, they cannot effectively share the archived data in Excel among divisions. We propose to do format conversion on these data and store them in a database. As the collection of Excel files cannot be shared with ease, we plan to use HBase, a non-relational database, to further integrate data. The purpose of this paper is to construct complete import tools and solutions based on HBase to facilitate easy access of data in HBase. Besides, a visual interface is also used to manage HBase to implement user friendly client connection tools for the HBase database. Keywords-NoSQL Database; Hadoop; HBase; Key-value Store; Healthcare

I. I NTRODUCTION The Big Data has already become one hot issue, which is mostly encountered in research fields with challenges of analyzing and forecasting huge amount of information, such as in weather forecasting, genetic analysis, biological research, and financial or commercial information [1], [4]–[7], [13]–[15]. To model and predict complex phenomena, we often use high-speed computers combined with distributed or parallel computing techniques to deal with huge amount of data. Moreover, in recent years, more and more enterprises face with challenges of data explosion with unexpected rapid growth rate of the amount of data storage systems; besides, many companies worry that they soon will encounter the same situation. It is difficult to process big data in most relational database management systems, because it needs to run massive parallel software concurrently on hundreds or thousands of servers. Common massive amount of data include interactive data information such as images, audios, videos, Internet search indexes, astronomical data, genetic information, medical records, and the website Log records transmitted through sensor networks, social networking, and wireless networks [8]–[10]. These raw data present the proliferation of big data. They are mostly non-structured or semi-structured data, not easy to be processed by using the traditional practice 978-1-4799-2419-6/13 $31.00 © 2013 IEEE 978-1-4799-2418-9/13 DOI 10.1109/PDCAT.2013.38

198

NoSQL database, also called Not Only SQL, is an approach to data management and database design that’s useful for very large sets of distributed data. NoSQL, which encompasses a wide range of technologies and architectures, seeks to solve the scalability and big data performance issues that relational databases werent designed to address. NoSQL is especially useful when an enterprise needs to access and analyze massive amounts of unstructured data or data that’s stored remotely on multiple virtual servers in the cloud computing. NoSQL is a general term meaning that the database isn’t an RDBMS which supports SQL as its primary access language, but there are many types of NoSQL databases: BerkeleyDB is an example of a local NoSQL database, whereas HBase is very much a distributed database. Hadoop, a parallel computing platform developed by the Apache Software Foundation, is an open source compiler tool and distributed file system. NoSQL database, also called Not Only SQL, is an approach for data management and database designs, and it is especially useful for very large sets of distributed data. NoSQL encompasses a wide range of technologies and architectures, seeking to address the scalability and big data performance issues that relational databases are not designed to solve. HBase, written in Java, is an open source, non-relational and distributed database modeled after Google’s BigTable. It is developed as part of Apache Software Foundation’s Apache Hadoop project and runs on top of the Hadoop Distributed File system (HDFS), providing BigTable-like capabilities for Hadoop. That is, it provides a fault-tolerant way to store large quantities of sparse data. As a column-oriented database management system that runs on top of HDFS, HBase is well suited for sparse data sets, which are common in many big data. Unlike relational database systems, HBase does not support a structured query language like SQL; in fact, HBase is not a relational data store at all. HBase applications are written in Java much like typical MapReduce applications, and HBase also supports applications written in Avro, REST, and Thrift. Every HBase system comprises a set of tables. Each table contains rows and columns, much like a traditional database. Each table must have an element defined as a Primary Key, and all access attempts to HBase tables need to use this Primary Key. An HBase column represents an attribute of an object; for example, if the table is used to store diagnostic logs from servers, where each row might be a log record, a typical column in such a table would be the timestamp of the time when the log record is written, or perhaps the server name where the record is originated. In fact, HBase allows for many attributes to be grouped into so called column families, such that the elements of a column family are all stored together. This is different from a row-oriented relational database, where all the columns of a given row are stored together.

the required information for people in various areas, we provide a visual graphical interface, instead of using query languages, to speed up data query in the HBase database. As a result, people are able to make full use of data in HBase and improve efficiency of statistical work. As opposed to the traditional commercial relational database, HBase is scalable, high-performance, and low-cost, but it does not have a complete and friendly user environment. Therefore, in addition to data conversion in HBase storage, it is important to provide appropriate technical support, friendly HBase user interface, approachable operation syntax, etc. The purpose of this paper is based on HBase to construct complete import tools and solutions in its environment. First, we analyzed the characteristics of the source data for converting into HBase. Then we recognized patterns of data usage, and constructed the HBase Data Model based on user behaviors. To make it easier to access HBase, we also implemented a visual interface to manage HBase as a user friendly database. The rest of the paper is organized as follows: In section 2, we describe the used techniques and some background knowledge. Section 3 describes the system architecture used in the paper. Section 4 shows experimental results for the proposed system. Section 5 provides conclusions and future work. II. BACKGROUND R EVIEW The big data is a large and complex issue. We are now facing with numerous challenges to find useful information from analyzing big data, and use it to reduce enterprise risks, promote revenues, and improve competitiveness. These challenges include how to obtain, store, search, share, analyze, and visually present the big data. Big data also stands for a new and hot issue in cloud computing. Traditionally, structural data are normalized in advance, stored in databases, and then are manipulated as the principal resource and headstone to support the enterprise IT systems. On the other hand, the rest of data, which are unstructural/semi-structural and generally in a massive quantity comparing with the structural ones, are hard to be processed and are casted aside or trashed on the corner. However, as new technologies in cloud computing like Hadoop and NoSQL emerging, these trashes, in a big quantity so are called as the big data, are now considered as the most valuable resources while the enterprise taking strides into this new market. Thus, issues like gathering, storing, modeling, analyzing, and manipulating big data turn to be hot in cloud computing researches and applications. To mention big data, what come to mind first are no longer the past system of hegemony of the database market Oracle, or the software giant Microsoft, but instead, the Apache Foundation’s open source Hadoop parallel computing and storage architecture, and the HBase NoSQL distributed database.

199

With HBase one must predefine the table schema and specify the column families. However, it is very flexible in that new columns can be added to families at any time, making the schema flexible and therefore able to adapt to changing application requirements. Just as HDFS has a NameNode and slave nodes, and MapReduce has JobTracker and TaskTracker slaves, HBase is built on similar concepts. In HBase a master node HMaster manages the cluster, and region servers store portions of the tables and perform the work on the data. HMaster is the implementation of the Master Server, which is responsible for monitoring all RegionServer instances in the cluster, and is the interface for all metadata changes. In a distributed cluster, the Master typically executes on the NameNode, while HRegionServer is the RegionServer implementation. It is responsible for serving and managing regions. In a distributed cluster, a RegionServer runs on a DataNode. In HBase, through the Zookeeper other machines are selected within the cluster as HMaster, unlike the HDFS architecture in which NameNode has a single point of availability problem.





• • •

• • •



Strongly consistent reads/writes: HBase is not an ”eventually consistent” DataStore, which makes it very suitable for tasks such as high-speed counter aggregation. Automatic sharding: HBase tables are distributed on the cluster via regions, which are automatically split and re-distributed as data grow. Automatic RegionServer failover Hadoop/HDFS Integration: HBase supports HDFS out of the box as its distributed file system. MapReduce: HBase supports vastly parallelized processing via MapReduce for using HBase as both source and sink. Java Client API: HBase provides a user-friendly Java API for programmatic access. Thrift/REST API: HBase also supports Thrift and REST for non-Java front-ends. Block Cache and Bloom Filters: HBase supports a block cache and bloom filters for high volume query optimization. Operational Management: HBase provides build-in web-pages for operational insight along with JMX metrics.

Regions are the basic elements of availability and distribution for tables, and are comprised of a store per column family. Figure 2 shows the hierarchy of objects in HBase.

Figure 1.

The roles in an HBase cluster.

Figure 1 shows the roles in an HBase cluster. HBase is built on top of Apache Hadoop and Apache ZooKeeper. Like the rest of the Hadoop ecosystem components, it is written in Java. HBase can run in three different modes: standalone, pseudo-distributed, and full-distributed. However, HBase has many features that support both linear and modular scaling. HBase clusters can be expanded by adding RegionServers hosted on commodity class servers. For example, when a cluster expands from 10 to 20 RegionServers, it doubles both in terms of storage as well as processing capacity. RDBMS can scale well, but only up to a point - specifically, the size of a single database server - and for the best performance requires specialized hardware and storage devices. Notable HBase features are:

Figure 2.

The heirarchy of objects in HBase.

In HBase, the catalog tables, -ROOT- and .META., exist as HBase tables. They are filtered out of the HBase shell’s list command, but they are in fact tables just like any other. Table 1 lists metrics in catalog tables -ROOT- and .META. Region names consist of the containing table’s name, a comma, the start key, a comma, and a randomly generated region id. The -ROOT- and .META. tables are internal system tables (or ’catalog’ tables). The -ROOT- keeps a list of all regions in the .META. table that keeps a list of all

200

Table I T HE M ETRICS OF CATALOG TABLES -ROOT-

CPU

RAM

.META.

Region Name

Metrics

-ROOT-,,0.70236052

numberOfStores=1, numberOfStorefiles=1, storefileUncompressedSizeMB=0, storefileSizeMB=0, memstoreSizeMB=0, storefileIndexSizeMB=0, readRequestsCount=538, writeRequestsCount=1, rootIndexSizeKB=0, totalStaticIndexSizeKB=0, totalStaticBloomSizeKB=0, totalCompactingKVs=6, currentCompactedKVs=6, compactionProgressPct=1.0, coprocessors=[]

.META.,,1.1028785192

numberOfStores=1, numberOfStorefiles=0, storefileUncompressedSizeMB=0, storefileSizeMB=0, memstoreSizeMB=0, storefileIndexSizeMB=0, readRequestsCount=8711, writeRequestsCount=70, rootIndexSizeKB=0, totalStaticIndexSizeKB=0, totalStaticBloomSizeKB=0, totalCompactingKVs=0, currentCompactedKVs=0, compactionProgressPct=NaN, coprocessors=[]

Table II H ARDWARE S PECIFICATION Node

AND

Disk speed

Network speed

Table III S OFTWARE SPECIFICATION AND ARGUMENTS SETTING OS version

Java version

Node 1

Intel(R) Core(TM)2 Quad CPU Q9550

4GB

352 MB/s

96.5 CentOS Mbits/sec 6.4 x86-64

jre 1.6.031 -b04

Node 2

Virtualized 2 cores from Intel i7-2600

1GB

412 MB/s

94 CentOS Mbits/sec 6.4 x86-64

jre 1.6.031 -b04

Version

Argument/Option

HBase

0.94.2

Master at Node1:60010

Hadoop

2.0.0

MapReduce 2

2.0.0

map.tasks.maximum=4 reduce.tasks.maximum=2

HDFS

2.0.0

Block size=64MB Replic=3

Zookeeper

3.4.5

Quorum at port 218

Since the disk I/O throughput is important for database systems and the disk speed of Node 2 is faster than that of Node 1, we used Node 2 to act as RegionServer. The flowchart in Figure 3 illustrates operations of writing data into Hbase.

regions in the system. The empty key is used to denote the table start and table end. A region with an empty start key is the first region in a table. If region has both an empty start and end keys, it is the only region in the table. III. S YSTEM I MPLEMENTATION This section presents several experiments conducted on one physical machine and one virtual machine. Each node contained 1-GE NICs, but had different CPU and memory levels, as listed in Table 2. We used Linux command dd to test the disk write performance for each node, and used the network testing tool iperf to measure the throughput of the network by creating TCP data streams. Table 2 Hardware Specification In the experiment, we used HBase-0.94.2 API and .hadoop-client-1.0.3 API. We also used JAVA programming language to build a client. Table 3 shows the software specification and arguments setting. The experimental platform is built on two nodes. Node 1 acts as HMaster, consisting of 1 Intel Core(TM), 2 Quad Q9550 CPU (12M Cache, 2.83 GHz), 4 GB memory, and 1TB disk. Node 2 acts as RegionServer, consisting of 2 Core CPU Virtualized from Intel i7-2600 and 1 GB memory.

Figure 3.

Flowchart of data write to HBase

The sequence diagrams in Figure 4 depicts how the main components of the system interact with one another to fulfill the goal of writing data into HBase.

201

of the Hbase cluster from the excel file, we can access the RegionServer Web interface to check the table information. Figure 6 shows details of the region on the RegionServer web GUI interface.

Figure 4.

SCHEMA OF

Rowkey

Figure 6.

Sequence diagrams of data write to HBase Table IV HBASE WHICH STORE PATIENTS RECORDS Name

Birth

Address

Sex

Chinese

Type / Day

Home

Sex

• •

Patients ID

The regional detail on RegionServer

NumberOfStores: the number of Stores in the RegionServer that has been targeted for compaction. NumberOfStorefiles: the number of StoreFiles opened on the RegionServer. A store may have more than one StoreFile (HFile).

A. Optimization of HBase properties We evaluated the cost of putting data into Hbase within different configuration of HBase. Table V tabulates the time cost of putting 260398 records into HBase with various configuration settings of three properties of HBase, which are described as follows. • setAutoFlush() Normally, the Puts will be sent one at a time to the RegionServer. If autoFlush set to false, these messages are not sent until the write-buffer is full. • setWriteToWAL() Turning writeToWAL off means that the RegionServer will not write Puts to the Write Ahead Log(WAL), instead only into the memstore; however, the consequence is that if there is a RegionServer failure there will be data loss. • setWriteBufferSize(10MB) Write Buffer size is in bytes. A larger buffer requires more memory on both the client and the server because the server instantiates the passed Write Buffer to process it while reduces the number of Remote Procedure Calls (RPC). In the experiment, we converted experimental data from Excel document files to HBase without time limits, and with no high fault tolerance considerations. The purpose is to find the most efficient combination. Since errors in the conversion process are low, we could choose configuration E as the optimal setting because it significantly reduced the time cost. When in need of a higher stability, or with fault tolerance mechanisms, then select settings such as configuration A, B, and D, in which the setAWriteToWAL state is ON, but with relatively increase in time cost. When hbase.regionserver.handler.count is configured to 20 on RegionServer, the property setup number of RPC server instances will spin up on RegionServer. For configuration A, the time cost is reduced by 35 seconds, as shown in Table VI.

IV. E XPERIMENTS Our experimental data consist of basic information of patients over the age of 65, with 260398 records in total. Four datasets were built; the first three datasets contained 65535 records, and the last dataset contained 63793 records. First, we needed to create HBase schemas by designing Rowkey, ColumnFamily, and qualifier of column. The Rowkey length was kept as short as reasonable such that it could still be useful to access the required data. In fact, we should expect tradeoffs when designing Rowkeys, a short key design that is useless for data access is not more valuable than a longer key with better get/scan properties. Table 4 shows the schema of HBase used to store patient records. In the experiment, hbase.hregion.max.filesize was set as 1073741824 (1GB). As shown in Figure 5, in which 260398 records were stored in the four datasets. Actually all records were put into the table PatientsInfo in HBase by generating monotonically increasing Rowkeys.

Figure 5.

Results of scanning the PatientsInfo table

After we converted all data into the table ”PatientsInfo”

202

Table V C OST OF DIFFERENT CONFIGURING PROPERTIES OF HBASE HBase Properties/Configuration

A

B

C

D

E

setAutoFlush()

on

on

on

off

off

setWriteBufferSize(10MB)

off

on

on

on

on

setWriteToWAL()

on

on

off

on

off

TIME

950sec

924sec

652sec

63sec

48sec

Table VI S ET HBASE R EGION S ERVER H ANDLER C OUNT VALUE TO 20 HBase Properties/Configuration

A

B

C

D

E

setAutoFlush()

on

on

on

off

off

setWriteBufferSize(10MB)

off

on

on

on

on

setWriteToWAL()

on

on

off

on

off

TIME

915sec

897sec

620sec

63sec

43sec

In configuration A, when setAutoFlush is turned on, the memstore in Regionserver will be flushed five times during the put status. The following Figures 7 to 12 show the information including write requests, memory usage, flush size when configuration E is used to put data into HBASE. In configuration E, if setAutoFlush is turned off, we can see in Figure 10 that memstore in Regionserver is flushed only one time during the put status. And when setWriteToWAL is turned off, the RegionServer will not write the Put to the Write Ahead Log (WAL), as shown in Figure 11 that recorded 29790 operations without WAL.

Figure 7.

Figure 8.

Write requests on configuration E

Write requests per sec on configuration E

Figure 9.

V. C ONCLUSION The row and column sizes should be minimized. The KeyValue class is the heart of data storage in HBase. When we design Row Key, ColumnFamily, and column in HBase, the name must be as short as possible, since all of them will be embedded within the KeyValue instance. As shown in Table 5, the longer these identifiers are, the bigger the KeyValue of data storage in HBase will become. While creating corresponding Rowkeys in HBase, problems appeared when we saved monotonically increasing

Memory usage on configuration E

values in the alphabetical order. The new writes were not evenly distributed, and the last of the original region become a new high hit rate region in need of a split. Monotonically increasing keys in a single region can reduce the randomly distributed consumption, but overall, it is best to avoid using the sequence number or timestamp as the row-key. As stated, another goal of this paper is to provide an integrated data parallel processing service environment, to ensure various demands of data services from different

203

Figure 10.

Flush average Size and operations rate on configuration E

Figure 12.

Memory heap and Memstore size E

balancing in the RegionServers. ACKNOWLEDGMENT This work is sponsored by Tunghai University The U-Care ICT Integration Platform for the Elderly, No. 102GREEnS004-2, Aug. 2013. This work was supported in part by the National Science Council, Taiwan ROC, under grant numbers NSC102-2218-E-029-002 and NSC101-2218E-029-004. R EFERENCES Figure 11.

[1] Wenbin Jiang, Hao Li, Hai Jin, Lei Zhang, Yaqiong Peng. VESS: An Unstructured Data-Oriented Storage System for Multi-Disciplined Virtual Experiment Platform, Proceedings of the International Conference on Human-centric Computing 2011 and Embedded and Multimedia Computing 2011.

Total: Puts without WAL on configuration E

Table VII K EY VALUE OF DATA STORAGE IN HBASE . Rowkey=r1, cf:attr1= v1

[2] Shoji Nishimura, Sudipto Das, Divyakant Agrawal, Amr El Abbadi. MD-HBase: design and implementation of an elastic data infrastructure for cloud-scale location services, Springer Science+Business Media, LLC 2012.

rowlength ———— 2 row —————– r1 columnfamilylength — 2 columnfamily ——– cf columnqualifier —— attr1 timestamp ———– server time of Put keytype ————- Put

[3] Himanshu Vashishtha, Eleni Stroulia. Enhancing Query Support in HBase via an Extended Coprocessors Framework, 4th European Conference, ServiceWave 2011, Poznan, Poland, October 26-28, 2011. Proceedings. [4] Divyakant Agrawal, Amr El Abbadi, Sudipto Das, Aaron J. Elmore. Database Scalability, Elasticity, and Autonomy in the Cloud, 16th International Conference, DASFAA 2011, Hong Kong, China, April 22-25, 2011, Proceedings, Part I.

medical divisions, and to offer services required for the information combination and computing resources. Therefore, the architecture of the system must be modular to support data services of customized, reusable, and scalable characteristics. Based on the needs of different applications, the cluster resources are timely adjusted to serve demands for data services of each application. Through the automatic information integration mechanism, data consistency and integrity requirements can be met for different data services applications, and a unified data access information system can be built to ensure the individual as well as overall service needs. In the future, we plan to use virtualization technology to construct a dynamic increase or decrease RegionServer in the HBase cluster. And then move data from regions with high access rates during off-peak hours to achieve load

[5] Huiju Wang, Xiongpai Qin, Yansong Zhang, Shan Wang, Zhanwei Wang. LinearDB: A Relational Approach to Make Data Warehouse Scale Like MapReduce, 16th International Conference, DASFAA 2011, Hong Kong, China, April 22-25, 2011, Proceedings, Part II. [6] Olivier Cur, Robin Hecht, Chan Le Duc, Myriam Lamolle. Data Integration over NoSQL Stores Using Access Path Based Mappings, 22nd International Conference, DEXA 2011, Toulouse, France, August 29 - September 2, 2011. Proceedings, Part I. [7] Feng Zhu, Jie Liu, Lijie Xu. A Fast and High Throughput SQL Query System for Big Data, 13th International Conference, Paphos, Cyprus, November 28-30, 2012. Proceedings.

204

[8] Chao-Tung Yang, Guan-Han Chen, and Shih-Chi Yu. Implementation of Cloud Computing Environment for Hiding Huge Amounts of Data, Parallel and Distributed Processing with Applications (ISPA), 2010 International Symposium on, Sept. 2010 Page(s):1-7. Proceeding. [9] Chao-Tung Yang, Wen-Chung Shih, Chih-Lin Huang. Implementation of a Distributed Data Storage System with Resource Monitoring on Cloud Computing. GPC 2012: 64-73 [10] Chao-Tung Yang, Cheng-Ta Kuo, Wen-Hung Hsu, WenChung Shih. A Medical Image File Accessing System with Virtualization Fault Tolerance on Cloud. GPC 2012: 338-349 [11] Cloudera recommendations on Hadoop/HBase cluster capacity planning, http://www.cloudera.com/blog/2010/08/hadoophbase-capacityplanning/. [12] Jianling Sun. Scalable RDF store based on HBase and MapReduce. Advanced Computer Theory and Engineering (ICACTE), 2010 3rd International Conference on, Hangzhou, China, 20-22 Aug, 2010. [13] Hadoop Wiki - HBase: Bigtable-like structured storage for Hadoop HDFS, http://wiki.apache.org/hadoop/Hbase. [14] Chen Zhang upporting multi-row distributed transactions with global snapshot isolation using bare-bones HBase, Grid Computing (GRID), 2010 11th IEEE/ACM International Conference on, Waterloo, Canada, 25-28 Oct. 2010. [15] D. Abadi, P. Boncz, and S. Harizopoulos. Column-oriented Database Systems. In Proceedings of the VLDB Endowment, 2(2):1664-1665, 2009.

205