2012 ACM/IEEE 13th International Conference on Grid Computing
A Distributed Cache for Hadoop Distributed File System in Real-time Cloud Services Jing Zhang1, Gongqing Wu1, Xuegang Hu1, Xindong Wu1, 2 1
2
Department of Computer Science Hefei University of Technology Hefei, 230039, China
[email protected],
[email protected]
One of the significant designed features of the Hadoop system is high throughput which is extremely suitable for handling large-scale data analysis and processing problems. This original design provides Hadoop with outstanding performance in off-line massive data processing with petabyte magnitude data sources. In recent years, with the continuous development of broadband networks, Internet applications with real-time interactive features have increased significantly. This type of real-time cloud computing environment has the following characteristics. (1) Personalized service: one of the cloud computing goals is to provide a user-adaptive virtual information service system. A personalized service is built on the basis of analysis on personal historical information. (2) Shorter period of users generating data and users consuming data: although the personalized models of individuals are derived from users generated contents, the period of the data recreation is shortening substantially. The continuous optimizations of the system are taking place at any time which makes end users feel smooth improvement with the increasing frequencies of using the system. (3) Real-time dynamics: in order to support real-time personalized features, the cloud services must schedule enough resources affiliated with a special person in several seconds or less. The total resource pool may contain hundreds of millions of users’ model data. (4) Differentiated management of personal data: In cloud services, personal data of a user can be classified into several categories, such as accessing log, profile, uploaded (or generated) materials and business model. Generally speaking, some information for a user’s profile and business model can be extracted from accessing log and uploaded materials by using data mining related techniques. The cloud services manage these categories of data in different manners. The user accessing log data are simply stored in a backup storage system. The user’s uploaded data are stored in the networking storage system and loaded into memory when the user explicitly reads or modifies them. The user’s profile and business model data are very critical in real-time services as they determine the business category and quality of services, which are generally loaded implicitly when the user logs into the system and cannot be scheduled out until the user logs out. During the user visits real-time cloud services, this kind of data may be accessed for many times. The size of this kind of data is typically not too large due to their generated methods and we assume it is about 10MB. Thus, the real system must schedule this kind of data (about 10 MB) for a
Abstract—The improvement of file access performance is a great challenge in real-time cloud services. In this paper, we analyze preconditions of dealing with this problem considering the aspects of requirements, hardware, software, and network environments in the cloud. Then we describe the design and implementation of a novel distributed layered cache system built on the top of the Hadoop Distributed File System which is named HDFS-based Distributed Cache System (HDCache). The cache system consists of a client library and multiple cache services. The cache services are designed with three access layers an in-memory cache, a snapshot of the local disk, and the actual disk view as provided by HDFS. The files loading from HDFS are cached in the shared memory which can be directly accessed by a client library. Multiple applications integrated with a client library can access a cache service simultaneously. Cache services are organized in the P2P style using a distributed hash table. Every file cached has three replicas in different cache service nodes in order to improve robustness and alleviates the workload. Experimental results show that the novel cache system can store files with a wide range in their sizes and has the access performance in a millisecond level in highly concurrent environments. Keywords-distributed cache system; cloud storage; HDFS; real-time file access; in-memory cloud
I.
INTRODUCTION
Apache Hadoop [1] is a well-known project that includes open source implementations of a distributed file system [2] and a MapReduce parallel processing framework that were inspired by Google’s GFS [3] and MapReduce [4] projects. The emergence of the open source Hadoop system eliminates the technical barrier to cloud computing. Several rising stars of international IT companies, such as Facebook and Twitter, are dedicated to making contributions to the Hadoop community as well as deploying and using this system to building their own cloud computing systems. After several years of development, Hadoop gradually forms a cloud computing ecosystem consisting of a set of technical solutions which include the HBase distributed database, Hive distributed data warehouse, ZooKeeper coordination services for distributed applications and etc. All the components are built on the top of low-cost commercial hardware, with the extensive availability and fault-tolerance, which makes Hadoop gradually become the mainstream of a commercial implementation as a cloud computing technology.
1550-5510/12 $26.00 © 2012 IEEE DOI 10.1109/Grid.2012.17
Department of Computer Science University of Vermont Burlington, VT 05405, U.S.A.
[email protected]
12
system, to enhance the ability of certain aspects of a system is bound to damage the ability of some other aspects. Although NoSQL databases, such as HBase, Cassandra [33], and MongoDB [34], are considered to be a good solution for solving problems of randomly reading and writing binary data from persistent storage devices. Some evaluations of these systems have shown that the performance of these databases still cannot meet the needs of real-time access to big data [35] [36]. According to the YCSB benchmark [35], the read latency of both HBase and Cassandra cannot be acceptable when throughput exceeds 8000 ops/sec. The HDCache is not a NoSQL DB and it only draws on the key-value storage strategy and the data replication mechanism used in NoSQL DB. The HDCache simplifies the accessing model in the granularity of files instead of tables, which improves performance dramatically. Another idea of addressing real-time access is to provide an in-memory storage system. J. Ousterhout et al put forward a tentative plan to build cloud computing systems in DRAM [15] [16]. DRAM Cloud is a great ambition in the long run; however it currently still faces with many challenges and cannot be practical in the short term. A practical simple alternative to the RAM cloud is the Memcached [17] which provides general-purpose key-value storage entirely in DRAM and is widely used to offload back-end database system [18]. Memcached can be used for accelerating MapReduce tasks on the Hadoop cluster [19] and also can be used for promoting read throughput in massive small-file storage systems [20]. Memcached and Redis [21] in-memory key-value storage systems are thought to be more feasible in productive environments and are used widely in many IT companies, such as Facebook [22] and Twitter [23]. Memcached and Redis are well implemented open source systems; however, they still have some shortages that we will analyze in the next section. Our HDCache can be viewed as a first step attempt to realize a DRAM Cloud system. The HDCache system tries to cache the content that will be accessed in near future which has the same design principle as Memcached and Redis. However, HDCache overcomes their defects such as failures in dealing with large files, no replication and persistent serialization.
specific user from several million model packages (about 100TB), then compute the results and send back to the user. All these procedures must be completed in 2 seconds. That resource scheduling procedure must be completed in a millisecond interval level. The Hadoop Distributed File System (HDFS) meets the requirements of massive data storage, but lacks the consideration of real-time file access. In HDFS file reading may contain several interactions of connecting NameNode and DataNodes, which dramatically decreases the access performance when the system is under a heavy burden of data and a concurrent workload. Thus, how to improve HDFS file access performance (especially file reading performance) is a key issue in real-time cloud services. In this paper, we present a novel distributed cache system named HDCache built on the top of HDFS, which has rich features and high access performance. The cache system provides general purpose storage for variable workloads including both huge and small files. In the rest of the paper, we will describe the details of this cache system. In Section II, some related work is reviewed. In Section III, we describe the prerequisites, considerations and motivations of designing a new cache system. Section IV provides the design and implementation details of the cache system. Section V evaluates the performance and the hit ratio issue of the cache system. Section VI concludes this paper with some future work. II. RELATED WORK The pitfalls of the Hadoop Distributed File System have been studied widely, since the project was launched by Yahoo! company. J. Shafer analyzed the performance of HDFS thoroughly and reached a conclusion that one of the major causes of the performance bottlenecks is the tradeoffs between portability and performance [5]. The most mentioned shortage of HDFS is the weak performance when dealing with small files. Some applications solve this problem on the top of HDFS by combining abundant small files into large ones and building an index for each small file in order to reduce the file counts in the system [6]. The others try to modify HDFS I/O features and the DataNode meta-data management implementation in order to provide a better performance [7] [8]. Many approaches [9] [10] [11] can be classified into these two categories. These two methods cannot fundamentally improve the system performance. (1) Combining small files into a large file is a typical time-consuming operation. Thus the key objective of this method is to improve system throughput rather than its response time. The HBase [12] systematically adopts this idea to solve the small file storage problem by introducing a Google Bigtable [13] like key-value distributed database in order to make file combinations and retrieval transparent to the end users. K. Dana evaluated the performance of HBase and found that reads slow down as the number of rows written increases, which indicates it may not meet the need of real-time access [14]. (2) Modifying HDFS I/O features and altering DataNode meta-data management implementation are comparatively dangerous. Generally speaking, without completely re-designing of the
III. PREREQUISITES AND DESIGN CONSIDERATIONS The goal of our work is to design and implement a distributed cache system on the top of HDFS that can accelerate person-specific data access in large-scale real-time cloud services. Our novel HDCache system is based on the following factors, prerequisites and design considerations. On-the-top Method rather than Built-in Method As mentioned in section II, many systems attempt to modify HDFS features to improve performance. These practices violate the original design principles and it is definitely difficult to get good results. Reducing HDFS system workload is conducive to performance improvement, and therefore building a distributed cache system on the top of HDFS is a better choice. The latter makes the cache and HDFS systems independent of each other, which is essential
13
for the maintenance of the entire system compatibility in large-scale deployment and in system upgrades. From the perspective of software engineering, a loose association between HDCache and HDFS is a better design choice for the independent evolution of both systems. With the open source community promoting HDFS, the performance of HDCache will be improved simultaneously. Meanwhile, any change of HDCache will not affect the performance of the underlying HDFS system.
well (SQL query results are usually small) but does not fit online personal data whose size is in the range from several KB to dozens of MB. Using this strategy will result in an enormous overhead of management and a large number of memory fragments. (2) The Memcached has no local serialization or snapshot mechanism. When a cache server crashes down, the cached contents will be lost. The cost of reconstructing lost contents is extremely expensive. The concept proposed by Google of building a cloud computing system on inexpensive commercial hardware [37] has been widely accepted. Thus, a persistent serialization mechanism is a requirement on the condition of unstable hardware and software. (3) The servers of the Memcached system are independent one another. The distributed function is provided by the client. The client algorithm decides which server to connect with and send requests to no matter whether the server status is normal or malfunctioned. This leaves the complex management issues to the user who must design and implement a central management service to coordinate multiple Memcached servers and clients. (4) The Memcached has a simple consistency checking process by setting an expired time. This may cause a burst of network traffic when a large number of cache data are expired simultaneously. A better way to solve this problem is to disperse the consistency check in a certain duration, which is based on the access frequency of these expired data. Our novel HDCache system will overcome the shortcomings mentioned above and make it more suitable for real-time cloud services.
Network I/O rather than Disk I/O Cloud computing systems are usually built on the top of low-cost commercial hardware connected by Gigabit Ethernet. In practice, the network I/O rate is about 100MB/s that is approximately equal to the disk I/O rate. On one hand, a real-time cloud computing system stores large amounts of data, on the other hand, data access of the system usually appears in the way of sudden and random bursting, which evidently slows down the disk scheduling performance resulting in the read efficiency being no more than 50MB/sec. Consequently, accessing data over the Ethernet usually is a better choice than reading them from an HDFS DataNode disk. If the cloud computing system is deployed on top of the high-speed networks such as 10-Gigabit Ethernet, InfiniBand and Myrinet, network I/O obviously has huge advantages compared to disk I/O. Layered Data Accessing Model There are three data access layers in the system when building a cache on the top of HDFS. The first layer is inmemory cache in which the data access rate is approximately equal to the memory unit access rate (ignoring OS memory swap). The second layer is local disk snapshot and remote inmemory cache with a data access rate about 50~100MB/s. The bottom layer is HDFS where all data are stored in DataNodes with the accessing rate influenced by many factors such as data load, concurrency of threads and network traffic etc. Applications using distributed cache firstly retrieve the desired file in DRAM cache, and if missing, the cache service will contact with another cache service for the file or load it from a local disk snapshot if existed. If the procedure still cannot get the desired file, the cache service requested by the client will load the file from HDFS. The details of this process will be discussed in the next section.
IV. DESIGN AND IMPLEMENTATION Our novel HDCache system is built on the top of the Hadoop Distributed File System. The cache system and the HDFS are loosely coupled. The system can be viewed as a C/S architecture. The only thing the third-party applications need to do is to integrate with a client-side dynamic library. The third-party applications use the cache to access data stored in HDFS transparently with very high performance. This section describes the key design and implementation issues on the cache system. Architecture The HDCache system currently aims at deploying in intranet environments within an organization isolated from the outside by firewalls. Although the security issues cannot be ignored in a real cloud system, in this paper we assume the cache system is run in a secure circumstance. Figure 1 describes a simple example of deployment of our system. The HDCache system can be deployed on HDFS NameNode, DataNode and any other application systems (such as web servers) that can access HDFS through networks and need cache functions no matter whatever the OS of these systems are Windows or Linux. The cache system contains two parts: a client dynamic library (libcache) and a service daemon. Users only need to integrate with the libcache in their applications, and they can access the cache services that are on the same machine or connected through
Motivations of Designing a New Cache System Can we modify an existing cache system such as Memcached to meet the requirements of real-time services? Although declared to be the most suitable for large-scale Internet applications, the following defects make Memcached invalid in real-time cloud services. (1) The Memcached system is designed for caching data that are stored in a database. It is not a typical cloud storage system. Memcached uses a pre-allocated memory pool called slab to manage memory. One slab contains multiple chunks which are the basic memory-allocated units. Slabs are divided into several groups according to chunks’ sizes. This memory management strategy fits the DB data
14
networks. One cache service can serve multiple applications simultaneously. Figure 2 describes the internal architecture of the service daemon and the libcache client library. y
communicating with ZooKeeper [28] servers remotely, and (3) calculating hash values of desired files and locating a specific cached file. Calculating the hash value of a file is a time-consuming operation. To avoid calculating the hash value of the same file in the system scope, when a client calculates a hash value, it will store the value in ZooKeeper servers. The ZooKeeper service can be viewed as a database that stores the information as a tree in the memory. ZooKeeper will be discussed in the following section of System Management. In the startup process of a libcache library, the libcache initialization procedure contacts with ZooKeeper servers, fetches all files’ hash values in ZooKeeper and stores them orderly in its memory. Because the size of a hash value is usually several bytes, the storing of these values from the client will only consume a little memory. When accessing a particular file, libcache looks up the hash value first instead of calculating it for every time. HDCache is designed for write-once-read-many access model for files, which is approximately held in a cloud computing environment. Thus, when a libcache client opens a file, it can be shared with the other clients running on the same host by sharing the same file descriptor. In this manner, even a file name string comparing operation is avoided, which accelerates the file access efficiency.
Figure 1. Typical network topology of deployment
A cache service runs as a daemon on the host. The HDFS access module uses HDFS API to load file contents into preallocated shared memory. Shared Memory Manager (SMM) is the core module that bridges the service and client library. The serialization module periodically writes the meta-data and all cached files into a local OS file system, forming a series of snapshots which can be used for reestablishing a cache after a system crashes down. Another function of the serialization module is to provide swap when the cache is deployed on the small RAM hosts. The client library uses sockets (over TCP) to communicate with a cache service to exchange control messages. Compared with highly efficient shared memory, we choose sockets as an inter-process communication facility for the reason that libcache can contact with both local and remote cache services in the same way on heterogeneous systems. The consistency of cache is guaranteed by a validator which has comprehensive validation rules in order to achieve the balance of resources consumption and timeliness.
Figure 2. Architecture of HDCache system
HDCache Service Internal Compared with the Memcached system in which client applications copy the cached content from a service process to their own process memory spaces, our cache system chooses shared memory which theoretically has higher performance. Figure 3 describes an overview of the Shared Memory Manager (SMM) module. The shared memory is divided into pages with a fixed size (typically about 4KB that can be reset by users). The
libcache Library The HDCache system provides a client library called libcache integrated by upper layer applications. The libcache consists of two major components: Communication & Control and Shared Memory Access. The Communication & Control module undertakes the tasks of (1) interoperating with the HDCache service on the same host, (2)
15
page is a basic memory allocation unit in our system. At the end of every page, leaves 4 (32bit OS) or 8 (64 bit OS) bytes storing the next page of the same file or storing ‘0x0’ standing for the file end, i.e. all pages of a file are organized as a linked list. The first page number of a file is stored in the FirstP domain of Meta Info Map, a data structure storing the cached file name, hash value and other information about the file. The client library fetches the first page number of a desired file from the cache server and then directly accesses shared memory for file content. The Mem Bitmap data structure is used for management of allocated and free pages. When a client opens a cached file, the libcache records the first page address of this file and returns a file descriptor to upper layer applications. The file descriptor is related to the meta information of the file such as read/write pointer possessed by the client. The client uses this file descriptor to read/write the content of the file. Although a cache service can serve multiple clients simultaneously, the clients maintain the read/write pointers respectively. When multiple clients share one cached file, the cache service maintains a reference counter for each opened file, the value of which is the number of clients that are sharing this file. The client also maintains a simple Meta Info Map structure which is usually named as the Client Open Files Table.
Ă File N
KHVb
Ă KHVx
R S V
Distributed Storage Process HDFS introduces replica to enhance system robustness. By default, each file stored in HDFS has three replicas that are stored in different machines, which minimizes the impact of a machine crash. Our cache system also adopts this design that every file cached has three replicas. Algorithm: Cache_File Input: File Name, Local Node IP Address Output: The file stored in specific cache services Procedure: Cache_File (filename, ipaddr) 1. Store file in NodeLocal. 2. File_HV := KetamaFunc (filename) NodeLocal_HV := KetamaFunc (ipaddr) Find NodeB whose hash value is clockwise equal to or greater than File_HV If NodeB_HV = NodeLocal_HV Then Find NodeB’ whose hash value is clockwise greater than NodeLocal_HV , and Node2 := NodeB’ Else Node2:=NodeB Store file in Node2 32 3. Node2 C_HV := 2 - Node2_HV If Node2 C_HV does not exist or Node2 C_HV=Node2_HV Then Find NodeC whose hash value is clockwise equal to or greater than Node2 C_HV , and Node3 := NodeC Else Node3 := Node2 C_HV Store file in Node3 END Note: *_HV means the hash value of *
Mem Bitmap 0 1 0 1 1 0 0 1 0 0 0 0
Meta Info Map FileName KatamaHV FirstP 1 KHVa File 1 File 2
are accessed the next time; and the frequently accessed files do this check in the background in order to keep the latest version. When a large number of files expire at the same time, a random delay is added to the consistency checkpoint in order to reduce an unexpected burst of network traffic. When the HDCache service processes consistency validation, the client still can access a file from the cache, however, the latest version of the file will not be seen until the validation process completes. The client library also provides functions to conduct immediate validation in the scope of a cache service or specifying a particular cached file.
Ă Ă Ă
6
0 1 0 1 1 0
Ă x
libcache
Ă Shared Memory Figure 3. Shared memory management of HDCache service
When the client acquires a file and the cache misses, the cache service will fetch the file from other cache services, local snapshots or HDFS. When the cache service loads the fetched file to its shared memory, if the memory has no enough pages to be allocated, the service will use LRU (Least Recently Used) algorithm to eliminate some files unused for a long time. The statistic information for the LRU algorithm is also stored in the Meta Info Map. When processing LRU in memory, the eliminated file can be swapped into a local disk. If the local disk has not enough space, LRU also can be applied to a local disk space to delete some unused files. Files loaded into the cache system respectively have expire times that can be set by the client. If the expire time fires, the consistency validator decides whether to process consistency check according to network traffic, system workload, and the file access frequency. The infrequently accessed files do not conduct consistency check until they
Figure 4. Alogrithm of loading file to distributed cache
Compared with the HDFS single NameNode design scheme, our cache chooses the DHT (Distributed Hash Table) instead. DHT is widely used in P2P system design and also introduced in cloud storage. A typical example is Amazon’s Dynamo [24] which is an important part of Amazon’s Elastic Computing Cloud. The key concept of DHT is consistent hash that maps the key to a position in 0-232 continuum. There are quite a few hash functions that can be used to implement consistent hash, such as FNV [25], CRC hash [26]
16
and Ketama hash [27] etc. We choose Ketama hash function not only because it has an open-source implementation but also it balances the computing performance, hit ratio and dispersity. Figure 4 describes the distributed file storage scheme. The procedure Cache_File is called when the cache service receives a file access request and cannot find any replica entirely in the distributed cache system. The cache service will calculate the Ketama hash value of the file and find the other two counterparts on the continuum; then store the file in these cache nodes as three replicas. Figure 5 demonstrates the algorithm running a typical circumstance. When Node A caches File 1, one of the replica is stored in the local cache service on Node A, another replica is stored in Node B whose hash value is clockwise equal to or greater than File 1 hash value on the continuum, and the last replica is stored in the Complement Node of B if existed, or stores in the node that is clockwise greater than B’s Complement (Cache X). When a file access request arrives at a cache service, then the service retrieves the file in its DRAM; if missing, the cache service will contact with the other two counterparts for the file or load it from a local disk snapshot if existed. If this procedure still cannot get the desired file, the cache service will load the file from HDFS and tell the counterparts to store the replicas if possible. 0/232 Cache X
itself to a ZooKeeper server with its IPs, Ketama hash values of its IPs, and any other information. Any component integrated with ZooKeeper client library can contact with the ZooKeeper servers to acquire global information, perceive other members status and/or coordinate with other processes. As we mentioned above, all Ketama hash values of visited files are stored in the ZooKeeper Servers. Thus, in distributed storage most of the hash value calculations are not needed as long as the system runs enough time. Introducing ZooKeeper to the HDCache system is not only beneficial to the system component management but also conducive to performance improvement. HDFS provides a Posix-like C/C++ APIs (libhdfs) [32] for easy integration with the third-party applications. In order to eliminate differences in the usage of APIs, the user interface of the cache system is designed as a Posix-like file system operation, even if internally, the data are stored in a key-value fashion. Table I lists some typical APIs and their meanings of the client library. TABLE I.
IMPORTANT APIS OF CLIENT LIBRARY APIs
description
cache_connect() cache_disconnect()
Node A
cache_fopen(filename,mod)
File 1 Cache A Hash Func
_ Cache B
cache_fread (fd,buffer,size) cache_fwrite(fd,buffer,size)
read/write cached file
cache_fclose(fd)
close cached file
cache_finfo(filename) cache_status()
Cache B
cache_validate() cache_fvalidate(filename)
Cache C
libcache connect (disconnect) to local HDCache service open a cached file, if not existed, cache service loads it into RAM
descriptive information of a cached file current status of cache service process cache or file consistency validation manually
V. EVALUATION
Figure 5. Novel DHT based file distributed stroage scheme
We implemented the novel cache system in both Linux (64bit) and Windows (64bit). We chose 64bit OS because 64bit OS is currently the mainstream in commercial environments and can provide large memory addressing space. Taking into account the performance we implemented the system with C++ program language. We setup a test-bed consisting of twenty-one servers running Ubuntu 11.04 64bit OS, each with four Intel Xeon E5620 2.6GHz CPU (16 cores), 16 GB of DRAM, four 7200 RPM 2TB SATA hard disks and a 1000 Mbps Ethernet card. Twenty of these computers are configured to be DataNode servers storing files sized in the range of 4K to 10G, and the remaining one is configured to be NameNode servers. Then the storage capacity of the cluster is about 160TB and the inuse data volume is about 20T. On every DataNode, we deploy a cache service. The cache client (a multi-thread tester integrated with the libcache library) is also deployed on the DataNode.
System Management & User Interface The coordination and management of multiple processes has always been a great challenge in distributed system design and implementations, although this issue has been studied for several decades. Our system chooses the ZooKeeper [28] [29] as a distributed system management infrastructure. ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. Compared with its counterpart Google’s Chubby [30] distributed lock service, ZooKeeper provides a richer set of features. ZooKeeper is based on the Zab [31] atomic broadcast protocol using by default simple majority quorums to decide on a proposal. Thus ZooKeeper can work if a majority of servers are correct (i.e., with 2f + 1 server we can tolerate f failures). When the cache service starts, it registers
17
HBase in a tiny cluster that contains only one HMaster and one Region Server. This configuration guarantees all operations are completed on the local host if we run benchmark tools on the Region Server. Although making the strong constraint, the network traffic is still unavoidable, because HMaster will be involved in the process of reading and writing data. Table IV shows experimental results of HBase.
Basic Performance Benchmark The first test is about the performance of cache APIs. We use a client with only one thread running API for many times. In real-time cloud services, an individual’s personal data seldom exceed 10MB, thus we cache the files sized 4KB to 10MB in this test. In order to compare with other systems, we also benchmark the performance of Redis and Memcached. Table II shows the typical basic performance of the important cache operations in different file sizes. In the basic performance benchmark, we omit the hash value calculation and network traffic overheads for the following reasons. (1) When a file is cached, the hash value also is cached. Thus, the value does not need to calculate twice. (2) Because Memcached and Redis cannot schedule cached content over the networks, so we only benchmark the performance in the condition that files are already cached. (3) When we benchmark large file storage performance, if the file is missing, the bottleneck is network I/O. TABLE II.
PERFORMANCE OF REDIS AND MEMCACHED
System & Operation
BASIC PERFORMANCE OF CACHE SYSTEM
APIs
0.023
( 43,478 )
cache_fclose
0.021
( 47,619 )
cache_fread(10MB)
7.69
(
130 )
cache_fwrite(10MB)
16.45
(
61 )
cache_fread(4KB)
0.004
( 250,000 )
cache_fwrite(4KB)
0.012
( 83,333 )
0.16
(
Operation Time (millisecond) (operations per second)
REDIS GET (10MB)
12.61
(
79 )
REDIS SET (10MB)
17.13
(
58 )
REDIS GET (4KB)
0.039
( 25,641 )
REDIS SET (4KB)
0.051
( 19,608 )
Memcached GET (4KB)
0.034
( 29,412 )
Memcached SET (4KB)
0.038
( 26,316 )
Operation Time (millisecond) (operations per second)
cache_fopen
cache_finfo
TABLE III.
The HBase client provides a cache function. When the data are queried from the table, the client caches the data for further usage, which is semantically consistent with our cache system. Thus, in the experiment, data are cache first. The experimental results show: (1) In 4KB files, our HDCache GET operation (cache_fopen, cache_fread, and cache_fclose operation sequence, about 0.048 ms) is about 7 times faster than HBase Cached GET operation. The reason may be that HBase must extract 4KB data from a cached column of the table. (2) In 4KB files, HBase PUT (similar to SET) operation is a bit faster than HDCache SET operation (cache_fopen, cache_fwrite, and cache_fclose operation sequence, about 0.056 ms). The results show that the data are stored to HBase client memory buffer before streaming to a database table. (3) In 10MB large files, HBase performance declines dramatically. Our HDCache is 26 times faster for the GET operation and 60 times faster for the SET operation. There are quite a few NoSQL databases such as MongoDB [34] that have the same design principle as the HBase system. They are likely to combine small data fragments with sizes from several KB to several hundred KB. When dealing with data having a wider size range, these systems may perform worse.
6,250 )
The cache_fread operation reads cache content to a client application. The cache_fopen, cache_fread, and cache_fclose operation sequence is equal to Redis GET operation. The cache_fwrite operation modifies cached file content. The cache_fopen, cache_ fwrite, and cache_fclose operation sequence is equal to Redis SET operation. During the benchmark, we find that Memcached does not support files sized 10MB, so we choose 4KB files for the Memcached benchmark. Table III shows experimental results of Redis and Memcached. Compared with Table II, we find the performances of the write (SET) operation of the three systems are in the same level (4KB file about 0.04ms, 10MB file about 17ms). The performances of the read (GET) operation of the three systems are different. In 10MB file, our cache can take 123 GET (including cache_fopen, cache_fread, and cache_fclose) operations per second which is 56% faster than Redis (79 operations per second). The results indicate that when processing small files the three systems almost equally performed, and when dealing with big files, our cache gains the better performance. HBase is considered to be an alternate for the situation that needs high performance of GET/PUT operations. In order to compare HDCache with HBase, we benchmark
TABLE IV.
PERFORMANCE OF HBASE IN TNIY CLUSTER
Operation
18
Operation Time (millisecond) (operations per second)
HBase Cached GET (4KB)
0.337
(
2,967 )
HBase PUT (4KB)
0.034
( 29,411 )
HBase Cached GET (10MB)
204.8
(
4.88 )
HBase PUT (10MB)
985.6
(
1.015 )
Throughput on Concurrency When a client with multi-threads concurrently accesses the cache system, we evaluate the impact to throughput. In this test case, the cache service caches 2GB files in different file sizes. The number of the concurrent threads is set from 1 to 200. Figure 6 shows the results of the read efficiency on concurrency. The read efficiency declines with the increase of client threads. However, when the number of threads exceeds 90, the efficiency fluctuates in a constant range. The read efficiency can keep greater than 1GB/s when the number of client threads is up to 200.
Simulation of Hit Ratio Our cache system aims at storing personal data in realtime cloud services. We assume that one user has 10MB personalized data, because in real-world applications, a user cannot manage large amounts of data under the network bandwidth and timing constraints. In an industrial cloud computing environment, in order to improve the efficiency of the services and to reduce the workload on the system, we always try to avoid scheduling personal data from one server to another, hoping that the user always logins onto the a specific server so long as it works. We assume that the critical personal data are profile and business model with the size about 10MB. Taking the normal commercial server with 48GB RAM into account, the configuration that a cache service holds 2000 users' data is a sound choice. Theoretically, if the access is uniform distributed, the cache memory size and the hit ratio are approximately in a linear relationship. To serve 2000 users, the cache needs 20GB memory to achieve a 100% hit ratio. However, the user access of a cloud services has some characteristics. We use the following scenario to simulate a user access model in real-time cloud services. a) On a single cache service node, there are 2,000 users accessing the personal data in a day, and the total access count is 100,000 (avg. 50 per user a day). b) Every user’s access count is a random number between 5 and 500;In a real system, the user’s access count is seldom outside this rage. c) The time point when a specific user access cache is random. Once a user begins to access the cache system, this user’s next access time point is near the last access time in a random short delay. We simulate this delay by a random number between 1 and 250. That is the access sequence of a specific user is neither uniform nor consecutive.
Figure 6. Read efficiency on different number of concurrent threads
Figure 7 shows cache_fopen operation performance on concurrency. The response time of the cache_fopen operation is prolonged with the number increase of client threads and fluctuates in a constant range when threads exceed 30. The cache_fopen operation can keep within 10ms when the number of client threads is up to 200. Concurrent tests show that our cache system can maintain very high performance in both services and client library in multi-thread environments.
Figure 8. Hit ratio of a single node in different cache memory size
Figure 7. cache_fopen operation response time on different number of concurrent threads
19
Figure 8 shows the relationship between cache size and hit ratio in this scenario. We find that one cache service only needs 2.4GB memory (12% of 20GB) to achieve a 90% hit ratio. We also extend the simulation to a 20-node distributed cache system. In this scenario, the number of users is set to 40,000 and the file access count is set to 200,000, which is twenty times as the test mentioned above. In this simulation, we study the relationship between hit ratio and the file replica number. Firstly, we use one replica scheme. The users arrive at the cloud system in a random manner and all personal data requests are randomly routed to the cache. Because the requests are disperse within the whole system, the hit ratio declines dramatically. Then we also use this test method, but increase the replica number to 3. Figure 9 shows that using 3 replica of a file algorithm introduced in this paper, the hit ratio will increase about 10% compared with the 1 replica method. The 3 replica method decreases the probability of cache missing and fits the personal data access model of a real-time cloud service.
different cache nodes. This improvement makes most of the file operations complete in the local memory or through network I/O between the cache services, which greatly reduces the frequency of access to HDFS, alleviates the workload of the HDFS NameNode, and improves the performance of the entire cloud computing system. Currently, this work is still the first step to bridge highthroughput cloud computing and real-time cloud computing. Many further studies are ongoing in different aspects. The HDCache system now is still based on the classic writeonce-read-many data access model in the cloud. However, many real-time services have more complex and dynamic data access models. How to introduce a transactional function needs an in-depth study. There are many other techniques in cloud computing such as NoSQL database, MapReduce framework and distributed data warehouse. Facing these techniques, how to analyze their real-time characteristics and improve them for real-time usage is still full of challenges. ACKNOWLEDGMENTS This work is supported by the National High Technology Research and Development Program of China (863 Program) under Grant No. 2012AA011005, the National Natural Science Foundation of China (NSFC) under Grant No. 61005044 and 60975034, and the Fundamental Research Funds for the Central Universities under Grant No. 2011HGZY0003. REFERENCES [1] [2]
Apache Hadoop. Available at http://hadoop.apache.org. Apache Hadoop Distributed File System. Available at http://hadoop.apache.org/hdfs. [3] S. Ghemawat, H. Gobioff and S. T. Leung, “Google File System”, In Proc. of the Nineteenth ACM Symposium on Operating Systems Principles (SOSP’03), Lake George New York, 2003, pp. 29-43. [4] J. Dean and S. Ghemawat, “Mapreduce: Simplified Data Processing on Large Clusters”. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation (OSDI’04), Berkeley, CA, USA, 2004, pp.137-150. [5] J. Shafer and S Rixner, "The Hadoop distributed filesystem: balancing portability and performance", In 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS2010), White Plains, NY, March 2010. pp.122-133. [6] J. Han, Y. Zhong, C. Han, X. He. "Implementing WebGIS on Hadoop: A case study of improving small file I/O performance on HDFS", In IEEE International Conference on Cluster Computing and Workshops (CLUSTER '09), New Orleans, LA , 2009. pp.1-8. [7] L. Jiang, B. Li, M. Song, "THE optimization of HDFS based on small files", In 3rd IEEE International Conference on Broadband Network and Multimedia Technology (IC-BNMT2010), Beijing, 2010. pp. 912-915. [8] G. Mackey, S. Sehrish, J. Wang, "Improving metadata management for small files in HDFS", In 2009 IEEE International Conference on Cluster Computing and Workshops (CLUSTER'09), New Orleans, Sept, 2009, pp.1-4. [9] J. Xie, S. Yin, et al. "Improving MapReduce performance through data placement in heterogeneous Hadoop clusters", In 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPSW), Atlanta, April, 2010, pp.1-9. [10] B. Dong, J. Qiu, et al. "A novel approach to improving the efficiency of storing and accessing small files on Hadoop: a case study by
Figure 9. File replica affects hit ratio
VI. CONCLUTIONS There are more and more real-time services coming forth on the Internet in cloud computing paradigms. Industrial circles tend to use mature technologies such as Hadoop to build real-time cloud computing systems. To overcome the performance shortages of the Hadoop Distributed File System (HDFS), this paper describes a novel distributed cache system built on the top of HDFS named HDCache. The HDCache system uses shared memory as the infrastructure which on one hand can deal with files with a wide range in their sizes, and on the other hand still has a very high performance, even if it is shared by a large number of client application threads. We have introduced DHT to the design of the distributed cache and improved data storage, which make every cached file have three replicas stored in
20
[11]
[12] [13]
[14]
[15]
[16] [17] [18]
[19]
[20]
[21] [22]
[23]
[24] G. DeCandia, D.Hastorun, and M. Jampani, et al,"Dynamo: amazon's highly available key-value store", In Proceedings of twenty-first ACM SIGOPS Symposium on Operating Systems Principles, Stevenson, Washington, USA, October, 2007, pp.205-220. [25] G. Fowler, L. Noll, K Vo and D. Eastlake. "The FNV NonCryptographic Hash Algorithm", ietf-draft, 2011, Available at http://tools.ietf.org/html/draft-eastlake-fnv-03. [26] W. Ehrhardt, "CRC Hash", Available at http://www.wolfgangehrhardt.de/crchash_en.html [27] Ketama Hash. Available at http://www.audioscrobbler.net/development/ketama. [28] P. Hunt, M. Konar, F. P. Junqueira, and B. Reed, "ZooKeeper: waitfree coordination for internet-scale systems", In Proceedings of the 2010 USENIX Conference on USENIX Annual Technical Conference, Boston, MA, June 23-25, 2010. [29] Hadoop ZooKeeper. Available at http://zookeeper.apache.org/. [30] M. Burrows, "The Chubby lock service for loosely-coupled distributed systems", In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI'06), 2006, pp.335-350. [31] B. Reed and F. P. Junqueira. "A simple totally ordered broadcast protocol". In LADIS'08: Proceedings of the 2nd Workshop on LargeScale Distributed Systems and Middleware, pages 1–6, New York, NY, USA, 2008. ACM. [32] Hadoop Distributed File System C/C++ APIs Documnet, Available at http://hadoop.apache.org/common/docs/current/libhdfs.html. [33] The Apache Cassandra Project. Available at http:// http://cassandra.apache.org/. [34] The Mongo Database Project. Available at http://www.mongodb.org/. [35] B. F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan and Russell Sears. "Benchmarking cloud serving systems with YCSB". In Proceedings of the 1st ACM Symposium on Cloud Computing (SoCC'10), Indianapolis, IN, 2010. pp.143-154. [36] Robin Hecht and Stefan Jablonski. "A Use Case Oriented Survey". In 2011 International Conference on Cloud and Service Computing (CSC2011), Hong Kong, pp.336 - 341. [37] Luiz A. Barroso, Jeffrey Dean, and Urs Holzle. Web search for a planet: The Google cluster architecture. IEEE Micro, April 2003, 23(2), pp.22-28.
PowerPoint files", In 2010 IEEE International Conference on Service Computing (SCC), Miami, July 2010, pp.65-72. H. Zhang, Y. Han, F. Chen and J. Wen, "A Novel Approach in Improving I/O Performance of Small Meteorological Files on HDFS", Applied Mechanics and Materials, vol.117-199, Oct. 2011, pp.17591765. Apache HBase. Available at http://hbase.apache.org. F. Chang, J, Dean, et al. "Bigtable: A Distributed Storage System for Structured Data", ACM Transactions on Computer Systems, Vol. 26(2), 2008, pp.205-218. K. Dana, Hadoop HBase Performance Evaluation, unpublished, Available at http://www.cs.duke.edu/~kcd/hadoop/kcdhadoopreport.pdf. J. Ousterhout, P. Agrawal, et al, "The Case for RAMClouds: Scalable High-Performance Storage Entirely in DRAM", ACM SIGOPS Operating Systems Review, Vol.43(4), Jan 2010, pp.92-105. J. Ousterhout, P. Agrawal, et al,"The case for RAMCloud", Communications of the ACM, Vol.54(7), July 2011, pp.121-130. Memcached: a distributed memory object caching system. Available at http://www.danga.com/memcached. J. Petrovic, "Using Memcached for Data Distribution in Industrial Environment", In Third International Conference on Systems (ICONS08), Cancun, 2008, pp.358-372. S. Zhang, J. Han, Z. Liu, K. Wang,"Accelerating MapReduce with Distributed Memory Cache ", In 15th International Conference on Parallel and Distributed Systems (ICPADS09), Shenzhen, 2009, pp.472-478. C. Xu, X. Huang, N. Wu, P. Xu, G. Yang, "Using Memcached to Promote Read Throughput in Massive Small-File Storage System", In 9th Internatinal Conference on Grid and Cooperative Computing (GCC), Nanjing, 2010, pp.24-29. Redis. Avialable at http://code.google.com/p/redis D. Borthakur et al. "Apache Hadoop goes realtime at Facebook", In Proceedings of the 2011 International Conference on Management of Data (SIGMOD'11), New York, 2011. S. Ekanayake, J. Mitchell, Y. Sun and J. Qiu, "Memcached Integration with Twister", unpublished, avialable at http://salsahpc.org/CloudCom2010/EPoster/cloudcom2010_submissi on_264.pdf.
21