RECENT ADVANCES in SOFTWARE ENGINEERING, PARALLEL and DISTRIBUTED SYSTEMS
Design and Implementation of efficient semi-synchronous replication Solution for Disaster Recovery REKHA SINGHAL, SHREYA BOKARE, PRASAD PAWAR Computer Networks and Internet Engineering Centre for Development of Advance Computing INDIA
[email protected],
[email protected],
[email protected] Abstract: - In this paper we propose a design and implementation for efficient semi-synchronous replication solution using iSCSI for disaster recovery. We replicate the data at block level to bring in efficiency. Further we use features of database application which helps in reducing the complexity and improving the performance of the disaster recovery solution. This paper concentrates on the detailed design and implementation of block level semi-synchronous replication solution. The paper also presents the overhead of the solution on the naïve IP SAN. Key-Words: - Disaster recovery, IPSAN, semi-synchronous replication. solutions like Hitachi Universal Replicator [1] and EMC SRDF [2] can reduce the write-latency penalty, but can result in inconsistent, unusable data unless write ordering across the entire data set, not just within one storage device, is guaranteed. Therefore data managers are forced to choose between two extremes: synchronized with great expense or affordable with a day of data loss. The replication may be at application level, Host level, appliance based at controller level and storage level for data replication. Application and host based data migration like CA XOsoft's WANSync [5] gives extra burden of data replication on the application which may compromise the application SLA. The appliance based data replication solutions like Druvva replicator [7] is also one of the feasible ways of data replication but have hardware and software dependencies. The storage based data replication solutions like HP’s continuous access [8] carry out the data replication at block level. For all the above mentioned data replication levels, the data replication is possible by both synchronous and asynchronous replication strategies. At block level, data protection [9], [10], [11] is traditionally done using snapshots, backups, synchronous and asynchronous replication strategies. SnapMirror, a technology by NetApp [12] implements asynchronous mirrors of the data from a source volume to the destination volume. In this paper we designed new approach called CDAC-SEMISYNC solution using block replications technique. CDAC-SEMISYNC is optimized for the enterprise applications such as database, where semi-sync replication process and features of application can be used to provide
1 Introduction As reliance on computerized data storage has grown, the cost of data unavailability is increased. A few hours downtime can cost from thousands to millions of dollars depending on the size of the enterprise and the role of the data. With increasing frequency, companies are instituting disaster recovery [14] plans to ensure appropriate data availability in the event of a catastrophic failure or disaster that destroys a site. It is relatively easy to provide redundant server and storage hardware to protect against the loss of physical resources. Without the data, however, the redundant hardware is of little use. Therefore data replication has become one of most important factor in the disaster recovery process. Data replication can be looked by its two aspects - data replication strategy and where data replication happens. The current methods for data protection and recovery offer either inadequate protection, or are too expensive in performance and network bandwidth. Traditional backup methodologies have struggled to meet the data recovery point objectives (RPO) and recovery time objectives (RTO) of today's businesses. Backup can protect data permanently. But the time of backup and recovery is slow. Remote synchronous and asynchronous replications are more recent alternatives. Remote mirroring and synchronous replication solutions like VERITAS Volume Manager (VxVM) [3] [4], Hitachi True Copy[1] keep backup data on-line and fully synchronized with the primary store, but they do so at a high cost in performance (write latency) and network bandwidth. Asynchronous replication
ISSN: 1790-5117
109
ISBN: 978-960-474-156-4
RECENT ADVANCES in SOFTWARE ENGINEERING, PARALLEL and DISTRIBUTED SYSTEMS
data at primary site and sets the respective variables for sending module. The Sending module sends data at remote site. Since data writing and sending is possible to/from buffer and primary disk (referred by log file), we used a decision module in our CDAC-SEMISYNC solution. Decision module takes the decision for writing and sending module regarding where to store data and from where to access data to replicate it to remote site.
negligible RPO and RTO factors respectively[15]. Further for replicating at block level we make use of iSCSI[13] protocol for the primary site data storage and remote site data replication process. Additionally we use buffers to store data to be replicated to speed up the performance. The rest of the paper proceeds as follows. Section 2 gives the overall design of the CDAC-SEMISYNC replication solution. Section 3 talks about the implementation of the CDAC-SEMISYNC replication solution. The performance analysis of CDAC- SEMISYNC solution is given in section 4. Finally paper concludes in the section 5.
3 Implementation of CDAC-SEMI SYNC CDAC-SEMISYNC modules are implemented by generating threads for respective modules named writing thread, sending thread and decision thread. Mutex are used for exclusive access to resources shared by all the threads. Figure 2 shows the complete flow of our solution. CDAC-SEMISYNC solution initialization takes place along with the iSCSI target initialization. During this, solution initializes all the three threads writing, sending and decision threads respectively. In this solution we are using buffer as one of data structure, to store block data and related information to bypass the reading of data blocks from the disk while replication. The buffer structure contain the fields, data buffer to stores the data blocks, LUN is device name to write data blocks, LBA is a logical block address of the data block and LEN is the length of the data block. We make use two buffers to parallelize the writing and sending of data from/to the buffer simultaneously by different threads. Each of this is implemented in FIFO link list where data is inserted at tail of link list and extracted from head of link list to send it to the remote site. Though we are using buffer there is limitation on its maximum size. So along with buffers, a temporary log file is used to store data block information i.e. LBA, LEN, when buffers get overflow. The sequence list is used to store sequence of buffer and temporary file in same order the data is written to it. The sequence list is used to maintain ordered delivery of data at remote site. The data is read from buffer or disk (using log file) using this sequence list and send to remote site. Along with this variable flags are used for giving signal to different threads for further operation. Table 1 discuses various resources used in the implementation of CDACSEMISYNC solution. The implementation of different threads given below: Writing Thread When the data comes at primary site, it first writes data into disk and calls Writing Thread. Writing Thread then first check
2 Design of CDAC-SEMISYNC We consider system consisting of primary and remote sites having application servers, data base servers and storages at respective sites, as shown in Figure1.The data is replicated from IPSAN of primary site to the IPSAN of remote site at block granularity level. We interface with the iSCSI protocol for communicating block level data from primary to remote site. The iSCSI command stores the data block along with its block information such as LBA, LUN and length. In a naïve approach, at block level data replication may be done by storing the data blocks to the primary disk along with the block data information in log file and then replicating it across by referring to the log information and reading from primary disk. Our CDAC-SEMISYNC uses buffer, to store data to be replicated to remote site to avoid overhead of disk access. But in case if buffer size is full and replication link is slow then the data information is logged into the temporary log file and data is replicated to the remote site later. As shown in Figure 1, CDAC-SEMISYNC solution has the three main components writing module, sending module and decision module. The writing module writes
Fig.1: Design of CDAC-SEMISYNC solution
ISSN: 1790-5117
110
ISBN: 978-960-474-156-4
RECENT ADVANCES in SOFTWARE ENGINEERING, PARALLEL and DISTRIBUTED SYSTEMS
Fig. 2 Design flow of CDAC-SEMISYNC solution
current sequence number from sequence list and write the data accordingly. Decision Thread: Depending on conditions given in Table2, decision thread change the sequence number and sets flag value so that Sending Thread continues it’s sending of data to remote site. In this case when value of flag is zero then it will be set to a value present in head of sequence list. Then head of sequence list will be deleted and flag value will be broadcasted to Sending Thread.
Sometimes it may happen that the data may not come at Writing Thread for long time. Then Decision Thread will check that flag value is zero or not. If it is zero then it change the state of flag using sequence list and broadcasts signal to Sending Thread. Sending Thread: The responsibility of the Sending Thread is to read a data from respective buffer or disk (using temporary log file) and send it to the remote list.
Table 1: Resources used in CDAC-SEMISYNC implementation Sr. Resources No. 1 current_node
Process writes Decision module
Process reads
Objective
Writing module, Decision Thread
specifies the value of the current sending sequence specifies current state of the sending module
2
flag
Decision Module
Writing module, Decision Module, Sending module
3
write_seq
Decision Module
Writing module, Decision Module
4
Max_size
----
Writing module, Sending module
5 Send_sequence Decision Module
6
Buffer1, Buffer2
ISSN: 1790-5117
Writing module
Writing module, Decision Module, Sending module
Sending module
specifies the value of the current writing sequence Constant Numeric value specifies the total buffer size = Buffer1+buffer2 head of link list specifies the sending sequence and tail specifies writing sequence Used to store data to replicate it to remote site
111
Values
Lock
buff1, buff2, file Lock-1
0-not sending, Lock-1 sending data from buffer1, sending data from buffer2, sending from file buff1, buff2, file Lock-2
Numeric value fixed initially depending on RAM size. buff1,buff2,file
---
Integer value indicating total size occupied by particular buffer
---
---
ISBN: 978-960-474-156-4
RECENT ADVANCES in SOFTWARE ENGINEERING, PARALLEL and DISTRIBUTED SYSTEMS
Table2: Sequence flow in various circumstances A B C D E F Current Next Last Total_ Total_size > Total_size Buffer Send Sending size = 0 Max_size > Write sequence Node and flag=0 Max_size (1)yes Buffer 1 1,2,3 yes 1 yes 2 yes 3 yes (2)yes Buffer 2 1,2,3 yes 1 yes 2 yes 3 yes (3)1 2 File 3 2 1 3 3 1 2 -
Since buffers are used, there is a probability of buffer data loss during operating system failure or power failure. To handle the situation, we used permanent log files where each and every incoming data log entry is made by Writing Thread. Fields of the records of the permanent log file are LBA, LUN and length. Along with this we maintain index list which decides which data is to be send from permanent log file. Sending Thread sends this remaining data from disk using index and permanent log files at the beginning of normal sending of data. The thread implementation of CDACSEMISYNC does not have any priority of execution. Due to which it does not have any kind of thread starvation. Here, all the thread waits till the satisfactory condition comes and executes accordingly. Data replication to remote site is the main task which is performed by sending thread. CDAC-SEMISYNC implementation uses shared resources shown in Table1 with their respective locks, only for condition checking and released after checking the condition; preventing deadlock and starvation for resource which may hampers the remote replication.
ISSN: 1790-5117
G Flag =0 yes yes yes yes yes yes yes yes yes yes yes yes yes yes
H I J Next Next entry Next Write into send Flag sequence sequence 2 2 1 3 3 3 3 1 2 2 1 3 3 2 2 3 3 3 3 1 1 2 3 3 3 3 1 1 3 3 2 1 1 2 3 3 3 3 2 2 1 1 1 1 2 2 1 1 3 2 2 3
4 Performance analysis of CDACSEMISYNC As a well known fact, the use of buffer for replication improves the I/O performance as compared to the use of disk for replication. According to our algorithm, as long as the buffer is free, data replication takes place through buffer to remote site. If both the buffers are busy, then only primary disk access is performed, where according to the temporary log file entry, primary disk data is replicated to the remote site. During data writing, whenever Decision thread changes the writing sequence, it also makes its entry in send sequence link list. This entry shows where the actual data writing has been done at that point of time. Therefore, less entries of file sequence in the send sequence link list specifies the less use of direct disk for replication which in short shows the improved performance. The performance improvement is calculated by measuring throughput and IOP’s using Iometer for both, naïve approach of data replication and CDACSEMISYNC data replication process. From Figure 3 and 4 it is clear that, there is average performance improvement of 224% using CDAC-SEMISYNC replication solution.
112
ISBN: 978-960-474-156-4
RECENT ADVANCES in SOFTWARE ENGINEERING, PARALLEL and DISTRIBUTED SYSTEMS
Throughput (MB/sec)
12 10
iSCSI target w ithout replicat ion
8 6 4
Naïve approc h of
2 0 512B
4KB
16KB
32KB
64KB
Data Block size
Fig.3 Performance comparison based on throughput
26 0 0
iS C S I targ et wit h o u t rep licat io n
24 0 0
Total I/O's per sec
22 0 0 20 0 0
N aïv e ap p ro ch o f d at a rep licat io n
18 0 0 16 0 0 14 0 0
C D AC S E M IS Y N C d at a rep licat io n
12 0 0 10 0 0 8 00 6 00 4 00 2 00 0 51 2 B
4KB
16KB
32KB
6 4KB
D a ta b l o c k s i z e
Fig.4 Performance comparison based on I/O’s
performance on WAN can be further increased by using block level compression or WAN acceleration methods. Block level compression methods compresses the data size, which reduces the time required to send data on WAN and in turn performs, data acceleration on WAN.
In an IPSAN appliance, CDAC-SEMISYNC solution works along with the iSCSI target. There is possibility of performance degradation due to addition of replication component with the iSCSI target. To calculate the performance impact of CDAC-SEMISYNC replication solution on main iSCSI target software, we compared performance of iSCSI target software with and without CDACSEMISYNC replication solution. We used Iometer to measure the throughput and IOP parameters. From Figure 3 and 4 it is clear that, there is average performance overhead of 2.2% using CDACSEMISYNC which is very negligible. The performance impact is more at smaller data block size and reduces with the increase in the data block size. This Performance overhead may get reduced further, with the use quad of core processor instead of dual core processor.
6 Conclusion In this paper, we have proposed, design and implementation of semi-synchronous replication solution for enterprise application to ensure disaster recovery with negligible RPO and RTO factors. Our solution uses iSCSI protocol for data replication at block granularity with semi-sync replication methodology. Semi-sync replication process makes use of buffer to speed up data replication to the remote site. Our solution guarantees data delivery at remote site in same order as that is at primary site. Our solution also ensures negligible performance overhead with the addition of replication component on naïve IPSAN.
5 Future scope Future scope of the paper is to enhance the replication performance on WAN. The replication
ISSN: 1790-5117
113
ISBN: 978-960-474-156-4
RECENT ADVANCES in SOFTWARE ENGINEERING, PARALLEL and DISTRIBUTED SYSTEMS
[11 ] Proceedings of the 24th IEEE Conference on
References: [1 ] “Disaster Recovery Issues and Solutions” A White Paper By Roselinda R. Schulman Hitachidata systems [2 ] EMC SRDF Solution family of remote replication solutions guide [3 ] VERITAS Performance brief: remote mirroring http://eval.veritas.com/mktginfo/products/White_Pa pers/High_Availability/vxvm_vrts_performance.pdf [4 ] A Guide to Understanding Volume Replicator ,a technical overview of replication capabilities by Veritas Storage Foundation [5 ] CA XO soft Replication e12 and CAXOsoft High availability r12 partner pocket guide. [6 ] Database Replication Using Generalized Snapshot Isolation by Swiss National Science Foundation grant number 200021-107824 [7 ] White paper on “Druvaa Replicator – Disaster Recovery” by Druvva [8 ] White paper on “HP StorageWorks 4x00/6x00/8x00 Enterprise Virtual Array user guide” by HP [9 ] “An Incremental File System Consistency Checker for Block-Level CDP Systems “ , by `Maohua Lu, Tzi- cker Chiueh, Shibiao Lin [10 ] M. Lu, S. Lin, and T. Chiueh, “Efficient Logging and Replication Techniques for Comprehensive Data Protection,”
ISSN: 1790-5117
Mass Storage Systems and Technologies, pp. 171– 184, 2007. “A Case for Continuous Data Protection at Block Level in Disk Array Storages” by Weijun Xiao, Member, IEEE, Jin Ren, and Qing Yang, Senior Member, IEEE, 2008. [12 ] SnapMirror®: File System Based Asynchronous Mirroring for Disaster Recovery SnapMirror, NetApp, and WAFL are registered trademarks of Hugo Patterson, Stephen Manley, Mike Federwisch, Dave Hitz, Steve Kleiman, Shane Owara Network Appliance, Inc. [13 ] http://www.thefreelibrary.com/Implementing +an+IP+SAN+for+disaster+recovery:+using+iSCSI +as+an...a0118109205 [14 ] Architecting availability and disaster recovery solutions Tim Read, by Sun Microsystems, April 2007 [15 ] Enterprise Storage Architecture for Optimal Business Continuity, by Prasad Pawar, Shreya Bokare, Dr. Rekha Singhal ,accepted for conference DSDE 2010 [16 ] “iSCSI + virtual server = better DR”,by SNIA on storage in Infostor ,October - 2008 [17 ] A Fast Disaster Recovery Mechanism for Volume Replication Systems by Yanlong Wang, Zhanhuai Li, and Wei Lin at Springer-Verlag Berlin Heidelberg
114
ISBN: 978-960-474-156-4