Distributed Memory Caching for the Fail Safe ...

Distributed Memory Caching for the Fail Safe Computation to Improve the Grid Performance 1st

2nd

3rd

Sabir Ismail, Abu Fazal Md Shumon, Md Ruhul Amin Dept. of CSE, Shajalal University of Science & Technology, Sylhet, Bangladesh [email protected], [email protected], [email protected]

Abstract Grid computing organizes geographically distributed resources under a single platform and let the users access this combined power. In this paper we have discussed the application of distributed memory caching system in the Grid computing environment to improve its computational environment. For our experiment, we used Alchemi, a .net based Grid computing framework and Memcached, a distributed memory caching technique. We completed couple of experiments in this environment and they demonstrated two very important outcomes. One of the outcomes outlined that distributed memory caching technique can provide fail safe computation for the Grid environment. The second result represented the reduction of the total computational time of the Grid applications. Based on the results of these current experiments and also previous experiments completed in our distributed computing laboratory we have proposed a new technique for the Grid computing environment that can provide performance improvement as well as the fail safe Grid computing environment. Keywords: Alchemi, Distributed Memory Caching, Fail Safe Computation, Grid Computing, Memcached. I. INTRODUCTION Grid computing [1] has enlightened a great way for processing, calculating & manipulating large amount of data in various fields from scientific research to business computing. Instead of dedicated servers and storage for each application, Grid computing enables multiple applications to share computing infrastructure, resulting much greater flexibility, cost reduction, power efficiency, performance improvement, scalability and availability, all at the same time [2]-[5]. In a Grid computing environment the owner node requests the execution of an application to the Grid manager node. The manager node receives the requested application and divides it into many smaller tasks. The manager creates thread for each smaller task and then submits to the executor nodes. After successful completion of the execution of each thread the executor nodes return the results to the manager that stores the results into the database. After the completion of the execution of all threads, the manager node combines all the results retrieving from database and publishes the final result [6]. Now during the distributed execution of Grid application three major types of failure may happen in

the Grid environment: a. failure of the executor node, b. failure of the Grid manager node & c. failure of the both Grid manager and executor nodes. In this paper, we have discussed these three issues and the corresponding strategies to provide fail safe computation in the Grid environment. For the first case, we have considered the previous experimental outcomes, which were found in the laboratory of our department [7]-[8]. For the second case we did some experimental works and in this paper we have mainly represented the outcomes of those experiments. In our experiment we applied the distributed memory caching technique in the Grid platform. We stored all the output of the executor nodes in the distributed cache memory. So when the manager was crashed down all the results were still in the distributed cache. Hence as soon as another manager was deployed it could make use of those outputs. We also used the distributed cache to speed up the merging procedure of the output data of different executors. We have shown all these experiments using relevant tables and figures in this paper. Based on these two experiments we have proposed an advanced fail safe solution to minimize the loss that occurs when the Grid manager as well as its couple of executor nodes is crashed down. II. PRESENT ALCHEMI STRUCTURE AND MEMCACHED A. Alchemi In Alchemi, the owner node generates application and creates threads. It then submits threads to the manager node & the manager node store them in a pool [9]-[10]. The manager node then distributes threads to the executor nodes in a first in first out basis. The executor node executes it and sends result back to the manager node. When all threads are finished, the manager node combines them and sends final result to the owner node. A.1 Pseudo code for current Alchemi 1. for each thread submit (thread) 2. //when a thread is finished, store result// store (threadId, result) 3. result = null for each thread result=result + getResult (threadId)

B. Memcached Memcached is a distributed memory caching system that creates in-memory key-value storage in multiple machines for small chunks of arbitrary data (strings, objects) [11]-[12]. This system uses client-server architecture. The server maintains a key-value associative array, which is a giant hash table distributed across multiple machines. The clients populate this table and query it using Memcached Client API. Hence the integration of Memcached with Alchemi framework can be show as follows:

Fig. 1: A setup for Grid platform with Alchemi and Memcached.

III. PROBLEMS DISCUSSION & CORROSPONDING SOLUTIONS A.1 Scenario 1: Failure of the executor node In the present situation, if any of the executor nodes fails to complete its assigned job then this task is rescheduled to another executor and the task is executed again from its initial state. Generally, if any job associated with a Grid application unable to succeed during its runtime in the executor then the scheduling algorithm just marks the job as failed and reset but it does not keep any track of the percentage of the work that has already been done. This problem has been solved at the distributed computing laboratory of our department using task restoration point [7]-[8]. From the experimental result, we have found that this technique improve the performance of the Grid environment. This technique is discussed in the next section (A.2).

Fig. 2: Failure of the executor node scenario

A.2 Solution: File based Grid thread implementation A file based GRID thread was implemented to overcome the executor failure problem [7]-[8]. Here the file keeps track of the last resultant values of the thread before the thread has been failed. A random access file is created associated with each thread. The file only contains the last two successful output of the thread. This generated file is stored in the manager Node. The file of a thread is deleted automatically after the successful completion of the thread. Threads from the application owner are placed in a pool and scheduled for execution on the various available executors by the Manager. If a thread is failed, manager reassigns that thread to another executor node with the same Thread ID from the thread pool. When the manager distributes the thread to any Executor node, the node first checks if there is any file whose name is similar to its own Thread ID. If the file is available for that thread, the node then recovers the resultant value of that thread from the file and starts execution from the recovered value. If no file is found for that thread then the thread will create a new file in the manager node using Thread ID (i.e. filename) as Thread ID remain same throughout the lifetime of an application. To implement this technique, we need to work in the start method of GridThread class where the thread will continuously write the output depending on the application. When the entire application will be finished, all the files those have been created will be deleted. A.3 Pseudo code of the Grid thread implementation public class GridThread : Gthread { public GridThread (String _fileName, int _startValue, int _endValue) { filename = _fileName startValue =_startValue endValue =_endValue } //Start of start () method// public override void start () { /*At first start value is assigned as backupValue*/ backupValue = startValue /*If filename is found in the given path then call readFile () to recover the restoration point else create a restoration file with the given path*/ if (fileNameExists (fileName)) backupValue = readFile (fileName) else createFile (fileName) //Start of the thread Task//

/*Each task starts from the backupValue (file restoration point) rather than the start value and task will continue until the endValue */ /*Writing the results into the file each time a calculation is done*/ writeIntoFile (fileName, value) //End of the thread Task// } //End of the start () method// } //End of the GridThread Class// B.1 Scenario 2: Failure of the manager node Failure of the manager may happen if the manager gets disconnected from the network or if the manager gets crashed for any reason. Generally, during the failure of a manager all the executor nodes stop their ongoing work. Again if the manager node and the database resides on the same computer then as soon as the manager node fails due to any reason there is no way to retrieve the output of the already processed sub tasks. Hence the failure of the Grid manager may seriously hamper the performance of a Grid environment. To get rid of such a situation we have used a distributed memory caching system. In our experiment the executor nodes send the processed results to the manager node as well as store it into the distributed caching system. If the manager fails for any reason and the Grid application is run again then for each created thread, the manager node at first looks into the distributed caching system to get the result (if any) of the pervious execution based on the Thread ID. If found then the manager node uses this result rather than assigning the task again to an executor. We implemented this technique by integrating Memcached with the Alchemi. Application of such a system not only provided fail safe computation for the Grid manager but also improved the total performance of the Grid environment. In the next sections, we have discussed the Grid architecture and corresponding pseudo codes we have used to develop the test bed for this experiment and shown our results using statistics and graphs.

Fig. 3: Failure of the Grid manager node scenario

B.2 Solution: Distributed memory caching system to provide fail safe computation We used Memcached to store the thread results given by the executors. Now, Memcached keeps only the least recent used objects in the cache. When the size of all objects becomes greater than the size of the cache then

all results may not be available in the Memcached. We kept the Memcached server on another node rather than the manger node. Thus in case of the failure of manager node, we had thread results in the Memcached server and for this we were able to retrieve most of the finished thread results. Again, Memcached is key-value storage. In our implementation we considered the Thread ID as the key and the thread result as the value. Hence before assigning the thread to an executor, manager first of all checked the Memcached key-value storage using the Thread ID to see if there was any result object. If found then the manager did not assign the thread to an executor. Otherwise, it put an entry into the Memcached key-value storage using Thread ID against result object value. To fulfill our requirements using Memcached, we added codes in three places in the current procedure of Grid manager node. Firstly, before creating a thread we first search that thread’s result (if any) in the Memcached server. Secondly, when the execution of a thread is finished we store the thread result into the Memcached. Finally, when the application is finished we collect thread result from the Memcached. B.3 Pseudo code of the Grid manager node class MainClass { public static void main () { /*Create threads to be executed*/ for (i = 1 to N number of threads) { key = threadId (i) /*Start of the proposed task One: Search the thread results in the † Memcached; if found then do not create task thread */ if (DistributedMemoryCachingSystem.get (key) == null) { thread = createGridThread (i) GridApplication.Threads.Add (thread) } /*End of the proposed Task one*/ } /*Start the application*/ GridApplication.Start () } //End of the main() /*Proposed Task Two: When a thread is finished this method is called to store the thread output into Memcached and Alchemi Database*/ static void ThreadFinished (Thread thread) { DistributedMemoryCachingSystem.set(threadId, thread.result()) AlchemiDatabase.set(threadId, thread.result()); } /*End of the ThreadFinished ()*/

/* When all the executors finish their tasks this method is called */ static void ApplicationFinished () { /*Collect thread result*/ for (i = 1 to N number of threads) { /*Start of the proposed task Three: Search result in Memcached; if not available then search from the Alchemi database††*/ if (DistributedMemoryCachingSystem.get (key) != null) result = DistributedMemoryCachingSystem.Get (key); else result = alchemiDatabase (key); /*End the proposed Task Two*/ } } /*End of the ApplicaionFinished()*/

B.4 Experimental results As we said earlier, we developed an experimental grid environment to see how our proposed procedure works. We executed PiCalculator application. PiCalculator generates number of PI digits determined by the user. Same input was used for both the existing procedure and our extended procedure. We made Homogeneous Grid computing platform using three computers. Table I Resource description used to setup a Grid platform Serial no. 1. 2. 3.

} //End of the MainClass// † If the existing thread result is already in Memcached, then no need to create threads again. We can get result from the Memcached server. †† When all threads are finished first try to get all individual thread result from Memcached, if certain thread result is not available then get that result from the Alchemi Database.

RAM (MB)

Processing Speed(GHz)

256 256 256

2.8a 2.8 2.8 Total 8.4 Executed Program: PiCalculator

B.4.1 Minimizing the processing time due to fail safe computation of the Grid manager The table below shows our experimental data. We executed the PiCalculator and after some time we interrupted the manager node. That means we disconnected the manager node, and then again, we reconnected the computers and restarted the application. Following tables show the experimental data.

Table II Time comparison between existing Alchemi procedure and our procedure # Of digit s

1000 2000 3000 4000 5000

# Of thr ead s 100 200 300 400 500

Applicati on interrupt ed after # of threads 50 100 200 300 400

Existing algorithm running time (HH:MM: SS.0) Before interrupt

After interrupt

Total

00:00:01.8 00:03:47.7 00:08:03.0 00:20:02.4 00:35:09.3

00:03:48.0 00:09:51.0 00:19:57.4 00:36:15.9 01:02:42.0

00:03:49.8 00:13:38.7 00:28:00.4 00:56:18.3 01:37:51.3

# of threa ds in Mem cache d 50 100 200 300 400

Our implementation running time (HH:MM: SS.0) Before interrupt

After interrupt

Total

00:00:01.9 00:03:21.9 00:10:33.6 00:21:12:4 00:37:13.2

00:00:34.0 00:06:09.7 00:10:24.9 00:16:17.0 00:24:35.0

00:00:35.9 00:09:31.6 00:20:58.5 00:37:29.4 01:01:48.2

The table shows that the total processing time in our proposed system is much less than the existing Alchemi operations. This is because in existing procedure the interruption destroyed all of the previously calculated result. Hence the whole application was executed from the beginning again. Whereas, our extended procedure enabled the Grid manager to reuse the previous calculated thread results. Since we could reuse the already calculated thread results even when the system was failed then we can claim that the system achieved the fail safe computational strategy. From the experimental data we have plotted the following graph. In the graph we can see that our implemented procedure performs very well rather the existing procedure due to its fail safe strategy.

Fig. 4: Comparison between existing and our implemented procedure based on table II data.

B.4.2 Minimizing the thread output merging time of a Grid application

executors. We have proposed a solution for this situation based on our experimental results.

In this experiment we have shown that how much time we could save by merging all thread results from Memcached rather than from Alchemi database. After the completion of all executor threads, we counted the time requires to merge the threads output using both the existing and the proposed system. And our experimental data have shown that merging from Memcached is less time consuming. Table III Comparison between existing Alchemi procedure and our implemented procedure #Of digits 500 600 700 800 900 1000 2000 3000 4000

#Of threads 50 60 70 80 90 100 200 300 400

Time merge datasets from Alchemi Database 0.062500 0.578125 0.093750 0.093750 0.140625 0.781250 0.156250 2.562500 4.359375

Time merge datasets from Memcached server 0.015625 0.546875 0.031250 0.031250 0.031250 0.781250 0.093750 0.062500 0.109375

We have plotted above data in the following graph, which shows that Memcached has a significant contribution in minimizing the merging time of threads output. We can also see from the figure the more data we need to merge from the Alchemi database, the time increase rapidly. But in case of Memcached, it’s all most same all the time.

Fig. 5: Comparison between existing and our implemented procedure based on table III data.

C.1 Scenario 3: Failure of the both manager and executor node If both manager and executor fail in a Grid computing environment then according to the current architecture there is no way to recover the so far calculated results or to continue the ongoing works on the other alive

Figure 6: Failure of the manager and the executor node scenario

C.2 Solution: Distributed file system Let us consider that we have implemented both the file restoration technique and distributed memory-caching technique. Then according to the previous discussion we can retrieve so far finished thread result from the Memcached server. Again we can also devise a new system using distributed cache so that the alive executors continue their ongoing tasks and put it into the distributed cache for future retrieval. But we will lose all the restoration files of all the executors that reside in the manger node. Again if the size of the restoration file is bigger than as usual then the overhead for storing this file in the manager node will increase. We also cannot choose database server for file restoration point since it is more time consuming. We also need to ensure that if the size of the calculated thread results becomes greater than the size of distributed cache then we need to store the older results in the secondary storage until we do not need them anymore. To solve this situation we propose to use distributed memory-caching technique on top of a distributed file system. In the distributed file system the restoration points will be created. Distributed file system generally creates multiple copies of its each file and distributes them over the network to ensure high availability and fault tolerance of data. Again when the size of the thread results in the distributed memory cache will be bigger than distributed file system would be the secondary storage. In this way we will be able to achieve not only the fail safe computation but also we will be able to build up the fault tolerance system for grid environment. In the next two sections we provide the pseudo codes for both the Grid executor and manager nodes for implementing this procedure in the Grid environment: C.3 Proposed pseudo code for the executor node using distributed file system 1. fileName = _fileName startValue = _startValue endValue= _endValue

2. if (fileExists(fileName) == true) backupValue = readFile(fileName) startValue = backupValue else createFileInDistributedFileSystem(fileName) 3. //start of the task// do //write the results into the file each time a calculation is done// writeIntoDistributedFileSystem (fileName, value) while(notReached(endValue)) C.4 Proposed pseudo code for the Grid manager node 1. for each thread if (DistributedMemoryChachingSystem.get(threadId) == null) submit (threadId) 2. //when a thread is finished, store result// DistributedMemoryChachingSystem.set(threadId, result) AlchemiDatabase.set(threadId, result) 3. //when the application is finished, merge result// result = null for each thread if (DistributedMemoryChachingSystem.get(threadId) != null) result=result+DistributedMemoryChachingSystem.get(threadId) else result=result+AlchemiDatabase.get(threadId) IV. CONCLUSION Grid computing is used to run large types of application. Any types of interruption during the run time can cost a great loss to the whole application. Using our implementation provided here, we can develop a failsafe computation strategy for the Grid manager node. Moreover, our implemented procedure improves the performance of Grid platform to a greater extent. Based on the experiments we have provided here, we have proposed fail safe computation strategy for the most vulnerable situation that is the failure of both Grid manager and executor nodes. Although we did not complete our third experiment but definitely it shows that the distributed memory caching techniques on top of the distributed file system is the trick to implement a complete fail safe solution for the Grid platform. It also opens up the gateway to develop a fault tolerance grid environment. Hence our future work is to implement

this strategy experiments.

and

complete

the

corresponding

REFERENCES [1] Grid Computing Info Centre, http://www.gridcomputing.com/ [2] Ian Foster and Carl Kesselman, “Globus: A metacomputing infrastructure toolkit”, Journal of Supercomputer Applications, 11(2) 115-128, 1997. [3] Li, Maozhen, Mark A. Baker. “The Grid: Core Technologies. Wiley”. ISBN 0-470-09417-6 (2005). [4] Ian Foster, Carl Kesselman, and S. Tuecke, “The anatomy of the Grid: Enabling scalable virtual organizations”, Journal of Supercomputer Applications, 15(3) pg. 200-222, 2001. [5] Buyya, Rajkumar, Kris Bubendorfer. “Market Oriented Grid and Utility Computing” Wiley. ISBN 9780470287682 (2009). [6] Venugopal, S., Buyya, R., Winton, L. “A Grid Service Broker for Scheduling Distributed DataOriented Applications on Global Grids”, “Proceedings of the 2nd Workshop on Middleware in Grid Computing (MGC 04)”, ISBN 1-58113950-0. [7] Abu Awal Md. Shoeb, Md. Abu Naser Bikas, Altaf Hussain, Md. Khalad Hasan, Md. Forhad Rabbi, “An Extended Algorithm to Enhance the Performance of the Gridbus Broker with Data Restoring Technique”, “Proceedings of the 2009 International Conference on Computer Engineering and Technology - Volume 01”, Pages: 371-375, 2009, ISBN:978-0-7695-3521-0 [8] Altaf Hussain, Abu Awal Md. Shoeb, Md. Abu Naser Bikas and Mohammad Khalad Hasan, “File Based GRID Thread Implementation in the .NETbased Alchemi Framework”, Appeared in IEEEXplore, Link: http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumbe r=04777784. [9] Akshay Luther, Rajkumar Buyya, Rajiv Ranjan, and Srikumar Venugopal “Alchemi: A .NET-Based Enterprise Grid Computing System”, “Proc. 6th Int’l Conf. on Internet Computing (ICOMP'05), Las Vegas, USA, 2005”. [10] “Alchemi, .NET Grid Computing Framework”, http://www.cloudbus.org/~alchemi/ [11] “A distributed object memory caching system”, http://www.memcached.org/ [12] “Memcached - Project hosting on Google code”, http://code.google.com/p/memcached/

Distributed Memory Caching for the Fail Safe ...

Distributed Memory Caching for the Fail Safe ...

Suggest Documents

Distributed Memory Caching for the Fail Safe ...

Safe Caching in a Distributed File System for ... - Semantic Scholar

Fail-safe PVM: A portable package for distributed ... - CiteSeerX

Fail-safe PVM: A portable package for distributed ... - Semantic Scholar

Distributed Selfish Caching - CiteSeerX

Distributed Cooperative Caching - CiteSeerX

Fail safe and safe life design philosophy

Fail safe and safe life design philosophy

Distributed Caching - Oracle

SocialCDN: Caching Techniques for Distributed Social Networks

PERSPECTIVE The Fail-Safe Schools Challenge ... - CiteSeerX

Distributed Caching Platforms - VLDB 2010

Stream Caching Using Hierarchically Distributed

Distributed Caching Platforms - VLDB 2010

Fail-Awareness: An Approach to Construct Fail-Safe ... - CiteSeerX

Decoded Fail Safe vs Fail Secure - When and Where - Allegion

Standards for the communication of radiological reports and fail-safe ...

MAS for Distributed Corporate Memory

The Mether System: Distributed Shared Memory for

The PARADIGM Compiler for Distributed-Memory ... - CiteSeerX

The Case for Compressed Caching in Virtual Memory Systems - Usenix

Power Fail Safe FAT File System - eLinux.org

Fail-Safe ANSI-C Compiler - Springer Link

Power Fail Safe FAT File System - eLinux.org