Cloud Providers Collaboration for a Higher Service level ... - IEEE Xplore

The 3rd International Conference on Communications and Information Technology (ICCIT-2013): Mobile and Cloud Computing, Beirut

Cloud Providers Collaboration for a Higher Service level in Cloud Computing Khaleel Mershad, Abdul Rahman Kaitoua, Hassan Artail, Mazen A. R. Saghir*, and Hazem Hajj Electrical and Computer Engineering Department, American University of Beirut, Beirut, Lebanon * Electrical and Computer Engineering Program, Texas A&M University at Qatar, Doha, Qatar e-mails: {kwm03, aak69, ha27, hh63}@aub.edu.lb, * [email protected] produced the concept of cloud computing, which promises to benefit all parties involved in massive data operations. An important characteristic of Cloud Computing platforms is their ability to offer globally compliant Data-intensive applications that will help users search, mine, extract data and statistics, and execute some specialized functionalities within huge datasets.

Abstract—Cloud computing is increasingly becoming a desirable and foundational element in international enterprise computing. There are many companies which design, develop, and offer cloud technologies. However, cloud providers are still like lone islands. While current cloud computing models have provided significant benefits of maximizing the use of resources within a cloud, the current solutions still face many challenges including the lack of cross-leverage of available resources across clouds, the need to move data between clouds in some cases, and the lack of a global efficient cooperation between clouds. In this paper, we address these challenges by providing an approach that enables various cloud providers to cooperate in order to execute, together, common requests. Several enhancements are provided by integrating hardware acceleration with the computation services. We extend the Hadoop framework by adding provisions for hardware acceleration with Field Programmable Gate Arrays (FPGAs) within the cloud, for multicloud interaction, and for global cloud management. Hardware acceleration is used to offload computations when needed or as a service within the clouds. It can provide additional sources of revenues, reduced operating costs, and increased resource utilization. We used a k-means clustering application as a case study to demonstrate the effectiveness of hardware acceleration.

One of the important current research trends on cloud computing focuses on programming models in the cloud. The MapReduce [1] model along with the Hadoop framework [2], have emerged as the leading choices for distributing computations in the cloud. The model provides abstractions that simplify the process of writing applications to access the massively distributed data systems. Hadoop allocates data and processing across clusters of servers and processes the data in parallel locally at each node. In this way, massive parallel processing can be accomplished using clusters that comprise thousands of nodes. In a standard Hadoop configuration, the master node contains the control parts of the system such as the NameNode and the Job Tracker. The NameNode is responsible for managing the read/write over the slave nodes and keeps a look up table in the memory for the files location over Hadoop Distributed File System (HDFS). The Job Tracker initiates the MapReduce job and tracks the processing of the job’s tasks over the slave nodes. In Hadoop, when executing a job, input files are divided into data blocks that are stored on DataNodes, as requested by the NameNode. DataNodes periodically report to the NameNode with lists of blocks they are currently storing.

Keywords—Cloud computing; hardware acceleration; multicloud network; cloud collaboration; Hadoop; FPGA

I.

INTRODUCTION

Over the last decade, the digital information world has been experiencing an exponential growth of data. This explosion in data has presented itself as challenge to manage and analyze, but also as an opportunity to for new discoveries and solutions. These new opportunities have necessitated the availability of scalable data and computational solutions. Emerging High tech companies, like Google, Amazon, Facebook, have developed large businesses by providing solutions and applications to collect, search and analyze data in the WWW. The datacenters for these companies became sources for additional business revenues by making use of and leasing idle computational resources already available to the company. As a result, datacenters became called “clouds”, where customers can submit requests for resource usage and computational services without having to know where the processing is being conducted, and without having to own or maintain the resources. These clouds can combine thousands of servers with supplementary infrastructure for networking, storage, and cooling. The convergence of demand and outcome has

978-1-4673-5307-6/13/$31.00 ©2013 IEEE

While previous cloud computing solutions have provided significant benefits of maximizing the use of resources within a cloud, the current solutions still face many challenges including the lack of cross-leverage of available resources across clouds, the need to move data from one cloud to another when the data source and computations belong to different providers, and the lack of a global efficient cooperation system across clouds. In this paper, we aim to providing solutions to these issues by achieving two main objectives. The first is to develop a framework for multi-cloud cooperation, while the second is to efficiently integrate FPGAs with hardware acceleration capabilities into cloud datacenters. Cooperation is one of the important factors to improve the utilization of various resources in distributed systems. Our design is compatible with the standard Hadoop framework and can be considered as an extension to it. Our main approach is to divide the system into two major layers: an intra-cloud (whose scope is within the cloud provider) layer and an inter-cloud

109


layer (handles communication and management between different cloud providers). Our approach will benefit both the clients, since they will be able to run their jobs on more universal datasets and get more meaningful and accurate results (for such types of jobs); and the cloud providers, since they will be able to share resources, cooperate in handling extra loads, and share the corresponding revenues. With respect to our second objective, recent studies have explored the use of FPGAs in MapReduce [3, 4], where FPGAs are used to implement integrated Map/Reduce engines that include multiple, application-specific Map and Reduce processors, and a hardware scheduler for assigning Map/Reduce tasks to idle processors. A modified design in [5] proposed the use of both CPU workers and FPGA workers that can run in parallel to perform Map and/or Reduce operations. In our design, we describe how it is possible to exploit the data flow characteristics of the MapReduce model and employ highlevel synthesis tools to automatically generate hardware accelerators that can be implemented and dynamically loaded in FPGAs. As a result, we propose a new service model that we call hardware-acceleration-as-a-service (HAaaS). When a hardware accelerator for a given application is available, users can choose to pay a premium for faster data processing rates. The datacenter operator can also use the hardware accelerators to reduce energy consumption and save costs, thus increasing job throughput and datacenter utilization. II.

Fig. 1.

Components of the reconfigurable multi-cloud framework.

Within each Intra-cloud system (slave), two networks will operate simultaneously: a software-based processing network that is very similar to a traditional Hadoop system (namenode and several datanodes), and a hardware network which comprises on one or more FPGA stacks. The two networks are connected via an FPGA Driver, which acts as a tasktracker for each FPGA node. The FPGA Driver is responsible for programming the FPGAs that are selected to execute a certain bitstream file. Also, the FPGA driver will send the bitstreams to the designated FPGAs and schedule and monitor the data transfer to and from the FPGAs by exploiting the FPGAs connection to banks of Solid State Drives via a fiber link. The FPGA Driver will be managed by the Intra-cloud jobtracker (which plays the role of an Intra-cloud master node, as compared to traditional Hadoop). The Intra-cloud jobtracker is one of the main entities in the software network which are:

COOPERATION OF CLOUD PROVIDERS

A. Overall View We begin by presenting a general view of the components of our system, which is depicted in Fig. 1. In this section we highlight the overall design and the system’s operations, then we discuss the details of each part of the system. Our system is divided into two major elements: an Inter-cloud network connected to several Intra-cloud systems (slaves). The Intercloud network consists of four main components which are: • The Inter-cloud Master: responsible for receiving requests from clients, preparing and distributing jobs among various Intra-cloud slaves, monitoring the execution of jobs at the Intra-cloud jobtrackers, monitoring the communications and data sharing between Intra-clouds, sending reports and results to the client, analyzing the runtime statistics received from the Intraclouds, invoking the generation of new hardware acceleration bitstream files, and managing the utilization of the acceleration files based on the requirements of the client applications and the availability of accelerators. • The Inter-cloud Namenode: maintains metadata from the Namenodes of all Intra-cloud slaves, and the locations of data blocks for the data files stored in the Intra-clouds. • The Hardware Design Center: responsible for generating hardware acceleration bitstream files for new Map and Reduce functions and testing them on some dedicated hardware before sending them to the Bitstream Library. • The Bitstream Library: responsible for saving bitstream files, their metadata, documentation, etc., and for providing and updating this information when needed.

• The Intra-cloud Master: responsible for communicating with Intra-cloud Masters of other clouds for sharing data among different clouds. The Intra-cloud Master will also keep a reserve copy of all operations that are executing in its cloud. It periodically updates its data and accordingly informs the Inter-cloud Master. A major responsibility of Intra-cloud Masters is to help in reconstructing the Intercloud Master records when a failure occurs to it. • The Intra-cloud jobtracker: in addition to taking the role of a master node as compared to the traditional Hadoop framework, two main additional tasks are assigned in our system to this component: first, it schedules and monitors the execution of bitstreams on the FPGAs. Second, it periodically reports the status of the Intra-cloud slave to both the Intra-cloud and the Inter-cloud Masters. • The Intra-cloud Namenode: similar to a traditional Hadoop framework, it maintains the file system tree of the Intra-cloud slave and the metadata for all the files and directories in the tree. It also keeps a record of the workstations on which all the file blocks are located.

110


• The Bitstream node: an Intra-Cloud library of bitstreams that have been previously used in this cloud slave. • The Workstation: similar to a traditional datanode in Hadoop, the workstation contains a tasktracker and a datanode that will cooperate to execute a certain MapReduce function on specific data. Next, we describe the details of the system components and the various communications that occur between them.

client connects to the Inter-cloud Master or to one of the cloud Web interfaces, he/she will be given the option of executing the job on a single cloud or on multiple clouds. Hence, the client can specify the clouds on which he/she wants the job to be executed. Alternatively, the client can choose to let the system (i.e., the Inter-cloud Master) decide on the set of clouds hosting the related data and distribute the job on this set of clouds.

B. Inter-Cloud Management and Communication Our system focuses on finding the best approach in which several cloud providers cooperate in executing a MapReduce job. If each cloud provider operates by itself in executing the job, the produced result will be only a part of the global result that will be achieved if the job is executed on data saved on all clouds. Also, if the user tries to run the job on each provider separately and then simply aggregates the final results, he will get false final results since many input data might be common between several cloud providers. Hence, the significance of our proposed system is that it enables a user to get complete results that are obtained from data residing in several clouds. Our design requires the existence of an Inter-cloud server that is connected to three other servers: an Inter-cloud Namenode, a Bistream library and a Hardware Design Center. These four entities are not part of any cloud provider. Rather, they could be maintained and administrated by a consortium of cloud providers, or by a third party. Each cloud provider will have its own interface with its clients. Hence, a client who wants to use the whole system can connect to any cloud provider, but his/her request will be forwarded to the Inter-cloud Master.

When the Inter-cloud Master receives a job from a client, it adds the job to the global scheduler queue. A distJob() thread is then started to discover (from the Inter-cloud Namenode) the cloud slaves on which the data related to the job reside, to split the job into tasks (Map, Shuffle, Reduce, Combine, etc.), and distribute these tasks among the chosen cloud slaves according to the resources and data availability at each cloud slave. Similar to the original Hadoop design, we assume that the distJob() will try to execute jobs at the locations where their data reside to avoid data communication as much as possible. This can be achieved for Map functions. Another factor that plays an important role in distributing the jobs among cloud slaves is the amount of software and hardware resources available within cloud slaves. Each hardware accelerator will have its own documentation stored in the Inter-cloud Bitstream Library. The hardware accelerator documentation contains statistics about the amount of resources (software and hardware) that are consumed by the hardware accelerator (obtained from previous executions of the hardware accelerator). Hence, in case the data resides on a cloud slave that does not have sufficient hardware resources, distJob() will use the hardware accelerator statistics to decide on whether to move the data to another cloud slave with sufficient hardware resources, depending on whether the hardware accelerator execution will produce time savings greater than the incurred delay of moving the data from one cloud to another.

3) Data Sharing and Transfer

1) Job Partitioning and Distribution Each cloud provider will execute its own standalone jobs (within its clouds) as usual. However, when executing a global job (which is executed by several providers), each provider needs to know the part of the data it should work on. For this reason, the Inter-cloud Namenode will hold a duplicate copy of the data list that is stored at each Intra-cloud Namenode. The Inter-cloud Master will use these lists to assign to each cloud slave the portion of the overall data it is responsible for executing. The Inter-cloud Master distributes the execution of data among cloud slaves according to two criteria: the data that is listed in the cloud slave Namenode, and the resources available in the cloud slave that enable it to efficiently execute required tasks. The Inter-Cloud Master commands the IntraCloud Masters such that some of the data will be transferred to the cloud that has much sufficient resources and will be executed there. During a job execution, each cloud slave will be executing its own MapReduce part of the overall job in parallel with other cloud slaves. Hence, we have two levels of parallelism: across cloud slaves and within each cloud slave.

Suppose data should be moved from a cloud slave C1 to another cloud slave C2, then distJob() assigns the part of the data that should be moved as part of the sub-jobs that are sent to C1 and C2. Once the sub-job of C1 reaches it, the Intracloud Master of C1 will contact that of C2 to organize the data transfer process, which could occur while other parts of the job are being executed. The jobtracker of C2 will be responsible for distributing the data from C1 among resources in C2. With respect to Reduce functions, each Reduce instance will execute results that have a certain key value. Hence, results from Map instances that are running on C1 might require to be transferred to another workstation or FPGA at C2. In this case, the jobtracker of C1 will notify the Intra-cloud Master of C1, which will contact the Intra-cloud Master of C2 for possible data transfer. Upon agreement, the Intra-cloud Master of C1 will send the data packet(s) to that of C2, which will forward the packet(s) to the jobtracker of C2. The latter will forward the data to its corresponding workstation or SSD drives.

2) Choosing the Appropriate Cloud Slaves Our proposed design offers to a client complete freedom of choosing the cloud(s) on which he wants his query to be executed. Each of the cloud slaves will continue to operate as a standalone cloud provider in addition to its participation in the global system. Hence, each cloud slave will have its own Web interface with clients for offering its own cloud services, in addition to an option for executing jobs on multiple clouds. If a

4) Handling Inter-cloud Master Failure During the execution of a job, the Intra-cloud jobtracker will continuously update the Intra-cloud Master and the Intercloud Master with the status of different functions and tasks that have finished or are in-progress. In addition, each Intracloud Master reports all communications that occur between its

111


cloud slave and other slaves to the Inter-cloud Master. The information that exists at the Intra-cloud Masters will assist in reconstructing the Inter-cloud Master in case it fails. This can be done as follows: if and when the Inter-cloud Master fails, the Intra-cloud Master that first discovers the failure will assume the temporary role of an Inter-cloud Master. First, the new Inter-cloud Master notifies all other Intra-cloud Masters about the failure and its new role. This prompts the Intra-cloud Masters to send the information they have about current jobs to the new Inter-cloud Master, thus enabling it to reconstruct the schedules of all running jobs. The temporary Inter-cloud Master will next connect to the Hardware Design Center, The Inter-Cloud Namenode, and the Bitstream Library and informs them about its role. Afterwards, it re-opens the connections to all clients that were connecting to the system (the ID and IP of each client will be sent as extra information within each subjob), and informs them about their jobs’ status and updates. When the original Inter-cloud Master recovers, the temporary one transfers its data to it allowing it to resume its functions.

appropriate FPGAs (step 5), and monitors the execution of different MapReduce functions as hardware accelerated bitstreams. When the final results in the cloud are generated, the Intra-cloud jobtracker sends to the Inter-cloud Master the locations of output files and the performance statistics (step 7).

C. Intra-Cloud Management and Operations In our system, we propose that cloud providers use FPGAs to execute certain Map and Reduce functions as hardware acceleration modules. In previous works, FPGAs were used with MapReduce on a single node and with no reconfiguration capability [3]. The authors of [3] proposed that several versions of mappers and reducers would be programmed on the FPGA. However, the changes in [3] were focused on MapReduce and did not extend to the full Hadoop framework. Also, the implementations were restricted to individual nodes, and the FPGA did not have direct access to storage. In another work [6], a framework called Mars was developed for Graphics Processing Units (GPUs). The framework was evaluated using web applications, and the results showed up to 16 times faster performance compared to a quad-core machine. However, there is limitation on the number of threads when using GPUs which is related to: 1) the hardware configuration such as the number of multiprocessors (MPs) and the number of registers on each MP, and 2) the computation characteristics of the Map and Reduce tasks, e.g., weather it is memory or computationintensive. In our system, we focus on integrating FPGAs as part of an extended multi-cloud Hadoop framework in an efficient manner. We propose to provide extensions to Hadoop to support the following: 1) Integration of one or more stacks of multi-FPGA systems, 2) Communication with FPGA nodes within the cloud, and in another clouds, and 3) Implementation and integration of accelerated mappers and reducers in clouds.

2) FPGA Driver Operations In our system, the FPGA Driver plays several roles which we explain in details in this section. In order to support compatibility with the Hadoop framework, several requirements should be satisfied by the FPGA Driver. As a bridge between the software Hadoop-based network and the FPGA-based hardware acceleration network, the FPGA Driver handles the following responsibilities:

Fig. 2.

Operation flow within the proposed system.

• Managing all communications between FPGA nodes within the same Intra-cloud • Movement of data to and from the FPGA, which involves two main operations: 1) preprocessing the data that needs to be sent to the FPGA accelerators, i.e., transforming the data into pairs; and 2) moving the results from the FPGA to storage when processing is complete • Managing the processing on each FPGA, which involves dealing with several tasks such as: loading an FPGA with the hardware acceleration bitstream, reporting the status of each FPGA to the Intra-cloud jobtracker, reporting the status of each running bitstream, and managing the number of MapReduce functions that should run on each FPGA based on the FPGA’s available resources. D. Hardware Acceleration as a Service (HAaaS) In our system, we propose to add to Hadoop the capability of supporting hardware acceleration (HA) via FPGAs. By using HA, we will be able to execute jobs that may be very expensive (in terms of delay and energy consumption) if executed as software functions. To illustrate the importance of using HA, we consider a MapReduce matrix multiplication application to execute on data files of total size of 100 terabytes. If each Map instance takes a 64 MB data block, and considering 10,000 parallel Map instances, then according to [7], the software execution of matrix multiplication will take 12.8 sec, whereas the same execution takes 0.332 sec on a hardware-accelerated FPGA. Hence, for our MapReduce application we can save by using hardware acceleration [(100×1012)/(64×106)]/104× 12.468 = 32.47 minutes. Besides reducing the execution time, FPGAs consume less energy than

1) Intra-cloud Task Distribution In our system, the Inter-cloud Master controls the communication and processing over the cloud slaves. When distributing a job among several slaves, the Inter-cloud Master sends to each slave its part of the job (sub-job) as depicted in step 1 in Fig. 2. The sub-job contains the Intra-cloud jobtracker configuration along with the target data. The Intra-cloud jobtracker gets from the sub-job the tasks that will be executed on workstations and those that will be executed on FPGAs. The jobtracker adds the workstations’ tasks to its scheduler and sends the FPGA tasks to the FPGA Driver (step 3). The latter adds the tasks that it receives from the jobtracker to its schedule (step 4), manages the distribution of these tasks to the

112


Going back to the original searchHA() process in our example, if it finds that one (or both) of the Map and Reduce functions that were submitted by the user has one or more matching hardware accelerators in the library, it notifies the Inter-cloud Master with the information about each hardware accelerator, its metadata, price, and the benefits it offers to the client in terms of execution time. The jobTracker sends this data to the client, which could make use of an interactive tool that will help him choose suitable hardware accelerators and calculate the suitable number of Map and Reduce instances that he/she should order as hardware accelerators according to his budget. If a client requests the use of hardware acceleration, the Inter-cloud Master specifies for each cloud slave the amount of resources that it should use (for example, type and number of hardware accelerators) according to the client’s request, and according to the current resources that exist in the cloud slave (for example, number of FPGA stacks). The Inter-cloud Master sends this information, encapsulated in the sub-jobs to the Intra-cloud Masters. Each cloud slave that needs to execute a bitstream which does not exist in its cloud can request it from the Inter-cloud Master which will in turn fetch it from the Bitstream library and send it to the cloud slave jobtracker.

other devices in many cases, as proved in [8]. In this section, we explain how hardware accelerators (or bitstreams) are designed, tested, integrated, and offered to clients. 1) Creating and Testing New Hardware Accelerators In our system, when the Inter-cloud Master receives a new job from a client, it invokes a utility process searchHA() that searches the Bitstream library for hardware accelerators matching the Map and Reduce functions of the job. If the searchHA() does not find any bitsream files that match the Map and Reduce functions, it notifies the Inter-cloud Master which continues the execution of the job as we described in Section II-B. While the job is executed on the Intra-cloud slaves, the serachHA() process will monitor the statistics that will be sent to the Inter-cloud Master about the executed Map and Reduce functions. The choice of implementing a Map (or Reduce) function in Hardware depends on two factors. The first is related to the execution frequency of the function (only high frequency functions are considered candidates for hardware acceleration), while the second is about containing operations whose execution times will be reduced if implemented as accelerators. If the two conditions are met by a Map or Reduce function, searchHA() will send the corresponding function to the Hardware Design Center, which invokes, upon receiving a software coded Map or Reduce function, a synthesizerHA() utility process. The latter is a hardware design process that attempts to generate the best hardware implementation for each software-coded operation. The synthesizerHA() function translates the software operations to a hardware description language (e.g., VHDL), and converts the result to a hardware circuit (bitstream) through a process of logic synthesis.

In addition to using hardware accelerators to offer HAaaS, a datacenter administrator can take advantage of the FPGA stacks and the possibility of executing bitstreams to enhance the overall performance of the datacenter. For example, when the datacenter becomes overloaded, it might use hardware accelerators to quickly execute some jobs and release computing resources that can be leased to other users. III.

Upon generating a suitable hardware accelerator for the particular Map or Reduce function, the synthesizerHA() uses one or more reserved FPGA stacks to test the new bitstream. The synthesizerHA() compares the hardware testing results with software results obtained from the executed job to make sure that the bitstream is faster than its corresponding software function. If the bitstream was found to be slower than its software function, it is either completely rejected or might undergo a second round for modifications and enhancements.

CASE-STUDY: K-MEANS CLUSTERING

We are currently building a complete testing platform for the system presented in this paper. Our aim is to construct the Inter-cloud network which contains the four main components which we stated in Section II-A, and which will be connected to at least two cloud slaves where each slave will contain, in addition to the four main Intra-cloud components mentioned in Section II-A, a workstation and an FPGA Driver connected to a stack of FPGAs (three or more) which in turn are connected via a fiber channel to several SSD drives. Up to now, we have implemented initial prototypes for the Inter-cloud network components, and for the main Intra-cloud elements. We are using Java NetBeans for developing the various stand-alone components in a Linux environment. We are also using OPNET for various network-wide simulation testing. We designed sample prototypes for the Intra-cloud jobtracker, Namenode, Hardware accelerator Node, workstation, and FPGA Driver. We also connected the FPGA Driver to three Xilinx Virtex-6 FPGAs via a switch and tested the execution of several bitstreams on the FPGAs. We are currently working on implementing the remaining parts of the system. In this section, we present a set of results that we produced from designing and running a hardware accelerator for the k-means clustering algorithm on our system prototype.

2) Adding Hardware Accelerators to the Bitstream Library When a new hardware accelerator needs to be added to the Bitstream library, the synthesizerHA() invokes a addHA() process whose responsibility is to generate the hardware accelerator metadata, documentation, and price. The latters will be added, alongside the bitstream, to the Bitstream library. The metadata is generated according to the tasks that exist in the hardware accelerator file, while the documentation represents a detailed description of the file and its execution and is continuously updated when new performance results of future jobs are reported. On the other hand, the pricing strategy adopted for hardware accelerator should depend on the resources and features that it uses (for example, the hardware cost needed to execute the bitstream), its popularity (importance), its execution time, and the savings that it makes to its corresponding software implementation, mainly in terms of execution time and/or energy consumption.

A. K-means Clustering Summary The k-means algorithm has been widely and successfully used to cluster large datasets. It runs as follows: Given a set of n data samples (x1, x2, …, xn), where each sample is a ddimensional real vector, k-means clustering aims to partition

3) Offering Hardware accelerators to Clients

113


the hardware accelerators when k=32. These results the n samples into k sets or clusters (k ≤ n) S = {S1, S2, …, Sk}. demonstrate the significant impact FPGAs and hardware The procedure follows a simple and easy way to classify the accelerators can have on the performance delay and energy given dataset into k clusters, where k is fixed a-priori. The main efficiency of data intensive applications. idea is to define k centroids, one for each cluster. These centroids should be placed in an intelligent mannerbecause of different location causes different results. So, the better choice is to place them as much as possible far away from each other. The next step is to take each point belonging to a given dataset and associate it to the nearest centroid. When no point is pending, the first iteration is completed and an initial grouping is done. In each new iteration, the algorithm re-calculates k new centroids as barycenters of the clusters resulting from the previous iteration. After calculating the new k centroids, a new binding has to be done between each point in the dataset and its nearest new centroid. Hence, each point will be re-associated to Fig. 3. Results of the k-means clustering algorithm on three different platforms. one of the k new centroids. As a result of new iterations, the k IV. CONCLUSION centroids change their location step by step until no more changes are done. In other words centroids do not move any We presented the design and architecture for a cooperative more. Hence, the algorithm has converged and each centroid multi-cloud computing system with hardware acceleration with its points form a single cluster. capabilities. Our system recommends two novel extensions to the current cloud computing model: offering hardware B. Testing Scenarios acceleration as a service and enabling clouds to collaborate. To study the efficiency and the performance increase that is The system also proposes these extensions to Hadoop by achieved when implementing k-means via hardware integrating into its functionalities these operations. The main acceleration, we designed three different scenarios. In the first, parts of the system are currently being implemented, but we the k-means algorithm is written as a C function that will be implemented initial prototypes and tested the k-means placed in the DDR3 RAM of the Virtex 6 FPGA. In this clustering application on them. The results illustrated the great scenario, the FPGS (ML605) will be running a MicroBlaze advantage of using hardware acceleration while processing processor at 150 MHz to execute the C function. In the second massive amounts of data. Our next step is to finish the various scenario, we transformed the k-means C function into a system components and test its complete functionality and then hardware accelerator that will use a group of hardware compare its performance to a standard Hadoop platform. accelerators (one for each centroid) and the FPGA will be operating at 200 MHz. In the third scenario, the C function is REFERENCES executed on an Intel i7 processor with 8GB RAM running at 3.4 GHz. We tested three datasets of sizes: 10, 20, and 40 [1] J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” Communications of the ACM, Vol. 51, No. 1, 2008, pp. million random points in a two-dimensional space. The number 107-113. of centroids (k) was varied between 2 and 32. We ran 25 [2] Apache Hadoop, online: http://hadoop.apache.org/ iterations for each calculation. We measured the execution time [3] Y. Shan, B. Wang, J. Yan, Y. Wang, N. Xu, and H. Yang, "FPMR: on the three different platforms, and the energy consumed on MapReduce framework on FPGA A case study of RankBoost the FPGA board for the first two scenarios. th C. Testing Results Fig. 3-a shows the total delay of the k-means Co-processor (second scenario) for the three tested data sets. We notice that the delay increases as the size of the data set increases, which is logical. We also notice that the delay increases linearly with the data size. In Fig. 3-b, we compare the total delays of the three scenarios. We notice that at k=32, the Co-processor is 151 times faster than the Intel i7 processor and 350 × 103 times faster than the MicroBlaze processor (first scenario). Finally, Fig. 3-c shows the energy saving gained by using hardware acceleration as compared to the software implementation (on MicroBlaze). The energy saving gain is defined as:

[4] [5]

[6]

[7]

(Total energy consumed by running k-means on the MicroBlaze) Energy Saving = (Total energy consumed by running k-means as a hardware accelerator)

[8]

From Fig. 3-c, we notice that the energy consumed by the MicroBlaze processor was as much as 8,376 times greater than

114

Acceleration", in Proc. of the 18 Annual ACM/SIGDA Int’l Symposium on Field Programmable Gate Arrays (FPGA'10), 2010, pp. 93-102. S. Sun, "Analysis and Acceleration of Data Mining Algorithms on High Performance Reconfigurable Computing Platforms," Ph.D. dissertation, Iowa State University, IA, 2011. D. Yin, G. Li, and K. Huang, “Scalable MapReduce Framework on FPGA Accelerated Commodity Hardware,” Internet of Things, Smart Spaces, and Next Generation Networking (Lecture Notes in Computer Science), Vol. 7469, pp. 280-294, 2012. B. He, W. Fang , N. K. Govindaraju, Q. Luo, and T. Wang, “Mars: A MapReduce Framework on Graphics Processors,” in Proc. of the 17th Conf. on Parallel architectures and compilation techniques, New York, NY, USA, 2008. N. Dave, K. Fleming, M. King, M. Pellauer, and M. Vijayaraghavan, “Hardware Acceleration of Matrix Multiplication on a Xilinx FPGA,” in Proc. of the 5th IEEE/ACM Int’l Conf. on Formal Methods and Models for Codesign (MEMOCODE 2007), 2007, pp. 97-100. J. Fowers, G. Brown, P. Cooke, and G. Stitt, “A performance and energy comparison of FPGAs, GPUs, and multicores for sliding-window applications,” in Proc. of the 20th ACM/SIGDA Int’l Symposium on Field Programmable Gate Arrays (FPGA 2012), Feb. 2012, pp. 47-56.