The load balancing problem in OTIS-Hypercube ... - Springer Link

21 downloads 19412 Views 1MB Size Report
Published online: 8 March 2008 ... Department of Computer Science, King Abdullah II School for Information Technology, ... B.A. Mahafzah, B.A. Jaradat.
J Supercomput (2008) 46: 276–297 DOI 10.1007/s11227-008-0191-3

The load balancing problem in OTIS-Hypercube interconnection networks Basel A. Mahafzah · Bashira A. Jaradat

Published online: 8 March 2008 © Springer Science+Business Media, LLC 2008

Abstract An interconnection network architecture that promises to be an interesting option for future-generation parallel processing systems is the OTIS (Optical Transpose Interconnection System) optoelectronic architecture. Therefore, all performance improvement aspects of such a promising architecture need to be investigated; one of which is load balancing technique. This paper focuses on devising an efficient algorithm for load balancing on the promising OTIS-Hypercube interconnection networks. The proposed algorithm is called Clusters Dimension Exchange Method (CDEM). The analytical model and the experimental evaluation proved the excellence of OTIS-Hypercube compared to Hypercube in terms of various parameters, including execution time, load balancing accuracy, number of communication steps, and speed. Keywords Load balancing · OTIS · OTIS-Hypercube · Hypercube · Interconnection networks

1 Introduction A potential optoelectronic architecture, known as the Optical Transpose Interconnection System (OTIS), was first proposed by Marsden et al. [1]. Recently, the OTIS B.A. Mahafzah () Department of Computer Science, King Abdullah II School for Information Technology, The University of Jordan, Amman 11942, Jordan e-mail: [email protected] B.A. Jaradat Department of Computer Science, School of Computer and Information Technology, Jordan University of Science and Technology, Irbid 22110, Jordan e-mail: [email protected]

The load balancing problem in OTIS-Hypercube interconnection

277

architecture has gained a considerable attention and significant efforts have been employed in studying and improving several aspects of OTIS networks. Many results and research work exist in the literature regarding the OTIS optoelectronic architecture [2–8]. A few have addressed the performance issue of such interconnection networks [2, 5]. In addition to all the existing studies and research work, complementary efforts must be made in order to attain the best possible performance; one way is the effective utilization of available resources that can be carried out through the load balancing technique. The significance of load balancing lies in its effect on improving the speedup in processing time, which is a major objective in all parallel processing systems. Based on this fact, this research is dedicated toward studying and solving the load balancing problem on OTIS-Hypercube interconnection networks, with the aspiration of devising an efficient solution that will have a valuable contribution to the enhancement of OTIS-Hypercube interconnection networks’ performance. Another significant motivational factor to conduct this research is extracted from the following fact reported by leading scientists: “With optical elements. . .light does magic [9].” Due to the miraculous power of light in achieving high-speed communications, the hopes are hanged on achieving high-speed parallel processing on the promising OTIS architecture. Therefore, all aspects of performance improvement on this promising architecture should be studied and evaluated, one of which is the load balancing problem, which is studied on OTIS-Hypercube systems in this paper. The proposed method is called Clusters Dimension Exchange Method (CDEM), which is based on the well-known Dimension Exchange Method (DEM) for load balancing on Hypercube interconnection networks. The efficiency of the proposed algorithm is shown, and the superiority of OTIS architecture is approved. The rest of this paper is organized as follows: Sect. 2 introduces OTIS systems and presents the load balancing techniques applied on both OTIS and Hypercube interconnection networks. The proposed load balancing methodology (CDEM) on OTIS-Hypercube is illustrated in Sect. 3, which also identifies the DEM on Hypercube, based on which CDEM is devised. In Sect. 4, analytical models of both, CDEM on OTIS-Hypercube and DEM on Hypercube, have been presented. The analysis involved the estimation of several performance metrics including: the worst case time complexity, the load balancing accuracy, the maximum number of communication steps, and the speed at which the load balancing process occurs. The analytical estimation has been approved in Sect. 5 through an experimental work that measured the metrics studied in Sect. 4. A comparison is conducted between the attained results of CDEM on OTIS-Hypercube and DEM on Hypercube. Section 6 concludes the paper by discussing the conclusions and proposed future work.

2 Background and related work This section introduces OTIS interconnection networks and presents the research work related to load balancing on Hypercube, which is the basis network from which OTIS-Hypercube under study is constructed.

278

B.A. Mahafzah, B.A. Jaradat

2.1 OTIS interconnection networks In an OTIS system, processors are clustered into groups, where processors within the same group are connected by electronic intra-group links forming an interconnection topology known as the factor network, whereas intergroup processors are interconnected by transposing processors and groups addresses, such that processor p of group g is interconnected to processor g of group p. The latter interconnection is achieved optically using the free space optical technology [1]. The factor network can be any of the traditional interconnection networks, such as the Hypercube, which is an interesting interconnection network topology, such that P processors in a Hypercube are organized in log2 P dimensions, where only two nodes are connected along each dimension. Two nodes in a Hypercube are connected if the hamming distance between the binary representations of their processors numbers is one. Each of the groups in Fig. 1 is a 2-dimensional Hypercube. Krishnamoorthy et al. have shown that the bandwidth and the power consumption in OTIS are optimized when the number of groups is equal to the number of processors in each group [10]. This means that an optimal N 2 -processor OTIS system consists of N groups, each of which contains N processors. For each known topology, an OTIS network can be constructed. For example, an N 2 OTIS-Hypercube network can be formed from N copies of an N -processor Hypercube. An instance of the OTIS-Hypercube interconnection network is a 16-processor (4 groups with 4 processors within each group) OTIS-Hypercube shown in Fig. 1, where an optical intergroup link (distinguished by a dashed link) interconnects processor p of group g to processor g of group p, and processors within the same group are connected by electronic intra-group links forming the Hypercube factor network. Each processor in a Hypercube is marked by a two-parameter label, where Fig. 1 16-processor OTIS-Hypercube

The load balancing problem in OTIS-Hypercube interconnection

279

the first parameter indicates the group to which the processor belongs, and the second parameter represents the processor’s position within the group. In terms of hardware implementation, OTIS promises to provide large-scale systems that are not possible with traditional electronic technology. This refers to the hardware limitations to support higher dimensions. In an OTIS system, the same number of processors can be arranged in fewer dimensions. In terms of performance, the OTIS interconnection network architecture is desirable due to its recursive structure that consists of multiple similar networks, which provides better support for several features, such as modularity, load balancing, fault tolerance, and robustness [5]. The attractive outcomes of the research on OTIS revealed its ability to achieve Terra bit throughput at a reasonable cost. Based on this, several research efforts have been employed toward studying OTIS and investigating its usefulness for real-life applications [1–8]. Researchers had followed distinct directions in exploring performance issues regarding the OTIS interconnection networks. An important work that has been conducted for evaluating the performance of OTIS was recently published in a M.S. thesis by Najaf-Abadi [2]; it is a valuable research because it concentrated on the performance evaluation and modeling of OTIS networks under important parameters, such as the network bandwidth and message latency. Various algorithms have been developed on OTIS. For instance, Wang and Sahni presented matrix multiplication on OTIS-Mesh [6], and BPC permutations on OTISHypercube [7]. Rajasekeran and Sahni introduced randomized routing, selection, and sorting on OTIS-Mesh [8]. Several other algorithm development efforts have been accomplished on several OTIS instances [3, 4]. 2.2 The load balancing problem One of the most important problems for a parallel processing system is load balancing. The load balancing problem has been studied using different approaches on various networks. Recently, Zhao, Xiao, and Qin proposed hybrid schemes of diffusion and dimension exchange called DED-X for load balancing on OTIS network [11]. The core of DED-X is to divide the load balancing process into three stages by a process of Diffusion-Exchange-Diffusion, where a traditional diffusion scheme, X (such as First Order Scheme (FOS), Second Order Scheme (SOS), and Optimal (OPT)), is applied on various stages of the load balancing process to achieve load balancing on OTIS factor networks [11]. The simulation results of the proposed schemes have shown significant promotion in efficiency and stability [11]. The same authors have generalized, in another work, several DED-X schemes for load balancing on homogeneous OTIS to produce the generalized Diffusion-Exchange-Diffusion schemes, GDED-X, to achieve load balancing on Heterogeneous OTIS networks [12]. The usability of the proposed schemes was shown theoretically and experimentally to be better than traditional X schemes for load balancing on heterogeneous OTIS networks [12]. Ranka, Won, and Sahni [13] introduced the Dimension Exchange Method (DEM) on Hypercube interconnection networks. It is a simple heuristic method that is based

280

B.A. Mahafzah, B.A. Jaradat

on averaging over loads of directly connected processors, where for each dimension d, every two processors connected along the dth dimension exchange their loads sizes, and according to the average, the processor with excess load transfers the amount of extra load to its neighbor. The advantage of DEM is that every processor can redistribute tasks to its neighbor processors without the information of global distributions of tasks. However, the worst case error in this method is log2 P on a P -processor Hypercube, where the error is defined as the difference between the maximum and the minimum number of tasks assigned to processors. Error reduction was the objective of several next researches. Better results were achieved by Rim et al. [14, 15], where they adapted DEM to perform efficient dynamic load balancing on Hypercube interconnection networks through proposing a new method, the odd even method that reduces the nonuniformity to no more than 1/2 log2 P . Additional advantages are achieved by introducing new techniques for hiding communication overheads involved in load balancing [15]. Jan and Hwang suggested an efficient algorithm for perfect load balancing on Hypercube multiprocessors, based on the well-known DEM [16].

3 The clusters dimension exchange method for load balancing on OTIS-Hypercube The proposed clusters dimension exchange method for load balancing on OTISHypercube is based on the well-known Dimension Exchange Method (DEM) for load balancing on Hypercube. DEM balances the processors’ loads in log2 P phases for a P -processor Hypercube, organized in log2 P dimensions. This is accomplished by going through all dimensions and balancing processors’ loads by redistributing the tasks among directly connected processors in each dimension. Figure 2 illustrates the DEM steps for load balancing on 4-D Hypercube of 16 processors. The processors along the first dimension exchange their loads sizes, where the processor with the higher load transfers the amount of excess load to the lower-loaded processor; the transfer direction is shown by arrows between processors, as Fig. 2b shows. The same steps are performed between processors connected along the second, third, and fourth dimensions, as shown in Figs. 2c, d, and e, respectively. The DEM will be used for devising a new method for load balancing on OTISHypercube interconnection networks, which is called the Clusters Dimension Exchange Method (CDEM). The dynamic load balancing problem on OTIS-Hypercube can be stated as follows: Given an OTIS-Hypercube of P processors, clustered into sqrt(P ) groups, the dynamic load balancing problem is to obtain an exactly or approximately equal load distribution among the OTIS-Hypercube’s processors. For a P -processor OTIS-Hypercube, the proposed strategy balances the processors’ loads in log P phases. The main concept of the proposed method is based on obtaining an equal load distribution among the groups at first by redistributing the loads among the groups so that all groups have an exactly or approximately the same total load. Then each group balances its processors’ loads so that all processors’ loads are equal or approximately equal. Figure 3 presents the proposed CDEM algorithm for load balancing on a P -processor OTIS-Hypercube.

The load balancing problem in OTIS-Hypercube interconnection

281

Fig. 2 A running example of DEM on 4-D Hypercube of 16 processors: a initial state; b load balancing—phase 1; c load balancing—phase 2; d load balancing—phase 3; e load balancing—phase 4

CDEM performs load balancing in a number of phases equal to the basis network’s dimension (Fig. 3, line 1). All pairs of groups, whose numbers differ in the dth bit position (line 2), perform the following steps in parallel:

Step 1: Groups’ total loads sizes exchange The groups, whose numbers differ in the dth bit position, exchange their total loads sizes through the optical interconnection, as indicated by line 4 in the algorithm (Fig. 3).

282

B.A. Mahafzah, B.A. Jaradat

Fig. 3 The CDEM algorithm for load balancing on OTIS-Hypercube

Step 2: Groups’ average total load calculation Each pair of the groups in step 1 computes the average of their total loads by calculating the floor of (the sum of the groups’ total load divided by two) (line 5).

The load balancing problem in OTIS-Hypercube interconnection

283

Step 3: Groups’ total loads redistribution Each group compares its total load to the average load (average of the group’s total load and its neighbor group’s total load). If the total load of the group is greater than the average load (Fig. 3, line 6), the processor interconnecting the two communicating groups is checked to determine whether it has the sufficient amount of excess load to transfer, which is computed as the difference between the total load at the group and the average of the communicating groups’ total loads. If it has the required amount of excess load (line 7), it will send it to the neighbor group (line 8), and its load will be adjusted by decrementing the transferred load (line 9), and the group’s total load is adjusted to be equal to the average load (line 10). On the other hand, if that processor does not have the sufficient amount of load to transfer (line 11), it will request the additional required load from its neighbors (line 12), and add it to its load (line 13). If the total load of the group is less than the average load (Fig. 3, line 17), the group will receive its neighbor group’s excess load (line 18), and the load of the group’s processors interconnecting the two groups will be adjusted by incrementing the transferred load (line 20), and the group’s total load is adjusted to have an additional load equal to the transferred load (line 21). Since all the groups have the same amount of workload units, balancing the processors’ loads within each group, using the DEM presented at the beginning of this section, will produce a completely balanced network, with all the processors having the same or approximately the same amount of workload. The load balancing procedure iterates through each of the Hypercube’s dimensions (Fig. 3, line 22). All the directly connected processors along the first dimension are balanced in the first phase. All pairs of processors, whose binary representation differ in the dth bit (lines 23–25), perform load balancing within each group by performing the following steps in parallel: Step 1: Processors’ loads sizes exchange All directly connected processors along the dth dimension exchange their loads sizes (Fig. 3, line 26). Step 2: Processors’ average load calculation The average load is computed for each pair of processors that are directly connected along the dth dimension as the floor of (the sum of the processors’ loads divided by two) (line 27). Step 3: Processors’ loads redistribution First, each processor compares its load to the average load (average of the processor’s load and its neighbor processor’s load). If the processor’s load is greater than the average load (line 28), the processor sends the amount of excess load (processor’s load minus average load) along the dth dimension (lines 29 and 30), and the processor’s load is adjusted to be equal to the average load (line 31). Otherwise, the processor receives the amount of its neighbor’s excess load along the dth dimension (lines 32–35), and its load is incremented by the amount of its neighbor’s excess load. The proposed CDEM method is illustrated through an example of load balancing on a 16-processor OTIS-Hypercube, shown in Fig. 4, where the 16 processors are clustered into 4 groups, each of which consists of 4 processors. The processors are identified by a two-parameter number; where the first number indicates the group to which the processor belongs, and the second parameter indicates the processor’s position within the group. Each processor operates on an assigned load; the number above

284

B.A. Mahafzah, B.A. Jaradat

Fig. 4 16-processor OTIS-Hypercube

Fig. 5 16-processor OTIS-Hypercube (groups’ load balancing—phase 1)

each processor indicates the processor’s current load. Intragroup processors are connected by electrical links, whereas intergroup processors are interconnected through optical interconnections, shown as dashed lines to be distinguished from electrical links. Next, a complete example shown in Figs. 4 to 8 will illustrate the method’s phases while transferring the OTIS-Hypercube to a balanced state. In the first phase, the groups, whose binary bit representations differ by the first bit position, exchange

The load balancing problem in OTIS-Hypercube interconnection

285

Fig. 6 16-processor OTIS-Hypercube (groups’ load balancing—phase 2)

Fig. 7 16-processor OTIS-Hypercube (processors’ load balancing—phase 1)

their total loads sizes, and the excess load is transferred from the higher-loaded group to the lower-loaded group. In the given example, group 0 whose total load is 28 exchanges its total load size with group 2, whose load is 38. Each of the groups computes the average groups’ load and decides that both, groups 0 and 2, should have an average of 33 workload units. As Fig. 5 shows, group 2 sends the extra 5 workload units through processor (2, 0), which sends 5 units of its load to group 0 that receives the new load units through processor (0, 2), increasing its 10 load units to 15

286

B.A. Mahafzah, B.A. Jaradat

Fig. 8 16-processor OTIS-Hypercube (processors’ load balancing—phase 2)

and decreasing the load of processor (2, 0) from 13 to 8 load units. Simultaneously, groups 1 and 3 exchange their total loads sizes, which are 43 and 27, respectively. So, the groups’ loads need to be redistributed in order to get an average of 35 load units per group. Therefore, group 1 sends the extra 8 workload units it has through processor (1, 3), which sends 8 units of its load to group 3 that receives the new load units through processor (3, 1). Thus, increasing the 5 load units of processor (3, 1) to 13 and decreasing the 14 load units of processor (1, 3) to 6, as shown in Fig. 5. In the same way, groups’ total load balancing proceeds between the groups, whose binary bit representation differs in the second bit, as Fig. 6 indicates, where groups’ total loads sizes are exchanged, and the excess load is transferred from the higherloaded to the lower-loaded group. At the end of phase 2, groups will have the same or approximately the same total load. In order to arrive to a balanced state, processors within each group perform load balancing using DEM. Figure 7 reveals the result of load balancing among the processors connected along the first dimension in each group. A completely balanced state is achieved by performing load balancing along the second dimension, as it is demonstrated in Fig. 8.

4 Analytical modeling This section presents the most important parameters that are used to evaluate the performance of a parallel processing system when the proposed load balancing procedure is applied; these parameters include: execution time, load balancing accuracy, number of communication steps, and speed.

The load balancing problem in OTIS-Hypercube interconnection

287

4.1 Execution time The execution time metric measures the time required to perform the load balancing steps. The worst case time complexity of the DEM method used for load balancing on Hypercube is O(M log2 P ), where M is the maximum load assigned to each processor in a P -processor Hypercube [16]. The worst case time complexity of CDEM, used for load balancing on OTIS-Hypercube is given in Theorem 1. Theorem 1 The worst case time complexity of CDEM for load balancing on OTISHypercube is O (Sqrt(P ) ∗ M log2 P ). Proof If each processor has a maximum of M workload units, then each group has a maximum of Sqrt(P ) ∗ M workload units. Thus, during the process of balancing the total loads among groups, there are at most (Sqrt(P ) ∗ M/2) workload units to be transferred between each pair of groups in each phase. Since the groups’ total loads balancing is performed in log2 Sqrt(P ) phases, then the time complexity contributed by the load balancing of the groups’ total loads is O((Sqrt(P )∗M/2) log2 Sqrt(P )) = O((Sqrt(P ) ∗ M/4) log2 (P )) ≈ O(Sqrt(P ) ∗ M log2 P ). In addition, during processors’ load balancing in each group, M/2 workload units at most need to be transferred between every two processors connected along the dimension d. Since the number of dimensions along which the processors are organized is log2 Sqrt(P ), then the time complexity contributed by processors’ load balancing is O((M/2) log2 Sqrt(P )) = O((M/4) log2 (P )) ≈ O(M log2 P ). Thus, the time complexity for the whole algorithm is:      O Sqrt(P ) ∗ M/4 log2 (P ) + M/4 log2 (P ) ≈ O Sqrt(P ) ∗ M log2 P 4.2 Load balancing accuracy The load balancing accuracy is determined by the error with which the processors’ loads are balanced, where the error is defined as the difference between the maximum number of workload units in any processor, and the minimum number of workload units in any other processor. The significance of the error as an evaluation parameter stems from the fact that the increase in error increases the processing time since the processing time in a parallel processing system depends on the time taken by the processor that has more tasks, where all tasks are of the same size. Therefore, reducing the error is the objective of all load balancing algorithms since such a reduction allows all processors to have approximately the same number of tasks. Thus, all processors will finish execution at approximately the same time. The maximum resulting error using the DEM on a Hypercube is expressed as e ≤ log2 P [11]. CDEM, on the other hand, balances the OTIS-Hypercube’s processors’ loads with an error e bounded by log2 Sqrt(P ). Theorem 2 gives the load balancing accuracy using CDEM on OTIS-Hypercube. Theorem 2 The maximum resulting error e using CDEM on OTIS-Hypercube is e ≤ log2 Sqrt(P ).

288

B.A. Mahafzah, B.A. Jaradat

Proof Applying CDEM to balance total loads among groups will yield a maximum error e = log2 Sqrt(P ) since the load balancing among groups is accomplished in log2 Sqrt(P ) phases, so if the sum of loads at two groups is odd, then one group should have one more unit than the other. This difference can be accumulated to log2 Sqrt(P ) since the CDEM balances the groups’ loads in log2 Sqrt(P ) phases. Such a maximum difference will be redistributed among the processors of the same group, leading to an error e = log2 Sqrt(P ) in the worst case when balancing the processors’ loads within each group.  4.3 Number of communication steps The number of communication steps represents the number of steps required by the processors to communicate in order to achieve load balancing. This number depends on the method used, the architecture of the interconnection network, and it may be affected by the initial load distribution. The significance of this metric refers to its effect on the processing time; less processing time can be achieved with a fewer number of communication steps. Therefore, the objective of any load balancing method is to perform efficient load balancing in the least possible number of communication steps. The maximum number of communication steps required by the DEM for achieving load balancing on a P -processor Hypercube is expressed as 3 log2 P [17]. Approximately, the same number of communication steps is required for a P -processor OTIS-Hypercube, as shown in Theorem 3. Theorem 3 The number of communication steps required by CDEM for load balancing on OTIS-Hypercube is 3 log2 P . Proof The number of communication steps for balancing the total loads among the groups is three steps for each phase; two of these steps are for exchanging loads sizes and one step for transferring excess load between the two groups, whose binary bit representations differ by the bit that represents the load balancing phase number. Since the number of processors in the network is P , then Sqrt(P ) groups are balanced in log2 Sqrt(P ) phases. So, the number of communication steps is 3 log2 Sqrt(P ). Also, the number of communication steps for load balancing among the processors in each group for each dimension is three; two steps for exchanging loads sizes and one step for transferring excess load between the two communicating processors in the same dimension. Since the number of processors in the network is P , then Sqrt(P ) processors in each group are organized in log2 Sqrt(P ) dimensions. Thus, the number of communication steps is 3 log2 Sqrt(P ). Thus, 3 log2 Sqrt(P ) steps for load balancing among groups and 3 log2 Sqrt(P ) steps for load balancing among processors result in a total of 6 log2 Sqrt(P ) = 3 log2 P communication steps for the whole network.  4.4 Speed The speed at which the load balancing process occurs plays a significant role in improving the system’s performance and reducing the processing time. This metric is affected by several factors; the most important of which are the load balancing method applied and the interconnection technology used.

The load balancing problem in OTIS-Hypercube interconnection

289

The speed, at which DEM performs load balancing on Hypercube, assuming that the speed of the electrical technology used is 250 Mb/s [18], can be expressed as the speed of links used during communication which is: 3 ∗ log2 P ∗ Speed of electrical links = 3 ∗ log2 P ∗ 250 Mb/s = 750 log2 P Mb/s The speed, at which CDEM performs load balancing on OTIS-Hypercube can be expressed as shown in Theorem 4. Theorem 4 The speed, at which CDEM performs load balancing on OTIS-Hypercube, is: 3 log2 Sqrt(P ) ∗ Speed of electrical links + 3 log2 Sqrt(P ) ∗ Speed of optical interconnection = 3 ∗ log2 Sqrt(P ) ∗ 250Mb/s + 3 log2 Sqrt(P ) ∗ 2.5 Gb/s = 3/2 ∗ log2 P ∗ 250 Mb/s + 3/2 ∗ log2 P ∗ 2.5 Gb/s Proof Since the OTIS-Hypercube interconnection network is a compromise of two links technologies, the speed of each link technology must be taken into account. Thus, the speed of the whole system is the speed of the electrical links passed in addition to the speed of optical links passed, leading to a system’s speed of 3 ∗ log2 Sqrt(P ) ∗ 250 Mb/s + 3 log2 Sqrt(P ) ∗ 2.5 Gb/s = 3/2 ∗ log2 P ∗ 250 Mb/s + 3/2 ∗ log2 P ∗ 2.5 Gb/s, assuming that the speed of the electrical technology used is 250 Mb/s [18], and the speed of the optical interconnection technology used is 2.5 Gb/s [19].  4.5 Summary of analytical results The previous analysis proves that compared to Hypercube interconnection networks, OTIS-Hypercube excels in most performance metrics. This excellence is mainly referred to the attractive OTIS architecture, in which the same number of processors can be organized in fewer dimensions, and to the use of the most suitable interconnection technology in the suitable position, where close processors are connected with electrical technology, while processors at larger distances are interconnected optically, achieving higher speed and less power consumption, with several other advantageous features. Table 1 summarizes the evaluated metrics for both; DEM on Hypercube, and CDEM on OTIS-Hypercube.

5 Experimental results and evaluation A significant complementary effort that supports the work accomplished so far is to conduct various experiments to evaluate, compare, and analyze distinct performance

290

B.A. Mahafzah, B.A. Jaradat

Table 1 Analytical comparison Metric

Method DEM on Hypercube

CDEM on OTIS-Hypercube

Execution time

O(M log2 P )

O(Sqrt(P ) ∗ M log2 P )

Error

e ≤ log2 P

e ≤ log2 Sqrt(P )

Communication steps

3 log2 (P )

3 log2 (P )

Speed

750 log2 P Mb/s

3/2 ∗ log2 P ∗ 250 Mb/s + 3/2 ∗ log2 P ∗ 2.5 Gb/s

metrics when the proposed load balancing method (Clusters Dimension Exchange Method (CDEM)), on an OTIS-Hypercube is applied. CDEM has been implemented and applied to various sizes of simulated OTIS-Hypercube interconnection networks. Then several experiments have been conducted in order to evaluate various performance metrics, and compare them to the results of applying DEM on Hypercubes of equivalent sizes. The experimental runs of the load balancing methods under study were performed on Dual-Core Intel Xeon Processor (CPU 3.2 GHz) with Hyper-Threading Technology, 2 GB RAM, and 2 MB L2 Cache per CPU. The experimental work was conducted under SUSE Linux 10 operating system. The interconnection network simulation was developed for both OTIS-Hypercube and Hypercube using object-oriented approach of C++ high-level programming language. The following major classes have been defined: • The interconnection network class, which constructs the interconnection network of the given size. • The processors class, which sets the properties and methods of processors. • The link class, which connects processors according to the interconnection network architecture. The simulation starts by constructing the desired interconnection network according to the number of processors determined by the user. The load balancing process is implemented using multithreaded approach, which is used to support parallel execution of the implemented load balancing process. The Pthread library routines are used to create and manage a dynamic number of threads that are used to perform load balancing steps simultaneously. The main functions implemented to achieve load balancing include: load computation, load size exchange, average load calculation, and excess load transfer. This section presents the experimental results of various metrics, including execution time, load balancing accuracy, number of communication steps, and speed. 5.1 Execution time Several experiments have been conducted to compute the time required to execute the proposed load balancing method on 16-, 64-, 256-, and 1024-processor OTISHypercube. In order to study the effect of the load size on the method’s execution

The load balancing problem in OTIS-Hypercube interconnection

291

Fig. 9 Average execution time using DEM on hypercube and CDEM on OTIS-Hypercube (maximum load size is 500 workload units per processor)

time, the experiments have been performed with variable average loads sizes assigned to processors. The loads sizes used are 10, 50, 100, and 500 workload units on average assigned to each processor. The same experiments have been carried out for 16-, 64-, 256-, and 1024-processor Hypercube. An intuitive result is the increased execution time for increasing number of processors. As Fig. 9 shows, the average execution time is few milliseconds for a small number of processors, such as a 16-processor Hypercube and OTIS-Hypercube, while an increased execution time becomes more observable for larger sizes; 256 processors and more. For instance, when the maximum load size is 500 workload units per processor, DEM on 1024-processor Hypercube takes about 292 seconds, whereas CDEM on 1024-processor OTIS-Hypercube consumes around 45 seconds. The previous discussion dealt with the effect of the network’s size on the execution time. Now, it is the time to consider the role played by the number of workload units assigned to each processor. The experiments revealed that the number of workload units assigned to processors greatly affects the execution time for a large number of processors. In addition, the execution time difference between Hypercube and OTISHypercube becomes more observable for larger number of processors. Figures 10 and 11 illustrate these conclusions. Figure 10 compares the average execution time taken by DEM to balance a 64processor Hypercube’s load, and the time required by CDEM to achieve load balancing on a 64-processor OTIS-Hypercube. A careful examination of Fig. 10 shows that the greatest execution time difference appears in the cases where the maximum load assigned to each of the networks’ processors is 500 workload units. Figure 11 clarifies two facts. The first intuitive fact is that increasing the number of processors to 1,024 shows that the average execution time difference between CDEM on OTIS-Hypercube and DEM on Hypercube stays about linear which is about 200 seconds with different number of workload units. The second clear fact

292

B.A. Mahafzah, B.A. Jaradat

Fig. 10 Average execution time using DEM on 64-processor Hypercube and CDEM on 64-processor OTIS-Hypercube

Fig. 11 Average execution time using DEM on 1024-processor Hypercube and CDEM on 1024-processor OTIS-Hypercube

is the supremacy of the proposed CDEM methodology on an OTIS-Hypercube compared to DEM used for load balancing on a Hypercube. It is evident from Fig. 11 that less execution time is required to perform load balancing on a 1024-processor OTIS-Hypercube than that required by equivalent Hypercube. For instance, assuming the maximum load is 10 workload units per processor, a 1024-processor Hypercube executes its load balancing method in about 289 seconds. On the other hand, a 1024-processor OTIS-Hypercube requires around 43 sec-

The load balancing problem in OTIS-Hypercube interconnection

293

Fig. 12 Average accuracy using DEM on Hypercube and CDEM on OTIS-Hypercube

onds to perform load balancing, which is a remarkable contribution to performance improvement. 5.2 Load balancing accuracy The error, that is responsible for determining the load balancing accuracy, is defined as the difference between the maximum number of workload units in any processor and the minimum number of workload units in any other processor. The conducted experiments were concerned with estimating the average error suffered by applying the Hypercube’s DEM and the OTIS-Hypercube’s CDEM. The average error results, from applying DEM and CDEM on Hypercube and OTISHypercube, respectively, are depicted in Fig. 12. It is apparent that better accuracy can be achieved on an OTIS-Hypercube. For the set of experiments performed, the error was reduced on 16-, 64-, and 256-processor OTIS-Hypercube, compared to equivalent Hypercubes, while average error values have met for 1024-processor Hypercube and OTIS-Hypercube. This can be interpreted by the fact that in an OTIS-Hypercube a processor possessing the maximum load may be in a group distinct from that of a processor having the minimum load; this will decrease the probability of their meeting in any of the processors’ load balancing phases to execute load balancing. 5.3 Number of communication steps The average number of communication steps was computed among several runs of DEM on Hypercubes of 16, 64, 256, and 1,024 processors. The experimental work results of load balancing using CDEM in an OTIS-Hypercube revealed that the average number of communication steps is convergent to the number of communication steps required in Hypercubes of equivalent sizes. Figure 13 shows the average number of communication steps, required by DEM and CDEM to achieve load balancing on Hypercube and OTIS-Hypercube, respectively.

294

B.A. Mahafzah, B.A. Jaradat

Fig. 13 Average no. of communication steps using DEM on Hypercube and CDEM on OTIS-Hypercube

5.4 Speed The speed at which the load balancing process occurs is a distinguished metric to evaluate since it reveals the role of the attractive technologies used in OTIS and its contribution to system’s performance improvement and the processing time reduction. Several experiments were conducted to estimate the speeds at which the proposed CDEM occurs in 16-, 64-, 256-, and 1024-processor OTIS-Hypercube. These experiments were followed by an experimental work to evaluate the speeds at which DEM is performed on Hypercubes of equivalent sizes. The experiments were performed with the assumption that the speed of the electrical technology used is 250 Mb/s [18], and the speed of the optical interconnection technology used is 2.5 Gb/s [19]. Distinguished results have been obtained experimentally. A look at Fig. 14 indicates the speed achieved on OTIS-Hypercube, in comparison with Hypercube. The obtained results show that CDEM performs load balancing on OTIS-Hypercube at a rate five times faster than DEM on Hypercube. For instance, CDEM performs load balancing on 1024-processor OTIS-Hypercube at a speed of about 40 Gb/s, while the speed at which DEM performs load balancing on 1024-processor Hypercube is around 7.5 Gb/s. A careful examination of the obtained results revealed a match between analytical and experimental evaluation for both methods; CDEM and DEM, on OTISHypercube and Hypercube, respectively. An example of such a comparison is shown in Fig. 15, which shows that the empirical speeds at which the load balancing process is accomplished using CDEM on various OTIS-Hypercubes’ sizes are very close to the speed analysis’ results.

The load balancing problem in OTIS-Hypercube interconnection

295

Fig. 14 Speed of load balancing using DEM on Hypercube and CDEM on OTIS-Hypercube

Fig. 15 Analytical vs. experimental speed of CDEM on an OTIS-Hypercube

6 Conclusions and future work The proposed load balancing methodology, called Clusters Dimension Exchange Method (CDEM), on OTIS-Hypercube interconnection networks, has been introduced. The performed analysis and the conducted experiments have been presented, and compared to the analytical and experimental results obtained from applying the Dimension Exchange Method (DEM) on Hypercube interconnection networks. These results demonstrate the effectiveness of the proposed CDEM methodology and the superiority of OTIS-Hypercube, in terms of several performance metrics.

296

B.A. Mahafzah, B.A. Jaradat

Reduced execution time was achieved, with higher accuracy at higher speed, by applying the proposed load balancing method, CDEM, on OTIS-Hypercube, in contrast with the time required to execute load balancing on Hypercube using DEM. It was clear from the empirical and the analytical results that the number of communication steps required by both, CDEM and DEM load balancing methodologies, are approximately the same. This research work is intended to be extended by applying the proposed CDEM on Extended OTIS-Hypercube interconnection networks, in which groups of processors are interconnected with wraparound links. In traditional OTIS-Hypercube interconnection networks, there is only one interconnection between every two groups, along which excess load can be transferred. However, the extra wraparound interconnection is expected to allow the excess load transfer along two interconnections. Thus, reducing the load transferred along each connection, and reducing the amount of transferred processor’s local load since more than one processor will participate in the process of excess load transfer. Acknowledgements The authors would like to express their deep gratitude to the anonymous referees for their valuable comments and suggestions, which improved the paper.

References 1. Marsden G, Marchand P, Harvey P, Esener S (1993) Optical transpose interconnection system architectures. Opt Lett 18(13):1083–1085 2. Najaf-Abadi H (2004) Performance modeling and analysis of OTIS networks. Master’s thesis, Department of Computer Engineering, Sharif University of Technology, Tehran, Iran 3. Wang C, Sahni S (1998) Basic operations on the OTIS-mesh optoelectronic computer. IEEE Trans Parallel Distrib Syst 9(12):1226–1236 4. Sahni S, Wang C (1997) BPC permutations on the OTIS-mesh optoelectronic computer. In: IEEE conference on massively parallel programming with optical interconnect (MPPOI 97) 5. Parhami B (2005) The Hamiltonicity of swapped (OTIS) networks built of Hamiltonian component networks. Inf Proc Lett 95:441–445 6. Wang C, Sahni S (2001) Matrix multiplication on the OTIS-mesh optoelectronic computer. IEEE Trans Comput 50(7):635–646 7. Wang C, Sahni S (1998) BPC permutations on the OTIS-Hypercube optoelectronic computer. Informatica 22:263–269 8. Rajasekeran S, Sahni S (1998) Randomized routing, selection, and sorting on the OTIS-mesh. IEEE Trans Parallel Distrib Syst 9(9):833–840 9. Zewail A (2002) Light and life. Ninth Rajiv Gandi Science and Technology Lecture, Bangalore, India 10. Krishnamoorthy A, Marchand P, Kiamilev F, Esener S (1992) Grain-size considerations for optoelectronic multistage interconnection networks. Appl Opt 31(26):5480–5507 11. Zhao C, Xiao W, Qin Y (2007) Hybrid diffusion schemes for load balancing on OTIS networks. In: ICA3PP, pp 421–432 12. Qin Y, Xiao W, Zhao C (2007) GDED-X schemes for load balancing on heterogeneous OTIS networks. In: ICA3PP, pp 482–492 13. Ranka S, Won Y, Sahni S (1998) Programming a hypercube multicomputer. IEEE Softw 5(5):69–77 14. Rim H, Jang J, Kim S (1999) An efficient dynamic load balancing using the dimension exchange method for balancing of quantized loads on hypercube multiprocessors. In: Proc of the second merged symposium (IPPS/SPDP 1999) 13th international parallel processing symposium and 10th symposium on parallel and distributed processing, pp 708–712 15. Rim H, Jang J, Kim S (2003) A simple reduction of non-uniformity in dynamic load balancing of quantized loads on hypercube multiprocessors and hiding balancing overheads. J Comput Syst Sci 67:1–25

The load balancing problem in OTIS-Hypercube interconnection

297

16. Jan G, Hwang Y (2003) An efficient algorithm for perfect load balancing on hypercube multiprocessors. J Supercomput 25:5–15 17. Willebeek-LeMair M, Reeves A (1993) Strategies for dynamic load balancing on highly parallel computers. IEEE Trans Parallel Distrib Syst 4(9):979–993 18. Kibar O, Marchand P, Esener S (1998) High speed CMOS switch designs for free-space optoelectronic MINs. IEEE Trans Very Large Scale Integr (VLSI) Syst 6(3):372–386 19. Esener S, Marchand P (2000) Present and future needs of free-space optical interconnects. In: 15 IPDPS 2000 workshop on parallel and distributed processing, pp 1104–1109

Basel A. Mahafzah is an Assistant Professor of Computer Science at the University of Jordan, Jordan. He received his B.Sc. degree in Computer Science in 1991 from Mu’tah University, Jordan. He also earned from the University of Alabama in Huntsville, USA, a B.S.E. degree in Computer Engineering. Moreover, he obtained his M.S. degree in Computer Science and Ph.D. degree in Computer Engineering from the same University, in 1994 and 1999, respectively. During his graduate studies he obtained a fellowship from the Jordan University of Science and Technology. After he obtained his Ph.D. and before joining the University of Jordan, he joined the Department of Computer Science at Jordan University of Science and Technology, where Dr. Mahafzah held several positions; Assistant Dean, Vice Dean, and Chief Information Officer at King Abdullah University Hospital. His research interests include Performance Evaluation, Parallel and Distributed Computing, Interconnection Networks, Artificial Intelligence, Data Mining, and e-Learning. He received more than one million U.S. dollars in research and projects grants. Moreover, Dr. Mahafzah supervised Master students and developed several graduate and undergraduate programs in various fields of Information Technology. His experience in teaching extends to eight years. Bashira A. Jaradat is a Teaching and Research Assistant at the Computer Science Department of the Hashemite University, Jordan. She received her B.Sc. and M.S. degrees in Computer Science from Jordan University of Science and Technology, Jordan, in 2004 and 2007, respectively. Her research interests include Parallel and Distributed Computing, Artificial Intelligence, Spatial Data Mining, and Mobile Databases.