Placement of Storage Nodes in a Network - CiteSeerX

0 downloads 0 Views 202KB Size Report
subra@seas.gwu.edu narahari@seas.gwu.edu [email protected]. Abstract. Media-On-Demand (MOD) servers cater to users' needs for data and information ...
Placement of Storage Nodes in a Network S.R. Subramanya Dept. of EE&CS George Washington University Washington, DC 20052 [email protected]

B. Narahari Dept. of EE&CS George Washington University Washington, DC 20052

[email protected]

Abstract.

Media-On-Demand (MOD) servers cater to users' needs for data and information such as news, movies, interactive games, music, merchandise catalogs, etc. This requires the storage, management, and delivery of huge amounts of multimedia data. One of the models of a MOD server is general network of computers. Some of the nodes in the network are storage nodes which contain data repositories, and the others are interface nodes which obtain data from storage nodes and deliver them to the users (clients) in a timely fashion. Determining the locations of the storage nodes to minimize the data trac between them and the interface nodes is a global optimization problem. This paper presents an oine heuristic for choosing the locations of storage nodes and clustering interface nodes with storage nodes so as to minimize the traf c from the storage nodes. The proposed algorithm has been validated by simulations.

Keywords: MOD Servers, Network, Storage nodes, Clustering, K-means algorithm.

1 Introduction A Media-on-demand (MOD) server has repositories of multimedia data, such as video, audio, images, graphics, and animation, together with text, stored on disks. The clients (users) request for data and information such as news, movies, interactive games, music, merchandise catalogs, etc. The typical operations on video, for example, could be: browse, play, pause, rewind, fast-forward, etc. The objective of the MOD server is to satisfy as many clients

Rahul Simha Dept. of CS College of William and Mary Williamsburg, VA 23185 [email protected]

as possible with the available resources (disk space, communication links, bu ers, etc.), adhering to certain quality-of-service (QoS) requirements. Some simple measures of QoS for video are (1) startup time, (2) response time to commands (like pause, FF, rewind, etc.), (3) the frame rate, etc. Among the several methods to improve the throughput of MOD systems are: optimizing the placement of data on disks [4, 5, 6], pipelining and caching [7, 8], allocation of requests and scheduling of I/O [9, 10], and newer kinds of le systems [14, 11, 13]. One of the models of a MOD server is a general network as shown in Fig. 1, where the nodes denote computers and the edges denote the communication links. Some of these nodes are interface nodes (I-nodes) and the others are storage nodes (S-nodes). The requests from clients are accepted (if possible) and served by the I-nodes. The S-nodes contain repositories of all data (compresses, striped, etc.), while the I-nodes contain limited amount of disk storage and appropriate bu er space to support the required rate of transfer to meet a given QoS. One of the key factors related to the performance of a MOD system, based on the above model, is the data transfer time between Snodes and I-nodes. Based on a given network connectivity, the link delays, and the demands of the di erent nodes, the placement of the Snodes in the network a ects the above mentioned data transfer time. This is a global optimization problem requiring extensive computations. In this paper, we present a heuristic based on k-means clustering algorithm [1, 2, 3]

Interface Nodes

Using the delay matrix D, the matrix D which contains the minimum delays between all pairs of nodes is computed. This is easily done by using the solution to the standard allpairs-shortest-path problem.

2.2 Problem Formulation

Given: N nodes in a network, a matrix of delays D = (d ), where d denotes the delay ij

ij

in the transfer of one unit of data from node i to node j , and the vector of the sizes of data Figure 1: Media-On-Demand server demands of the nodes, S = fS1 ; S2 ; : : : S g. Determine: (1) L, the list of the locations of the M storage nodes and (2) the M clusters, and compare it with two greedy schemes. All C1 : : : C , where C contains the locations of all the three schemes have been implemented and interface nodes to be served by storage node i. the simulation results demonstrate the imObjective: Minimize the total transfer time proved performance of the heuristic over the of all data from the storage nodes to the ingreedy schemes. terface nodes. Let T denote the total time The next section formulates the problem, required for all transfers from the storage node Section 3 describes the greedy schemes, Seck to all theP interface nodes which it serves. tion 4 presents the proposed heuristic, and simThen T = 2Ck d(k; i)  S . The total transfer ulation results are given in Section 5, followed time T = MIN =1 fT g. The objective is by conclusions. to minimize T . This is a variant of the clustering problem, where a nite set of items has to be partitioned 2 Notations, and Problem into disjoint subsets such that the `distances' Formulation between all items in a group are as small as possible, while the `distances' between the groups 2.1 Notations and Assumptions are as large as possible. In our problem setThe following notations and de nitions are ting, a cluster refers to a set of I-nodes and one used in the problem formulation and in the S-node, where data transfer takes place from subsequent description and algorithm. the S-node to the I-nodes in the cluster. Determination of the subsets (clusters)is a global N : No. of nodes (computers) in the network. optimization problem, whereby a set of simulM : Number of server nodes. taneous algebraic equations describing the opD: The delay matrix. criterion must be solved by direct or S : Data demands vector of the interface nodes. timality iterative methods. This is computationally inL: List of indices of storage nodes. tensive and is the motivation for the heuristics C : Cluster i with indices of its interface nodes. we I : f1 : : : N g. Indices of all nodes in the network. present in this paper. n : Node i in the network. Storage Nodes

N

M

i

k

k

i

i

k

:::M

k

i

i

We note the following assumptions: (1) The data requirements of all I-nodes in any cluster is satis ed by the S-node. (2) The storage and processing capacities the S-nodes are about the same. (3) The data requirements of all I-nodes in the network is uniformly distributed.

3 Greedy Clustering

In this scheme, the M S-nodes out of the given N nodes are rst determined. The choice of the locations of the S-nodes can be chosen randomly or by a simple heuristic. Accordingly, we have two schemes called Greedy1 and

Greedy2. Then the I-nodes are put in appropriate clusters by nding the S-node which has the least link delay with the given I-node. The pseudocode is given below.

Algorithm 3.1 GreedyCluster (in: N; M; D; S , out: L) 1. begin fFirst determine the locations of the M S-nodes. Then do the clustering.g 2.

3. 4. 5.

Randomly pick M nodes to serve as S-nodes (Greedy1), or do steps 3 to 7 below (Greedy2). Pick an arbitrary node n 1 . L fk1 g.

while j L j< M do P (D[k; j] + S ) Find j 3 L s.t. 2L  P 2L(D [k; l] + S ); 8l 2 I ? L. fAmong the nodes not in L, j has k

k

6. 7.

j

l

k

the maximum delay with the S-nodes currently in L.g L L [ fj g.

endwhile fL contains the locations (indices) of storage nodes.g 8. I I ? L. 9. while I = 6  do 10. Pick an i from I . 11.

Find k such that: D[k; i]  D[l; i]; 8l 2 L. fi.e. nd the closest S-node for node ig C C [ fig; I I ? fig.

12. 13. end

k

k

4 Proposed Algorithm The proposed algorithm (1) determines the locations of the S-nodes, and (2) assigns a set of I-nodes to an S-node (clustering), with the objective of minimizing the data transfer times between the I-nodes and the S-nodes. The algorithm is based on the k-means clustering, which is a commonly used partitional clustering method [1, 3]. (Refer to Section 2.2 for the de nition of clustering). The outline of the algorithm is as follows: rst the M locations are chosen. This can be done (1) randomly,

or (2) by considering all nodes as possible locations and by repeatedly merging them suitably (bottom-up), or (3) by starting with a single node and by repeatedly adding more nodes suitably (top-down). Once the M locations are tentatively determined, the clusters are formed by assigning I-nodes to those S-nodes which minimizes the data transfer time. This takes into account both the link delays and also the load on the S-nodes. Then the overall data transfer time is determined. The centroids of the clusters are then determined, which are taken as the new locations of the S-nodes. The process of determining the clusters, the overall transfer time, and the new centroids is done iteratively until the `error' { the di erence between the current data transfer time and the previous one does not change very much. In the context of this problem, centroid refers to the node such that the sum of the link delays of this node with all other nodes in the cluster is the minimum. The pseudocode of the heuristic is given below.

Algorithm 4.1 K-M-Cluster N; M; D; S , out: L) 1. begin 2.

3. 4. 5. 6. 7. 8.

(in:

k

Start with M nodes initially chosen to be the storage nodes. These could be randomly chosen or by using a few methods such as pair-wise nearest neighbor clustering or splitting. L fIndices of the M S-nodesg. I f1    N g. Iteration number, r 1; T 0 Large value. C ; 8i 2 L. for each i 2 I ? L do Find the S-node, say k which has the minimum data transfer time. i.e., D [k; i]  D [l; i]; 8l 2 L. C C [ fig. i

9. 10. endfor 11. for each cluster k do 12. Find P the data transfer time, T . T = 2Ck fD [k; i] + S g. 13. endfor 14. Compute total transfer time T for the current iteration. T = MAX 2L fT g. fTestr forr?convergence. g 1 T ?T 15. if j T r?1 j<  then 16. Stop. L is the list of the S-nodes and C ; 1  k  M is the clusters. Exit. 17. endif 18. L^ . 19. for each cluster k do 20. k^ FindCentroid (C ; D ). 21. L^ L^ [ k^. 22. endfor 23. Update list of S-nodes: L L^. 24. r r + 1. Goto step 6. 25. end k

k

k

k

i

i

r

r

k

k

k

Algorithm 4.2 Find^ (centroid Centroid (in: C ; D  , out: k node)) 1. begin 2. for each i 2 C do 3. Find T (k; i), the time to nish the data transfers of all nodes in cluster C

k

k

4.

with i as the S-node. T (k; i) = P 2Ck fD[i; l] + S g.

endfor

l

k

l

Choose node k^ such that 5. fk^ as the S-node of cluster C results in minimum transfer time in the cluster.g k

6. end

5 Simulation results The greedy algorithm (with the two kinds of placement of servers) and the proposed algorithm were simulated by a C program on Sparcstation. The data transfer time for the schemes were measured for 50 and 100 nodes out of which 5 through 25 nodes were chosen as the S-nodes. The delay matrix was generated with random delays uniformly distributed in the interval [50; 150], and the sizes of the data requests were uniformly distributed in the interval [300; 900]. The results are shown in Fig. 2 below. It is easily seen that the proposed placement heuristic performs better than the greedy schemes.

6 Conclusions This paper proposed a heuristic for the placement of storage nodes in a network serving as a MOD server, with the objective of minimizing the data transfer time between the storage nodes and the interface nodes. The heuristic was shown by simulation, to have better performance compared to two greedy schemes.

References [1] MacQueen, J. `Some methods for classi cation and analysis of multivariate observations' Proceedings of the Fifth Berkeley Symposium on

Total data transfer time

10000 Greedy1 Greedy2 Heuristic

8000 6000

[10]

50−Node network

4000

[11]

2000 0

5

10

15

20

25

4

Total data transfer time

2

x 10

[12]

Greedy1 Greedy2 Heuristic

1.5

100−Node network 1

[13]

0.5

0

5

10

15 No. of Servers

20

25

Figure 2: Performance of the di erent schemes

[2] [3] [4]

[5]

[6]

[7]

[8]

[9]

Mathematical Statistics and Probability, Cam, L.M. and Neyman, J. (Eds.), 1967, pp281{297. Hartigan, J. Clustering Algorithms, Wiley, 1975. Jain, A.K. and Dubes, R.C. Algorithms for Clustering Data, Prentice-Hall, 1988. Vin, H.M. et. al. `Optimizing the Placement of Multimedia Objects on Disk Arrays', International Conf. on Multimedia Computing and Systems, 1995, pp158{165. Rangan, P.V. and Vin, H.M. `Ecient Storage Techniques for Digital Continuous Multimedia', IEEE Transactions on Knowledge and Data Engineering, 5(4), August 1993, pp564{ 573. Brubeck, D.W. and Rowe, L.A. `Hierarchical Storage Management in a Distributed VOD System', IEEE Multimedia, Fall 1996, pp37{ 47. Ozden, B. et. al. `A Framework for the Storage and Retrieval of Continuous Media Data', International Conf. on Multimedia Computing and Systems, 1995, pp2{13. Cohen, A. et. al. `Pipelined Disk Arrays for Digital Movie Retrieval', International Conf. on Multimedia Computing and Systems, 1995, pp312{317. Jadav, D. and Choudhary, A. `Techniques for Increasing the Stream Capacity of a Multime-

[14]

dia Server', Int'l. Conf. on High Performance Computing, Dec. 1996, pp43{48. Reddy, A.L.N. and Wyllie, J.C. `I/O Issues in a Multimedia System', Computer, 27(3), March 1994, pp69{74. Tobagi, F.A. et. al. `Streaming RAID: A Disk Storage System for Video and Audio Files', Proc. ACM Multimedia Conf., Aug. 1993, pp393{400. Anderson, P. et. al. `A File System for Continuous Media', ACM Trans. on Computer Systems, Nov. 1992, pp311{337. Haskin, R.L. `The Shark Continuous-Media File Server', Proc. of IEEE COMPCON, Feb. 1993. Ozden, B. et. al. `Fellini{a File System for Continuous Media', Tech. Report 113880941028-30, AT&T Bell Labs., 1994.

Suggest Documents