Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999 Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999
Efficient and Dependable Multimedia Data Delivery Service in World Wide Web Environment Qutaibah M. Malluhi and Gwang S. Jung Computer Science Department Jackson State University Jackson, MS 39217
[email protected] [email protected] Abstract Multimedia data is characterized by large objects that require high-bandwidth. This paper presents a technique that enables efficient and dependable data storage and delivery of large multimedia objects in the World Wide Web environment. The proposed approach strips data into blocks that are distributed over multiple Web servers. Data distribution is transparent to users. The parallelism of multiple Web servers is exploited to achieve high data rates. The paper presents two data coding techniques to achieve high dependability.
1. Introduction Multimedia data handling is becoming a core technology in computer science. The World Wide Web (WWW) architecture is evolving to support multimedia protocols and standards. Database management system architectures are departing for text-only data to support multimedia objects. Web documents, databases, and digital libraries are becoming more and more populated with multimedia content. Multimedia objects are characterized by large-sizes that require highbandwidth. Therefore, there is an emerging need to provide ways for efficient handling of large data objects that are accessed through limited bandwidth, storage, and processing resources. Moreover, Internet information services are becoming an indispensable part of our lives. The operation of businesses, research institutions, organizations as well as individuals is becoming dependent on the availability of these services and, therefore, failure of these resources is becoming very costly. The computer systems in the Internet collectively provide large amount of computational and storage resources. Such resources should be effectively exploited to offer efficient, highly available, reliable, and, therefore, dependable services.
William E. Johnston Information and Computer Sciences Division Lawrence Berkeley National Laboratory MS 50B-2239, Berkeley, CA 94720
[email protected] In this paper, we develop a method for enabling efficient and dependable data service in a large scale distributed computing and communication environment like WWW. The proposed method is based on striping original data into blocks that are distributed across available data servers (e.g., Web servers). Efficient data service can then be achieved by using multiple links, established between the client and data servers, to transfer data blocks in parallel. The client is responsible for joining the downloaded blocks. The system and communication link capacity of the client can therefore be fully utilized. The paper describes two methods for enable dependable service. These methods use coding techniques to add redundancy to the original data. This redundancy enables us to retrieve the original data even if portion of the data is unavailable due to server and/or network failure. The following section describes data mirroring as a common alternative for achieving load balancing, and dependability in the WWW. In Section 3, we discuss the basic idea of the proposed data storage and delivery scheme. Section 4 discusses two coding schemes to accomplish dependable service. In Section 5, fault tolerance analysis of our scheme versus mirroring is outlined. Finally Section 6 concludes the paper.
2. Data Mirroring in WWW Service through mirror sites in the Internet environment has been generally accepted as a technique to improve data and service availability [1][5]. By creating and maintaining replicated data copies in several mirror sites, a service remains operational in spite of node and/or communication link failures. The mirror site approach can also increase efficiency by distributing the workload over multiple mirror sites. Data or document caching is another special type of data replication that is commonly used to improve data availability and performance [2][3][6].
0-7695-0001-3/99 $10.00 (c) 1999 IEEE
1
Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999 Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999
Therefore, two major motivations for service replication are increasing service availability and improving end-to-end performance. Service availability is defined as the probability that there exists enough functioning (or available) resources in the distributed system so that an arriving request can be satisfied. Endto-end performance can in general be measured by the response time required to complete an access request. End-to-end performance not only depends on the failure characteristics of the system, but also, the processing and communication capacities of the environment. Even if a service is available at a user’s request time, it may not be guaranteed that the user gets the desired data in a reasonable amount of response time. There is always a chance that a server, router, or network link fails during data transmission. This is especially true when a large object (e.g., program, audio, video clip, etc.) is transmitted from a server to the user’s client. The absence of recovery through mere data replication renders the system less reliable, especially if the failure interrupts on-going operations. Therefore, the service provided by the mirror sites may not be dependable. To be dependable, the service should be both initiated successfully, and terminated correctly within a reasonable amount of response time. In other words, the dependable service must be highly available and at the same time reliable [9][12]. Maintaining mirror sites attempts to reduce server load [10]. Load balancing among mirror sites, however, cannot be automatically controlled by the system and is not guaranteed. The load balance could only be obtained if the user’s requests are evenly distributed among the replicated servers during the entire service time. If users’ requests are concentrated on only a small subset of the replicated servers, load balancing is lost. It is therefore desirable to achieve a guaranteed load balance which is systematically controlled by the system and not by the users. Another goal of service through mirror sites is to improve data transfer time between the server and client. This goal can be, to some extent, achieved by encouraging users to select a server whose geographic distance from the client is shortest. However, due to the current Internet architecture, geographic distance could be misleading. The Internet consists of several large backbones that are connected at several peering points. Inter-backbone traffic has to go through one of these peering points. Therefore, connecting to a server which is a few miles away in the same city may be more expensive than connecting to a faraway server because the two servers are attached to two different backbones.
3. Data striping in WWW In our approach, we first strip an object into blocks. The stripped blocks are then distributed across Web servers (servers for short). A client gets an object by using multi-threaded communication links to the data block servers. Servers transmit data blocks in parallel to the client. The client collects and combines these blocks. High data rates are achieved through utilizing the cumulative bandwidth of multiple servers and multiple network paths in order to fully exploit the client bandwidth. Scalability can be achieved by simply adding more servers to increase the level of parallelism. Reliability and high availability are achieved by utilizing a set of check blocks encoded from the original stripped blocks to recompute unavailable data. The approach described in this paper is much more cost-effective than replication because the encoded redundancy is much smaller in size than the original data. Unlike replication, in our approach, load balancing is handled and guaranteed by the system. Our striping approach enables the client to fully utilize its bandwidth. The client bandwidth can be saturated by the aggregate bandwidth of multiple parallel data servers. Therefore, the server speed and bandwidth are never the bottleneck. What the client pays for bandwidth is what he gets.
3.1. Dependable Data Delivery Dependable service must be highly available and at the same time reliable. A server is called a faulty server if its data is inaccessible. Otherwise, the server is operational. A faulty server does not necessarily mean that the server is down. Inaccessibility of server data can be the a result of a variety of reasons. Examples include, server is down, network delay is very high, network packets are lost, file is corrupted, destroyed, or deleted, or only part of the file object is transmitted (security considerations may suggest that some sensitive portion of the object is not sent over an insecure or unreliable medium). In our approach, redundant data blocks are added to the original data blocks. The redundant blocks are used for retrieving original data blocks if a portion of the data blocks is unavailable. The data blocks along with redundant blocks are distributed over the network across multiple servers. The data block layout over servers is performed in a way that maximizes parallelism. Moreover, data blocks and their corresponding redundancy are stored on distinct servers in order to enable the system overcome server faults. The client downloads a file object from a set of operational servers storing data blocks corresponding to
0-7695-0001-3/99 $10.00 (c) 1999 IEEE
2
Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999 Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999
the file. Available service can be achieved by providing the client with a mechanism for automatically selecting a set of operational servers for a target object. Fault tolerance can be achieved by recovering from failures when they occur. Let n be the total number of servers. Let m be the affiliated data servers, m ≤ n . Servers storing original data blocks of a file object are called data servers. The remaining k = n – m servers are used to hold redundant data for fault tolerance, and are named check servers. The choice of m and k is a function of the desired storage overhead (amount of redundancy) or equivalently, the desired degree of fault tolerance. A file object is divided into p data blocks. For simplicity, we assume that m divides p . Otherwise, dummy blocks can be added to make p a multiple of m . These p blocks are divided into groups of m blocks each. Each group is encoded to produce k redundant blocks (see Figure 1). The m data blocks and their corresponding k redundant blocks are said to be married. Encoding is done in such a way that if any m out of n married blocks are available, the original m blocks can be reconstructed. Therefore, by storing married blocks into distinct servers, if any ≤ k Web sites are faulty, there exists at least m servers which are operational and thus, the original m blocks can be produced.
together in time should be placed on distinct servers. For example, it is advantageous to put consecutive movie frames in a single set of married blocks so that these frames are distributed over distinct servers.
3.2. Tasks of the Servers and Clients Striping file objects, encoding data blocks to produce married blocks, and distributing the data blocks are major tasks performed by the server sites. Information (meta-data) about distributed data blocks, such as logical block identification and its physical address (i.e., URL), must be kept in one of the servers. The meta-data is thus created when data blocks are allocated over the servers, and kept in a server called master server. The meta-data can be replicated over several servers for higher availability. The data blocks and meta-data are stored in the designated Web document directories of the servers. The major tasks performed at the server sites are enumerated below. •
striping a large file into p smaller sized blocks, and dividing the blocks into groups of m blocks each.
•
Generating the encoding and decoding parameters (matrices) for reliability coding.
•
Encoding a set of m blocks into m + k blocks, where k is the number of redundant blocks.
•
Distributing m + k encoded data blocks to n servers.
• p data blocks m data blocks
k married redundant blocks
m data blocks
k married redundant blocks
m data blocks
k married redundant blocks
n Web servers (m data servers + k check servers)
Figure 1.
Distribution of data blocks over Web Servers
The strategy for selecting the sets of married blocks is application and/or data dependent. Since married blocks are distributed over distinct servers, data layout over the servers depends on the selection of married blocks. The data layout scheme should try to maximize parallelism for data delivery. Blocks accessed close
Creating the meta-data and storing it in a master server (and in any replicated servers) The client runs in any host machine connected to the Internet. The client is responsible for checking the status of the servers. The client checks the status of servers by requesting one-byte (dummy) data from each server periodically. The client maintains server mask bits showing the status of servers. The server mask is set to “1” if the corresponding server responds to the client within a time interval. To download a file object, the client first gets metadata for the object, then makes multiple hypertext transport protocol (HTTP) connections to the data servers. For each block, the client creates a thread to establish a connection to the server where the target block is stored. The client is responsible for selecting the server sites to download data blocks necessary for having a file object based on the server mask. The downloaded data blocks are then merged into a file. If necessary, redundant blocks are retrieved and decoded to recover original unavailable data blocks. Therefore, the major tasks performed by the client can be summarized as follows.
0-7695-0001-3/99 $10.00 (c) 1999 IEEE
3
Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999 Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999
•
Checking the status of servers and maintaining the server mask.
•
Obtaining meta-date from the master server and creating connections to the operational servers for downloading data blocks.
•
Keeping track of server availability during the download operation.
•
Requesting alternative redundant blocks if some data servers are unavailable.
•
Merging a set of blocks to construct the original file.
•
Decoding any m out of the m + k blocks to reconstruct the original m blocks if there are redundant blocks delivered.
4. Coding Schemes In our method, we encode m symbols of the original data object into n = m + k symbols. The m original symbols are called an information word and the coded n symbols are called a code word. If the information word appears in the code word, the code is called to be systematic. In the absence of faults, the client needs to be able to get the information word stored on the various data servers with no decoding. Therefore, we need to use systematic codes for our environment. All of the arithmetic operations used in this section are operations in a finite field. A Galois field GF ( q ) is a finite field of q elements [4][11]. In this section, we describe two different coding techniques. The first technique arranges the blocks in a one dimensional array and encodes blocks along this single dimension. The second technique logically arranges blocks in a two dimensional array and adds parity blocks across the rows and across the columns of this array.
10 0…0
g 0m
…
g 0 ,n-1
01 0…0
g 1m
…
g 1 ,n-1
g 2m
…
g 2 ,n-1
G = Im × m Pm × k = 0 · : 0
0 1…0
· · · · · · · : : : : : : : 0 0 … 1 g m – 1 ,m … g m – 1 ,n-1
Therefore, V = U × G = U × Im × m Pm × k = [ U × I U × P ] = ( u 0, u 1, …, u m – 1, v m, v m + 1, …, v n – 1 )
Let's assume that k= n – m servers are faulty. This means that k elements of the V are missing. Our goal is to find the way for the client to reconstruct the original data (i.e., the vector U). Let V′ be the set of m married blocks which the client could download, i.e., V′ is V without the k missing elements. Let G’ be the m × m matrix generated by ignoring the k columns of G corresponding to the missing blocks. Clearly, V’ = UG’ . Suppose that G’ is invertible, we can obtain U as –1
–1
follows: V’G’ = UG’G’ = U . Since U can be reconstructed only if G’ is invertible, obtaining invertible G’ matrix is the key to the decoding process. G’ is represented as, G’ = column subset of I m × m column subset of Pm × k
The columns of G’ are partitioned into two parts. The m×m
first part is a column subset of I . The number of columns in this part is equal to the number of operational data servers. The second part is a column m×k
4.1. One Dimensional Coding Let U = ( u 0, u 1, …, um – 1 ) be an information word. A code word V = ( v 0, v 1, …, v n – 1 ) will be produced by V = U × G , where, G is an m × n matrix. We require
that the rows of G be linearly independent. Otherwise, there will be more than two information words that will be mapped onto the same code word. Moreover, we want the code to be systematic. Therefore, G should be of the form,
subset of P . This part contains at most same number of columns as the number of faulty servers. In practice, the number of faulty servers is much less than the number of operational servers. Therefore, G’ and its –1
inverse G’ are sparse matrices. This sparsity can be exploited to obtain more efficient algorithms [7]. Therefore, the matrix G must satisfy the following three conditions. 1)
G = IP .
2)
The rows of the G matrix must be linearly independent. 3) Every m columns of G have to be linearly independent. Condition (1) ensures that G produced a systematic
0-7695-0001-3/99 $10.00 (c) 1999 IEEE
4
Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999 Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999
code, condition (2) ensures that no two information words are mapped into the same code word, and condition (3) ensures that G’ , which is a column subset of G, is always invertible. The following algorithm generates the desired G matrix. This algorithm assumes that we are using GF ( q ) to develop our code and that n ≤ q – 2 . This puts a theoretical limitation on n; the maximum number of servers that we can use. In practice however, this is not a limitation because n is usually much smaller than q. For example, for byte data 8
symbols (i.e., GF ( 2 ) ) the limit is n = 254 . Let y 1, y 2, y 3, …, y n be any n distinct nonzero elements in GF ( q ) . Construct a matrix of the following form: y 1 y2 y3 … yn 2
2
2
3
3
3
2
y 1 y2 y3 … yn G1 =
y 1 y2 y3 … · · · · : : : : m
m
m
3
yn
the 1D scheme. Moreover, the 2D scheme reduces the communication cost because the number of married blocks needed to compute missing blocks is reduced. In addition to the row and column parity check blocks, we employ one additional check server. This additional server holds the parity information of all data servers (or equivalently, the parity of all row or column parity servers). Figure 2 illustrates this scheme. The full parity site is illustrated by a dark box on the bottom right corner of the 2D array. It is obvious that with this scheme, any 3 server fault is recoverable. The worst case happens when 3 blocks residing at three corners of a rectangle are faulty. In this case, the block residing at the forth corner can be used to recover from this 3-server fault. In addition, it can be shown that most of the 4 and 5 server faults are recoverable. A 4 or 5 server fault is not recoverable only if the fault involve 4 blocks at the 4 corners of a rectangle. This method requires 2 m + 1 parity blocks
,
for an
m × m array. Therefore, it has a redundancy
rate of 2 m + 1 ⁄ m .
· : m
y1 y2 y 3 … y n
data servers
where, y i ≠ y j and y i ≠ 0 for 1 ≤ i < j ≤ m . Transform G 1 into a systematic matrix G using elementary row operations. An elementary row operation on a matrix is one of the following two operations: 1) multiply one of the rows by a nonzero scalar or 2) add a scalar multiple of one row to another. In [7], we have shown that the matrix G generated by the above procedure satisfies conditions (1), (2), and (3).
4.2. Two Dimensional Parity Coding In the 2D coding, servers (or blocks) are logically organized in a 2D structure. Encoding is performed across each logical row of blocks to produce row check blocks. Similarly, we encode the blocks in each column to produce column check blocks. Figure 2 illustrates this idea with a single row and column check (simple parity) blocks. Therefore, a missing block can be computed from either it corresponding row blocks or its corresponding column blocks. This scheme is especially useful when we use the simple parity operation (XOR) to produce parity check blocks. Doing that, a missing block can simply be reconstructed by adding (XORing) the available row or column married blocks. This makes missing block computation much simpler and faster than
row parity servers
full parity server column parity servers
FIGURE 2 2D PARITY APPROACH ILLUSTRATED BY A 4 × 4 ARRAY OF DATA SERVERS.
5. Fault Tolerance Analysis In this section, we compare coding versus mirroring in terms of their power to tolerate faults and the required amount of redundancy. Fault tolerance is measured by the probability of a nonrecoverable fault. The amount of redundancy is measured in terms of the redundancy rate. The redundancy rate is defined as the ratio between the size of redundant information and the size of the data object.
0-7695-0001-3/99 $10.00 (c) 1999 IEEE
5
Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999 Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999
Redundancy Rate
Pf = 0.1
Pf = 0.01
Pf = 0.001
k=0
0
0.57
0.077
0.008
k=1
0.125
2.25 × 10
–1
3.44 × 10
–3
3.58 × 10
–5
k=2
0.25
7.02 × 10
–2
1.14 × 10
–4
1.19 × 10
–7
k=4
0.5
4.33 × 10
–3
7.47 × 10
–8
7.87 × 10
k=8
1.0
6.38 × 10
–6
1.10 × 10
– 14
→0
–3
3.6 × 10
–7
3.6 × 10
–2
8.00 × 10
–4
8.00 × 10
– 13
(a) ρ 2D
0.778
2.6 × 10
– 11
(b) ρm
1.0
7.73 × 10
–6
(c) Figure 3. (a) Probabilities of nonrecoverable fault ( ρ ( 8, k ) ) for 1D coding, (b) Probabilities of nonrecoverable fault ( ρ 2D ) for 2D parity coding, and (c) Probabilities of nonrecoverable fault ( ρ m ) for mirroring. Define ρ ( m, k ) as the probability of a nonrecoverable fault for m information servers and k check servers and ρ m as the probability of nonrecoverable fault when data duplication (mirroring) is used. Let Pf be the probability of a particular server being faulty and let be the probability that a server is
P o = 1 – Pf
operational. redundancy,
Therefore, for a system without the probability of a fault is m
ρ ( m, 0 ) = 1 – Po . In addition, we have, ρ ( m, k ) = probability that > k servers are faulty n
=
∑
prob ( exactly i servers are faulty )
i = k+1 n
=
∑
fault for a range of Pf values (Pf = 0.1, Pf = 0.01, and Pf = 0.001). In Figure 3, m is taken to be 8. The probabilities are compared for k = 1, 2, 4, and 8. The table in Figure 3 also provides the values of ρ ( 8, 0 ) , for the case of no redundancy (no coding), and ρ m for the case of mirroring. Figure 3 illustrates that coding very significantly improves the system reliability. By utilizing a single check block (see rows for k = 0 and k = 1 in Figure 3), the probability of a fault is reduced by about 60% for Pf = 0.1, 95% for Pf = 0.01 and 99% for Pf = 0.001. Consider the 2D parity approach. Let m be the total 2
number of data servers and n = ( m + 1 ) be the total number of servers. Let ρ 2D be the probability of a nonrecoverable fault in the 2D parity strategy. We can write,
n P iP n – i . i f o
i = k+1
n
Figure 3 shows the probabilities of nonrecoverable
ρ 2D =
0-7695-0001-3/99 $10.00 (c) 1999 IEEE
∑
i
n–i
F ( i )p f p o
i=0
6
Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999 Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999
Where F ( i ) is the number of possible i-server faults that are nonrecoverable. Notice that, F ( 0 ) = F ( 1 ) = F ( 2 ) = F ( 3 ) = 0 . F ( 4 ) is the number of rectangles in the
References [1]
n × n array of sites. For each
rectangle in the n × n array, there are n – 4 possible 5 server faults. Therefore, F ( 5 ) = ( n – 4 ) × F ( 4 ) . For large values of i (i.e., for large number of bolck faults), F ( i ) can be approximated by n because almost all of the i error layouts will produce a nonrecoverable fault. We computed ρ 2D for 9 data servers ( n = 16 ). The results are shown in Figure 3(b). As expected, Figure 3 demonstrates that for the same redundancy rate, 1D coding has an error probability orders of magnitude smaller than mirroring. By adding only two check blocks, 1D coding achieves a better error probability than duplicating each of the eight blocks. 1D coding is superior than 2D parity coding. This is achieved, however, at the expense of slower decoding time.
[2]
[3]
[4]
[5]
6. Conclusions [6]
In this paper, we have presented a scheme for efficient storage and delivery of multimedia data in the WWW environment. The approach employs a set of distributed data servers (Web servers) to collectively produce high data rates. The paper describes two coding methods to achieve dependable service. The approach described in this paper is much more cost-effective than utilizing mirror sites because the coded redundancy is much smaller in size than the original data. Unlike mirroring, in our approach, load balancing is handled and guaranteed by the system. Our striping approach enables the client to fully utilize its bandwidth. The client bandwidth can be saturated by the aggregate data transfer rates by downloading fragments from multiple parallel data servers. Therefore, the server speed and bandwidth are never the bottleneck.
[7]
[8] [9]
[10]
[11]
Acknowledgment [12]
The work described in this paper has been in part supported by DOE (DE-FG02-97ER25339).
O. Babaoglu, A. Bartoli, and G. Dini, “Replicated File management in Large-Scale Distributed Systems”, Tech. Rep. TR UBLCS-94-16, University of Bologna, Bologna, Italy, 1994. A. Bestavros, “Speculative Data Dissemination and Service to Reduce Server Load, Network Traffic and Service Time for Distributed Information Systems”, in the Proceedings of ICDE’96: The 1996 International Conference on Data Engineering, New Orleans, LA, March 1996. A. Bestavros, R. Carter, M Crovella, C. Cunha, A Heddaya, and S. Mirdad, “Application level document caching in the Internet”, in the Proceedings of ICDE’96: The Second International Workshop on Services in Distributed and Networked Environments, Whistler, British Columbia, June 1995. G.C. Clark, Jr., and J. Bibb Cain, Error-Correction Coding for Digital Communications, Plenum Press, New York, 1981. P. Danzig, D. Delucia, and K. Obraczka, “Massively Replicating Services in Wide Area Internetworks”, Tech. Rep. TR 93-541, University of California at Santa Cruz, 1993. A. Heddaya and S. Mirdad, “Globally Load Balanced Fully Distributed Caching of Hot Published Documents”, Tech. Rep. BU-CS-96024, 1996. (also available in the Proceedings of 17th International Conference on Distributed Computing Systems, Baltimore, MD, 1997). Q. Malluhi and W.E.Johnston, “Coding For High Availability of a Distributed-Parallel Storage System”, will appear in IEEE Transactions on Parallel and Distributed Systems. W.W. Peterson and E.J. Weldon, Error-Correcting Codes, 2nd ed., Cambridge MIT Press, 1972. D. Powel, Ed., Delta-4: A Generic Architecture for Dependable Distributed Computing, SpringerVerlag, 1991. M. O. Rabin, “Efficient Dispersal of Information for Security, Load Balancing, and Fault Tolerance”, Journal of the ACM, Vol. 36, No. 2, PP. 335-348, April 1989. T.R.N. Rao and E. Fujiwara, Error-Control Coding for Computer Systems, Prentice-Hall, Inc., New Jersey, 1989. M. Satyanarayanan, J.J. Kistler, P. Kumar, M.E. Okasaki, E.H. Siegel, and D.C. Steere, “Coda: A Highly Available File System for a Distributed Workstation Environment”, IEEE Transactions on Computers 39, 4, pp. 447-459, April 1990.
0-7695-0001-3/99 $10.00 (c) 1999 IEEE
7