A Scalable Parallel Video Server on Shared Ethernet Chow-Sing Lin, Min-You Wu and Wei Shu Department of Electrical and Computer Engineering University of Central Florida Orlando, FL 32816 E-mail: flch, wu,
[email protected]
Abstract
To increase the throughput of parallel video server interconnected by a shared network environment, such as Ethernet, elimination of network collision is highly desired. A special mechanism, serialized con ict-free scheduling, is presented in the paper. Performance study of (1) server-push video severs with con ictfree scheduling algorithm and serialized con ict-free scheduling mechanisms, (2) and a client-pull video server with no scheduling algorithm is conducted on the top of Non-Uniform-Network Access network environment. The experimental result shows that the server with serialized con ict-free scheduling mechanism can support up to 50% more clients than the other two on shared Ethernet.
1 Introduction In recently years, real-time digital multimedia systems have moved from concept to reality due to major advance in communication, storage, computing, and data compression technologies. In particular, video-ondemand systems capable of delivering a large selection of video clips to a subscriber's television at the press of a button have attracted a great deal of attention. Examples are movies-on-demand, video magazines, video kiosks, distance learning, computer-aided-training, and video library, etc. Due to the very nature of video data, large size and real-time constraint, the performance of a video server may be poor without careful design; i.e., either the resource utilization rate will be low or a number of clients will experience a service interrupt or "glitch" during playout. To design a high performance video-on-demand system, requirements such as scalability, Quality-ofService (QoS), fault tolerance, high performance and versatile functionality shall be met. Among several types of video-on-demand systems, parallel video servers have been proven as an ecient way to meet
above requirements to provide large scale of video-ondemand systems [7, 2, 5, 1, 6]. Tiger uses distributed scheduling algorithm and the wide stripping strategy to balance the load. Constant-data-length has been used to store both constant-bit-rate and variable-bitrate encoded video les, which may lead to low resources utilization rate when the variation of supported VBR video le is large [2]. MARS solving the misorder of data arrival by designing a closely coupled system in which each storage node is connected via a custom ATM-based high interconnection network (APIC). The proximity of storage nodes enables them to be accurately synchronized through the common APIC [1]. A greedy algorithm to generate a con ict free schedule for a multiprocessor video server connected by switch has been proposed by Reddy [6], but it cannot guarantee the optimal scheduling capacity without ad hoc rearranging requests. Y.B. Lee et al. proposed a client pull model of video servers [5]. Due to the lack of con ictfree scheduling, system resources are under-utilized. To ensure con ict-free scheduling on a shared network environment, such as Ethernet, a special loosely server-level synchronization, serialized con ictfree scheduling, is presented in the paper. Also, performance comparison of two server-push parallel video servers with con ict-free scheduling and serialized con ict-free scheduling, and a client-pull parallel video server without explicitly scheduling is explored. The reminder of this paper is organized as follows. Section 2 discusses the system architecture of parallel video server. The Con ict-free scheduling for video streams are presented in Section 3. Section 4 presents the mechanisms of server-level synchronization on switching and shared networks for parallel video servers while experimental results are shown in Section 5. Finally, Section 6 presents concluding remarks.
2 Architecture Description There are two major types of parallel video servers: shared memory multiprocessors and distributed memory clustered architectures. In a multiprocessor system a set of storage nodes and computing nodes are connected to a shared memory. The video data is sent to the memory buer through a high-speed network or bus, and then to the clients. A mass storage system has presented the capacity of supporting hundreds of requests [3]. However, it is not yet clear that a multiprocessor video server can be scalable. On the other hand, A clustered architecture is easy to scale to thousands of server nodes. In such a system, a set of storage nodes, a set of delivery nodes and a control node are connected by an interconnection network such as switches or repeater hubs. The storage node is responsible for storing video clips, retrieving requested data blocks and sending them to delivery nodes within a time limit. Each storage node deals with its own disk scheduling algorithm separately to provide enough bandwidth. The delivery node, on the other hand, is responsible for taking requests from clients then forwarding to the control node for scheduling. Video blocks sent by storage nodes are buered in delivery nodes, where video blocks are re-sequenced if necessary, and then send to clients. Admission control, network scheduling for retrieving video data from storage nodes to delivery nodes, network synchronization of data retrieval between nodes, and content management for maintaining video clips are resided in the control node. Upon receiving requests from delivery nodes, the control node schedules requests to next time cycle if admitted. Unless receiving further notices from clients, the scheduler keeps updating the current schedule table and then sends it to storage nodes for retrieving the next video blocks until reaching their ends. It is a server-push model that keeps pumping data to clients. Note that the at architecture where a logical storage node and a logical delivery node are mapped onto a physical node called processing node is applied in the paper. By dierentiating the accessibility between a client and a processing node from a client's point of view, a parallel video-on-demand system can be categorized as two types, Non-Uniform-Network Access (NUNA) and Uniform-Network Access (UNA). Non-UniformNetwork Access means that the distance between a client and any given delivery node is dierent while Uniform-Network Access means the same. The dierence between NUNA and UNA explicitly expresses the constraint of request relocation discussed in [7]. Since in UNA the distance between a client and a processing node is the same, there is no penalty for relocating
Parallel Video Server
Clients
Control Node STB
STB
STB
Monitor
Monitor
Monitor
STB
STB
STB
STB
.....
Repeater Hub Processing Node 0
Monitor
STB
.....
Repeater Hub Processing Node 1 Monitor
Monitor
Monitor
STB
STB
STB
Monitor
STB
.....
Repeater Hub Processing Node 2 Monitor
Monitor
Monitor
Monitor
Interconnection Network STB
STB
STB
Monitor
Monitor
Monitor
STB
.....
Repeater Hub Processing Node N-1
Monitor
Figure 1: Diagram of a NUNA clustered parallel video server with at con guration on a local network. a request to another delivery node while extra bandwidth will be required for for forwarding video data between delivery nodes in NUNA. Therefore, poor system throughput will be experienced if request relocation is not restricted to minimum in NUNA architecture. Figure 1 shows the diagram of a realized NUNA clustered parallel video server with at con guration on a local network. In the experiment the NUNA architecture is applied.
3 Con ict-Free Scheduling for Video Streams Suppose that a parallel video server consists of N delivery nodes and N storage nodes, interconnected by a high speed network. The blocks of a video le with block size BlkSize are evenly distributed to N storage nodes in round-robin fashion. Assume that the base stream playout rate is BaseRate, then the length of a time cycle is tf = BlkSize=BaseRate. In general, the data transfer of a single disk or a disk array can be much higher than BaseRate. Therefore, in a time cycle, multiple video streams can be serviced by a storage node while the real-time constraint of an individual stream is still preserved. The time cycle is thus divided into time slot, where the length of the slot, ts , is equal to or greater than the time required for retrieving a block from the storage node or transmitting to the delivery node, whichever is larger. Therefore, the number of slots in a cycle, m, can be determined by m = btf =ts c. In a time slot, if more than one request needs to retrieve blocks from the same storage node, they compete for the resource. In order to avoid such a con ict, only the requests that access dierent storage nodes can be
scheduled onto the same time slot. Thus, in every time slot, at most N requests can be scheduled, each of that retrieves a block from dierent storage nodes. Once the rst cycle has a con ict-free scheduling, the following cycles will not have con ict. 0
r0 0
delivery node
0
storage node
0
1
2
0
1
r1
r2
r3
3
0
r4
1
2
2
r8
1
3
r7
0
2
r6
3
0
r5
2
1
3
r10
3
1
3
2
1
2
3
1
2
3
(a) slot 0
delivery node
storage node
0
0
1
1
2
2
3
delivery node
0
1
2
3
r11
slot 0 slot 1 slot 2 node 0
0
0
0
node 1
1 1
0 1
0 1
node 2
0 2
2 2
3 2
node 3
3 3
3 3
1 3
3
(b) slot 1
r9
r8 r1 r6 r7 2
r0 r5 r10 r3 1
r4 r9 r2 r11
than m requests access same storage node j . In this case, only m requests can be ful lled by storage node j and other requests must be delayed. In a large system, the delay can be signi cant. In [7], a peg-and-hole algorithm is applied to minimize the delay. Request Relocation. Now, we consider the situation that some delivery nodes receive more than m requests. In this case, some requests need to be relocated. An optimal algorithm for the linear chain has been proposed in [7].
4 Server-Level Synchronization
4.1 Synchronization for Switch-Based Network delivery node i i rk j
2
time cycle slot 0 slot 1 slot 2
(d) optimal schedule
storage node
0
1
2
request r k storage node j
3
node 0
0 r0 1
0 r4 2
0 r8 0
node 1
1 r1 0
1 r2 0
1 r3 2
node 2
2 r7 2
2 r6 1
2 r5 1
(c) slot 2
Figure 2: Bipartite graph for con ict-free scheduling Con ict-free scheduling (CFS). To obtain a
con ict free schedule for a given set of requests, a greedy algorithm has been proposed in [6], which cannot guarantee the maximum system performance. The optimal algorithm has been proposed for con ict-free scheduling problem in the previous work [7]. The algorithm converts the scheduling problem to a matching problem on bipartite graphs. An example is shown in Figure 2. A bipartite graph G with bipartite (X; Y ) is constructed from the request set, where X = x0 ; x1 ; :::; xN ,1 , Y = y0 ; y1 ; :::; yN ,1 , and xi is joined to yj if and only if delivery node i handles a request that retrieves a block from storage node j (Figure 2(a)). When each delivery node handles m requests and there are m requests retrieving a block from storage node j , the constructed graph is a m-regular bipartite graph. For the rst time slot, whether there exists a set of requests each of which accesses a dierent storage node from a dierent delivery node is equivalent to nding a perfect matching in G. After determining a perfect matching for time 0, the matched edges are eliminated. The matching algorithm is then iteratively applied to the reduced graph (Figure 2(b) and (c)). The resultant schedule is shown in Figure 2(d). Request delay. Consider the situation that more
(a)
1 r1 0
1 r2 0
0 r8 0
node 0
0 r0 1
2 r6 1
2 r5 1
node 1
2 r7 2
0 r4 2
1 r3 2
node 2
(b)
Figure 3: A con ict-free schedule and its sub-schedules This con ict-free scheduling algorithm discussed above assumes nodes in video server is connected by a switch-based interconnection network. Each processing node has a dedicate link connecting to a switch which is capable of operating data in full-duplex mode, i.e., each node can send and receive data simultaneously. With a careful design of scheduling algorithm, each node can concurrently operate disk access and network transmission not only within a node but also across nodes without causing resources con ict. Figure 3(a) shows an instance of a schedule table. The schedule table is broken into N sub-schedules, shown in Figure 3(b), and then sent to storage nodes at the begining of each time cycle. Each request in a sub-schedule accesses data from the same storage node. Once a storage node received a sub-schedule from the scheduler, it starts to read data blocks from disks and send them to corresponding delivery nodes. Assume that the size for storing a request is
time
Cycle time 0 Node 0
Node 1
Node 2
sch
sch
sch
Servicing Rd Tx
ctrl
Rd Tx
ctrl
Rd Tx
ctrl
sch
sch
sch
Rd Tx
ctrl
Rd Tx
ctrl
Rd Tx
ctrl
Figure 4: Operations of storage nodes in switch-based interconnection network
ReqSize, then the size for a schedule table is N m ReqSize. According to con ict-free scheduling, in each
time slot only one request can be served by a storage node, then the size of a sub-schedule is m ReqSize. Suppose interconnection network bandwidth is Netbw, then the time to receive a sub-schedule from the scheduler is sch = (m ReqSize)=Netbw (1) Usually sch is quite short since ReqSize is small. As soon as a storage node receives a sub-schedule from the scheduler, data blocks for those requests are retrieved from disks. The accessing time can be calculated as Rd = (m BlkSize)=Diskbw (2) On the other hand, the time for the interconnection network to transmit m data blocks is Tx = (m BlkSize)=Netbw (3) In order to increase the system performance, a pingpong buer is used for pipelining data blocks between disk access and network transmission. Let Servicing represents the time for pipelining m data blocks. If the bottleneck is on disk bandwidth, Servicing = Rd + BlkSize=Netbw (4) or, if the bottleneck is on interconnection network bandwidth, Servicing = Tx + BlkSize=Diskbw (5) There are some information needed to be sent to the scheduler after the data have been transmitted, such as new requests and synchronization information. To reduce system overhead, these information is packed as a control packet, size of ctrlSize, and then sent to network. The time, ctrl, to transmit a control packet is, ctrl = ctrlSize=Netbw (6)
Again, ctrl is short due to its quite small size. Figure 4 shows operations performed in storage nodes connected by a switch-based interconnection network along with a time line. These operations can be performed concurrently because each storage node has its own dedicate bandwidth for disk access and network transmission. Each storage node shall deal with its own internal real time clock to achieve slot-level synchronization. To ensure Quality-of-Service (QoS), each data block is accessed and transmitted within one time cycle. Therefore, sch + Servicing + crtl tf (7) The con ict-free scheduling is designed for video-ondemand system whose nodes are connected by switchbased interconnection network. It still causes network transmission con ict, however, if they are connected by a shared network environment, such as Ethernet. The following section shows that with proper synchronization, con ict-free operations, especially network transmission, are still achievable.
4.2 Synchronization for Shared Ethernet
Ethernet, as a common thought of the realization of IEEE 802.3 standard, is a 1-persistent CSMA/CD LAN. CSMA (Carrier Sense Multiple Access) protocol means that, when a station has data to send, it rst listens to the channel to see if any one else is transmitting. If the channel is busy, the station waits until it becomes idle. When the station detects an idle channel, it transmits a frame. If a collision occurs, the station waits a random amount of time (called exponential backo) and starts all over again. The protocol is called 1-persistent because the station transmits with a probability of 1 whenever it nds the channel idle. CD (Collision Detection) protocol is for stations to abort their transmission as soon as they detect a collision. This Protocol, known as CSMA/CD, is widely used on Ethernet in the MAC sublayer [4]. When the number of stations connected by shared Ethernet trying to send data to network at the same time is getting larger, a large amount of network collisions occurs. Under this circumstance, the delay for successfully transmitting packets to destination nodes will be unpredictable due to the exponential backo algorithm of Ethernet. Because of the real-time constraint of video-on-demand systems, such an unpredictable delay results in poor stream throughput. Without a special mechanism, a con ict-free schedule which is originally suitable for switching interconnection network will generate a huge amount of frame col-
lisions and retransmissions. To avoid packet collisions, serializing network transmissions of storage nodes is highly desired. Figure 5 shows that the serialized operations of storage nodes along with a time line. Disk access operations still can be performed simultaneously since each storage node has its own bus to access data from disks. However, storage nodes connected by a shared network must serialize network transmissions to avoid con ict. After the rst time of reading schedule and accessing disks, node 0 starts to transmit data to network. As soon as nishing transmission, it receives next sub-schedule from the scheduler and then broadcasts the control packet to all nodes. The control packet, consists of information needed to be exchanged between nodes, such as new requests and a SYN which will be received by next storage node to start the network transmission. Without losing generality, the time to broadcast a control packet to network can be computed as Equation 6. The reason to prefetch next sub-schedule right after network transmission instead of waiting for the end of time cycle is to overlap its disk access operation with operations of another storage node since they are independent from each other. Furthermore, It can be true that when a storage node get its turn to transmit data, the required data blocks in the cycle have been retrieved in the buer and are ready to be transmitted. If the disk access operation can be completely overlapped with operations of other storage nodes as shown in Figure 5, only the network transmission time spent in each storage node accounts for a time cycle. In this case, m BlkSize buer space is necessary to accommodate all retrieved data blocks in one time cycle. Therefore, a cycle time is,
X (Txi + schi + ctrli) = N (Tx + sch + ctrl)
N ,1 i=0
(8)
Note that Equation 8 holds only when the disk bandwidth does not cause the bottleneck. That means the equation
Rd (N , 1) (Tx + sch + ctrl)+ Tx , Blksize=Netbw (9) must be true. Also N (Tx + sch + ctrl) tf must be held to ensure QoS.
5 Experimental Results A prototype of parallel video server has been implemented to evaluate the eciency of various synchronization mechanisms during normal playout. The system con guration used for the prototype is shown in
time 0 Node 0
sch
Rd
sch
Rd
sch
Rd
Tx
sch crtl
Tx
Rd
SYN
Node 1
sch crtl
SYN
SYN
Tx
sch crtl
Rd
Rd SYN
Node 2
Tx
sch crtl
Rd
cycle time
Figure 5: Operations of processing nodes in the shared network environment Figure 1. It is a at, Non-Uniform-Network Access, cluster architecture consisting of three Pentium II 333 PCs. Each PC acting as a processing node is running RedHat 5.1 Linux operating system with 64 MBytes of RAM and a Quantum Viking II 9 GBytes SCSI disk. One processing node is also running as a control node. These nodes are interconnected by a Intel Express 100B-TX shared Ethernet hub. In the server only two MPEG-1 movies are stored which are Toy Story and Top Gun. These two video les are partitioned into consecutive video blocks which are evenly distributed into three storage nodes in a round-robin fashion. Each of blocks consists of 30 frames (two Group-Of-Pictures) representing one second playout time. Although these MPEG-1 video les are Variable-Bit-Rate encoded, the variation of video blocks is within 15%. Therefore, a constant-data-length block size, 1.4Mbits, has been chosen to store each video block. Two models, server-push and client-pull, of parallel video servers have been evaluated in the paper. For server-push model, a server applying con ict-free scheduling and a server applying serialized con ictfree scheduling discussed in section 4 has been implemented. For a client-pull video server, there is no explicit scheduling algorithm for pumping video streams to clients. The client is responsible for sending a request of a speci c video block to video server whenever is needed. Without losing generality, the client sends a request to a speci c storage node every one second. Also, an access FIFO Queue is implemented to queue up requests sent by clients in each node of client-pull video server. Furthermre, a ping-pong buer, whose size is 2.8 Mbits (2 1:4Mbits), is used in both the server-push video server with con ict-free scheduling and the client-pull video server to facilitate disk access and network transmission. On the other hand, a buer space of m 1:4Mbits is used to facilitate the serialization of network transmission in a storage node. The performance of these video server are measured by the maximum number of video streams supportable without experiencing a "glitch" in clients. Table 1
Table 1: Experimental Result of various parallel video server in normal playout Model scheduling supportable clients Server-Push CFS 24 Server-Push Serialized CFS 36 Client-Pull None 26
shows the maximum number of clients supportable by these three parallel video servers. The experimental result shows that on a shared Ethernet the maximum number of clients supportable of a server with serialized con ict-free scheduling algorithm is 50% and 38% more than that of a server with con ict-free scheduling algorithm and that of a client-pull server, respectively. The performance enhancement of the server with serialized con ict-free scheduling algorithm is due to the elimination of frame collisions by serializing network transmission on shared Ethernet. Both the server-push video server with con ict-free scheduling algorithm and the client-pull video server generate a lot of collisions on shared Ethernet, but the client-pull server supports two more clients than the server with con ict-free scheduling. Originally, the parallel video server interconnected by switch-based network applies con ict-free scheduling algorithm to avoid network con ict. However, on shared Ethernet it causes even more serious frame collisions than the client-pull video server due to the precise synchronization of scheduling algorithm. Table 2: Average time spent in each operation of parallel video server with serialized con ict-free scheduling Operations time spent Rd 434ms sch 1ms Tx 330ms ctrl 1ms The average time spent in each operation of a server with serialized con ict-free scheduling is listed in Table 2. The sustained rate for disk bandwidth and network bandwidth is 38.9 Mbits/sec and 51.2 Mbits/sec, respectively. Equation 9 holds since 434 ms (3 , 1)(330 + 1 + 1) + 330 , ((176=1024) 8=51:2) 1000 = 967:1 ms. Therefore, only the network operation accounts for a cycle time.
6 Concluding Remarks In this paper, we present how to design a con ict-free scheduling algorithm of video streams on Non-UniformNetwork access network environment. To increase the throughput of parallel video server interconnected by a shared network environment, such as Ethernet, elimination of network collision is highly desired. A special mechanism, serialized con ict-free scheduling is presented in the paper to avoid network collision. The experimental result shows that the server with serialized con ict-free scheduling mechanism can support up to 50% more clients than the other two on a shared Ethernet at the expense of larger buer space. An experiment of parallel video servers interconnected by a switch-based network will be conducted in the near future.
References [1] M. Buddhikot, G. M. Parulkar, and J. R. Cox Jr., Design of a large scale multimedia storage server, Journal of Computer Networks and ISDN Systems (1994), 504{524. [2] W.J. Bolosky et al., The tiger video leserver, Proc. Sixth int'l Workshop on Network and Operating System Support For Digital Audio and Video (Los Alamitos, Calif), IEEE Computer Society Press, 1996. [3] J. Hsieh, M. Lin, J. C. Liu, D. Du, and T. Ruwart, Performacne of a mass storage system for video-ondemand, Journal of Parallel and Distributed Computing 30 (1995), no. 1, 147{167. [4] IEEE, New York, 802.3: Carrier sense multiple access with collision detection, 1985a. [5] Y. B. Lee and P. C. Wong, A server array approach for video-on-demand service on local area networks, IEEE Infocom 1 (1996), 27{34. [6] A. Reddy, Scheduling and data distribution in multiprocessor video server, Proc. Second IEEE Int'l Conf. On Multimedia Computing and Systems (Los Alamitos, Calif), IEEE Computer Society Press, 1995, pp. 256{263. [7] M. Y. Wu and W. Shu, Scheduling for large-scale parallel video server, Proc. Sixth Symp. on the Frontier of Massively Parallel Computation, IEEE Computer Society Press, 1996, pp. 126{133.