study of data distribution techniques for the ... - Semantic Scholar

4 downloads 1074 Views 96KB Size Report
data distribution for speeding up the encoding process. This is always ..... read the frames from their local hard disk, instead of read them from the server disk ...
Proceedings of the IASTED International Conference Parallel and Distributed Computing and Systems November 3-6, 1999, MIT, Boston, USA

STUDY OF DATA DISTRIBUTION TECHNIQUES FOR THE IMPLEMENTATION OF AN MPEG-2 VIDEO ENCODER T.Olivares*, F.Quiles*, P.Cuenca*, L.Orozco-Barbosa✢, I.Ahmadº * Departamento de Informática, Universidad de Castilla-La Mancha, Campus Universitario 02071 Albacete, Spain, [teresa, paco, pcuenca]@info-ab.uclm.es ✢ School of Information Technology and Engineering, University of Ottawa, 161 Louis Pasteur Ottawa, ON K1N 6N5 Canada, [email protected] º Department of Computer Science, The Honk-Kong University of Science and Technology Clear Water Bay, Kowloon, Hong-Kong, [email protected] use of a high performance computer platform can prove of great help. In this paper, we use a software MPEG-2 video encoder [1] and we focus on obtaining the best scheme of data distribution for speeding up the encoding process. This is always advantageous since video encoding is much more complex and time-consuming compared to decoding.

Abstract Recent developments in the areas of computer communications have enabled the deployment of a wide variety of multimedia applications. Among the various media, video is characterized by its stringent requirements in terms of processing power, storage and bandwidth. In this paper, we undertake the study of a parallel implementation of a software MPEG-2 encoder. We use a platform consisting of a cluster of workstations interconnected by an ATM switch. The use of a parallel encoder offers great flexibility as opposed to a hardware decoder, i.e., flexibility in setting parameters and modifying the various stages of the encoding process. The use of a parallel system should allow us to reduce turn around times by reducing considerable the time required to encode the video with a new set of parameters. However, the effective implementation of such application over a platform without a common memory requires a thorough analysis of the best strategy to distribute the data. This issue is particularly important in video coding, due to the large volume of video data to be handled. In this study, we pay particularly attention to the study of data allocation among the processors in order to improve the overall system operation. We explore various different methods of data distribution. Results of encoding times using different number of processors are provided. Our results show that the underlying middleware plays a major role on the performance of the overall system.

MPEG encoding [2] is an extremely processorintensive task. In the last few years different approaches for the implementation of a parallel MPEG encoder have been made. Regarding software, the MPEG encoder developed in the University of California at Berkeley is possible the most remarkable [3]. It has been later modified to run on Intel Paragon and Intel Touchstone Delta [4]. A slice-based implementation of MPEG-2 [5] video encoding is described in [6], and the data-parallel approach of MPEG-2 video encoding, completely portable, flexible and scalable [7]. The MPEG-2 video specifications provide a standard syntax, which offers a trade-off between cost and quality such as compression efficiency.

2. The MPEG algorithm MPEG-2 is an ISO/IEC and ATSC standard for the managing of high-quality video and audio encoding. It has been designated as the encoding standard of the U.S. upcoming high-quality digital television standard. ATV [8]. This standard is intended for a wide range of applications and it is fully compatible with MPEG-1, the first phase of MPEG (Motion Pictures Expert Group) work. The MPEG standards were designed with two requirements in mind, need for high compression, and need for random access capability. To fulfil these requirements, MPEG exploits the spatial and temporal redundancy within an image sequence for optimum compression.

Keywords: MPEG-2, parallel processing, data allocation message passing, network of workstations

1. Introduction The study of novel video encoding schemes is a very active research area nowadays. The use of software implementation of a new encoder is very desirable, effective and more flexible than some special-purpose hardware, allowing algorithmic improvements to give new solutions. However, a major drawback of a software implementation is the turn around time, i.e., the time required to run the software encoder. Towards this end the 302-139

A group of consecutive frames are combined to form a structure known as Group of Pictures (GOP) (see Figure 1). This group of frames consisting of a pattern of I, P 1

compression process of MPEG-2 video encoder involves the following steps (see also figure 2):

and B pictures is repeated one after another in the video sequence. The encoding/decoding process of a picture within a GoP is independent of the processing carried out to encode any other GoP within the video sequence. This structured organization of frames into GoP provides the basis to allow for the definition of random access capabilities.

1. 2. 3. 4. 5.

Sample rate reduction: color space transformation. Motion Compensation Spatial-to-DCT (Discrete Cosine Transform) domain transformation. Quantization: discard important DCT domain samples Entropy coding: lossless coding of DCT domain samples

3. Our parallel platform

I

B

B

P

B

B

P

B

B

We have used a portable and scalable parallel implementation of the MPEG-2 video encoder using the SPMD programming paradigm [9]. This paradigm is based on the partitioning the data into smaller pieces that are assigned to different processors. A single program is written for all processors that asynchronously execute the program on their local piece of data. Communication of data and synchronization is done through message-passing using Message Passing Interface (MPI) library [10]. To distribute a frame data across processors, we set up a virtual 2D grid of processors regardless of the underlying hardware topology. We have used a group of 64-bit, 167MHz SUN Ultra-1 workstations. The cluster of SUN Ultra-1 workstations consists of 20 such workstations. They are interconnected by a ForeRunner ATM switch (ASX-1000). The cluster is transformed into a virtual 2D grid. The processors are assigned x-y coordinates which facilitates inter-processor communication and identification of the neighboring processors. The data is then mapped onto the virtual processor grid.

P

Figure 1. MPEG-2 GoP structure

1

2

3

4

5

6

7

8

9

10 1112 13 14 15

16

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

YCbCr

RGB

9

7

4

6

0

0

0

0

8

6

0

0

0

1

1

0

3

0

1

2

0

2

0

0

0

0

0

0

1

0

0

0

0

0

0

1

0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

Run length coding (0,9) (0,7) (0,8) (0,3) (0,6) (5,1) (4,2)...

MACROBLOCK (16x16 pixels)

1

2

3

4

5

6

7

8

1

DCT and Quantization

2

MOTION COMPENSATION

3 4 5 6 7 8

BLOCK (8x8 pixels)

Since the overhead due to inevitable communication can be the major limiting factor, care should be taken while partitioning the data among the processor such that minimal inter-processor communication is employed. Since each processor has enough memory, it is possible to explore the use of different data distribution strategies in order to reduce the amount of overhead introduced by the communication process. As in [11], the frame data can be distributed among the processors allowing overlap such that each processor is allocated some redundant data.

Hufman coding

Figure 2. MPEG-2 Encoding Process MPEG-2 video use a DCT (Discrete Cosine Transform) coding method for reducing spatial redundancy, and for reducing temporal redundancy use the motion compensation, for encoding only the changes between two contiguous frames. This is the most compute-intensive operations [2]. Given a reference picture and a kxl macroblock (usually 16x16 pixels) in a current picture, the objective is to determine the kxl macroblock in the reference picture that better matches (according to a given criterion) the characteristics of the macroblock in the current picture. The location at the macroblock regions is given usually by the (x, y) coordinates of their left-top corner. Ideally, the search should be performed over the whole reference picture for the best match; however, this is impractical. Instead, the search is done just around the region around the original location of a macroblock in the current picture. The 302-139

4. Data Distribution Techniques 4.1. Frame division The first study presents a comparison of three different methods at level of encoding frames in parallel (Figure 3). The frame data should be distributed among the processors such that each processor encodes a prespecified area of every frame in the video sequence.

2

1

1

3

(β+16) x (δ+16) if i≤ k, j≤ l

2

2

1

3

4

2

3

4

β x (δ+16)

if i>k, j ≤ l

(β+16) x δ

if i ≤k, j > l

βxδ

if i>k, j > l

4

Local-size at Pij =

METHOD 1

METHOD 2

METHOD 3

1

3

2

4

k≤s l≤t

- Method 2: Horizontal frame division For this method let us have: α = 0, β = M, γ = MOD (N, (mxnx16)), δ = (N-γ) / (mxn), γ = tx16, because the local size of frame-data for all processors will be M × a fraction of N, M × (N/mxn). N will be divided now between all processors and not only between the processors in dimension n of the processors grid. So, Pij is determined as follow:

Figure 3. Different methods of frame division We denote the size of a video frame by MXN (in pixels), with M=hxMacroblock_size and N=vxMacroblock_size, where Macroblock_size=16 and h, v are the number of macroblocks and slices in a video frame, respectively. The size of the 2-D processor grid is given by mxn. The set of processors is denoted by Pij for i= 1, 2, …k, …m, j= 1, 2, …, l, …n. We then define the following parameters: - Pixels not assigned on dimension m: α = MOD(M, (mx16)) - Pixels assigned to each processor on dimension m: β = λ(M-α)/m - Pixels not assigned on dimension n: γ = MOD(N, (nx16)) - Pixels assigned to each processor on dimension n: δ = λ(N-γ)/n - From this: α = sx16 and γ = tx16, s and t are positive integers, the macroblocks not assigned in each dimension. - The processors are numbered in consecutive order columnwise, we denote this number by r.

Local-size atPij=

β x (δ+16)

if r < t

βxδ

If r >= t

- Method 3: Vertical frame division For this method we have: α = MOD (M, (mxnx16)), β = (M-α) / (nxm), γ = 0, δ = N, α = sx16. Now M will be divided between all processors. Pij is determined as follow:

Local-size at Pij =

(β+16) x δ

if r < s

βxδ

if r >=s

4.2. Sequence division Since, the motion estimation requires the frames within a GoP. We make consider to distribute the sequence by assigning a GoP’s to the processors. We will refer to this method as Method 4. Let us consider the following values, nframes is the number of frames in the sequence, N, the number of frames in one GOP, ngops will be nframes/N, mxn the size of 2D processor grid, and :

It is important to note that for all these three methods, there is a need for each processor of counting with all the surrounding macroblocks in a frame. This is required to estimate the motion vectors when carrying out the motion estimation from a frame according to the MPEG-2 specification. In order to eliminate the need of exchanging information during the encoding process, the processors retrieve the needed information from the main server [12]. By taking into account this fact, we define the three methods depicted in Figure 3 as follows:

- α = MOD (ngops, (mxn)), number of GOPs not assigned - β = (ngops - α) / (mxn), number of GOPs assigned to the processors

- Method 1: Block division The GOPs assigned to the processor Pij, with rank (pij)= r, will be calculated as: ng = r + (k × m × n)

For this method the local size of frame-data at the processor Pij is determined as follows:

302-139

3

With the following conditions: if α = 0 then k = 0 … β-1 and if α ≠ 0 then if r < α then k = 0 …β else if r ≥ α then k=0 … β-1. After this, the computations for obtaining the frames assigned to each processor, nf, are the followings, with N = number of frames in GOP, if ng = x then: nf = x, …, (x+N-1).

The encoding time (no counting the I/O overhead) is very similar for all four methods (see Figure 4.b). However, we notice that as the number of processors is increased, the time required for encoding the video sequence falls behind the ideal performance. This is to say, the system is unable to perform the task on 1/N fraction of the time needed by one processor, where N is the number of participating processors.

4.3. Experimental results

Method 4 exhibits the best results when evaluating the overall process, i.e., including the I/O overhead (Figure 4.a). In the case of all three other methods the overhead introduced by the data distribution process increases as the number of processors is increased. The only exception to this observation happens as the number of processors is increased from one to two. The main cause of this phenomenon is due to the fact that under all these three methods, all processors have to gain access to all video frames. Since the operation of all processors is activated at the same time, all processors may attempt to access the same video frame at the same time. This is clearly introducing a non-negligible overhead. In the case of method 4, the workstations do not require to access every frame file. These results show clearly the benefits of distributing the video sequence following a sequence division paradigm.

This first set of experiments aims to compare the four aforementioned methods. During our experiments, we have used a video sequence consisting of 96 frames encoded under the CCIR-601 format. The number of workstations has been varied from 1, 2, 4, 8 and 16. By varying the number of processors, we have been mainly interested in evaluating the overall time required to encode the video as well as the overhead introduced to distribute the video sequence among the participating nodes. The experiments have been performed at least five times. The results have been obtained by averaging the results of all trials. Figures 4.a and 4.b show the encoding times (seconds/frame) with input/output operations (ET-I/O) and without input/output operations (ET), respectively. Encoding Tim e with I/O operations (sec) 12 Method Method Method Method

11 10

1 2 3 4

5. Sequence division with read first

9

Since we have obtained the best results using method 4, we explore in this second phase a variation to the way the video data is distributed under this method. In the first phase, the retrieval and encoding processes were interleaved. In this second phase, a processor retrieves all the video sequence before engaging in the encoding process.

8 7 6 5 4 3 2 1 0 0

2

4

6

8

10

12

14

16

The idea is that all processors will retrieve their GoPs and store them in their local hard disks. They will then read the frames from their local hard disk, instead of read them from the server disk (see Figure 5).

18

Number of processors

(a) Encoding Time without I/O operations (sec) 1,6 Method Method Method Method

1,4 1,2

1 2 3 4

In this new method, referred from now on as Method 4local, the distribution of frames from the server to the processors can be made following one of the two following strategies:

1 0,8

1.

0,6

2.

0,4 0,2 0 0

2

4

6

8

10

12

The first strategy will create a bottleneck that may affect the performance as well as the reliability of the system. The second strategy seems to better fit our needs. We therefore set the system to operate under this second strategy.

14 16 18 Number of processors

(b) Figure 4. Encoding time a) with I/O b) without I/O 302-139

A dedicated processor is responsible of distributing the video frames to all the processors All the processors retrieve the frames from the server disk

4

M4 and M4local, only include the processing time. The first two metrics clearly depict that the major limiting factor is the distribution strategy being employed. This shows the advantage of retrieving the video sequence into an auxiliary local disk unit. Our future research efforts, we will be exploring further the main reasons of this discrepancy on the performance between the two retrieval strategies

The experiments compare method 4 and method4local with the same CCIR-601 video sequence of 96 frames used in the first set of experiments. We set the number of processors to 1, 2, 4, 8 and 16. Similar to the first set of experiments, we have been interested in evaluating the total time required to encode the video sequence as well as the time required to retrieve the video server from the central server.

Total Time w ith I/O operations (sec) 3 2,8

,2 7LPH

Method 4

2,6 Method4local

2,4 2,2 2 1,8 1,6 1,4 1,2

6HUYHU GLVN

1 0,8 0,6 0,4

(a)

0,2

/RFDO GLVNV

0 0

$OORFDWLRQ7LPH

2

4

6

8

10

12

14

16

18

N umber of processors

,2 7LPH

(a) E ncod ing T im e w it h out I/O op erat io ns (sec) 1,8 M ethod 4

1,6

M ethod 4local 1,4

6HUYHU

1,2 1

/RFDO GLVNV

0,8

(b)

0,6 0,4

Figure 5. a) Method 4, b) Method 4-local

0,2 0 0

Figure 6a shows the total encoding time including the time required to retrieve the data from the main server. Both methods show very similar results. However, it is important to notice that method 4local shows better results as the number of workstations increases. On the contrary, method 4 shows slightly better results for a system configuration of less than eight processors. In the case of the encoding time without I/O, both schemes exhibit similar results. As for the previous case, we observe that the system falls behind the ideal performance as the number of processors is increased (see Figure 6b).

2

4

6

8

10

12

14

16

18

N u m b er o f pro cesso rs

(b) Figure 6. Encoding time a) with I/O, b) without I/O Speedup 18 Ideal

16

M4-I/O M4

14

M4local-I/O 12

M4local

10

Figure 7 shows the speedup for method 4 and method 4local. Speedup is defined as the ratio of the time required to complete the compression of a given number of images (96 in our case) by one workstation and the time required to complete the task by the slowest workstation of the set of N workstations. In the figure, M4-I/O and M4local-I/O represent the speedup obtained by considering the overhead introduced by the retrieval process for method 4 and method 4 local, respectively. The other two metrics, 302-139

8 6 4 2 0 0

2

4

6

8

10

12

14

N umber of processors

Figure 7. Speedup

5

16

18

on Personal Computers: Algorithms and Technologies, San Jose, CA, February 6-10, 1994, Vol.2187, pp.229-240. [7] S.M.Akramullah, I.Ahmad, M.L.Liou, A portable and scalable MPEG-2 Video Encoder on Parallel and Data Distributed Computing Systems, Symposium on Visual Communications and Image Processing, Vol.2727, pp. 973-984, Orlando, Florida, March 1720, 1996. [8] Advanced Television Systems Committee, A compilation of Advanced Television Systems Committee Standards, March 1997. [9] A. Karp, Programming for parallelism, IEEE Computer, May 1987, pp.43-57. [10] P.S. Pacheco, Parallel Programming with MPI, Morgan Kaufman Publishers, Inc. [11] S.M.Akramullah, I.Ahmad, M.L.Liou, Performance of a Software-Based MPEG-2 Video Encoder on Parallel and Distributed Systems, IEEE Transactions on Circuits and Systems for Video Technology, Vol.7, Nº4, , pp.687-695.August 1997 [12] S.M.Akramullah, I.Ahmad, M.L.Liou, A Software based H.263 Video Encoder using a Cluster of workstations, SPIE’s 1997 Optical Science, Engineering and Instrumentation Symposium, Vol.3166, San Diego, CA, August 27 July-1, 1997.

6. Conclusions In this paper, we have studied various strategies for the implementation of a parallel implementation of a MPEG-2 video encoder. The use of a cluster of workstations interconnected by an ATM switch has been explored. One of the main challenges towards the use of such platform for the implementation is the definition of a proper video data distribution scheme among the workstations. In order to come out with the most suitable solution, we have had to analyze the procedures used in the compression algorithm of the MPEG-2 standard. Four different methods were studied and compared. The distribution based on the partitioning of the video streams into GoPs has proved to be the best alternative, We have further improved this scheme by pre-fetching the video sequence into the local disk of each workstation. Our future research efforts will be focusing on the design of direct dynamic scheduling algorithms. This interest responds to the fact that many cluster count with an heterogeneous set of workstations, varying on processing power, storage capacity as well as networking facilities. We also plan to explore the use of our encoder for the study of novel algorithms and parameters setting.

Acknowledgements This work was partially funded by the Ministry of Education of Spain under CICYT project TIC97-08997CO4-02.

References [1] S.M.Akramullah, I.Ahmad, M.L.Liou, Parallelization of MPEG-2 Video Encoder for Parallel and Distributed Computing System, Midwest Symposium on Circuits and Systems, Vol.2, pp.834-837, Rio de Janeiro, Brazil, August 13-16, 1995. [2] V. Bhaskaran. K. Konstantinides, Image and Video Compression Standards. Algorithms and Architectures, Second edition, Kluwer Academic Publishers, 1997. [3] K.L. Long, L.A. Rowe, Parallel MPEG-1 video encoder, Picture Coding Symposium, California, September 1994. [4] K. Shen, L.A. Rowe, E.J.Delp, A parallel implementation of an MPEG-1 encoder: Faster then real time!, Conference on Digital Video Compression on Personal Computers: Algorithms and Technologies, San Jose, CA, February 5-10, 1995, pp.407-418. [5] ISO/IEC JTCI/SC29/WG11, Generic Coding of Moving Pictures and Associated Audio, ISO/IEC 13818-2, Draft International Standard, March 1994. [6] Y.Yu, D.Anastassiou, Software implementations of MPEG-2 video encoding using socket programming in LAN, Conference on Digital Video Compression 302-139

6

Suggest Documents