Real-Time MPEG Encoding in Shared-Memory Multiprocessors∗ Denilson Barbosa1
Jo˜ao Paulo Kitajima2
1
Computer Science Department Universidade Federal de Minas Gerais PO Box 702, Belo Horizonte, MG 30123-970 - Brasil {dmb,meira}@dcc.ufmg.br
Abstract Advances in digital imaging and faster networking resources enabled an unprecedent popularization of digital video applications, not only for entertainment, but also for health and education. However, capturing, digitizing, storing and displaying video in real time remains a challenge for nowadays technology. Furthermore, the large number of frames that compose a digital video, each of them occupying hundreds of kilobytes, stress significantly both storage and networking resources available currently. A common strategy to reduce the amount of data that has to be handled while providing digital video services is compression, as specified by the MPEG coding standards, for instance. The compression cost then becomes the greatest, demanding the use of either specialized hardware or parallel computing to meet the real time demands. In this work, we propose a novel shared-memory parallel algorithm for MPEG-1 video encoding based on the modularization of different coarse grained tasks and the exploitation of the maximum parallelism among them. Different types of video sequences are used as a benchmark and rates of 43 frames per second are achieved for standard sequences with a non dedicated Sun multiprocessor with 32 CPUs.
1
Introduction
Digital video is no longer an extra feature of multimedia applications, but essential in some very important applications. Videoconference, HDTV, Web TV broadcasting and telepresence applications are some examples of this fact. However, digital video demands much more storage space and communication bandwidth than traditional ∗ This Work was partially supported by CAPES and PROTEM-CC Almadem Project (CNPq grant 680072/95-0). Jo˜ ao Paulo Kitajima has a FAPESP fellowship since May 1st, 1999 no. 99/01389-0.
Wagner Meira Jr.1
2
Bioinformatics Laboratory Universidade Estadual de Campinas PO Box 6176, Campinas, SP 13083-970 - Brasil
[email protected]
computer applications. Although the current telecommunication infrastructure allows continuous digital media transmission with quality, digital video would be of little practical interest without using compression. For example, 15 minutes of digital video with “TV like” quality (30 frames per second, 720 × 484 pixels in each frame with 12 bits per pixel)1 occupies 14.11 gigabytes of storage space and requires a bandwidth of 12.45 gigabits per second for real-time transmission without compression. To deal with this problem almost every video encoding scheme use some kind of data compression. Different applications require different levels of video quality and use different schemes for video coding and compression. Videoconference applications, for example, tolerate low quality video, if the audio is understandable and well synchronized with images. A remote surgery, on the other hand, requires a very high video quality, otherwise the surgeon will not be able to do his job. High quality video demands higher compression rates. Moreover, it must be quickly decoded. The MPEG standards are intended to provide high quality video with high compression rates. Nevertheless, MPEG encoding algorithms demand high computational power, far beyond traditional computers can provide. Usually, real-time MPEG encoding is not achieved by conventional hardware. On the other hand, parallel computing techniques have been used to improve slow and complex computations for the last two decades. There are some approaches to digital video encoding using parallel machines, but, the majority of them employ data parallel schemes. In this work, we are interested in a control parallel algorithm for real-time MPEG encoding, running on a shared memory multiprocessor. This paper is organized as follows. Section 2 describes the MPEG-1 standard and the basics of video 1 This
is a usual video format.
Sequence
original frames I
B
B
P
Group of Pictures (GOP) Picture Slice Macroblock Block
coded frames
Figure 2: Frame dependencies in MPEG encoding. Figure 1: MPEG-1 video layers.
encoding. Next, we discuss some previous software parallel approaches to video encoding in Section 3. Section 4 presents the proposed algorithm and Section 5 discusses implementation issues and experimental results. Finally, some conclusions and future works are outlined in Section 6.
2
MPEG Video
MPEG is an acronym for Moving Pictures Experts Group and is the result of a joint effort from ISO2 and IEC3 . There are currently three established international standards, MPEG-1, MPEG-2 and MPEG-4. A new standard, MPEG-7, is under development. MPEG-1 and MPEG-2 are intended for high quality video with high compression rates while MPEG-4 deals with the use of video in low communication capability environments. The MPEG-7 standard is intended for authoral information storage and retrieval as long as search by contents in video sequences. The MPEG-2 standard is an extension of its predecessor, MPEG-1, and supports more video formats, such as digital TV. However, the encoding algorithms are basically the same in both standards. In this work, we decided to use MPEG-1 due to its relative simplicity. It is easy to extend our encoder to the MPEG-2 standard. We are interested specifically in the MPEG-1 Video standard [4]. The audio encoding was left as future work. Hereinafter, we use the general term MPEG meaning the MPEG-1 Video standard.
2.1
MPEG Video Encoding
Figure 1 shows the basic components of a MPEG video. A sequence is made of many GOPs (Groups of Pic2 International 3 International
Organization for Standardization. Eletrotechnical Commission.
tures), which, in turn, have many pictures. Each picture is divided into slices, and each of them has several macroblocks. One macroblock has up to six blocks4 . The first four blocks compose the luminance signal of the image, while the remaining two code the chrominance signal [3]. Most of the video data is coded in these last two layers. Many encoding decisions are taken in the picture layer, including the encoding model for the frame, which determines the encoding of the remaining layers. There are four picture encoding models (I, P, B and D) in the MPEG standard, divided into two categories: (1) intra-frame encoding, intended to eliminate the spatial redundancy present in each frame; and (2) inter-frame encoding, dealing with temporal redundancy, present among frames in the sequence [3, 2]. The I-frames (Intracoded frames) are encoded independently of other frames in the sequence using only intra-frame encoding techniques. P (Predicted) and B (Bidirectionally predicted) frames are encoded as differences from previous (or subsequent) reference frames using inter-frame encoding techniques. Only I and P frames can be used as reference frames. D-frames are intended for a low resolution version of the sequence and cannot be mixed with frames of other types. For this reason, we do not implement D-frame encoding. Figure 2 shows the encoding dependencies among frames. P-frames use reference frames located in the past (i.e., before itself) while B frames use two reference frames: one in the past and other in the future relative to itself. The differences between the frames are obtained via motion estimation and motion compensation techniques [3]. There are several implementations of these algorithms [12]. The MPEG standard defines the use of block matching motion estimation and compensation. In this scheme, each macroblock of the frame being coded is matched to a number of pos4 Depending
on the frame type.
sible macroblocks inside a search window. This process is depicted by Figure 3. The motion estimation process aims to find a good match for the macroblock being coded inside the search −−−→ window. The vector (u, v) is the relative displacement between the macroblock and its best match, and is used to code the macroblock. There is not always a perfect match for the macroblock. In these cases, motion compensation is used to encode the differences between the macroblock and its best match. There are situations where neither techniques achieve a good result. In these cases, the encoder switches to intra-frame encoding for the macroblock. Given the number of possible techniques, the computational effort needed to encode a frame is heavily dependent of the encoding model employed. Moreover, The frame encoding dependencies (Figure 2) turn into data dependencies for the encoder. Therefore, these dependencies must be taken into account when developing a parallel MPEG video encoder. Actually, they can hamper the parallelism and, therefore, diminish the desired speedup.
3
Related Work
There are some previous works on parallel MPEG video encoding. The use of special hardware is described in [11]. The major drawback to this approach is that the hardware, sooner or later, becomes obsolete. This happens because of new improvements in video coding technology or standards that cannot be incorporated in existent ad hoc hardware. Regarding general purpose parallel machines, [1] describes an algorithm for real-time MPEG-2 encoding. This work describes a data parallel encoder implemented for the Intel Paragon and for networks of workstations. In this work, each frame is divided among the processors, as depicted in Figure 4, for 4 procesFrame being coded
sors. There are some portions of the frames, shaded in the figure, that have to be sent to more than one processor. This happens because the search window of a macroblock in one processor may overlap the portion of the frame sent to another processor. The problem with this approach is that the amount of data exchanged can be very high. Furthermore, adding more processors makes the problem worse: the proposed solution does not scale. The authors suggest an alternative approach where each processor gathers all data it needs before starting encoding. This avoids communication during the encoding phase but does not reduce the amount of data exchanged. Another approach is to send whole frames for each processor, as described in [9]. The problem here is that the amount of data needed by a processor depends on the type of the frame it is encoding. A processor encoding a B-frame will need more data than a processor encoding an I-frame. Clearly, the communication burden will not be identical among the processors. There are some related works regarding MPEG video encoding using shared-memory parallelism. One parallel algorithm for the MPEG-4 standard is described in [10]. In that work, only the motion estimation process is done in parallel: the authors did not explore other possible parallelizations of the algorithm. Furthermore, there are heuristic algorithms for this process that produce good visual results and demand much less computing power [12]. We have used one such heuristic [3, 6] which, alone, achieves an improvement of 74% [2] over the original sequential encoder.
4
Our Algorithm
We propose an algorithm [2], based on shared memory parallelism, to overcome the problems of data communication in distributed memory schemes. Figure 5 shows the structure of this algorithm. The encoding process is split in two kinds of task: (1) GOP encod-
Reference frame P1
P2
(u,v)
macroblock being coded
best match
Figure 3: Search window.
search window P3
P4
Figure 4: Data parallel approach.
Main
...
GOP
... pic
pic
pic
pic
...
GOP
...
... pic
pic
pic
pic
... ...
GOP
pic
... ...
... pic
pic
pic
pic
pic
...
pic
... write buffer
...
write
Figure 5: Schema of the proposed parallel algorithm.
ing, called GOP tasks; and (2) frame encoding, called picture tasks. Yet, there are the main task, which coordinates the GOP tasks and the write task, responsible for writing the encoded frames in the compressed output file. We define two equally sized buffers: (1) the read buffer, to hold uncompressed frames, and (2) the write buffer, where the coded frames are stored prior to be written in the compressed output file. The circles in Figure 5 represent tasks. In this work, the smaller task performed is the encoding of a single frame, regardless of its type. The dashed lines in the figure represent synchronization dependencies, while the solid lines represent data flows. Since the read buffer is shared by all tasks, there is no need for the processors to communicate during the encoding process. They can read their reference frames directly from the memory. Also, there is no problem of concurrent accesses to this read buffer, because it is not written during encoding.
4.1
Synchronization
There are two synchronization steps in our algorithm: (1) encoding synchronization and (2) writing synchronization. The first guarantees the correct encoding of the frames, provided the dependencies among them. This is done by coordinating the GOP tasks with their correspondent picture tasks. Each GOP task encodes the GOP heading on the output file and creates a number of new picture tasks, corresponding to the frames inside the GOP. Those tasks are performed in parallel and, when they are all finished, the GOP task is considered completed.
The main task groups the frames in GOPs creating GOP tasks which are independently executed. The synchronization among the main task and the GOP tasks is similar to that among the GOP tasks and the picture tasks. When all GOP tasks are completed the sequence is encoded. The writing synchronization is used to prevent occupied write buffers from being incorrectly overwritten. There are flags to indicate the state of the write buffer. When a coded frame is stored in the buffer, the corresponding flag is set and no other frame can be encoded on that buffer. The write task, after dumping that buffer to the output file, clears its flag and another compressed frame can be written on later. Dumping the frames to the output file is a task that must be serialized because the order of the frames in the sequence is unique. Therefore, this synchronization can hinder the full exploitation of the parallelism of our algorithm.
5
Implementation and Experimental Results
The algorithm was implemented in a multithreaded environment, using the C language and Pthreads [5], on a Solaris 2.51 operating system. The encoder was executed in a Sun Enterprise 10000 multiprocessor with 32 CPUs and 8 gigabytes of primary memory. Unfortunately, during the experiments, this machine was in use by other users. The average workload of the machine during our experiments was of 15 CPUs fully busy. There were idle processors, but the processormemory bus was shared, as well as the disk subsystem.
sequence football susie sun leroy
frames 146 148 4653a 2054
screen size 320 × 240 320 × 240 352 × 240 352 × 240
The first step towards the intended parallelization was to isolate the encoding routines for I, P and B frames. Our first approach was to use two available public domain encoders: mpeg2encode [7] and one encoder developed by the Portable Video Research Group at Stanford University [8]. Unfortunately, these encoders did not suit our intended modularization. We then implemented a simple MPEG encoder, based on the syntax described in [6]. Some optional features of the MPEG standard were not implemented (e.g., the adaptive quantization). Although they are very important, we left them as future work because our current major concern was not video quality but evaluate the speedup and the scalability of our algorithm.
kinds of tags, corresponding to the two kinds of encoding tasks: (1) picture tags and (2) GOP tags. The GOP tags are written by the main task while the picture tags are written by the corresponding GOP task. Each tag has semaphores to indicate the start and the end of the task. There are also semaphores to coordinate the writing and dumping of frames in the write buffer. These semaphores are used to implement the synchronization steps described in Section 4.1. There are a fixed number of GOP and picture encoding threads. Each GOP encoding thread performs a GOP tasks generated by the main thread. The picture tasks corresponding to the frames in the GOP are created and their tags are stored in a shared array. The picture encoding threads go forever searching this array, looking for new frames to be coded. Eventually, all the frames (assigned to a given GOP task ) are coded and the task is then considered completed. At this point the GOP encoding thread waits for another GOP task to perform. This process continues until there are no more frames left in the sequence. One important feature of this bag of tasks approach is that the number of frame encoding threads is independent from the number of GOP encoding threads. More than that, the number of threads employed is also independent from the number and the type of frames in the GOPs and in the sequence.
5.1.1
5.2
Table 1: Test sequences. a In
5.1
the first experiment only 1000 frames are used.
The Parallel Encoder
Buffer Allocation
Every GOP task is assigned a set of frames. To better exploit the parallelism, we create enough buffers to hold all the frames in the GOP. Otherwise, for instance, a frame being encoded could have to wait for reference frames waiting for free buffers to be loaded. In this work, we fixed the number and the distribution of the frames in each GOP. We used 12 frames per GOP, according to the following sequence: I,B,B,B,P,B,B,B,P,B,B,P. This was done to generate a uniform workload in every experiment. 5.1.2
Implementing the Synchronization
Multithreaded programming environments are well suited for shared memory parallel programs. The tasks are easily mapped to threads of execution which, by their nature, share every resource available for the program, including the main memory. We create different threads to execute the tasks in our algorithm. Shared memory parallel programs use shared portions of memory for task communication and synchronization. In this work, there is one portion of memory for each task, which we call a task tag. There are two
Video Sequences Used
The computational effort needed to encode a video sequence depends largely on the sequence itself. Besides screen size and number of frames, one of the most important factors is the complexity of the scenes [3]. Thus, in order to better determine the performance of our algorithm, we chose four different sequences to run the experiments. Table 1 shows some features of these sequences. From these four sequences, two are considered simple: susie and leroy. The susie sequence shows a girl answering the phone and the leroy video shows an interview with a musician. The other two are considered more complex. Football shows a scene of an American football game while sun is a promotional sequence produced by Sun Microsystems. The susie and football sequences were used as test sequences by the MPEG committee. We decided to also use longer sequences to verify the impact of the size of a sequence over the encoder’s performance.
5.3
The Experimental Environment
We devised two experiments to investigate both the performance of the encoder and its real-time encod-
1.25 football susie sun leroy
0.4
football susie sun leroy
1
seconds
seconds
0.6
0.75 0.5
0.2
0.25 0
0 1
2
4
8
16
1
32
2
4
8
16
threads
threads
(a) average picture encoding time
(b) average wait time
32
100 work
100
75
% time
speedup
ideal average
10
I/O wait
50
25
0
1 1
2
4
8
16
1
32
threads
2
4
8
16
32
threads
(c) average speedup
(d) average efficiency
Figure 6: Performance data.
ing capability. Concerning performance, we use two measures: (1) speedup and (2) efficiency. By efficiency we mean the portion of time the encoder spends doing useful work. More specifically, the time spent in file reading and synchronization steps are not considered useful. The goals of the second experiment, concerning realtime encoding, are twofold: (1) verify the impact of sequence size over the encoder’s performance and (2) verify the capability of real-time encoding (i.e., more than 30 frames per second - this is the acceptable standard rate for digital video encoding). The first goal is hard because the complexity of different portions of a sequence may vary. In this experiment, we used 500, 1000, 2000, and 4000 frames of the sun sequence, because its scene complexity is approximately the same across the sequence. The first experiment used 4 GOP encoding threads with 1, 2, 4, 8, 16, and 32 picture encoding threads.
In the second experiment we used 4 GOP encoding threads with 16 picture encoding threads. For each configuration, 20 measures were taken. In order to minimize the impact of the machine load, the measures were compared to the average of the 20 points. The 5 points presenting the largest deviation were discarded.
5.4
Experimental Results
Our experimental results concerning the first experiment are summarized in the four graphs presented in Figure 6. Figure 6(a) shows the average frame encoding time for each sequence. As expected, encoding more complex sequences results in greater average frame encoding times. This graph also shows that encoding times do not change significantly when using up to 8 threads. This occurs because, beyond 8 threads, there is competition among the threads for the processors to perform the tasks. The scheduling overhead was not factored out from the encoding time.
46
work I/O wait
0.2
0.1
0 500
1000
2000
without I/O with I/O
frames per second
seconds
0.3
42
38
34 500
4000
1000
2000
4000
frames
frames (a) impact of sequence size
(b) frame rates
Figure 7: Real-time encoding.
Figure 6(b) shows the average wait times due to synchronization steps for the four sequences. These times are considered very small for up to 8 threads. The same result can also be seen in the average efficiency curve (Figure 6(d)). This graph also shows that the I/O operations are a major source of performance degradation for the algorithm, with 32 threads. Finally, Figure 6(c) shows the average speedup achieved. Since our approach does not modify the encoding algorithms, the above ideal speedup is due to the parallelism among the write task and frame encoding tasks. One can note that the algorithm presents almost linear speedup, very close to the ideal, when we use up to 8 threads. The peak performance is obtained using 16 threads, when the speedup is close to 10. We believe that, in better experimental environments our algorithm would have performed better for 16 and 32 threads. In this experiment the encoder exhibited better performance with longer sequences. Figure 7 shows the data concerning the second experiment. The graph in Figure 7(a) shows that the encoder’s performance seems not to be affected by the size of the sequence. As mentioned however, the encoder performed slightly better with longer sequences. This is also suggested by the slowly decreasing encoding times in the figure. Figure 7(b) shows the frame encoding rates achieved, either with and without considering I/O times for frame reading. We overlooked I/O times considering that an ideal real-time encoder would grab the uncompressed frames from a camera or a VCR, would store them in a temporary buffer and would encode them directly from the main memory. In this configuration our encoder would encode 43 frames in a second, which is almost 50% better than the real-time constraint.
6
Conclusions and Future Work
In this paper we presented a simple and novel approach for real-time parallel MPEG encoding using multiprocessors. The encoder was able to encode up to 43 frames per second, which is almost 50% better than our original goal, using only 16 threads. This figure was achieved using a Sun multiprocessor with 32 processors. Although this machine is still expensive, in a medium term, it will sell as a deskstation (QuadPentium PCs are already sold for less than US$ 10,000.00). The performance exhibited by the encoder is promising and we believe it can be improved. The more-thannecessary performance suggests that the 30 frames per second threshold can be achieved when considering other overheads (e.g., due to capture and digitization processes) and improving the video encoding algorithms. Future work includes (1) implementation of adaptive quantization to improve video quality, (2) audio encoding, (3) tests with a real camera adapted to the computer, and (4) code improvement in order to achieve 30 frames per second using a less expensive platform (e.g., a multiprocessor Intel PC). Finally, we would like to acknowledge CENAPADMG/CO for making their multiprocessor available and Almadem PROTEM project.
References [1] S. M. Akramullah, I. Ahmad, and M. L. Liou. A Portable and Scalable MPEG-2 Video Encoder on Parallel and Distributed Computing Systems. In Symposium on Visual Communications and Image Processing ’96, 1996.
[2] Denilson M. Barbosa. MPEG Encoding of Digital Video in Parallel. Master’s thesis, Federal University of Minas Gerais, March 1999. in Portuguese. [3] Vasudev Bhaskaran and Konstantinos Konstantinides. Image and Video Compression Standards: Algorithms and Architectures. Kluwer Academic Publishers, second edition, 1997. [4] ISO/IEC 11172-2: Coding of Moving Pictures and Associated Audio for Digital Storage Media at up to About 1.5 Mbit/s - Part II: Video, 1993. [5] Bil Lewis and Daniel J. Berg. Multithreaded Programming with Pthreads. Prentice Hall PTR, 1997. [6] Joan L. Mitchell, William B. Pennebaker, Chad E. Fogg, and Didier J. LeGall. MPEG Video Compression Standard. Chapman and Hall, 1996. [7] MPEG-2 Video Encoder, Version 1.1a. MPEG Software Simulation Group, 1994. [8] PVRG MPEG codec, Version 1.2.1. Portable Video Research Group, Stanford University. [9] K. Shen, L. Rowe, and E. Delp. A Parallel Implementation of an MPEG-1 Encoder: Faster than Real-Time! In SPIE Conference on Digital Video Compression: Algorithms and Techniques, San Jose, CA, 1995. [10] M. K. Steliaros, G. R. Martin, and R. A. Packwood. Parallelisation of Block Matching Motion Estimation Algorithms. Technical Report CS-RR320, Department of Computer Science, University of Warwick, Coventry, UK, 1 1997. [11] H. H. Taylor et al. An MPEG Encoder Implementation on the Princeton Engine Video Supercomputer. In Proceedings of Data Compression Conference, pages 420–429, Los Alamitos, CA, 1993. [12] A. Murat Tekalp. Digital Video Processing. Prentice Hall PTR, 1995.