Performance characterization of a low-cost video codec ... - CiteSeerX

0 downloads 0 Views 69KB Size Report
Performance characterization of a low-cost video codec on portable devices. Giulio Iannello ..... PNSR vs. bit-rate of video 1 coded with the three codecs used for ...
Performance characterization of a low-cost video codec on portable devices Giulio Iannello, Luca Vollero Dipartimento di Informatica e Sistemistica Universit`a of Napoli, Federico II {iannello,vollero}@unina.it

Abstract Bandwidth and processing requirements of conventional multimedia applications typically exceed capabilities of current technology portable terminals. Applications should hence be able to accommodate their requirements to run on these devices. In this paper we provide a complete performance characterization of a prototypal video codec based on techniques which trade-off complexity with reproduction quality. Comparison with standard codecs like H263 and MPEG-4 demonstrates a remarkable reduction of coding times, potentially enabling real-time processing of multimedia data on low-power devices. However, these performance gains are achieved at expenses of PNSR, making the approach suited to applications like videoconferencing only if a limited quality is acceptable. Our analysis also points out and quantifies some limitations of current technology lowpower devices.

1 Introduction The evolution of computer communication networks and the rapid increase in microprocessor performance have enabled a new set of high-quality, real-time multimedia applications. However, as the trend in computer and network architecture is towards mobile wireless systems, the applications must be able to adapt to widely different circumstances in terms of network bandwidth, processing power, visualization capabilities, etc. Although CPU performance constantly improves over time, advances in battery technology and low-power circuit design will not alone meet the demands of increasingly complex mobile applications. This seems especially true for multimedia applications, where both network bandwidth and processing capabilities of clients may easily become a bottleneck. This paper focuses on the experimental assessment of recently proposed coding techniques for video information that trade-off complexity with reproduction quality, potentially enabling real-time processing of multimedia data on lowpower devices [1, 2]. We provide a complete characteriza-

Francesco Delfino ITEM - Laboratorio Nazionale CINI Napoli [email protected]

tion of a prototypal codec based on these techniques. The characterization includes a comparison with more conventional client architectures and with standard coding schemes like H263 and MPEG-4, which prove to be still too much demanding for low-power terminals without specialized hardware support. Our results show that the considered codec has the potential of delivering good performance on low-power devices. However, this is achieved at expenses of reproduction quality, making the approach suited to applications like video conferencing only if a limited reproduction quality can be accepted. Our analysis also points out and quantifies limitations of current technology low-power devices giving hints on how applications can be ported on these devices with acceptable performance.

2 Low-cost video coding techniques The major current video coding standards, like MPEG-1/2 and H.261/263, guarantee a very good performance in a wide range of conditions, but present a significant encoding complexity that cannot be reduced too much without a complete change of approach. This is all the more true for MPEG4, due to the need of complex image analysis algorithms to extract video objects. From this point of view, although wavelet-based techniques [5, 7] appear more promising, they still require a 3D transform, which can be too demanding, both in terms of complexity and memory requirements, in certain situations [4]. To enable effective coding/decoding of video information on low-power devices, we must resort to simpler compression schemes, even though they entail some performance loss in terms of increased rate or impaired reproduction quality. In this respect, a low-complexity video coding system has been proposed in [2]. The scheme, referred to as CG codec from now on, is based on conditional replenishment (CR) and hierarchical vector quantization (HVQ). These techniques, avoiding complexity of motion compensation, and partially performing off-line the computations required by vector quantization (VQ), leads to a remarkable complexity reduction with respect to more standard techniques like those used in MPEG-1/2/4 and H.261/263 codecs. In this paper,

L1

Intra Frame

Inter P-Frame

Inter B1-Frame

L2

L3

Inter P-Frame

Inter B2-Frame

Inter B1-Frame

Inter B2-Frame

Inter B2-Frame

Inter B2-Frame

Figure 1. Data layer structure for the low quality full frame rate stream. we used an improved version of the CG coder proposed by Cagnazzo et al. [1] which further reduces the complexity of the algorithm using ordered code-books VQ in the CR phase. The CG codec is also a multi-layer scalable codec: the output is arranged so that only a base layer is required to perform the decoding process with limited quality and additional layers provide increasingly enhanced reproduction quality. The layering is performed using both hierarchical coding schemes and temporal sub-sampling techniques. The coder output is composed of six layers: three for low quality reproduction, and three for enhanced reproduction. The structure of the base (low quality) stream is reported in figure 1. Frames are temporally ordered from left-to-right, and the coding dependencies are represented by the ingoing arrows. To encode (decode) a frame, the codec needs to know the compression (decompression)-aware information of frames linked by ingoing arrows. The complementary enhancement stream has the same three-level frame structure. Typically, in level L1 of both base and enhanced stream there is one I-frame every 24 P-frames. The presence of multiple hierarchically arranged layers has the main advantage of enabling adaptation of transmission rate to available bandwidth. Nevertheless, it can be exploited to further adapt compression/decompression steps to available computational power and visualization capabilities of the terminal devices.

3 Experimental setup and methodology As a low power device, we used in our experiments an HP iPaq 3850, a current technology palmtop that seems to meet minimum requirements for hosting non trivial multimedia applications. To better evaluate the characterization of codec performance on the iPaq platform, we carried out a preliminary evaluation of its capabilities with respect to a traditional Pentium-based workstation. The characteristics of both platforms, including operating system and C compiler used, are reported in table 1. The workstation is a medium equipped workstation that can be representative also of a typical laptop. We believe that this experimental setup can reasonably as-

Platform Processor Type Memory Cache L2 Cache L1 OS C compiler

iPaq 3850 Intel StrongARM 1110, 206 MHz 64 MB not present 24KB (I-16, D-8) Linux2.4.18 gcc 2.95.2

Workstation Intel Pentium, III, 600 MHz 384MB 512KB 32KB(I-16, D-16) Linux2.4.7-10 gcc 2.96

Table 1. Platforms characteristics sumed representative of realistic scenarios even though forthcoming devices are expected to deliver increasing performance. Indeed the complexity of coding/decoding multimedia data, typically expected by users, is growing, which makes observed results fairly representative in the mid-term. We ran two benchmarks to evaluate relative differences in memory and CPU performance. The first benchmark consists in copying memory blocks of different sizes. It provides information at all levels of the memory hierarchy. The second benchmark consists of four nested loops moving bytes between typical data structures used in the improved CG codec. This benchmark has been designed so as to perform all memory operations in L1 cache, in order to roughly evaluate CPU speed and effects of compiler optimizations. The gathered results are reported in table 2. For the memory benchmark (first three lines) it is reported the measured sustained copy bandwidth in MB/s. For CPU performance, it is reported the time normalized to the case where all optimizations are enabled on the workstation platform. In the third column of the table it is reported the slow down factor of the iPaq with respect to the workstation. From data reported in the table, we can conclude that iPaq is at least one order of magnitude slower than the workstation if a relevant fraction of memory accesses are out of cache (the difference can be even higher for in-cache computations). Another intersting observation is that only -O1 compiler optimizations seem to have effect on the iPaq platform. We argue that the main reason for this is that the slow memory system makes performance almost independent of further code optimizations, providing that the memory access pattern is not changed (as it is the case for the benchmark used).

parameter memory L2 cache L1 cache CPU (no opt.) CPU (-O1) CPU (all opt.)

iPaq 30.5 – 37.5 36.6 28.8 28.8

workstation 250 1080 1400 9.1 4.4 1.0

slowdown fact. 8.2 – 37.3 4.0 6.5 28.8

Table 2. Performance comparison between platforms for different compiler optimization flags.

In order to assess the effectiveness of the CG approach, we decided to compare our prototypal implementation with standard codecs, and ported on iPaq two open source codecs: (i) the FFMPEG suite, including an H263 codec [3], and (ii) XviD, an MPEG-4 compliant codec [8]. All measurements reported in the following sections about the improved CG codec and the H263 codec have been obtained by code instrumentation, using a high resolution timer. They refer only to coding/decoding operations, excluding frame acquisition and visualization. Measurements concerning the XviD codec were directly provided by its open source implementation. Finally, to better evaluate codec performance, we have repeated all measurements on two video sequences with different characteristics. The first sequence contains typical information generated in videoconferencing applications with large fixed background and limited movements of objects populating the scene. The second sequence is a typical trailer with scenes characterized by quick changes and fast moving objects.

4 Porting and code optimization The codec has been ported on the iPaq platform. This required minor modifications to the code and we run a first set of tests using this version of the codec. In the attempt to further improve performance, we also tried to manually introduce into the code two kinds of simple optimizations. First, the original code contained many inner loops moving individual bytes from one data structure to another in order to rearrange blocks for coding/decoding. Since in most cases bytes moved by a loop were contiguous, we substituted these loops with block transfers through calls to memcpy. Second, we manually performed the complete unrolling of loops cycling a small number of times (typically 3 or 4). Tables 3-(a) and 3-(b) report frame elaboration times on workstation and iPaq, with and without setting compiler optimizations. Data refer to decoding P-frames of the base stream of the first video and are representative also of other

(a) without compiler optimizations original manually Platform code opt. code speedup workstation 3.8 3.2 1.18 IPAQ 21.7 14.5 1.50 (b) with compiler optimizations original manually Platform code opt. code speedup workstation 1.8 1.7 1.06 IPAQ 9.6 9.1 1.05 Table 3. Codec performance on the two platforms.

frame size (pixels) #P-frames per I-frame frame rate (frames/s) bit-rate (Kb/s) compression factor (%)

base 176 × 144 24 6.25 80 16.2

enhanced 352 × 288 24 6.25 300 15.4

Table 4. Relevant parameters of the streams used in the experiments.

tests. In all cases, compiler optimizations reduce decoding times, meaning that when applied to the entire codec they have beneficial effect also on locality of memory accesses. Also manual optimizations improve performance, but their effect is limited when compiler optimizations are enabled. Hence, in our experiments we used the manually optimized version of the codec with all compiler optimizations enabled.

5 Performance analysis In this section we evaluate the performance of the CG codec on iPaq, using the H263 and XviD codecs as a reference. To make the comparison fair, we set the coding parameters of both codecs so as to generate a video with the same frame size, the same compression ratio, and the same GOP structure of the one generated by the CG codec. Since neither the H263 standard nor the current implementation of XviD do support B frames, we used the GOP structure of the L1 level of the CG codec, whose relevant parameters are summarized in table 4. In these conditions the comparison can be considered fair, since all codecs process the same frames using the same information. As we already discussed, we used two different video streams. In all tests the manually optimized version of the CG codec was used. All times reported are in ms.

80

80

70

70

60

60 CG H263 XVID

40 30

50 ms

ms

50

CG H263 XVID

40 30

20

20

10

10

0

0 I-frames

P-frames

mean

st. dev.

I-frames

(a) coding phase

P-frames

mean

st. dev.

(b) decoding phase

Figure 2. Coding and decoding times of codecs on the iPaq platform (video 1, base stream)

80

80

70

70

60

60 CG H263 XVID

40 30

50 ms

ms

50

CG H263 XVID

40 30

20

20

10

10

0

0 I-frames

P-frames

mean

st. dev.

I-frames

(a) coding phase

P-frames

mean

st. dev.

(b) decoding phase

Figure 3. Coding and decoding times of codecs on the iPaq platform (video 2, base stream)

400

400

350

350

300

300 CG H263 XVID

200 150

250 ms

ms

250

CG H263 XVID

200 150

100

100

50

50

0

0 I-frames

P-frames

mean

st. dev.

I-frames

(a) coding phase

P-frames

mean

st. dev.

(b) decoding phase

Figure 4. Coding and decoding times of codecs on the iPaq platform (video 1, enhanced stream)

400

400

350

350 300

300 CG H263 XVID

200 150

250 ms

ms

250

CG H263 XVID

200 150

100

100

50

50 0

0 I-frames P-frames

mean

(a) coding phase

st. dev.

I-frames

P-frames

mean

st. dev.

(b) decoding phase

Figure 5. Coding and decoding times of codecs on the iPaq platform (video 2, enhanced stream)

Test video 1 video 2

CG 12.0 17.4

H263 10.62 11.63

XviD 17.9 25.9

video 1 video 2

coding 11.200 11.222

decoding 7.724 7.731

(a) base

Table 5. Comparison between codecs on the workstation platform (enhanced stream, Pframes, mean coding time).

video 1 video 2

coding 58.936 56.207

decoding 51.972 47.742

(b) enhanced

Table 6. Mean coding and decoding times of B-frames (CG codec only).

45,00

40,00

XviD H263 CG

35,00

PNSR

Figures 2 and 3 report mean coding and decoding times of the three codecs for the base stream of video 1 and video 2, respectively. For each case, besides individual data about Iand P-frames, the mean coding and decoding times and their standard deviation are reported. From the data, the symmetric complexity of the CG coder is apparent. In particular, for video 1 the CG codec is about 3 times faster than the H263 codec and almost 5 times faster than the MPEG-4 codec in the coding phase. The speedups increase to 4 and 6.5, respectively, for video 2, which requires a more expensive motion compensation. As to the decoding phase, for video 1 H263 is slightly faster than CG, which is competitive with XviD. For video 2, CG exhibits its worse performance: has higher decoding times than both H263 and Xvid, with higher variability. Nevertheless, even in this case absolute decoding times of CG remain within acceptable limits. These results are confirmed by data concerning the enhanced stream, reported in figures 4 and 5. The CG codec is 2-4 times faster than H263 and XviD in the coding phase and slightly slower in the decoding phase. Table 5 reports representative data of codecs performance on the workstation, in order to evaluate the different relative behavior of the three codecs on the two platforms. If the relative speedups are compared with the ones achieved on the iPaq, it is apparent that the CG scheme is especially suited to low-power platforms. All these data confirm that the symmetric nature of the CG coding scheme leads to reduced coding times with respect to more standard codecs. From a quantitative point of view the speedup achieved on a low-power platform in the coding phase is between 2 and 6, depending on the characteristics of the video stream and the quality of the coding performed. This speedup does not substantially compromise decoding times which are of the same order of magnitude than for the standard codecs. These considerations hold even if B-frames are considered. Coding/decoding times of B-frames for the CG codec for all cases considered are reported in table 6. Observing that the number of B-frames in a GOP is three times the number of P-frames, it is apparent that at 25 frames/s the CG codec has even better mean performance than those presented so far. However, at this frame rate a comparison with the other codecs which do not support B-frames is difficult and we do not discuss the matter further. Mean coding/decoding times per frame of CG on a rela-

30,00

25,00

20,00 60

80

100 bitrate

120

140

Figure 6. PSNR of the codecs for video 1 at 6.5 frames/s.

tively slow device like iPaq potentially allow real-time processing of video information at standard frame rates (25 frames/s), if the base stream only is used. On such a device, H263 or MPEG-4 streams with same frame size and compression ratio could be decoded in real time (video streaming), but could not be coded (video conference). Enhanced streams cannot be coded in real-time on a portable device at standard frame rates. However, the scalable characteristics of the CG codec would allow real-time coding/decoding of these streams on the iPaq already at 12.5 frames/s. If such a lower rate is acceptable, both video streaming and video conferencing applications could be run on a low-power platform. Moreover, in the video streaming case, CG intrinsic scalability would not require multiple copies of the same video at the source site. Conversely, this would be mandatory for the other streams, for which video streaming applications could use alternative formats only if the video has been coded off-line at the proper rate. For a complete characterization of the codec, also reproduction quality has to be considered. Figure 6 reports the PNSR vs. bit-rate of video 1 coded with the three codecs used for the experiments. Measurements refer to a frame rate

of 6.25 frames/s, corresponding to the streams used above for performance comparison. From the graphs, it is apparent that the qualities of H263 and XviD are comparable and that they are much better than the one of the CG codec at all bit-rates. In particular, at 80 Kilobits/s, the gap is about 11.75 dB. This result points out that the low-cost coding capabilities of CG are counterbalanced by a remarkable loss in reproduction quality even in fairly static videos. Nevertheless, it is worth noting that this negative result is partly alleviated by two observations: (i) the subjective quality of the CG streams looks sufficient for interactive applications, and (ii) such applications could not be supported on current technology lowpower devices by the other codecs without special hardware support.

6 Related work Reducing complexity of algorithms for multimedia processing is an active research field. However, a few experimental work is reported in the literature, especially for what is concerned with low-power devices. In [4], Johanson analyzes the performance of a wavelet based codec. His work indicates that different implementations of performance-critical code is necessary in order to use it on different platforms. In our work, using a more simplified coding/decoding algorithm, we do not need to perform this tuning, obtaining a source-level portable software. In [6], Sheikh et al. focus on a standard H.263 encoder and optimize its code for an embedded Digital Signal Processor. They demonstrate that access to external memory is a bottleneck for video systems with large memory requirements and they suggest to use implementation-dependent designs. Our work confirms that memory can be a bottleneck also on low-power devices, but our results demonstrate that the CG approach can lead to acceptable performance with minimal optimization efforts.

7 Conclusions Processing requirements of expensive multimedia applications (e.g. real-time video processing) can not be tolerated by devices like palmtops and smart-phones. Our work demonstrates that the CG codec can perform video coding at low cost due to its symmetric properties, making it suited to be used in time constrained multimedia applications. We also show that state of the art codecs for both streaming (MPEG4) and videoconferencing (H263) cannot perform video coding on low-power devices, although they lead to acceptable decoding performance on these devices. Another interesting aspect of CG codec is its adaptability, allowing to change quality reproduction on-the-source (server overloading control), on-the-middle (network congestion control) and on-the-device (computational overloading control) without computational overhead in the coding phase.

The counterpart of these good features is a much lower PNSR achieved by the the CG scheme in its currently available implementation. This may represent a severe drawback, limiting the future applicability of CG schemes only to special purpose applications involving very low-cost devices.

Acknowledgements This work has been carried out under the financial support of the Ministero dell’Istruzione, dell’Universit`a e della Ricerca (MIUR) in the frameworks of the FIRB project “Middleware for advanced services over large-scale, wiredwireless distributed systems (WEB-MINDS)”, and of the project “Scalability and Quality of Service in Web Systems”.

References [1] M.Cagnazzo, G.Poggi, L.Verdoliva, “Low-complexity scalable video coding through table lookup VQ and predictive index coding”, IDMS-PROMS, Coimbra (Portugal), Nov. 2002. [2] N. Chaddha, A. Gupta, “A framework for live multicast of video streams over the Internet”, Procs. Int. Conf. on Image Processing, pp. 1–4, 1996. [3] FFMPEG, http://ffmpeg.sourceforge.net. [4] M. Johanson, “Implementation Issues for Scalable Multimedia Communication Systems”, Framkom technical report 2001:2, http://w2.alkit.se/ mathias/publications.html [5] B.J.Kim, Z.Xiong, W.A.Pearlman, “Low bit-rate scalable video coding with 3-D set partitioning in hierarchical trees (3-D SPIHT)”, IEEE Transactions on Circuits and Systems for Video Technology, Dec.2000, pp.13741387. [6] H.R. Sheikh, S. Banerjee, B.L. Evans, A.C. Bovik, “Optimization of a Baseline H.263 Video Encoder on the TMS320C6x”, Proc. Texas Instruments DSP Educator’s Conference, Aug. 2-4, 2000, Houston, TX. [7] J.W.Woods, G.Lilienfield, “A resolution and frame-rate scalable subband/wavelet video coder”, IEEE Transactions on Circuits and Systems for Video Technology, Sept.2001, pp.1035-1044. [8] XviD, http://www.xvid.org/.

Suggest Documents