Cache Optimization for Embedded Systems Running H.264/AVC ...

0 downloads 0 Views 279KB Size Report
Cache Optimization for Embedded Systems Running H.264/AVC Video Decoder ... 777 Glades Road ... processor to run the decoding algorithm and a two-.
Cache Optimization for Embedded Systems Running H.264/AVC Video Decoder Abu Asaduzzaman and Imad Mahgoub Department of Computer Science & Engineering Florida Atlantic University 777 Glades Road Boca Raton, Florida 33431, USA Tel: (561) 297-3855, Fax: (561) 297-2800 E-mail: [email protected] and [email protected] and/or conferencing have started to enter the marketplace over the past several years and are becoming more and more popular. The demands to run such video applications on mobile handsets are growing. The future of the embedded systems running multimedia is video applications. Applications drive the architecture of embedded and mobile devices. For many mobile devices, single-processor implementation is recommended to meet the application’s size and power consumption requirements [1]. Due to the improvements in the semi-conductor industry, required computational power to implement such a device is not an issue. However, due to the growing disparity between the increasing computation power and the memory speed, the amount of traffic from CPU to memory leads to a significant processor/memory speed gap [2]. Also, bus contention becomes a serious problem for the newly proposed multi-core processors like Intel Xeon and AMD Opteron [3]. Using cache(s) may improve the overall performance by dealing with memory bandwidth bottlenecks and bus contention problem. Using data stored in the cache helps cut down bus traffic significantly. Figure 1 shows the memory hierarchy with level-1 (CL1) and level-2 (CL2) caches.

Abstract The computing power of microprocessors has exponentially increased in the past few decades and so is the support to computation intensive multimedia applications. With such improved computing power, memory subsystem deficiency becomes the major barrier to support video messaging and video telephony/conferencing on mobile handsets. Studies show that for multimedia applications there are sufficient reuses of values for caching and there is an opportunity for customizing the cache subsystem for improved performance. In our previous work, we optimized cache to enhance MPEG-4 (Part 2) decoding performance running on a mobile device. H.264/AVC (or MPEG-4 Part 10) outperforms both MPEG-4 (Part 2) and H.263 by providing better video quality at a lower bit-rate and a lower latency. As a result, H.264/AVC becomes the next generation video codec for embedded systems. In this paper, our focus is to enhance decoding performance of H.264/AVC through cache optimization for an embedded and mobile device. The simulated architecture includes a processor to run the decoding algorithm and a twolevel cache system. Level-1 cache is split into Instruction and Data caches and level-2 cache is a unified cache. We use Cachegrind to characterize H.264/AVC decoding workload and VisualSim to model the system-level architecture and run the simulation for H.264/AVC decoding workload. Simulation results show that H.264/AVC decoding performance can be enhanced through cache optimization. Keyword─ Cache optimization, embedded system, mobile device, H.264/AVC video, decoder.

Figure 1: Typical memory hierarchy to bridge the processor-memory speed gap

1. Introduction

To improve cache performance, CL1 is divided into Instruction (I1) and Data (D1) caches. Higher cache miss rates generate significant excess CPU-

A proverb says that a picture speaks more than a thousand words. Video messaging, video telephony,

1-4244-0212-3/06/$20.00/©2006 IEEE

665

Memory traffic. For desktop environment, a bigger cache may simply resolve this problem. For mobile devices, where power and bandwidth are limited, cache inefficiency requires the use of higher capacity components that can drive up system cost [4, 5, 6]. H.264/AVC (also known as MPEG-4 Part 10) is a newer technology and it outperforms its both parents, MPEG-4 (Part 2) and H.263 by providing better video quality at a lower bit-rate and a lower latency. It provides up to four times the frame size of video encoded with the MPEG-4 (Part 2) video codec at a given data rate. H.264/AVC provides much higher quality than H.263 across the entire bandwidth spectrum [7]. As a result, H.264/AVC has become an important and demanding application for video messaging and video telephony / conferencing. Like MPEG-4 (Part 2), H.264/AVC standard does not explicitly define a CODEC (enCOder / DECoder pair). Instead, H.264/AVC defines the syntax of an encoded video bit-stream together with the method of decoding this bit-stream. We strongly believe that research needs to be conducted to investigate the impact of various cache parameters on performance of embedded and mobile systems running H.264/AVC video. This paper is organized as follows. In Section 2, we discuss related work. We present a short overview of H264/AVC video decoder in Section 3. Section 4 explains the simulated architecture. Simulation details are presented in Section 5. In Section 6, the simulation results are analyzed. Finally, we conclude our work in Section 7.

In [9], cache behavior of multimedia and traditional applications is examined. This work shows that multimedia applications exhibit higher data miss rate and comparable lower instruction miss rate. This study indicates larger data cache line sizes than are currently used would be beneficial in case of multimedia applications. The problem related to improving memory hierarchy performance for multitasking data intensive application is addressed in [10]. They used cache partitioning techniques to find a static task execution order for inter-task data cache misses. Due to the lack of freedom in reordering task execution, this method optimizes the caches more. In [5], we explore the architecture of a multiprocessor embedded system running MPEG-4 application. We develop a simulation program to evaluate the system performance in terms of utilization, delay, and total transactions for various CL1 sizes. In [11], we enhance MPEG-4 decoding performance through cache optimization of a mobile device. We simulate an architecture that has a processor core to run the decoding algorithm and a memory hierarchy with two-level caches. We use Cachegrind and VisualSim simulation tools to optimize cache size, line size, associativity level, and cache levels. Using Cachegrind, we characterize MPEG-4 video decoder and collect data to run VisualSim simulation model. Simulation results show that the right selection of the cache parameters can improve performance and reduce cost for devices running MPEG-4 decoder. In this paper, we focus on the impact of various cache design parameters namely, cache size, line size, associativity, and cache levels on the performance of H.264/AVC decoding algorithm running on an embedded and mobile device with a single-processor and two-levels of caches.

2. Related work The primary focus of most cache optimization research is to investigate the impacts of various cache parameters on system performance for a target set of applications. In this paper, we enhance decoding performance of H.264/AVC through cache optimization of a single-processor mobile handset. Some selected articles from various journals and conferences that are very relevant to this work are presented in this section. A general-purpose (desktop-like) computing platform running MPEG-2 application is studied in [8]. Most of the data transferred is concentrated on the system bus. At the very least, there are two streams of encoded and decoded video being concurrently transferred in and out of main memory. It is found that there is sufficient reuse of values for caching to significantly reduce the raw required memory bandwidth for video data. The results indicate that the addition of a larger second level cache to a small first level cache can reduce the memory bandwidth significantly.

3. H.264/AVC video The standardization of video compression technology is a revolution in the broadcast television, telecommunication, and home entertainment system. H.263 is standardized by ITU-T (International Telecommunication Union – Telecommunication Standardization Sector) in 1995 and it is widely used in videoconferencing systems. MPEG-4 (Part 2) is standardized by ISO (International Organization for Standardization) in 1998 and it enables a new generation of internet-based video applications. The groups developed these standards, the ISO Motion Picture Experts Group (MPEG) and the ITU-T Video Coding Experts Group (VCEG), have developed a new standard that promises to significantly outperform both MPEG-4 (Part 2) and H.263, providing high-quality, low bit-rate streaming video. The new standard is

666

prediction P formed in the encoder. P is added to D’n to produce uF’n which is filtered to create the decoded macro-block F’n [12, 13].

called Advanced Video Coding (AVC) and is widely known as H.264/AVC or MPEG-4 Part 10 [12, 13]. Figure 2 shows the time period when various standards were developed.

Figure 3: H.264/AVC decoder It is very important that both encoder and decoder use identical reference frames to create the prediction. Otherwise, the predictions in encoder and decoder may be different, leading to an increasing error between the encoder and decoder.

Figure 2: Joint (ISO MPEG and ITU-T VCEG) standards In 2001, the Joint Video Team (JVT) was formed including experts from MPEG and VCEG. The main task of JVT was to develop the draft H.26L (VCEG’s long-term effort to develop a new standard) into a full International Standard. In 2003, the final drafting work on the first version of the new standard was completed. The outcome is two identical standards: ISO MPEG standard is MPEG-4 Part 10 and ITU-T standard is H.264/AVC [12, 13, 14].

4. Architecture 4.1. Simulated architecture In this work, we optimize cache parameters for an embedded and mobile device running H.264/AVC video decoding algorithm. We use trace-driven HWSW co-simulation technique [15]. The system architecture is shown in Figure 4. A processing core is used to run H.264/AVC decoding algorithm. The system has level-1 (CL1) and level-2 (CL2) caches. CL1 is a split into instruction (I1) and data (D1) caches and CL2 is a unified cache. The processor and the main memory are connected via a shared bus. External and I/O interfaces are added as needed.

3.1. H.264/AVC codec H.264/AVC standard does not explicitly define a CODEC (enCOder / DECoder pair) like MPEG-4. Instead, H.264/AVC defines the syntax of an encoded video bit-stream together with the method of decoding this bit-stream [12]. The functional elements of an accommodating encoder and decoder are possible to include. Figure 3 shows the decoder functionalities. The functions shown in this figure are likely to be necessary for compliance. In addition, there is scope for considerable variation in the structure of the CODEC. The basic functional elements (such as prediction, transform, quantization, and entropy encoding) are little different from previous standards (MPEG-4 and H.263).

3.2. H.264/AVC decoder We briefly discuss H.264/AVC decoding algorithm, our target application for this work. Various functionalities in a H.264/AVC decoder are shown in Figure 3. The decoder receives an encoded bit-stream from the Network Abstraction Layer (NAL). The data elements are entropy decoded and reordered to produce a set of quantized coefficients (X). These are rescaled and inverse transformed to give D’n. Using the header information from the bit-stream, the decoder creates a prediction macro-block P, identical to the original

Figure 4: Simulated architecture for embedded system running H.264/AVC decoder A system-level simulation program is developed using VisualSim. To drive the trace-driven VisualSim simulation model H.264/AVC decoding workload is collected in terms of miss rates and read/write

667

references. Using Cachegrind and JM RS(96), we collect I1, D1, and CL2 miss rates and D1 and CL2 read/write references for H.264/AVC decoding algorithm for various cache parameters. We evaluate the system performance in terms of utilization and total number of transactions processed by different system components for various cache size, line size, associativity, and cache levels. The processor decodes the encoded video streams from the main memory. The CPU reads the encoded data from and writes the decoded data into the main memory through its cache hierarchy.

model of the architecture running H.264/AVC decoding algorithm. Cachegrind is a Linux/Unix based simulation tool from Valgrind. Cachegrind is used to characterize H.264/AVC decoding workload by performing detailed simulation of the CL1 (I1+D1) and CL2 caches. Total references, misses, and miss rates for I1, D1, and CL2 caches can be collected. We use JM RS(96) with Cachegrind to decode H.264/AVC encoded file. The following is a command used to run Cachegrind, valgrind –skin=Cachegrind –I1=32768,4,64 – D1=32768,4,64 –L2=131072,4,64 ldecod.exe decoder.cfg The general format to specify cache parameters with Valgrind command is –=,,. Here, level-1 cache I1 (or D1) size is 32768 bytes (32K), associativity is 4way, and line size is 64 bytes. Similarly, level-2 cache size is 131072 bytes (128K), associativity is 4-way, and line size is 64 bytes. Command “ldecod.exe” runs H.264/AVC decoder using “decoder.cfg” configuration file that includes information about input bit-stream, output file, etc. The output is saved in a file named, cachegrind.out.. VisualSim is an effective tool for system-level simulation of embedded devices from Mirabilis design, Inc. VisualSim is used to model the system-level behavior of the embedded system running H.264/AVC decoding algorithm. This tool provides libraries for various system components including CPU, caches, bus, and memory. VisualSim simulation model is developed by selecting right blocks and making appropriate connections among them. VisualSim simulation cockpit provides functionalities to run the model and to collect simulation results.

4.2. Cache design parameters We consider four most significant cache design parameters in this work – cache size, line size, associativity, and cache levels. 4.2.1. Cache size. In [8], P. Soderquist and M. Leeser show that for H.264/AVC decoding, the cache-memory traffic is a function of cache size. Increasing cache sizes show improvement for multimedia applications. Cache memory has cost and space constraints. So the decision of how large a cache to implement is critical. 4.2.2. Line size. Larger lines tend to provide low miss rates, but require more data to be read and possibly written back on a miss. Study shows that, minimal memory traffic occurs with the smaller lines [8]. 4.2.3. Associativity. Increasing associativity level of smaller caches may provide better performance. Study shows that changing from a direct-mapped cache to a 2-way set-associative may reduce memory traffic by as much as 50% for small caches. Set sizes of greater than 4, however, show minimal benefit across all cache sizes [8].

5.2. Representative H.264/AVC workload

4.2.4. Multi-level caches. In general, addition of CL2 decreases the bus traffic and latency. In [11], it is shown that CL2 significantly improve the performance of a system running multimedia applications.

We generated H.264/AVC decoding workload to run our simulation model. For the accuracy and completeness of the simulation results, it is very important to use the workload that truly represents the target application. The workload defines all possible scenarios and conditions that the system-under-study will be operating under [19, 20, 21]. In this work, we use H.264/AVC, an important and demanding multimedia application to run embedded real-time systems. The original YUV file we use to generate .264 file is a common intermediate format (CIF) YUV 4:2:0 video stream with image format 352 pixels by 288 lines, 30 fps. Using H.264/AVC JM RS (96) CODEC, we decode a .264 file (to a new YUV file). Using Cachegrind, we collect the total references for I1, D1, and CL2. Data (D1) and CL2 references are divided into read and write references.

5. Simulation 5.1 Simulation tools In this work, we use the following simulation tools – Cachegrind from Valgrind [16], VisualSim from Mirabilis Design [17], and JM Reference Software [18]. Using Cachegrind and JM RS(96), we collect I1, D1, and CL2 miss rates for H264/AVC decoding algorithm. Using VisualSim, we develop the simulation

668

As shown in Table 1, the H.264/AVC encoded file (.264 file) that we decode has 62% instruction and 38% data references.

5.4. Assumptions In order to simplify and run the VisualSim simulation model, we make the following assumptions without compromising the purpose of this research interests. 1) Task time has been divided among CPU, CL1, CL2 (when exists), bus, and main memory proportionally [17]. 2) The dedicated bus that connects level-1 and level-2 caches introduces negligible delay compared to the delay introduced by the system bus which connects CL2 and main memory. 3) Write-back update policy is implemented for reduced CPU utilization. According to this policy the CPU is released immediately after CL1 is updated. 4) Data transfer between CPU and cache is in frames and cache block size is equal to a frame size. 5) CIF YUV 4:2:0 formatted (352 pixels by 288 lines, 30 fps) video stream file has been used to generate a representative .264 file.

Table 1. Total I1, D1, and CL2 references while decoding .264 file Total CL1 Refs (K) CL1 Refs CL2 Refs I1 D1 I1% D1% (K) 120,163 74,159 62 38 191 Table 2 shows that 75% of the data references are read and the remaining 25% are write references. Table 2. D1 Read and write references while decoding .264 file D1 Refs (K) D1 Refs Read Write Read% Write% 55,765 18,394 75 25 Table 3, also, shows that 84% of the CL2 references are read and the rest 16% are write references.

6. Results and discussion Table 3. CL2 Read and write references while decoding .264 file L2 Refs (K) CL2 Refs Read Write Read% Write% 161 30 84 16

In this work, we investigate the impacts of various cache design parameters on the performance of H.264/AVC decoding algorithm for embedded and mobile system. Our focus of interests include (1) how miss rates are affected by various cache parameters, (2) how performance (namely, utilization and total number of transactions) is influenced by the presence of CL2, and (3) how utilization is changed due to various video files with different bit-rates. We explore the effects of cache size, line size, and associativity variation on cache miss rates. Using Cachegrind, we obtain miss rates for I1+D1 cache sizes from 8K+8K to 64K+64K, CL2 size from 128K to 4M, line size from 16B to 256B, and associativity level from 2-way to 16-way. In this section, we discuss the impacts of cache parameters on H.264/AVC decoding.

The numbers presented in Tables 1, 2, and 3 are used in the VisualSim simulation model.

5.3. Input parameters Cache size, line size, associativity, and cache levels are varied as input to the VisualSim simulation model. Important system parameters are shown in Table 4. Table 4. System parameters Item Value CL1 Cache sizes 8K+8K to 64K+64K CL2 Cache sizes 128K to 4M Line size 16B to 256B Associativity 2-way to 16-way Cache levels L1 and L2 Task time 1.0 simulation time units Task rate Task time * 1.0 CPU time Task time * 0.4 MEM time Task time * 0.6 Bus time MEM time * 0.4 CL1 Cache time MEM time * 0.2 CL2 Cache time MEM time * 0.4 Bus queue length 100

6.1. Miss rates versus cache parameters We present the miss rates due to the variation of CL1 (I1+D1) cache size where CL1 line size is fixed at 32B and associativity at 8-way are shown in Figure 5. We keep, CL2 fixed at cache size 1M, line size at 32B, and associativity at 8-way. Our simulation results show that the miss rate decreases when CL1 size is increased from 8K+8K to 16K+16K. It is also observed that the decrease in miss rate of D1 is higher than that of I1. According to our simulation results, for CL1 (D1 + I1) size 16K+16K and bigger, the decrease is not significant. So using a CL1 (I1+D1) size greater than 16K+16K may not offer any benefit.

669

Using Cachegrind, we collect the miss rates for different line sizes while level-1 cache (D1+I1) size is fixed at 16K+16K, level-2 cache size is fixed at 1M, and associativity at 8-way. For a smaller size, miss rates start decreasing with the increase of line sizes. After a certain point, known as cache pollution point (128B for D1), miss rates start increasing. In this case, for line size 128B or higher, it requires more data to be read and written (in case of a miss). Using Cachegrind, we collect miss rates by varying associativity for CL1 (D1+I1) cache size 16K+16K, CL2 size 1M, and line size 32B. The miss rates for different levels of associativity are shown in Figure 8. The miss rates significantly decrease when we go from 2-way to 4-way associativity. Going to 8way to higher, the changes are not significant.

Miss Ratio (%)

Miss Ratio Vs Level-1 Cache Sizes for Decoder CL1: Line Size 32B, 8-way associativity CL2: Cache size1M, Line Size 32B, 8-way associativity 1.5 D1

1

I1 L2

0.5 0 8+8

16+16

32+32

64+64

D1+I1 Sizes (KB)

Figure 5. Miss ratio versus CL1 (I1+D1) size The miss rates due to the variation of CL2 cache size where CL2 line size is fixed at 32B and associativity at 8-way are shown in Figure 6. In this case, CL1 is fixed at cache size 16K+16K, line size 32B, and associativity 8-way.

Miss Ratio Vs Associativity for Decoder CL1: Cache size 16K+16K, Line size 32B CL2: Cache size 1M, Line size 32B Miss Ratio (%)

Miss Ratio (%)

Miss Ratio Vs Level-2 Cache Size for De code r CL1: Cache size 16K+16K, Line size 32B, 8-way associativity CL2: Line size 32B, 8-way associativity 1.5 D1

1

I1 0.5

D1

2

I1 L2

1 0 2-way

L2

0

4-way

8-way

16-way

N-way Associativity 128

256

512

1024

2048

4096

Figure 8. Miss ratio versus associativity

CL2 Size (KB)

Figure 6. Miss ratio versus CL2 size

6.2. Performance versus level-2 cache (CL2)

It is observed that for CL2 size of 256K or less the miss rates remains unchanged, from 256K to 2M the miss rates decrease sharply, and for 2M or big the miss rates decrease slowly until it becomes 0%. From cost, space, and complexity standpoints, larger CL2 may provide no significant advantage. For certain applications, up to a certain point, increasing the line size may reduce cache miss rates as shown in Figure 7.

Using VisualSim, we investigate the impact of the presence of a level-2 cache. CL2 size is varied and performance metrics, namely total number of transaction and CPU utilization, are calculated. We keep CL1 (D1+I1) size fixed at 16K+16K, line size at 32B, and associativity at 8-way. We change CL2 size from 128K to 4 MB and keep CL2 line size fixed at 32B and associativity at 8-way. In our simulation, memory references are initiated at the CPU and are referred to CL1 (D1 or I1); if not satisfied, be referred to CL2; finally, unsuccessful requests are satisfied from the main memory (MM) via the shared bus. Simulation results show that total number of transactions decrease with the increase of CL2 size as shown in Figure 9. We run our simulation program for a total of 1M transactions; 380,000 of those transactions are data and the rest 620,000 are instructions [Table 1]. All transactions are referred to D1 or I1 by the CPU. For D1 miss ratio 1.0% and I1 miss ratio 0.4%, (3,800 + 2,480) or 6280 requests go to CL2. For CL2 size 128K (miss ratio is 0.3%), only 19 requests go to MM via the shared bus. For CL2 size

Miss Ratio Vs Cache Line Size for Decoder CL1: Cache size 16K+16K, 8-way associativity CL2: Cache size 1M, 8-way associativity Miss Ratio (%)

3

3 D1

2

I1 L2

1 0 16

32

64

128

256

Cache Line Size (B)

Figure 7. Miss ratio versus line size

670

2M (miss ratio is 0.03%) only 2 tasks go to MM. For CL2 cache size 4M or higher, no requests go to the main memory as miss ratio becomes 0.0%.

various reasons including quality of service and index of animation (example, a landscape panning is coded differently from a football match). In our simulation, we keep CL2 fixed. We start with task-rate 0.1 tasks per unit time and increase it up to 2 tasks per unit time. Simulation results show that utilization increases with increased task-rate. At task rate of 2 tasks per simulation time units, utilization is close to 100% for CL1 (D1+I1) size (8K + 8K) as shown in Figure 11.

Number of Transactions

Number of Transactions Vs CL2 Size for Decoder CL1: Cache size 16K+16K, Line size 32B, 8-way associativity CL2: Line size 32B, 8-way associativity Total number of transactions: 1M 20 15 Bus 10

CPU Utilization Vs Task-rate for Decoder CL1: Line Size 32B, 8-way associativity CL2: Cache size1M, Line size 32B, 8-way associativity

MM 5 0 128K

256K

512K

1M

2M 100

Utilization (%)

CL2 Size

Figure 9. Total number of transactions versus CL2 size Studies show that addition of CL2 decreases the bus traffic and latency and improves system performance for multimedia applications. In [11], we present the impact of the presence of CL2 on CPU utilization for MPEG-4 (Part 2) video. Figure 10 shows the impact of CL2 size variation on CPU utilization (with and without CL2) for H.264/AVC video. CPU utilization decreases with the increase of CL2 size. Between CL2 sizes 256K and 1M, this decrement is significant. For CL2 size of 128K or smaller and 2M or bigger this change is not significant.

CPU Utilization (%) .

CL1 Only CL1 and CL2

512

1024

0.5

1

1.5

2

2.5

Figure 11. CPU Utilization versus task-rate Further increase in task rate (with other parameters unchanged) causes the system to break down. System cannot process the tasks, so task queue starts dropping the tasks. A bigger cache size should reduce the utilization. CL1 (D1+I1) size (64K + 64K) reduces CPU utilization by 30% (approximately).

7. Conclusion Cache memory is used to bridge the processormemory speed gap. Decoding software running on the embedded system generates more than the required cache-memory traffic. So customized cache subsystem may improve performance of such embedded system running decoding algorithm. In our previous work, we optimized cache parameters for a mobile device running MPEG-4 decoder. In this paper, we focus on enhancing H.264/AVC decoding performance through cache optimization of an embedded and mobile device. The architecture includes a processor to run the decoding algorithm and a two-level cache system. We optimize cache size, line size, associativity, and cache levels for the system. Simulation results show that H.264/AVC decoding performance can be enhanced by cache optimization. Many embedded and mobile devices require single-chip implementation of multiprocessor systems to meet performance, application size, and energy/power consumption requirements. We are currently exploring such architectures for multimedia applications.

0 256

20

Task-rate (tasks per unit time)

20 128

CL1(64K+64K)

40

0

100

40

CL1(8K + 8K) 60

0

CPU Utilization Vs CL2 Size (with & without CL2) for Decoder CL1: Cache size 16K+16K, Line size 32B, 8-way associativity CL2: Line size 32B, 8-way associativity

80 60

80

2048

CL2 Size (KB)

Figure 10. CPU utilization versus CL2 size Between CL2 sizes 256K and 1M, increasing cache size decreases miss rates. In case of a decreased miss rate CPU gets the required values faster than before. The reduction in memory access time decreases CPU utilization.

6.3. Utilization versus task-rate Using VisualSim simulation model, we investigate the impact of various video files with different bit-rates on CPU utilization. Bit-rates may be different for

671

[11] A. Asaduzzaman, I. Mahgoub, et al, “Cache Optimization for Mobile Devices Running Multimedia Applications”, Proceedings of the IEEE Sixth ISMSE, Miami, FL, December 2004, pp. 499-506. [12] I. Richardson, "H.264 / MPEG-4 Part 10 White Paper", www.vcodex.com, 2002. [13] ITU-T Rec. H.264/ISO/IEC 11496-10, “Advanced Video Coding”, Final Committee Draft, Document JVTE022, September 2002. [14] R. Schaphorst, “Videoconferencing and Videotelephony – Techonology and Standards”, Artech House, Norwood, MA, 2nd edition, 1999. [15] D. Kim, Y. Yi, S. Ha, "Trace-Driven HW/SW Cosimulation Using Virtual Synchronization Technique", ACM 1-59593-058-2/05/0006, DAC 2005, Anaheim, CA, June 2005. [16] Cachegrind – a cache profiler: Valgrind; http://valgrind.kde.org/index.html [17] VisualSim – system-level simulator: Mirabilis Design, Inc.http://www.mirabilisdesign.com/ [18] K. Sühring, “H.264/AVC Reference Software”; http://iphome.hhi.de/suehring/tml/download/ [19] A. Maxiaguine, S. Kunzli, and L. Thiele, "Workload Characterization Model for Tasks with Variable Execution Demand", Project supported in part by KTI/CTI, Computer Engineering and Networks Laboratory, Swiss Federal Institute of Technology (ETH) Zurich, Switzerland [20] A. Avritzer, J. Kondek, D. Liu, and E.J. Weyuker, "Software Performance Testing Based on Workload Characterization", WOSP '02, Rome, Italy, July 2002.

8. References [1] A. Jerraya, H. Tenhunen, and W. Wolf, “Multiprocessor Systems-on-Chips”, IEEE Computer Society, July 2005, pp. 36-40. [2] E. Rotenberg, "Architectures for Real-Time", www.tinker.ncsu.edu/ericro/research/realtime.htm, 2005 [3] “AMD Multi-core Opteron CPU”, en.wikipedia.org/wiki/Opteron, 2005. [4] “Enabling Video in Mobile Devices”, www.hantro.com/pdf/Hantro2005.pdf, 2005. [5] A. Asaduzzaman and I. Mahgoub, “Evaluation of Application-Specific Multiprocessor Mobile System”, Proceedings of the 2004 Symposium on Performance Evaluation of Computer Telecommunication Systems, San Jose, CA, July 2004, pp. 751-758. [6] C. Kulkarni, F. Catthoor, and H. DeMan, “Hardware cache optimization for parallel multimedia applications”, Proceedings of the 4th International Euro-Par Conference on Parallel Processing table of contents, 1998, pp. 923 – 932. [7] D. W. Leitner, “What's in a name: H.264”, videosystems.com/e-newsletters/whats_name_h264/, 2005 [8] P. Soderquist and M. Leeser, “Optimizing the Data Cache Performance of a Software MPEG-2 Video Decoder”, ACM Multimedia 97 – Electronic Proceedings, Seattle, WA, Nov. 1997. [9] N.T. Slingerland and A.J. Smith, “Cache Performance for Multimedia Applications” portal.acm.org/ft_gateway.cfm?id=377833&type=pdf [10] A.M. Molnos, B. Mesman, et al, “Data Cache Optimization in Multimedia Applications”, Proceedings of the 14th Annual Workshop on Circuits, Systems and Signal Processing, ProRISC 2003, Veldhoven, The Netherlands, November 2003, pp. 529-532.

[21] A. Maxiaguine, Y. Liu, S. Chakraborty, W. T. Ooi, “Identifying "representative" workloads in designing MpSoC platforms for media processing” The 2nd Workshop on Embedded Systems for Real-Time Multimedia, 2004, pp. 41-46.

672

Suggest Documents