A Low-Power Memory Architecture with Application-Aware ... - CiteSeerX

0 downloads 0 Views 1MB Size Report
The Multiview Video Coding (MVC) [3] has evolved as a coding standard ... cope with similar issues for register file and in-order cores, recent advancements in ...
A Low-Power Memory Architecture with Application-Aware Power Management for Motion & Disparity Estimation in Multiview Video Coding 1

Bruno Zatt1,2, Muhammad Shafique1, Sergio Bampi2, Jörg Henkel1

Karlsruhe Institute of Technology (KIT), Chair for Embedded Systems, Karlsruhe, Germany Federal University of Rio Grande do Sul (UFRGS), Informatics Institute/PGMICRO, Porto Alegre, Brazil {bruno.zatt, muhammad.shafique, henkel} @ kit.edu, {bzatt, bampi} @ inf.ufrgs.br

2

Abstract—A low-power architecture for an on-chip multi-banked video memory for motion and disparity estimation in Multiview Video Coding is proposed. The memory organization (size, banks, sectors, etc.) is driven by an extensive analysis of memoryusage behavior for various 3D-video sequences. Considering a multiple-sleep state model, an application-aware power management scheme is employed to reduce the leakage energy of the on-chip memory. The knowledge of motion and disparity estimation algorithm in conjunction with video properties are considered to predict the memory requirements of each Macroblock. A cost function is evaluated to determine an appropriate sleep mode for the idle memory sectors, while considering the wakeup overhead (latency and energy). The complete motion and disparity estimation architecture is implemented in a 65nm low power IBM technology. The experiments (for various test video sequences) demonstrate that our architecture provides up to 80% leakage energy reduction compared to state-of-the-art. Our scheme processes motion and disparity estimation of four HD1080p views encoding at 30fps with a power consumption of 57mW.

Normalized Energy  Consumption

on-chip video memory to avoid frequent off-chip memory accesses, thus reducing the off-chip memory energy. State-of-the-art techniques further reduce the off-chip memory energy by employing search window reuse [6][19] and asymmetric search windows [18]. When considering the adaptive nature of fast ME and DE algorithms like TZ Search [4]3, depending upon their search patterns, several pixel-regions in the rectangular search windows are not accessed. Therefore, recent techniques [11][22] propose dynamic window sizing algorithms in order to reduce the energy of the on-chip video memory. State-of-the-art techniques like [11] do not provide leakage reduction which is of crucial importance in deep sub-micron technologies. The work of [22] does not exploit the correlation in 3Dneighborhood (i.e. spatial, temporal, and view domains) and memory usage statistics of different MBs and frames, thus provide limited leakage reduction.

Keywords: Multiview video coding, motion estimation, disparity estimation, low power, application-aware power management, power-gating, on-chip memory, video memory

I.

INTRODUCTION AND MOTIVATION

10% 45%

10

45%

1

Computation On‐Chip Memory Off‐Chip Memory

ME/DE Energy Breakdown

Fig. 1 Energy consumption breakdown of MVC showing the energy of computations, on-chip and off-chip memory for ME/DE4 Typically, the on-chip video memories5 are one of the primary sources of leakage energy, due to their frequent usage and significant footprint. For instance, a memory of 1.2 Mbits is required to support the search range of ±96 and four prediction directions. Note, a prediction direction refers to the relative position of a reference frame with respect to the current frame. The ME searches in frames in the left and right directions, while DE searches in the top and down directions. Our memory-requirement analysis in Section I.A demonstrates that the amount of required on-chip memory varies significantly, as not all parts of the search window are accessed due to (a) the adaptive nature of ME and DE algorithms, (b) differentlysized search windows used for ME and DE [23], and (c) diverse texture and motion properties of different MBs. Therefore, a reduced-sized multi-bank on-chip memory is required for ME and DE in MVC, where each bank is partitioned into multiple sectors that can be independently power-gated (i.e. switching-off the power supply to the memory sectors using sleep transistors [14]) depending upon the memory-requirements of different frames or even MBs. However, power-gating comes with a wakeup overhead (in terms of latency and energy) for powering-on, which can be significant. To cope with similar issues for register file and in-order cores, recent advancements in sleep-transistors have enabled (i) power-gating with multiple sleep modes [12][13][14], each having different leakage saving and wakeup overhead; and (ii) state-retentive power-gating 3

Multiple cameras capture several video sequences of a 3D-scene. Authors in [5] demonstrated a similar fact for previous MPEG standards.

978-1-4577-1400-9/11/$26.00 ©2011 IEEE

ME/DE

Search Range

4 2

Others

0.1 [±16] [±32] [±64] [±96]

The advancements in capture/display technologies and incessantly escalating consumer interest in 3D-multimedia have ignited research and development for low-power 3D-applications (e.g., multiview video recording and playback) on battery-powered mobile devices [1][2]. The Multiview Video Coding (MVC) [3] has evolved as a coding standard for multiview videos1. It jointly exploits temporal and inter-view redundancies to provide 20%-50% improved compression compared to the H.264 video encoder at the cost of a significantly higher energy consumption due to motion and disparity estimation (ME, DE) [1]. Our MVC energy analysis in Fig. 1 illustrates that ME and DE may consume >90% of the total MVC energy consumption, when encoding a “Ballroom” sequence (resolution: 640x480, content: high texture, high motion) for different search ranges. This makes ME and DE the primary research focus for lowpower design, when targeting realization of high-definition MVC on battery-powered mobile devices. The ME and DE algorithms search the best match for a Macroblock (MB) in a pre-defined search window from a reference frame/view by incorporating a block matching process using Sum of Absolute Differences (SAD) that requires enormous computation and memory requirements. The breakdown in Fig. 1 shows that approximately 90% of ME/DE energy is consumed by the (off-chip + onchip) memory accesses, which are due to a huge number of reference pixels fetched from the memory and used in SAD computations2. Moreover, an increased ME/DE energy for larger search windows is mainly due to the increased memory energy. Since the pixels in a (rectangular) search window are accessed multiple times, state-ofthe-art techniques [6]-[11] employ search window prefetching in the

1

100

5

40

These fast ME and DE algorithms are necessary to curtail the number of search candidates to meet throughput constraints with minimal quality loss. Using the TZ Search [4] as a fast ME/DE algorithm, which is up to 23x faster than the Full Search, while providing a comparable video quality. Throughout this paper, ‘on-chip memory’ denotes an ‘on-chip video memory’.

needs to be taken, as a misprediction may incur significant misses, thus a high penalty in terms of re-fetching from the external memory and wakeup of additional memory sectors. The less scattered distribution in the box-plot hints towards the fact that there is an extensive correlation in the 3D-neighborhood, as MBs of the same object in the neighboring frames and views exhibit similar memory requirements. This fact becomes apparent in the 3D plot of Fig. 3. Therefore, the memory requirements of a frame may be predicted (with a high accuracy) by exploiting the correlation in the 3D-neighbohood, i.e. memory requirements of the neighboring frames. The frame-level prediction can be further refined considering the MB-level properties.

[14], the contents of the memory are preserved at the cost of reduced leakage savings; see Section III for the memory and power model. Here, the challenge is to identify the idle periods for different sectors of the on-chip memory and to choose an appropriate sleepmode. However, the knowledge of idle sectors and their idle periods are hard to predict due to assorted memory requirements of different MBs because of their diverse texture and temporal properties. Stateof-the-art power-gating approaches typically predict the idle memory sectors by monitoring the history of memory usage. However, these approaches ignore following two important pieces of knowledge that can provide a better estimate of the memory requirements, and consequently an increased potential for leakage savings: i) algorithm-specific knowledge, like ME and DE algorithms and the number of candidates in different search patters, ii) video-content properties, like spatial (variance, texture, brightness) and temporal (SADs, motion and disparity vectors) properties of different MBs/frames in the 3D-neighborhood. We believe that application (considering algorithm and video data) have the best knowledge of its memory requirements at a certain point in time. Consequently, compared to state-of-the-art historybased schemes, an application-aware power management scheme provides a better prediction of the memory requirements. Therefore, raising the abstraction level of power-gating decision to the application-level provides a higher potential for leakage energy savings. The goal of this work is to reduce the memory power of ME and DE in MVC by employing a reduced-sized on-chip memory with an application-aware power management scheme. In the following, we present a detailed memory-analysis that provides the motivation for this work, followed by our novel contributions and idea overview.

Memory Usage [KBytes]

20000 20

Memory Usage [KBytes]

DE

5

50005

Vassar

Ballroom

Exit

DE

250

500 750 Time (µs)

1000

1

250

500 750 Time (µs)

1000

Fig. 4 Memory usage variation within one video frame

3rd Quartile (75%) 2nd Quartile (50%) 1st Quartile (25%)

Exit

Group‐2

Fig. 5 shows four excerpts of the “Ballroom” sequence illustrating the memory access behavior (within the white lines) of MBs with low and high ME/DE. The memory access behavior of the MB with low motion/disparity is less spread and focused towards the centre, thus less memory is used in smaller vicinity. In contrast, the memory access behavior of the MB with high motion/disparity shows that the memory from a wider region is accessed; see multiple displaced diamond patterns. Fig. 4 and Fig. 5 demonstrate that the memory requirements of an MB can be accurately predicted by considering its spatial and temporal properties and the memory requirements of MBs of the same group.

Minimum

Ballroom

0.6

10000 10

Maximum Average

1.0

Group‐1

1

10 0

ME

15000 15

25

ME

1.4

Fig. 3 3D-plots showing the similarity in memory usage

Instead of a Full Search (impracticable due to its gigantic computation and energy requirements), an adaptive fast ME/DE algorithm TZ Search [4] is deployed for this analysis to represent a real-world embedded system scenario. These adaptive algorithms are typically based on multiple search stages and patterns, while processing different number of search candidates for different MBs, thus exhibit highly-varying memory usage profile; as shown in Fig. 2 and Fig. 4.

15

1.8

When further analyzing the memory requirements within a frame (see Fig. 4 for “Ballroom” sequence), two different variation zones are noticed in ME that correspond to two different groups of MB, where MBs in a group have similar spatial and temporal properties. MBs in the group-1 exhibit a low-variation in their memory usage, while MBs in the group-2 exhibit high-variation. The distinction between two groups can be made by evaluating the spatial and temporal properties of MBs. Depending upon the group-level variations, low-leakage or high-leakage sleep mode may be selected. The large variations for DE are primarily due to the bigger search performed by the TZ algorithm for capturing longer disparity vectors.

A. Motivational Case Study and Memory Usage Analysis

20

x104

Memory Usage [KBytes]

Memory Usage [KBytes]

The frame‐level memory usage can be predicted using the temporal neighbors

Vassar

Fig. 2 Box-plot showing the summary of memory usage of various macroblocks for ME and DE Fig. 2 shows the box-plot summary of memory usage by different MBs in various video sequences for an on-chip memory of size 37.25 Kbytes storing a search window of 193x193 (corresponding to a recommended search range of ±96 [23]) for one prediction direction. However, the maximum memory usages are 20.9 KBytes and 23.2 KBytes that corresponds to memory wastage of 44% and 38% for ME and DE, respectively. Still, most of the MBs require much less memory than the maximum requirements. The minimum and maximum memory requirements vary for different video sequences due to their spatial and temporal properties. In worst case, more than 80% of the on-chip memory may be idle, thus leading to significant energy wastage due to leakage. Fig. 2 shows that, in case of ME, the box plot is less scattered and close to the average. It demonstrates a high correlation in the memory usage profile for ME. The observation is different for DE, where the range between the 25% and 75% quartiles is relatively wider compared to that of ME. Still, the 75% quartile is much below than the maximum usage. However, care

High Motion

Low Motion

High Disparity

Low Disparity

Fig. 5 Comparing the memory access patterns (within the white lines) of MBs with slow and fast motion/disparity Summarizing the analysis, an application-aware power management scheme for an on-chip video memory needs to consider the knowledge of ME and DE algorithm, spatial and temporal video properties at both frame and MB levels, and correlation in the 3D-neighborhood to determine the number of idle sectors and an appropriate sleep mode for each sector of the memory.

41

Search pattern and does not consider disparity estimation. Additionally, these works do not consider power-gating techniques to reduce the onchip memory leakage that represents a crucial share in the total energy consumption. Generic techniques for reducing the on-chip SRAM leakage (like [12], [13]) propose memories with multiple sleep modes in order to better exploit the leakage vs. wake-up penalty tradeoff. Stateretentive power-gating of register files featuring multiple sleep modes is presented in [14]. However, to control these memories an efficient power management is required. In [25] the hardware powergating is controlled by monitoring the underlying hardware. These observation-based techniques may lead to miss-predictions, especially in case of sudden variations. The techniques in [24][26] consider application-knowledge for a video decoder case study, but they only exploit the knowledge at frame level. These techniques consider longer periods and may not cope with severe variations at the MBlevel. Authors in [30] presented an adaptive pipelined MPSoC for H.264/AVC with a run-time system that exploits the knowledge of macroblock (MB) characterization based on their spatial and temporal properties [27][28] to predict the workload. Based on this knowledge, unused processors are clock-gated or power-gated. These technique may not be power-wise efficient in MVC as they cannot exploit the correlation in 3D-neighborhood and spatial and temporal properties of the video data. A multi-level pipelined architecture for fast ME and DE in MVC is proposed in [29], which is equipped with a dynamic window sizing scheme and an on-chip memory [22]. The unused sectors of the memory are gated by considering the memory requirements of consecutive MBs that amortize the power-gating overhead [22]. However, these techniques do not exploit (a) distribution of memory usage at frame- and MB-levels, (b) the memory usage correlation in the 3D-neighborhood, (c) memories with multiple sleep modes. Therefore, these techniques provide limited leakage savings. Unlike the above-discussed related work, our work considers the extensive application-specific knowledge at various levels (prediction direction, frame, etc.) along with memory usage correlation in 3Dneighborhood for making the power-gating decision in the on-chip video memories. It thereby provides significant leakage reduction, as the algorithms may know (in advance) their exact memory requirements.

B. Our Novel Contributions and Idea Overview

A novel low-power video memory architecture for motion and disparity estimation (ME, DE) in MVC is proposed that employs: 1). An on-chip multi-banked video memory (Section IV): based on the offline memory usage analysis, an algorithm is proposed to determine the size of the on-chip memory by evaluating the tradeoff of leakage reduction and misses (as a result of reducedsized video memory). Afterwards, the organization (banks, sectors) is obtained by considering the throughput constraint. Each bank is partitioned into multiple sectors to enable a fine-grained power management control. The data for each prediction direction is stored in distinct sectors. 2). An application-aware power management scheme for the on-chip memory (Section V): A multi-level power management scheme is employed. First, depending upon the current prediction direction (top, left, down, right, i.e. using the knowledge from the application that determines a prediction direction), different sectors can be completely power-gated. Afterwards, frame-level memory requirements are predicted by taking the weightedaverage of the neighboring frames in the 3D-neighborhood. Then, the consecutive MBs with similar spatial and temporal properties are grouped together and sleep modes for their idle sectors are determined by evaluating a cost function of leakage savings and wakeup overhead. In the last step, the power-gating control of different sectors is refined at the MB level. Application‐ Aware Power  Management  Unit (Section V)

On‐Chip Multi‐Bank  Video Memory (Section IV)

(e.g., variance, brightness)

Multiview Video Memory

Memory Application knowledge Requirement Predictor (Section V) SAD  CPU

Monitoring

(executing ME DE algorithm)

Hardware  Accelerators

Other Modules of an MVC Encoder

Fig. 6 MVC with motion and disparity estimation hardware showing our Novel Contribution in blue filled boxes To the best of our knowledge, this is the first multi-banked video memory architecture that employs a multi-level application-aware power management scheme to enable low-power motion and disparity estimation in MVC. The proposed architecture and power management scheme require the knowledge of ME and DE algorithms and the search window perfecting technique in order to perform a memory-requirement analysis, though our concept is not limited to any fixed algorithm. Fig. 6 presents an MVC encoder with joint ME/DE hardware architecture, showing our novel contribution in blue filled boxes.

III. MEMORY AND POWER MODEL

Now, we present the memory and power model used in this work. The on-chip video memory is partitioned into NBanks banks, such that the rows of an MB are stored in different banks to provide parallel data access for the SAD accelerator hardware in order to support highthroughput constraints. Each bank Bi; iЄ[1…NBanks] is composed of NSector equally-sized sectors. Each sector consists of SSector number of bytes organized in memory lines, where the size of one memory line is given as NBLine. This implies that the number of lines in a sector Sij is SSector/NBLine. Fig. 7 shows an abstract diagram of our memory organization.

II. RELATED WORK

A search window-based data reuse scheme for H.264 is proposed in [6] to reduce the off-chip memory power. The work in [7] presents an MVC encoder that employs the full search and requires the fetching and storage of complete search window. It suffers from excessive leakage due to big on-chip memories, especially for larger search ranges. The architecture proposed in [19] employs multiple on-chip memories that may be used independently for search in different reference frames or merged to search in a big search window of a given reference frame. This approach is inefficient for MVC that requires significantly larger search ranges for all prediction directions. To deal with the search window size, a search window follower scheme is presented in [8], while [9] presents a candidate-level data reuse scheme. In [10] a search range reduction scheme predicts the center of search window based on the neighborhood. The work in [11] proposes a cache algorithm for ME with prefetching scheme. However, it is limited to a fixed Four Step

Memory Bank Memory Line Group of sectors  which is gated with  a common circuitry ST

Sleep Transistor

Bank1 Sector

Bank2



Bankn

Vdd ST



ST

Ctrl.

Application‐Aware Power  Manager for On‐Chip Video Memory

Extracting the Spatial Properties of Videos



Offline ME/DE Analysis (Section I.A, IV, V)

Fig. 7 Architectural model of the on-chip multi-banked memory with sleep transistors and application-aware power manager All Sij; iЄ[1…NSector] sectors are connected to a power-gate circuitry STj in order to simultaneously power-gate the Sij sectors of all banks. In this paper, we assume the power-gate model with multiple sleep

42

KBytes). The maximum requirement is < 20 KBytes for search range of ±96, i.e. 54% of the total size of a rectangular search window. This leads to an increased leakage. However, using a rectangular search window ensures no misses, as all the data is always available in the on-chip memory. In case a reduced-sized memory is used, the probability of misses increases. This fact is illustrated in Fig. 10 with the help of two histograms from our memory miss analysis. For the experiments in Fig. 10, a rectangular search window is considered which is clustered in 16x16 pixels regions fetched independently on demand. After the ME and DE are performed for MBx, MBx+1 is processed. During the ME and DE processing of MBx+1, for each 16x16 pixels region that is not available in the on-chip memory, a miss is computed. The histograms show that the number of misses decreases radically with an increase in the memory size. Especially the reduction rate is significant for ME. The challenge is to obtain the size of on-chip video memory for ME and DE, such that the leakage savings due to reduced size (compared to the rectangular search window) are balanced by the energy overhead due to misses.

modes (like in [12][13][14]), where each sleep mode has a certain leakage savings at the cost of a wakeup energy and latency overhead. Therefore, using multiple sleep modes provide the foundation to exploit the wake-up overhead vs. leakage saving tradeoff. Different sleep modes are typically realized by controlling the virtual ground bias using footer transistors. Fig. 8 shows the power state machine (PSM), where each sector can be power-gated in one of the three sleep modes, i.e. S1, S2, and S3. The S0 mode corresponds to the powered-on state. PSM is given as PSleepMode = {S0, S1, S2, S3}. For the S0 mode, the leakage energy is computed based on the drain current ‘I’ and Vdd, i.e. ES0 = ΣVdd.Ii.ti. The S1 and S2 modes are intermediate state-retentive sleep modes, i.e. data inside the memory cells is preserved and this mode does not require re-fetching of data from the off-chip memory. For these sleep modes, the total energy is computed as ES1=ES0×ΦS1 and ES2=ES0×ΦS2, where ΦS1 and ΦS2 are calculated using the design curves for footer gate bias vs. normalized leakage and footer gate bias vs. virtual ground voltage, as discussed in [12]. The S3 mode is a state-non-retentive, i.e. data is lost and it requires re-fetching from the off-chip memory. It is also termed as a powered-off state and the wakeup energy from S3 to S0 depends upon the capacitance (Ccircuit) and Vdd, Ewake_up=½.Ccircuit.Vdd2 (see Fig. 8). The wake-up penalty for other transitions depends on the scaling factor ξx for wake-up energy and ρx for wake-up latency, where x represents the transition x Є {T1, T2, T3}. The scaling factors are obtained from the design curves for normalized leakage vs. normalized wakeup-penalty, as discussed in [12]. The wakeup latencies of S1 and S2 are quite short (see Section VI for values), thus these modes are beneficial for short sleep durations, e.g., in the Group-2 with fast variations of memory usage by different MBs (as discussed in Section I.A). In contrast, S3 is beneficial for longer sleep durations. Correspondingly, S1 and S2 modes also provide reduced leakage savings compared to S3.

x104 4.0

x103

ME:  SR[±96]

3.0

3.0

2.0

2.0

1.0 0

DE:  SR[±96]

4.0

1.0 2

4

6

8

10 12 14 16

0

2

Memory Usage [KBytes]

4

6

8 10 12 14 16 18

Memory Usage [KBytes]

Fig. 9 Histograms of Memory Usages during ME and DE Size of rectangular search window is 37.25KBytes (Search Range: ±96)

#Misses  (x103)

50

ME

40

s4 s 3 s2

30

s1

20

15 10

10

5

0

0 6

8

DE

20

10 12 14 16 18

5

10

15

20

Mem. Size [KBytes] Mem. Size [KBytes] Fig. 10 Analyzing the effect of memory size on the number of misses

Fig. 11 presents the pseudo-code for the proposed algorithm to perform memory size exploration for a given prediction direction. The input is a set of different size options S={s1, s2, …, sn}, obtained from an extensive memory usage analysis for a set A of various video sequence (slow-fast motion); as exemplified in Fig. 9 and Fig. 10. Further inputs are leakage energy of the rectangular search window (ELeakRecSW); a set B of test video sequences which are different from the set A to avoid biasing towards the offline analysis; and the prediction direction (dir). Different search window sizes are evaluated in a loop in a decreasing order, i.e. starting from the larger window sizes (lines 4-11). The candidate size s is evaluated for a miss-analysis by performing video encoding tests for set B of various video sequence and the energy for misses (EMissTotal) is estimated using Eq. 2 (line 5). Depending upon the duration of ME and DE, the leakage energy of the on-chip memory (ELeak) for a given size is estimated using Eq. 1 (line 5). Afterwards, the energy profit (EProfit) is computed w.r.t. the rectangular search window size as net energy savings considering the leakage-energy saving and miss-energy overhead (line 6). The search window size with the best energy profit (sBest) is selected and returned (lines 7-10, line 12). After the size for a given prediction direction (Sdir) is obtained, for Ndir number of prediction directions, the total memory size is computed as STotal=∑i=0…Ndir Sdir_i. As discussed in Section III, the memory is partitioned into banks to provide parallel access to different MB-lines for parallel SAD computation. The number of

Fig. 8 Power State Machine with multiple sleep modes [12] The leakage energy of the total on-chip video memory and the energy for memory misses are given as: (1) E L eak = PL e a k × T M E D E E M iss T o ta l = E M iss × N M iss (2) E M iss = E o ffC h ip A cc e ss + E H W sta ll + E lin eF illin g PLeak is the accumulated leakage power for the total on-chip video memory. TMEDE is the time for performing motion and disparity estimation including the miss latency. EMiss is the energy required for one memory miss that includes the energy to fetch data from the off-chip memory (EoffChipAccess), additional energy due to the stalling of the SAD hardware (EHWstall), and energy to fill the memory line (ElineFilling).

IV. ARCHITECTURE OF OUR MULTI-BANK VIDEO MEMORY

The goal is to determine an appropriate size of the on-chip video memory and its organization in terms of number of banks, sectors in a bank, etc. The parameters that can affect the size of the on-chip video memory are: (a) motion and disparity estimation (ME, DE) search algorithm, number of search candidates in different search stages; (b) search range that also depends upon the video resolution; and (c) spatial and temporal video properties. Fig. 9 shows the histogram of memory accesses during ME and DE for a search range of ±96. The memory usage in ME and DE is much less than the size of a rectangular search window for one prediction direction (37.25

43

PM1 Å F(µ+σ; µ, σ2) - F(0; µ, σ2) ≈ 0.84 PM2 Å F(µ+2σ; µ, σ2) - F(0; µ, σ2) ≈ 0.975

banks is computed (Eq. 3) depending upon the given throughput constraints (as frame rate FRate in fps, frames per second) and video resolution (W×H, width and height of the video in pixels). ⎛ ⎞ 1 f × 10 6 ⎟ (3) N Banks = ×⎜ SAD ⎟ NBLine ⎜⎜ W × H × FRate × N Avg _ dir × N dir ⎟ 256 ⎝ ⎠ SAD N Avg _ dir is the average number of SADs per MB and it depends upon the search algorithm. The frequency of the ME/DE hardware is denoted as f in MHz. Afterwards, the size of a sector (SSector in bytes, Eq. 4) is computed by considering the variations in the usage profile (see Fig. 9) in order to increase the potential of power-gating for different MBs that exhibit diverse spatial and temporal properties.

(

)

S S ec tor = ⎣⎢(UsageMax − UsageMin ) / UsageStdDeviation ⎦⎥ Total number of sectors for each direction ( N N 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13.

d ir S ector

(6) (7)

dir Sector

Fig. 12 Statistical Distribution of Memory Requirements (ME, DE)

(4)

These predicted memory requirements are then forwarded to the algorithm of the application-aware power management (Fig. 13) as a tuple: MROffline={PM3,  PM2,  PM1}. Further inputs are prediction direction (dir), camera view (v), video frame (f), total size of the onchip memory (STotal), and size of a sector (SSector). The power management scheme is explained step-by-step in the following.

) is computed as:

= ⎢⎡ S dir / ( N B a n ks × S S ecto r ) ⎥⎤

(5)

DetermineVideoMemorySize(ELeakRecSW, B, dir, S)  BEGIN   EBestProfit Å 0; sBest Å 0;  For all s Є S // evaluate sizes in a decreasing order     (ELeak, EMissTotal) Å performMEDE(s, B, dir );   // see Eq.1 and Eq.2    EProfit = (ELeakRecSW – ELeak) – (EMissTotal);    If (EProfit ≥ EBestProfit) Then;      sBest = s;      EBestProfit = EProfit;    End If;  End For  return sBest;  END 

1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11.

Fig. 11 Pseudo-code of the algorithm for finding the memory size for a given prediction direction

12. 13. 14. 15. 16. 17.

As discussed earlier in Section I.A and illustrated by Fig. 9, not all MBs use the complete on-chip memory and despite of a reduced-size memory, major parts (in several cases more than 40%) of the memory may not be used; see usage variations in Fig. 9. Furthermore, the memory usage in ME is much less than that in DE, see Fig. 9. Therefore, our proposed power management scheme performs power-gating to the unused sectors. The key challenge is to determine an appropriate sleep mode depending upon the predicted memory requirements considering the spatial and temporal properties of frames and MBs.

ApplicationAwarePowerManager(dir, v, f, STotal, SSector, MROffline)  BEGIN   list N Å getNeighboringFrames (dir, v, f );  ∀n ∈ N      MRn ← ( n is Available ) ? getMemReq(n) :   MROffline ;   MRCurrent Å frameMemReq(MRLeft, MRRight, MRTop, MRDown);   // Fig. 15  list G Å getMBGroups (f); // combine MBs in Groups   For all g Є G    MRGroupÅreAdjustMemReq(g, MRCurrent, EMissGroup);   // Fig. 16   list PS Å setSleepModes(STotal, SSector, MRGroup);  // Fig. 17    For all mb Є g      {EMissGroup, ELeakGroup, memUsedMB} Å performSearch( ); // perform ME and DE search and log memory requirements of the current MB    MRCurrentÅ mbLevelPowerGating(PS, memUsedMB);   // Fig. 18     End For    End For  MR Å computeMemStatistics(PM3, PM2, PM1);  return MR;  END 

Fig. 13 Pseudo-code of the Application-Aware Power Manager Phase1). Frame-Level Power Management: the memory requirements for the current frame f in a view v (MRCurrent, line 5) are predicted from the neighboring frames in temporal (left, right) and disparity (top, down) domains using a weighted prediction of their respective MRn, as shown in Fig. 14. First the neighboring frames are obtained (line 3). In case the information about the memory requirements of a certain neighboring frame is not available6, its memory requirements are initialized with the offline memory requirements (MROffline), see line 4.

V. APPLICATION-AWARE POWER MANAGEMENT

First, a prediction direction is obtained from the application level. Since the search window for each prediction direction is stored in distinct sectors, the sectors of the unused prediction directions are put in S2 state-retentive mode, as the data will be required in the MB loop. Note, different search predictions are processed sequentially for each MB. Afterwards, the application-aware power management is employed for each prediction direction independently. The primary input to the application-aware power management scheme is an offline analysis of the memory requirements (Fig. 6, Fig. 16). From this analysis, three different memory requirement predictions are obtained by performing a Probability Distribution Function (PDF) analysis over various test video sequences. First prediction is about the maximum memory requirement which is denoted as PM3. Considering a Gaussian distribution, two further highly-probable memory requirement predictions (PM1 and PM2) are computed using Eq. 6 and Eq. 7, where the high-probability zones cover the area under the curve considering µ+σ (PM1) and µ+2σ (PM2). Here, µ denotes the average of the distribution and σ denotes the standard deviation. Fig. 12 shows an abstract example for computing PM1, PM2, and PM3. PM1 covers 84% of the area under the curve, while PM2 covers 97.5% of the area under the curve.

Fig. 14 2D-weighted prediction using the memory usage of the frames in the 3D-neighborhood Fig. 15 presents the pseudo-code for frame-level memory requirement prediction. Each predicted memory requirement MR={PM3,  6

44

For instance, the neighboring frame is an Intra-frame or, the top disparity neighbor for the first view is not available.

store the data which might be used later by other MBs of the group (see lines 4-5). This leads to a reduced wakeup overhead and reduced leakage savings compared to S3. Since wakeup incurs an energy overhead, our scheme predicts the sleep duration which is required to amortize the wakeup overhead as a function of number of MBs in the group, see Eq. 12. Due to the state-non-retentive nature of S3, there is a probability of memory misses. Therefore, in addition to Ewakeup, EMissGroup is also added for evaluating the sleep decision of the S3 mode. The set of sectors in different power modes is saved and returned. E w akeup / E Leak If S 1 or S 2 ⎧ (12) N G roup > ⎨ + ( E E ) / E Else w akeup M issG roup Leak ⎩

PM2, PM1} is computed as the weighted average of the corresponding MR of the neighboring frames, using Eq. 8 (see line 5 in Fig. 15). d n , ∀n∈{Left,Right,Top,down} denotes the temporal/disparity distance in terms of number of frames between the current frame and the prediction frame, while α and β are given as the motion and disparity weighting factors, respectively. MRCurrent = [( MR Left * d Left + MR Right * d Right )* α (8) + ( MRtop * d Top + MR Down * d Down )* β] / 4 1. 2. 3. 4. 5. 6. 7.

frameMemReq(MRLeft, MRRight, MRTop, MRDown)  ∀n ∈ {Left,Right,Top,Down}     {PM3, PM2, PM1}n Å getMemReqSteps(MRn); // see Fig. 12  // compute the weighted average using Eq. 3  ∀i ∈ [1...3]   PMi‐Current Å weightedAvg(PMi‐Left, PMi‐Right, PMi‐Top, PMi‐Down, α, β);  MRCurrent Å {PM3, PM2, PM1}Current;  return MRCurrent; 

Furthermore, after an MB group is encoded, the energy of misses (EMissGroup) along with the wakeup energy overhead (EwakeupS3->S2) and leakage energy (ELeakGroup) are used to predict the number of sectors that should be moved from the sleep mode S3 (state-non-retentive) to S2 (state-retentive), as NPM 3 = EMissGroup / ( ELeakGroup × EwakeupS 3→S 2 ) .

Fig. 15 Pseudo-code of frame-level memory requirement prediction Phase2). Grouping of MBs: Since different MBs in a frame exhibit different spatial and temporal properties, not all MBs of a frame use same amount of memory for ME and DE. Therefore, the frame-level memory requirement prediction is adapted for different MBs in order to determine the sleep mode. Since state transitions (especially from S3 to S0) incur a wakeup overhead in terms of energy and latency, consecutive MBs sharing the same spatial and temporal video properties are grouped together (using Eq. 9) in order to increase the sleeping duration; see line 6 in Fig. 13.

1. 2. 3. 4. 5. 6. 7. 8. 9. 10.

⎧if ( SADMB > TH SAD & VarMB < TH Var ) Group = I (9) ⎨ ⎩ Else, Group = II where, THSAD and THVar are computed using a statistical analysis over various test video sequences and are derived as:

THVar = μVar + 1.5 ∗ σVar ;TH SAD = μ SAD + 1.5 ∗ σ SAD Group‐I 0.015 (homogeneous +Slow Motion)

.015

0.01

0.01

0.005

0.005

0 1200

1400

1600

1800

2000

Required Mem. [bytes]

2200

0 1200

Fig. 17 Pseudo-code for determining the number of sectors and their corresponding sleep modes Phase4). MB-Level Power Management: Afterwards, for all MBs in an MB-Group, the ME and DE search is performed and EMissGroup, ELeakGroup, memUsedMB are obtained; see line 11 in Fig. 13. Then, the number of sectors in the state-retentive sleep modes (S0, S1, and S2) for the upcoming MB is re-adjusted depending upon the actually used memory of the currently-encoded MB. Fig. 18 illustrates the procedure for readjusting the sleep modes for the upcoming MB in an MB-Group. First the difference between the used memory and predicted memory is computed in terms of number of sectors (line 4). If the difference is zero, no update in the sleep modes is performed (line 5). If the difference is positive, i.e. the used memory is less than the predicted memory in mode S0, additional sectors are put into the state-retentive sleep mode S1 (lines 7). Otherwise, more sectors are powered-on and the sectors in other state-retentive modes are adjusted accordingly (lines 8-15).

(10)

Group‐II (textured +Medium‐Fast Motion)

1400

1600

1800

2000

setSleepModes(S, SSector, MR)  BEGIN    {PM3, PM2, PM1} Å getMemReqSteps(MR);// see Fig. 12  S‐PM3 ⎥ ;     M = ⎢PM3 ‐PM2 ⎥ ;     M = ⎢PM2‐PM1 ⎥ ;    M3 = ⎢ 2 1 ⎢⎣ ⎢⎣ SSector ⎥⎦ SSector ⎥⎦ SSector ⎦⎥ ⎣⎢   PowerGate(M3, S3); PowerGate(M2, S2); PowerGate(M1, S1); // using Eq. 12    M0; = (S/SSector) – (M3+M2+M1);    SwitchOn(M0, S0);    list PS Å {M0, M1, M2, M3};    return PS;  END 

2200

Required Mem. [bytes]

Fig. 16 Statistical distribution of ME and DE memory requirements for homogenous and textured MBs Fig. 16 shows the PDF of memory requirements for two different groups of MBs. Group-I contains the homogeneous MBs with slow motion and disparity. Group-II contains highly-textured MBs with medium-fast motion and/or disparity. It is noteworthy that the distribution of MBs in the Group-I is more centered compared to the Group-II in case of ME. Therefore, the frame-level prediction is readjusted considering the MB group using Eq. 11, see line 8 in Fig. 13. ξi is given as the difference between the average textures (computed using the Sobel Operator) of the complete video frame and the MB Group.

1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17.

∀i ∈ [1...3] PM i-Group =ξ i*PM i-Current (11) Phase3). Group-Level Power Management: Pseudo-code in Fig. 17 provides the flow for making power-gating decisions. First the group-level memory requirement prediction in terms of {PM3,  PM2,  PM1} is obtained. Afterwards, different sets of sectors are obtained that are candidate for power gating. The memory more than the maximum requirements (M3) is gated in S3 mode, as it is highly improbably to be used by the MB-group. The other two sets of sectors – M2 and M1 – are candidates for being gated in S2 and S1 state-retentive modes, as they

mbLevelPowerGating(PS, memReqMB)  {M0, M1, M2, M3} ÅPS ;  ⎥;   M= ⎢memReqMB SSector ⎥⎦ ⎢⎣ Δmem = M0 – M;  If (Δmem == 0) Then  return PS;  If (Δmem > 0) Then  // Put more sectors in S1 gating mode    M0’ = M0 – Δmem;  M1’ = M1 + Δmem;  M2’ = M2;  Else    M0’ = M0 – Δmem;  // switch ON more sectors    If (|Δmem| ≥ M1) Then  // re‐adjust S1‐gated and S2‐gated sectors      M1’ = M2 + M1 – |Δmem|;  M2’ = 0;    Else      M1’ = M1 – Δmem;  M2’ = M2;    End If  End If  PS Å {M0’, M1’, M2’, M3};  return PS; 

Fig. 18 Pseudo-code of MB-Level Power-Gating

45

shows the corresponding energy savings. It is worthy to note that the variations in the energy savings are very frequent. It is due to the fact that our scheme adapts very quickly to accommodate sudden variations in the memory requirements, thus frequently transiting between S0ÆS1ÆS2. An interesting observation in Fig. 20(c) is that the variations between S1-S2 do not touch the sectors gated in S3 mode. This shows that the frame-level prediction of the maximum requirements is very accurate and the probability of powering-on the S3 gated sectors is significantly low.

Phase5). Re-compute Statistics: After the frame ME and DE are completed, the probabilistic analysis (like in Fig. 12 and Fig. 16) is performed to obtain the MR, which is used as the neighboring MB by the subsequent frames. Note, during the Intra-frame encoding, the complete on-chip memory is kept is the S3 mode, as no ME or DE is performed for Intra-frames.

VI. RESULTS AND EVALUATION

(c) (b) (a) Energy Savings [%] Memory Blocks States Memory Blocks States

To evaluate our low-power memory architecture with an applicationaware power management, ME and DE are performed using different multiview video sequences (“Ballroom”, “Vassar”, “Exit”, and “Flamenco2”) each with 4 views considering various motion and disparity configurations. The fast TZ Search [4] algorithm is employed with a search range of [±96, ±96] pixels using three Quantization Parameters QP={22, 32, 42}. The memory and power model described in Section III along with the set of thresholds and power mode parameters shown in TABLE I. are used in our experiments. TABLE I. POWER MODEL PARAMETERS AND THRESHOLDS ΦS1 0.5 α 0.65 ΦS2 0.3 β 0.35 ξ1 0.35 ρS1 0.1 ξ2 0.35 ρS2 0.2 ξ3 0.6 ρS3 0.3

Level‐C+

1

Level‐C

NC

FC

FMBC

0.4 0.2 0

Ballroom Vassar Exit Flamenco Fig. 19 Comparing the leakage savings with state-of-the-art Vassar

Exit

S3

80

64

48

32

16

ME DE Ballroom

0

ME DE Vassar

Ballroom

ME DE Exit

Vassar

Exit

ME DE Flamenco Flamenco

96 80 64 48 32 16

96

80

64

48

32

16

0

100

1

1001

2001

3001

96

Energy Consumption

Energy Saving

80

75 64

50

48

32

25 16

0

0

500

1

1000 1001

1500

2000

3000

2500

2001

3001

3500

Fig. 21 illustrates the comparison of our prediction accuracy with the actual memory usage for a rectangular search window and two history-based predictors. It is notable that in cases of sudden variations, our application-based prediction follows the exact usage more accurately compared to the others. This improved accuracy leads to significant energy savings, as shown in Fig. 19.

0.6

Ballroom

S2

96

0.8

0

S1

112

#MB Fig. 20 Detailed analysis of memory usage and sleep modes

Memory Requirement  Predictors

Normalized Memory  Leakage 

The leakage reduction provided by our architecture is compared to state-of-the-art on-chip storage memory and search window prefetching techniques Level-C [6] and Level-C+ [6]. For fairness of comparison same search range is employed, therefore, the comparison purely reflects the leakage reduction due to the reduced memory size and power-gating. Level-C and Level-C+ are search window-based techniques while our approach stores and manages only the data demanded by the search pattern. Fig. 19 shows the leakage reduction normalized to the Level-C+ that has the highest energy consumption. This is primarily due to the large-sized memory. Due to its reduced size, our memory even without the power management (NC) is able to provide 50% leakage energy reduction compared to Level-C+. When the application-aware power-gating only at the frame level (FC) is integrated with our memory, the leakage energy reduction approaches more than 75%. The fine-grained power management at the MB level (FMBC) provides further 3%-5% leakage reduction, altogether providing up to 80% leakage reduction compared to Level-C+.

S0

112 96 80 64 48 32 16 0

Flamenco

Fig. 20 presents the detailed analysis of the memory usage, selected sleep modes, and the corresponding energy savings for a series of MBs. Fig. 20(a) shows different memory usages in ME and DE for different video sequences in terms of blocks (16x16 pixels). The power states of different blocks are shown by different colors. It is worthy to note that the sleep mode S1 is used too seldom. It is due to the fact that our scheme quickly transited between S1 and S2, as the difference in the wakeup overhead for S1 and S2 is insignificant, even at MB-level that requires several hundred cycles for ME and DE. The decision of state S3 is primarily at the frame-level as it is state-nonretentive, which is visible by the transition in Fig. 20(b). At the MB and MB-Group levels, our scheme tends to choose S2 mode due to its low wakeup overhead. The selection of S1 allows for fine-grained power savings and accommodates sudden variations. Fig. 20(c)

70

90

Actual Memory Requirements

History Based (Mean)

History Based (Median)

Our Application‐Aware Prediction

History‐Based Predictors Delayed Prediction in Transitions

80

60

70

50

60

50

40 40

30

Our Applicaiton‐Aware Predictor reacts to sudden variations

30 1

11

20 21

31

40 41

51

#MB

60 61

71

80 81

91

100

10

Fig. 21 Comparing the accuracy of different predictors to the actual memory usage considering a rectangular search window Our on-chip memory with application-aware power management is integrated in the ME/DE hardware architecture of [22] that features an array of 64x4-sample SAD operators and 21 SAD trees to provide high throughput. All components, including the memory size are designed to support real-time encoding for up to 4-views HD1080p performing ME and DE for search ranges up to [±96, ±96] pixels. TABLE II presents a detailed comparison with two of the most prominent ME/DE hardware architectures available in the current literature [10][22]. Our architecture significantly reduces the gate

46

count in relation to [10] but contains a bigger memory, which is primarily due to the reduced [±16, ±16] search range of [10] compared to our case of [±96, ±96] search range. Note, search range of [±96, ±96] is required to support high-quality DE. In case the same search range is used, the memory size of [10] would increase to approximately 1.2Mbits. In this case, our architecture will provide a memory reduction of 30%. Compared to [22] the increase in the memory size is mainly due to the reduced number of misses. Altogether our approach provides, 8%-10% reduced memory misses compared to [22]. Additionally, our application-aware power management leads to reduced leakage power/energy, which is the primary design concern in this work.

[9] [10]

[11]

[12]

TABLE II. HARDWARE RESULTS COMPARISON WITH STATE-OF-THE-ART [22] with our on-chip Tsung'09 [10] [22] video memory TSMC 90nm Low ST 65nm LP ST 65nm LP Technology Power LowK Cu 7 metal layer 7 metal layer 230k 102k 102k Gate Count 64 Kbits 512 Kbits 832 Kbits SRAM 300 MHz 300 MHz 300 MHz Frequency 265mW, 1.2v 74mW, 1.0v 57mW, 1.0v Power 4-views 720p 4-views HD1080p 4-views HD1080p Throughput

[13] [14] [15] [16]

VII. CONCLUSIONS

[17]

A low-power on-chip video memory is proposed for motion and disparity estimation in Multiview Video Coding. An algorithm is proposed to determine the organization of a reduced-sized multibanked memory by evaluating the leakage energy savings and the energy overhead due to misses. The proposed architecture employs an application-aware power management at various levels, i.e. prediction direction, frame, group of MBs, and MB. Experiments demonstrate that for various 3D-video sequences, the proposed architecture provides up to 80% leakage energy reduction compared to state-of-the-art. Our work demonstrates that raising the abstraction level of power-gating to the application level, while jointly considering the knowledge of algorithm and input data, provides a higher potential for leakage reductions.

[18] [19] [20] [21] [22]

ACKNOWLEDGMENT We would like to thank Leandro Max de Lima, Federal University of Rio Grande do Sul, for his contribution to the hardware implementation. [1] [2] [3] [4] [5] [6] [7] [8]

[23]

REFERENCES P. Merkle et al., " Efficient Prediction Structures for Multiview Video Coding" IEEE Trans. on Circuits and Systems for Video Tech., vol.17, no.11, pp. 1461- 1473, 2007. Lynx 3D SH-03C: http://www.sharp.co.jp/products/sh03c/index.html Joint Draft 8.0 on Multiview video coding, JVT-AB204, 2008. J. Yang et al., "Multiview video coding based on rectified epipolar lines", International Conference on Information, Communications, and Signal Processing, pp.1-5, 2009. S. Yang, W. Wolf, N.Vijaykrishnan, “Power and performance analysis of motion estimation based on hardware and software realizations”, IEEE Transactions on Computers, vol. 54, no. 6, pp. 714-726, 2005 C.-Y. Chen et al., "Level C+ data reuse scheme for motion estimation with corresponding coding orders", IEEE Trans. on Circuits and Systems for Video Tech., vol.16, no.4, pp. 553- 558, 2006. L.-F. Ding et al., "A 212 MPixels/s 4096×2160p Multiview Video Encoder Chip for 3D/Quad Full HDTV Applications", IEEE Journal of Solid-State Circuits, vol.45, no.1, pp.46-58, 2010. S. Saponara, L. Fanucci, "Data-adaptive motion estimation algorithm and VLSI architecture design for low-power video systems", IEE Computers and Digital Techniques, vol.151, no.1, pp. 51- 59, 2004.

[24] [25] [26] [27]

[28]

[29]

[30]

47

T.-C. Chen et al., "Fast Algorithm and Architecture Design of LowPower Integer Motion Estimation for H.264/AVC", IEEE Trans. on Circuits and Systems for Video Tech., vol.17, no.5, pp.568-577, 2007. P.-K. Tsung et al., "Cache-based integer motion/disparity estimation for quad-HD H.264/AVC and HD multiview video coding", IEEE International Conference on Acoustics, Speech, and Signal Processing, pp.2013-2016, 2009. C.-Y. Tsai et al., "Low Power Cache Algorithm and Architecture Design for Fast Motion Estimation in H.264/AVC Encoder System", IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. II-97-II-100, 2007. H. Singh et al., "Enhanced leakage reduction techniques using intermediate strength power gating", IEEE Transaction on Very Large Scale Integration, vol. 15, no. 11, pp. 1215-1224, 2007. K. Agarwal, K. Nowka, H. Deogun, D. Sylvester, "Power gating with multiple sleep modes", International Symposium on Quality Electronic Design, pp. 633–637, 2006. S. Roy, N. Ranganathan, S. Katkoori, "State-Retentive Power Gating of Register Files in Multi-core Processors featuring Multithreaded In-Order Cores", IEEE Transaction on Computers, 2010. L. Shen et al., "View-Adaptive Motion Estimation and Disparity Estimation for Low Complexity Multiview Video Coding", IEEE Trans. on Circuits and Systems for Video Tech., vol.20, no.6, pp.925-930, 2010. H.-C. Chang et al., "A Dynamic Quality-Adjustable H.264 Video Encoder for Power-Aware Video Applications", IEEE Trans. on Circuits and Systems for Video Tech., vol.19, no.12, pp.1739-1754, Dec. 2009. S.-H. Wang, S.-H. Tai, T. Chiang, "A Low-Power and Bandwidth-Efficient Motion Estimation IP Core Design Using Binary Search", IEEE Trans. on Circuits and Systems for Video Tech., vol.19, no.5, pp.760-765, 2009. X. Xu, Y. He, "Fast disparity motion estimation in MVC based on range prediction," IEEE International Conference on Image Processing, pp.2000-2003, 2008. H. Shim, C.-M. Kyung, "Selective Search Area Reuse Algorithm for Low External Memory Access Motion Estimation", IEEE Trans. on Circuits and Systems for Video Tech., vol.19, no.7, pp.1044-1050, 2009. T. Tuan et al., “A 90nm Low-Power FPGA for Battery-Powered Applications”, ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pp. 3-11, 2006. JMVC 6.0, garcon.ient.rwthaachen.de, Sep. 2009. B. Zatt, M, Shafique, F. Sampaio, L. Agostini, S. Bampi, J. Henkel, "Run-Time Adaptive Energy-Aware Motion and Disparity Estimation in Multiview Video Coding", IEEE Design Automation Conference, pp.1026-1031, 2011. X. Xu, Y. He, "Fast disparity motion estimation in MVC based on range prediction", IEEE International Conference on Image Processing, pp.2000-2003, 2008. X. Liu, P. J. Shenoy, and M. D. Corner, “Chameleon: Applicationlevel power management,” IEEE Trans. Mobile Computing., vol. 7, no. 8, pp. 995–1010, 2008. S. Mondal, S.O. Memik, “Fine-grain leakage optimization in SRAM based FPGAs”, IEEE Great Lakes Symposium on VLSI, pp. 238-243, 2005. K. Rajamani et al., “Application-Aware Power Management”, IEEE International Symposium on Workload Characterization, pp. 39-48, 2006. M. Shafique, L. Bauer, J. Henkel, "enBudget: A Run-Time Adaptive Predictive Energy-Budgeting Scheme for Energy-Aware Motion Estimation in H.264/MPEG-4 AVC Video Encoder", IEEE Design Automation and Test in Europe Conference, pp.1725-1730, 2010. M. Shafique, B. Molkenthin, J. Henkel, "An HVS-based Adaptive Computational Complexity Reduction Scheme for H.264/AVC Video Encoder using Prognostic Early Mode Exclusion", IEEE Design Automation and Test in Europe Conference, pp.1713-1718, 2010. B. Zatt, M. Shafique, S. Bampi, J. Henkel, "Multi-Level Pipelined Parallel Hardware Architecture for High Throughput Motion and Disparity Estimation in Multiview Video Coding", IEEE Design Automation and Test in Europe Conference, pp.1448-1453, 2011. H. Javed, M. Shafique, S. Parameswaran, J. Henkel, "Low-Power Adaptive Pipelined MPSoCs for Multimedia: An H.264 Video Encoder Case Study", IEEE Design Automation Conference, pp.1032-1037, 2011.

Suggest Documents