Instruction-Based Voltage Scaling for Power Reduction in SIMD-Based MPSoCs Sohaib Majzoub Dept. of Computer Engineering, American University in Dubai, POBox 28282, Dubai, Dubai, UAE
[email protected]
Address: American University in Dubai Department of Computer Engineering POBox 28282 Dubai, Dubai, UAE Office : + 971 (4) 318 3438 Fax : + 971 (4) 318 3419 Email :
[email protected]
Date of Receiving: Date of Acceptance:
Instruction-Based Voltage Scaling for Power Reduction in SIMD-Based MPSoCs Sohaib Majzoub Dept. of Computer Engineering, American University in Dubai, POBox 28282, Dubai, Dubai, UAE
[email protected]
Abstract—Power reduction in Multiprocessor System-on-Chip (MPSoC) is a key design challenge today. In this paper, we present an approach to power reduction for homogeneous SIMD MPSoC platforms. The technique involves the use of dynamic voltage scaling and power management of voltage islands. The basic idea is to have different supply voltage for different instructions: slow instructions run at the higher voltage levels while fast instructions run at lower voltages to save on power. We use a reconfigurable MPSoC architecture, namely MorphoSys, as a vehicle to demonstrate the approach, although the methods are suitable for other SIMD MPSoC architectures. A number of image processing benchmarks are manually mapped to the hardware to demonstrate the power reduction. The results show energy savings from 12% up to 24% for the specified benchmarks using the proposed approach. Keywords — Multi-Processor System-on-Chip (MPSoC), Power Modeling and Management, Reconfigurable Systems.
1 INTRODUCTION The system-on-chip (SoC) paradigm is undergoing a major shift driven largely by the emergence of multiprocessor SoC (MPSoC) platforms, [1], supporting a large set of embedded cores on a single chip. In the current era of portable computing, power reduction in MPSoCs, with tens to hundreds of cores, is a vital design goal. The use of multiple supply voltages has been widely adopted in integrated circuit designs to reduce overall switching and leakage power consumption [1][2]. This use of multi-VDD fits well with the MPSoC paradigm as the cores usually execute a non-uniform workload. Thus, cores with heavy workload are supplied higher VDD, while those with lighter load are supplied lower VDD to save on power. Often, the approach is realized through the use of voltage islands. The concept of voltage islands allows a finite number of supply voltages that can be applied to different blocks of a design to reduce the overall power consumed. Idle power can be reduced separately using power-gating by which an entire block is switched off, put into sleep mode, to greatly reduce leakage power [4][5].
For example, in [10], the authors try to optimize both dynamic and leakage power in a heterogeneous multiprocessor systems. They propose a scheduling algorithm that simultaneously uses dynamic voltage scaling (DVS) as well as adaptive body biasing (ABB). Experimental results show that the average power reduction is 68% for 70nm CMOS technology. In [1], the authors tried to optimize the selection of the number of processing cores and their corresponding voltage/frequency for a given workload in order to minimize power. Their method is based on functional, cycle-accurate simulation on a virtual SystemC-based platform, MP-ARM. Their results show 50% power saving for ARM cores in 0.13m technology. The technique of dynamic voltage and frequency scaling (DVFS) is utilized in [5] to minimize the power consumption under timing constraints for JPEG and MPEG-1 decoding tasks. Using heuristic approaches, the results show a maximum power saving of 55%. In [8], the same problem as in [5] is tackled in finding the optimum voltage levels to minimize the average power consumption under a given performance constraint. In [9], the optimization is carried out on a heterogeneous multiprocessor platform. The results show 58% improvement in power reduction. In other work on multiprocessor platforms, voltage scaling is performed at the application and the task levels. Off-line as well as on-line task scheduling algorithms are discussed thoroughly in the literature for uniprocessor systems [1]. However this topic is still ongoing research for various multiprocessor architectures [6][7][8][9]. In our work, we consider power reduction on a homogeneous SIMD (single-instruction, multiple-data) MPSoC [1]. That is, all processors in the n x n array are identical. The architecture allows one or more row/column or the entire array to execute the same instruction on different data. We explore an instruction-level approach due to the simple nature of the processing elements in a SIMD architecture. We take advantage of the fact that different instructions may have different execution times. Our key concept is that slower instructions will receive a higher VDD so that they can complete within the clock cycle, while faster instructions receive a lower VDD so that their power dissipation is reduced. Processors are grouped column-wise into a set of voltage islands. The supply voltage can be changed independently for each island [5]. In order to identify a suitable set of supply voltages, all instructions are profiled to find the minimum voltage supply needed to fit within a given clock cycle. We then pick a small set of supply voltages and then create a set of voltage islands on the array, which we call voltage or power masks, based on the instructions used by the application. Changing the voltage mask within every cycle is called cycle-based voltage mask (CBVM) switching. On the other hand, changing the voltage mask between different applications (or tasks) is called task-based voltage mask (TBVM) switching. In this paper, we investigate instruction-based voltage scaling and voltage mask manipulation methods to reduce power. In Section 2, we present the key concept, the different power models, and the scaling methodology used for VDD selection for each instruction. In Section 3, we demonstrate the power saving by mapping a group of image handling benchmarks. Finally, we draw conclusions and propose future work in Section 4.
2 INSTRUCTION-LEVEL VOLTAGE ISLANDS Most of the previous work on power reduction in MPSoC platforms dealt with voltage scaling at the program/task level. For instance, if processor A runs program X in 500ns while processor B runs program Y in 1000ns, processor A has to wait another 500ns for processor B to finish.
Thus, processor A can be slowed down by reducing its voltage and/or frequency to fit within 1000ns [9]. While this approach works well, it does not easily translate to a SIMD architecture, which may benefit from voltage scaling at the instruction level rather than the task level. For example, in a typical SIMD architecture, the clock cycle is determined by the slowest instructions. Each instruction may have its own execution time, but can be categorized into a number of bins depending on the degree of timing differences. In Figure 1, we show a case, generated randomly for illustration purposes at this point, where the instructions can be grouped into two categories. Here, a 20ns clock cycle is needed to safely run all instructions, considering nominal voltage level to be 1.1V for all instructions. To reduce power, the “Slow Instructions” group in the figure should run at nominal voltage, i.e. 1.1V, as they are slow, while the “Fast Instructions” group can run at lower voltage level, less than 1.1V, to save on power. The lower voltage level has to be determined based on the actual timing for the fast instructions as we discuss in the coming sections. As a result, the cycle time remains the same but the overall power is reduced. This approach is possible on SIMD MPSoC platforms that consist of an array of simple ALUs with varying instruction execution times. Previous work was conducted on more sophisticated processors such as ARM processors [1]. Such a processor has a pipelined architecture that might prevent voltage scaling to be carried out at the instruction level. Furthermore, a voltage scaling at the program or task level would be more efficient as previous work shows [1]. For MPSoC designs with simple ALUs in a SIMD architecture, the proposed method is quite suitable. Based on this, we will explore the merits of instruction-level voltage scaling to reduce power. 2.1
MorphoSys Architecture
MorphoSys [11] is a reconfigurable platform with multiple processors that are very simple in nature. We view it as an MPSoC platform with SIMD architecture. Figure 2 shows the block diagram and architecture of MorphoSys M1 chip [11] and the details of one of the reconfigurable processors. Its structure follows the master-slave processor design. The master processor is a RISC processor called TinyRISC. The slave processors form the reconfigurable part which consists of an 8x8 array of small reconfigurable processors (RPs)1. The other supporting blocks are: the context memory, the frame buffer, and the DMA controller. The frame buffer and the context memory provide the data and instructions, respectively, in a parallel fashion to the RP array. The role of the frame buffer is to improve the data flow between the main memory and the RP array. The context memory contains the context words that are sent to configure the data input, the interconnection, and the operation to be executed in every RP in the array. The relevant part of the MorphoSys architecture for this paper is the RP array. It has a parallel structure, with a hierarchical communication bus scheme. It is suitable for applications that have a high degree of parallelism, regularity, and intensive computations [11]. Bus transfers can occur in one cycle for neighboring processors and two cycles for distant processors. The execution of the RP array can be configured to be row-wise or column-wise to provide the parallelism needed for the target application. The reconfigurability feature in each RP lies in the multiplexing of the data inputs. The settings on the multiplexers control the data inputs and operations performed [11]. 1
In the original work in [11], the RP array is referred to as the RC array.
2.2
Voltage Scaling on the MorphoSys Instruction Set
The instruction set of the RPs ranges from simple logic and arithmetic instructions to multiply and accumulate instructions. The supply voltage can be scaled based on the type of instruction being executed in an application and how often they are executed. To categorize the different instructions, the RP unit was implemented in RTL and then synthesized to the gate level. The delay versus instruction plot for the RP is shown in Figure 3. The plot was obtained with VDD=1.1V for a 90nm CMOS technology using PrimeTimeTM. The delays can be clearly grouped into fast and slow instructions. The first group consists of the normal arithmetic and logic operations that require less than 1.7ns. The second category has instructions that involve multiply and accumulate operations which require approximately 4.5ns. Then, a voltage/delay scale had to be used to scale down the voltage of each instruction, eliminating any slack. To do that, a normalized curve of delay versus VDD was obtained using HSPICETM that used PrimeTimeTM nominal values. We used this simple function to scale the delay as a function of supply voltage for every instruction. The required VDD value was obtained for each instruction. In our case, the target delay for the overall system is 4.5ns. Using this target value, the voltage versus instruction plot was generated, as shown in Figure 4 and Figure 5. Ideally, each instruction might have its own voltage level. However, implementing 30 voltage levels adds considerable design complexity that might take away most of the benefits due to timing and energy overhead. Thus, it is reasonable to say that two voltage levels might be sufficient to be considered in the implementation [12]. The, the next problem is to determine the value of the voltages to be selected, i.e. selecting two out of 30 voltage levels such that the total energy is kept at minimum. Although this selection depends heavily on the type of instructions used by the applications after fabrication, this selection has to be decided at a pre-fabrication stage. The two selected voltage levels are fixed and never change after fabrication. From Figure 4, as mentioned earlier, there are two groups of instructions. The first group is a fast one, which includes most of the arithmetic, logic, and bitwise instructions. The supply voltage of these instructions ranges from 0.5V to 0.8V. The second group is a slow one, which includes most of the multiply/accumulate instructions. The supply voltage of these instructions ranges from 0.9V to 1.1V. A voltage level has to satisfy two contradicting factors, it has to be selected such that it is as low as possible, and it covers as many instructions as possible. Furthermore, the value of the selected voltages varies between different applications, depending on the instructions used by these applications. Based on this observation, the number of voltage levels can be reduced down to 0.7V, 0.8V, 1.0, and 1.1V. Since 1.1V has to be selected anyway to meet the timing constraint, 4.5ns, 1.0V can be eliminated. Then, the decision has to be made whether to select 0.7V or 0.8V. There is only one instruction, |A+B|+C, that needs a 0.8V. It seems that it would be better to group it with the slow instructions. The reason is that it might be better to raise the voltage of a single instruction to 1.1V instead of increasing the voltage of the whole 23 instructions by 0.1V. Furthermore, these fast 23 instructions have much higher usage probability by potential workload compared to a single instruction. Thus, the 0.7V supply is used for 23 instructions, and the 1.1V supply is used for the other 7 instructions (those that require more than 0.7 V). Using the voltage scaling based on VDDL=0.7V and VDDH=1.1V, all instructions will have a critical path delay of roughly 4.5ns.
Figure 6 shows the total power consumed by the instructions after scaling the voltage with VDDL=0.7V or VDDH=1.1V. It was generated using PrimeTime PXTM. The initial power levels are shown in the light colored part, and the final power levels are shown in the dark colored part. The instructions running at the VDDH=1.1V will have the same consumed power. In order to evaluate different power reduction schemes, a hardware emulator for this MPSoC is used [13]. The emulator is a cycle-based simulator that functionally executes each instruction. The user may specify different VDD values to the tool and it will report the number of cycles needed and the power dissipated for the mapped application. Each instruction is automatically assigned to the proper voltage island based on the voltage mask defined in advance. 2.3
Voltage Island Partitioning and Voltage Masks
The RP array is first partitioned into voltage islands in a column-wise manner. That is, all processors in a column receive the same supply voltage. A voltage mask is used to set the different power modes of each voltage island. During execution, three power modes are defined: Run mode, Idle mode and Sleep mode [4]. In the Run state, the block is operating with its corresponding VDD value, either VDDL or VDDH. In the Idle state, the island has no work assigned to it. In Sleep mode, the island is shut down [1]. Power equations for Run, Sleep and Idle modes are:
PRun CVDD 2 f
;
Psleep 0 ;
qVT
Pidle I o e nkT VDD
where the switching activity, C is the switching capacitance, VDD is the supply voltage, and f is the operating frequency. In the idle power equation, Vt is the threshold voltage, T is the temperature, q is the electron charge, k is the Boltzmann constant, and n and Io are technologydependent constants. 2.4
Experimental Approach
In this section, we will demonstrate the power improvements possible using a number of applications. In this work, the total number of the processors in the array is considered to be 16x16 (originally it was 8x8), and the number of processors to be utilized out of the 256 is determined by the workload requirements. The rest of the RP’s is put to Sleep. When an application is executed, the RP array is partitioned column-wise into voltage islands, with regions of high VDD, low VDD and regions in Sleep mode. Two cases can be considered for setting the voltage masks. The first case is the cycle-based voltage mask switching, CBVM, where the voltage mask is updated on cycle-by-cycle basis. While this case may not be feasible due to the switching time overhead, it can provide an upper bound for the power savings that can be reached. The second case is the task-based voltage mask switching, TBVM. In the TBVM case, the voltage mask is set before each application is executed. Instructions must be properly routed to the RP’s with the specified VDD level. In addition, the needed data must move to a specific RP in order to run a specific instruction. For instance, if the multiplication requires VDDH and there are only 16 RP’s running with VDDH then the data has to transfer to these RP’s to do the multiplication. The goal is find an optimal TBVM voltage mask for a given application.
In order to compare different power saving schemes, a number of image processing benchmarks were manually mapped onto the RP array. Namely, lossless image compression (LIC), two-dimension discrete cosine transform (DCT), Fast Fourier transform (FFT), and finally two-dimension convolution (2DC). We computed the CBVM case in order to determine the maximum possible power improvement. Then, the TBVM approach was used and compared against the CBVM approach. Ultimately, the compiler, part of the future work, should handle the voltage mask definition to determine if TBVM can meet the target set by CBVM. We will elaborate only on the lossless image compression (LIC) example and then we will show the results of the rest of the benchmarks. In the CBVM example, we assumed that there is no transition time delay between certain states. That is, switching between VDDL, VDDH, and the Idle modes can be realized in the same cycle. This ideal case will be considered as upper bound for maximum power saving that can be realized for this specific benchmark. However, we assumed that the transition to and from the Sleep mode is not feasible. The minimum number of RPs per voltage island will be one column, or 16 RPs. Since the RP array has SIMD architecture with column/row execution, it is more practical to have the minimum granularity of the voltage island as one column. The LIC is a highly-parallel application that consists of a sequence of multiplications and additions. We considered 24-bit colored bmp image. Every pixel is represented in 3 bytes so it needs to be processed in 3 RPs. The number of pixels to be processed in parallel in the RP array is 4x16 occupying 12x16 RP’s. 2.5
Results
In the CBVM case, the execution required only 12 clock cycles to complete (see Table II). The supply voltages were controlled column-wise between the different states on cycle by cycle basis. Four columns of the RP array remained off during the operation since they were not needed by the application. Cumulatively, there were 108 columns that used VDDL and 36 that used VDDH during the 12 cycles. We now consider the TBVM case. Figure 7 shows the single mask that defines the island assignment used during the entire execution. It should be mentioned here that the data bus scheme provides the first 4 columns with full connectivity to the second 4 columns. Transfers can occur in one cycle. However, the first 8 columns have limited connectivity to the other 8 columns and would require two cycles to transfer data. Therefore, placing the VDDH columns in the middle is vital for the data dependency requirements. In the TBVM case, it required 14 cycles to complete instead of 12 cycles as in the CBVM case, as shown Table II. The power calculations used are based on the results from Figure 6. It should be mentioned here also that the VT manipulation, as well as the idle state, were not considered for these benchmarks. However, it will be considered in the future work. Table I shows the number of columns in each power state with its corresponding voltage value using TBVM for the mapped benchmarks. Table II shows the power saving results for the mapped benchmarks for both CBVM and TBVM. The initial power is based on running the algorithm without any power saving scheme; that is, the entire 16 x 16 array runs with a VDDH supply voltage for all the processors. The bounds on the energy savings are indicated by the CBVM cases. The results demonstrate that TBVM islands partitioning is capable of providing the level of power improvements close to the CBVM in some cases. The TBVM improvements are in the range of 25-56%, depending on the application.
From Table II, the CBVM and TBVM power saving for FFT are identical. The reason is that the power mask uses VDDL for all columns in both cases throughout the execution. It is using VDDL since the instructions are mostly adding and shifting. In case of 2DC with TBVM, only 5 out of 16 columns are used, all set to VDDH (see Table I). However, the same number of columns is switching back and forth to VDDL in case of CBVM, which is why the saving is higher. Figure 8 shows the power saving percentage in terms of dynamic and leakage. It shows that TBVM and CBVM saving from leakage is the about same for 2DC (as 11 columns are in Sleep mode in both cases), but the dynamic saving is higher in case of CBVM. In the case of DCT, a similar result is obtained. However, Figure 8 shows that the difference between CBVM and TBVM is due mostly to leakage savings since the dynamic savings are about the same. That is because, in case of TBVM, more columns are set to idle mode, in other words more columns are leaking. Finally, the LIC shows little difference between CBVM and TBVM schemes because there is less idle time. The results in Table II show that TBVM is demonstrating close power saving values compared to CBVM. Considering the cycle-by-cycle mask-switching time and power overhead, TBVM seems to be the sensible solution that provides acceptable power saving with less overhead. Thus, we discuss the switching power and timing overhead in the case of TBVM only, as it is the preferred solution. The timing and power overhead due to ON-OFF switching of cores can be considerable, however, switching overhead between the two different supply levels, VDDH and VDDL, is much less [15][16][17]. In the TBVM case, voltage mask switching between different VDD levels has to take place at the start of the execution only, and it has to change only when switching to a different application. We consider this time and power switching overhead to be four cycles ahead of the application execution [15][16]. Table III shows the TBVM results including switching power.
3 CONCLUSIONS In this paper, we presented new instruction-based voltage scaling and column-wise voltageisland partitioning techniques for SIMD MPSoC designs. We determined the voltages needed for each instruction to operate within the system clock cycle. To estimate the possible power savings, we mapped a group of image processing benchmarks into the RP array and compared the CBVM and TBVM approaches. TBVM seems to provide acceptable power saving results and less timing and power switching overhead compared to CBVM. The TBVM power saving are in the range of 12-24%.
REFERENCES [1] A. Coskun, T. Simunic Rosing, K. Mihic, G. De Micheli, Y. Leblebici, “Analysis and Optimization of MPSoC Reliability,” Journal of Low-Power Electronics, April 2006. [2] N. Eisley, V. Soteriou, L. Peh “High-Level Power Analysis for Multi-core Chips” Int. Conf. on Compilers, Architecture and Synthesis for Embedded Systems, Seoul, Korea, Oct. 2006.
[3] M. Ruggiero, A. Acquaviva, D. Bertozzi, L, Benini “Application-Specific Power-Aware Workload Allocation for Voltage Scalable MPSoC Platforms” Proc. of the Inter. Conf. on Computer Design, San Jose, CA, USA Oct. 2005. [4] "Power Islands: The Evolving Topology of SoC Power Management," Virtual Silicon Technology, Inc. Design & Reuse Industry Articles, 2004, pp. 1-5. [5] K. Srinivasan, N. Telkar, V. Ramamurthi, K.S. Chatha, “System-Level Design Techniques for Throughput and Power Optimization of Multiprocessor SoC Architectures” Proc. of the IEEE Computer Society Annual Symposium on VLSI, Feb. Lafayette, LA, USA, 2004. [6] T. Theocharides, M.K. Michael, M. Polycarpou, and A. Dingankar, "A Novel System-Level On-Chip Resource Allocation Method for Manycore Architectures," Proceedings of the 2008 IEEE Computer Society Annual Symposium on VLSI, 2008, pp. 99-104. [7] R. Mishra, N. Rastogi, D. Zhu, D. Mossé, R Melhem “Energy Aware Scheduling for Distributed Real-Time Systems,” The 17th Inter. Symposium on Parallel and Distributed Processing, Nice, France, April 2003. [8] N. Bambha, S. Bhattacharyya, J. Teich, E. Zitzler “Hybrid Global/Local Search Strategies for Dynamic Voltage Scaling in Embedded Multiprocessors” The 9th Inter. Symposium on Hardware/Software Codesign, Copenhagen, Denmark, April 2001. [9] A. Rae, S. Parameswaran, “Voltage Reduction of Application-Specific Heterogeneous Multiprocessor Systems for Power Minimization” ASP-DAC, Yokohama, Japan., Jan. 2000. [10] L. Yan, J. Luo, N. Jha, “Combined Dynamic Voltage Scaling and Adaptive Body Biasing for Heterogeneous Distributed Real-time Embedded Systems” Inter. Conf. on ComputerAided Design, San Jose, CA, USA, Nov. 2003.
[11] H. Singh, G. Lu, M. Lee, F. Kurdahi, N. Bagherzadeh “MorphoSys: Case Study of A Reconfigurable
Computing
System
Targeting
Multimedia
Applications,”
Design
Automation Conference, Los Angeles, CA, USA, June 2000. [12] K. Agarwal, K. Nowka, "Dynamic Power Management by Combination of Dual Static Supply Voltages," Int. Symposium on Quality Electronic Design (ISQED'07), 2007. [13] S. Yuan, “Reconfigurable System Emulator Software Specification,” Technical Report, ECE Dept. University of British Columbia, Vancouver, Canada, 2006. [14] P. Rong, M. Pedram, “Power-Aware Scheduling and Dynamic Voltage Setting for Tasks Running on a Hard Realtime System,” Proc. of the ASP-DAC, Yokohama, Japan, Jan. 2006. [15] Andrei, A, M Schmitz, P Eles, Z Peng, and B.M. Al-Hashimi, “Overhead-conscious voltage selection for dynamic and leakage energy reduction of time-constrained systems”, in Design, Automation, and Test in Europe, (DATE, 2005). [16] Mochocki, Bren, Xiaobo Sharon Hu, and Gang Quan, “Transition-overhead-aware voltage scheduling for fixed-priority real-time systems.” ACM Transactions on Design Automation of Electronic Systems 12, no. 2 (April 2007). [17] R. Peng, and M. Pedram, “Energy-Aware Task Scheduling and Dynamic Voltage Scaling in
a Real-Time System.” Journal of Low Power Electronics, Vol. 4, no. 1 (April 2008) pp.1-10. [18] S. Majzoub, "Voltage Island Design in Multi-Core SIMD Processors,” The 5th International Design and Test Conference (IDT’10), pp. 18-23, Dec. 14-15 2010, Abu Dhabi, UAE.
FIGURES AND TABLES
MUX_A 16
FLAG
R0- R3
8 FB
Other RCs
Other RCs
8 FB
R0- R3
Figure 1. An illustration of Fast/Slow instruction profiling
Constant
12
MUX_B 16
ALU+MULT SHIFT o/p REG
R0 R1 R2 R3
28 w r
16
To 16 To we we FB 8 other HE 16 VE 16 RCs
Figure 2. MorphoSys Block Diagram
Figure 3. Instruction Delay vs. Instruction
16
Figure 4. The minimum required voltage to meet the cycle time for each instruction
Figure 5. Scaling voltages of all instructions to equalize the delay.
Figure 6 Power after instruction voltage scaling.
Figure 7. Voltage Mask for the TBVM case (Lossless Image Compression)
Figure 8. Power saving percentage for the mapped benchmarks TABLE I COLUMNS’ VOLTAGE -VALUE IN THE VOLTAGE MASK IN CASE OF TBVM Columns Columns Columns in Application at VDDH at VDDL Sleep mode FFT 0 16 0 2D Convolution (2DC) 5 0 11 DCT 3 9 4 Lossless Image 3 9 4 Compression (LIC) TABLE II RESULTS OF IMAGE HANDLING BENCHMARKS Application CBVM
Number of cycles 4
FFT
2D Convolution
Initial Power
Final Power
Total Power Reduction
49 mW
56.4%
49 mW
56.4%
162 mW
41.4%
207 mW
25.4%
772 mW
45.1%
932 mW
33.7%
130 mW
43.7%
137 mW
40.9%
112 mW TBVM
4
CBVM
24 278 mW
TBVM
24
CBVM
80
DCT
1406 mW TBVM
104
CBVM Lossless Image Compression TBVM
12 232 mW 14
TABLE III FINAL POWER SAVING INCLUDING SWITCHING OVERHEAD IN CASE OF TBVM Power Switching Total Power Application % Power Saving Consumed Overhead Consumed FFT 49mW 49mW 98mW 12.5% 2D Convolution (2DC) 207mW 34.5mW 241.5mW 13.13% DCT 932mW 35.8mW 967.8mW 31.16% Lossless Image 137mW 39.14mW 176.1mW 24.1% Compression (LIC)
BIOGRAPHY Sohaib Majzoub received his B.E. degree in Electrical Engineering from Beirut Arab University, and M.E. degree in Computer and Communication Engineering from American University of Beirut, Beirut Lebanon, in 2000 and 2003, respectively. Then, he worked for one year at the Processor Architecture Lab, as part of the Pre-Doctoral Fellowship Program at the Swiss Federal Institute of Technology (epfl), Lausanne Switzerland. He, then, joined the PhD program at the University of British Columbia, Vancouver Canada, and received his degree in 2010. His research was a joint work between the System-on-Chip and Image Processing labs. He is currently assistant professor at the American University in Dubai, UAE. His main interest is performance and power optimization for Multiprocessor System-on-Chip.