2014 7th International Symposium on Telecommunications (IST'2014)
Estimating Application Workload Using Hardware Performance Counters in Real-time Video Encoding Siamak Rasoolzadeh Multimedia Processing Laboratory School of Electrical and Computer Engineering College of Engineering University of Tehran
[email protected]
Morteza Saedpanah Multimedia Processing Laboratory School of Electrical and Computer Engineering College of Engineering University of Tehran
[email protected]
Abstract—In recent years, cloud computing has emerged as a viable alternative for many computationally intensive applications. Offloading an application to the cloud has many advantages, but power consumption is still an important concern. Service providers should minimize power while maintaining customer’s quality of service requirements. Dynamic Voltage and Frequency Scaling (DVFS) is an effective method to optimize power consumption by minimizing the processor speed to a level that satisfies the application constraints. In applications such as real-time video encoding, where computational complexity depends on the video content, it is important to be able to adjust voltage and frequency dynamically. Incorrect DVFS may cause over-provisioning of resources or missing application deadlines. In other words, one of the key challenges of DVFS is the accuracy of workload estimation. In this paper, we propose a workload estimation method using low level Hardware Performance Counters (HPCs) for content dependent multimedia applications. The proposed technique has been used to estimate the workload of an H.264/AVC video encoder at the Group-Of-Pictures (GOP) level. Simulation results indicate that 23% energy saving can be achieved in comparison with ondemand frequency setting policy used in Linux. Keywords— Workload Prediction, Power Consumption, DVFS, Hardware Performance Counters
I. INTRODUCTION In recent years, cloud computing is widely used as a new way of internet-based service delivery to provide on-demand IT services to end-users. In this model, user applications run on Virtual Machines (VMs) that are located on distributed servers. Users access these computing resources through the Internet. Computationally intensive real-time multimedia applications, such video applications, are one of the most popular applications that have benefited from this new paradigm. It is anticipated that two-third of internet traffic will be video traffic by 2015[1]. Efficiently managing resources in terms of energy consumption is one of the major challenges of cloud computing. CPU is one of the main energy consuming devices in a server. Its energy consumption relates to the voltage and frequency in which it operates. In CMOS circuits, the energy consumption of a processor is quadratically proportional to the
978-1-4799-5359-2/14/$31.00 ©2014 IEEE
Mahmoud Reza Hashemi Multimedia Processing Laboratory School of Electrical and Computer Engineering College of Engineering University of Tehran
[email protected]
operating frequency [2]. Therefore, frequency level reduction known as Dynamic Voltage and Frequency Scaling (DVFS)can have a significant impact on the energy consumption of a processor. In DVFS the processing speed is minimized to a level where the workload timing and throughput constraints are still met. One of the key challenges of any DVFS implementation is future workload estimation. Workload estimation is more important when the application’s computational requirements change with time and based on content. This is even more important when a cloud service provider has to guarantee a contracted Service Level Agreement. In real-time video encoding applications, the encoder has to meet predefined deadlines at frame- or Group-Of-Picture (GOP)-level. Estimating the number of clock cycles needed for each task then becomes one of the main objectives of DVFS in video encoders. Many conventional DVFS methods, such as the ondemand method used in Linux, cannot guarantee to adjust the frequency efficiently since they only consider the CPU-Utilization level. This may lead to under- or overestimation of workload. Considering the time it takes to change the frequency level in DVFS which is in the order of microseconds and the power needed in voltage regulators for frequency transitions, selecting the appropriate task granularity is important in DVFS efficiency. Our simulation results indicate that the computational complexity of successive GOPs change gradually for natural videos. Hence, applying DVFS at the GOP level would lead to less frequent changes and hence lower power consumption in voltage regulators. Most recent processors have Hardware Performance Counters (HPCs) to monitor micro-architectural events while running workloads. Hence, a low complexity yet accurate workload estimation can be performed at the OS-level by gathering the HPC values for each processor. HPC statistics can help to characterize workload in real time, and can be used to predict the application behavior on each processor for efficient application of DVFS. In this approach one of the main challenges is to select the proper HPC. In this paper, we propose an HPC-based workload estimation for video encoding applications that leverages HPC statistics to detect GOP borders and estimate the clock needed
for the next GOP. The rest of this paper is organized as follows. In section II we discuss the related works. The proposed workload estimation is presented in section III, followed by experimental results in section IV. Finally, the paper concludes in section V. II. RELATED WORKS Interval-based workload estimation has been the subject of many researches in recent years. One of the earliest works in [3] monitors the CPU state in consecutive windows to calculate the non-idle cycles of the current interval to predict the future ones and adjust frequency, accordingly. The PAST algorithm in[4] seems to be practical for real-time applications. It predicts the future workload to be equal to the last workload without any knowledge of future. The PEAK method in[5] estimates the future workload to be equal to the minimum or maximum level of the last workload. The moving Average (MA) method in [6] estimates the next interval workload as the average of few previous intervals’ workload. In Exponentially Weighted Average (EWA) [7] the upcoming workload is estimated as the weighted summation of prior intervals. The weights of previous intervals are determined based on their correlation with future intervals. Many researches in workload estimation for multimedia applications are restricted to video decoding. In a group of these researches known as profiled-based estimation, they extract the bitstream information from encoded frames by separating entropy decoding from the other components[8]. Considering the linear relation between the decoding time and encoded frame length, the retrieved information is applied to calculate each frame’s decoding time. This type of workload estimation is performed at the MB-level [9], frame-level [8] and GOP-level [10]. In[11] a comparison between these three levels showed that frame-level workload estimation has lower estimation error. Although most of these methods reported a reasonably high estimation accuracy, but they suffer from two problems. At first, by separating the entropy decoder, the DVFS technique cannot be applied to this process. Secondly, the entropy decoder separation is not applicable for most existing decoder implementations. Another set of works in video decoder workload estimation estimate the complexity of the next frame as a function of previous ones. In [12], the decoding time is divided into two sub-works: on-chip and off-chip accesses. For on-chip access, the workload is estimated based on the history of previous frames, and its estimation error is compensated in the upcoming frames. For off-chip access, the lowest possible frequency is assigned. Considering multi-core trend in most computers, real-time applications are divided into threads. Operating systems should schedule threads on cores that can satisfy their timing constraints. Application level workload estimation is not applicable for operating systems to schedule threads in realtime. Therefore, we propose to use hardware level statistics for workload estimation. These statistics are gathered through registers known as Hardware Performance Counters (HPCs). Some studies ([13], [14]) investigated HPCs to model the power consumption of applications. Their objective was to find
the correlation between power consumption and used hardware resources using HPCs. Another work in [15] used HPCs to identify execution behavior of applications and used its results for efficient DVFS. In the next section we propose our methodology to use HPCs for workload estimation in video coding applications at the GOP-level. III.
PROPOSED METHOD FOR WORKLOAD ESTIMATION IN VIDEO CODING APPLICATIONS
In this section, we present our proposed methodology to estimate the computational complexity of an H.264/AVC video encoder at the GOP level in order to use this estimation for DVFS. A. Framework Overview Figure 1 illustrates our framework. At the top of this framework a real-time H.264 encoder application is running. At the next level the OS schedules threads of applications on the processors at the hardware level. The main challenge is to determine the resource requirements of the application and to allocate resources in a way that the H.264 encoder is performed in real-time and the power consumption is minimized.
Figure 1 : Framework overview for an H.264/AVC video encoder
As mentioned before, one of the methods to optimize power consumption is DVFS, where the current voltage and frequency is set based the workload of the previous interval. In this approach the accuracy and complexity of the estimation algorithm is key. In this paper we plan to estimate this workload efficiently, yet accurately, using low level hardware performance counters. In the next section and based on our preliminary simulations we not only show that this is possible, but we also determine the proper interval for a H.264/AVC video encoder. B. Observations In most predictive video compression standards, such as the H.264/AVC, frames are categorized into Intra-coded frames (Iframes) and Inter-coded frames (P-frames). I-frames are encoded internally and independent from other frames, while P-frames are encoded with reference to their previous frames. To reduce drift and increase robustness to errors, an I-frame is inserted every few frames. The distance between two consecutive I-frame is referred to as a GOP. As will be demonstrated later in this sub-section, our experimental results indicate that estimating workload and setting the DVFS values accordingly, at the GOP level results
in reasonably accurate workload estimation and relatively good energy saving. This is mostly because the computational complexity of the current GOP is generally not that different from it preceding GOPs. In order to estimate a GOP workload, one needs to detect GOP borders correctly. Our preliminary observations indicate that the border of GOPs can be detected using Hardware Performance Counters (HPCs) that are available in most modern processors. HPCs are the basic elements of the Performance Monitoring Unit (PMU) of contemporary processors. They monitor the event occurrence in microprocessors in real-time without performance penalty.
C. Workload Estimation Methodology for H.264 Encoder In this sub-section, we describe the proposed workload estimation methodology. The first step of our methodology is shown in Figure 3. At this step, the designer should experimentally determine the HPCs that can detect the GOP borders more accurately. If the activity-ratio of the event is different for I- and P-frames and it demonstrate the same behavior for the majority of I- and Pframes, then the examined event should be added to an array of events that can be used for GOP border estimation.
We tested our framework with x264[16], as one the most efficient software implementations of the H..264/AVC encoder. Table 1 shows the main encoder settings. Our testbed includes an Intel Core i7 4770 processor with Haswell micro architecture, 16 GB RAM and Linux OS with kernel 2.3.36. Table 1: x264 encoding parameters in observation
Parameter Number of threads Profile Framesper second GOP size Input video resolution
Value 1 Baseline 30 10 352x288
We used the perf_event tool in the Linux kernel to read the hardware performance counters of our processor. At first we define the activity-ratio as the ratio between the numbers of event occurrence in the monitoring interval and duration of the interval in terms of clock cycle. We run our simulations for three different videos (parkrun, football and foreman) and observed that the activity-ratio of L1-DCache-hit event differs for I- and P-frames. The activity-ratio for parkrun video is shown in Figure 2. It should be noted that the activity-ratio for other two tested videos shows the same behavior.
Figure 3 : Event selection algorithm for GOP border estimation
The final result of this first step is a set of microarchitectural events that can determine the GOP borders. It should be noted that our methodology depends on the processor micro-architecture and one should define the initial event list of his own processor. Using HPCs found in previous step, one should monitor them to detect GOP borders in run-time and estimating the next GOP complexity. At the second step of our methodology, the computational complexity of the next GOP is estimated based on the history of previous GOPs. The simplest way is to estimate the next GOP complexity to be the same as the complexity of its previous GOP. It should be noted that any other estimation algorithm, such as the Moving Average (MA), can be applied in this step. After estimating the complexity of the next GOP we should calculate the needed frequency that satisfies its timing constraints. It should be noted that misprediction should be compensated in upcoming GOPs. The frequency is calculated as in (1). =
Figure 2: Activity ratio of L1-DCache-hit event for parkrun video
As illustrated, in I-frames there is a sudden drop in the mentioned event, while it is fairly constant during P-frames. Using the above observations, we will explain the proposed workload estimation technique in the next subsection.
(1 + )
∗
(1)
Where is the next GOP frequency, is the estimated complexity of the next GOP in cycles, is the rate at which the video should be encoded in frame per second, is the size of each GOP, and e is the estimation error that is calculated based on equation(2). =
−
(2)
After calculating the frequency of the next GOP we should select a frequency level, from the list of available ones for this specific processor that is equal or greater than the calculated frequency. IV.
and can be used to detect GOP borders. This is once again due to the nature of the motion estimation algorithm in P-frames.
EXPERIMENTAL RESULTS
In this section, we evaluate the proposed workload estimation technique. A. Evaluation Platform Our test system is equipped with an Intel Core i7 4770 processor with 4 physical cores, frequency ranges as listed in Table 2, with 64KB private L1 cache, 256 KB L2 cache, 8MB shared L3 cache, and 16 GB of RAM. The OS was Linux kernel 2.3.36, and we used the perf_event tool to monitor hardware performance counters.
(a)
Table 2 : Frequency levels of the Core i7 4770 processor
Level
Frequency (GHz)
Level
Frequency (GHz)
1
0.8
8
2.3
2
1
9
2.5
3
1.2
10
2.7
4
1.5
11
2.8
5
1.7
12
3.0
6
1.9
13
3.2
7
2.1
14
3.4
(b)
We used x264 software to encode six CIF size video sequences with the configuration parameters listed in Table 3. Table 3: x264 encoding parameters
Parameter Number of threads Profile Frames per second GOP size Input video resolution Number of reference frames Motion estimation range
Value 1 Baseline 30 10 352x288 1 +/- 16
B. Results and Discussion Experimental results show that some of the performance counters can demonstrate the periodic behavior of video encoding at the GOP level. Figure 4 shows the activity-ratio of the three main events for few consecutive GOPs. As the results show, the L1 cache activity ratio differs in Iand P-frames, and hence can be used to signal the start and end of a GOP. This was expected since P-frames are using motion estimation where the search window is mostly stored in cache, as opposed to I-frames where there is no data reuse. The other events that are reported in Figure 4 show the same behavior
(c) Figure 4 : Activity-ratio of (a) L1-DCache-hit (b) Execution-port (c) Memory-port events for GOP border estimation
Figure 5 shows the activity-ratio of the offcorerequests event where there is no significant difference between P- and I-frames. Events such as this are omitted and cannot be used in our workload estimation at the GOP level. After event selection for GOP border recognition, we monitored the L1-DCache-hit event while running the x264 application on six video sequences. At the next step we estimated the complexity of next GOP equal to current GOP complexity. Estimation errors are shown in Table 4 for tested video sequences. After complexity estimation we calculated the frequency for next GOP based on equation (1) and Table 2. Table 4 shows the results of applying our methodology for six CIF size video sequences. For energy saving column we compared the energy consumption of our method with
ondemand policy of Linux OS. The ondemand policy only considers the CPU-Utilization level for frequency adjustment.
[3]
[4]
[5]
[6]
[7]
[8]
Figure 5: Activity ratio of offcore-requests event
[9]
[10]
Table 4 : simulation results
[11] Video sequence
Energy saving
GOP Miss rate
Minimum estimation error
Maximum estimation error
Average estimation error
Football
19%
9%
0.04%
12.9%
4.4%
Foreman
27%
8%
0.2%
12.2%
3.5%
Bowing
21%
9%
0.1%
6.6%
2.4%
Akiyo
26%
7%
0.03 %
9.5 %
3%
Bus
18%
11%
0.1%
6.2%
2.1%
City
27%
12%
0.06%
2.2%
0.77%
Average
23 %
9%
0.1%
8.3%
2.7%
[12]
[13] [14]
[15]
It should be mentioned that while having 23% energy saving in average in comparison with the ondemand policy, some of mispredictions lead that 9% of all GOPs couldnot meet their deadlines. V. CONCLUSION In this paper, we proposed a generic approach for GoPlevel workload estimation in video coding applications using hardware performance counters. Experimental results show that because of applying motion estimation algorithm in Pframes we can find events that distinguish the different behavior of encoding processes of I- and P-frames. The application of the proposed methodology determined that L1DCache, execution ports, and memory ports activity-ratio can be used to find GOP borders. Applying a simple algorithm for complexity estimation for next GOPs based on the previous GOP’s complexity showed that we achieved 23% energy saving in comparison with ondemand frequency selection. REFERENCES [1]
[2]
“Visual networking index: Global mobile data traffic forecast update, 2010-2015,” White paper, Cisco Systems, Inc., Feburary 2011,available online. T. Burd and R. Brodersen, “Processor design for portablesystems,”The Journal of VLSI Signal Processing, vol. 13, no. 2, pp. 203–221, 1996
[16]
M. Weiser, B. Welch, A. Demers, and S. Shenker, “Scheduling for reduced CPU energy,” in Proc. 1st USENIX Symp. Oper. Syst. Des.Implement. , Nov. 1994, pp. 13–23. D. Grunwald, P. Levis, K. I. Farkas, C. B. Morrey, III, and M. Neufeld, “Policies for dynamic clock scheduling,” in Proc. 4th Symp. Oper. Syst. Des. Implement., vol. 4. Oct. 2000, pp. 73–86. K. Govil, E. Chan, and H. Wasserman, “Comparing algorithms for dynamic speed-setting of a low power CPU,” in Proc. 1st ACM Int. Conf. Mobile Comput. Netw. , 1995, pp. 13–25. T. Pering, T. Burd, and R. Brodersen, “The simulation and evaluation of dynamic voltage scaling algorithms,” in Proc. Int. Symp. Low Power Electron. Des. , 1998, pp. 76–81. A. Sinha and A. P. Chandrakasan, “Dynamic voltage scheduling using adaptive filtering of workload traces,” in Proc. 14th Conf. VLSI Des.,Jan. 2001, pp. 221–226. A. Bavier, A. Montz, and L. Peterson, “Predicting MPEG execution times,” in Proc. Int. Conf. Measure. Model. Comput. Syst. SIGMETRICS/Performance, 1998, pp. 131–140. Y. Huang, V. A. Tran, and Y. Wang, “A workload prediction model for decoding MPEG video and its application to workload-scalable transcoding,” in Proc. 15th ACM MM , Sep. 2007, pp. 952–961. D. Son, C. Yu, and H. N. Kim, “Dynamic voltage scaling on MPEG decoding,” in Proc. ICPADS , 2001, pp. 633–640. E. Nurvitadhi, B. Lee, C. Yu, and M. Kim, “A comparative study of dynamic voltage scaling techniques for low-power video decoding,” in Proc. Int. Conf. Embed. Syst. Appl., 2003, pp. 633–640. K. Choi, R. Soma, and M. Pedram, “Off-chip latency-driven dynamic voltage and frequency scaling for an MPEG decoding,” in Proc. DAC, 2004, pp. 544–549. G.D.Costa,H.Hlavacs, Methodology of measurement for energy consumption of applications, in:GRID, IEEE, 2010, pp.290–297. K. Singh, M. Bhadauria, S.A. McKee, Real time power estimation and thread scheduling via performance counters, SIGARCH Comput. Archit. News 37(2009)46–55. C. Isci, G. Contreras, M. Martonosi, Live, runtime phase monitoring and prediction on real systems with application to dynamic power management, in: Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO39, IEEE Computer Society,Washington, DC, USA, 2006,pp.359–370. VideoLAN Organization, “x264.” [Online]. Available: http://www.videolan.org/developers/x264.html