An Optimal Warning-Zone-Length Assignment Algorithm ... - CiteSeerX

3 downloads 0 Views 901KB Size Report
munication sub-system needs to support multiple QoS criteria while providing a hard real-time ... clude multimedia players, cellular phones, medical devices, radar, and flight/missile ... Another simple approach is the use of warning-zone [Chen.
An Optimal Warning-Zone-Length Assignment Algorithm for Real-time and Multiple-QoS On-Chip Bus Arbitration Huan-Kai Peng and Youn-Long Lin Department of Computer Science, National Tsing Hua University HsinChu, Taiwan 300 In an advanced System-on-Chip (SoC) for real-time applications, the arbiter of its on-chip communication sub-system needs to support multiple QoS criteria while providing a hard real-time guarantee. To fulfill both objectives, the arbitration algorithm must dynamically switch between non-real-time (NRT) and real-time (RT) modes such that use of the RT mode is minimized to best accommodate the overall QoS criteria. In this paper, we define a model for this problem, and propose optimal solutions to its associated problems with static and dynamic warning-zonelength assignment. Compared with previous works, the proposed approach enables a bus arbiter to use much less RT-mode in providing real-time (RT) guarantee and, therefore, gives the arbiter more opportunity to employ non-RT-modes to achieve better overall QoS. Experimental results show that the proposed approach reduces RT mode usage by as much as 37.1%. Moreover, that reduction in RT mode usage helps cut execution time by 27.0% when applying our approach to an industrial DRAM controller. Another case study on an AMBA-compliant ultra-high-resolution H.264 decoder IP shows that the proposed approach reduces RT mode usage by 26.4%, which leads to an average reduction of 10.4% in decoding time. Finally, when implementing a 16 master arbiter, it costs only 6.9K and 9.5K gates of overhead using the proposed static and dynamic approach, respectively. Therefore, the proposed approach is suitable for real-time SoC applications. Categories and Subject Descriptors: C3 [Computer Systems Organization]: Special-Purpose and Application-Based Systems — Real-time and embedded systems; J6 [Computer Applications]: Computer-Aided Engineering — Computer-aided design (CAD) General Terms: Algorithms, Performance Additional Key Words and Phrases: System-on-Chip, QoS, Real-time Scheduling, On-chip Communication

1. INTRODUCTION Advances in semiconductor manufacturing technology makes feasible integration of an increasing number of Intellectual Properties (IP’s) and shared resources onto a single System-On-Chip (SoC.) Because these components behaves heterogeneously, they require an on-chip communication sub-system to support them with multiple Quality-of-Service (QoS) criteria [Goossens 2004]. Examples include latency minThis work was supported in part by National Science Council of Taiwan (NSC-96-2220-E-007-013, NSC-96-2220-E-007-019) and the Ministry of Economic Affairs of Taiwan (96-EC-17-A-01-S1-038). Permission to make digital/hard copy of all or part of this material without fee for personal or classroom use provided that the copies are not made or distributed for profit or commercial advantage, the ACM copyright/server notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists requires prior specific permission and/or a fee. c 2009 ACM 1529-3785/2009/0700-0001 $5.00

ACM Transactions on Computational Logic, Vol. V, No. N, July 2009, Pages 1–App–7.

2

·

H.-K. Peng and Y.-L. Lin

imization [Lahiri et al. 2006], bandwidth allocation [Lin et al. 2007], and resource utilization improvement [Lee et al. 2005]. Meanwhile, embedded systems often have real-time I/O’s, which will function incorrectly when not ensured a real-time guarantee. For these systems, the real-time guarantee must be satisfied, and all other QoS criteria must be supported under the condition that the real-time guarantee is strictly guarded. Such examples include multimedia players, cellular phones, medical devices, radar, and flight/missile control systems. Supporting a real-time guarantee in an increasingly complex multiple-QoS SoC is a problem unlike traditional ones. Traditional real-time research employs a singlemode arbitration algorithm to ensure real-time guarantee as the only QoS criterion. In multiple-QoS applications, however, it requires a hybrid arbitration algorithm switching between real-time (RT) and non-real-time (NRT) modes to support both a real-time guarantee and a number of non-real-time QoS criteria. More aggressively, we would like to maximize the flexibility for NRT-mode arbitration by meeting all real-time constraints with minimum RT-mode usage. A number of on-chip communication research studies have tried to tackle this problem by reducing bus latencies under multiple QoS requirements. Lahiri et al. [2006] propose LOTTERYBUS to provide bandwidth control with reduced average latency for the highest-priority master; Lu and Koh [2005] propose SAMBA-BUS to reduce all bus masters’ average latency while improving bus bandwidth utilization; Richardson et al. [2006] propose the dTDMA bus to reduce both average latency and power consumption. Various Network-on-chip research studies [Bolotin et al. ; Goossens et al. 2005; Pestana et al. 2004; Ogras and Marculescu 2006] also have reduced interconnection latencies using iterative simulation-based flows under multiple QoS requirements. However, these best-effort latency reduction techniques are limited in providing an RT guarantee in nature. They need to use simulation-based QoS evaluation tools [Lahiri et al. 2001; Meyerowitz et al. 2003; Poletti et al. 2003; Conti et al. 2004] iteratively to check whether there are any missed deadlines. Even with the inefficient and laborious process, they still do not ensure an RT guarantee.1 Outside the on-chip communication field, similar problems are handled using more formal approaches. For software video coding, W¨ ust et al. [2005] apply the Markov decision process [Puterman 1994] and reinforcement learning [Sutton and Barto 1998] to find the best balance between picture quality, quality smoothness, and real-time guarantee; or multimedia software, Combaz et al. [2005] use a function-based model to maximize time-budget utilization while ensuring RT guarantee; or web-servers, Lu et al. [2001] apply feedback control theory [Franklin et al. 1997] to support relative latency guarantee and real-time requirements; or operating systems, Swaminathan and Chakrabarty [2005] use an off-line scheduler to minimize I/O power consumption without missing deadlines. All the above works, though formally supporting RT requirements under multiple QoS criterion, are targeted to their specific applications and, hence, are not suitable for on-chip environments. Indeed, Kim et al. [2006] and Gill et al. [2001] both use general-purpose dynamic schedulability checking2 to decide when to switch between RT and NRT modes. 1 Corner 2 Details

cases in RT problems are not easily hit even by massive simulations. of the schedulability checking used by Gill et al. [2001] can be found in Sha et al. [1989].

ACM Transactions on Computational Logic, Vol. V, No. N, July 2009.

Optimal Warning-Zone-Length Assignment Algorithm Req. issue time

Deadline

Cur. time

Req. issue time

·

3 Deadline

Cur. time warning zone Warning Zone Length (WZL)

(a) Warning-zone not entered, non-urgent case Fig. 1.

warning zone Warning Zone Length (WZL)

(b) Warning-zone entered, urgent case

Concept of warning-zone based real-time arbitration

However, they are still too complex for hardware implementation, which both lead to O(n) critical-path and O(n3 ) area complexities for an n-device system. To replace the tedious dynamic schedulability check for efficient hardware implementation, one intuitive approach is to switch to the RT-mode only when there are pending RT requests. Another simple approach is the use of warning-zone [Chen et al. 2006; Jun et al. 2007].3 We use Figure 1 to illustrate this idea in three steps: (1) To each RT request, a warning-zone is reserved before its deadline. (2) According to the current time, when an RT request has not yet entered its warning-zone (as is the case in Figure 1 (a)), we consider the request non-urgent and are free to handle it using arbitrary NRT-mode arbitration. (3) If, however, any RT-request has entered its warning-zone (as the case in Figure 1(b)), we consider the request urgent and must handle it using RT-mode arbitration, by which the contention between multiple urgent RT requests is also solved. Accordingly the switching between RT and NRT modes is decided simply by whether there exists any urgent RT request, instead of using dynamic schedulability check. Simple as the idea is, it could be quite practical and effective for general RT SoC applications as long as its three aspects were carefully developed.4 First, an appropriate real-time analysis model needs to be defined because the stochastic SoC behavior is different from that of traditional RT systems. Second, an RT guarantee needs to be guarded by formal proof. Third, the warning-zone lengths (WZL’s) should be carefully assigned such that RT-mode usage is minimized. Among the three aspects, the last has the greatest impact on the system’s overall level of QoS. Figure 2 shows an example of how a smaller WZL assignment leads to less RT-mode usage, a higher level of QoS, and significant cycle reduction on DRAM access. The example consists of 3 RT- and 2 NRT-masters, each issuing 4-burst requests to a shared DRAM. Each request has a service cycle and is issued toward a particular bank. The service cycle consists of activation penalty and burst-access latency.5 However, the penalty cycles can be overlapped with the service cycles of another request if successive requests are scheduled in a bank-interleaving order [Elpeda Inc. 2007; Takizawa and Hirasawa 2001; Lee et al. 2005]. Besides, each 3 Instead

of the original term warning-line used in Chen et al. [2006] and slack (with a slightly different meaning) used in Jun et al. [2007], warning-zone is used throughout this paper, which, according to the author’s belief, more properly explains the concept. 4 These three aspects were omitted in Chen et al. [2006] and Jun et al. [2007] because neither RT guarantee nor RT/NRT mode switching is their discussion focus. 5 In this example, we list service cycles’ detailed components (e.g., activation penalty and 4-burst access time) for illustration purposes only. In the general model defined in Section 2, only the sum-up term service cycle is used. ACM Transactions on Computational Logic, Vol. V, No. N, July 2009.

·

4

H.-K. Peng and Y.-L. Lin

Master Activ. Penalty 4-burst Acc. Time Bank Usage Deadline Bank 0 90 8 17 (write) m1 (RT) Bank 0 92 8 17 (read) m 2 (RT) m3 (RT) Bank 0 94 8 6 (write) m 4 (NRT) Bank 1 8 6 (read) m5 (NRT) Bank 1 8 6 (read)

Warning Zone Usage of RT-mode Arbitration Non-overlapped Activ. Penalty Overlapped Activ. Penalty

12

m1

12

wzl1 = 77

wzl1 = 78 12

12

wzl2 = 52

m2 wzl2 = 78 12

m3

12

wzl3 = 27

wzl3 = 78 12

m4

0

12 20

0

20

m5 0

25

50

75

90

time

0

25

Bank 0

50

70

90

time

22% cyc. reduction

Bank 1

Fig. 2.

Two arbitration scenario with different warning-zone-length assignments

request from RT masters has a corresponding deadline. In Scenario I on the left, WZL’s are assigned equal length (wzl1 = wzl2 = wzl3 = 78) according to Chen et al. [2006].6 Following the previous description, RT-mode arbitration is used 3 times because warning-zones are entered while arbitration decisions take place, indicated by the three circles. These usages of RT-mode prevent us from applying the currently preferred NRT-policy (i.e., scheduling requests in a bank-interleaving order). This results in a long total access time of 90 cycles including 3 non-overlapped activation penalties (indicated by dark-grey triangles). In Scenario II on the right, another WZL assignment is used (wzl1 = 77, wzl2 = 52, wzl3 = 27). The smaller assignment enables the RT-masters to enter their warning-zones relatively later, hence allowing more flexibility for using the NRTpolicy to schedule both RT and NRT requests in a bank-interleaving order. As a result, we see that by reducing unnecessary RT mode usage, most activation penalties are overlapped, total access time is reduced by 22%, and still, real-time constraints are met. The main contribution of this paper is twofold. First, compared with previous works, the proposed approach enables a bus arbiter to use RT-mode much less in providing real-time (RT) guarantee. It gives the arbiter more opportunity to employ non-RT-modes to achieve better overall QoS. Second, the proposed approach is hardware-efficient and high-speed. Therefore, it is suitable for on-chip applications. The remainder of this article is organized as follows. Section 2 defines our model; Section 3 presents the static optimal WZL assignment, whereas Section 4 presents the dynamic one. Section 5 presents two additional techniques that reduce the RTmode usage further; Section 6 analyzes the hardware complexity of the proposed approach; Section 7 presents the experimental results. Section 8 concludes the article. Readers are also encouraged to refer to the various appendices of this work.

6 In Chen et al. [2006], all WZL’s are assigned by summing up the service cycles of all RT masters plus the maximum service cycle among all NRT masters. In Figure 2, it is 78 = SUM(8 + 17, 8 + 17, 8 + 6) + MAX(8 + 6, 8 + 6).

ACM Transactions on Computational Logic, Vol. V, No. N, July 2009.

Optimal Warning-Zone-Length Assignment Algorithm

·

5

ri di ci

mi wzli ttdi tk current time Fig. 3.

ndli

time

Relation between various basic notations of RT masters

2. THE MODEL We assume a single On-Chip Bus (OCB), on which an arbiter arbitrates all requests from a number of bus masters in a Non-Idling and Non-Preemptive (NINP) fashion. By NINP, we mean the arbiter must grant at least one request when any exists, and an already granted request cannot be interrupted before it has been completely served. While this assumption describes the basic form of many OCB’s [ARM Inc. 1999; 2003; IBM Inc. 2001; OPENCORES 2002; 2007; STMicroelectronics 2006] characteristics, this work can be extended to more broad assumptions including: (1) preemptive arbitration, (2) hybrid bus masters, (3) hierarchical buses, and (4) other on-chip shared resources. Please refer to Appendix A for these extensions. 2.1 Basic Notations Bus masters are divided into two disjoint sets: MRT and MN RT . RT Masters. MRT = {m ~ 1, . . . , m ~ n } denotes the n RT masters issuing real-time constrained requests. To model the current request issued by an m ~ i in MRT , three parameters—the recurrence time (ri ), the service cycle (ci ), and the relative deadline (di )—are used, with their relationships shown in Figure 3. Among them, ri denotes the time interval between the issuing time of the current request and the next; ci denotes the time interval required to serve the current request; di denotes the time interval between the current request’s issuing time and its deadline. All the three parameters are dynamic—they represent the behavior of the current request, and change once the next request is issued. By default, it is assumed di ≤ ri ∀i ∈ [1, n] since we assume each request must be completed at or before the issue of the next one by the same RT master. Let an m ~ i in MRT issue a request at time tk . Then, before m ~ i is served and within the interval [tk , tk + di ], ndli and ttdi are defined as follows: ndli denotes the time tk + di , which is the nearest deadline; ttdi denotes the number ndli − current absolute time, which represents time-to-deadline. During a ttdi counting down from di back to zero, wzli serves as a threshold. For being a member of L = {wzl1 , . . . , wzln }, wzli is assigned a value in [0, di ]. Whenever ttdi falls within wzli , we say that m ~ i enters its warning zone and becomes urgent (to be granted). If ttdi drops below zero (that is, the current request of m ~ i is not completely served before ndli ), we say m ~ i misses its deadline at ndli . All notations used in this article are listed in Table I. Note that RT masters can also issue NRT requests. This case ACM Transactions on Computational Logic, Vol. V, No. N, July 2009.

·

6

H.-K. Peng and Y.-L. Lin

Table I.

List of notations used in this paper

Notation

Description

MRT = {m ~ 1, . . . , m ~ n} ci ri di ttdi ndli

The set of n RT Masters The service cycle of the current request issued by m ~i The recurrence time of the current request issued by m ~i The relative deadline of the current request issued by m ~i The time to deadline of the current request issued by m ~i The nearest (absolute) deadline of the current request issued by m ~i The user-specified latency of the current request issued by m ~ i (For irregular RT masters only) The service cycle upper bound of all requests issued by m ~i The service cycle lower bound of all requests issued by m ~i The user-specified latency lower bound of all requests issued by m ~ i (For irregular RT masters only) The recurrence time constraint of rimin ∀i ∈ [1, n] , specified by Formula (1) The set of m NRT Masters The service cycle upper bound of all requests issued by any NRT masters The set of n warning-zone-lengths The warning-zone-length assigned to m ~i The RT-mode usage ratio during a period of on-line arbitration The RT access loading during a period of on-line arbitration

li cmax i rimin min li rmin MNRT = {m ~ n+1 , . . . , m ~ n+m } cnrt max L = {wzl1 , . . . , wzln } wzli σrt µrt

is discussed in detail in Appendix A.2. NRT Masters. MN RT = {m ~ n+1 , . . . , m ~ n+m } is a set of m masters issuing nonreal-time constrained requests. In our model, each m ~ j in MN RT needs only one parameter, the service cycle, to model the current request it issues. Moreover, cnrt max denotes the upper bound for the service cycles of all requests issued by any masters in MN RT . Under the same assumption, it is also assumed each request must be completed at or before the issue of the next by the same NRT master. 2.2 Problem Formulation Our target is to provide, simultaneously, an RT guarantee and a high level of QoS in an OCB environment. Using the generalized warning-zone-based algorithm (GWBA), three basic steps are executed whenever an NINP arbitration decision is made: —Step 1. Check if there exists any ttdi ≤ wzli for any i in [0, n] —Step 2. If the answer is yes, the RT mode is entered, where an RT arbitration policy is employed. —Step 3. Otherwise, the NRT mode is entered, where an NRT policy is employed. Note that in general, any RT or NRT policies can be used in Step 2 and Step 3. ACM Transactions on Computational Logic, Vol. V, No. N, July 2009.

Optimal Warning-Zone-Length Assignment Algorithm

·

7

In any given time interval, let RT-mode usage ratio (σrt ) be: σrt =

# RT-mode arbitrations # total arbitration decisions

Then our goal is to minimize σrt while assuring a real-time guarantee. Since such a goal is like using “just enough” numbers of RT arbitrations, and reserving the rest for high QoS, this problem is also called the precise RT guarantee problem. Moreover, since the choice of NRT policy is application-specific, this article focuses on achieving a precise RT guarantee by finding a good combination of an RT-policy and a WZL assignment. The Recurrence Time Constraint. For our problem, we constrain the minimum recurrence time of each RT master (rimin ’s) such that: rimin ≥ rmin ∀i ∈ [1, n], where rmin = cnrt

max − 1 +

n X

cmax j

(1)

j=1

where cmax denotes the service cycle upper bounds for an RT master m ~j . j This constraint excludes the cases when the RT access loading (µrt ) is too high, where in any given time period, the RT access loading is defined as: µrt =

# bus cycles occupied by RT accesses # total bus cycles

This constraint is met by typical real-time SoC systems. Exceptions exist in high-end real-time systems involving extremely short-period masters. However, such exceptional applications tend to require real-time guarantee as the only QoS requirement (for such requirement is already hard enough), and thus, these applications are excluded from our discussion. Also, Appendix B offers more discussion about short-period masters, and finds that the constraint is met in most cases. The implication of the constraint is that, by Pigeonhole Principle, any master’s deadline falls inside the interval of any rimin for one time at most. It also implies that inside any wzli , any m ~ i is delayed, at most, only once by each of the other RT-masters. 2.3 Dynamic or Static Deadline/WZL Assignment According to the on-chip environment’s behavior, deadlines and WZL’s should be assigned dynamically. In practice, however, it is sometimes more practical to assign them statically. As mentioned in Section 2.1, each RT master’s request has its own ci and ri updated dynamically at issue time. Therefore, it is natural to assume that wzli and di will both be assigned dynamically. The wzli ’s dynamic value depends on a combination of a certain dynamic ci ’s. For di ’s, we divide all RT masters into two sets: regular RT masters and irregular RT masters. Regular RT masters, such as video-display or audio-playback controllers, issue requests with similar recurrence time. Moreover, since such requests only need to be completely served before the issue of the next request by the same master, di ’s of such requests are assigned directly by their recurrence time, which is dynamic. Irregular RT masters, on the other hand, represent CPU or Ethernet controllers that issue requests with a wide ACM Transactions on Computational Logic, Vol. V, No. N, July 2009.

8

·

H.-K. Peng and Y.-L. Lin

Table II.

Comparison of dynamic/static relative deadline and WZL assignments Dynamic assignment

Static assignment

Warning-zone-length Relative deadline

wzli = f (ci ) Reg. : di = ri Irreg. : di = li

wzli = f (cmax ) i Reg. : di = rimin Irreg. : di = limin

Attractiveness for SoC application

Flexibility

Practicality

range of recurrence times, which may be random or dependent on the time when the request is served. In this case, di ’s are assigned using a user-specified latency (li ) according to the system’s RT requirement, which is also dynamic. Using dynamic assignment, though natural, is sometimes less feasible because existing OCB protocols [ARM Inc. 1999; 2003; IBM Inc. 2001; OPENCORES 2002; 2007; STMicroelectronics 2006] do not specify dynamic ri and ci as standard signals. Hence, many IP’s do not, by default, provide these signals. Indeed, most existing protocols allow user-defined signals. However, manually adding them for all IP’s can still be difficult.7 Static assignment of the WZL and deadline is more practical. It uses the upperbound values of each RT request’s service cycle (cmax ∀i ∈ [1, n]) to calculate the i static WZL’s for each RT master. Also, it uses the lower bound of each RT request’s recurrence time (rimin ∀i ∈ [1, n]) as the relative deadlines for regular RT masters. For irregular RT masters, the lower bounds on the user-specified latencies (limin ) are used as the relative deadlines. Note that all the upper and lower bounds can be configured once in a period of time (e.g., a few K cycles) and, therefore, change the statically assigned values of wzli ’s and di ’s. They are called static because they are not changing frequently whenever every RT request is issued like those that are dynamically assigned do. Table II compares dynamic and static assignments. They are viewed as two separate problems. The dynamic assignment problem is more generic and flexible to provide an RT guarantee that is more “precise”; on the other hand, the static assignment problem, though less flexible, is more practical. In Section 3, we first focus on the static assignment problem. Then, in Section 4, we show that a similar approach can also be applied to the dynamic one. 3. OPTIMAL STATIC WARNING-ZONE-LENGTH ASSIGNMENT As illustrated in Section 1, decreasing wzli ’s (the values of a WZL set L’s members) can significantly reduce σrt . However, if they are assigned too small a length, the GWBA algorithm could fail to provide a real-time guarantee. Therefore, the problem becomes how to generate a set of minimum WZL assignments that will 7 Even

with sufficient knowledge of all IP’s in use, adding these signals can be difficult because, in nature, they are hard to predict at issue time. For recurrence time, while it is easy to predict for a regular master, it is not for an irregular master. For service cycle, although most protocols do have signals specifying the amount of data to be accessed for the current transaction (like BURSTLENTH for AMBA), the actual service cycle is dependent to the requested device, which often has a variable-latency, such as DRAM. ACM Transactions on Computational Logic, Vol. V, No. N, July 2009.

Optimal Warning-Zone-Length Assignment Algorithm

·

9

preserve a hard real-time guarantee, defined as the minimum WZL assignment problem. In Section 3.1, we define the feasibility and optimality of any static WZL set, and develop theorems for checking them in Section 3.2. An illustration of the feasibility/optimality checking process can be found in Appendix C. Finally, we give an example in Section 3.3 to show how to generate an optimal static WZL set subject to an MRT and a cnrt max . 3.1 Definition of Feasible and Optimal WZL Sets Employing the earliest-deadline-first (EDF) scheduling algorithm as our RT-policy (for it is found to be optimal among all NINP real-time scheduling algorithms8), we first sort a WZL set L = {wzl1 , . . . , wzln } into ascending order. This order will further correspond to one of the n! permutations of L’s associated MRT . For example, an L = {wzl1 , wzl2 , wzl3 } = {9, 4, 6} corresponds to the permutation (m ~ 2, m ~ 3, m ~ 1 ) because wzl2 < wzl3 < wzl1 (4 < 6 < 9). Also, any WZL set for three RT masters will corresponds to one of the 3! = 6 possible permutations.9 A predefined permutation has a special meaning. By definition, masters being predefined in the front are assigned smaller WZL’s. Once these masters enter their warning-zones, their ttdi ’s are more likely to be small enough to get a higher priority according to EDF. Therefore, the predefined permutation roughly represents the priority among RT-masters already in their warning-zones. According to the above definition, feasible and optimal WZL sets are defined as following: —A warning-zone-length set L = {wzl1 , . . . , wzln } is feasible if and only if it does not induce any deadline misses. —A warning-zone-length set Lopt = {wzl1opt , . . . , wzlnopt } is optimal if and only if it is feasible and for any other feasible WZL set L′ = {wzl1′ , . . . , wzln′ } corresponding to the same permutation of Lopt : wzliopt ≤ wzli′ ∀i ∈ [1, n]

(2)

As illustrated in Figure 4, there is one and only one optimal WZL set (indicated by the stars) for each permutation, which is by definition the smallest among all 8 Previous

research in the real-time scheduling field has provided a strong foundation for our choice of the RT-policy. Liu and Layland [1973] first derived the feasibility test of any preemptive scheduling algorithm for periodic tasks, and showed that Earliest Deadline First (EDF) is optimal in the preemptive/periodic subset of real-time scheduling problems in the sense that it finds a valid schedule without deadline misses if any such schedule exists. For non-preemptive scheduling, it was shown that in some cases optimal schedules can be found only by inserting idling cycles. However, finding a feasible non-preemptive scheduling with idling cycles was shown to be NP Complete [Howell and Venkatrao 1995]. Later, Jeffay et al. [1991] derived the feasibility test for Non-Idling, Non-preemptive (NINP) real-time schedules for periodic or sporadic tasks and showed that EDF is also optimal for a subset of NINP scheduling problems. Finally, George et al. [1995] derived a simple proof to assure the optimality of EDF among all NINP real-time scheduling algorithms. For interested readers, George et al. [2000] and Sha et al. [2004] provide a good overview to the field of real-time scheduling. 9 At this point, we ignore the cases in which multiple WLZ’s share the same value, which will be covered in a later example in Appendix C. ACM Transactions on Computational Logic, Vol. V, No. N, July 2009.

10

·

H.-K. Peng and Y.-L. Lin Arbitrary WZL sets with n! possible permutations Feasible WZL sets

Ш Theorem 1

One optimal WZL set for each permutation Ш Theorem 2

Fig. 4.

Relation between feasible and optimal WZL

feasible WZL sets holding the same permutation. Note that, however, two optimal WZL sets of distinctive permutations are not compared against each other. For example, we do not define whether Lopt a = {2, 3, 6} is larger or smaller than Lopt b = {6, 4, 1}, because Lopt a corresponds to the permutation (m ~ 1, m ~ 2, m ~ 3 ), whereas Lopt b corresponds to another permutation (m ~ 3, m ~ 2, m ~ 1 ). 3.2 Conditions for Checking Feasible and Optimal WZL Sets In this subsection, we develop two theorems to examine a WZL set’s feasibility and optimality, respectively. Theorem 1 is derived from Lemma 1 (as the necessary condition) and Lemma 2 (as the sufficient condition); it is used to examine the feasibility of a WZL set. Theorem 2 is derived from Theorem 1; it is used to examine the optimality. To describe the lemmas and theorems clearly, we define two sets, SMALLi and LARGE i , for each m ~ i in MRT . SMALLi (LARGE i ) represents the subset of all RT-masters with smaller (larger) WZL’s than m ~ i . For example, for all WZL sets corresponding to the permutation {m ~ 2, m ~ 3, m ~ 1 } (e.g., La = {9, 3, 6} and Lb = {10, 2, 7}), SMALL3 = {m ~ 2 } because wzl2 < wzl3 (3 < 6 and 2 < 7). Similarly, LARGE 3 = {m ~ 1 } because wzl1 > wzl3 (9 > 6 and 10 > 7). After SMALLi and LARGE i are defined, we can explore the ideas behind the two lemmas and two theorems before proving them formally. Observe the RHS of Formula 3 used in all lemmas and theorems. It canP be divided into the MAX part (including the minus one term) and the SUM ( ) part. The SUM part suggests that wzli , the WZL of m ~ i , must reserve spaces for itself and all masters in SMALLi , because masters in SMALLi have “roughly higher” priorities than m ~ i when their warning-zones are entered. (Please refer to the last paragraph and the second of Section 3.1 for the meaning of “roughly higher”.) Further, the MAX term suggests wzli must also reserve time for the one worst-case master among LARGE i and MN RT , because they may be granted at least one cycle before m ~i enters its warning-zone, and could not be preempted. Accordingly, the underlying spirit of Lemma 1 is that wzli must, at least, reserve spaces for both the SUM part and the MAX part, which is the necessary condition of feasibility. On the other hand, Lemma 2, as the sufficient condition, explains why the SUM part does not have to include masters in LARGE i , although, according to EDF, they do have the ability to hold a higher priority over m ~ i within their warning-zones, as long as they have uncompleted deadlines prior to m ~ i ’s. Lemmas 1 and 2 lead to Theorem 1 that ensures Formula 3’s RHS be both necessary and sufficient for the feasibility. Theorem 2 further shows the optimality exists when the equality in Formula 3 holds. When reading these lemmas and theorems, readers are encouraged to refer to Appendix C to keep a clear idea about how they are used in practice. ACM Transactions on Computational Logic, Vol. V, No. N, July 2009.

Optimal Warning-Zone-Length Assignment Algorithm MAX(cjmax : mj

#"$%&i U !NRT)

!"##i

ms …

c…… cumax cimax

mu

!"##i

mi

min(wzli)

Fig. 5.

11

mk

csmax

01

·

time

The critical case of wzli

Lemma 1. Given the recurrence time constraint, assuming an NINP protocol, let EDF be the RT-policy in use. If a static warning-zone-length set L = {wzl1 , . . . , wzln } is feasible, then L satisfies: [ wzli ≥ MAX(cmax :m ~ j ∈ LARGE i MN RT ) − 1 j X [ + (cmax :m ~ j ∈ SMALLi {m ~ i }) ∀i ∈ [1, n] (3) j Proof. Assume L = {wzl1 , . . . , wzln } is feasible. Then for any m ~ i in MRT , consider theScase shown in Figure 5 where m ~ k , having the maximum cmax in the k set MN RT LARGE i , enters its warning zone at time = 0 when other masters have yet entered their warning zones. Since m ~ k is the only master entering its warning zone then, it must be immediately granted. Assume later at time = 1, m ~ i and all masters in SMALLi , e.g., m ~ i ...m ~ u , enter their warning zones simultaneously. Because of our choice of EDF, m ~ i will not be granted until m ~ k and all masters in SMALLi are completed. Therefore, as shown in the figure, m ~ i must wait for [ [ X :m ~ j ∈ LARGE i MN RT ) + (cmax :m ~ j ∈ SMALLi {m ~ i }) MAX(cmax j j cycles before being granted. Besides, SMALLi also needs to complete its own request, consuming cmax cycles. i Therefore, to guarantee that any m ~ i in MRT does not miss its deadline in such a case, it must hold a wzli no smaller than the right-hand-side of Formula 3. Therefore, the proof is valid. Lemma 2. Given the recurrence time constraint, assuming an NINP protocol, let EDF be the RT-policy in use. If a static warning-zone-length set L = {wzl1 , . . . , wzln } satisfies Formula (3), then L is feasible. Proof. Serving as the sufficient condition, the proving process of Lemma 2 has a recursive flavor and is more complicated than that of Lemma 1. The main challenge is to prove why the last term of Formula (3) does not have to include the service cycles of masters in LARGE i , despite their request being indeed capable of delaying m ~ i whenever their deadlines are before that of m ~ i. We begin with introducing the concept of properly delay using Figure 6. In the figure, if m ~ i is to miss its deadline at ndli , it must be granted no sooner than ndli − cmax + 1. Moreover, since Formula (3) sets a lower bound on wzli , m ~ i must i enter its warning-zone no later than ndli − wzli . Therefore, between the latest time ACM Transactions on Computational Logic, Vol. V, No. N, July 2009.

12

·

H.-K. Peng and Y.-L. Lin ci tpdi = gti – wti

ndli

wti

gti

Fig. 6.

The concept of ”properly delay”

The first deadline miss The first deadline miss

md

md ma

X

X

mz

Contradiction

X NOT properly delayed Properly delayed

mb

X

Step 2



X

ma

X

mb

X



Step 1

Fig. 7.

The first deadline miss

md ma

X

mi

Case (a)

mz

X

Case (b)

X

Step 3

The three-step approach used in Lemma 2

m ~ i is granted (gti ) and the latest time m ~ i enters its warning-zone (wti ), m ~ i must be “properly delayed” by other masters to avoid being granted before the time ndli − cmax + 1. i This Time interval to be Properly Delayed (tpdi = gti − wti ) can be filled by RT masters or NRT masters under different conditions. For an RT master m ~ j, following the EDF principle, it must have an ndlj ≤ ndli to delay m ~ i , for at most once because of the recurrence time constraint. For an NRT master, it can only be granted before m ~ i enters its warning-zone, or equivalently, gtj < wti must hold. Therefore, any NRT master can delay m ~ i by at most cnrt max − 1. Under the concept of properly delay, Lemma 2 is proven by contradiction through three steps as shown in Figure 7: Step 1 :

Step 2 :

Step 3 :

Given the recurrence time constraint, Formula (3) is satisfied, and EDF is the RT policy. Assume there does occur some deadline being missed, and let the first one occur when m ~ d misses its deadline at td . We show that m ~ d must be properly delayed, where at least an m ~ a ∈ LARGE d must directly contribute to tpdd , or otherwise, a contradiction will result. For m ~ a to delay m ~ d (to miss its deadline), we show that m ~ a itself must also be properly delayed by another m ~ b ∈ LARGE a . This step is repeated for at most (n − 2) times until finally m ~ z , the master with the largest WZL among all masters in MRT , is reached. Two cases are analyzed, in which tpdz ’s are both shown to be impossible to fulfill completely, thus leading to contradictions. Since if Formula (3) is satisfied, all possible paths after the first deadline miss lead to contradiction, so the proof is valid.

Explanation of Step 1. Figure 8(a) shows the situation of Step 1 when m ~ d misses its deadline at td . In this step, we will calculate tpdd ’s lower bound to show that ACM Transactions on Computational Logic, Vol. V, No. N, July 2009.

Optimal Warning-Zone-Length Assignment Algorithm td - rmin

gtd

wtd

·

13

td

md tpdd

mv

!"##d !"##d #"$%&d

mw ma min(wld)

(a)

td - rmin wta

td

gta

tpda

mb

mj ‘s ma

!"#a

mx my

!$%&&a !$%&&a &%'("a

min(wla) (b) Fig. 8.

(a) Step 1 and (b) Step 2 of Lemma 2

at least an m ~ a ∈ LARGE d must contribute to tpdd . The lower bound of tpdd , by definition, is determined by gtd and wtd . For the former, we can observe from the figure that: gtd ≥ td − (cnrt

max

− 1)

since m ~ d is assumed to miss its deadline at td ; for the later, since Formula (3) sets a lower bound on wzld , denoted by min(wzld ), we can also find from the figure that: wtd ≤ td − min(wzld ) Consequently, tpdd ’s lower bound is set as: tpdd = gtd − wtd ≥ min(wzld ) − cmax +1 d [ X max :m ~ j ∈ SMALLi ) = MAX(cj :m ~ j ∈ LARGE i MN RT ) + (cmax j To cause the deadline miss, tpdd must be completely fulfilled by the masters in SMALLd , LARGE d , or MN RT . First, note that all masters in SMALLd can fulfill only the second term: X :m ~ j ∈ SMALLi ) (cmax j Therefore, if we trace in the negative direction of time starting from gtd , before wtd is reached, we will find a master (that is, m ~ a in Figure 8(a)) being scheduled right before a non-negative number of consecutive masters in SMALLd , e.g., m ~v and m ~ w in the figure. Not belonging to SMALLd , this m ~ a can belong to MN RT or LARGE d . However, this master cannot belong to MN RT for two reasons : (1) A master in MN RT cannot be granted after wtd as previously mentioned; (2) Being granted before wtd , a master in MN RT can only delay m ~ d for at most (cnrt max 1) cycles, which is less than or equal to the first term minus one: [ MAX(cmax :m ~ j ∈ LARGE i MN RT ) − 1 j ACM Transactions on Computational Logic, Vol. V, No. N, July 2009.

14

·

H.-K. Peng and Y.-L. Lin

In other words, as with m ~ a in Figure 8(a), there must be a master in LARGE d that delays m ~ d right before a sequence of consecutive masters in SMALLd do. This statement is visualized in Step 1 of Figure 7. Explanation of Step 2. With Step 2 in Figure 8(b), we show that m ~ a also will have to be properly delayed by an m ~ b in LARGE a by calculating the lower bound on tpda . Let USED a denote the set of masters scheduled between m ~ a and m ~ d , including m ~ d but not m ~ a . That is, USED a in Figure 8(b) is equal to the set {m ~ d, m ~ v, m ~ w } in 8(a). By the period constraint, these masters in USED a cannot delay m ~a again because any masters in MRT will not issue a request with a deadline within [td − rmin , td ] more than once. Therefore, masters in USED a should be excluded when we consider the possible masters to fill tpda . To calculate the lower bound on tpda , again, gta and wta are considered. From the figure, we can similarly find that: X gta ≥ td − (cmax :m ~ a ∈ USED a ) + 1 − cmax j a Moreover, since m ~ a also needs to hold a deadline no later than td to delay m ~ d , we have that: wta ≤ td − mim(wzla ) Consequently, tpda ’s lower bound is set as: X tpda = gta − wta ≥ min(wzla ) − (cmax :m ~ i ∈ USED a ) − cmax +1 i a [ X = MAX(cmax :m ~ i ∈ LARGE i MN RT ) + (cmax :m ~ n ∈ SMALLa − USED a ) i i (4) Again, masters in SMALLa can fulfill only the second term: X (cmax :m ~ n ∈ SMALLa − USED a ) i since the masters in USED a are excluded. Then, as with the situation in Step 1, at least an m ~ b in LARGE a is required to properly delay m ~ a. Repeating the same process, we will sequentially find an m ~ c in LARGE b followed by an m ~ e in LARGE c , ...etc., until finally m ~ z is reached. The recursion is visualized in Step 2 of Figure 7. Explanation of Step 3. In Step 3, we complete the proof by showing that tpdz is impossible to be completely fulfilled, which implies that the deadline miss on td should never have happened. This contradiction is deduced in both of the following two cases: —Case (a) when USED z = SMALLz —Case (b) when USED z 6= SMALLz . The cases are shown in Figure 9(a) and Figure 9(b), respectively. In Case (a), tpdz ’s lower bound can be found by reusing Formula (5) as long as we include two more properties in this situation: (1) LARGE z = NULL since m ~ z by definition holds the largest WZL among all masters in MRT . (2) SMALLz − USED z ACM Transactions on Computational Logic, Vol. V, No. N, July 2009.

Optimal Warning-Zone-Length Assignment Algorithm td - rmin= wtz gtz

·

15

td

tpdz = cnrt_max

mj ‘s mz

!"#z

mj ‘s mz

!"#z

min(wln)

(a) td - rmin= wtz

td

gtz

tpdz cnrt_max

mk‘s

!$%&&z – !"#z min(wlz)

(b) Fig. 9.

Two cases of Step 3 of Lemma 2

= NULL according to Case (a)’s assumption. By applying these two properties to Formula (4), we have: tpdz ≥ MAX(cmax :m ~ j ∈ MN RT ) j or equivalently, tpdz ≥ cnrt

max .

The fact that min(wzlz ) = rmin (by Formula (3)) is also an important consideration for calculating the above lower bound. It ensures that a sufficiently long part of tpdz falls within [td − rmin , td ], where any delay brought by the masters in USED z can be ignored because of the recurrence time constraint. Now, tpdz can only be filled by masters in MN RT (because of the two properties). However, as previously mentioned, the maximum delay that any master in MN RT can contribute to m ~ z is bounded by cnrt max − 1, which is less than cnrt max , the lower bound on tpdz . Therefore, it is impossible for m ~ z to be properly delayed in such case, which creates a contradiction. In Case (b), tpdz ’s lower bound is calculated in a manner similar to that in Case (a) except the assumption of USED z 6= SMALLz . Hence, we have: X tpdz = gtz = wtz ≥ cnrt max + (cmax :m ~ j ∈ SMALLz − USED z ) j Though the second term in the right-hand side can be filled by the masters in USED z − SMALLz , the first term (cnrt max ) is still left to be filled only by the masters in MN RT , which is, as mentioned in Cases (a), impossible. Finally, the whole proof is summarized in Step 3 of Figure 7. It shows that, once a WZL set L satisfies Formula (3), every condition after the first deadline miss leads to contradiction. The proof is valid. Theorem 1. Given the recurrence time constraint, assuming an NINP protocol, and let EDF be the RT-policy in use. A static warning-zone-length set L = {wzl1 , . . . , wzln } is feasible if and only if L satisfies Formula (3). Proof. Lemma 1 shows that Formula (3) is necessary for L to be feasible, and Lemma 2, sufficient. Therefore, the proof is valid. ACM Transactions on Computational Logic, Vol. V, No. N, July 2009.

16

·

H.-K. Peng and Y.-L. Lin

Theorem 2. Given the recurrence time constraint, assuming an NINP protocol, and let EDF be the RT-policy in use. A static warning-zone-length set L = {wzl1 , . . . , wzln } is optimal if and only if L satisfies: [ wzli = MAX(cmax :m ~ j ∈ LARGE i MN RT ) − 1 j X [ + (cmax :m ~ j ∈ SMALLi {m ~ i }) ∀i ∈ [1, n] (5) j Proof. Formula (5) is the extreme case of Formula (3). By Theorem 1 and the optimal WZL set’s definition in Formula (2), the proof is valid. Using Theorems 1 and 2, we can check the feasibility and optimality for any given WZL set; the process is illustrated in Appendix C. 3.3 Generating Optimal WZL Sets Instead of checking any WZL’s feasibility and optimality, in practice, we wish to generate the optimal WZL set for any given MRT and cnrt max using a predefined permutation. Now, we describe how Theorem 2 can be used this way. In the next example, we demonstrate how to generate the optimal WZL set for the permutation {m ~ 2, m ~ 4, m ~ 1, m ~ 3 }, assuming MRT = {m ~ 1, m ~ 2, m ~ 3, m ~ 4 }, where min max min max min max min , r ), (c , r )} = (30, 3), (28, 7), (45, 5), {(cmax , r ), (c , r ), (c 4 1 1 2 2 3 3 4 (35, 9) and cnrt max = 4. The desired optimal WZL set can be generated in 3 steps: Step 1 :

Let M′RT = {m ~ ′1 , m ~ ′2 , m ~ ′3 , m ~ ′4 } = {m ~ 2, m ~ 4, m ~ 1, m ~ 3 } be the pre-defined permutation.

Step 2 :

Assign L′ = {wzl1′ , . . . , wzln′ }, which corresponds to M′RT rather than MRT , by: ′

wzli′ = MAX({cmax : j ∈ [i+1, n]}, cnrt j

max ) −1 +

i X



(cmax ) ∀i ∈ [1, n−1] j

j=1

(6) , c′3 = cmax , and c′4 = cmax . Note that where c′1 = cmax , c′2 = cmax 3 2 4 1 Formula (6) is equivalent to Formula (5) if we substitute SMALLi with SMALL′i = {m ~ j : j ∈ [1, i − 1]} and LARGE i with LARGE ′i = {m ~ j : j ∈ [i + 1, n]}, which are inferred by the definition of M′RT in Step 1. In our example where {c′1 , c′2 , c′3 , c′4 , cnrt max } = {7, 9, 3, 5, 4}, L′ will be assigned as: X wzl1 = {7} + MAX{9, 3, 5, 4} − 1 = 15 X wzl2 = {7, 9} + MAX{3, 5, 4} − 1 = 20 X wzl3 = {7, 9, 3} + MAX{5, 4} − 1 = 23 X wzl4 = {7, 9, 3, 5} + MAX{4} − 1 = 27 Step 3 :

Renumber L′ to L. In the example, L′ = {wzl1′ , wzl2′ , wzl3′ , wzl4′ } = {wzl2 , wzl4 , wzl1 , wzl3 }. Therefore, L = {23, 15, 27, 20} is the resulting

ACM Transactions on Computational Logic, Vol. V, No. N, July 2009.

·

Optimal Warning-Zone-Length Assignment Algorithm

17

optimal WZL set for the predefined permutation M′RT = {m ~ 2, m ~ 4, m ~ 1, m ~ 3 }.10 At this point, one may ask whether it is possible for Formula (6) to generate an L′ = {wzl1′ , wzl2′ , wzl3′ , wzl4′ } that is not non-decreasing. If it can, the resulting L in Step 3 will correspond to a different permutation from M′RT , implying that the optimal WZL set for some permutations may not exist. To prevent this situation from happening, we show that any L′ obtained from Formula (6) is non-decreasing using Lemma 3. Lemma 3. If a static warning-zone-length set L = {wzl1 , . . . , wzln } is assigned according to Formula (6), then L is non-decreasing that satisfies: wzli+1 ≥ wzli ∀i ∈ [1, n − 1] Proof. The problem is separated into two j ∈ [i + 1, n]}, cnrt max ) = cmax and Case i ′ . 1, n]}, cnrt max ) 6= cmax i In Case (a) and for all i ∈ [1, n − 1]:

(7)

cases: Case (a) when MAX({cmax : j max′ (b) when MAX({cj : j ∈ [i +

wzli+1 − wzli = MAX({cmax : j ∈ [i + 2, n]}, cnrt j

max )

i+1 X

−1 +

cmax j

j=1

−(cmax i+1 − 1 +

i X

cmax ) j

j=1

=

MAX({cmax j

: j ∈ [i + 2, n]}, cnrt

max )

Indeed, this value is non-negative. ′ In Case (b), since MAX({cmax : j ∈ [i + 1, n]}, cnrt j

max )

MAX({cmax : j ∈ [i + 1, n]}, cnrt j

max )



6= cmax , we have: i

= MAX({cmax : j ∈ [i + 2, n]}, cnrt j

max )

Therefore, wzli+1 − wzli = MAX({cmax : j ∈ [i + 2, n]}, cnrt j

max )

−1 +

i+1 X

cmax j

j=1

−(MAX({cmax : j ∈ [i + 2, n]}, cnrt j

max )

−1+

i X

cmax ) j

j=1

= cmax i+1 Again, this value is non-negative. Since, in both cases, the value of wzli+1 − wzli is non-negative, the proof is valid. In this section, we tried to reduce σrt (while ensuring RT guarantee) by using minimum static WZL’s. We derived two theorems, based on which we found processes for checking and generating the optimal WZL set of each of the n! permutations. Lemma 3 further ensures that all the n! optimal WZL sets exist. 10 Note

that this result is identical to the L3 in Table X of Appendix C. ACM Transactions on Computational Logic, Vol. V, No. N, July 2009.

18

·

H.-K. Peng and Y.-L. Lin

ma

rbmin

mb time tb

ndla ndlb'

ndlb

(a)

ma

rb

mb time tb

ndla

wtb

ndlb

(b) Fig. 10.

Example of static and dynamic deadline assignment

4. DYNAMIC RELATIVE DEADLINE AND WZL ASSIGNMENT Up to now, we have discussed static WZL’s and noted that it is easier to use them in practice. However, they are inflexible because the largest service cycles (cmax ’s) i are always used to calculate WZL’s, as well as the smallest recurrence time (rimin ’s) and user-specified latency (limin ’s) for the relative deadlines. Assuming ri ’s and ci ’s dynamic values can be provided at issuing time, there is potentially more flexibility for reducing σrt . 4.1 Dynamic Relative Deadlines Figure 10 illustrate an advantageous case using dynamic relative deadlines. In the figure, a regular RT master m ~ b issues a request with a recurrence time (rb ) larger than its recurrence time lower bound (rbmin ). In Figure 10(a), although it is actually fine to set the deadline at ndlb = tb + rb , being unaware of the dynamic value of rb , the deadline will be set to ndlb′ = tb + rbmin instead, which makes m ~ b very likely to receive an urgent grant during the time [ndla , ndlb′ ] (because the warning zones of m ~ a and m ~ b are connected to each other). If the instant recurrence time of rb is provided at issue time tb , its deadline can then be set at ndlb = tb + rb as shown in Figure 10(b), in which m ~ b ’s request has a better chance of receiving a non-urgent grant in the time interval [ndla , wtb ]. Applying dynamic relative deadlines, the only modification to the static approach is changing the initialization value of time-to-deadlines (ttdi ’s). Originally, a ttdi is initialized as static values: rimin for regular masters and limin for irregular masters. After the modification, it is initialized to dynamic values, that is, ri and li for regular and irregular masters, respectively. The correctness of our approach is not affected by the modification, since the proof in Section 3 addresses the relation between service cycles and WZL’s rather than relative deadlines. 4.2 Dynamic Optimal WZL Assignment Dynamic WZL assignment uses a dedicated circuit to calculate, in each clock cycle, the instant optimal WZL’s for each RT master using dynamic cj ’s according to: ACM Transactions on Computational Logic, Vol. V, No. N, July 2009.

Optimal Warning-Zone-Length Assignment Algorithm

[ wzli = MAX(reqj · cj : m ~ j ∈ LARGE i MN RT ) − 1 X [ + (reqj · cj : m ~ j ∈ SMALLi {m ~ i }) ∀i ∈ [1, n]

·

19

(8)

where reqj represents the requesting signal of m ~ j (reqj = 1 if m ~ j is requesting for the bus; reqj = 0 if not). This approach generates dynamic WZL’s much smaller than the static ones (by Formula (5)) for two reasons. First, this approach uses dynamic cj ’s rather than the cmax ’s, the upper bound values. Second, this approach j does not consider a term cj when m ~ j is not requesting the bus. While the smaller (and dynamic) WZL’s imply lower σrt , an RT guarantee still has to be proven safe. Theorem 3. Given the recurrence time constraint, assuming an NINP protocol, let EDF be the RT-policy in use. A dynamic warning-zone-length set L = {wzl1 , . . . , wzln } satisfying Formula (8) is optimal if and only if all user-specified latencies of irregular RT masters’ requests are larger than or equal to rmin that is specified by Formula (1). Proof. Proving by contradiction, we begin by introducing the concept of Critical WZL Update (CWU), the only condition in which Formula (8) could be infeasible. Then, we illustrate how a CWU leads to deadline misses. Accordingly, we give two necessary and sufficient conditions for a CWU to make a dynamic WZL set infeasible. The two conditions are then found invalid if the user-defined latencies of irregular masters share a lower bound of rmin . Finally, the dynamic WZL assignment is found optimal in addition to its feasibility. Since Formula (5), the static version of Formula (8), has been proven feasible, in this proof, we need only to investigate whether RT guarantee is affected by the updates of WZL’s brought by issuing and completing requests. Further, also from Theorem 1, we know that WZL’s generated by Formula (8) are sufficiently long to ensure RT guarantee in conditions11 before or after any updates, as long as the warning-zones are entered from the beginning. However, the only exception happens if a WZL-increasing update (by issuing a request) causes a warning-zone to be entered from the middle, i.e., when a wti is updated from a time after the current time to a time before the current time, which is called a CWU. Figure (11) illustrates a CWU: consider an RT master m ~ d that issues a request at td with a deadline at ndld . Initially, assume m ~ d ’s dynamic wzld is shorter than its static assignment, wzldstatic . Given that the service cycles of all other RT masters temporarily remain the same, m ~ d will not enter its warning-zone until the time wtd = ndld − wzld . Now, assume that at time ti close to wtd , another RT master m ~ i issues a request with a deadline ndli smaller than ndld . Accordingly, the issuing of this request updates m ~ d ’s dynamic WZL from wzld to wzld′ . In such case, m ~d would need to enter its warning-zone at wt′d to ensure the RT guarantee in the worst case. However, this is impossible since the current time is already at ti , which is larger than wt′d . In other words, deadline misses will occur in the worst case (as illustrated in the proof of Lemma 1). From observing the above example, we discover that two conditions, holding simultaneously, are necessary and sufficient for a CWU to make a dynamic WZL set following Formula (11) to be infeasible: 11 By

condition, we mean the status of all reqi ’s and ci ’s of a certain moment. ACM Transactions on Computational Logic, Vol. V, No. N, July 2009.

20

·

H.-K. Peng and Y.-L. Lin

wzldstatic wzld

md td wtdstatic

wtd

time

ndld

current time (a)

wzldstatic wzld'

md mi td

wtd'

ti

wtd

ndld

time

current time (b) Fig. 11.

Example of possible deadline miss caused by dynamic WZL assignment

—(1) These requests must be issued after the time wtstatic = ndld − wzldstatic . d —(2) These requests must hold a deadline smaller than ndld . The first condition is necessary for the infeasibility since, if it fails to hold, wzld will be updated at a time before wzldstatic , in which m ~ d ’s request would still enter the updated warning-zone on time to avoid deadline misses. The second condition is also necessary because, if it does not hold, m ~ i ’s request will be scheduled after that of m ~ d , where m ~ i will not contribute to m ~ d ’s deadline miss because of our choice of EDF. Besides being necessary, the two conditions are also sufficient to make the dynamic WZL set infeasible, since if they occur simultaneously, a deadline miss of m ~ d will occur whenever wzld is updated so that wt′d becomes smaller than the current time, as in the example. However, these conditions for infeasibility will never happen if all user-specified latencies are larger than or equal to rmin , because the two conditions occurring together imply that there must be an RT request, either from a regular RT master or an irregular one, that holds a relative deadline shorter than wzldstatic , which is by definition shorter than rmin . For regular masters, this condition will not happen because their di = ri and their ri ’s are assumed to follow the recurrence time constraint. For irregular masters, the condition will not happen either, since the user-specified latencies serving as relative deadlines share rmin as their lower bound. Since the necessary and sufficient conditions of the infeasibility are shown to be invalid, the feasibility is proven. Further, by similarly proving approach of Theorem ACM Transactions on Computational Logic, Vol. V, No. N, July 2009.

Optimal Warning-Zone-Length Assignment Algorithm

·

21

mi with smaller ri but larger wzli mj with lager rj but smaller wzlj time

(a)

mi with smaller ri and smaller wzli mj with lager rj and larger wzlj time

(b) Fig. 12.

Scenario of two WZL assignments with different permutations

5, the dynamic WZL assignment is also found optimal. Dynamic relative deadlines and WZL assignment are two extensions used when the dynamic values of ri ’s and ci ’s are available. Besides providing more flexibility, RT guarantee is also ensured. 5. TECHNIQUES FURTHER INCREASING NON-URGENT GRANTS Given that the RT guarantee is ensured, reducing σrt can be viewed as greedily increasing the number of non-urgent grants, which is the number of times when RT masters are granted outside their warning-zones. Pursuant to that philosophy, two more techniques are proposed. The first technique suggests the one, among the n!, optimal WZL set that yield the smallest σrt . The second provides a general guideline to alter an NRT-policy when designing a hybrid-mode arbiter. 5.1 Rate-Monotone (RM) Permutation Up to now, we have shown how to find the optimal WZL set, statically or dynamically, of any given permutation. However, finding such sets and evaluating their performance exhaustively requires O(n2 · n!) off-line computation time, which is impractical. Therefore, we propose a heuristic of rate-monotone permutation. The basic idea is that, to increase the number of non-urgent grants (or equivalently, to reduce the number of urgent grants), we should avoid assigning larger WZL’s to RT-masters with smaller recurrence time, like the case illustrated in Figure 12(a). The reasons are twofold: First, a small-recurrence-time master (e.g., the m ~ i in the figure) holds a tight deadline, to which a large WZL makes it enter its warning-zone quickly. This increases the possibility that the master’s requests will be granted urgently. Second, a small-recurrence-time master, by default, recurs more frequently, which would probably increase further the total number of urgent grants. In Figure 12(a), all 5 arbitration decisions indicated by vertical dotted lines are very likely to incur urgent grants to m ~ i or m ~ j , because they all cut through the dotted boxes representing warning zones. The situation can be significantly improved by assigning WZL’s according to RTmasters’ recurrence time as illustrated in Figure 12(b), where the same 5 arbitration decisions are more likely to be granted non-urgently, since only one of them cuts through the warning-zones. Such an approach is equivalent to (1) pre-defining a ACM Transactions on Computational Logic, Vol. V, No. N, July 2009.

22

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

·

H.-K. Peng and Y.-L. Lin

Algorithm :Optimal WZL Assignment with RM-permutation () Input: MRT = {m ~ 1, . . . , m ~ n} cmax , . . . , cmax n 1 r¯1 , . . . , r¯n cnrt max Output: L = {wzl1 , . . . , wzln } M′RT ← Sort MRT according to the ascending order of mean recurrence time. for i = 1 : n do ′ wzli′ = MAX({cmax : j ∈ [i + 1, n]}, cnrt j end for

max )

−1 +

Pi

j=1



(cmax ) j

L ← Renumber L back to the original order of MRT return(L)

Fig. 13.

Pseudo code of optimal WZL assignment with RM-permutation

rate-monotone permutation to renumber the RT-masters, such that: i

MAX(cnrt_max, c4, c3)

for wzl2

MAX(cnrt_max, c4)

for wzl3

MAX(cnrt_max)

for wzl4

c4 cnrt_max

Fig. 15.

>

Inherent resource sharing of dynamic WZL assignment

if any of the urgk ’s become high, the OR gate will pull rtModeSel high to signal that RT mode must be used at the next NINP scheduling point. In the mean time, the comparator tree will select the urgent master with the smallest ttdk and pull its corresponding gntNxtRtk high to indicate which RT master is to be granted in the next NINP scheduling point. The area complexity of ROWLA using static WZL assignment is O(n) because of the comparators in the TMU’s and the comparator tree; the timing complexity is O(log n) because of the height of the comparator tree. Applying dynamic WZL assignments needs a slightly different set of input signals and design of TMU’s. As shown in the upper-right part of Table IV, dynamic service cycle signals (c1 ,. . .,cn and cnrt max ) are required instead of WZL signals. Moreover, each TMU should be given a dedicated circuit calculating the dynamic WZL according to Formula (8). This circuit requires O(n) adders and comparators for each of the n TMU’s, a requirement that seems to increase the area complexity to O(n2 ). However, most of the comparators and adders used in Formula (8) can be shared, as illustrated in Figure (15), where only four comparators are needed to generate all the MAX terms for all four WZL’s. Accordingly, only O(n) comparators and adders are needed to calculate all WZL’s dynamically, yielding an area complexity of O(n). The timing complexity also remains O(log n) given that the adders and comparators calculating Formula (8) are connected in a tree-like structure. Finally, although the timing and area complexity remain the same, the actual area and critical-path still increase.

7. IMPLEMENTATION AND EXPERIMENTAL RESULTS We conducted comprehensive experiments to show how ROWLA reduces σrt and improves the overall QoS. Section 7.1 presents the area and timing results of hardware implementation. In Section 7.2, we apply ROWLA to an industrial multi-port DRAM controller [Denali Inc. 2007] to measure QoS level improvement in terms of DRAM access cycle reduction. In Section 7.3, ROWLA is applied to a QFHD (3840x2160) H.264/AVC video decoding system [Peng et al. 2007] to measure per frame decoding cycles. Experiments are conducted through cycle-accurate RTL simulation. ACM Transactions on Computational Logic, Vol. V, No. N, July 2009.

Optimal Warning-Zone-Length Assignment Algorithm

Table IV.

·

25

Timing and area overheads compared to a baseline AHB bus Area gates (k) ov. (%) 2.4 -

Baseline Conventional Chen ROWLA, Sta. ROWLA, Dyn.

3.7 3.9 3.9 6.3

54 63 63 162

Timing delay (ns) ov. (%) 2.9 2.9 2.9 2.9 4.1

0 0 0 41

7.1 Implementation Results We implement ROWLA for 4, 8, and 16 RT masters in Verilog RTL and synthesize it targeted towards a TSMC 130nm cell library. The results are summarized in Figure 16. For the static version, the critical path delays for 4, 8, and 16 RT masters are 1.9, 2.7, and 3.0 ns, respectively. The area of each is 1.6, 3.8, and 6.9 K gates, respectively. For the dynamic, the critical path delays are 4.1, 5.5, and 7.0 ns, respectively. The area of each is 3.5, 5.5, and 9.5 K gates. While the area costs of both cases are negligible for SoC applications, the static version is significantly faster than the dynamic. In order to observe the timing and area overhead, we implemented an 8-master standard AMBA bus with a priority-based arbiter. We then implement additional circuits on top of this baseline arbiter to provide precise RT guarantee for 4 masters using different approaches: (1) the conventional approach that uses the RT mode whenever there is any RT request; (2) the static WZL-based approach by Chen et al. [2006]; (3) the static ROWLA; and (4) the dynamic ROWLA. In the last case, dynamic signals are added in addition to standard AMBA signals. We then compare the area and timing overheads and list them in Table IV. The area overhead ranges from 54% (the conventional approach) to 162% (dynamic ROWLA). The area (and timing) results of Chen and static ROWLA are identical because they have similar hardware architectures and differ only in WZL assignments. Although the area overhead seems high compared to the baseline bus, it is very small for a general SoC. (For example, it is about only 1.2% of our 345K-gate H.264 decoding system.) Therefore, we consider the area overhead minor. The timing overhead is zero for the first three approaches, while it is 41% for the fourth. In the former case, the longest paths of the three implementations are all within 2 ns, while the priority-based baseline arbiter’s critical path is 2.9 ns. Therefore, they do not incur critical path degradation. On the other hand, the 4-master dynamic ROWLA has a critical path of 4.1 ns. It, thus, dominates the overall timing performance with 41% degradation. Therefore, when dynamic ROWLA is used on high-speed OCB’s, pipelining the arbiter may be required. 7.2 RT Blockage Ratio and DRAM Access Cycle Reduction In the two experiments of this subsection, ROWLA is used to add RT guarantee capability to Denali’s multi-port DRAM Controller [Denali Inc. 2007]. In Section 7.2.1, the techniques of static/dynamic WZL assignment, RM permutation, and ACM Transactions on Computational Logic, Vol. V, No. N, July 2009.

26

·

H.-K. Peng and Y.-L. Lin 10000 dyn_4mas dyn_8mas dyn_16mas sta_4mas sta_8mas sta_16mas

9000 8000 7000

area (gate)

6000 5000 4000 3000 2000 1000 0 1

2

3

4

5

6

7

8

9

10

11

critical path timing (ns)

Fig. 16. Synthesis results of ROWLA for 4, 8, 16 RT masters applying static and dynamic WZL assignments

CBI are evaluated for their σrt -reducing and QoS-improving capability. Later in Section 7.2.2, we present power consumption and energy dissipation results. In Section 7.2.3, the dynamic variation of service cycle and recurrence time are altered to observe their relationship to the effectiveness of dynamic deadline and WZL assignment. Accessing a DRAM controller requires a variable number of service cycles. Observing from the AMBA bus side, the service cycles include a basic penalty (3 cycles for a Read operation, and 1 for Write), an activation penalty (6 cycles), and burst transfer cycles (1 cycle per word). Accordingly, a 4-burst read access needs 3 + 6 + 4 = 13 cycles in the worst case, or 3 + 4 = 7 cycles if the activation penalty is completely overlapped. 7.2.1 Exp1: Effectiveness of Various Proposed Techniques. In this experiment, we use 4 NRT- and 4 regular RT-masters to measure the σrt and the required access cycles to complete 10,000 NRT requests from each NRT-master under the disturbance of different degrees of RT access loading (µrt ). The service cycles setting of all masters in Exp1 are listed in Table V(a). For example, cmin = 1 (basic penalty for a write operation) + 8 (burst length of m ~ 5) = 5 9 cycles, and cmax = 1 (basic penalty for a write operation) + 8 (burst length of m ~ 5) 5 + 6 (activation penalty) = 15 cycles. The △ci , defined by −(cmax − cmin )/(cmax ), i i i indicates the degree of ci ’s dynamic variation, which is listed in the 2nd column from the right of Table V. The lower bounds on recurrence time of RT-masters are set to be: rimin = rmin · γik , i ∈ [1, 4] where k is iterated from 0 to 39, rmin assigned according to Formula (1), and γ1 ∼ γ4 = 1.02, 1.07, 1.15, and 1.30, respectively. The setting of the lower bounds serves the following purposes: —(1) Meeting the recurrence time constraint of Section 2.1. —(2) Generating a wide range of µrt from 12.3% to 81.3% ACM Transactions on Computational Logic, Vol. V, No. N, July 2009.

·

Optimal Warning-Zone-Length Assignment Algorithm

27

Table V. △ci ’s and △ri ’s of different settings in (a) Exp1 and Exp2a, (b) Exp2b, and (c) Exp2c Master MRT

MNRT

m ~1 m ~2 m ~3 m ~4 m ~5 m ~6 m ~7 m ~8

Regularity Regular Regular Regular Regular -

R/W W R W R W R W R

BLmax i 8 4 4 4 8 4 4 4

△BLi 0% 0% 0% 0% 0% 0% 0% 0%

cmax i 15 13 11 13 15 13 11 13

△ci -40% -46% -54% -46% -40% -46% -54% -46%

△ri 5% 5% 5% 5% -

△BLi 0% 0% 50% 50% 50% 50% 50% 50%

cmax i 15 13 13 13 17 13 13 13

△ci -40% -46% -77% -77% -71% -77% -77% -77%

△ri 5% 5% 50% 50% -

△BLi 0% 0% 50% 50% 50% 50% 50% 50%

cmax i 17 13 13 13 17 13 13 13

△ci -71% -77% -77% -77% -71% -77% -77% -77%

△ri 50% 50% 50% 50% -

(a) Master MRT

MNRT

m ~1 m ~2 m ~3 m ~4 m ~5 m ~6 m ~7 m ~8

Regularity Regular Regular Irregular Irregular -

R/W W R R/W R/W R/W R/W R/W R/W

BLmax i 8 4 4 4 8 4 4 4 (b)

Master MRT

MNRT

m ~1 m ~2 m ~3 m ~4 m ~5 m ~6 m ~7 m ~8

Regularity Irregular Irregular Irregular Irregular -

R/W R/W R/W R/W R/W R/W R/W R/W R/W

BLmax i 8 4 4 4 8 4 4 4 (c)

—(3) Making a wide range of RT-master’s recurrence time combination (r1 : r2 : r3 : r4 ranging from 1:1:1:1 to 1: 6.4: 107.5: 12832.8), which models a variety of situations in SoC systems. Based on the above lower bound settings, the distribution of dynamic ri of each RT master is controlled by its dynamic variance △ri , listed in the rightmost column of Table V(a), such that ri is dynamically given a random value in [rimin , rimin (1+△ri )] at each issuing time. While Table VI summarizes all legends representing various combinations of techniques used in Exp1, Figure 17 shows the results: Figure 17(a) shows the σrt -reducing effectiveness of RM permutation and CBI for static WZL assignment; Figure 17(b), for the dynamic. Figure 17(c) compares the above σrt results to those of the conventional approach and those of Chen et al. [2006]; Figure 17(d) correlates these reductions in σrt to those in DRAM access cycles. Figure 17(a) shows the resulting σrt of four different permutations (Ls4321, Ls3412, Ls2413, and Ls rm ) under different degrees of µrt . Additionally, a reACM Transactions on Computational Logic, Vol. V, No. N, July 2009.

·

28

H.-K. Peng and Y.-L. Lin

Table VI.

List of legends used in the experiments

Legend

Techniques used

Ls4321 Ls3412 Ls2413 Ls rm Ls rm cbi Ld4321 Ld3412 Ld2413 Ld rm Ld rm cbi Ld dl Ld both Lchen Lconv LNRT ONLY

Static optimal WZL assignment with the permutation {m ~4 , m ~3 , m ~2 , m ~1 } Static optimal WZL assignment with the permutation {m ~3 , m ~4 , m ~1 , m ~2 } Static optimal WZL assignment with the permutation {m ~2 , m ~4 , m ~1 , m ~3 } ~ Static optimal WZL assignment with the RM permutation, {m ~1 , m ~2 , m ~3 , m4} Static optimal WZL assignment with the RM permutation and CBI Dynamic optimal WZL assignment with the permutation {m ~4 , m ~3 , m ~2 , m ~1 } Dynamic optimal WZL assignment with the permutation {m ~3 , m ~4 , m ~1 , m ~2 } Dynamic optimal WZL assignment with the permutation {m ~2 , m ~4 , m ~1 , m ~3 } ~ Dynamic optimal WZL assignment with the RM permutation, {m ~1 , m ~2 , m ~3 , m4} Dynamic optimal WZL assignment with the RM permutation and CBI Ls rm cbi added with dynamic deadlines Ld rm cbi added with dynamic deadlines The static WZL assignment proposed by Chen et al. [2006] The conventional approach that uses the RT mode only when there are pending RT requests The approach that considers solely NRT QoS criteria while disregarding RT constraints

0.6

0.6 L_s4321 L_s3412 L_s2413 L_s_rm L_s_rm_cbi

L_d4321 L_d3412 L_d2413 L_d_rm L_d_rm_cbi

0.5

0.4

RT_BLOCKAGE_RATIO

RT_BLOCKAGE_RATIO

0.5

0.3

0.2

0.1

0.4

0.3

0.2

0.1

0

0 1

0.8

0.6

0.4

0.2

0

1

0.8

0.6

RT_LOADING

(a)

0.2

0

(b)

0.6

1 L_conv L_chen L_s_rm_cbi L_d_rm_cbi NRT_only

0.5

0.9

0.4 0.8 NORML_CYC

RT_BLOCKAGE_RATIO

0.4 RT_LOADING

0.3

0.7

0.2 0.6 0.1

L_conv L_chen L_s_rm_cbi L_d_rm_cbi NRT_only

0.5 0 1

0.8

0.6

0.4

0.2

0

1

0.8

RT_LOADING

(c)

0.6

0.4

0.2

0

RT_LOADING

(d)

Fig. 17. Experiment 1: (a) Resulting σrt of optimal static WZL assignment of different permutations; (b) Resulting σrt of optimal dynamic WZL assignment of different permutations; (c) σrt comparison between our approach and others; (d) Normalized access cycle comparison between our approach and others. ACM Transactions on Computational Logic, Vol. V, No. N, July 2009.

Optimal Warning-Zone-Length Assignment Algorithm

·

29

sult set applying CBI (Ls rm cbi ) is included. Three facts can be observed from this figure: (1) Using RM permutation, Ls rm outperforms the other three optimal WZL sets with non-RM permutations by up to 7.0%. Ls2413 comes in second since it corresponds to a permutation closer to the RM one. (2) The reduction brought by Ls rm is not significant when µrt is above 80% or below 30%. In the former case, the recurrence time of each RT-master is still the same, and therefore a good permutation makes no difference; in the latter case, the µrt is so small that low σrt can be easily achieved by any permutation. (3) Using the CBI technique, Ls rm CBI can bring extra σrt reductions by up to 4.8%. Furthermore, this reduction is not affected when µrt is above 80%, since this technique is independent of the ordering between RT-masters’ recurrence time. Figure 17(b) shows the effectiveness of RM permutation and CBI techniques using dynamic optimal WZL assignments. While similar observations can be found as those in Figure 17(a), the improvement brought by the RM permutation technique using the dynamic assignment is less significant, for the reason mentioned in Section 5.1. Moreover, we can see the overall σrt reduction brought by dynamic assignments is greater than that brought by static assignments. In Figure 17(c), we compare Ls rm cbi and Ld rm cbi with Lchen (representing the WZL assignment of Chen et al. [2006]) and Lconv (representing the conventional approach that uses the RT-mode whenever there are RT requests). We observe three facts: (1) Ls rm cbi outperforms Lchen and Lconv by up to 21.5% and 28.7%, respectively. More σrt reduction is brought by Ld rm cbi , which is up to 37.1% and 30.7% compared to Lconv and Lchen, respectively. (2) The reduction is less significant when µrt < 30% for the previously mentioned reason. While less reduction also occurs to Ls rm cbi when µrt > 80%, this is not the case for Ld rm cbi . It is because static assignments always use the worst-case (also the longest) WZL to prevent any possible deadline misses, while dynamic assignments uses the longest WZL’s only when it does come to the worst case. When µrt > 80%, the short recurrence times make ttdi ’s fall below the static WZL’s immediately, but not below those dynamic WZL’s as soon. Thus, smaller σrt is achieved even in such cases. In Figure 17(d), we compare the same set of results of Figure 17(c) by their total DRAM access cycles normalized to that of Lconv . Four facts are observed from this figure. (1) Ls rm cbi helps reduce the DRAM access cycle by up to 23.7% and 16.1% compared to Lconv and Lchen , respectively; for Ld rm cbi , it is 27.0% and 23.4%. (2) By correlating Figure 17(c) and (d), we can find the reduction in σrt leads to proportional reduction in DRAM access cycles because a lower σrt gives the DRAM controller’s built-in bank-interleaving scheduler more flexibility to overlap those DRAM penalty cycles, which affirms our presumption that reduction in σrt leads to overall QoS level improvement. (3) Comparing the results of Lconv and Ld rm cbi when µrt = 12.3% (the rightmost points in both Figure 17(c) and (d)), we see that a 5.5% σrt reduction still leads to a 4.5% reduction on DRAM access cycles. It shows that even RT requests do not interact with NRT ones extensively, reducing RT-mode usage still benefits the overall QoS level. (4) When µrt < 50%, Ld rm cbi near-perfectly eliminates any RT-mode usage (σrt < 1.1%). This makes the difference between its access cycle and that of LNRT ONLY smaller than 1.3%. ACM Transactions on Computational Logic, Vol. V, No. N, July 2009.

30

·

H.-K. Peng and Y.-L. Lin

For Ls rm cbi , similar situations happen when µrt < 40%. In these situations, our approach can ensure RT guarantee and yield high QoS level (indicated by small access cycles), competitive even with LNRT ONLY, which does not consider RT constraints at all. 7.2.2 Power Dissipation and Energy Consumption. Besides σrt and total execution cycles, we measure the average power dissipation of the bus under dynamic/static ROWLA and other approaches. Also, we measure the total energy consumption results and normalize them to the result of Lconv. The results are measured using their switching activities in Exp1 and Synopsys PrimePower [Synopsis 2005]. For average power dissipation as shown in Figure 18(a), two facts can be observed: (1) µrt (and its resulting σrt ) has no direct impact on average power consumption. The reasons are twofold. First, the bus is always kept fully loaded; therefore, the switching activities under different µrt are similar. Second, both RT and NRT arbitration decisions are calculated no matter which one is chosen in the end; therefore, the power consumption under different σrt is also similar. (2) The implementation area of each setting, on the other hand, has a direct impact on power. From Table IV, dynamic ROWLA (Ld rm cbi ) requires the most gates (6.3k gates) and, hence, consumes the most power (around 1520uW); LNRT ONLY has the least gates (2.4k gates) and, thus, consumes the least power (around 730uW); Ls rm cbi , Lchen, and Lconv have the area ranging from 3.7k to 3.9k and consume power from 885uW to 920uW. Accordingly, we observe that the power consumption of various approaches supporting precise RT guarantee is largely decided by the area cost of their implementations, rather than their σrt -reducing capability.12 Figure 18(b) presents the total energy consumption normalized to Lconv, which is the product of average power dissipation (in Figure 18(a)) and total execution cycle (in Figure 17(d).) Two interesting facts are observed: (1) Although Ls rm cbi , Lchen, and Lconv are fairly similar in average power consumption, Ls rm cbi saves up to 20% energy consumption compared to Lconv and 15% compared to Lchen, as a result of its reduced total execution cycles. (2) Although Ld rm cbi also has small execution cycles and is beneficial for reducing total energy consumption, it consumes about 50% more energy than Lconv , as a result of its 70% area overhead over that of Lconv. Accordingly, we observe that using dynamic ROWLA is a more expensive trade-off in terms of both area and power. If used in large circuits, its better σrt reducing capability may be worth the relatively low power and area overheads. In small circuits, however, static ROWLA is more attractive. 7.2.3 Exp2: Dynamic Variation and Dynamic Deadline/WZL. From Figure 17(c) and (d), the additional improvement brought by Ld rm cbi compared to Ls rm cbi is insignificant (△σrt < 1.4%) when µrt < 50%. Since the dynamic approach is more expensive than the static one, we would like to know when to use dynamic instead of the static. In this experiment, we raise the dynamic variances △ri ’s and △ci ’s to see their impact on the effectiveness of dynamic deadline/WZL 12 Here

we exclude the case when the NRT arbitration policy is deliberately designed to save power, in which low σrt is likely to help reduce power. Swaminathan and Chakrabarty [2005] is one such example. ACM Transactions on Computational Logic, Vol. V, No. N, July 2009.

Optimal Warning-Zone-Length Assignment Algorithm 1800

31

1.8 L_conv L_chen L_s_rm_cbi L_d_rm_cbi NRT_only

L_conv L_chen L_s_rm_cbi L_d_rm_cbi NRT_only

1.6

NORMALIZED TOTAL ENERGY

1600

1400 POWER (uW)

·

1200

1000

1.4

1.2

1

0.8

800 0.6 600 0.4 1

0.8

0.6

0.4

0.2

0

1

0.8

RT_LOADING

0.4

0.2

0

RT_LOADING

(a) Fig. 18.

0.6

(b)

(a)Power dissipation and (b)normalized total energy consumption of Experiment 1

assignments. This change is accomplished by substituting originally regular RT masters in Table V(a) with irregular RT masters. In the following, Exp2a uses the same setting as the previous experiment with two regular RT masters substituted in Exp2b and four in Exp2c. These settings and the resulting △ri ’s and △ci ’s are summarized in Table V(b) and (c), in which Exp2a has the greatest dynamic variances, and Exp2c, the smallest. Using these three settings, four result sets are compared: Ld dl applies dynamic deadline but not dynamic WZL; Ld rm cbi applies dynamic WZL but not dynamic deadline; Ls rm cbi applies none, and Ld both applies both. Figure 19(a) and (b) show the σrt and normalized access cycle for Exp2a, respectively; Figure 19(c) and (d) show for Exp2b; Figure 19(e) and (f) show for Exp2c. Comparing the figures, there are two observations: (1) For a given µrt , the dynamic deadline and WZL assignments can better reduce σrt than the static ones when △ri ’s and △ci ’s are larger. For example, when µrt = 50%, △σrt = 0.9% in Exp2a (Figure 19(a)), △σrt = 1.1% in Exp2b (Figure 19(c)), and △σrt = 6.1% in Exp2c (Figure 19(e)). (2) When △r and △c are raised, the achievable µrt decreases because the recurrence time constraint uses the service cycles’ upper bound values. This experiment also suggests guidelines for selecting between static and dynamic assignments (the former being inflexible yet smaller and faster, the latter being larger and slower yet more flexible). For example, assume that the system designer wants to choose the expensive dynamic WZL assignment only in exchange for at least 5% of additional σrt reduction. Then, in a low △r and △c system like that in Exp2a, the designer should use the dynamic only if µrt is larger than a threshold of 70%. In cases of medium △r-and-△c systems like that of Exp2b, the threshold is 60%, and in high △r-and-△c systems like the case of Exp2c, the threshold is 45%.

7.3 Case Study: QFHD H.264 Video Decoding System QFHD (Quad Full High Definition) is the standard of digital cinema, the next generation of high-definition video technology. Decoding video in QFHD requires extensive computation complexity that is beyond the capability of existing comACM Transactions on Computational Logic, Vol. V, No. N, July 2009.

·

32

H.-K. Peng and Y.-L. Lin

0.6

1 L_s_rm_cbi L_d_dl L_d_rm_cbi L_d_both 0.9

0.4 0.8 NORML_CYC

RT_BLOCKAGE_RATIO

0.5

0.3

0.7

0.2 0.6 0.1 L_s_rm_cbi L_d_dl L_d_rm_cbi L_d_both

0.5 0 1

0.8

0.6

0.4

0.2

0

1

0.8

0.6

RT_LOADING

0.4

0.2

0

RT_LOADING

(a)

(b)

0.6

1 L_s_rm_cbi L_d_dl L_d_rm_cbi L_d_both 0.9

0.4 0.8 NORML_CYC

RT_BLOCKAGE_RATIO

0.5

0.3

0.7

0.2 0.6 0.1 L_s_rm_cbi L_d_dl L_d_rm_cbi L_d_both

0.5 0 1

0.8

0.6

0.4

0.2

0

1

0.8

0.6

RT_LOADING

0.4

0.2

0

RT_LOADING

(c)

(d)

0.6

1 L_s_rm_cbi L_d_dl L_d_rm_cbi L_d_both 0.9

0.4 0.8 NORML_CYC

RT_BLOCKAGE_RATIO

0.5

0.3

0.7

0.2 0.6 0.1 L_s_rm_cbi L_d_dl L_d_rm_cbi L_d_both

0.5 0 1

0.8

0.6

0.4

0.2

0

1

0.8

RT_LOADING

(c)

0.6

0.4

0.2

0

RT_LOADING

(d)

Fig. 19. Experiment 2: (a) σrt and (b) normalized access cycle of Exp2a; (c) σrt and (d) normalized access cycle of Exp2b; (e) σrt and (f) normalized access cycle of Exp2c;

puting systems. Peng et al. [2007]13 proposed a dedicated AMBA-compliant H.264 decoder IP. Running at 250MHz, it is capable of real-time decoding of a QFHD H.264 video bitstream.14 Its superior performance has been demonstrated in successful integrations on Peking University’s SoC platform. The integration photo 13 This

QFHD version is the substantially improved version of its previous CIF version [Peng et al. 2007] published in ASPDAC. 14 It needs an average of 250 cycles to decoder a 16 x 16 pixel MB (Macro Block), or equivalently, 250 M cycles to decode 30 frames of QFHD H.264 video. ACM Transactions on Computational Logic, Vol. V, No. N, July 2009.

·

Optimal Warning-Zone-Length Assignment Algorithm RT Bus Masters

33

Bus Arbiter

Display

Ethernet

CPU

(m1)

( m 2)

(m3)

DRAM Controller

AHB

MAU1

MAU2

MAU3

MAU4

MAU5

MAU6

MAU7

( m 4)

( m 5)

(m 6)

(m 7)

( m8)

(m9)

(m10)

IPRED INTERP BSG

DF bs

mvdinfo

MVG

recon

IQ & IT

residual mv & ridx

coeff

PARSER

CAVLD/ CABAD

para & predinfo

H.264 Video Decoder

(a)

(b)

Fig. 20. The (a) photo and (b) block diagram of an H.264 decoding system, cooperated by National Tsing Hua University and Peking University.

Table VII. The function, behavior, and bandwidth requirement of each bus master in the H.264 decoding system Master m ~1 MRT m ~2 m ~3 m ~4 m ~5 m ~6 MNRT m ~7 m ~8 m ~9 m~10

Function BLmax i Display ctrl. 16 Ethernet ctrl. 8 CPU 4 Ipt. buf. R. 4 Ref. buf. R. 16 Line buf. R. 8 DF top-row R. 4 Ref. buf. W. 12 Disp. buf. W. 8 Line buf. W. 8

cmax i 25 15 13 13 25 17 13 19 15 15

△ci rimin -24% 90 -40% 475 -62% 500 -46% -24% -35% -46% -31% -40% -40% -

△ri BW (MB/s) 5% 949 25% 80 67% 32 80 801 128 96 356 475 128

Total

3125

and the system block diagram are shown in Figure 20. When decoding QFHD video, enormous bus traffic toward DRAM is requested from ten bus masters (m ~ 1 ∼ m~10 in Figure 20(b)). Each bus master has its own function, behavior, and bandwidth consumption as shown in Table VII. m ~1 ∼ m ~3 are RT masters, of which each request is RT-constrained. m ~ 4 ∼ m~10 are NRT masters; they are the MAU’s (Memory Access Units) of the decoder IP. In total, all bus masters consume a bandwidth of 3.13 GB/s. Although the MAU’s are not RT-constrained, they greatly affect the overall decoding time because, if any of them waits too long, its relating sub-IP will be suspended, which will cause the whole decoder IP to stall. Whether the waiting of an MAU suspends a sub-IP, and whether the suspension of a sub-IP causes the whole decoder to stall will depend on the dynamic status of the decoder. If the MAU requests are managed carefully so that the decoder stall cycles are minimized, the total decoding time can be improved. Up to now, the system bus’s hybrid-mode bus arbiter has needed to support three ACM Transactions on Computational Logic, Vol. V, No. N, July 2009.

34

·

H.-K. Peng and Y.-L. Lin

competing QoS policies: —(1) Managing all MAU requests to reduce the decoder stall cycles; —(2) Scheduling all bus requests for DRAM to reduce non-overlapped DRAM penalty; —(3) Scheduling all RT requests to ensure RT guarantee. The first two policies are supported by the decoder IP’s embedded arbiter [Lee 2007]. In this experiment, this embedded arbiter is used in the NRT-mode of our hybrid-mode bus arbiter. The seven MAU’s are then connected directly to the bus without internal arbitration. For the third policy involving the RT-policy and the RT/NRT mode-switching policy of the bus arbiter, we use the proposed approach of this paper. With all three QoS policies served, we aim to reduce the total decoding time by using minimum RT mode to ensure RT guarantee, thereby giving the NRT-mode arbitration maximum flexibility to reduce the decoder stall cycles and non-overlapped DRAM penalty. We conduct whole-system RTL simulation on three Sun Fire V490 workstations, each equipped with two 1.35GHz SPARC CPU’s and 8GB of memory. To save the extremely long simulation time, we substitute m ~1 ∼ m ~ 3 with three BFM’s (Bus Functional Models) configured according to Table VII. With three different approaches mentioned in Section 7.2— Lconv , Lchen , and Ls rm cbi —, five QFHD H.264 sequences are decoded, during which µrt ∼ = 31%. The whole simulation takes about nine days on each workstation. While results for individual sequences are included in Appendix D, Table VIII presents the results averaged from the five sequences. From this table, again, we can see that σrt leads to a higher overall QoS level. In this case, this means fewer decoder stall cycles, fewer non-overlapped DRAM penalty cycles, and reduced total decoding time. First, Table VIII(a) presents the σrt results. Notice that the σrt results for I-, P-, and B-frames are very similar to one another, because the characteristics of RT traffic are not affected by different frame types. On average, Lconv yields σrt of 25.0%; Lchen, 7.2%; and Ls rm cbi , 0.4%. In other words, to ensure RT guarantee, Lconv has to use RT mode once in every four arbitration decisions, Lchen, once in every 15, and Ls rm cbi, once in every 250. In 99.6% of the time, Ls rm cbi supports RT guarantee without using RT mode, during which the NRT mode of the hybrid-mode bus arbiter can be fully used to support the NRT QoS criterion. The benefit can be observed in Table VIII(b), where the decoder stall-cycle results are presented. On average, Ls rm cbi helps reduce 64.0% of the stall cycles. Different degrees of stall-cycle reduction are observed in different frame types: 72.8% in I-frames, 52.2% in P-frames, and 64.4% in B-frames. These differences occur because of the characteristics of the embedded arbiter and the decoder IP, which are beyond the scope of this article. Uneven reduction between different types of frames also happens to the non-overlapped DRAM penalties listed in Table VIII(c), where an average of 5.0% penalty is reduced using Ls rm cbi . The reduction here is not as significant as that of the decoder stall cycles because that reducing DRAM penalty is the second-level policy of the original embedded arbiter, whereas reducing the decoder stall cycle is the first-level policy. ACM Transactions on Computational Logic, Vol. V, No. N, July 2009.

Optimal Warning-Zone-Length Assignment Algorithm

Table VIII.

·

35

H.264 video decoding system results - Averaged from the results of five sequences Policy Lconv Lchen Ls rm cbi

I-Frame 24.7 6.8 0.3

σrt (%) P-Frame B-Frame 25.1 25.3 7.4 7.5 0.4 0.4

Average 25.0 7.2 0.4

(a) σrt Decoder stall cycles per frame I-Frame P-Frame B-Frame Average cyc (M) Reduc. (%) cyc (M) Reduc. (%) cyc (M) Reduc. (%) cyc (M) Reduc. (%) Lconv 1.29 1.00 1.07 1.12 Lchen 0.91 -29.1 0.81 -18.9 0.81 -23.8 0.85 -24.4 Ls rm cbi 0.35 -72.8 0.47 -52.2 0.38 -64.4 0.40 -64.0 Policy

(b) Decoder stall cycles per frame Non-overlapped DRAM penalty cycles per frame I-Frame P-Frame B-Frame Average cyc (M) Reduc. (%) cyc (M) Reduc. (%) cyc (M) Reduc. (%) cyc (M) Reduc. (%) Lconv 3.30 2.87 3.02 3.06 Lchen 3.24 -2.0 2.81 -2.1 2.96 -2.0 3.00 -2.1 Ls rm cbi 3.14 -5.1 2.72 -5.1 2.87 -4.8 2.91 -5.0 Policy

(c) Non-overlapped DRAM penalty cycles per frame Total decoding time per frame I-Frame P-Frame B-Frame Average cyc (M) Reduc. (%) cyc (M) Reduc. (%) cyc (M) Reduc. (%) cyc (M) Reduc. (%) Lconv 9.44 10.04 10.33 9.94 Lchen 9.00 -4.7 9.72 -3.2 9.92 -3.9 9.55 -3.9 Ls rm cbi 8.26 -12.5 9.22 -8.2 9.26 -10.4 8.91 -10.4 Policy

(d) Total decoding time per frame

Finally, the total decoding time is presented in Table VIII(d). For I-frames, Ls rm cbi helps reduce total decoding time by 12.5%, in P-frames, 8.2%, in Bframes, 10.4%. On average, the reduction is 10.4%. Note that the reduction in total decoding time is greater than the sum of the reduction in decoder stall cycles and DRAM penalty cycles because the total number of RT requests, as a fixed portion of total bus cycles, also decreases when other portions of bus cycles are reduced. In this case, however, this additional reduction cannot be derived directly from the reduction in the stall cycles and DRAM penalty because they are partly overlapped with the bus cycles of RT requests. In this case study, the entire QFHD H.264 decoding SoC costs 345K gates and 104K bytes of on-chip SRAM (135K gates for CPU, 71K for DRAM controller, 91K for the decoder, and 48K for the rest). In such a design, the proposed RT-handling arbiter circuit takes less than 1.6K gates. For an overall performance improvement of 10.4%, the trade-off is quite attractive. ACM Transactions on Computational Logic, Vol. V, No. N, July 2009.

36

·

H.-K. Peng and Y.-L. Lin

8. CONCLUSION In multiple-QoS real-time SoC systems, we need a new type of real-time guarantee, called precise real-time guarantee, to avoid a drop in overall system performance caused by imprecise RT-/NRT- mode switching. We propose a general-purpose real-time handler named ROWLA that uses optimal static and dynamic warningzone-length assignments and two heuristic techniques. Experimental results show that ROWLA significantly reduces the RT-mode usage ratio by up to 37.1% and contributes to a cycle reduction by up to 27.0% when applied on a commercial DRAM controller. On a QFHD H.264 decoding SoC, the proposed approach helps cut the total decoding time by an average of 10.4%. The hardware implementation is shown to be low-cost and high-performance, making it applicable for high-speed, cost-sensitive SoC applications. In the future, we intend to explore relaxing the recurrence time constraint as well as application-specific CBI techniques. ACKNOWLEDGMENTS

The authors would like to thank Dr. TingTing Hwang and Dr. Juinn-Dar Huang for their helpful comments; Wei-Kuan Shih for verifying the correctness of the lemmas and theorems; the NTHU H264 Decoder Team—Chun-Hsin Lee, Sheng-Tsung Hsu, Yuan-Chun Lin, Ping Chao, Wei-Cheng Hung Jian-Wen Chen, Hui-Ting Huang, Adam Shin-Chih Lee, Hao-Ting Huang, and Kai-Hsiang Chang—for designing and implementing the NTHU 4Kx2K H.264 Decoder; Chimei Optoelectronics Inc. for sponsoring the QFHD LCD panels; Dr. Feng Liu and Dong Chang for help integrating the decoder IP into Peking University’s SoC system and providing valuable system-level information; finally, Global Unichip, Inc., for providing continual technical consultancy on system integration and chip implementation. REFERENCES ARM Inc. 1999. AMBA Specification Rev. 2.0. ARM Inc. Available at http://www.arm.com/ products/solutions/AMBA\_Spec.html. ARM Inc. 2003. AMBA AXI Specification. ARM Inc. Available at http://www.arm.com/ products/solutions/axi_spec.html. Bolotin, E., Cidon, I., Ginosar, R., and Kolodny, A. Qnoc: Qos architecture and design process for network on chip. Journal of Systems Architecture 50, 2-3. Chen, C. H., Lee, G. W., Huang, J. D., and Jou, J. Y. 2006. A real-time and bandwidth guaranteed arbitration algorithm for soc bus communication. In Proceedings of Asia South Pacific Design Automation Conference. Yokohama, Japan, 600–605. Combaz, J., Fernandez, J.-C., Lepley, T., and Sifakis, J. 2005. Fine grain qos control for multimedia application software. In Design, Automation and Test in Europe Conference and Exhibition. IEEE Computer Society, Washington, DC, USA, 1038–1043. Conti, M., Caldari, M., Vece, G. B., Orcioni, S., and Turchetti, C. 2004. Performance analysis of different arbitration algorithms of the amba ahb bus. Design Automation Conference 0, 618–621. Davare1, A., Zhu, Q., Natale, M. D., Pinello, C., Kanajan, S., and SangiovanniVincentelli, A. 2007. Prediction-based flow control for network-on-chip traffic. In Proceedings of the 44th annual conference on Design automation. ACM, 278–283. Denali Inc. 2007. Databahn DRAM Memory Controller IP. Denali Inc. Available at http: //www.denali.com/products/databahn\_dram.html. ACM Transactions on Computational Logic, Vol. V, No. N, July 2009.

Optimal Warning-Zone-Length Assignment Algorithm

·

37

Digital Cinama Initiatives, LLC 2008. Digital Cinema Specification Version 1.2. Digital Cinama Initiatives, LLC. Available at http://www.dcimovies.com/DCIDigitalCinemaSystemSpecv1_2. pdf. Elpeda Inc. 2007. How to use SDRAM/DDR/DDR2 - Users Manual. Elpeda Inc. Available at http://www.elpida.com/en/products/documents.html. Franklin, G. F., Powell, J. D., and Workman, M. 1997. Digital Control for Dynamic Systems, Third Edition. Addison-Wisley. George, L., Muhlethaler, P., and Rivierre, N. 1995. Optimality and non-preemptive realtime scheduling revisited. INRIA Research Report n2516. George, L., Muhlethaler, P., and Rivierre, N. 2000. A few results on non-preemptive realtime scheduling. INRIA Research Report n3926. Gill, C. D., Levine, D. L., and Schmidt, D. C. 2001. The design and performance of a real-time corba scheduling service. Real-Time Systems 20, 2, 117–154. Goossens, K. 2004. Interconnect-Centric Design for Advanced SoC and NoC, chapter 15. Kluwer. Goossens, K., Dielissen, J., Gangwal, O. P., Pestana, S. G., Radulescu, A., and Rijpkema, E. 2005. A design flow for application-specific networks on chip with guaranteed performance to accelerate soc design and verification. Design, Automation and Test in Europe Conference and Exhibition 2, 1182–1187. Howell, R. R. and Venkatrao, M. K. 1995. On non-preemptive scheduling of recurring tasks using inserted idle times. Information and Computation 117, 1 (Feb.), 50–62. IBM Inc. 2001. CoreConnect Bus Architecture version 3.5. IBM Inc. Available at http:// www-03.ibm.com/chips/products/coreconnect/. Intel Inc. 2008. Intel 82540EM Gigabit Ethernet Controller. Intel Inc. Available at http: //www.intel.com/design/network/products/lan/controllers/82540.htm. Jeffay, K., Stanat, D. F., and Martel, C. U. 1991. On non-preemptive scheduling of periodic and sporadic tasks. In Proceedings of 12th IEEE Real-Time Systems Symposium. San Antonio (TA), U.S.A., 129–139. Joint Video Team, International Telecommunication Union 2007. H.264 : Advanced video coding for generic audiovisual services. Joint Video Team, International Telecommunication Union. Available at http://www.itu.int/rec/T-REC-H.264. Jun, M., Bang, K., Lee, H.-J., Chang, N., and Chung, E.-Y. 2007. Slack-based bus arbitration scheme for soft real-time constrained embedded systems. In Proceedings of Asia South Pacific Design Automation Conference. Yokohama, Japan, 159–164. Kim, S., Lee, J., and Lee, J. 2006. Runtime feasibility check for non-preemptive real-time periodic tasks. Information Processing Letters 97, 3 (Feb.), 83–87. Lahiri, K., Dey, S., and Raghunathan, A. 2001. Evaluation of the traffic-performance characteristics of system-on-chip communication architectures. VLSI Design, International Conference on 0, 29. Lahiri, K., Raghunathan, A., and Lakshminarayana, G. 2006. The lotterybus on-chip communication architecture. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 14, 6 (June), 596–608. Lee, A. S. C. 2007. Comprehensive System-level Analysis and Optimization with Milti-QoS Considerations on a H.264 Video Decoder SoC. M.S. thesis, National Tsing Hua University, Hsinchu, Taiwan. Lee, K. B., Lin, T. C., and Jen, C. W. 2005. An efficient quality-aware memory controller for multimedia platform soc. IEEE Transactions on Circuits and Systems for Video Technology 15, 620–633. Lin, B.-C., Lee, G.-W., Huang, J.-D., and Jou, J.-Y. 2007. A precise bandwidth control arbitration algorithm for hard real-time soc buses. Asia and South Pacific Design Automation Conference 0, 165–170. Liu, C. L. and Layland, J. W. 1973. Scheduling algorithms for multiprogramming in a hardhealtime environment. Journal of the Association for Computing Machinery 20, 1 (Jan.), 46–61. ACM Transactions on Computational Logic, Vol. V, No. N, July 2009.

38

·

H.-K. Peng and Y.-L. Lin

Lu, C., Abdelzaher, T. F., Stankovic, J. A., and Son, S. H. 2001. A feedback control approach for guaranteeing relative delays in web servers. In RTAS ’01: Proceedings of the Seventh Real-Time Technology and Applications Symposium (RTAS ’01). IEEE Computer Society, Washington, DC, USA, 51. Lu, R. and Koh, C.-K. 2005. Improving the scalability of samba bus architecture. Asia and South Pacific Design Automation Conference 0, 1164–1167. Meyerowitz, T., Pinello, C., and Sangiovanni-Vincentelli, A. 2003. A tool for describing and evaluating hierarchical real-time bus scheduling policies. Design Automation Conference 0, 312. Motion Picture Experts Group 2007. Coding of moving pictures and associated audio for digital storage media at up to about 1.5 Mbit/s – Part 3: Audio. Motion Picture Experts Group. Available at http://www.chiariglione.org/mpeg/standards/mpeg-1/mpeg-1.htm. Ogras, U. Y. and Marculescu, R. 2006. Prediction-based flow control for network-on-chip traffic. In Proceedings of the 43rd annual conference on Design automation. ACM, 839–844. OPENCORES 2002. WISHBONE SoC Interconnection Architecture for Portable IP Cores Rev.B3. OPENCORES. Available at http://www.opencores.org/projects.cgi/web/ wishbone/wbspec\_b3.pdf. OPENCORES 2007. OCP 2.2 Specification. OPENCORES. Available at http://www.ocpip. org/membership/information/wheel/specification/. Peng, H. K., Lee, C. H., Chen, J. W., Lo, T. J., Chang, Y. H., Hsu, S. T., Lin, Y. C., Chao, P., Hung, W. C., and Jan, K. Y. 2007. A highly integrated 8mw h.264/avc main profile realtime cif video decoder on a 16mhz soc platform. In Proceedings of Asia South Pacific Design Automation Conference. Yokohama, 112–113. Peng, H. K., Lee, C. H., Hsu, S. T., and Hung, W. C. 2007. A 4Kx2K Real-time H.264 Decoder IP. Report of the 2007 National Silicon Intelectual Property Contest, Available at http://140.114.75.173/QFHD/. Pestana, S. G., Rijpkema, E., R?dulescu, A., Goossens, K., and Gangwal, O. P. 2004. Costperformance trade-offs in networks on chip: A simulation-based approach. Design, Automation and Test in Europe Conference and Exhibition 2, 20764. Poletti, F., Bertozzi, D., Benini, L., and Bogliolo, A. 2003. Performance analysis of arbitration policies for soc communication architectures. Design Automation for Embedded Systems 8, 2-3 (June), 189–210. Puterman, M. L. 1994. Marov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley-Interscience, New York. Richardson, T. D., Nicopoulos, C., Park, D., Narayanan, V., Xie, Y., Das, C., and Degalahal, V. 2006. A hybrid soc interconnect with dynamic tdma-based transaction-less buses and on-chip networks. International Conference on VLSI Design 0, 657–664. Sha, L., Abdelzaher, K., Arzen, K., Cervin, A., Baker, T., Burns, A., Buttazzo, G., Caccamo, M., Lehoczky, J., and Mok, A. K. 2004. Real time theory: a historical perspective. Realtime Systems 28, 101–155. Sha, L., Rajkumar, R., Lehoczky, J., and Ramamritham, K. 1989. Mode change protocols for priority-driven preemptive scheduling. Tech. rep., Amherst, MA, USA. STMicroelectronics 2006. STBus Interconnect. STMicroelectronics. Available at http://www.st. com/stonline/products/technologies/soc/stbus.htm. Sutton, R. S. and Barto, A. G. 1998. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA. Swaminathan, V. and Chakrabarty, K. 2005. Pruning-based, energy-optimal, deterministic i/o device scheduling for hard real-time systems. ACM Transactions on Embedded Computing Systems 4, 1 (Feb.), 141–167. Synopsis 2005. PrimePower Manual Version 2005.09. Synopsis. Takizawa, T. and Hirasawa, M. 2001. An efficient memory arbitration algorithm for a single chip mpeg2 av decoder. IEEE Transactions on Consumer Electronics 47, 3 (Aug.), 660–665. ¨ st, C. C., Steffens, L., Verhaegh, W. F., Bril, R. J., and Hentschel, C. 2005. Qos Wu control strategies for high-quality video processing. Real-Time Syst. 30, 1-2, 7–29. ACM Transactions on Computational Logic, Vol. V, No. N, July 2009.

Suggest Documents