Asynchronous Layered Interface of Multimedia

4 downloads 0 Views 1000KB Size Report
Computer Laboratory, University of Cambridge, William Gates Building 15 JJ Thomson Avenue,. Cambridge ...... Abstraction, and Flexible Network Configuration,^ IEEE. Trans. .... Vodafone AirTouch, Walnut Creek(US), Senior RF Engineer.
Journal of VLSI Signal Processing 46, 133–151, 2007

* 2007 Springer Science + Business Media, LLC. Manufactured in The United States. DOI: 10.1007/s11265-006-0019-4

Asynchronous Layered Interface of Multimedia SoCs for Multiple Outstanding Transactions EUN-GU JUNG AND DONGSOO HAR Department of Information and Communications, Gwangju Institute of Science and Technology, 1 Oryong-dong, Buk-gu, Gwangju 500-712, South Korea JEONG-GUN LEE Computer Laboratory, University of Cambridge, William Gates Building 15 JJ Thomson Avenue, Cambridge CB3 0FD, UK KYOUNG-SON JHANG Department of Computer Engineering, College of Engineering, Chungnam National University, 220 Gung-dong, Yuseong-gu, Daejeon 305-764, South Korea JEONG-A LEE Department of Computer Engineering, Chosun University, 375 Susuk-dong, Dong-gu, Gwangju 501-759, South Korea

Received: 31 August 2006; Revised: 18 November 2006; Accepted: 22 November 2006

Abstract. In this paper, a novel asynchronous layered interface for a high performance on-chip bus is designed in a Globally Asynchronous Locally Synchronous (GALS) style. The proposed asynchronous layered interface with distributed and modularized control units supports multiple outstanding in-order/out-of-order transactions to achieve high performance. In the layered architecture, extension of an asynchronous layered interface performing complex functions is readily achieved without repeating the implementation of the whole bus interface. Simulations are carried out to measure the performance and power consumption of implemented asynchronous on-chip bus with the proposed asynchronous layered interface. Simulation results demonstrate that throughput of the asynchronous on-chip bus with multiple outstanding out-of-order transactions is increased by 30.9%, while power consumption overhead is 16.1% and area overhead is 56.8%, as compared to the asynchronous on-chip bus with a single outstanding transaction. Keywords: multiple outstanding transactions, in-order/out-of-order transaction completion, asynchronous onchip bus, GALS

134

1.

Jung et al.

Introduction

As CMOS technology evolves, the complexity of System-on-a-Chip (SoC) design is increased rapidly [1]. Especially, SoCs for multimedia information processing relevant to IPTV, HDTV, DMB, HDDVD, mobile phone, etc typically have high complexity. They have been implemented by various design methods to meet run-time requirements and design goals of applications [2–4], so they commonly use on-chip buses to transfer data between processing blocks in SoCs. For multimedia SoC design, synchronous type of on-chip bus has been used [5–8]. With this conventional design based on a single global clock, it is very hard to transfer data between Intellectual Properties (IPs) in SoCs without serious performance degradation. Although SoCs may be divided into multiple clock domains to avoid the performance loss, synchronization failures might occur frequently while communicating between different clock domains. This problem might be solved by a highly efficient clock distributor, but its use can lead to excessively incurred power consumption [1, 9, 10]. As an alternative, Globally Asynchronous Locally Synchronous (GALS) design [11] has been studied widely, since it can reuse existing synchronous IPs as local modules and eliminate synchronization failures by Global Asynchronous Interconnects (GAIs). Contrary to conventional SoC design, GALS design has the potential to prevent excessive power consumption because it does not require highly efficient clock distributor and clock generator. In other words, a local module has a local clock generator and a local clock distributor for a synchronous IP, while a global clock generator and a global clock distributor is removed due to a GAI implemented with self-timed circuits. Also, a GAI consumes inherently low power as it consumes only leakage power when there is no data transfer. Asynchronous On-Chip Buses (OCBs) [12, 13] have been proposed for low power GAI. These OCBs enable low latency with small silicon area, as synchronous OCBs [5–8] widely used in SoC design. New features such as multiple outstanding transactions with in-order/out-of-order transaction completion [8] were introduced for synchronous OCB to meet the demand for high performance. However, few asynchronous OCBs with such features have appeared in the literature so far. We propose in this

paper a high performance asynchronous OCB enabling multiple outstanding transactions with inorder/out-of-order transaction completion for a GAI. These features are embodied for master IP and slave IP through asynchronous layered interface with distributed and modularized control units. In the layered architecture, extension of an asynchronous layered interface performing complex functions is readily achieved without repeating the implementation of whole bus interface. This paper is organized as follows. Section 2 introduces the concept of asynchronous handshake protocol, multiple outstanding transactions, and in-order/ out-of-order transaction completion. Section 3 presents the proposed asynchronous layered interface. In Section 4, the implementation of asynchronous OCBs with the proposed asynchronous layered interface based on distributed and modularized control units is explained in details. Section 5 shows simulation environments and corresponding simulation results. Concluding remarks are given in Section 6.

2.

Asynchronous Interconnection for Multiple Outstanding Transactions

2.1. Asynchronous Handshake Protocol Asynchronous systems use a handshake protocol [15] for data transmission, as shown in Fig. 1. For a sender having data to transmit, a request signal is sent to a receiver, which, in return, sends back an acknowledge signal to complete a communication cycle. A handshake protocol can be classified according to signaling method such as two-phase signaling and four-phase signaling. The four-phase signaling generates four signal transitions using two control signals request and acknowledge in order to transmit a single data item. The two-phase signaling in Fig. 1a makes two signal transitions and these transitions result in a different state, unlike the four-phase signaling. An asynchronous handshake protocol can also be classified into two types depending on data encoding scheme. The first type uses bundled data encoding using a single physical wire to transmit a logical 1-bit datum as shown in Fig. 1a. The second type uses Delay Insensitive (DI) data encoding which needs more than one physical wire to transfer a logical 1-bit datum. The second type does not require a request signal because the encoded data themselves

Asynchronous Layered Interface of Multimedia SoCs

Figure 1.

Asynchronous handshake protocols with: a bundled data encoding; b delay insensitive data encoding.

contain the data validity information of the request signal as in Fig. 1b. The asynchronous handshake protocol with DI data encoding enables robust data transfer since a receiver obtains the transferred data value and the data validity from the encoded data.

transactions, the corresponding responses may arrive out of order due to different response time associated with each command. It is assumed in Fig. 2a that Master 1 transfers three commands to three slaves sequentially and corresponding responses arrive in a reverse order as seen in Fig. 2b. When the responses from slaves arrive, master interface of the Master 1 transfers stored responses to Master 1. It is also assumed that type2 transaction has higher priority in processing. If each master IP supports only in-order transaction completion, each master interface must rearrange a sequence of responses by Re-Order Buffer (ROB) [19]. However, if each master IP supports out-oforder transaction completion, the order of a sequence of responses does not cause problems and the master interface need not wait for in-order transaction completion of these responses. In Fig. 2, as an instance, response3 corresponding to command3 of type2 is transferred to Master 1. Other responses of type1 are transmitted to Master 1 in order. As seen in Fig. 2b, completion time of outof-order transaction is reduced from t2 to t1, giving better performance.

2.2. Multiple Outstanding Transactions with In-order/ Out-of-order Transaction Completion To increase the performance of a synchronous OCB, two design schemes [8, 14] have been considered:

&

&

135

Multiple outstanding transactions [8, 14]: Since most OCBs support a single outstanding transaction, each master interface issues only one command before the corresponding response arrives [5–7]. However, if an on-chip bus supports multiple outstanding transactions, each master interface can generate another command before the previous response arrives so that the function of multiple outstanding transactions increases bus performance. High performance synchronous OCB AMBA 3 AXI [8] has already supported this function and has been used for real applications requiring high performance such as Freescale_s multimedia processor [16], TTPCom_s CBEmacro 3G modem [17], ARM based MPSoC [18], and so on. In-order/out-of-order transaction completion [8, 14]: When a master interface issues several commands as a part of multiple outstanding

3.

Architecture

The increasing design complexity in implementing a high performance asynchronous OCB makes it difficult to design a centralized control unit. The

136

Jung et al.

Slave 1

command1 response1

Master 1

command2 response2

Slave 2

command3 response3

Slave 3

a Sequence of commands

command1 command2 command3 (type 1)

Sequence of incoming responses

(type 1)

(type 2)

response3

response1

response2

(type 2)

(type 1)

(type 1)

t1 Sequence of transferred responses (In-order transaction completion) Sequence of transferred responses (Out-of-order transaction completion)

response1

response2

response3

(type 1)

(type 1)

(type 2)

response3

response1

response2

(type 2)

(type 1)

(type 1)

No wait

b

t2

Waiting time for reordering

Figure 2. In-order/out-of-order transaction completion: a transactions between a master and three slaves; b performance improvement in out-of-order transaction due to higher priority of response3.

design involving with the centralized control unit can suffer from increased area cost, long synthesis time, and degraded performance due to poor optimization [20, 21]. As a solution, a layered architecture [22] for off-chip network can be considered. However, it is not suitable to apply the layered architecture directly to the implementation of communication protocols for an asynchronous OCB, because the design environment of SoC differs from that of off-chip network. Hence, a layered architecture is modified for an asynchronous OCB here. The asynchronous OCB in Fig. 3 is a simplified model based on the layered architecture. The asynchronous OCBs have a master interface and a slave interface for communications. Each interface consists of three layers: physical layer, data link layer, and transport layer.

Since the asynchronous OCB employs a single physical addressing like most commercial OCBs, network layer is not necessary. Physical layer is concerned with data encoding, filtering, and bus driving. The data encoding function in layered architecture proposed here carries out DI data transmission. The crosstalk effect on wires, causing increased data transfer time, can be reduced by DI data encoding scheme. The filtering function ensures correct behavior of control circuits by blocking unnecessary signals from a shared bus to control circuits in master interface and slave interface. Since each interface drives highly capacitive loads to transfer data, the bus driving function is indispensable. Data link layer is concerned with flow control and access control. For the flow control, a special

Asynchronous Layered Interface of Multimedia SoCs

Figure 3.

137

GALS system with an asynchronous OCB based on the layered architecture.

mechanism or a dedicated control circuit is not needed because the asynchronous handshake protocol prevents data overflow in destination by holding acknowledge signal. An asynchronous arbiter is used for the access control of shared bus. For lock and burst transactions, a special control signal sent from a master IP causes locks of an asynchronous arbiter, which is dedicated for master interfaces and allows continued use of the bus for a particular master IP. Transport layer is concerned with flow control for read/write transaction, burst transaction, split transaction, multiple outstanding transactions, and inorder/out-of-order transaction completion [8, 14]. The read/write transaction is to transfer a single datum between a master interface and a slave interface. Since the split transaction allows a master interface to release the use of a bus before the corresponding response arrives from a slave interface, other master interfaces can use bus leading to increased bus performance. Unlike a read/write transaction, the burst transaction transfers many data items using a single destined address.

4.

Implementation

Three asynchronous OCBs with the proposed asynchronous layered interface are designed: (1) asynchronous on-chip bus with a single outstanding transaction (SI-OCB), (2) asynchronous on-chip bus with multiple outstanding transactions and in-order transaction completion (MI-OCB), and (3) asynchronous on-chip bus with multiple outstanding transactions and out-of-order transaction completion (MO-OCB). Figure 4 shows the overview of an asynchronous OCB configuration including command and response channels. Each channel has an asynchronous arbiter unit implemented in a tree structure [23]. The address, write, control, and tag data are transferred through the command channel from a master to a slave while read, error, and tag data are transmitted through the response channel from a slave to a master. A split transaction is supported by a decoupled protocol [12] with two separate channels. Burst transactions are not supported in the implementation, so each transaction transfers only one

138

Figure 4.

Jung et al.

Configuration of implemented on-chip bus.

read/write data to destined address. An exemplar implementation of asynchronous on-chip bus with burst transactions is addressed in [12]. 4.1. Asynchronous OCB for a Single Outstanding Transaction Figure 5 shows the proposed layered interfaces of SIOCB with distributed and modularized control units corresponding to partitioned functionality of each layer. Each interface consists of three layers. Each subblock in each layer is a control circuit generated from a Signal Transition Graph (STG) by petrify synthesis tool [24]. A master interface and a slave interface are divided into two parts by a horizontal dashed line.

The upper part of the master interface transfers information of a master IP to the upper part of the slave interface. The upper part of the slave interface gets information of the master IP from the upper part of the master interface, and transfers it to a slave IP. The lower part of the slave interface transfers information of the slave IP to the lower part of the master interface. The lower part of the master interface gets information of a slave IP from the lower part of the slave interface and transfer it to the master IP. The behavior of both interfaces is as follows. The master IP activates the master interface in Fig. 5a to transfer information of the master IP to the slave IP. Here, Addr, Wdata, Ctrl, Rdata, and error indicate address, write data, control data, read data, and error

Asynchronous Layered Interface of Multimedia SoCs

Figure 5. interface.

139

Layered interface for asynchronous on-chip bus (SI-OCB) with a single outstanding transaction: a master interface; b slave

data, respectively. The controller AHPC-T1 of transport layer gets information from a master IP and uses the signals (issue_req, issue_ack) to generate a single token related to a single outstanding transaction. The generated token is transferred to the controller AHPC-T2. The token is consumed by an incoming response from a slave interface and then

AHPC-T2 can accommodate a new token, i.e., AHPC-T1 can transfer another information of the master IP. Hence, only a single outstanding transaction is allowed. The controller AHPC-L1 in data link layer uses arb_req and arb_gnt signals to obtain the permit to access the command channel. The physical layer transfers information of a master IP to a slave

140

Jung et al.

IP through command channel. The physical layer performs data encoding (DE encoder), address decoding (addr decoder) to select a slave IP, filtering (filter), and bus driving (driver). The transferred information from the master interface activates the slave interface in Fig. 5b. The physical layer performs data decoding, followed by the activation of the controller AHPC-L3 in data link layer. The information consists of address, write, control, and tag data. The tag data contain identifiers master_id and slave_id of the source (master) and the sink (slave), respectively. The AHPC-L3 stores decoded data into an asynchronous buffer Buffer3 and activates the controller AHPC-T3 of transport layer. The controller AHPC-T3 transfers data to a slave IP and stores master_id into an asynchronous buffer Buffer4 to send the corresponding results of a slave IP to a master interface corresponding to the master_id. When the physical layer of the master interface gets the data from the slave IP, it performs data decoding (DE decoder), and activates the controller AHPC-L2 in data link layer (see Fig. 5a). The response information of the slave IP includes read, error, and tag data. Then, AHPC-L2 stores the incoming data into an asynchronous buffer Buffer2 and activates the controller AHPC-T2 in transport layer. Two control signals such buf_req (buffer request) and buf_ack (buffer acknowledge) are used to store data into Buffer2. The AHPC-T2 transfers data to the master IP, and generates issue_rout and issue_aout signals involving with a single outstanding transaction. 4.2. Asynchronous OCB for Multiple Outstanding In-order Transactions Figure 6 shows the proposed layered interfaces of MI-OCB. The layered interfaces of MI-OCB are implemented by modifying those of SI-OCB. The master interface of MO-OCB is derived by modifying AHPC-T1 in the master interface of SI-OCB, and adding a new component asynchronous FIFO FIFO1 between AHPC-T10 and AHPC-T2, an asynchronous Re-Order Buffer (ROB), and a MI_gen subblock, so the AHPC-T10 of transport layer, the asynchronous ROB of data link layer, and the MI_gen subblock of physical layer are used instead, as shown in Fig. 6a. Other controllers are used in the same manner as in SI-OCB. Here, w_req, w_ack, r_req, and r_ack

indicate write request, write acknowledge, read request, and read acknowledge, respectively. The modifications are related to supporting multiple outstanding transactions and in-order transaction completion. The modified controller AHPC-T10 and the appended component FIFO1 enable multiple outstanding transactions by the token mechanism. To create a token, the controller AHPC-T10 of transport layer uses issue_rin and issue_ain signals. The token is generated and inserted into the FIFO1, when the master interface is activated and the FIFO1 has an empty slot. The FIFO1 controls the number of multiple outstanding transactions by means of tokens stored. The token in the FIFO1 is consumed by an incoming response from the slave interface and then the FIFO1 can accommodate a new token, which means another transaction is possible for a master IP. In-order transaction completion is enabled by the asynchronous ROB and MI_id generated by the MI_gen subblock in a master interface. Since the master interface generates MI_id per transaction in order, the asynchronous ROB can rearrange a sequence of incoming responses based on MI_id. The slave interface of MI-OCB is identical with that of SI-OCB, except the asynchronous buffer Buffer30 and the asynchronous FIFO FIFO2 involving with multiple outstanding transactions and inorder transaction completion (see Fig. 6b). Instead of Buffer4 in the slave interface of SI-OCB, the asynchronous FIFO FIFO2 in transport layer is used to accommodate multiple outstanding transactions. The Buffer30 is implemented by adding data storage for the MI_id like the Buffer20 . Other controllers are used in the same manner as in SI-OCB. It is noteworthy that MI-OCB can be easily implemented through small modifications in designed interfaces of SI-OCB, thanks to highly distributed and modularized control units in the proposed layered architecture. 4.3. Asynchronous Re-order Buffer for MI-OCB To rearrange a sequence of incoming responses, an asynchronous ROB is necessary. An asynchronous ROB for an asynchronous microprocessor was proposed in [15, 25]. The direct use of its structure in an asynchronous OCB is inefficient, because asynchronous ROB operating environment involved with an asynchronous OCB is significantly different from that with an asynchronous microprocessor. Hence, a new asynchronous ROB is implemented for

Asynchronous Layered Interface of Multimedia SoCs

141

Figure 6. Layered interface for asynchronous on-chip bus (MI-OCB) with multiple outstanding in-order transactions: a master interface; b slave interface.

our environment, as shown in Fig. 7. Like the asynchronous ROB copy-back in [15], the implemented asynchronous ROB uses a token passing method. They have similar behavior, but the resulting STGs and circuits are differed from each other. Here, buf_req, buf_ack, to_req, to_ack, ti_req, and ti_ack indicate buffer request, buffer acknowledge,

token output request, token output acknowledge, token input request, and token input acknowledge, respectively. Since the proposed ROB takes distributed control units, its capacity is increased by simply adding cells. The cell consists of ROB controller and Data storage. This feature ensures high modularity and scalability for the proposed ROB. It has been

142

Figure 7.

Jung et al.

Proposed asynchronous ROB for MI-OCB.

found from simulations that the optimal number of multiple outstanding transactions is four when throughput and power consumption are considered. This finding of the optimal number of multiple outstanding transactions will be revisited in Section 5. Since the number of multiple outstanding transactions considered is four, the asynchronous ROB has four buffers and therefore can handle four responses simultaneously. In the initial state, ROB starter in Fig. 7 generates a token and supplies it to the first cell (Cell #1), permitting the first cell to transfer a stored response to a master IP. When a new response from a slave arrives, the proposed asynchronous ROB stores it into an appropriate cell according to the MI_id item. The sequence of the stored responses may be out of order. When the cell with a stored response gets the token, it transfers the response to a master IP and hands the token over to the next neighbored cell. Followed by three token passing from the first cell (Cell #1), the ROB starter transfers a token from the last cell (Cell #4) to the first cell. If a neighbored cell receiving the token does not have a response, the neighbored cell holds the token until a response from a slave arrives. Hence, the proposed asynchronous ROB always outputs the stored responses in order. 4.4. Asynchronous OCB for Multiple Outstanding Out-of-Order Transactions The layered master interface of MO-OCB is almost identical with those of MI-OCB, except a MIO Buffer

subblock, as shown in Fig. 8. Here, MO_MI, MO_id, MI_r_req, MI_r_ack, MO_r_req, MO_r_ack, MI_r_gnt, MO_r_gnt, sel_MI, and sel_MO indicate in-order/out-of-order transaction, out-of-order transaction identification number, read request of in-order transaction, read acknowledge of in-order transaction, read request of out-of-order transaction, read acknowledge of out-of-order transaction, read grant of in-order transaction, read grant of out-of-order transaction, in-order transaction selection, and out-oforder transaction selection, respectively. It is supposed that MO_MI and MO_id are provided by a master IP. The MIO Buffer is mainly composed of MI Buffer, MO Buffer, MO_MI_SEL, and MUX. The MI Buffer stores arrival responses associated with in-order transaction completion, which is identical with the asynchronous ROB in a master interface of MI-OCB. The MO Buffer consists of AHPC-L2 and Buffer2 in a master interface of SIOCB and stores incoming responses related to outof-order transaction completion. The MO_MI_SEL is composed of the asynchronous arbiter MUTEX and the MUX selection controller CTRL. Depending on MO_MI signal, an incoming response is stored in either MO Buffer or MI Buffer. Since stored responses in either MO Buffer or MI Buffer can be transferred to a master IP anytime, a control circuit MO_MI_SEL is necessary to arbitrate between the two buffers. When the two buffers begin to transfer stored data with MI_r_req, MI_r_ack, MO_r_req, and MO_r_ack signals, MO_MI_SEL selects one of them and then stored data of the selected buffer are trans-

Asynchronous Layered Interface of Multimedia SoCs

Figure 8.

143

Layered master interface for asynchronous on-chip bus (MO-OCB) with multiple outstanding out-of-order transactions.

mitted through MUX. In other words, MUTEX in MO_MI_SEL choose one of two Bread^ request signals (MI_r_req, MO_r_req) and activates corresponding grant signal (MI_r_gnt or MO_r_gnt). The CTRL in MO_MI_SEL activates the controller AHPC-T2 of transport layer and choose one of two outputs in MUX to transfer the corresponding data to a master IP. The slave interface of MO-OCB is identical with that of MI-OCB, except the asynchronous buffer Buffer30 in Fig. 6b and the asynchronous FIFO2 to store MO_MI and MO_id (see Fig. 6b). Other controllers are used in the same manner as in MIOCB. Like MI-OCB, MO-OCB is implemented easily by small modifications with MI-OCB.

5.

Simulation

5.1. Simulation Environment At the transistor level, it is difficult to construct various simulation environments used by other OCBs to compare performance and power consumption of the proposed asynchronous OCBs with those of counterparts. Hence, we made our simulation environment, as illustrated in Fig. 9. Three asynchronous OCBs were implemented with 0.18 um CMOS process at the transistor level. The implemented OCBs consist of 32-bit address bus, 32-bit write data bus, 32-bit read data bus, 16-bit control

144

Jung et al.

Table 1. Simulation parameters of the first simulation environment: distribution of bus transactions. master1

master2

master3

master4

m_case1

1

0

0

0

m_case2

1/2

1/2

0

0

m_case3

1/3

1/3

1/3

0

m_case4

1/4

1/4

1/4

1/4

& & Figure 9.

Simulation environment.

bus, and 2-bit error bus. In Fig. 9, MI is a master interface and SI means a slave interface. A master is composed of a synchronous IP, an asynchronous wrapper, and a master interface (MI). A slave consists of a synchronous IP, an asynchronous wrapper, and a slave interface (SI). The simulation environment consists of two parts: (1) an implemented part and (2) a virtual part. The implemented part corresponds to the implemented buses and consists of four master interfaces, eight slave interfaces, two arbiters for a command and a response channels, and so on. The virtual part is constructed by using ADFMI of NanoSim from Synopsys and consists of twelve synchronous IPs for a master and a slave interfaces. It is assumed that each synchronous IP is implemented by a synchronous design technique and has an asynchronous wrapper module to communicate with proposed asynchronous OCBs. Before measuring performance and power consumption, workloads should be determined, since workloads affect simulation results. However, it is difficult to find appropriate workloads of real applications because real workloads can be obtained when all applications with real input data are modeled exactly. Instead, workloads for simulations are obtained by synthetic workload generation with following parameters:

&

The distribution of bus transactions, which indicates how many portion of total bus transactions each master is responsible for. The distribution of accessed slaves by each synchronous IP. The ratio of out-of-order transactions to total transactions in MO-OCB.

These parameters determine the delay of components in the virtual part. In the virtual part, it is assumed that delay of all asynchronous wrapper modules is always zero to focus on proposed asynchronous OCBs. Through the synthetic workload generation, various possible situations can be investigated, where the proposed bus architecture can be utilized well, and find two simulation environments. The first simulation environment is that all synchronous IPs have the same clock frequency. Parameters of the first simulation environment are as follows:

& & &

The clock frequency of all synchronous IPs is set to an infinite clock frequency (INF MHz), 400, 266, or 133 MHz. The ratio of non-bus transfer time to total transfer time per synchronous IP is 0. In other words, all transfers of each synchronous IP are always bus transfers. For the distribution of bus transactions, four cases are made as shown in Table 1. The master1 is enabled for m_case1, and the master1 and master2 are enabled for m_case2, and so on.

Table 2. Simulation parameters of the first simulation environment: distribution of accessed slaves by each synchronous IP. master1

& &

The clock frequency of a synchronous IP. The ratio of non-bus transfer time to total transfer time per synchronous IP, where total transfer time consists of non-bus transfer time and bus transfer time.

master2

master3

master4

s_case1

slave1

slave3

slave5

slave7

s_case2

slave1,2

slave3,4

slave5,6

slave7,8

s_case3

slave1õ4

slave3õ6

slave5õ8

slave7,8,1,2

s_case4

slave1õ8

slave1õ8

slave1õ8

slave1õ8

Asynchronous Layered Interface of Multimedia SoCs

Table 3. Simulation parameters of the second simulation environment: clock frequencies and distribution of bus transactions of master IPs. Master IP1

Master IP2

Master IP3

Master IP4

Clock frequency (MHz)

300

166

66

33

Distribution of bus transactions (%)

65

20

10

5

& &

The ratio of out-of-order transactions to total transactions is 0, 50 or 100%. The total number of bus transactions is 4,800.

5.2. Simulation Results on Performance Figures 10a, b and 11a, b show simulation results at the first simulation environment. Here, throughput is defined by Throughput ¼ Ntrans  Nbit =T

&

& &

Table 2 shows the distribution of accessed slaves by each synchronous IP, which consists of four cases. In s_case1, each master communicates with a single dedicated slave. Each master communicates with two, four, and eight dedicated slaves with the same probability for s_case2, s_case3, and s_case4, respectively. The ratio of out-of-order transactions to total transactions is 50%. The total number of bus transactions is 4,800.

For the second simulation environment, Tables 3 and 4 show some parameters such as various clock frequencies of all synchronous IPs and distribution of bus transactions of master IPs. It is the more realistic model of a System-on-a-Chip consisting of CPU, DSP, RAM/ROM on-chip memory, and peripheral devices. The synchronous master IP1 has the largest bus transactions, as does CPU dealing with most workloads using high clock frequency. The synchronous slave IP5 can be assumed as an on-chip ROM and other IPs act as other modules of a System-on-aChip. The other parameters of the second simulation environment are as follows:

& &

The ratio of non-bus transfer time to total transfer time per synchronous IP is 0. The distribution of accessed slaves by each synchronous IP is s_case4 in Table 2. In other words, each master communicates with all slaves with the same probability.

Table 4.

145

ð1Þ

where Ntrans is the total number of bus transactions, T indicates the completion time of data transmission, and Nbit means the data bit width. Figure 10a shows throughput of MO-OCB when distribution of accessed slaves is s_case1 and the clock frequency of all synchronous IPs is varied from 133 MHz to infinite. The infinite clock frequency means that response delay of a synchronous IP is 0 ns, namely, any synchronous IP with an infinite clock frequency gives a response without any delay. This extreme infinite frequency is used to evaluate pure performance, power and overhead of the proposed interfaces. For four clock frequencies, throughput is increased proportionally to the number of enabled masters, which is a profound property of an asynchronous OCB. However, if a few masters use up most bus bandwidth, the throughput increase becomes saturated and the increase of enabled masters has little effect. For example, when all synchronous IPs have the infinite clock frequency, throughput is increased slightly in cases of three and four enabled masters (m_case3, m_case4) because two enabled masters (m_case2) use most of bus bandwidth, as illustrated in Fig. 10a. This trend can be seen in other distribution of accessed slaves such as s_case2, s_case3, and s_case4. The saturation point is determined by simulation environment parameters such as bus latency, ratio of master clock to slave clock, average burst size, and so on. Simulation results of Fig. 10b show throughput of MO-OCB when the clock frequency of all synchro-

Simulation parameters of the second simulation environment: clock frequencies of slave IPs.

Clock frequency (MHz)

Slave IP1

Slave IP2

Slave IP3

Slave IP4

Slave IP5

Slave IP6

Slave IP7

Slave IP8

200

133

66

50

33

66

33

20

Jung et al.

146 16

Throughput (Gbit/s)

14 12 10 8 6 133M H z 266M H z 400M H z IN F M H z

4 2 0 m _ case1

m _ case2

m _ case3

m _ case4

a

that shows throughput of three asynchronous OCBs when the clock frequency of all synchronous IPs is 400 MHz and the distribution of bus transactions is m_case1. In four cases of distribution of accessed slaves, SI-OCB has similar throughput while throughput of MI-OCB and MO-OCB is increased. Here, the change from s_case1 to s_case4 means that the number of accessed slaves is increased as shown in Table 2. Especially, the throughput is increased rapidly at s_case2. At s_case3 and s_case4, the throughput increase of MI-OCB and MO-OCB is small because most bus bandwidth has already been used up at s_case2. This trend appears in other clock

8 .5

4 .8

5 .5

4 .7

4 .5 s_ c ase1 s_ c ase2 s_ c ase3 s_ c ase4

3 .5 2 .5 1 .5 m _ case1

m _ case2

m _ case3

m _ case4

b Figure 10. Simulation results on performance of the first simulation environment: a throughput of MO-OCB when distribution of accessed slaves is s_case1 and a clock frequency of a synchronous IP is varied from 133 MHz to infinity; b throughput of MO-OCB when the clock frequency of all synchronous IPs is 133 MHz and distribution of accessed slaves by each synchronous IP is s_case1, s_case2, s_case3, or s_case4.

nous IPs is 133 MHz and distribution of accessed slaves by each synchronous IP is s_case1, s_case2, s_case3, or s_case4. Among four cases of distribution of accessed slaves, s_case4 has the lowest throughput, since s_case4 has the highest probability of selecting the same slave simultaneously by more than two masters. This high probability of conflict results in the loss of performance. The cases of s_case1 and s_case2 have the highest throughput because all masters communicate with exclusively dedicated slaves (see Table 2). The trend of simulation results in Fig. 10a and b appears in simulation results of other asynchronous OCBs such as SI-OCB and MI-OCB. The improvement of throughput due to multiple outstanding transactions is represented in Fig. 11a

Throughput (Gbit/s)

6 .5

4 .6 4 .5 4 .4

S I-O C B M I-O C B M O -O C B

4 .3 4 .2 4 .1 s_ case1

s_ case2

s_ case3

s_ case4

a 6 .5 Normalized throughput

Throughput (Gbit/s)

7 .5

5 .5

133M H z 266M H z 400M H z IN F M H z

4 .5 3 .5 2 .5 1 .5 0 .5 s_ case1

s_ case2

s_ case3

s_ case4

b Figure 11. Simulation results on performance of the first simulation environment: a throughput of three asynchronous OCBs when the clock frequency of all synchronous IPs is 400 MHz and the distribution of bus transactions is m_case1; b normalized throughput of MO-OCB when the distribution of bus transactions is m_case1 and a clock frequency of a synchronous IP is varied from 133 MHz to infinity.

Asynchronous Layered Interface of Multimedia SoCs

Throughput (Gibt/s)

2 .0 1 .8 1 .6 1 .4

MO-

OCB(100%)

MO-

a

OCB(50%)

MO-

OCB(0%)

MO-OCB

MI-OCB

SI-OCB

1 .2

Throughput (Gbit/s)

2 .0 1 .9 1 .8 1 .7

OCB, MO-OCB(0%), MO-OCB(50%), and MOOCB(100%) is increased by 30.5, 30.3, 30.9, and 30% because of multiple outstanding transactions (see Fig. 12a). The increment of throughput for MIOCB and MO-OCB with reference to SI-OCB can be accounted for by the presence of FIFO1 in master interface and FIFO2 in slave interface. The multiple tokens of FIFO1 and FIFO2 enable MI-OCB and MO-OCB to use full bus bandwidth. Since the throughput in Eq. (1) is inversely proportional to the completion time T, the increment of throughput represents the reduction of T. Figure 12b shows the effect of the number of multiple outstanding transactions on throughput of MI-OCB and MO-OCB at the second simulation environment. When the number of multiple outstanding transactions is larger than four, the increase of throughput is decreased. The number four is an

S I-O C B M I-O C B M O -O C B (0 % ) M O -O C B (5 0 % ) M O -O C B (1 0 0 % )

1 .6 1 .5 1 .4

0 .1 6 0 0 .1 5 5 0 .1 5 0

2

4

6

8

10

12

14

16

Multiple outstanding transactions b

Energy (nJ)

1 .3 0

147

0 .1 4 5 0 .1 4 0 0 .1 3 5 0 .1 3 0

Figure 12. Simulation results on performance of the second simulation environment: a throughput of three asynchronous OCBs; b throughput of three asynchronous OCBs as a function of the number of multiple outstanding transactions.

0 .1 2 5 0 .1 2 0 S I- O C B

M I- O C B

M O -O C B

a 0 .1 8 0 .1 7 Energy (nJ)

frequencies such 133 MHz, 266 MHz, and infinity as shown in Fig. 11b. Figure 11b shows normalized throughput of MOOCB when the distribution of bus transactions is m_case1 and the clock frequency of all synchronous IPs is varied from 133 MHz to infinity. The normalized throughput is set on the basis of throughput of s_case1 when the frequency of all synchronous IPs is 133 MHz. Compared with s_case1 of each clock frequency, the throughput at s_case4 of each frequency is the highest due to multiple outstanding transactions. The improvement of throughput is 2.9, 5.1, 7.0, or 18.8%, when the frequency of all synchronous IPs is 133, 266, 400 MHz, or infinity, respectively. As a more realistic model, Fig. 12a and b show simulation results of the second simulation environment. Compared with SI-OCB, throughput of MI-

0 .1 6 0 .1 5 0 .1 4 S I- O C B M O -O C B (0% ) M O -O C B (10 0 % )

0 .1 3 0 .1 2 0

2

4

6

8

M I -O C B M O -O C B (5 0 % )

10

12

14

16

Multiple outstanding transactions

b Figure 13. Simulation results on power consumption: a energy consumption per data transaction of three asynchronous OCBs at the first simulation environment; b energy consumption per data transaction of three asynchronous OCBs at the second simulation environment.

148

Jung et al.

appropriate to our simulation environment as the number of multiple outstanding transactions. The number of multiple outstanding transactions at the saturation point is determined by simulation environment parameters such as bus latency, ratio of master clock to slave clock, and average burst size. Depending on the ratio of out-of-order transactions to total transactions, throughput of MO-OCB is changed (see Fig. 12b). As the proportion of out-oforder transactions is increased, the throughput of MO-OCB is improved. It is because a master interface of MO-OCB does not necessary to wait for reordering arrival responses related to out-oforder transaction completion. When the number of multiple outstanding transactions is two, MO-OCB with all in-order transactions (MO-OCB(0%)) has the similar performance to MI-OCB, and MO-OCB with all out-of-order transactions (MO-OCB(100%)) shows the highest throughput. At the number eight, both MO-OCB and MI-OCB have similar throughput due to saturation of bus bandwidth.

environment, compared with SI-OCB, MI-OCB, MOOCB(0%), MO-OCB(50%), and MO-OCB(100%) with four transactions consume higher energy by 11.3, 17.6, 16.1, and 14.4% because of hardware complexities related to multiple outstanding transactions and in-order/out-of-order transaction completion. Energy consumption of MI-OCB is lower than those of MO-OCB due to energy consumption of MIO Buffer. Energy consumption of MO-OCB is decreased as the proportion of out-of-order transactions is increased, because the activity of the asynchronous ROB in MIO Buffer is decreased. When the number of multiple outstanding transactions is two, MOOCB with all in-order transactions (MO-OCB(0%)) has the highest energy consumption, and MO-OCB with all out-of-order transactions (MO-OCB(100%)) shows the lowest energy consumption. At the number sixteen, both MO-OCB for all out-of-order transactions and MI-OCB have similar energy

5.3. Simulation Results on Power Consumption With respect to power consumption, energy consumption per bus transaction is measured for two cases. Figure 13a shows energy consumption per bus transaction of three asynchronous OCBs at the first simulation environment. Sixty four cases are generated by four cases of distribution of bus transactions, four cases of distribution of accessed slaves, and four cases of clock frequency of a synchronous IP. Regardless of 64 cases, each asynchronous OCB consumes very similar energy because the number of bus transactions is identical to three asynchronous OCBs. Compared with SI-OCB, the energy consumption of MI-OCB and MO-OCB is increased by 11.7%, 20%, respectively. The difference of energy consumption 11.7% between SI-OCB and MI-OCB is caused by the hardware overhead of MI-OCB, e.g., AHPC-T1_, FIFO1, and asynchronous ROB in the master interface and Buffer3_ and FIFO2 in the slave interface. The difference of energy consumption 8.3%(=20%j11.7%) between MI-OCB and MOOCB indicates the additional power consumption of MIO buffer in the master interface of MO-OCB. Figure 13b shows energy consumption of three asynchronous OCBs at the second simulation environment. Like simulation results of the first simulation

Figure 14. Results of area analysis: a number of transistors of three implemented asynchronous OCBs; b percentage of each layer contributed to the total area.

Asynchronous Layered Interface of Multimedia SoCs

consumption since energy consumption of the asynchronous ROB dominates MO-OCB. Figure 13b shows the effect of the number of multiple outstanding transactions on energy consumption per bus transaction of three asynchronous OCBs at the second simulation environment. As the number of multiple outstanding transactions increases, power consumption increases since hardware complexity of the asynchronous ROB increases. 5.4. Area Analysis Figure 14 shows results of area analysis for three implemented asynchronous OCBs. Figure 14a shows the total number of transistors in three implemented asynchronous OCBs. The total number of transistors is calculated on the basis of the transistor of a minimum primitive inverter in 0.18 um CMOS technology. The master interfaces of MI-OCB and MO-OCB require more than 10,000 transistors, while other one of SI-OCB takes about half of them at the most. This large difference is due to the area of the asynchronous ROB and related controllers, which corresponds to 52.1 and 56.8% of the total area in the master interface of MI-OCB and MO-OCB (see Fig. 14b). The proportion of each layer contributed to the total area is illustrated in Fig. 14b. Physical layer of all interfaces has large proportion in the range of 40– 76.1%, largely because of an encoder and a decoder for data encoding, and driving buffers (see Figs. 5, 6, and 8). 6.

Conclusions

In this paper, the asynchronous layered interface with distributed and modularized control units for high performance asynchronous On-Chip Buses (OCBs) is proposed for Globally Asynchronous Locally Synchronous System-on-a-Chip design. The proposed layered interface is based on the modified layered architecture. To achieve high performance, the proposed asynchronous layered interface accommodates multiple outstanding in-order/out-of-order transactions. Because of a design technique with distributed and modularized control units, an asynchronous layered interface is extended easily from basic asynchronous OCB to an advanced asynchronous OCB without repeating the implementation of the whole bus interface. Two simulation environments with synthetic workload generation are con-

149

structed to measure performance and power consumption of the implemented asynchronous OCBs with the proposed asynchronous layered interfaces. Simulation results on throughput of the first simulation environment reveal some properties of the implemented asynchronous OCBs. Firstly, the throughput increases as the number of enabled masters is increased until throughput is saturated. Secondly, throughput is decreased as the number of accessed slaves is increased, because probability of selecting the same slave simultaneously by more than two masters is increased. Lastly, MI-OCB and MO-OCB have higher throughput than SI-OCB due to multiple outstanding transactions. Simulation results on power consumption of the first simulation environment show that each asynchronous OCB consumes very similar energy regardless of 64 cases, because the number of bus transactions is identical to three asynchronous OCBs. Simulation results of the second simulation environment are the more realistic model. Compared with SI-OCB, throughput of MIOCB and MO-OCB (100% of out-of-order transactions) is increased by 30.5 and 30% because of multiple outstanding transactions. Also, power consumption of MI-OCB and MO-OCB (0% of in-order transactions) is increased by 11.3 and 17.6% as compared with SI-OCB, due to the hardware complexity of multiple outstanding transactions and the asynchronous reorder buffer. Finally, area analysis shows that the area of the asynchronous reorder buffer and related controllers is 52.1 and 56.8% of the total area in a master interface of MI-OCB and MO-OCB. Also physical layer occupies about half of the total area. Acknowledgements This work has been supported in part by the Center for Distributed Sensor Network at GIST, in part by the GIST Technology Initiative (GTI), and in part by the MIC, Korea, under the ITRC support program supervised by the IITA" (IITA-2005-C1090-05020029). References 1. International Technology Roadmap for Semiconductors, Semiconductor Industry Association, 2005. 2. Po-chih Tseng, Yung-chi Chang, Yu-wen Huang, Hung-chi Fang, Chao-tsung Huang, Liang-gee Chen, BAdvances in

150

3.

4.

5.

6. 7.

8. 9.

10.

11. 12. 13.

14.

15. 16. 17. 18.

19. 20.

21.

22.

Jung et al.

Hardware Architectures for Image and Video Coding—A Survey,^ Proc. IEEE, vol. 93, 2004, pp. 184–197. T. R. Jacobs, V. A. Chouliaras, and D. J. Mulvaney, BThreadParallel MPEG-2, MPEG-4 and H.264 Video Encoders for SoC Multi-processor Architectures,^ IEEE Trans. Consum. Electron., vol. 52, 2006, pp. 269–275. A. Dasu and S. Panchanathan, BReconfigurable Media Processing,^ International Conference on Information Technology: Coding and Computing, 2001, pp. 300–304. E. Salminen, V. Lahtinen, K. Kuusilinna, and T. Hamalainen, BOverview of Bus-Based System-on-Chip Interconnections,^ IEEE Int. Symp. Circuits Syst., vol. 2, 2002, pp. 372–375. D. Flynn, BAMBA: Enabling Reusable On-Chip Designs,^ IEEE Micro, vol. 17, 1997, pp. 20–27. A. Rincon, G. Cherichetti, J. Monzel, D. Stauffer, and M. Trick, BCore Design and System-on-a-Chip Integration,^ IEEE Des. Test Comput., vol. 14, 1997, pp. 26–35. AMBA AXI Protocol Specification, ARM, 2003. M. Pedram, BPower Minimization in IC Design: Principles and Applications,^ ACM Transactions on Design Automation, vol. 1, 1996, pp. 3–56. R. Sridhar, BClocking and Synchronization in Sub-90 nm Systemon-chip (SoC) Designs,^ IEEE Design, Automation and Test in Europe Conference and Exhibition, 2004, pp. 49–84. D. Chapiro, BGlobally-Asynchronous Locally-Synchronous Systems,^ Ph.D. dissertation, Stanford Univ., USA, Oct. 1984. W. J. Bainbridge, BAsynchronous System-on-chip Interconnect,^ Ph.D. dissertation, Univ. of Manchester, UK, Mar. 2000. J. Kessels, A. Peeters, T. Kramer, M. Feuser, and K. Ully, BDesigning an Asynchronous Bus Interface,^ IEEE International Symposium on Asynchronous Circuits and Systems, Mar. 2001, pp. 108–117. A. Radulescu, J. Dielissen, S. G. Pestana, O. P. Gangwal, E. Rijpkema, P. Wielage, and K. Goossens, BAn Efficient onChip NI Offering Guaranteed Services, Shared-Memory Abstraction, and Flexible Network Configuration,^ IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 24, 2005, pp. 4–17. J. Sparso and S. Furber, Principles of Asynchronous Circuit Design—A Systems Perspectives, Kluwer, 2002. L. Zhou and Xiao Yan, BIntegration Flow of Video Processing Unit into SoC,^ Freescale Semiconductor, 2006. D. Pajak, BSystem Solutions for a Baseband SoC,^ white paper, ARM, 2006. C. Evrard, BARM MPCore—The First Integrated Symmetric Multiprocessor Core,^ Sophia Antipolis MicroElectronics Forum, 2004. D. Sima, T. Fountain, and P. Kacsuk, Advanced Computer Architecture, Addison-Wesley, 1997. E. Kim, J.-G. Lee, and D.-I. Lee, BAutomatic Process-oriented Control Circuit Generation for Asynchronous High-level Synthesis,^ IEEE International Symposium on Asynchronous Circuits and Systems, 2000, pp. 104–105. M. Theobald and S. Nowick, BTransformations for the Synthesis and Optimization of Asynchronous Distributed Control,^ IEEE Design Automation Conference, 2001, pp. 263–268. H. Zimmermann, BOSI Reference Model—the ISO Model of

Architecture for Open Systems Interconnection,^ IEEE Trans. Commun., vol. 28, 1980, pp. 425–432. 23. M. Josephs and J. Yantchev, BCMOS Design of the Tree Arbiter Element,^ IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 4, 1996, pp. 472–476. 24. J. Cortadella, M. Kishinevsky, A. Kondratyev, L. Lavagno, and A. Yakovlev, BPetrify: A Tool for Manipulating Concurrent Specifications and Synthesis of Asynchronous Controllers,^ IEICE Trans. Inf. Syst., vol. E-80D, 1997, pp. 315–325. 25. D. Gilbert, BDependency and Exception Handling in an Asynchronous Microprocessor,^ Ph.D. dissertation, Univ. of Manchester, UK, Mar. 1997.

Eun-Gu Jung received the B.S. degree in Electronic Engineering from Kyungpook National University, Daegu, Korea, in 2000, and the M.S. and Ph.D. degree in Information and Communications from Gwangju Institute of Science and Technology(GIST), Gwangju, Korea, in 2002 and 2006. He is currently a research assistant in the Department of Information and Communications at GIST. His research interests include asynchronous Network-on-Chip (NoC)/OnChip Bus for Globally Asynchronous Locally Synchronous (GALS) systems, delay insensitive handshake protocol, and high speed asynchronous circuit design.

Jeong-Gun Lee received the B.S. degree in Computer Science from Hallym University, Korea in 1996, and the M.S. and Ph.D. degree in Information and Communications from Gwangju Institute of Science and Technology(GIST), Gwangju, Korea, in 1998 and 2005, respectively. He is

Asynchronous Layered Interface of Multimedia SoCs

currently a Post-Doctoral Researcher in Computer Laboratory, University of Cambridge, UK. His research interests include asynchronous Network-on-Chip (NoC), asynchronous FIFO design, asynchronous logic synthesis, performance evaluation of asynchronous systems, and Globally Asynchronous Locally Synchronous(GALS) design methodology.

Kyoung-Son Jhang received the B.S., M.S., and Ph.D. degree in Computer Engineering from Seoul National University, Seoul, Korea, in 1986, 1988, and 1995, respectively. From Mar. 1996 to Aug. 2001, he has been a faculty member of Computer Engineering Department of Hannam University in Daejon, Korea. Since Sep. 1, 2001, he has been working for Department of Computer Engineering of ChungNam National University as a Professor. His research interests include computer architecture, design automation and system on a chip design. He is a member of the Korean Information Science Society, the Institute of Electronics Engineers in Korea, and the IEEE Computer Society.

Jeong-A Lee received the B.S. in Computer Engineering with honors from Seoul National University in 1982, M.S. in Computer Science from Indiana University, Bloomington in

151

1985 and Ph.D. in Computer Science from University of California, Los Angeles in 1990. From 1990 to 1995, she was an assistant professor at the Department of Electrical and Computer Engineering, University of Houston. She is presently a Professor of Department of Computer Engineering, since joining Chosun University in 1995. Her research interests include computer architecture, fast digital and CORDIC arithmetic, application specific architectures design and configurable computing. She is the author of more than 100 technical papers, was a guest editor of a special issue on CORDIC, Journal of VLSI Signal Processing Systems for Signal, Image, and Video Technology in 2000, and has been working as a programming committee member for several international conferences and a senior member of IEEE.

Dong-Soo Har received the B.S. and M.S. degree in Electronic Engineering from Seoul National University, Seoul, Korea, in 1986 and 1988, respectively, and the Ph.D. degree in Electrical Engineering from Polytechnic University at Brooklyn, New York, in 1997. From 1997 to 2000, he was with Vodafone AirTouch, Walnut Creek(US), Senior RF Engineer. From 2001 to 2002, he was with Romeo System, fremont(US), Senior System Engineer. From 2002 to 2003, he was an Assistant Professor in Seoul National University, Seoul, Korea. Since 2003 he has been working for the Department of Information and Communications of Gwangju Institute of Science and Technology(GIST) as an Assistant Professor. His research interests include wireless/wired data network planning, capacity enhancement for multimedia service systems, performance analysis of noisy/interference limited systems, smart antenna, and System-on-a-Chip(SoC) design for multimedia systems. He was a recipient of the 2000 IEEE Best Paper Award (Jack Neubauer Memorial Award) of Transactions on Vehicular Technology.

Suggest Documents