Towards a High-Level Synthesis of Reconfigurable Bit-Serial Architectures Achim Rettberg, Florian Dittmann, Mauro Zanella, Thomas Lehmann University of Paderborn, Fuerstenallee 11, 33102 Paderborn, Germany Email:
[email protected],
[email protected],
[email protected],
[email protected]
Abstract This paper presents high-level synthesis methods for a fully reconfigurable self-timed synchronous bit-serial pipeline architecture. The idea is to distribute the central control unit. Local controls of the operators are realized through a one-hot implementation of the central control engine. Specialized routing components allow the reconfiguration of the implemented circuit with respect to rapid system prototyping. We describe several kinds of high-level synthesis approaches, especially the scheduling, which can be used for this type of architecture. This means we optimize specific characteristics, like loops, junctions and splitters, during the synthesis phase.
1
Introduction
Power consumption, area minimization as well as signal delay and reconfiguration with respect to rapid system prototyping make increasing demands on chip design. While design space can be reduced by bit-serial operators, long control lines in synchronous bit-serial architecture usually affect the performance of the circuit [1]. The used MACT-Architecture ([7], [8]) replaces long, global control lines by short, local control signals (handshake wires), resulting in less area consumption, but demanding for revaluating synthesis methods. This paper presents high-level synthesis methods for this new synchronous, fully re-configurable self-timed bit-serial and fully interlocked pipeline architecture. The synthesis methods include the avoidance of deadlocks and the resolution of arithmetic loops. This contribution is organized as follows: Section 2 describes briefly the pipeline architecture. The high-level synthesis methods are discussed in section 3. The final section comprises a summary and an outlook.
2
Pipeline Architecture Realization
As mentioned before our high-level synthesis approach addresses bit-serial synchronous architectures. One implementation of such an architecture is the MACT
architecture [7], [8]. MACT works synchronously, but consists of features of typically asynchronous architectures. Fields of application of such an architecture are for example signal processing in terms of digital filters or digital controllers. These algorithms may be realized in hardware by means of the proposed architecture. It requires only small chip area and an equally small number of input and output pins, thus reducing the size and the complexity of the printed circuit. The speed of the bitserial processing is high enough for the application domain in question due to the different internal and external cycle time. For example, in an electrical motorcurrent control there are delays by input/output converters (e.g., A/Ds and D/As) and those resulting from the inertia of the motor. The idea of the MACT architecture is to distribute the central control unit to locally controlled elements. These are synchronized and interlinked by handshake wires, by means of using only local control wires. The handshake mechanism used in the architecture is similar to the one that is used in bit-serial asynchronous architectures (cf. [3], [4] and [5]). The data stream is interpreted as a data package, which contains a control marker and the processed data. Additionally, a separator (gap) may be included between the control marker and the data. A valid data package is shown in Figure 1. Data
Gap
Data packet
Figure 1.
Control marker Dataflow
Data package with a control marker, a separator (gap), and data of the pipeline architecture
The data package is shifted through the data path. The bit order of the data requires LSB1 first. The control marker is recognized by scanners in the data-path. The scanners identify the marker and activate the component control. Therefore, the time factor of the elements 1
LSB = least significant bit
Proceedings of the 16th Symposium on Integrated Circuits and Systems Design (SBCCI’03) 0-7695-2009-X/03 $17.00 © 2003 IEEE
delay
Scanner A1
A
B
Stall
Stall-block
B Operator
Output
Control marker Routing Information
Data
Startbit
Gap Dataflow
Data packet
Extended control marker pipeline architecture
of
the
The selection of the paths within the design can be controlled by the extension of the control marker in the data packet. That mean, the control marker contains the routing information, see Figure 3. The router component is configurable by the output port names. For example, the router 1 in Figure 4 has the output ports u and v. That mean, the control marker contains the information if the data packet is routed to output port u or v.
delay
Scanner B
Stall
delay
B
A
In order to realize fast reconfiguration within our architecture we developed a component called router. The router component offers the freedom to select different paths within the implemented design. With the help of the router it is possible to map different algorithms represented each by dataflow graphs into a single dataflow graph (folding).
Figure 3.
Control
A
packages can be shifted one after the other in a sector of the dataflow graph. The only constraint is that all data packages have a minimum distance between each other, thus preventing overwriting. A controller of the synchronizer assumes one of two states: one for waiting for all data packages (wait) and the other for pushing them (push). Therefore, a synchronizer (see Figure 2) has a control unit and a scanner for each input. A scanner recognizes if a control marker arrives at the input. When all scanners in the synchronizer detect a control marker, the control unit switches to the stage push. Otherwise, the inputs with valid data are blocked and the control unit switches to the state wait. In this case, the synchronizer sends zeroes to the subsequent operator, which ensures that the circuit behind the synchronizer is not blocked.
Free-Previous-Section
Scanner A2
Stall A
Stall B
schedule is mapped to a position within the implemented data path. As we know exactly the length of the control marker, the separator and the data, we can detect where the data are processed when a marker signal reaches a certain position in the data path. The distance between the control information (the marker) and the operations is defined with the operations to be affected only by the control marker of the actually processed data. Consequently, the distance between control wires and the controlled elements is locally limited. Data packages moving relatively independently along the data path have to be synchronized at non-unary operations in view of compensating for the different path lengths. For this purpose, the fastest path in the dataflow graph is blocked by a so-called stall wire until all other necessary data packages have arrived at a specific synchronization point. This functionality is realized by a component named synchronizer. Due to the defined length of the data packages the length of the stall wire is locally limited. In bit-serial architectures it is mandatory to shift data aligned through the operators. The interconnection structure of operator and synchronizer is depicted in Figure 2. The stall wire in Figure 2 realizes the blocking of the corresponding inputs. When all inputs have valid data, all inputs are aligned and the stall signal is altered. The valid data at the inputs are recognized by the control marker that is stored before the separator and the data in each data package. Here, the control marker can be interpreted as a synchronization marker.
Router 1 u
v
Output
Figure 2.
Router 2
Realization of the synchronizer in front of an operator block
The locally limited impact of control information and the limited blocking of operations in the architecture allow the implementation of pipelining, i.e., several data
Router 3
w
Figure 4.
Proceedings of the 16th Symposium on Integrated Circuits and Systems Design (SBCCI’03) 0-7695-2009-X/03 $17.00 © 2003 IEEE
x
y
z
Routing example in a dataflow graph
Now routing information can be stored in different ways in the control marker. In the first approach, all output ports of the routers for the path to be realized in the dataflow graph are stored in the control marker. For example, if we want to route in the dataflow graph of Figure 4 over port v of router 1 and port y of router 3 the control marker contains in a sequence (router 1, router 3) the control bit-vector ; whereat the control marker is not modified. Here each router has to select its part from the control marker. In a second approach, each router consumes its own part of the control bit-vector and deletes its entry in the marker. In our example that mean, router 1 deletes in the control-bit vector. At all operators it has to be ensured that the control marker has the same length.
3
High-Level Synthesis of the Pipeline Architecture
High-Level Synthesis relies on elementary guidelines which enable reliable and effective transformations from abstract to more specific representations of the system. As the MACT-Architecture derives from multiple approaches of system architecture (e.g. synchronous and asynchronous, pipeline etc.), it is therefore necessary to revaluate known and proven concepts of high level synthesis as can be found in [9] and was done in [10]. We decided to analyse typical parts of dataflow graphs mainly focusing on optimization, predictability and reliability. Therefore we introduced simplistic, yet under practical aspects still acceptable constrains. We assumed that all operator cells need the same amount of time, i.e. the same number of clock cycles to perform their operation. This makes sense, as the architecture operates bit-serial and it is possible to design bit-serial basic arithmetical operators consuming each the same amount of time. Subsequently we can deal with all arithmetical operations to be of the same magnitude in the sense of time. In addition we assume no resource constraints, which in turn makes sense as the suitable control or streaming algorithms are of limited dimension. We started with the consideration of scheduling policies in order to gain insight into the behaviour of the architecture.
3.1
Junctions
Figure 5 shows a dataflow graph (DFG) consisting of two independent branches, which conjoin in a junction (D). As scheduling policy, the left approach uses ASAP (As Soon As Possible) the right ALAP (As Late As Possible). Standard architectures with central controller and registers would try to use the mobility (which equals
ALAP – ASAP + 1 and indicates nodes with a degree of scheduling freedom if the mobility is > 1) in order to optimize the dataflow. A
B
A
C D
Figure 5.
B
C D
Example dataflow graphs with junction
However, using our architecture there is no need to consider ALAP, ASAP or the mobility of operations before a junction, as the MACT-Architecture makes scheduling tasks in this case unnecessary due to the selftiming approach. This can be seen if the scenario of Figure 5 is considered while assuming the MACTArchitecture. As operator D is only activated if valid data (i.e. the control markers) is present on both inputs, operator B is stalled as long as the data in the left branch is still in operation. To summarize, this means the data in Figure 5 while using the MACT-Architecture will always be processed in the manner of the right scheduling.
3.2
Splitter and Junction
The second general problem of scheduling is illustrated in Figure 6. There, we find a splitter followed by a junction one or more time steps later. The left branch symbolizes the critical path and allows no further optimization. We could see that it is mandatory to examine the right branch in detail mainly focusing on synchronization.
A
B
C
D
Proceedings of the 16th Symposium on Integrated Circuits and Systems Design (SBCCI’03) 0-7695-2009-X/03 $17.00 © 2003 IEEE
B C
D
E Figure 6.
A
E
Dataflow graphs with split and junction
Therefore we consider the scenario of Figure 6 in detail. Valid data, including control sequence and separator is put on the input line and processing starts. Operation A is performed and the data is passed on to B and C, using a trivial splitter. B and C both will start their operation, nevertheless C will stop after the data starts to reach the scanner of E as the processing of the data in the left branch is still in progress (remember, we assume equal time periods for all operations). Depending on the actual length of data and the duration of every operation, even operation A may stop, as due to the pipeline behaviour, data may be still within operator A. As A gets blocked, B and D cannot get any further data and inconsistency occurs. E will never be able to operate correctly. This results in data confusion. As the reason of this confusion can be found in the data splitting after operator A, we introduce extended splitters as a first approach. These splitters buffer a whole data stream and are located at both outputs after A. Now, no data confusion will occur, as even the longest stall signal will not affect the operator A. Yet, in most cases this solution is far from an optimum, as it delays data unnecessary, consumes too much area etc. Considering the problem from a different angle, preventing data confusion means synchronizing data before entering E. Thus, we have to analyze and compare the dataflow in the two branches. Obviously it would be enough to delay the data in the right branch. This optimization problem can be solved by taking two different options into account. The first one refers to area optimization; the second one will optimize the data throughput. In order to achieve a minimum of area consumption, we add the smallest necessary delay element into the shorter path, taking care of proper operation in operator A. As the operators in both branches themselves store parts of the data word, we have to add a delay element of the length of the difference of both branches. This insertion may be obsolete, if the length of a whole data word fits into the whole shorter branch. Thus, we find the following algorithm (T = operation_time):
we can see that C in the right branch in Figure 6 will have a mobility of 2, whereas all other nodes will have a mobility of 1. We then will insert a delay operation, which reduces the mobility of the node C in the right branch to 1. This delay, now labelled as “register” must postpone the date an equal amount of clock cycles as the operation in the other branch takes longer. In order to evaluate the correct time duration of every delay, we refer to the mobility, e.g. a mobility of 3 means a delay, which is as long as twice the calculation time of one operation. The delay of the “registers” equals the mobility of one node of the shorter branch (path) subtracted by one multiplied with the duration of one operator: delay = (mobility_of_node – 1) * operation_time
if data_length > #operators_in_shorter_path * T then Insert delay element of length = (data_length – #operators_longer_path * T) – (data_length – #operators_shorter_path * T) fi The second approach, maximizing data throughput is solved by referring to the mobility, which gives a hint of potential data confusion. By calculating the mobility of every node using the formula mobility = ALAP – ASAP + 1,
We consider the example of Figure 7 where we have multiple and folded branches. The presented solution is done under the issue of maximising the data throughput, resulting in area minimization. A straight forward approach derived from the idea of first reducing the highest mobility, will often result in sub-optimal results. This result is depicted in Figure 7 and was generated as follows. First, we reduced the highest mobility to 1, which means we insert a “register” (filled black nodes) in the branch corresponding to node D. Subsequently, we have to insert two further “registers” (after G and H in this case), which sums up to an amount of three “registers”.
Thus, we can speak of parallel processing in both branches, a maximum data throughput is achieved, as the upper part of the DFG, especially A, can accept data immediately and does not have to wait for a free signal of the first elements in the shorter path. As only whole-numbered multiples occur, we either can think of an abstract delay element concretized with help of the above calculated delay factor, or we can imagine a limited amount of “registers” (e.g. simple delay, double delay, fourfold delay etc.) which are combined until the needed length is reached. Considering the positioning of the “registers”, we notice two rules. First, it is of no importance whether we insert the element before or after C, i.e. we can expand A by adding a “register” at the right output, C by adding a “register” at the input or at the output, E by adding a “register” at the right input or we can introduce F as pure delay element. Second, assuming more operations, the reduction of the mobility of one node to 1 will reduce the overall mobility of this branch to 1, which means no further precautions must be taken. This leads to the subsequent generalization described in the next section.
3.3
Multiple Branches
Proceedings of the 16th Symposium on Integrated Circuits and Systems Design (SBCCI’03) 0-7695-2009-X/03 $17.00 © 2003 IEEE
A B
C
F
G
A
D
I
E
B
C
H
F
G
J
D
E H 3
I
2
K
K
L
L
2
J
M
either in front of this subpart (C, D, E, G, H and J) or after it (as displayed in Figure 8). The size of the delay elements of the optimal solution sums up to an amount of 3, which is less than half of the size of an sub-optimal solution like the one depicted in Figure 7. In case we want to optimize area consumption, we again start from the innermost split and join section by calculating the optimal delay of every branch using the formula of section 3.2.
M
3.4 Figure 7.
Dataflow graphs with multiple branches
Thus, by simply calculating the mobility of every node and starting at the highest mobility or even randomly inserting delay elements subsequently, we might reach a not optimal solution, by using too many delay elements. Further, the size of the delay elements is depicted beneath the nodes and sums up to an amount of 7. The optimal solution with the minimal amount of delay elements is depicted in Figure 8 in the right dataflow graph. This solution was calculated as follows. A
Loops
Furthermore, control flows consist of one or more loops, an example can be seen in Figure 9. This layout is the result of mathematical equations, where a delay is necessary to avoid arithmetic loops. So, we cannot avoid the delay, but we have to consider where to insert this delay best and how long the delay has to be.
C
A
D
B
E
A
F B
C
F
G
I
D
J
E
B
C
H
F
G
I
K
K
L
L M
Figure 8.
D
E 1
H
J 2
M
Optimal synthesis of dataflow graph with multiple branches
We start with reducing the problem to the innermost split and junction pair. In this case, C, D, E, G, H and J form this core. We than start to calculate the mobility of every node only referring to this subpart, followed by reducing the mobility to 1 of every node, which means adding a delay element into the branch corresponding to D. Then we consider the next surrounding split and join pair with the background of having reduced the mobility of the preliminary subpart to 1. In order to avoid splitting of the previous optimized subpart, we aggregate the innermost section to one virtual node. Thus, the subpart as a whole gets the mobility, 3 in this case, which in turn is used to insert the “register”. We insert a delay element
Figure 9.
Loop example
Trying to reach the fastest calculation speed, the node at the re-merging of the backward branch into the data flow (B in the example of Figure 9) should be in permanent operation. Thus, firstly we use the mobility of a subpart of the DFG. In terms of the depicted example of Figure 9, we have to consider the subpart of B to E including both boarders. The maximum mobility (3 of B and E) yields to the maximum necessary size of the delay of F. Secondly, we have to take the general situation into account, which means B must have finished its nth operation before starting the (n+1)st. This constrain means, the length of the data path from B again to B via the backward branch must equal the length of a data packet (data + separator + control markers). Subsequently, the optimal delay can be calculated as follows: Delay = data_packet – mobility * operation_time Using this formula and carefully considering Figure 9, we notice the important synthesis constrain: Loops have to be considered before synchronizing (multiple) branches, as B, C, D and E themselves are a split and join subpart. The “register” again can be inserted either at the beginning or at the end of the backward branch of the loop.
Proceedings of the 16th Symposium on Integrated Circuits and Systems Design (SBCCI’03) 0-7695-2009-X/03 $17.00 © 2003 IEEE
gated clock
gated clock
A
B
C
D
Figure 10. Two guarded partitions behind a router
3.5
Router
The concept of self-controlling of the data flow and reconfigurability by inserting intelligent routers offers the possibility of power saving. We show this by referring to Figure 10. The router (A) decides by evaluating the control markers whether to send the data to the left (B) or the right (C) branch. As a practical example, one can imagine two different ways of processing streaming data. The output of branch B and C are re-joined and are assumed to be further processed by the same node (D). It is now possible to switch off the unused branch by using the potential of gated clocks driven by the router A. For example, using the left branch will cause the router A to stall the right branch thus resulting in a decreasing of power consumption.
4
Example
As an example we implemented a Proportional-Integral controller (PI) with the MACT-Architecture. Sampling rates of mechatronic systems [2] are characterized by their inertia as well as the delay of sensors and actuators (converters), thus resulting in a practical rate of a PI controller of 100 kHz. The speed of the design implemented on a Xilinx Virtex 400E (RABBIT System [6]) is approximately 50 MHz. Taking into account the entire overhead related to the control bits of a data package, 16-bit data will be processed by this implementation in about 500 kHz. This may be slower than a parallel implementation, but much faster than required by such a system.
5
Conclusion and Outlook
This paper presented an high-level synthesis method for a bit-serial processing architecture as implementation of an algorithm specification as dataflow graph. The architecture has the peculiar feature of being self-timed and comprises a fully interlocked pipelining structure
which aims at controlling the different computational paths of a system design. We have shown different synthesis approaches in terms of chip area, low-power design, and speed. Furthermore the architecture allows mapping of different dataflow graphs into one graph by including routers in the graph and routing information in the data packet. The routers offer the possibility of introducing gated clocks, thus again reducing power consumption. The discussed synthesis methods include methods for resolving deadlocks and the resolution of the arithmetic loop problem. The given example shows the elegance of a practical realization. As an outlook, we could find a potential of pushing operations to a later or earlier time which introduces the idea of using different clock areas in order to reduce the needed energy further.
References [1] H. M. Jacobson; P. N. Kudva; P. Brose; P. W. Cook; S. E. Schuster; E. G. Mercer; C. J. Myers. “Synchronous Interlocked Pipelines”, 8th International Symposium on Asynchronous Circuits and Systems (ASYNC 02), 2002. [2] M. Zanella; Th. Koch; F. Scharfeld. “Development and Structuring of Mechatronic Systems, Exemplified by the Modular Vehicle X-mobile”, IEEE/ASME International Conference on Advanced Intelligent Mechatronics (AIM 01), Como, Italy, July 2001. [3] A. Rettberg; B. Kleinjohann. “A Fast Asynchronous Reconfigurable Architecture for Multimedia Applications”, 14th Symposium on Integrated Circuits and System Design (SBCCI 2001), Pirenópolis, GO, Brazil, September 2001. [4] W. Hardt; B. Kleinjohann; A. Rettberg. “The FLYSIG Prototyping Approach”, 11th IEEE International Workshop on Rapid System Prototyping (RSP 2000), Paris, France, June 2000. [5] A. Rettberg; A. Hennig; B. Kleinjohann. “Re-Configurable Multiplier Units of the Asynchronous FLYSIG Architecture”, 10th NASA Symposium on VLSI Design, Albuquerque, NM, USA, March 2002. [6] M. Zanella; M. Robrecht; Th. Lehmann; R. Gielow; A. de Freitas Francisco; A. Horst. “RABBIT: A Modular Rapid Prototyping Platform for Distributed Mechatronic Systems”, SBCCI 2001 - XIV Symposium on Integrated Circuits and Systems Design, Brasília, Brazil. [7] Achim Rettberg, Mauro C. Zanella, Christophe Bobda, and Thomas Lehmann. “A Fully Self-Timed Bit-Serial Pipeline Architecture for Embedded Systems”, in Proceedings of the Design Automation and Test Conference (DATE), Munich, 2003. [8] Achim Rettberg, Mauro C. Zanella, Thomas Lehmann, and Christophe Bobda. “A New Approach of a Self-Timed BitSerial Synchronous Pipeline Architecture”, in Proceedings of the Rapid System Prototyping Workshop, San Diego, 2003. [9] Raul Camposano. “High Level VLSI Synthesis”, Kluwer Academic Publisher 1991. [10] David Renshaw, Peter Denyer. “A Bit-Serial Approach”, VLSI Signal Processing, Addison-Wesley, 1985.
Proceedings of the 16th Symposium on Integrated Circuits and Systems Design (SBCCI’03) 0-7695-2009-X/03 $17.00 © 2003 IEEE