Evaluation of a Field-Programmable VLSI Based ... - Semantic Scholar

3 downloads 489 Views 662KB Size Report
Sep 9, 2008 - blocks and clock distribution, asynchronous bit-serial architecture is pro- ... †The authors are with the Graduate School of Information Sci- ences ...
IEICE TRANS. ELECTRON., VOL.E91–C, NO.9 SEPTEMBER 2008

1419

PAPER

Special Section on Advanced Processors Based on Novel Concepts in Computation

Evaluation of a Field-Programmable VLSI Based on an Asynchronous Bit-Serial Architecture Masanori HARIYAMA†a) , Member, Shota ISHIHARA† , Student Member, and Michitaka KAMEYAMA† , Fellow

SUMMARY This paper presents a novel asynchronous architecture of Field-programmable gate arrays (FPGAs) to reduce the power consumption. In the dynamic power consumption of the conventional FPGAs, the power consumed by the switch blocks and clock distribution is dominant since FPGAs have complex switch blocks and the large number of registers for high programmability. To reduce the power consumption of switch blocks and clock distribution, asynchronous bit-serial architecture is proposed. To ensure the correct operation independent of data-path lengths, we use the level-encoded dual-rail encoding and propose its area-efficient implementation. The proposed field-programmable VLSI is implemented in a 90 nm CMOS technology. The delay and the power consumption of the proposed FPVLSI are respectively 61% and 58% of those of 4-phase dual-rail encoding which is the most common encoding in delay insensitive encoding. key words: FPGAs, reconfigurable VLSIs, asynchronous architecture, LEDR (Level-Encoded Dual-Rail) encoding

1.

Introduction

Field-programmable gate arrays (FPGAs) are widely used to implement special-purpose processors. FPGAs are costeffective for small-lot production and flexible because functions and interconnections of logic resources can be directly programmed by end users. Despite their design cost advantage, FPGAs impose large power consumption overhead compared to custom silicon alternatives [1]. The overhead increases packaging costs and limits integrations of FPGAs into portable devices. One efficient way for low power is asynchronous architecture, where timing is managed locally (as opposed to globally with a clock system as in synchronous architecture). Asynchronous design can reduce power consumption by avoiding two of the problems of synchronous design: • all parts of a synchronous design are clocked, even if they perform no useful function; • the clock line itself is a heavy load, requiring large drivers, and a significant amount of power is wasted just in driving the clock line. There are synchronous solutions to these problems such as clock-gating. However, the solutions are complex and the problems can often be avoided with no extra effort or complexity when asynchronous design. Asynchronous architecture has a lot of advantages over Manuscript received January 18, 2008. Manuscript revised April 13, 2008. † The authors are with the Graduate School of Information Sciences, Tohoku University, Sendai-shi, 980-8579 Japan. a) E-mail: [email protected] DOI: 10.1093/ietele/e91–c.9.1419

synchronous architecture: • Avoidance of clock-skew and the complexity of the clock distribution network • Low power • Better than worst-case performance • Improved electro-magnetic compatibility(EMC) Asynchronous architectures for FPGAs have been proposed [2]–[7]. Reference [2] proposes the first asynchronous FPGA. It extends a synchronous FPGA by adding special arbitration cells and modifying the logic block. As it has the global clock network, it can implement synchronous circuit as well as asynchronous one. However, each of the special elements for asynchronous circuit is composed of a lot of LUTs even though it is simple one. This results in a large area, a large delay and a large power consumption. References [3], [4] and [5] report asynchronous FPGAs based on bundled-data encoding. In this encoding, delay elements larger than the maximum delay of the data-path in various conditions are required. Especially, for FPGAs, complex programmable delay elements are required. These result in low performance. To solve this problem, Refs. [6] and [7] propose asynchronous FPGAs based on dual-rail encoding which requires no delay insertion. They use 4-phase dual-rail encoding because of relatively small overheads. However, in the 4-phase dual-rail encoding, a spacer must be inserted between two consecutive data values. This results in low throughput and high dynamic power consumption because of the large number of signal transitions. Another well-know dual-rail encoding is level-encoded dual-rail (LEDR) [8] which requires no spacer. Its throughput is double and its dynamic power consumption is half of the 4-phase dual-rail encoding since the number of signal transitions is half of the 4-phase dual-rail encoding. The major drawback of the LEDR encoding is its larger overheads in area. Based on this observation, we use the LEDR encoding and propose its area-efficient implementation. To my best knowledge, this is the first implementation of a FPGA based on the LEDR encoding. This paper presents a novel asynchronous architecture of FPGAs to reduce the power consumption. In the dynamic power consumption of the conventional FPGAs, the power consumed by the switch blocks and clock distribution is dominant since FPGAs have complex switch blocks and the large number of registers for high programmabil-

c 2008 The Institute of Electronics, Information and Communication Engineers Copyright 

IEICE TRANS. ELECTRON., VOL.E91–C, NO.9 SEPTEMBER 2008

1420

ity. To reduce the power consumption of switch blocks and clock distribution, asynchronous bit-serial architecture is proposed. To reduce the overhead of the LEDR encoding, it is combined with the bit-serial architecture which minimizes the complexity of switch blocks. This paper is the long version of [9] with detailed evaluations. The proposed asynchronous FPVLSI is based on fine-grained cellar-array architecture [10], where a cell consists of a 2-input-and-1output LUT(Look-up table) and a carry generator to implement an area-efficient adder. For simplicity, in this version of the proposed asynchronous FPVLSI, the carry generator is removed. The proposed field-programmable VLSI is implemented in a 90 nm CMOS technology. The delay and the power consumption of the proposed FPVLSI are respectively 61% and 58% of those of 4-phase dual-rail encoding which is the most common encoding in delay insensitive encoding. 2.

Architecture

2.1 Fine-Grained Pipelining Bit-Serial Architecture As shown in Fig. 1, the FPVLSI consists of a meshconnected cellular array based on a bit-serial architecture to reduce the complexity of the switch block. As described below, we employ dual-rail encoding which is suitable for reconfigurable VLSIs. Hence, bit-serial architecture requires 3 wires: 2 for a data value and request, 1 for acknowledge. Each cell is connected to only four adjacent cells, and the number of programmable switches is reduced in comparison with the typical FPGA. The specification of the functionality of a logic block provides a great impact on the complexity of the switch block, the area of the logic block, and the delay of the logic block. The higher functionality of the logic block requires the larger number of inputs and outputs of a logic block. This increases the complexity of the switch block. On the other hand, the higher functionality of the logic block reduces the number of the logic blocks re-

Fig. 1

Overall architecture.

quired for implementing a target function, and reduces the total delay. Based on this observation, the functionalities of the fine-grained logic block are specified as follows. • 1-bit addition with carry storage. • Arbitrary logic function of two inputs. • 1-bit storage. The specification allows to perform a bit-serial addition on a single cell without carry ripple, and the high functionality can be achieved with the minimum number of inputs and outputs of the logic block. This results in further reduction of interconnection complexity. Figure 2 shows the block diagram of the logic block. As shown in Fig. 2, the logic block has two input registers and an output registers for cell-level fine-grained pipelining as well as a 2-inputand-1-output LUT. In the case of a single context, the performance of the synchronous fine-grained bit-serial architecture is more than two times higher than that of the conventional FPGA architecture under a constraint of the same chip area [10]. 2.2 Problem on Clock Distribution of the Synchronous FPGAs In FPGAs, the clock distribution power is a serious problem because it has an enormously larger number of registers than custom VLSIs. The most well-known technique to reduce the clock distribution power is clock-gating. The idea of clock-gating in ASICs is to gate the clock network in the standby state. Clock-gating can result in saving two kinds of power consumption: 1. Dynamic power consumption of the clock network 2. Dynamic power consumption of registers of a circuit and gates in the fan-out of the registers On the other hand, clock-gating in conventional FPGAs has to be implemented with a different and compromised method. Instead of customizing the fixed global clock network, a feedback multiplexer is used for the contents of the flip-flop to be retained in the stand-by state [11]. Therefore, clock gating in FPGAs only reduces “2. Dynamic power

Fig. 2

Block diagram of the logic block.

HARIYAMA et al.: EVALUATION OF A FIELD-PROGRAMMABLE VLSI

1421

consumption of registers of a circuit and gates in the fan-out of the registers.” The reasons why the global clock network isn’t customized are as follows: • The fixed global clock network is designed with such low skew that no hold-time violation occurs. This simplifies timing analysis. Placement and routing are likewise expedited. As a result, the entire design flow is completed in a fraction of the time required to complete the equivalent flow for an ASIC. This time saving is the main advantage of the FPGAs. • The customized clock network can be implemented using the programmable interconnects. However, the worst case of clock skew can’t be estimated since FPGA venders don’t guarantee the worst case of the minimum delay of components [12]. As a result, it is impossible to guarantee that no hold-time violations occur. From these reasons, in the Xilinx ISE manual, the clock gating method without the customized global clock is recommended. 2.3 Asynchronous Architecture Based on LEDR Encoding Asynchronous encoding schemes are classified into • Bundled-data encoding • Delay insensitive encoding (usually dual-rail encoding) Figure 3 shows the overall architecture for the bundled-data encoding. The bundled-data encoding splits request and value into separate wires. The value is encoded as in a synchronous circuit using N wires to denote a N-bit number, and request is encoded using a dedicated request wire denoted by REQ. The bundled-data encoding requires the explicit insertion of delay in REQ to ensure that a request is never received before the bundled value is valid. The bundled-data encoding is the most frequently-used way in ASICs since its hardware overhead is relatively small. This is because the REQ wire is shared among all the N wires. Hence, to transfer an N-bit value, only N + 2 wires are required. The major disadvantage is that it requires the constraint of the delay length. If the data path is fixed in advance, it is relatively easy to meet the constraint by optimizing layouts of wires. On the other hand, in reconfigurable VLSIs such as FPGAs, it is not easy to always meet the constraint since the data path is programmable.

The delay insensitive encoding makes value implicit in the request and no delay insertion is therefore required [13]. Hence, the delay insensitive encoding is the ideal one for reconfigurable VLSIs. The most common delay insensitive encoding is dual-rail encoding. Figure 4 shows the overall architecture for dual-rail encoding. The dual-rail encoding uses two request wires to send a single bit of data. Hence, to transfer an N-bit value, 2N + 1 wires are required. The disadvantage of the dual-rail encoding is the large hardware overhead since it requires as twice wires as the synchronous manner. The bit-serial architecture described in the previous section is one most efficient way of reducing the hardware overhead of wires because of its inherent wire overhead. There are two major methods for dual-rail encoding: • 4-phase dual-rail encoding • Level encoded 2-phase dual-rail encoding(LEDR) Figure 5 shows the code table of the 4-phase dual-rail encoding, which is the most common one in the dual-rail encoding. Figure 6 shows the example where data values 0, 0 and 1 are transferred. The main feature is that the sender sends spacer (0, 0) after a data value. The receiver knows the arrival of a data value by detecting the change of either bit: 0 to 1. The drawback of the 4-phase dual-rail encoding is low throughput because of insertion of spacers. The LEDR encoding enhances the throughput of the delay insensitive encoding. Figure 7 shows the code table of the LEDR encoding. Note that each data value has two types

Fig. 4

Fig. 5

Fig. 3

Bundled-data encoding for N-bit data.

Fig. 6

Dual-rail encoding for N-bit data.

Code table of the 4-phase dual-rail encoding.

Example of 4-phase dual-rail encoding.

IEICE TRANS. ELECTRON., VOL.E91–C, NO.9 SEPTEMBER 2008

1422

Fig. 7

Dual-rail encoding with 2-phase signaling.

Fig. 10 LUT for the LEDR encoding based on hybrid of decoders and multiplexers. Fig. 8

Example of LEDR encoding.

Fig. 11 Detailed structure and the behavior of the sub-module of the LUT for invalid inputs.

Fig. 9 The multiplexer-based LUT for the LEDR encoding (Only Vout is illustrated).

of code words with different phases. For example, data value 0 is encoded as (0, 0) in phase 0 and (0, 1) in phase 1. The code word consists of V(Value bit) and R(Redundant bit). The value V is encoded as in a synchronous circuit. The redundant bit R is defined by EXOR-ing V and Phase so that R includes the information on Phase and consecutive code words get different only by hamming distance 1. Figure 8 shows the example where data values 0, 0 and 1 are transferred. The main feature is that the sender sends data values alternately in phase 0 and phase 1. The receiver knows the arrival of a data value by detecting the change of phase, and data values are continuously transferred between the sender and the receiver without any break. The throughput is doubled in an ideal case in comparison with the 4-phase dualrail encoding. The drawback of the LEDR encoding is that it requires slightly complex hardware to support the high throughput. In the following, we describe the area-efficient implementation of the FPVLSI for the LEDR encoding. For the LEDR-based FPVLSI, the major concern is designing the compact LUT. Figure 9 shows the conventional multiplexerbased LUT for the LEDR, where only Vout is illustrated, and

another LUT is necessary to obtain Rout . Based on two 2-bit input (Va , Ra ) and (Vb , Rb ), Vout is determined. The previous output is kept by the feed back loop if the combination of the inputs is invalid. For example, the combination of inputs with different phases is invalid. To make a correct output for such invalid combination of inputs, the number of multiplexers becomes large. In the LUT for Vout , 8 memory bits and 8 fed-back bits are used as inputs of the multiplexer-tree. To solve this problem, the LUT based on a hybrid of decoders and multiplexers is proposed. Figure 10 shows the block diagram of the proposed LUT, which consists of 4 sub-modules. Each sub-module consists of a decoder, a multiplexer and a memory bit. The decoders exclude invalid input patterns with different phases. Then, only valid data are fed to the multiplexer. As a result, the numbers of memory bits and multiplexers are reduced, and the transistor count is reduced to 64% compared to multiplex type LUT. Figure 11 shows the detailed structure of a sub-module. If the combination of inputs is invalid, in other words, 2-inputs have the different phase, all pass-transistors turn off according to the output of the decoder; the outputs of the multiplexer are in a high-impedance condition; the previous outputs stored in latch is kept. On the other hand, if input patterns are valid, in other words, when two inputs have the same phase, according to the output of the decoder, these 2 pass-transistors turn on; the value of the memory bit is

HARIYAMA et al.: EVALUATION OF A FIELD-PROGRAMMABLE VLSI

1423

Fig. 12

Behavior of the sub-module of the LUT for valid inputs.

Fig. 14

Chip micro-photograph of the FPVLSI.

Table 1 Features of the FPVLSI. Technology 90 nm CMOS Supply voltage 1V Area 1.5 mm × 1.5 mm (Core) Cell size 30 µm × 46 µm Number of cells 20 × 30 Measured delay of a cell 2.7 ns Fig. 13

Block diagram of the register.

selected as outputs; the outputs are stored in the latches as shown in Fig. 12. As the LEDR is a dual-rail encoding, two double-edgetriggered flip-flops are required as a register to store a data value in the typical manner. To reduce this overhead, we reduce the number of registers. In phase 0, redundant bit Rout is same as Vout while in phase 1, Rout is same as Vout . Hence, Rout can be calculated from Vout and Phase which is stored in the C-element for controlling each register. The signal Vin is stored in a register while Rin not stored. As a result, the number of registers is reduced to half. The block diagram of a register is shown in Fig. 13. Note that an EXOR gate is not used to calculate Rout = Vout EXOR Phase. This is because the LEDR encoding doesn’t allow Vout and Phase to change at the same time when they are used as inputs of the EXOR gate. The output Vthrough is used to hold Phase until Vin is stored in the register for avoiding hold-time violation. 3.

Evaluation

The asynchronous FPVLSI is fabricated in a 90 nm CMOS process. Figure 14 and Table 1 show the micro-photograph and the features of the FPVLSI, respectively. The chip includes 600 cells on 1.5 mm × 0.75 mm area. Figure 15 shows the measured wave form of the FPVLSI. From the delay of series-connected 300 cells, the delay is evaluated to be 2.7 ns. This result agrees well with the simulation result of HSPICE.

Fig. 15

Measured waveform.

Let us compare the proposed FPVLSI with an FPGA based on the 4-phase dual-rail encoding and a synchronous FPGA, both of which have cellular array architecture like the proposed FPVLSI. The cells of the proposed FPVLSI, the 4-phase dualrail FPGA and the synchronous FPGA have the same functionality. The cell consists of a logic block and a switch block. The logic block consists mainly of a 2-input-and-1output LUT and a register. A logic block of the proposed FPVLSI is improved from the cell of the prototype chip. As shown in Fig. 2, the logic block of the prototype chip consists of two input registers, one 2-input-and-1-output LUT and one output register. For area and power efficient design, the two input registers are eliminated as shown in Fig. 16 while slightly decreasing the throughput. Moreover, the circuit parameters are also

IEICE TRANS. ELECTRON., VOL.E91–C, NO.9 SEPTEMBER 2008

1424

Fig. 16

The cell of the proposed FPVLSI. Fig. 18

Fig. 19

Fig. 17

Sub-module of the LUT for valid inputs.

Sub-module of the LUT for invalid inputs.

LUT for the 4-phase dual-rail FPGA.

improved. To design the logic block of the 4-phase dual-rail FPGA in an area-efficient manner, the LUT is composed of decoders and multiplexers like the proposed FPVLSI. As shown in Fig. 17, the difference between the proposed FPVLSI and the 4-phase dual-rail FPGA is that the LUT of the 4-phase dual-rail FPGA requires one additional submodule for detecting invalid values. The size of a submodule is smaller than that of the proposed FPVLSI because the 4-phase dual-rail encoding is simple as shown in Fig. 18 and Fig. 19. As shown in Fig. 20, a register for the 4-phase dual-rail is just the extension of a conventional SR latch. Its size is almost the same as that of the proposed FPVLSI. The logic block of the synchronous FPGA employs a conventional SRAM-based LUT, and a conventional D flipflop as a register. The cells are designed using a schematic CAD: Composer (by CADENCE) and evaluated using a circuit simulator: HSPICE (by Synopsys). The circuit parameters such as a gate width are determined based on the library for the 90 nm technology provided by VDEC. From previous experimental results in the 90 nm technology, the delay of the real chip is about 1.2 times larger than that of the layoutlevel evaluation with parasitic capacitances and parasitic resistance of wires considered, and about 2.7 times larger than

Fig. 20

Register for 4-phase dual-rail FPGA.

Table 2 Comparison result between the FPVLSI and 2-rail and a 4-phase asynchronous FPGA.

that of the schematic-level evaluation. Table 2 shows the comparison between the FPVLSI based on LEDR and the FPGA based on the 4-phase dualrail encoding. The number of transistors per cell of the FPVLSI is only by 13% larger than that of the 2-rail and 4-phase encoding. The delay of a cell and the power consumption per data set are respectively reduced to 61% and 58%. Table 3 shows the comparison between the FPVLSI

HARIYAMA et al.: EVALUATION OF A FIELD-PROGRAMMABLE VLSI

1425 Table 3

Comparison between the FPVLSI and a synchronous FPGA.

Fig. 22

Fig. 21

The peak power at a data rate of 1 G data set/s.

Relations between power consumption and utilization.

and a synchronous FPGA. The synchronous FPGA has also fine-grained architecture like the FPVLSI. Because of the overhead for the asynchronous control, the number of the transistors of the FPVLSI is 2.53 times larger than the synchronous FPGA. The delay and the power consumption of a cell of the FPVLSI are 4.58 times and 1.8 times larger than those of the synchronous FPGA, respectively. In terms of delay of a cell, the synchronous architecture seems better than the LEDR-base asynchronous architecture. However, it is difficult for the synchronous architecture to exploit the maximum performance of a cell since as high as 8 GHz a clock frequency is required from the following reason. To achieve the maximum throughput, we assume that the cell-level pipelining is used. The maximum throughput is 1/0.12 ns = 8.3 G since the delay of cell of the synchronous FPGA is 0.12 ns. Hence, the clock frequency of more than 8 GHz is required to achieve the maximum throughput. Moreover, it is very difficult to distribute such high frequency clock since an FPGA has a heavy clock load because of an enormously large number of registers. For example, even in Xilinx leading-edge FPGA, Virtex5, its maximum clock frequency is only 550 MHz. Note that the power consumption per data set of a cell of the synchronous FPGA includes the power consumption by clock distribution, and it occupies 32% of a total power consumption of a cell. In practical applications, the total power consumption depends on utilization of hardware. Even when the FPGA is not used or has the un-used blocks, the synchronous FPGA requires the power consumed by clock distribution since it is not practical to customize the fixed clock network of the conventional synchronous FPGAs as mentioned in Sect. 2.2. Based on this consideration, we evaluate the relations between the power consumption and utilization for the pro-

Fig. 23

Relation between the peak power and the data rate.

posed FPVLSI and the synchronous FPGA. Figure 21 shows the relations. The number of cells is 256. When the utilization is less than 28%, the proposed FPVLSI has lower power consumption than the synchronous FPGA. Therefore, the proposed FPVLSI would be suitable for applications with low utilization such as mobile applications. To evaluate the peak power, 256 cells are connected in series for pipelining. Figure 22 shows the peak power at a data rate of 1G data set/s which corresponds to synchronous architectures running at 1 GHz. In this condition, the peak power of the FPVLSI is reduced to 39% compared to the synchronous FPGA. In a synchronous architecture, all cells operate in synchronization with the clock signal which results in a large peak power. The peak power is determined by the number of cells, and is independent of the clock frequency (or the data rate). On the other hand, in an asynchronous architecture, all cells operate based on the handshake protocol without the common clock signal. In other words, all the cells operate independently in time, which results in a small peak power. As shown in Fig. 23, the peak power depends on the data rate unlike the synchronous architecture. 4.

Conclusion

We proposed the FPVLSI based on an asynchronous finegrained pipelining architecture. The key technologies are the LEDR encoding to enhance the performance, and its area-efficient circuits. The asynchronous FPGA has a great potential for low power. In evaluation, we have compared

IEICE TRANS. ELECTRON., VOL.E91–C, NO.9 SEPTEMBER 2008

1426

just the performances of the cells. We should evaluate the total performance of a whole chip using a lot of applications since the power consumption of switch blocks strongly depends on their fan-outs. The asynchronous architecture is very suitable for fine-grained power gating since the data word includes information on whether the block is used or not. Therefore, the power gating control is easily archived using the information. On the other hand, in the synchronous FPGA, the fine-grained power gating may be impractical since it requires significant overhead of control circuits. The asynchronous FPVLSI with fine-grained power gating is now undergoing. Acknowledgement This work is supported by VLSI Design and Education Center(VDEC), the University of Tokyo in collaboration with Synopsys, Inc., and Cadence Design Systems, Inc. Authors thank Prof. Takahiro Hanyu, Research Institute of Electrical Communication, Tohoku University, Japan for his helpful support in CAD environment. References [1] V. George, H. Zhang, and J. Rabaey, “The design of a low energy FPGA,” Proc. 1999 International Symposium on Low Power Electronics and Design, pp.188–193, California, USA, Aug. 1999. [2] S. Hauck, S. Burns, G. Borriello, and C. Ebeling, “An FPGA for implementing asynchronous circuits,” IEEE Design & Test, vol.11 no.3, pp.60–69, July 1994. [3] R. Payne, “Asynchronous FPGA architectures,” IEE Computers and Digital Techniques, vol.143, no.5, pp.282–286, 1996. [4] K. Maheswaran and V. Akella, “PGA-STC: Programmable gate array for implementing self-timed circuits,” Int. J. Electron., vol.84, no.3, pp.255–267, 1998. [5] C. Traver, R.B Reese, and M.A. Thornton, “Cell designs for selftimed FPGAs,” Proc. 14th Annual IEEE International ASIC/SOC Conference, pp.175–179, 2001. [6] J. Teifel and R. Manohar, “An asynchronous dataflow FPGA architecture,” IEEE Trans. Comput., vol.53, no.11, pp.1376–1392, Nov. 2004. [7] R. Manohar, “Reconfigurable asynchronous logic,” Proc. IEEE Custom Integrated Circuits Conference, Sept. 2006. [8] M.E. Dean, T.E. Williams, and D.L. Dill, “Efficient self-timing with level-encoded 2-phase dual-rail (LEDR),” University of California/Santa Cruz Conference on Advanced Research in VLSI, pp.55– 70, 1991. [9] M. Hariyama, S. Ishihara, C.C. Wei, and M. Kameyama, “A fieldprogrammable VLSI based on an asynchronous bit-serial architecture,” IEEE Asian Solid-State Circuits Conference (A-SSCC), pp.380–383, Nov. 2007. [10] M. Hariyama, W. Chong, and M. Kameyama, “Field-programmable VLSI based on a bit-serial fine-grain architecture,” IEICE Trans. Electron., vol.E87-C, no.11, pp.1897–1902, Nov. 2004. [11] Y. Zhang, J. Roivainen, and A. Mammela, “Clock-gating in FPGAs: A novel and comparative evaluation,” 9th EUROMICRO Conference on Digital System Design (DSD’06), pp.584–590, 2006. [12] “Gated clock conversion with Synplicity’s synthesis products,” Synplicity Inc., Sunnyvale, CA, Application Note, July 2003. [13] J. Sparsø and S. Furber, Principles of Asynchronous Circuit Design: A Systems Perspective, Kluwer Academic Publishers, 2001.

Masanori Hariyama received the B.E. degree in electronic engineering, M.S. degree in Information Sciences, and Ph.D. in Information Sciences from Tohoku University, Sendai, Japan, in 1992, 1994, and 1997, respectively. He is currently an associate professor in Graduate School of Information Sciences, Tohoku University. His research interests include VLSI computing for real-world application such as robots, high-level design methodology for VLSIs and reconfigurable computing.

Shota Ishihara received the B.E. degree in Information Engineering from Tohoku University, Sendai, Japan, in 2007. He is currently working toward the M.S. degree in Graduate School of Information Sciences, Tohoku University. His primary research interest is in the area of reconfigurable computing.

Michitaka Kameyama received the B.E., M.E. and D.E. degrees in Electronic Engineering from Tohoku University, Sendai, Japan, in 1973, 1975, and 1978, respectively. He is currently a Professor in the Graduate School of Information Sciences, Tohoku University. His general research interests are intelligent integrated systems for real-world applications and robotics, advanced VLSI architecture, and newconcept VLSI including multiple-valued VLSI computing. Dr. Kameyama received the Outstanding Paper Awards at the 1984, 1985, 1987 and 1989 IEEE International Symposiums on Multiple-Valued Logic, the Technically Excellent Award from the Society of Instrument and Control Engineers of Japan in 1986, the Outstanding Transactions Paper Award from the IEICE in 1989, the Technically Excellent Award from the Robotics Society of Japan in 1990, and the Special Award at the 9th LSI Design of the Year in 2002. Dr. Kameyama is an IEEE Fellow.

Suggest Documents