Simulation-Based Circuit-Activity Estimation for ... - ACM Digital Library

7 downloads 0 Views 5MB Size Report
Sean Seeley, Vidya Sankaranaryanan, Zack Deveau, Panagiotis Patros, and Ken- neth B. Kent. 2017. Simulation-Based Circuit-Activity Estimation for FPGAs.
Simulation-Based Circuit-Activity Estimation for FPGAs Containing Hard Blocks Sean Seeley

Faculty of Computer Science University of New Brunswick [email protected]

Vidya Sankaranaryanan

Faculty of Computer Science University of New Brunswick [email protected]

Panagiotis Patros

Zack Deveau

Faculty of Computer Science University of New Brunswick [email protected]

Kenneth B. Kent

Faculty of Computer Science University of New Brunswick [email protected]

Faculty of Computer Science University of New Brunswick [email protected]

ABSTRACT

1

FPGAs are electronic devices that are programmable and can functionally perform equivalently to a number of other circuits. FPGAs are used for both rapid and cheap prototyping of new circuit designs as well as for replacing outdated chip models. Due to their complexity, circuits cannot be practically designed by hand; instead, specialized Computer Aided Design (CAD) software performs this complex task. A major concern for devices is power requirements, which can have adverse effects on both the environment and users. The power requirements of a circuit can be directly connected with its activity, which can be estimated by the CAD tools. In this work, we focus on the open source Verilog-To-Routing (VTR) CAD software and propose an improved activity estimation tool using VTR’s synthesizer (Odin II) that extends beyond the capabilities of its current estimator (ACE2), such as proper black box activity propagation and support for circuits containing no clocks or more than one clock. Our results are experimentally evaluated with VTR’s FPGA architectures and benchmark circuits.

The era of Big Data is propelled by the abundance of information produced by all kinds of small, embedded devices that constitute the edges of the Internet of Things (IoT) [14]. Various sensors capture locations, temperatures, pressures, traffic, personal data, etc. that are subsequently filtered and forwarded through the Internet to cloud servers for storage and aggregation. All these embedded devices rely on electronic circuits for implementing their functionality. A circuit acts as the logic controller of the device as well as its input and output manager. A straightforward way to implement a circuit is to directly fabricate its pins, logic components (a.k.a. logic gates) and interconnection network on a chip. These electronic circuits are referred to as Application Specific Integrated Circuits (ASICs). [2, 6] However, ASICs are only useful for exactly the application they were designed for. Field Programmable Gate Arrays (FPGAs) provide an alternative to this problem. FPGAs can replicate the functionality of other circuits by being reprogrammable. In particular, FPGAs are comprised by a set of Look Up Tables (LUTs) each with a number of inputs and outputs but also, with a number of control signals that modify the function of the LUT in all possible ways. Additionally, FPGAs contain programmable routing elements that can connect the LUTs and the input/output pins in a programmable manner. Finally, because certain types of logic and routing components tend to repeat, FPGAs contain non-programmable hard-blocks that implement a specific functionality, such as addition, memory and multiplication [2, 6]. Circuits for FPGAs are synthesized by specialized Computer Aided Design (CAD) software into netlists that abstract the circuit’s functionality into a graph. Then, the netlist is first packed into LUT-sized groups and then placed and routed on a given FPGA architecture; components that can be mapped to the hard-blocks of the given architecture are not synthesized into soft-logic but placed directly on one of the available hard-blocks and routed accordingly. Verilog-to-Routing (VTR) is a CAD software tool targeting FPGAS [11, 17]: it is comprised of Odin II [5], which synthesizes Verilog code into a netlist, maps components to hard-blocks available on the given FPGA architecture and also performs simulation-based functional evaluation [10]; ABC, which optimizes the netlist; and Versatile-Placement-Routing (VPR) [1], which packs, places and routes the optimized netlist on the target FPGA architecture.

CCS CONCEPTS • Hardware → Reconfigurable logic and FPGAs; Power estimation and optimization; Logic synthesis; Technology-mapping; Software tools for EDA; Circuit optimization; • Computer systems organization → Embedded hardware; ACM Reference format: Sean Seeley, Vidya Sankaranaryanan, Zack Deveau, Panagiotis Patros, and Kenneth B. Kent. 2017. Simulation-Based Circuit-Activity Estimation for FPGAs Containing Hard Blocks. In Proceedings of RSP’17, Seoul, Republic of Korea, October 15–20, 2017, 7 pages. https://doi.org/10.1145/3130265.3130326

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. RSP’17, October 15–20, 2017, Seoul, Republic of Korea © 2017 Association for Computing Machinery. ACM ISBN 978-1-4503-5418-9/17/10. . . $15.00 https://doi.org/10.1145/3130265.3130326

36

INTRODUCTION

RSP’17, October 15–20, 2017, Seoul, Republic of Korea

S. Seeley et al.

Odin II does not synthesize logic that can be mapped to hard blocks; instead, its netlist contains specialized black boxes that abstract the functionality of hard blocks, such as memories [18] and multipliers. Hard blocks are used in FPGAs instead of expanded soft logic because they minimize the required area, reduce power requirements of the implemented circuits and increase their clock frequency by shortening the design’s critical path. An identified drawback of FPGAs over ASICs is increased power consumption [7]. The power required for an ASIC is lower since only dedicated hardware is powered. However, on an FPGA, the circuitry to support programmability and signal routing requires extra power. Additionally, a LUT is much more inefficient for power than the dedicated logic gates it implements. Consequently, increased power requirements directly correlate with increased energy consumption. Furthermore, with every new FPGA generation, the costgap for producing large numbers of FPGAs over ASICs is reduced; thus, FPGAs should further increase their market share. [20] All these can be particularly influential on climate change [4]; the anthropogenic effects that drive this phenomenon must be mitigated as soon as possible to contain the various natural disasters that are either happening or are bound to happen. In the IOT era, where billions of embedded devices are active at a given moment, power-aware circuit design becomes an ethical obligation. Furthermore, reducing power consumption also has more immediate positive effects in terms of a device’s performance as well as reduced battery consumption. This is a motivator for having a power model as part of the CAD tool so one can experiment and research new FPGA architectures for power efficiency. An FPGA activity estimation framework, which is included in VTR via ACE2 [8], was proposed by Poon et al. [16]. The model distinguishes between two main sources of power consumption on FPGAs: First, static power refers to the necessary power to keep the various components of the FPGA, such as memories and LUTs, operational. In the model, this boils down to a per-node metric Pl (x) that describes the probability that the wire x is in logical high, i.e. 1 or high voltage, while the circuit is operational. Second, dynamic power refers to the necessary power for a circuit to perform its computational activities. In the model, this results into a per-node metric As (x) called switching activity that describes the probability that wire x will switch states (from 0 to 1 or from 1 to 0). Switching activity is controlled both by normal switching as well as the time-delay-based glitching. Switching occurs while performing computations during every clock cycle as the inputs change. Glitching occurs within a clock cycle due to the various delays the signals encounter while propagating through the circuit. However, ACE2 does not propagate signals from black boxes nor does it support circuits that do not contain exactly one clock. In this paper, we make the following contributions: • We identify VTR’s issues with the estimation of simulationbased activity when hard blocks are present or when the number of clocks are not exactly one. • We design and implement a solution for these problems in Odin II and as part of the VTR CAD flow. • We experimentally evaluate our solution using VTR benchmarks and FPGA architectures as well as a simplified power index metric of our design.

37

in0

NOT Gate Delay = 1ns

not0

NOT Gate Delay = 2ns

in1

XOR Gate Delay = 1ns

not1

out

Figure 1: Simple Circuit for Activity Estimation Example

out not1 not0 in1 in0 0

5

10

15

20

25

30

35

40

Figure 2: Example Inter-Cycle Activity for Example Simple in0 in1 not0 not1 out Circuit

out not1 not0 in1 in0

0

5

10

15

20

25

30

35

40

Stabilized in0 in1for Example Simple Figure 3: Example Intra-Cycle Activity Circuit (Thenot0 shaded areasnot1 mark the stabilized time periods.) out • We compare our results with those of a probability-based activity estimator that disregards hard blocks. • We perform a number of commits into the VTR and Odin II public repository on GitHub regarding both our solution as well as necessary fixes to support our experimentation and maintain the research tool.

2

AN ACTIVITY CALCULATION EXAMPLE

Consider the simple example circuit displayed in Figure 1. It contains two input pins that are both connected to NOT gates. However, the first NOT gate propagates the signal faster than the second. The outputs of these NOT gates are then passed to an XOR gate, which drives the output pin.

Simulation-Based Circuit-Activity Estimation

RSP’17, October 15–20, 2017, Seoul, Republic of Korea

Table 1: Example Calculation of Activity Metrics for Simple Circuit

in0 in1 not0 not1 out

Pl 4/8 = 0.500 3/8 = 0.375 4/8 = 0.500 5/8 = 0.625 (5 + 0.25)/8 = 0.656

Ps 7/8 = 0.875 3/8 = 0.375 3/8 = 0.375 7/8 = 0.875 4/8 = 0.500

As 7/8 = 0.875 3/8 = 0.375 3/8 = 0.375 7/8 = 0.875 (4 + 6)/8 = 1.250

4

In Figure 2, we display how the various signals in the simple circuit evolve over time for a specific pattern of inputs, after the circuit has stabilized. As expected by the circuit’s design, we can see that signals not0 and not1 are the inverse of signals in0 and in1 respectively; the output signal out is high when the signals not0 and not1 have the same value. However, this is not the complete picture of what happens in this circuit. In Figure 3, we display the evolution of the signals including the misalignments (glitches) that occur due to the delay difference between the two NOT gates. Three shorter spikes regarding signal out now appear between times 10-15, 20-25 and 35-40 because the signal from not0 arrives 1ns faster than the signal from not1. Finally, we can use this full representation of the circuit’s evolution to calculate the various activity metrics, which are displayed in Table 1. Let us examine the in0 signal first: its static probability is 0.5 because 4 out of 8 times the signal is high; its switching probability is 0.875 because out of the 8 stable time-slices, this signal switched value 7 times; and its switching activity is the same since the signal behaved in the same way during the unstable periods. Nevertheless, the signal out behaved differently between the Ps and As metrics: although its switching probability was 0.5 because it switched value 4 out of 8 times, its switching activity was higher because of the three glitches that occurred, which added six more switching events; thus, 10 out of 8 equal to 1.25.

3

during initialization. Thus, the authors were able to produce smaller and shorter circuits using their technique. Finally, related work on Odin II and VTR includes hard block reduction [21], addition of system-on-chip processors on a variety of FPGA architectures [9] and support for hard-block adders and carry chains [12].

RELATED WORK

ACE2 [8] by Lamoureux et al. is the closest related work to ours. ACE2 also calculates activity estimation of circuits but has two major drawbacks that we aim to overcome in this work: First, it ignores the effect of black boxes in the circuit, which are there so that they can be mapped to hard blocks by VPR afterwards. Second, it does not support circuits that do not contain exactly one clock. Todorovich et al. also proposed activity estimation [19] for FPGAs based on statistical analysis. This work supports only combinational circuits and also, does not support black boxes nor estimating activity through input-vector simulation. Regarding attempts to reduce activity and power for FPGAs, Bsoul et al. proposed an FPGA design that supports dynamic power gating [3]. This design switches off unused components of the FPGA so that its leakage power is reduced. Additionally, Patros et al. proposed that the power consumption of a circuit implemented on an FPGA can be reduced by removing the reset sub-circuit [15]. Unlike ASICs, FPGAs that support poweron-reset, do not require such circuitry. Instead in these FPGAs, initial values of registers can be stored as part of its programming

38

DESIGN AND IMPLEMENTATION

The metrics we use for our solution are based on the previous ACE2 [8] work currently in VTR: we calculate per wire x its static probability Pl (x), its switching probability Ps (x) and its switching activity As (x). Our design was adjusted so that it runs in the simulator of Odin II, which maintains information about black boxes and hard blocks, instead of executing stand-alone as currently ACE2 does. The various theoretical details of the activity model are similar and are omitted for brevity. However, because the number of wires in a circuit can be exceptionally high, we introduce a new metric for aggregating the various activity scores of the circuit wires. The expected power requirements are proportional to both the static probability Pl (x) and the switching activity As (x) as discussed in the background: P ∝ Pl (x), ∀x ∈ wires

P ∝ As (x), ∀x ∈ wires

(1)

(2)

Thus, we introduce the Simplified Power Index (SPI), which is calculated for a certain simulation run of a certain circuit deployed on a certain FPGA architecture as follows:  Õ  SPI = Pl (x) + As (x) (3) x ∈wires

Due to its definition as well as Equations 1 and 2, we can deduce that SPI is proportional to the power requirements of the circuit: SPI ∝ P

(4)

Concerning implementation, our code was developed in separate modules and merged into the Odin II trunk. We implemented hooks into the Odin II simulator, which are invoked at every cycle of the simulation. Using the ability of Odin II to maintain information regarding the functionality of hard blocks, their effect is no longer disregarded as it happens in ACE2. Additionally, the simulator of Odin II supports circuits that contain no clocks or more than one clock, which allows our tool to estimate the activity of such circuits. Thus, our modifications allow the activity estimation model to utilize the correct outputs of any black boxes, including their propagation effects to the remaining circuit. Any necessary information regarding the composition of the FPGA is found in the parameter architecture file. In the end of the simulation, our results are dumped into a special activity output file. We print per wire x, its static probability Pl , its switching probability Ps and its switching activity As .

4.1

Further Extensions and Maintenance

In addition to implementing, testing and committing our activity estimation tool, we performed a number of other updates and commits for Odin II. These changes concerned both maintenance of this open source platform as well as the introduction of new features

RSP’17, October 15–20, 2017, Seoul, Republic of Korea

S. Seeley et al.

that were relevant to improved activity (and thus power) estimation of FPGA circuits using the VTR flow. In particular, we made the following contributions to the Odin II repository: First, we added a command-line parameter for passing a random seed to Odin II, which we then use to initialize the random input-vector producer for simulation. This modification allows the standardization of randomized testing based on the seed; otherwise, the same random input-vectors would be produced and repeating the tests would have no effect. Additionally, we decided against initializing the seed with a function such as time(NULL) because if multiple tests were to be started in parallel, they might all risk starting on the same time-slice; thus producing again the same pseudo-random numbers. Second, we implemented register initialization using the Verilog initial block construct. Because VTR is mostly focusing on FPGA packing, placement and routing statistics of synthesizable circuits, certain unsynthesizable features had been omitted for brevity. However, since in our technique we perform simulation to estimate activity, it can make a significant difference if a certain register is set to a certain value in the beginning because this changes the state of the circuit and can lead to different activity patterns. Third, we expanded the syntax of Odin II to accept newer Verilog constructs such as comma-based sensitivity lists for the ALWAYS block and inferring undeclared identifiers to be nets by default. These changes increased the pool of potential circuits that could have their activity estimated in VTR and particularly newer and thus more relevant circuits. Fourth, we added support for the Verilog power operator, which we convert to successive multiplications, assuming the exponent is an integer. This is a crucial improvement because besides expanding the Verilog support of Odin II and VTR, it also creates a number of multiplications, which are a core target for FPGA hard blocks, for which we can then properly estimate their activity using our tool. Fifth, during our tests, we used both Verilog and BLIF files as inputs for circuit estimation. However, because of limitations regarding the usual use of the VTR flow, reading and simulating BLIF files was incomplete: BLIF files can describe their LUTs in both positive and negative ways but the negative way was not present. We added this feature, which in particular enabled BLIF files optimized by ABC to be fed as input to the simulator and activity tool. Sixth, we performed various memory management related improvements, which enabled the synthesis and simulation of larger circuits using soft-logic-intensive architectures. We used automated memory analysis tools such as Valgrind and Coverity as well as more traditional core dumps and debugging with gdb. Finally, as part of our training and familiarization with the VTR code base, we responded to numerous issues raised by users in the public repository as well as pushed related fixing commits for any repeatable bugs. All of our commits followed the software engineering guidelines of the VTR community and passed all the required regression tests successfully.

5

EXPERIMENTS

To evaluate our approach in estimating circuit activity, we conducted a series of experiments using the improved VTR flow. First,

39

we selected three sample FPGA architectures from those available in the VTR flow (Table 2): arch0 describes a modern design— comparable to the commercial 40nm Stratix IV—and contains black boxes for 32Kb memories and fracturable 36x36 multipliers; arch1 specializes arch0 with carry chains that link to adjacent blocks; and arch3 simplifies arch0 by removing all black boxes. Thus, two of these architectures (arch0, arch1) contain hard blocks; whereas the third (arch2), does not. Therefore, any changes that correlate with the existence or not of hard blocks should be detectable. Second, we used the Verilog benchmarks of the VTR flow for circuits as they cover a broad range of applications. Because some of these circuits use memories and multipliers that can be synthesized into hard blocks, we again expect detectable differences. As a baseline, we considered the probability-based activity estimation provided by ACE2, which we will refer to as ACE2 (Prob). Because this analysis is static, it always provides the same results for a given circuit compiled on a given architecture file; thus, we conducted only a single run for ACE2 (Prob). Concerning our simulation-based technique, we repeated the tests 16 times per circuit and per architecture. For each repetition, we passed a different random seed to the command line of Odin II between 0 and 15; thus, different pseudo-random vectors were produced for each run. In all cases, the simulation lasted for 10,000 clock cycles. All tests were conducted on the same machine running CentOS Linux with a 12-core Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz CPU and 32GB of main memory. The large capacity of the main memory was crucial in allowing the synthesis and simulation of large and soft-logic intensive circuits in a realistic time; efforts to run these tests on smaller machines with paging led to the application not finishing within a timely manner. The experimental results were the static probability (Pl ) and switching activity (As ), which were captured per activity-estimation method, FPGA architecture, benchmark circuit and circuit wire. Due to their sheer size, a relational database was used for their aggregation and subsequent analysis using SQL queries. In the following paragraphs, we discuss these results and compare the activity measurements of the two methods, the three architectures and the benchmark circuits.

5.1

Methods’ Comparison

In this section, we evaluate the statistical differences between our activity estimation method and the baseline. The analysis is split into two parts: first, we investigate differences in the Pl and As metrics per circuit wire; second, we investigate differences with our aggregated SPI metric per run. 5.1.1 Per Wire. In Figures 4 and 5, we display the Root Mean Square (RMS) error per wire between the two activity estimation methods per metric respectively. Note that because our technique was repeated multiple times, the various metrics were first averaged across the 16 runs. Our results suggest that there were generally larger differences between our technique and the baseline regarding static probability Pl (Figure 4) than switching activity (Figure 5). Because the largest differences in Pl concerned the architecture without hard blocks, this suggests that the main reason for the differences reported in

Simulation-Based Circuit-Activity Estimation

RSP’17, October 15–20, 2017, Seoul, Republic of Korea

Table 2: Selected VTR FPGA Architectures for Experiments VTR XML File k6_frac_N10_mem32K_40nm.xml k6_frac_N10_frac_chain_mem32K_40nm.xml k6_frac_N10_40nm.xml

Description VTR Flagship VTR Carry Chain VTR Soft Logic

0.60

100,000

0.50

10,000 SPI for Arch0

Methods' RMS of Pl per Component

Name Arch0 Arch1 Arch2

0.40 0.30

Hard Blocks Memories, Multipliers Memories, Multipliers, Carry Chains None

1,000 100

0.20

10

0.10

1

0.00

Benchmark Circuit

ACE2 (Prob)

Benchmark Circuit arch0

arch1

arch2

Figure 6: Comparison of Activity Estimation Methods using Arch0 100,000 10,000

0.35 SPI for Arch1

Methods' RMS of As per Component

Figure 4: Comparison of Activity Estimation Methods regarding Static Probability (Pl ) per Wire

SHB (SIM)

0.30 0.25 0.20

1,000

100 10

0.15 1

0.10 0.05 0.00

Benchmark Circuit

ACE2 (Prob)

Figure 7: Comparison of Activity Estimation Methods using Arch1

Benchmark Circuit arch0

arch1

SHB (SIM)

arch2

Figure 5: Comparison of Activity Estimation Methods regarding Switching Activity (As ) per Wire

this metric has to do with using simulation, instead of only statically analyzing probabilities. However, the opposite observation can be made for the As metric: in general, larger differences in switching activity were measured for the architectures containing hard blocks; thus, this metric is more sensitive to any hard block activity calculations that were not present in the baseline. All in all, the relatively high RMS values registered in all cases suggests that performing simulation-based over probability-only

40

estimation results into significant differences in the per-wire activities. Thus, our tool can provide a different insight than the baseline technique regarding activity and power of the various circuit wires. 5.1.2 With SPI. Because of the sheer numbers of the various wires per circuit, our analysis switched to the aggregated Simplified Power Index (SPI) metric discussed previously. The various calculated SPIs for the three testing architectures are visually displayed in Figures 6, 7 and 8 respectively. Since our technique was repeated multiple times, we display the average SPI over 16 runs as well as error bars with standard deviation. In most cases, we can see that the SPI measurements acquired with our technique closely match those acquired by the baseline method. This is expected due to the way SPI is calculated: if some

RSP’17, October 15–20, 2017, Seoul, Republic of Korea

S. Seeley et al. 4,500

100,000

4,000 3,500 SPI using SHB (SIM)

SPI for Arch2

10,000 1,000 100 10

3,000 2,500 2,000

1,500 1,000

1

500 0 Average

ACE2 (Prob)

GeoMean arch0

Benchmark Circuit SHB (SIM)

arch1

Harmonic Mean

arch2

Figure 10: Aggregated SPI Scores per Architecture

Figure 8: Comparison of Activity Estimation Methods using Arch2

100,000

SPI using SHB (SIM)

10,000 100,000

SPI using SHB (SIM)

10,000 1,000

1,000 100

10

100

1

10

Benchmark Circuit

1 Average

arch1

VPR LUTs

arch2

Figure 9: Comparison of SPI per Circuit per Architecture

activity is shifted to other parts of a circuit, it will still increase the index in similar ways. All in all, measurable differences in the SPI metric can be observed when comparing our technique with the baseline.

5.2

Harmonic Mean

Figure 11: Aggregated SPI Scores per Circuit

Benchmark Circuit arch0

GeoMean

Architectures’ Comparison

Next, we use our SPI metric acquired by our activity estimation technique to investigate the various levels of activities measured across the three tested architectures. We display the average SPI scores per architecture and benchmark in Figure 9, using the standard deviation for the error bars. Across the various benchmark circuits, arch2 stands out as the one with the highest activity in most cases. This can be due to the fact that arch2 does not contain hard blocks. In this architecture the various multiplication and memory operations are no more contained inside a single component but are instead, expanded to many more LUTs. These observations are further supported by the aggregated SPI scores displayed in Figure 10. We display the average, geometric mean and harmonic mean on the graph. In all cases, we can see

41

that the soft-logic arch2 is the one with the highest activity and thus, the highest expected power requirements. Consequently, our activity estimation technique paired with the SPI score provides a reasonable way to quickly distinguish between the various architectures.

5.3

Circuits’ Comparison

In our last analysis, we use our SPI metric to compare the activities of the benchmark circuits. The experimental results are displayed in Figure 11 and as before, they utilize the mean, the geometric mean and the harmonic mean for aggregating the SPI scores. Additionally, we display on the graph the number of LUTs VPR [11] required to place these circuits. Our results suggest that the aggregated SPI scores correlate with the number of LUTs required by VPR, which itself should be considered as a more crude indicator of overall circuit activity. In other words, the larger the circuit, the higher the overall activity, and thus, the higher the expected power requirements. Therefore, this provides evidence supporting both our improved estimation model and our novel SPI metric. Additionally, because arch2 was the only soft-logic-only architecture, our results suggest that significant activity and power reductions are expected for ch_intrinsics, diffeq1, diffe2, mkpktMerge, or1200 and raygentop; these are the circuits that we predict will

Simulation-Based Circuit-Activity Estimation

RSP’17, October 15–20, 2017, Seoul, Republic of Korea

gain the most in terms of power reductions when implemented on FPGA architectures with the same hard blocks. However, for sha, stereovion0 and stereovision3, no measurable differences in activity were measured when hard blocks were used; therefore, for these circuits, we predict no power reductions for FPGA architectures that contain similar hard blocks to the experimental ones.

6

CONCLUSION

In the era of abundance of embedded electronic devices, FPGAs have become prevalent for rapidly and cost-efficiently prototyping new circuit designs. Additionally, FPGAs are used directly in devices, when either a more general-purpose functionality is desired or when an older circuit design that needs to be replaced can no longer be found in the market. Designing circuits can be a daunting task to be done by hand. Instead, specialized software exists that synthesizes a circuit’s description from hardware languages. Such CAD tools pack, place and route the circuit design on an FPGA. A major concern for circuit design is power requirements. With billions of devices at a time operating on the Internet of Things, it is vital to minimize their power requirements so that their effect on climate change is mitigated. Also, reducing power requirements can make these devices more usable and more performant, by extending their battery life and increasing the clock speed without excess heating. Circuit power can be extrapolated from the activity a circuit has, which in turn corresponds to the percentage of time the circuit’s wires stay in high voltage but also, to how frequently they switch values. In this paper, we propose, implement and experimentally evaluate an activity estimation tool that runs on the FPGA CAD flow VTR and in particular, on its synthesizer, Odin II. Our tool improves on VTR’s state-of-the-art activity estimator tool ACE2 by supporting the proper activity calculation of circuits that contain black boxes (which are mapped to FPGA hard blocks further down the VTR pipeline) and also, by supporting circuits that do not contain exactly one clock. Additionally, to support our new tool, we performed a number of updates and maintenance to the VTR project; all of our contributions have been committed to VTR’s GitHub repository. Finally, we evaluated our tool with three VTR FPGA architectures and using the VTR Verilog circuits. In conclusion, our tool furnishes circuit designers with incrementally more information and for more types of designs, which can help them estimate the activity and thus power requirements of their FPGA circuits. In the future, we want to extend the visualization tool of Odin II [13] to also include activity related information per node. This could enhance the identification of hot spots in circuits resulting in rapid and power-aware prototyping. Finally, another future task will be the evaluation of our improved activity model using actual FPGA measurements and in particular, investigate the correlation between our SPI metric and power consumption.

ACKNOWLEDGMENTS The authors would like to acknowledge the contributions of Sana Oladi, Stephen A. Mackay and the three anonymous RSP reviewers

42

for proofreading the paper and providing useful comments and suggestions. The authors also would like to thank the Natural Sciences and Engineering Research Council of Canada and CMC Microsystems for their contributions of financial and tool support to this project. Finally, we also thank the UNB/IBM Centre for Advanced Studies–Atlantic for access to additional resources for conducting our research.

REFERENCES

[1] Vaughn Betz and Jonathan Rose. 1997. VPR: A new packing, placement and routing tool for FPGA research. In Field-Programmable Logic and Applications. Springer, 213–222. [2] Stephen D. Brown, Robert J. Francis, Jonathan Rose, and Zvonko G. Vranesic. 2012. Field-programmable gate arrays. Vol. 180. Springer Science & Business Media. [3] Assem AM Bsoul, Steven JE Wilton, Kuen Hung Tsoi, and Wayne Luk. 2016. An FPGA Architecture and CAD Flow Supporting Dynamically Controlled Power Gating. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 24, 1 (2016), 178–191. [4] Intergovernmental Panel On Climate Change. 2014. IPCC. Climate change (2014). [5] Peter Jamieson, Kenneth B. Kent, Farnaz Gharibian, and Lesley Shannon. 2010. Odin II-an open-source verilog HDL synthesis tool for CAD research. In FieldProgrammable Custom Computing Machines (FCCM), 2010 18th IEEE Annual International Symposium on. IEEE, 149–156. [6] Steve Kilts. 2007. Advanced FPGA design: architecture, implementation, and optimization. John Wiley & Sons. [7] Ian Kuon and Jonathan Rose. 2007. Measuring the gap between FPGAs and ASICs. IEEE transactions on computer-aided design of integrated circuits and systems 26, 2 (2007), 203–215. [8] Julien Lamoureux and Steven JE Wilton. 2006. Activity estimation for fieldprogrammable gate arrays. In Field Programmable Logic and Applications, 2006. FPL’06. International Conference on. IEEE, 1–8. [9] Jingjing Li, Konstantin Nasartschuk, and Kenneth B Kent. 2014. System-on-chip processor using different FPGA architectures in the VTR CAD flow. In Rapid System Prototyping (RSP), 2014 25th IEEE International Symposium on. IEEE, 72–77. [10] Joseph C Libby, Ashley Furrow, Paddy O’Brien, and Kenneth B Kent. 2011. A framework for verifying functional correctness in Odin II. In Field-Programmable Technology (FPT), 2011 International Conference on. IEEE, 1–6. [11] Jason Luu, Jeffrey Goeders, Michael Wainberg, Andrew Somerville, Thien Yu, Konstantin Nasartschuk, Miad Nasr, Sen Wang, Tim Liu, and Nooruddin Ahmed. 2014. VTR 7.0: Next generation architecture and CAD system for FPGAs. ACM Transactions on Reconfigurable Technology and Systems (TRETS) 7, 2 (2014), 6. [12] Jason Luu, Conor McCullough, Sen Wang, Safeen Huda, Bo Yan, Charles Chiasson, Kenneth B Kent, Jason Anderson, Jonathan Rose, and Vaughn Betz. 2014. On hard adders and carry chains in FPGAs. In Field-Programmable Custom Computing Machines (FCCM), 2014 IEEE 22nd Annual International Symposium on. IEEE, 52–59. [13] Konstantin Nasartschuk, Rainer Herpers, and Kenneth B Kent. 2012. Visualization support for FPGA architecture exploration. In Rapid System Prototyping (RSP), 2012 23rd IEEE International Symposium on. IEEE, 128–134. [14] Huansheng Ning. 2013. Unit and ubiquitous Internet of Things. CRC press. [15] Panagiotis Patros and Kenneth B Kent. 2016. Automatic detection and elision of reset sub-circuits. In Rapid System Prototyping (RSP), 2016 International Symposium on. IEEE, 1–7. [16] Kara KW Poon, Steven JE Wilton, and Andy Yan. 2005. A detailed power model for field-programmable gate arrays. ACM Transactions on Design Automation of Electronic Systems (TODAES) 10, 2 (2005), 279–302. [17] Jonathan Rose, Jason Luu, Chi Wai Yu, Opal Densmore, Jeffrey Goeders, Andrew Somerville, Kenneth B. Kent, Peter Jamieson, and Jason Anderson. 2012. The VTR project: architecture and CAD for FPGAs from verilog to routing. In Proceedings of the ACM/SIGDA international symposium on Field Programmable Gate Arrays. ACM, 77–86. [18] Andrew Somerville and Kenneth B Kent. 2012. Improving memory support in the VTR flow. In Field Programmable Logic and Applications (FPL), 2012 22nd International Conference on. IEEE, 197–202. [19] Elias Todorovich, M Gilabert, Gustavo Sutter, Sergio López-Buedo, and E Boemo. 2002. A tool for activity estimation in FPGAs. Field-Programmable Logic and Applications: Reconfigurable Computing Is Going Mainstream (2002), 21–33. [20] Stephen M Trimberger. 2015. Three ages of FPGAs: A retrospective on the first thirty years of FPGA technology. Proc. IEEE 103, 3 (2015), 318–331. [21] Bo Yan and Kenneth B Kent. 2015. Hard block reduction and synthesis improvements in Odin II. In Rapid System Prototyping (RSP), 2015 International Symposium on. IEEE, 126–132.