System generator model-based FPGA design ... - IEEE Xplore

1 downloads 0 Views 1MB Size Report
The conceptual model design is built using MATLAB. Simulink, and the equivalent hardware model is created using. Xilinx System Generator for FPGA ...
2017 2nd Asia-Pacific Conference on Intelligent Robot Systems

System Generator Model-Based FPGA Design Optimization and Hardware Cosimulation for Lorenz Chaotic Generator

Lei Zhang Faculty of Engineering and Applied Science University of Regina, Regina S4S0A2 Canada e-mail: [email protected]

Abstract — Chaotic systems can be synchronized and used for secure communication to transmit video, audio and text files. Field Programmable Gate Arrays (FPGAs) are beneficial for the implementation of high speed, low cost and low power embedded communication systems. In this paper, a model-based design approach is presented for FPGA implementation and optimization of chaotic generators. Lorenz attractor has its significance in studying chaotic systems and is used as the design subject in this paper. The conceptual model design is built using MATLAB Simulink, and the equivalent hardware model is created using Xilinx System Generator for FPGA implementation. The design models are created with 32-bit fixed-point and 16-bit fixedpoint data formats and implemented on FPGA to evaluate the design performance, including the maximum operating clock frequency, resource utilization and power consumption. The 32bit and 16-bit fixed-point models are further optimized by using timing analysis and adding delays to break critical paths to improve timing performance. The implementation results show that the proposed design and optimization approach has achieved promising improvement on design performance by tripling the maximum operating frequency for both 32-bit and 16-bit fixedpoint configurations. The FPGA hardware co-simulation results demonstrate the anticipated Lorenz chaotic generator outputs for both designs.

Keywords-chaotic generator; Lorenz attractor, hardware cosimulation; FPGA

I. I NTRODUCTION Chaotic systems are aperiodic and appear random in the time domain[1], but they can be synchronized and used for message encryption in communication systems. Various chaotic generators have been implemented on FPGA in realtime for synchronous communication applications[2]. It is proved that two chaotic systems with the same parameter settings will synchronize with each other[3]. One big challenge in embedded communication system design is to achieve the specified security level with constraint on hardware resources such as memory size and computational capacity, meanwhile meeting the performance requirements for speed and power. A Lorenz chaotic generator conceptual model and its Xilinx System Generator model are presented in[4], using 32-bit signed fixed-point with 18 bits of fraction, at a clock step size of 0.01(dt), and achieving a maximum frequency of 2.5 MHz. Another Lorenz attractor implementation on a Xilinx

978-1-5090-6793-0/17/$31.00 ©2017 IEEE

Spartan 3E FPGA device is reported in[5], using 32-bit signed fixed-point data format with 20 bits of fraction. One more Lorenz attractor hardware implementation is given by[6], [7] using 16Q16 fixed-point data format on a Virtex II-Pro FPGA. All the above mentioned designs use 32-bit fixedpoint data format. In this paper, an extended model-based design approach and optimization methods are presented for FPGA implementation of chaotic systems using Xilinx System Generator on a Zynq 7z020 FPGA. 32-bit fixed-point and 16bit fixed-point data formats are used for the model design. Timing analysis are carried out based on critical paths listed in FPGA implementation time reports. These models are further optimized with added delays to break long data paths and improve timing performance. The Lorenz attractor is represented by equation1. dx = σ(y − x) dt dy = ρx − y − xz dt dz = −βz + xy dt

(1)

where variables σ = 10, ρ = 28, β = 83 , initial values x0 = 10, y0 = 20, z0 = 30, and step size dt = 0.01. The solutions of these three dimensional ordinary differential equations (ODEs) depend on the initial values. A. Forward Eular Method When implementing chaotic generator such as Lorenz attractor on FPGA, simple discrete integration method can be used to reduce FPGA resource usage, but may introduce rounding errors and cause the output not to converge. The problem of fixed-point representation errors will always be present, which can be accepted so long as the solutions to the differential equations converge at a given step size[8]. Euler method is a first-order numerical procedure for solving ODEs. The forward Euler method is based on a truncated Taylor series expansion. Given n ODEs with n-variable in equation(2), the forward Euler method for FPGA implementation is represented by equations(3). It is noted that large step size (dt) could

170

introduce anomalies in chaotic generators[9]. Other numerical solutions such as fourth order Runge-Kutta method (RK-4) can also be used for solving ODEs[7]. dx1 = f1 (x1 , ..., xn ) dt (2) ... dxn = fn (x1 , ..., xn ) dt apply forward Euler’s method: x1 (t + dt) = x1 + f1 (x1 (t), ..., xn (t))dt ... xn (t + dt) = xn + fn (x1 (t), ..., xn (t))dt

(3)

B. Fixed-point FPGA Implementation The Lorenz attractor model is designed using Xilinx System Generator (XSG) and Simulink. The Simulink blocks are configured with 32-bit fixed-point and 16-bit fixed-point data format respectively. The fixed-point models are further optimized to improve timing performance and reduce FPGA resource utilization. The optimization approach is to firstly generate Simulink conceptual model for Lorenz attractor and obtain simulation results for 32-bit floating point data format. Then based on the output data range, an initial fixed-point data format is selected for the XSG blocks to create the hardware model for FPGA implementation. A commonly used signed fixed-point representation Qm.n gives m bits of integer, n bits of fraction, and 1 bit of sign. Its representation range is between −2m and 2m − 2−n , and its precision is 2−n . In system generator model, signed fixed-point data format is represented as Fixaa bb, where aa is the total number of bits and bb is the number of fractional bits. e.g, Fix32 18 represents a 32-bit fixed-piont data format with 1 sign bit, 18 fractional bits and 13 integer bits. It is observed from the Lorenz attractor conceptual simulation that the intermediate values in the model are in an approximate range between -1024 and +1024. therefore, at least 10-bit is required for integer part to avoid overflow. In order to compare implementation result with the referenced design[5], Fix32 18 data format is used. When Fix16 5 data format with 1 sign bit and 10 integer bits is used for the 16bit model design, only 5 bits are left to be used for fraction. The fractional precision is 2−5 (0.03125). This is problematic because when setting the clock step size (dt) to a smaller value than the minimum fraction, it will be rounded to 0. Therefore, the fractional bits for the dt multiplier blocks are set to Fix16 9. Moreover, overflow is generated by the x ∗ z multiplier block during simulation. Therefore, this individual multiplier block is configured with Fix16 4 data format with 11-bit for integer to avoid overflow.

Therefore the timing constraint for the clock period of the target design is set to 10 ns. Timing closure is the process by which an FPGA design is modified to meet its timing requirements. The maximum frequency of a FPGA design is not generated directly by the Vivado software tool. It can be calculated using the clock period and the Worst Negative Slack (WNS) given by the implementation timing report, as in equation (4). 1 fmax = (4) Ts − W N S where fmax is the maximum frequency, Ts is the clock period. When an implementation is completed successfully meeting all timing constraints, the WNS value should be positive, which means a faster fmax or a shorter Ts can be used. On the other hand, if an implementation fails to meet all timing constrains, the WNS is negative, and the timing constraint for Ts can be increased to achieve timing closure for a successful implementation, without changing the design. The fmax can only be calculated when the implementation is completed without timing failure. II. 32- BIT M ODEL FPGA I MPLEMENTATION The 32-bit fixed-point XSG model is created using Fix32 18 data format, with 1 sign bit, 18 fractional bits and 13 integer bits. The model is used to generate Vivado project for FPGA Implementation. The same design model is implemented with different timing constraint settings for clock period. This method can be used to find maximum frequency for a design without predefined timing requirement. The timing closure of the implementation is achieved by increase the clock period Ts when timing failure (negative WNS) occurs during implementation. The new Ts should be set greater than the subtraction of current Ts and the negative WNS. The implementation results are listed in Table I. TABLE I.

32-BIT FIXED-POINT FPGA IMPLEMENTATION

Clock period Ts (ns) Worst Negative Slack(ns) Maximum Frequency(MHz) Look-up Table(LUT) Registers/Flip-flop(FF) Slices DSP48E1 Total On-chip Power(W)

Spartan 3Ea 20 NA 18.03 1912 144 1029 8 NA

Zynq 7020 40 30 25 9.816 2.457 (-0.683) 33.13 36.31 NA 868 868 868 96 96 96 338 338 343 8 8 8 0.153 0.154 0.173 a Reference design in [5]

The implementation succeeds when Ts = 30n, but failed when Ts = 25n. The fmax at 30 ns is 36.31 MHz, which is twice faster than the referenced design on Spartan-3E FPGA. The implementation results also show that the power consumption increases as clock frequency increases.

C. Maximum Clock Frequency

III. 32- BIT M ODEL O PTIMIZATION

A Xilinx Zedboard with a Zynq7020 FPGA is used for FPGA implementation and hardware co-simulation. This board has a 100 MHz system clock, with 10 ns clock period (Ts ).

The 32-bit fixed-point design is optimized in order to meet the timing requirement for 10 ns clock period at 100 MHz system clock frequency. The implementation with 25 ns clock

171

period reports a number of critical paths with long delay for register 1. In order to increase the maximum frequency to the

system clock frequency on the Zedboard at 100 MHz, several optimization methods are applied [10], [11].

Figure 1. Lorenz Attractor 32-bit Fixed-point Optimization Model with 4 Delays

Firstly, double delay blocks are added to the outputs. Each delay block has 1 latency. This setup allows one register to be placed next to the FPGA on-chip logic and the other one to be packed into an IOB (Input/Output Block) on the FPGA, which avoid generating critical path from logic to IOB. A delay block with a latency of 2 is implemented by shift register SRL16 in FPGA and does not give the same result for the modelbased design. The implementation results are improved by this optimization as shown in Table II. The timing performance at 25 ns clock period meets the timing requirement for the design with a small margin of 0.02 ns WNS; but fails at 20n sample period with a negative WNS of -5.181 ns. TABLE II.

32-BIT IMPLEMENTATION RESULTS

Clock period Ts (ns) Worst Negative Slack(ns) Maximum Frequency(MHz) Look-up Table(LUT) Registers/Flip-flop(FF) Slices DSP48E1 Total On-chip Power(W)

40ns 10.167 33.52 868 320 375 8 0.154

30ns 2.983 37.01 868 320 388 8 0.165

25ns 0.020 40.03 868 320 400 8 0.175

20ns (-5.181) NA 868 320 389 8 0.188

Secondly, delay blocks are added to cut critical paths. Additional delays are also added on the non-critical paths to match the number of delays on all paths and ensure that the output signals are correctly aligned. Timing closure is achieved after adding four delays blocks on the critical data path. The optimization approach is taken by the following steps:

* Add two delay (latency) for Multiplier blocks, one delay for Constant Gain block; add additional delays to match the delays for three data paths. * Add delay blocks to the feedback path for three integrators, set latency to 4 for three feedback delay blocks to match the delays on the forward data paths. * Set Adder/Subtracter blocks to be implemented using dedicated FPGA resource DSP48. This will save FPGA logic resource and achieve better performance. * Set Multiplier blocks to optimize for speed and use embedded multipliers (DSP48). This will save FPGA logic resource and achieve better performance. * Set Constant Multiplier blocks to implement using Distributed RAM. Distributed RAMs are implemented using FPGA logic resources, which are suitable for implementing small size memory. They can be relative faster and more flexible than Block RAM for the ‘Place and Routing’ process in FPGA implementation to meet timing requirement of the design. * Add Down Sample block to each output. The number of down sample equals the total number of delays on each of the three data paths for x, y and z. Each path has a total delay of 5. Set the down sample rate for three output Down Sample blocks to 5. Select ‘first value of frame’ as output sample. * Add one additional delay block before the ‘Gateway Out

172

block’. Set the block to implement using ‘behavioral HDL’. This block will be bounded to the IOB and hence decrease the long time path from the logic to output ports. * Balancing the delays by cutting critical paths with timing failures listed in the implementation report. Fig.1 shows the design optimization model. The implementation results for the original and optimized designs are listed in Table III. In the original design model without delay block, the three outputs are generated within one clock cycle. By adding delay blocks, critical long time paths can be cut into shorter segments to meet the timing requirement. However, this means it will take multiple clock cycles to complete the calculation, which reduces the data throughput. The trade off between time performance and throughput needs to be considered for specific applications. TABLE III.

IV. 16- BIT M ODEL FPGA I MPLEMENTATION & O PTIMIZATION Fig.2 shows the optimized FPGA hardware implementation model with 16-bit fixed-point data format using the same optimization approach as the 32-bit model. The challenge for the 16-bit fixed-point model design is to avoid overflow of the chaotic generator. Overflow occurs when all blocks are configured as Fix16 5 data format. Trial and error is used with the model simulation to correctly represent the data range and data precision.The implementation results of 16-bit fixedpoint model and the optimization model for Lorenz generator are listed in Table IV. TABLE IV.

Model Sample period Ts (ns) Worst Negative Slack(ns) Maximum Frequency(MHz) Look-up Table(LUT) Registers/Flip-flop(FF) Slices DSP48E1 Total On-chip Power(W)

32-BIT OPTIMIZATION RESULTS

Sample period Ts =10ns) Worst Negative Slack(ns) Maximum Frequency(MHz) Look-up Table(LUT) Registers/Flip-flop(FF) Slices DSP48E1 Total On-chip Power(W)

No delay (-15.603) NA 868 96 336 8 0.252

3-delay (-2.454) NA 875 875 408 8 0.171

16-BIT FIXED-POINT FPGA IMPLEMENTATION AND OPTIMIZATION RESULTS

4-delay 0.059 100.59 945 1142 444 8 0.177

Non-Optimized 25 20 2.429 -1.117 44.30 – 376 376 48 48 149 151 2 2 0.145 0.151

Optimized 20 10 4.808 2.564 65.82 136.13 377 376 48 362 156 185 2 2 0.141 0.200

Figure 2. Lorenz Attractor 16-bit Fixed-point Optimization Model with 1 Delay

The original 16-bit model can achieve 44.3 MHz maximum frequency at 25 ns Ts , but has negative WNS at 20ns Ts and

fails to meet the timing requirement. After adding one delay to each data path, the optimized model achieves timing closure

173

at 20 ns Ts with 65.82 MHz maximum frequency, as well as at 10ns Ts with 136.13 MHz maximum frequency, more than tripled the original model. Adding one delay block reduced the data throughput by half, but this can be compensated by using pipelining with increased FPGA resource usage. V. FPGA H ARDWARE C O - SIMULATION The Zedboard is used for hardware co-simulation to evaluate the Lorenz attractor outputs of the fixed-point models. The designed models are converted into FPGA configuration image and downloaded to the FPGA on Zedboard using JTAG configuration port. The models run on the FPGA device, and the generated results are send back to the PC via JTAG and displayed by MATLAB. The hardware co-simulation results demonstrate the same chaotic outputs for the Lorenz attractor for the 32-bit fixed-point model. The hardware co-simulation block generated from the optimized design is shown in Fig.3. The hardware co-simulation outputs x, y, z and their 3D outputs are shown in Fig.4. The hardware co-simulation for 16-bit fixed-point optimization model can also generate correct outputs for the Lorenz attractor.

VI. C ONCLUSIONS This paper demonstrated the System Generator model-based FPGA design approach and implementation results for Lorenz chaotic generator. This design approach can be extended for FPGA implementation of other chaotic generator designs. The aim is to increase the frequency of Lorenz chaotic generator with FPGA acceleration. Timing closure is achieved by critical paths timing analysis and adding delays to break critical paths. The implementation results show that the optimized design models achieves better design performance by increasing the maximum operating frequency threefold for both 32-bit and 16-bit fixed-point models. The optimized 32-bit model achieves a maximum frequency of 100.59 MHz with 5 delays. The optimized 16-bit model achieves a maximum frequency of 136.13 MHz with 1 delay. The FPGA resource usage is reduced by approximately 75% for 16-bit fixed-point model compared to equivalent 32-bit fixed-point model. The correct Lorenz attractor outputs are generated by the hardware cosimulation. R EFERENCES [1]

Figure 3. Lorenz Attractor Hardware Co-simulation Model

A. Abel and W. Schwarz, “Chaos communications: Principles, schemes, and system analysis,” Proceedings of the IEEE, vol. 90, no. 5, pp. 691– 710, May 2002. [2] P. Wu, J. Alam, C. Hu, and J. Li, “Controlling unified hyperchaotic system to encryption digital information,” in Proceedings of the 3rd International Conference on Cloud Security and Management, 2015, pp. 118–121. [3] H. Kamata, T. Endo, and Y. Ishida, “Practical private speech communication system with chaos using digital signal processor,” J. Acoust. Soc, jjpn. (E) 19. 6, 1998. [4] M. Aseeri, M. I. Sobhy, and P. Lee, “Lorenz chaotic model using filed programmable gate array (fpga),” in The 45th Midwest Symposium on Circuits and Systems, vol. 1, 2002. [5] M. Aseeri and M. I. Sobhy, “Field programmable gate array (fpga) as a new approach to implement the chaotic generators,” in 3rd International Conference on Advanced Engineering Design AED, Prague, Czech Republic, June 2003. [6] C. Tanougast, Chaos-Based Cryptography: Theory, Algorithms and Application, 2011, ch. Chapter 9: Hardware Implementation of Chaos Based Cipher: Design of Embedded Systems for Security Applications. [7] M. S. Azzaz, C. Tanougast, S. Sadoudi, and A. Dandache, “Realtime fpga implementation of lorenz’s chaotic generator for ciphering telecommunications,” in Circuits and Systems and TAISA Conference, 2009. NEWCAS-TAISA ’09. Joint IEEE North-East Workshop on, June 2009, pp. 1–4. [8] J. M. E. Cuautle and L. Fraga, Engineering Applicaitons of FPGAs Chaotic systems, Artificial neural Networks, Random Number Generators, and Secure Communication systems. Switzerland: Springer, 2016. [9] B. Muthuswamy and S. Banerjee, A Route to Chaos Using FPGAs Volume I Experimental Observations. Switzerland: Springer, 2015. [10] Xilinx, Vivado Design Suite User Guide: MOdel-Based DSP Design using System Generator, UG897, v2016.1 ed., Xilinx, Apr. 2016. [11] Vivado Design Suite Tutorial: Model-Based DSP Design Using System Generator, UG948, v2015.3 ed., Xilinx, Oct. 2015.

Figure 4. 32-bit Fixed-point Hardware Co-simulation Outputs

174

Suggest Documents