Ability to have secure boot code & secure software for PS. â Protects .... LinuxLink. Timesys. NA. Now. Industrial, Medical, Automotive. Android. iVeia. 2.3. Now.
Enabling New Product Innovation Across Markets with Zynq-7000 All Programmable SoC, Vivado HLS and IP Integrator Olivier TREMOIS XILINX EMEA&I DSP Specialist IECON 2013
Agenda Xilinx: A Generation Ahead
Zynq-7000 Architecture
Vivado Design Suite
Live Demo
All Programmable Abstraction
Page 2
Xilinx: A generation Ahead
Semiconductor Market Leader in All Programmable Devices Founded 1984 $2.17B FY13 revenue ~50% market segment share
3,000 employees worldwide 20,000 customers worldwide 2,500+ patents
Headquarters R&D Sales, Marketing, Support Manufacturing (Fab, Assy, Test) Page 4
Driving Industry Mandates
Programmable Imperative
Programmable Systems Integration Insatiable Intelligent Bandwidth
Page 5
All Programmable and Smarter Systems
Smarter Networks
Smarter Data Centers
Smarter Factories
Page 6
Smarter Vision
Smarter Energy
Moving A Generation Ahead
28nm
Programmable Logic Devices
All Programmable Devices
Enables Programmable Logic
Enables All Programmable & Smarter Systems
Page 7
New All Programmable & Smarter Competencies Homogeneous
Heterogeneous
SerDes
Analog-to Digital
3D IC Expertise and Supply Chain SoC and embedded software World class SerDes and analog mixed signal SmartCORE IP for Smarter Systems
Next generation design automation Page 8
Charting an Aggressive Course Forward
Programmable systems integration
3D IC
SoC
FPGA
28nm HPL
20nm SoC
14-16nm FinFET
System-level price/performance/watt
Page 9
10nm
Process
Staying A Generation Ahead at 20nm
20nm Portfolio: Industry’s first ASIC-class All Programmable architecture for FPGAs and 2nd generation SoCs and 3D ICs Product: “Co-optimized” with Vivado for extra performance, power and integration Productivity: 1.5-2x system-level performance and systems integration
Smarts: Smarter solutions for smarter networks, data centers, vision and control
Page 10
Zynq-7000 All Programmable SoC
Why All Programmable?
Reduce exploding design costs Dramatically increase flexibility
Programmable Systems Integration
Leverage broad technology portfolio – Logic & high-speed I/O – S/W-programmable ARM systems
– 3-D IC
Increased System Performance
BOM Cost Reduction
Total Power Reduction
– Analog/mixed-signal – System-to-IC design tools
Accelerated Design Productivity
– Intellectual property
Build better electronic systems with fewer chips….faster! Page 12
The First All Programmable SoC
2012
Production: NOW 350+ unique customers actively designing
100+ AP SoC specific partners All Major OS’s supported and in use 20+ different development boards Won every award it entered Page 13
2010
A Unique Value Proposition Breakthrough in “All Programmable” SoC-level integration • ASIC levels of performance and power consumption • Flexibility of an FPGA • Ease of programming of a microprocessor
Programmable Systems Integration
Increased System Performance
BOM Cost Reduction
Total Power Reduction
Accelerated Design Productivity Page 14
ALL programmable platform: processor, PL, DSP, IOs & AMS Improved Security in a fully integrated solution
1GHz Dual ARM Cortex-A9, much higher throughput vs. 2 chip solutions, >10X SW acceleration with PL
Up to 40% system cost savings: fewer components (power supplies, memories, …), higher volumes with platform approach for better pricing
Up to 50%: processor power control modes, 28nm HPL process, integration saves chip to chip interconnection power.
Flexible and scalable platform, comprehensive ecosystem of tools, OS & IP High level synthesis flow for faster PL developments
Zynq-7000 AP SoC Applications Mapping
Z-7100
Page 15
Zynq-7000 AP SoC Architecture
Complete ARM-based Processing System Processor Core Complex Dual ARM Cortex-A9 MPCore with NEON™ extensions Single / Double Precision Floating Point support Up to 1 GHz operation
High BW Memory Internal – L1 Cache – 32KB/32KB (per Core) – L2 Cache – 512KB Unified
On-Chip Memory of 256KB Integrated Memory Controllers (DDR3, DDR3L, DDR2, LPDDR2, 2xQSPI, NOR, NAND Flash)
AMBA Open Standard Interconnect
Integrated Memory Mapped Peripherals
2x USB 2.0 (OTG) w/DMA 2x Tri-mode Gigabit Ethernet w/DMA 2x SD/SDIO w/DMA 2x UART, 2x CAN 2.0B, 2x I2C, 2x SPI, 32b GPIO
Processing System Ready to Program Page 17
High bandwidth interconnect between Processing System and Programmable Logic ACP port for enhanced hardware acceleration and cache coherency for additional soft processors
Primary System Interconnects Maximizing Data Transfers
Programmable Logic to Memory – 2 Ports to DDR Controller – 1 Port to OCM SRAM
L2 Cache
Central Interconnect
OCM
Programmable Logic to Memory
• Connects CPU Block to Common Peripherals, through the Central Interconnect
Peripherals
• 2x 32b AXI Ports from Processing System to Programmable Logic
...
– Crossbar switches for high bandwidth communications – Processing System Master Ports
DDR Controller
NAND, NOR/SRAM, QSPI Controllers
APU
DMA
OCM
– Processing System Slave Ports • 2x 32b AXI Ports from Programmable Logic to Processing System
Central Interconnect
ACP (Accelerator Coherence Port) – Low-latency cache-coherent port for programmable logic
Enables application-specific customizations with a standard programming model
Master/Slave AXI Interfaces to Programmable Logic
Arrow direction shows control, Data flows both directions
Legend Configurable AXI3 32 bit/64 bit AXI3 64 bit / AXI3 32 bit / AHB 32 bit / APB 32 bit
Page 18
ACP
Tightly Integrated Programmable Logic Built with State-of-the-art 7 Series Programmable Logic Artix-7 & Kintex-7 FPGA Fabric 28K-444K logic cells 430K-6.6M equivalent ASIC gates
Over 3000 Internal Interconnects Up to ~100Gb of BW Memory-mapped interfaces
Note: ASIC equivalent gates based on analysis over broad range of designs
Integrated Analog Capability Dual multi channel 12-bit A/D converter Up to 1Msps
Scalable Density and Performance Page 19
Enables Massive Parallel Processing Up to 2020 DSP blocks delivering over 2662 GMACs
7-Series Programmable Logic Fabric IP portability between 7-series FPGA and Zynq-7000 AP SoC
Logic Cells PCI-Express
LUT6 + 2 DFF
Gen1 / Gen2 Endpoint or Root Port
25x18 + Acc. + pre-adder
Flexible Transceivers
Block RAM 36Kb blocks Mem. or FIFO
Multi Protocol Up to 12.5Gbps
Flexible I/O Multi Standard High Speed
A/D converters 2x 12bit 1MSPS
Page 20
DSP Slices
Clock Management MMCM + PLL
Programmer’s View of Programmable Logic Simple memory mapped Interface Programmer’s View of Custom Accelerators & Peripherals Start Address
Description
0x0000_0000
External DDR RAM
0x4000_0000
Custom Peripherals (Programmable Logic including PCIe)
0xE000_0000
Fixed I/O Peripherals
0xF800_0000
Fixed Internal Peripherals (Timers, Watchdog, DMA, Interconnect)
0xFC00_0000
Flash Memory
0xFC00_0000
On-Chip Memory
Page 21
Start Address
Description
0x4000_0000
Accelerator #1 (Video Scaler)
0x6000_0000
Accelerator #2 (Video Object Identification)
0x8000_0000
Peripheral #1 (Display Controller)
Code Snippet int main() { int *data = 0x1000_0000; int *accel1 = 0x4000_0000;
// Pure SW processing Process_data_sw(data);
// HW Accelerator-based processing Send_data_to_accel(data, accel1); process_data_hw(accel1); Recv_data_from_accel(data, accel1); }
Flexible External I/O 54 Dedicated Peripheral I/Os Supports integrated peripherals Static memory (NAND, NOR, QSPI) More I/Os available though the Programmable Logic
73 Dedicated Memory I/Os DDR3 / DDR3L / DDR2 / LPDDR2 Memory Interfaces Configurable as 16bit or 32bit
Up to 400 Multi-Standard and High Performance I/O Up to 250 3.3V capable multi-standard I/O Up to 150 high performance I/O Up to differential 17 ADC inputs
Flexibility Beyond Any Standard Processing Offering Page 22
High Performance Integrated Serial Tranceivers ( 7030 / 7045 / 7100) Up to 16 transceivers Operates up to 12.5Gbs Supports popular protocols Integrated PCIe Gen2 block
Efficient Power Control
Device
Sleep Mode*
Estimated Operating Ranges*
Z-7010
~ 1W – 2W
Z-7020
~ 2W – 3W
Z-7030
~ 100 mW
~ 3W – 6W
Z-7045
~ 5W – 15W
Z-7100
~6W – 17W * Those represent typical power numbers
Page 23
Back
Zynq-7000 Device Portfolio Summary Scalable platform offers easy migration between devices Zynq-7000 AP SoC Devices
Z-7010
Z-7020
Processing System Programmable Logic
Max Frequency
800 MHz
Memory
Up to 1 GHz
L1 Cache 32KB I / D, L2 Cache 512KB, on-chip Memory 256KB
External Memory Support
DDR3, DDR3L, DDR2, LPDDR2, 2x QSPI, NAND, NOR 2x USB 2.0 (OTG), 2x Tri-mode Gigabit Ethernet, 2x SD/SDIO, 2x UART, 2x CAN 2.0B, 2x I2C, 2x SPI, 4x 32b GPIO
Peripherals
Peak DSP Performance (Symmetric FIR)
~430K (28k LC)
~1.3M (85k LC) ~1.9M (125k LC)
I/O
Page 24
~6.6M (444kLC)
560KB
1,060KB
2,180KB
3,020KB
100 GMACS
276 GMACS
593 GMACS
1334 GMACS
2662 GMACS
-
Gen2 x4
Gen2 x8
2x 12bit 1Msps A/D Converter
Processor System IO
Multi Gigabit Transceivers
~5.2M (350k LC)
240KB
PCI Express® (Root Complex or Endpoint) Agile Mixed Signal (XADC)
Multi Standards 3.3V IO Multi Standards High Performance 1.8V IO
Z-7100
NEON™ & Single / Double Precision Floating Point
Processor Extensions
Block RAM
Z-7045
Dual ARM® Cortex™-A9 MPCore™
Processor Core
Approximate ASIC Gates
Z-7030
130 100
200
100
212
250
-
-
150
150
150
-
-
4
16
16
Zynq-7000 AP SoC Boot Modes and Boot Stages
Boot Mode Selection Where to boot from? Five boot mode signals (wired MIO pins) , mode[4:0], are used to indicate the boot source, JTAG mode and PLL bypass selection Two voltage mode signals, vmode [1:0], are used to indicate voltage mode of the multiplexed I/O banks Except for JTAG, all other boot devices can operate in Secure Mode
Secure Mode can be enabled by the user
SD boot mode supports FAT file system
NOT SECURE
SECURE BOOT MODE
NAND
NOR
Quad SPI
SD Card*
* Need to use MIO[45:40] to boot from SD Card
“BOOT.bin” is FSBL image Page 26
JTAG Debuggers
Boot Modes How to boot?
Non-secure – Standard boot model
Secure – Ability to have secure boot code & secure software for PS – Protects Bitstream & IP
Debug & Development (JTAG) – Debug the PS & PL
Secure Boot or Non-Secure Boot is defined by the user in the BOOT ROM header
Page 27
Boot Stages Overview
Stage0 – Boot ROM – Provided by Xilinx – Not user accessible
Stage1 – First Stage Boot Loader – User developed – Xilinx provides as example
Stage2 – Second Stage Boot Loader – Optional
– User developed – Xilinx provides as example
Page 28
Non-Secure Boot
Stage 2
• Rest of PS Boot data or PL Bitstream loaded • Can be from Ethernet or USB etc
2nd Stage Boot Loader
• First Stage Bootloader runs from OCM Stage 1
• Loads PS Boot Data into specified memory (e.g. DDR) OR • Enables Second stage boot (optional) • Loads Bitstream and configures PL (optional)
First Stage Boot Loader
• Power up Zynq-7000 AP SoC Stage 0
Page 29
• Boot Mode Pins Identify Boot Device • BootROM Code runs • Copies First Stage Boot Loader to OCM
Boot Mode Selection
Example Boot Process for Linux • Linux kernel loaded to DDR Stage 3
• RFS can be selected from the Linux command line • RFS contains Linux applications
OS (Kernel & Drivers)
• U-Boot runs from DDR Stage 2
• Loads OS kernel from selected boot device • Loads ramdisk from default boot device
U-Boot
• First Stage Boot Loader runs from OCM RAM Stage 1
FSBL
• Bitstream is loaded and PL is configured • U-Boot is loaded from boot Device into DDR
• Power up Zynq Stage 0
Page 30
• On Chip ROM code runs – identifies the boot device by reading the mode pin status • Copies First Stage Boot Loader from selected boot device to OCM RAM
Secure Boot
Stage 2
• User developed code can be secure
uBoot For Linux
• First Stage Boot Loader runs from OCM Stage 1
• Decrypts and authenticates PS Boot Data using AES/SHA engine and puts into specified memory OR • Enables Second stage boot (optional) • Decrypts and authenticates Bitstream using AES/SHA and configures PL (optional)
First Stage Boot Loader
• Power up Zynq-7000 AP SoC Stage 0
Page 31
• Boot Mode Pins Identifies Boot Device • BootROM Code runs • Decrypts and authenticates FSBL using AES/SHA and then copied into OCM (PL powered on)
Boot Mode Selection
Zynq-7000 AP SoC Ecosystem Development Elements
Comprehensive Partnership Ecosystem
Software Tools
Intellectual Property
System Architecture
Software OS & Middleware
Design Services
Over 100 Zynq Specific Partners … and Growing Page 33
Comprehensive Partnership Ecosystem OS, Middleware and Tools solutions Alliance Partners Design Service Ecosystem Alliance Partners qualified as Zynq Design Centers across Geo regions – Well trained on Zynq-7000 AP SoC platform – Strong expertise on FPGA and Embedded Processing – Close relationship with local Xilinx customers
Key IP solutions supporting segment focused applications – 2D/3D Graphics, Video Imaging, Video Codec, PCIe, SATA, USB
Many Boards and Modules targeting specific market and application available from Alliance members – Jump start customers to use Zynq-7000 AP SoC platform
Strong Training Ecosystem (Xilinx ATP) – Zynq-7000 training classes available to broad customer base WW http://www.xilinx.com/products/silicon-devices/soc/zynq-7000/ecosystem
Page 34
Comprehensive Operating Systems Offering
More than 95% of commercial embedded operating systems supported on Zynq-7000 Scalable solution ranging from Real Time Operating Systems to fully featured Operating Systems Safety critical certifications in key industry segments Multi-core support in SMP and AMP mode Robust open source initiative for Linux and Android
Page 35
Strong Embedded Linux Offering
Open Source
Available on Xilinx GIT tree since July 2010 (EA) Available on main GIT tree since October 2011 Wiki at http://wiki.xilinx.com Forums at http://forums.xilinx.com
from Xilinx
Petalinux WindRiver Linux – Yocto Project Compatible
Commercial from Partners
MontaVista Linux – Carrier Grade Linux LinuxLink More to come
Xilinx Acquired Petalogix in Aug. 2012 to Strengthen its Linux Offering Page 36
Operating Systems Support Status OS
Provider
Version
Availability (ZC702 board support)
Linux
Segment Market Used in all segments
PetaLinux
Xilinx
12.12
Now
WR Linux 5
Wind River
3.4
Now
MVL CGE6
Montavista
2.6.32
May ‘13
Communications
LinuxLink
Timesys
NA
Now
Industrial, Medical, Automotive
Android
iVeia
2.3
Now
Consumer
Windows Embedded Compact 7
Adeneo Embedded
7
Now
ISM, Automotive
VxWorks
Wind River
6.9.2
Now
ISM, Automotive, A&D
INTEGRITY
GreenHills Software
NA
Now
A&D, Automotive, ISM
QNX
Adeneo Embedded
6.5
Q2CY13
Automotive, ISM
OSE
ENEA
5.5
Now
Communication
ThreadX/NetX
Express Logic
NA
Now
Consumer, Medical, Industrial
FreeRTOS
Xilinx
7.x
Now
All segments
RTA-OS SC1-4
ETAS
3.0
Now (single core)
Automotive
eCOS
ITR
3.0
Q4CY2012
ISM, Automotive
eT-Kernel
eSOL
TBD
Now
Automotive, Consumer
µc/OS
Micrium
II
Now
Industrial, Medical, A&D
Nucleus
Mentor
NA
Now
Industrial, Medical and Automotive
Quadros
Quadros
NA
Now
Industrial, Medical, POS
Page 37
Numerous SW Development Tool Options
In addition to Xilinx free SDK
Highly optimized ARM compilers from partners Advanced software development tools from world-class partners offering software profiling and tracing solutions
Page 38
Vivado Design Suite
FPGA designs for fast time to market Time to market is shorter and shorter. Projects are more and more complex.
How can we be ahead of today’s complexity and time to market requirements ?
What’s standard HDL flow status ? Are there new design flows ?
Page 40
Next Generation Productivity Challenges Implementation bottlenecks – Managing hierarchy & reuse in implementation – Getting estimates early & closing on timing, utilization, power
– Debugging across tools, abstractions, changes – Runtime & QOR scalability
3D
System Integration bottlenecks – Design and IP reuse – Integrating algorithmic and RTL level IP – Mixing DSP, embedded, connectivity, logic – Verification of blocks and “systems”
Page 41
Vivado Design Suite Accelerates Productivity
Accelerating Implementation Fast, Hierarchical and Deterministic Closure
Accelerating System Integration IP and System-centric Integration with Fast Verification
Page 42
Vivado Key Enabling Technologies Shared, Scalable Data Model Progressive estimation accuracy across the entire flow
Reduced iterations late in the cycle
Estimation
IP Integration
RTL Design
Synthesis
Place & Route
Shared, Scalable Data Model
RTL
Schematics
Placement
Scales to >10M LUT
Interactive design & debug environment Cross-probe from reports to schematics or HDL Integrate IP from any domain
entity FIR is port (clk : in rst : in din : in
Code Changes
Tool Settings
Timing Report Timing Path #1 Timing Path #2 Timing Path #3
Reports
Page 43
Placement Edits
Analytical Placer Solving The Interconnect Bottleneck Timing Cost f(x)
initial random seed Local moves
not routable best solution found
optimal solution (not found)
Placement Solution x (found by random moves and seeds)
Traditional P&R
Vivado P&R
“Cost” Criteria
1 dimension: timing minimization
3 dimensions: timing, congestion, wire length minimization
Primary Algorithm
“Simulated Annealing”: Random, iterative search based on initial seed
Analytical: solves simultaneous equations to minimize all dimensions
Runtime
Unpredictable Breaks at high utilization, congestion
Most predictable
Scalability
Does not scale to 1 M LC
Designed for 10M+ logic cells
Page 44
Vivado Delivers Denser, Faster Implementation ISE
Vivado
P&R runtime
13 hrs
5 hrs
Memory usage
16 GB
9 GB
Reduced Wire length & Congestion Significantly reduced
Concurrently Optimizes Timing, Device Utilization Page 45
Vivado Delivers the Industries Fastest and Most Predictable Run-time 4x faster than competition Most predictable run-times and results 2X the design capacity Runtime Comparison for 100+ designs 25
Xilinx Competition
Runtime (hours)
20
Supports 1+M LC designs where competition fails 15
predictable run-times
~4x run-time advantage
10
5
0 0K
500K
1 000K
Design Size (LC) Page 46
1 500K
2 000K
Vivado Design Suite Technology Advantages
Accelerating Implementation Fast, Hierarchical and Deterministic Closure
Accelerating System Integration IP and System-centric Integration with Fast Verification Page 48
Page 48
Vivado High-Level Synthesis (HLS) Accelerates Algorithmic C to IP, Co-Processing Accelerator Integration
Available today for C, C++, SystemC
Proven in customer designs Supports: – Rapid development and algorithm exploration – Variable precision and floating-point
A clear differentiator over anything available! Page 49
Introducing IP Integrator Enabling an IP Centric Design Flow IP Packager Source (C, RTL, IP) Simulation models Documentation Example Designs Test bench
Standardized IP-XACT IP Subsystem
Page 50
Xilinx IP
Uses multiple plug-and-play forms of IP to implement functional subsystem
3rd Party IP
Includes software drivers and API
User IP
Accelerates integration and productivity
Vivado IP Integrator: Intelligent IP Integration
Hierarchy Support
Correct-by-construction – Interface level connections – Extensible IP repository
System Hierarchy View Interface Connections with Live DRCs
– Real-time DRCs and parameter propagation / resolution – Designer Assistance
TCL Console Extensible IP Catalog
Automated IP Subsystems – Block automation for rapid design creation – One click IP customization
Board Aware – Support all 7 series and Zynq Platforms
Page 51
Vivado 2013.2 Accelerates Time to Integration: IP Integrator, with Zynq, HLS, SysGen Integration
Page 52
High-Level Synthesis with Xilinx Vivado HLS
© Copyright 2012 Xilinx
ESL: What is it? Design Methodology
ESL
IDE
RTL
Netlist
Layout Page 54
Functionality
High-Level Synthesis
Model
Model-Based Design
Architecture
Synthesis
Gates
Place & Route
Silicon
DSP Design Methodology Flexible design environment Floating-point and Fixed-point Hardware Generation Real time analog data acquisition World class C design flow MATLAB Simulink
Xilinx IP
Toolbox IP
HDL Coder
C/C++
System Generator
MATLAB / Simulink
DSP Design Platforms Analog Signals
Page 55
A/D D/A
RTL
C Libraries
Vivado HLS
IP Catalog
Vivado
IP Integrator
Where should we use HLS?
Datapath Centric: • • • •
Definitely
Video and smart video (Consumer, A&D) Imaging with multiple DSP processors (Medical) Baseband processing with multiple DSP processors (Wireless) Radar and smart radar (Automotive, A&D)
Compute Centric:
Maybe
• Complex compute systems, complicated math (Science HPC, A&D) • Fit • GPU for parallel processing • Floating Point • Linear Algebra • No Fit • CPU-optimized code (pointers, casting,…) • System calls Control Centric:
For sure !
• Motor control (Industrial) • Package processing (Connectivity)
Page 56
Who is HLS for?
Embedded Designer: Definitely • Already a C expert • Comfortable writing C • Familiar with C development environments • Hardware aware Hardware Engineer:
Definitely
• Already a hardware expert • Might have some C experience, easy to pick up • Needs to fine-tune some algorithmic aspects • Gets a lot of test benches from algorithm designer • Struggling with verification issues at RTL level Algorithm Designer: How willing are you to learn about hardware? • Pointer casting • Hand optimized loops • malloc(), free()
Page 57
Design Flow …
VIVADO HLS
RTL Simulation
Page 58
RTL Export
IP-Xact
SysGen
Pcore
Benefits of HLS
Productivity – Verification
Video Design Example
• Functional
Input
C Simulation Time
RTL Simulation Time
Improvement
• Architectural
10 frames 1280x720
10s
~2 days (ModelSim)
~12000x
– Abstraction • Datatypes • Interface
RTL (Spec)
• Classes
– Automation
C (Spec/Sim)
RTL (Sim)
RTL (Sim)
Block level specification AND verification significantly reduced Page 59
Benefits of HLS
Portability – Processors and FPGAs – Technology migration – Cost reduction – Power reduction
Design and IP reuse Page 60
Benefits of HLS
Permutability – Architecture Exploration • Timing Parallelization Pipelining
• Resources Sharing
– Better QoR
Rapid design exploration delivers QoR rivaling hand-coded RTL Page 61
Vivado HLS Code Synthesis C, C++ and SystemC
Vivado HLS Projects and Solutions Vivado HLS is project based – A project specifies the source code which will be synthesized – Each project is based on one set of source code – Each project has a user specified name Source
A project can contain multiple solutions – Solutions are different implementations of the same code – Auto-named solution1, solution2, etc. – Supports user specified names
– Solutions can have different clock frequencies, target technologies, synthesis directives Project Level
Projects and solutions are stored in a hierarchical directory structure – Top-level is the project directory – The disk directory structure is identical to the structure shown in the GUI project explorer (except for source code location) Page 63
Solution Level
HLS Control & Datapath Extraction
Code void fir ( data_t *y, coef_t c[4], data_t x ){ static data_t shift_reg[4]; acc_t acc; int i; acc=0; loop: for (i=3;i>=0;i--) { if (i==0) { acc+=x*c[0]; shift_reg[0]=x; } else { shift_reg[i]=shift_reg[i-1]; acc+=shift_reg[i]*c[i]; } } *y=acc; }
From any C code example ..
Page 64
Operations
Control Behavior Finite State Machine (FSM) states
Control & Datapath Behavior Control Dataflow
RDx RDc
>= == + * + * WRy Operations are extracted…
0
1
2 The control is known
RDx
RDc
>=
-
==
-
+
*
+
*
WRy
A unified control dataflow behavior is created.
The Key Attributes of C code Functions: All code is made up of functions which represent the design hierarchy: the same in hardware Top Level IO : The arguments of the top-level function determine the hardware RTL interface ports
void fir ( data_t *y, coef_t c[4], data_t x ){
Types: All variables are of a defined type. The type can influence the area and performance
static data_t shift_reg[4]; acc_t acc; int i; acc=0; loop: for (i=3;i>=0;i--) { if (i==0) { acc+=x*c[0]; shift_reg[0]=x; } else { shift_reg[i]=shift_reg[i-1]; acc+=shift_reg[i] * c[i]; } } *y=acc; }
Loops: Functions typically contain loops. How these are handled can have a major impact on area and performance. Arrays: Arrays are used often in C code. They can influence the device IO and become performance bottlenecks. Operators: Operators in the C code may require sharing to control area or specific hardware implementations to meet performance
Let’s examine the default synthesis behavior of these … Page 65
C Functions Become RTL Hierarchy
Each function is translated into an RTL block – Verilog module, VHDL entity Source Code void A() { ..body A..} void B() { ..body B..} void C() { B(); } void D() { B(); }
void foo_top() { A(…); C(…); D(…) }
RTL hierarchy
foo_top
B
A D
my_code.c
• Small functions may be automatically inlined
B
Each block can be shared like any other component provided it’s not in use at the same time
– Functions may be inlined to dissolve their hierarchy Page 66
C
Top-Level Function Arguments Become Ports //Top Level Function for hardware synthesis void image_demo(AXI_PIXEL in_pix[MAX_HEIGHT][MAX_WIDTH], AXI_PIXEL out_pix2[MAX_HEIGHT][MAX_WIDTH], int rows, int cols){
RTL Ports AND protocols – Created by synthesis directives
}
– Design scheduled to match IO timing
Synthesis ap_clk ap_rst ap_start
image_demo RTL
Clock & Reset ports ap_done ap_idle
rows cols in_pix_data in_pix_empty_n
Design Start, Idle & Done ports
No IO protocol (for config data)
in_pix_read
in_pix implemented as FIFO port
out_pix2_data out_pix2_full_n
Page 67
out_pix2_write
out_pix2 implemented as FIFO port
Bus Interfaces : Added on RTL Export AXI4 Interfaces added as IP Export – RTL Ports can be grouped into a single AXI4 slave interfaces – C Function and Header are provided for slave controller / CPU
ap_clk ap_rst ap_start
image_demo RTL
Exported Design ap_done ap_idle
rows cols in_pix_data in_pix_empty_n
in_pix_read out_pix_data
out_pix_full_n
Page 68
out_pix_write
Software header files provided for CPU control.
AXI4 Slave
Control ports grouped into common interface.
AXI4 Stream
Streaming Input data
AXI4 Stream
Streaming Output data
C, C++ and SystemC Support The vast majority of C, C++ and SystemC is supported – Provided it is statically defined at compile time – If it’s not defined until run time, it won’ be synthesizable
Certain Code Must be Changed: Cannot be Synthesized
– Dynamic memory allocation: size is indeterminate – System calls: cannot access the disk or OS features (e.g. date)
Certain Code Should be Changed: For Performance – Code is optimized to get performance when run on a CPU, or a DSP or a GPU and for an FPGA • Changes to synthesize on an FPGA are less than to run on a DSP or GPU
– Changes like this are nothing to run away from : it’s normal, sane, behavior Page 69
Vivado HLS Design Optimization
Data Types Determine the Size of the Operators
Code void fir ( data_t *y, coef_t c[4], data_t x ){ static data_t shift_reg[4]; acc_t acc; int i; acc=0; loop: for (i=3;i>=0;i--) { if (i==0) { acc+=x*c[0]; shift_reg[0]=x; } else { shift_reg[i]=shift_reg[i-1]; acc+=shift_reg[i]*c[i]; } } *y=acc; }
From any C code example ..
Page 71
Operations
Types Standard C types
RDx RDc
>= == + * + * WRy Operations are extracted…
long long (64-bit)
short (16-bit)
int (32-bit)
char (8-bit)
float (32-bit)
double (64-bit)
unsigned types
For floats and doubles there must be a FP core in the library binding can map to, else cannot be synthesized Arbitary Precision types C: C++:
ap(u)int types (1-1024) ap_(u)int types (1-1024) ap_fixed types
C++/SystemC: sc_(u)int types (1-1024) sc_fixed types Can be used to define any variable to be a specific bitwidth (e.g. 17-bit, 47-bit etc). The C types define the size of the hardware used: handled automatically
Review: Latency and Throughput Design Latency – The latency of the design is the number of cycle it takes to output the result • In this example the latency is 10 cycles
void foo_top (a,b,c,d, *x, *y) { ... func_A(…);func_A func_B(…);func_B func_C(…)func_C func_D(…)func_D return res; }
func_A
func_B
func_C
func_D
Latency = 10 cycles
Design Throughput – The throughput of the design is the number of cycles between new inputs • By default (no concurrency) this is the same as latency • Next start/read is when this transaction ends Next New Input Read
New Input Read
func_A
func_B
func_C
Throughput = 10 cycles Page 72
func_D
Latency and Throughput In the absence of any concurrency – Latency is the same as throughput void foo_top (a,b,c,d, *x, *y) { ... func_A(…);func_A func_B(…);func_B func_C(…)func_C func_D(…)func_D return res; }
Next New Input Read
New Input Read
func_A
func_B
func_C
func_D
Latency = 10 cycles Throughput = 10 cycles
Pipelining for higher throughput – VHLS can pipeline functions and loops to improve throughput func_A
func_B
func_D
func_C func_A
func_B
func_C
func_D
Latency = 10 cycles Throughput = 4 cycles
– Latency and throughput are related – This presentation will discuss optimizing for latency first, then throughput Page 73
Loops By default, loops are rolled – Each C loop iteration Implemented in the same state – Each C loop iteration Implemented with same resources foo_top
Synthesis
a[N]
+
void foo_top (…) { ... Add: for (i=3;i>=0;i--) { b = a[i] + b; ... }
N
b
Loops require labels if they are to be referenced by Tcl directives (GUI will auto-add labels)
– Loops can be unrolled if their indices are statically determinable at elaboration time • Not when the number of iterations is variable
Page 74
Unrolled loops can reduce latency foo_top
a[3]
+ +
Unrolled
+ a[2]
+
void foo_top (…) { ... Add: for (i=3;i>=0;i--) { b = a[i] + b; ... }
a[1]
b
a[0]
Select loop “Add” in the directives pane, right-click & select unroll
clk Option 1 Option 2 Option 3
3 3
2 2
3 2 1 0
Unrolled loops are likely to result in more hardware resources and higher area
Page 76
1
1 0
0
Unrolled loops allow greater option & exploration
Loop Flattening VHLS can automatically flatten nested loops – A faster approach than manually changing the code
Flattening should be specified on the inner most loop – It will be flattened into the loop above – The “off” option can prevent loops in the hierarchy from being flattened Loops will be flattened by default: use “off” to disable void foo_top (…) { ... L1: for (i=3;i>=0;i--) { [loop body l1 ] }
1 x4
2
x4
3
x4
4 x4
36 transitions Page 77
L2: for (i=3;i>=0;i--) { L3: for (j=3;j>=0;j--) { [loop body l3 ] } } L4: for (i=3;i>=0;i--) { [loop body l4 ] }
void foo_top (…) { ... L1: for (i=3;i>=0;i--) { [loop body l1 ] }
1 x4
L2: for (k=15,k>=0;k--) { [loop body l3 ] } L4: for (i=3;i>=0;i--) { [loop body l1 ] }
2
x16
4 x4
28 transitions
Loop Merging VHLS can automatically merge loops – A faster approach than manually changing the code – Allows for more efficient architecture explorations – FIFO reads, which must occur in strict order, can prevent loop merging • Can be done with the “force” option : user takes responsibility for correctness void foo_top (…) { ... L1: for (i=3;i>=0;i--) { [loop body l1 ] }
1 x4
2
x4
3
x4
4 x4
L2: for (i=3;i>=0;i--) { L3: for (j=3;j>=0;j--) { [loop body l3 ] } Already flattened } L4: for (i=3;i>=0;i--) { [loop body l4 ] }
void foo_top (…) { ... L123: for (l=16,l>=0;l--) { if (cond1) [loop body l1 ] [loop body l3 ]
1 x16
if (cond4) [loop body l4 ] }
18 transitions 36 transitions Page 78
Review: Arrays in HLS An array in C code is implemented by a memory in the RTL – By default, arrays are implemented as RAMs, optionally a FIFO foo_top
A[N] void foo_top(int x, …) { int A[N]; L1: for (i = 0; i < N; i++) A[i+x] = A[i] + i; }
RAM_1P
N-1 N-2
Synthesis
… 1 0
A_in
DIN ADDR
DOUT
CE WE
The array can be targeted to any memory resource in the library – The ports and sequential operation are defined by the library model • All RAMs are listed in the VHLS Library Guide List of available Cores Example: Array “A” is targeted to a single port distributed RAM resource
Page 79
A_out
Arrays : Performance bottlenecks Arrays are intuitive and useful software constructs – They allow the C algorithm to be easily captured and understood
Array accesses can often be performance bottlenecks – Arrays are targeted to a default RAM • May not be the most ideal memory for performance void foo_top (…) { ... for (i = 2; i < N; i++) mem[i] = mem[i-1] +mem[i-2];
Or RD
RD
+
WR
RD
WR
RD
}
RD
WR
RD
+
+
}
• Cannot pipeline with a throughput of 1
Even with a dual-port RAM, we cannot perform all reads and writes in one cycle
VHLS allows arrays to be partitioned and reshaped – Allows more optimal configuration of the array – Provides better implementation of the memory resource Page 80
Array Partitioning Partitioning breaks an array into smaller elements • •
If the factor is not an integer multiple the final array has fewer elements Arrays can be split along any dimension •
If none is specified dimension zero is assumed • Dimension zero means all dimensions
•
All partitions inherit the same resource target
•
That is, whatever RAM is specified as the resource target • Except of course “complete”
0
1
…
(N/2-1)
N/2
…
N-2
N-1
0
2
…
N-2
1
…
N-3
N-1
block
array1[N] 0
1
…
N-1
cyclic
0
N-3
complete Multiple memories allows greater parallel access Page 81
N-1
1
N-2 …
2
Divided into blocks: N-1/factor elements
Divided into blocks: 1 word at a time (like “dealing cards”)
Individual elements: Break a RAM into registers (no “factor” supported)
Array Dimensions The array options can be performed on dimensions of the array my_array[10][6][4] Dimension 1 Dimension 2 Dimension 3 Dimension 0 (All dimensions)
Examples
my_array_0[10][6] my_array[10][6][4] partition dimension 3 my_array_1[10][6] my_array_2[10][6] my_array_3[10][6]
my_array[10][6][4] partition dimension 1
my_array_0[6][4] my_array_1[6][4] my_array_2[6][4] my_array_3[6][4] my_array_4[6][4] my_array_5[6][4] my_array_6[6][4] my_array_7[6][4] my_array_8[6][4] my_array_9[6][4]
my_array[10][6][4] partition dimension 0 10x6x4 = 240 individual registers Page 82
Loop and Function Pipelining Without Pipelining
With Pipelining Loop:for(i=1;i(xn, xk, &fft_status, &fft_config); dummy_proc_be(&fft_status, ovflo, xk, out); Page 91
Co-processing with Zynq-7000 AP SoC
All Programmable SOC Approach Requirements
SW Spec Iterate
Page 94
Verify
HW Spec Iterate
Verify
Vivado High-Level Synthesis Requirements
SW Spec Iterate
Verify
HW Spec Iterate
Verify
Accelerates Algorithmic C to Co-Processing Accelerator Integration Page 95
Accelerator: Use Case Models Model 2: Acceleration Model
Model 1: Data Flow Model Standard Mem
PS
PS Standard IO Custom IP
Custom IP
Custom I/O
®
®
Custo m I/O
Custo m I/O
Custom I/O
Custom I/O
Custom I/O
Custom IP
Standard IO
Custom IP
Custom IP
Custom I/O
Standard Mem
Acceleration Model Control – Data Flow Model
• • • •
• Custom IP for complex function & data flows • PS for control & resource management
Programmable Systems Integration
4
Increased System Performance
BOM Cost Reduction
1
Page 96
Programmable Systems Integration
Total Power Reduction
3
Accelerated Design Productivity
Balances SW/HW partition PS primary compute platform PL for HW Acceleration Communication between PS & PL: High
1 2
Increased System Performance
BOM Cost Reduction
3
Total Power Reduction
2
Accelerated Design Productivity # = Relative importance
4
Zynq-7000 Drive Reference Design Platform DDR Current Measurement
UDP/IP @ 1Gibit
Field Oriented Control IP
PC
SVM - RPFM
ra
Bridge
La Lc
rc
Lb
rb
Process Data
SDC
Observer (Speed) Encoder readout
Current Measurement Field Oriented Control IP SVM - RPFM
ra
Bridge
La Lc Lb
rb
Process Data UDP/ Agent Dynamic Link Library
Page 97
Observer (Speed) Encoder readout
rc
Zynq-7000 EtherCAT Performance Advantage Example ARM core up to 1 GHz – Fabric FOC 4 Motors + EtherCAT®
Best Drive in the market 62.5 us
Read / Write EtherCAT ESC: Commands + Status
EtherCAT®
Octets to Floating Point Conversion FOC Motor 0
Zynq 16 us
Commands + Status
4X faster response time
Positioning Controller:
FOC Motor 1 Execute motor position
Speed Controller:
FOC Motor 2 FOC Motor 3
Estimate and correct error in speed Field Oriented Control
Zynq 1.6 us 40x faster control loop
Current Controller: Estimate and correct error in torque
Power Modulation: PWM, SVM, RPFM
Zynq-7000 EtherCAT® Intelligent Electric Drive Using ZC702, FMC-MC1 and FMC-ISMNET
Page 99
© Copyright 2013 Xilinx
Zynq-7000 EtherCAT System Using ZC702, ZedBoard and ZedBoard Master with QNX® and SOEM
ISM-NET
Page 100
© Copyright 2013 Xilinx
Accelerator Attached as Slave Pro: Simple System Architecture, Simple Register Interface Con: Limited communication bandwidth
Processing System
Common Peripherals
Memory Interfaces
Programmable Logic
ARM® Dual Cortex-A9 MPCore™ System
Slave Port
Acc. 1 AXI4 interconnect
Page 101
Acc. 2
Accelerator Attached as Master - High Performance Port Direct to OCM or DDR Memory Pro: High Data Bandwidth Con: Increased Design Complexity, Increased Latency
Zynq Processing System
Common Peripherals
Memory Interfaces
High Performance Port
ARM® Dual Cortex-A9 MPCore™ System
Slave Port
Programmable Logic AXI4 interconnect
AXI_DMA
Acc. 1 AXI4 interconnect
Page 102
AXI_DMA
Acc. 2
Accelerator Attached as Master - With Coherent DMA to L1 Caches Pro: Low latency, High data bandwidth for short bursts Con: Increased Design Complexity, Most efficient for data that fits in caches
Zynq Processing System
Common Peripherals
Memory Interfaces
ARM® Dual Cortex-A9 MPCore™ System
Slave Port
Accelerator Coherency Port
Programmable Logic AXI4 interconnect
AXI_DMA
Acc. 1 AXI4 interconnect
Page 103
AXI_DMA
Acc. 2
Co-processing with Zynq-7000 AP SoC Live Demo
Design example: matrix multiply
Simple and very common linear algebra example – Easy to integrate floating point variables and arithmetic operations
A*B=C
Algorithm maps naturally to nested loops – Loop 1: Iterate over rows of A • Loop 2: Iterate over columns of B – Loop 3: Multiply each index of row vector A with an index of column vector B and accumulate
Algorithm is easily parallelizable – No data recurrence/ dependencies Page 105
Design example: the original C code Code snippet (single-precision floating-point real-valued) void matrix_multiply_hw( float mat_in1[DIM][DIM], float mat_in2[DIM][DIM], float mat_out[DIM][DIM]) { // matrix multiplication of a A*B matrix L1: for (int y = 0; y< DIM; y++) L2: for (int x = 0; x < DIM; x++) { float sum = 0;
L3: for (int i = 0; i < DIM; i++) sum = sum + a[y][i] * b[i][x];
Loop 1 (rows of A)
Loop 2 (columns of B)
Loop 3 (vector index) FP multiply FP add
out[y][x] = sum; } return; } Page 106
Design example: the original C code Default implementation A matrix Matrix multiplication uses comparable numbers of adds versus mults, so it is a balanced example
B matrix 2-D array (C/C++)
Default mapping
Control
FADD REG
Memory I/F
Page 107
FMUL
Optimizations: array partitioning A matrix B matrix Repartition inputs to allow parallel read/ write
*
+ *
+ *
+ * Parallelize mult/ add by unrolling inner loop.
Page 108
All Programmable Abstraction The Future of Xilinx Design Tools
Productivity: Automation + Abstraction C Derivatives and Model Based
C Programming
Abstraction
Behavioral Assembly IP Binary RTL
Gate/Layout
Automation Page 110
Xilinx Strategy All Programmable Abstractions
IP
Abstraction
IPI
Automation
Page 111
Today’s Hardware Design Abstractions Zynq™ - 7000 All Programmable SoCs Accelerated time to integration
Behavioral
C/C++ Vivado HLS
Abstracting hardware through increasing layers of automation Automating not dictating design flows
Abstraction Model Based MathWorks National Instruments
Page 112
IP Extractions Vivado IP Integrator
Abstraction Evolution - System Level System Level abstraction System Level
Abstracting all hardware through increasing layer of automation Model Based MathWorks National Ins.
Abstraction
IP Extractions Vivado IPI
Page 113
Behavioral C/C++ Vivado HLS
ALL Programmable Abstractions for All Programmable Devices
Summary
The Coprocessing Value
Programmable Systems Integration
Increased System Performance
BOM Cost Reduction
All Programmable SoC’s deliver tight coupling of processors and Programmable Logic enabling coprocessing accelerators
Offloading processing tasks to Programmable Logic can appreciably increase system level performance
Accelerators can eliminate multiple general purpose or DSP processors
Total Power Reduction
Moving processing functions to Programmable logic can result in significant system level power reduction
Accelerated Design Productivity
Using HLS to create accelerators can dramatically reduce development time Using IPI provides an unmatched ease of use in system integration
Page 115
Follow Xilinx
facebook.com/XilinxInc
twitter.com/#!/XilinxInc
youtube.com/XilinxInc
Q&A
Thank You