Document not found! Please try again

XILINX - Enabling New Product Innovation

80 downloads 203614 Views 12MB Size Report
Ability to have secure boot code & secure software for PS. – Protects .... LinuxLink. Timesys. NA. Now. Industrial, Medical, Automotive. Android. iVeia. 2.3. Now.
Enabling New Product Innovation Across Markets with Zynq-7000 All Programmable SoC, Vivado HLS and IP Integrator Olivier TREMOIS XILINX EMEA&I DSP Specialist IECON 2013

Agenda Xilinx: A Generation Ahead

Zynq-7000 Architecture

Vivado Design Suite

Live Demo

All Programmable Abstraction

Page 2

Xilinx: A generation Ahead

Semiconductor Market Leader in All Programmable Devices Founded 1984 $2.17B FY13 revenue ~50% market segment share

3,000 employees worldwide 20,000 customers worldwide 2,500+ patents

Headquarters R&D Sales, Marketing, Support Manufacturing (Fab, Assy, Test) Page 4

Driving Industry Mandates

Programmable Imperative

Programmable Systems Integration Insatiable Intelligent Bandwidth

Page 5

All Programmable and Smarter Systems

Smarter Networks

Smarter Data Centers

Smarter Factories

Page 6

Smarter Vision

Smarter Energy

Moving A Generation Ahead

28nm

Programmable Logic Devices

All Programmable Devices

Enables Programmable Logic

Enables All Programmable & Smarter Systems

Page 7

New All Programmable & Smarter Competencies Homogeneous

Heterogeneous

SerDes

Analog-to Digital

3D IC Expertise and Supply Chain SoC and embedded software World class SerDes and analog mixed signal SmartCORE IP for Smarter Systems

Next generation design automation Page 8

Charting an Aggressive Course Forward

Programmable systems integration

3D IC

SoC

FPGA

28nm HPL

20nm SoC

14-16nm FinFET

System-level price/performance/watt

Page 9

10nm

Process

Staying A Generation Ahead at 20nm

20nm Portfolio: Industry’s first ASIC-class All Programmable architecture for FPGAs and 2nd generation SoCs and 3D ICs Product: “Co-optimized” with Vivado for extra performance, power and integration Productivity: 1.5-2x system-level performance and systems integration

Smarts: Smarter solutions for smarter networks, data centers, vision and control

Page 10

Zynq-7000 All Programmable SoC

Why All Programmable?

Reduce exploding design costs Dramatically increase flexibility

Programmable Systems Integration

Leverage broad technology portfolio – Logic & high-speed I/O – S/W-programmable ARM systems

– 3-D IC

Increased System Performance

BOM Cost Reduction

Total Power Reduction

– Analog/mixed-signal – System-to-IC design tools

Accelerated Design Productivity

– Intellectual property

Build better electronic systems with fewer chips….faster! Page 12

The First All Programmable SoC

2012

 Production: NOW  350+ unique customers actively designing

 100+ AP SoC specific partners  All Major OS’s supported and in use  20+ different development boards  Won every award it entered Page 13

2010

A Unique Value Proposition Breakthrough in “All Programmable” SoC-level integration • ASIC levels of performance and power consumption • Flexibility of an FPGA • Ease of programming of a microprocessor

Programmable Systems Integration

Increased System Performance

BOM Cost Reduction

Total Power Reduction

Accelerated Design Productivity Page 14

ALL programmable platform: processor, PL, DSP, IOs & AMS Improved Security in a fully integrated solution

1GHz Dual ARM Cortex-A9, much higher throughput vs. 2 chip solutions, >10X SW acceleration with PL

Up to 40% system cost savings: fewer components (power supplies, memories, …), higher volumes with platform approach for better pricing

Up to 50%: processor power control modes, 28nm HPL process, integration saves chip to chip interconnection power.

Flexible and scalable platform, comprehensive ecosystem of tools, OS & IP High level synthesis flow for faster PL developments

Zynq-7000 AP SoC Applications Mapping

Z-7100

Page 15

Zynq-7000 AP SoC Architecture

Complete ARM-based Processing System Processor Core Complex  Dual ARM Cortex-A9 MPCore with NEON™ extensions  Single / Double Precision Floating Point support  Up to 1 GHz operation

High BW Memory  Internal – L1 Cache – 32KB/32KB (per Core) – L2 Cache – 512KB Unified

 On-Chip Memory of 256KB  Integrated Memory Controllers (DDR3, DDR3L, DDR2, LPDDR2, 2xQSPI, NOR, NAND Flash)

AMBA Open Standard Interconnect

Integrated Memory Mapped Peripherals    

2x USB 2.0 (OTG) w/DMA 2x Tri-mode Gigabit Ethernet w/DMA 2x SD/SDIO w/DMA 2x UART, 2x CAN 2.0B, 2x I2C, 2x SPI, 32b GPIO

Processing System Ready to Program Page 17

 High bandwidth interconnect between Processing System and Programmable Logic  ACP port for enhanced hardware acceleration and cache coherency for additional soft processors

Primary System Interconnects Maximizing Data Transfers

Programmable Logic to Memory – 2 Ports to DDR Controller – 1 Port to OCM SRAM

L2 Cache

Central Interconnect

OCM

Programmable Logic to Memory

• Connects CPU Block to Common Peripherals, through the Central Interconnect

Peripherals

• 2x 32b AXI Ports from Processing System to Programmable Logic

...

– Crossbar switches for high bandwidth communications – Processing System Master Ports

DDR Controller

NAND, NOR/SRAM, QSPI Controllers

APU

DMA

OCM

– Processing System Slave Ports • 2x 32b AXI Ports from Programmable Logic to Processing System

Central Interconnect

ACP (Accelerator Coherence Port) – Low-latency cache-coherent port for programmable logic

Enables application-specific customizations with a standard programming model

Master/Slave AXI Interfaces to Programmable Logic

Arrow direction shows control, Data flows both directions

Legend Configurable AXI3 32 bit/64 bit AXI3 64 bit / AXI3 32 bit / AHB 32 bit / APB 32 bit

Page 18

ACP

Tightly Integrated Programmable Logic Built with State-of-the-art 7 Series Programmable Logic  Artix-7 & Kintex-7 FPGA Fabric  28K-444K logic cells  430K-6.6M equivalent ASIC gates

Over 3000 Internal Interconnects  Up to ~100Gb of BW  Memory-mapped interfaces

Note: ASIC equivalent gates based on analysis over broad range of designs

Integrated Analog Capability  Dual multi channel 12-bit A/D converter  Up to 1Msps

Scalable Density and Performance Page 19

Enables Massive Parallel Processing  Up to 2020 DSP blocks delivering over 2662 GMACs

7-Series Programmable Logic Fabric IP portability between 7-series FPGA and Zynq-7000 AP SoC

Logic Cells PCI-Express

LUT6 + 2 DFF

Gen1 / Gen2 Endpoint or Root Port

25x18 + Acc. + pre-adder

Flexible Transceivers

Block RAM 36Kb blocks Mem. or FIFO

Multi Protocol Up to 12.5Gbps

Flexible I/O Multi Standard High Speed

A/D converters 2x 12bit 1MSPS

Page 20

DSP Slices

Clock Management MMCM + PLL

Programmer’s View of Programmable Logic Simple memory mapped Interface Programmer’s View of Custom Accelerators & Peripherals Start Address

Description

0x0000_0000

External DDR RAM

0x4000_0000

Custom Peripherals (Programmable Logic including PCIe)

0xE000_0000

Fixed I/O Peripherals

0xF800_0000

Fixed Internal Peripherals (Timers, Watchdog, DMA, Interconnect)

0xFC00_0000

Flash Memory

0xFC00_0000

On-Chip Memory

Page 21

Start Address

Description

0x4000_0000

Accelerator #1 (Video Scaler)

0x6000_0000

Accelerator #2 (Video Object Identification)

0x8000_0000

Peripheral #1 (Display Controller)

Code Snippet int main() { int *data = 0x1000_0000; int *accel1 = 0x4000_0000;

// Pure SW processing Process_data_sw(data);

// HW Accelerator-based processing Send_data_to_accel(data, accel1); process_data_hw(accel1); Recv_data_from_accel(data, accel1); }

Flexible External I/O 54 Dedicated Peripheral I/Os  Supports integrated peripherals  Static memory (NAND, NOR, QSPI)  More I/Os available though the Programmable Logic

73 Dedicated Memory I/Os  DDR3 / DDR3L / DDR2 / LPDDR2 Memory Interfaces  Configurable as 16bit or 32bit

Up to 400 Multi-Standard and High Performance I/O  Up to 250 3.3V capable multi-standard I/O  Up to 150 high performance I/O  Up to differential 17 ADC inputs

Flexibility Beyond Any Standard Processing Offering Page 22

High Performance Integrated Serial Tranceivers ( 7030 / 7045 / 7100)  Up to 16 transceivers  Operates up to 12.5Gbs  Supports popular protocols  Integrated PCIe Gen2 block

Efficient Power Control

Device

Sleep Mode*

Estimated Operating Ranges*

Z-7010

~ 1W – 2W

Z-7020

~ 2W – 3W

Z-7030

~ 100 mW

~ 3W – 6W

Z-7045

~ 5W – 15W

Z-7100

~6W – 17W * Those represent typical power numbers

Page 23

Back

Zynq-7000 Device Portfolio Summary Scalable platform offers easy migration between devices Zynq-7000 AP SoC Devices

Z-7010

Z-7020

Processing System Programmable Logic

Max Frequency

800 MHz

Memory

Up to 1 GHz

L1 Cache 32KB I / D, L2 Cache 512KB, on-chip Memory 256KB

External Memory Support

DDR3, DDR3L, DDR2, LPDDR2, 2x QSPI, NAND, NOR 2x USB 2.0 (OTG), 2x Tri-mode Gigabit Ethernet, 2x SD/SDIO, 2x UART, 2x CAN 2.0B, 2x I2C, 2x SPI, 4x 32b GPIO

Peripherals

Peak DSP Performance (Symmetric FIR)

~430K (28k LC)

~1.3M (85k LC) ~1.9M (125k LC)

I/O

Page 24

~6.6M (444kLC)

560KB

1,060KB

2,180KB

3,020KB

100 GMACS

276 GMACS

593 GMACS

1334 GMACS

2662 GMACS

-

Gen2 x4

Gen2 x8

2x 12bit 1Msps A/D Converter

Processor System IO

Multi Gigabit Transceivers

~5.2M (350k LC)

240KB

PCI Express® (Root Complex or Endpoint) Agile Mixed Signal (XADC)

Multi Standards 3.3V IO Multi Standards High Performance 1.8V IO

Z-7100

NEON™ & Single / Double Precision Floating Point

Processor Extensions

Block RAM

Z-7045

Dual ARM® Cortex™-A9 MPCore™

Processor Core

Approximate ASIC Gates

Z-7030

130 100

200

100

212

250

-

-

150

150

150

-

-

4

16

16

Zynq-7000 AP SoC Boot Modes and Boot Stages

Boot Mode Selection Where to boot from? Five boot mode signals (wired MIO pins) , mode[4:0], are used to indicate the boot source, JTAG mode and PLL bypass selection Two voltage mode signals, vmode [1:0], are used to indicate voltage mode of the multiplexed I/O banks Except for JTAG, all other boot devices can operate in Secure Mode

Secure Mode can be enabled by the user

SD boot mode supports FAT file system

NOT SECURE

SECURE BOOT MODE

NAND

NOR

Quad SPI

SD Card*

* Need to use MIO[45:40] to boot from SD Card

“BOOT.bin” is FSBL image Page 26

JTAG Debuggers

Boot Modes How to boot?

Non-secure – Standard boot model

Secure – Ability to have secure boot code & secure software for PS – Protects Bitstream & IP

Debug & Development (JTAG) – Debug the PS & PL

Secure Boot or Non-Secure Boot is defined by the user in the BOOT ROM header

Page 27

Boot Stages Overview

Stage0 – Boot ROM – Provided by Xilinx – Not user accessible

Stage1 – First Stage Boot Loader – User developed – Xilinx provides as example

Stage2 – Second Stage Boot Loader – Optional

– User developed – Xilinx provides as example

Page 28

Non-Secure Boot

Stage 2

• Rest of PS Boot data or PL Bitstream loaded • Can be from Ethernet or USB etc

2nd Stage Boot Loader

• First Stage Bootloader runs from OCM Stage 1

• Loads PS Boot Data into specified memory (e.g. DDR) OR • Enables Second stage boot (optional) • Loads Bitstream and configures PL (optional)

First Stage Boot Loader

• Power up Zynq-7000 AP SoC Stage 0

Page 29

• Boot Mode Pins Identify Boot Device • BootROM Code runs • Copies First Stage Boot Loader to OCM

Boot Mode Selection

Example Boot Process for Linux • Linux kernel loaded to DDR Stage 3

• RFS can be selected from the Linux command line • RFS contains Linux applications

OS (Kernel & Drivers)

• U-Boot runs from DDR Stage 2

• Loads OS kernel from selected boot device • Loads ramdisk from default boot device

U-Boot

• First Stage Boot Loader runs from OCM RAM Stage 1

FSBL

• Bitstream is loaded and PL is configured • U-Boot is loaded from boot Device into DDR

• Power up Zynq Stage 0

Page 30

• On Chip ROM code runs – identifies the boot device by reading the mode pin status • Copies First Stage Boot Loader from selected boot device to OCM RAM

Secure Boot

Stage 2

• User developed code can be secure

uBoot For Linux

• First Stage Boot Loader runs from OCM Stage 1

• Decrypts and authenticates PS Boot Data using AES/SHA engine and puts into specified memory OR • Enables Second stage boot (optional) • Decrypts and authenticates Bitstream using AES/SHA and configures PL (optional)

First Stage Boot Loader

• Power up Zynq-7000 AP SoC Stage 0

Page 31

• Boot Mode Pins Identifies Boot Device • BootROM Code runs • Decrypts and authenticates FSBL using AES/SHA and then copied into OCM (PL powered on)

Boot Mode Selection

Zynq-7000 AP SoC Ecosystem Development Elements

Comprehensive Partnership Ecosystem

Software Tools

Intellectual Property

System Architecture

Software OS & Middleware

Design Services

Over 100 Zynq Specific Partners … and Growing Page 33

Comprehensive Partnership Ecosystem OS, Middleware and Tools solutions Alliance Partners Design Service Ecosystem Alliance Partners qualified as Zynq Design Centers across Geo regions – Well trained on Zynq-7000 AP SoC platform – Strong expertise on FPGA and Embedded Processing – Close relationship with local Xilinx customers

Key IP solutions supporting segment focused applications – 2D/3D Graphics, Video Imaging, Video Codec, PCIe, SATA, USB

Many Boards and Modules targeting specific market and application available from Alliance members – Jump start customers to use Zynq-7000 AP SoC platform

Strong Training Ecosystem (Xilinx ATP) – Zynq-7000 training classes available to broad customer base WW http://www.xilinx.com/products/silicon-devices/soc/zynq-7000/ecosystem

Page 34

Comprehensive Operating Systems Offering

More than 95% of commercial embedded operating systems supported on Zynq-7000 Scalable solution ranging from Real Time Operating Systems to fully featured Operating Systems Safety critical certifications in key industry segments Multi-core support in SMP and AMP mode Robust open source initiative for Linux and Android

Page 35

Strong Embedded Linux Offering

Open Source

Available on Xilinx GIT tree since July 2010 (EA) Available on main GIT tree since October 2011 Wiki at http://wiki.xilinx.com Forums at http://forums.xilinx.com

from Xilinx

Petalinux WindRiver Linux – Yocto Project Compatible

Commercial from Partners

MontaVista Linux – Carrier Grade Linux LinuxLink More to come

Xilinx Acquired Petalogix in Aug. 2012 to Strengthen its Linux Offering Page 36

Operating Systems Support Status OS

Provider

Version

Availability (ZC702 board support)

Linux

Segment Market Used in all segments

PetaLinux

Xilinx

12.12

Now

WR Linux 5

Wind River

3.4

Now

MVL CGE6

Montavista

2.6.32

May ‘13

Communications

LinuxLink

Timesys

NA

Now

Industrial, Medical, Automotive

Android

iVeia

2.3

Now

Consumer

Windows Embedded Compact 7

Adeneo Embedded

7

Now

ISM, Automotive

VxWorks

Wind River

6.9.2

Now

ISM, Automotive, A&D

INTEGRITY

GreenHills Software

NA

Now

A&D, Automotive, ISM

QNX

Adeneo Embedded

6.5

Q2CY13

Automotive, ISM

OSE

ENEA

5.5

Now

Communication

ThreadX/NetX

Express Logic

NA

Now

Consumer, Medical, Industrial

FreeRTOS

Xilinx

7.x

Now

All segments

RTA-OS SC1-4

ETAS

3.0

Now (single core)

Automotive

eCOS

ITR

3.0

Q4CY2012

ISM, Automotive

eT-Kernel

eSOL

TBD

Now

Automotive, Consumer

µc/OS

Micrium

II

Now

Industrial, Medical, A&D

Nucleus

Mentor

NA

Now

Industrial, Medical and Automotive

Quadros

Quadros

NA

Now

Industrial, Medical, POS

Page 37

Numerous SW Development Tool Options

In addition to Xilinx free SDK

Highly optimized ARM compilers from partners Advanced software development tools from world-class partners offering software profiling and tracing solutions

Page 38

Vivado Design Suite

FPGA designs for fast time to market Time to market is shorter and shorter. Projects are more and more complex.

How can we be ahead of today’s complexity and time to market requirements ?

What’s standard HDL flow status ? Are there new design flows ?

Page 40

Next Generation Productivity Challenges Implementation bottlenecks – Managing hierarchy & reuse in implementation – Getting estimates early & closing on timing, utilization, power

– Debugging across tools, abstractions, changes – Runtime & QOR scalability

3D

System Integration bottlenecks – Design and IP reuse – Integrating algorithmic and RTL level IP – Mixing DSP, embedded, connectivity, logic – Verification of blocks and “systems”

Page 41

Vivado Design Suite Accelerates Productivity

Accelerating Implementation Fast, Hierarchical and Deterministic Closure

Accelerating System Integration IP and System-centric Integration with Fast Verification

Page 42

Vivado Key Enabling Technologies Shared, Scalable Data Model Progressive estimation accuracy across the entire flow

Reduced iterations late in the cycle

Estimation

IP Integration

RTL Design

Synthesis

Place & Route

Shared, Scalable Data Model

RTL

Schematics

Placement

Scales to >10M LUT

Interactive design & debug environment Cross-probe from reports to schematics or HDL Integrate IP from any domain

entity FIR is port (clk : in rst : in din : in

Code Changes

Tool Settings

Timing Report Timing Path #1 Timing Path #2 Timing Path #3

Reports

Page 43

Placement Edits

Analytical Placer Solving The Interconnect Bottleneck Timing Cost f(x)

initial random seed Local moves

not routable best solution found

optimal solution (not found)

Placement Solution x (found by random moves and seeds)

Traditional P&R

Vivado P&R

“Cost” Criteria

1 dimension: timing minimization

3 dimensions: timing, congestion, wire length minimization

Primary Algorithm

“Simulated Annealing”: Random, iterative search based on initial seed

Analytical: solves simultaneous equations to minimize all dimensions

Runtime

Unpredictable Breaks at high utilization, congestion

Most predictable

Scalability

Does not scale to 1 M LC

Designed for 10M+ logic cells

Page 44

Vivado Delivers Denser, Faster Implementation ISE

Vivado

P&R runtime

13 hrs

5 hrs

Memory usage

16 GB

9 GB

Reduced Wire length & Congestion Significantly reduced

Concurrently Optimizes Timing, Device Utilization Page 45

Vivado Delivers the Industries Fastest and Most Predictable Run-time 4x faster than competition Most predictable run-times and results 2X the design capacity Runtime Comparison for 100+ designs 25

Xilinx Competition

Runtime (hours)

20

Supports 1+M LC designs where competition fails 15

predictable run-times

~4x run-time advantage

10

5

0 0K

500K

1 000K

Design Size (LC) Page 46

1 500K

2 000K

Vivado Design Suite Technology Advantages

Accelerating Implementation Fast, Hierarchical and Deterministic Closure

Accelerating System Integration IP and System-centric Integration with Fast Verification Page 48

Page 48

Vivado High-Level Synthesis (HLS) Accelerates Algorithmic C to IP, Co-Processing Accelerator Integration

Available today for C, C++, SystemC

Proven in customer designs Supports: – Rapid development and algorithm exploration – Variable precision and floating-point

A clear differentiator over anything available! Page 49

Introducing IP Integrator Enabling an IP Centric Design Flow IP Packager  Source (C, RTL, IP)  Simulation models  Documentation  Example Designs  Test bench

Standardized IP-XACT IP Subsystem

Page 50

Xilinx IP

 Uses multiple plug-and-play forms of IP to implement functional subsystem

3rd Party IP

 Includes software drivers and API

User IP

 Accelerates integration and productivity

Vivado IP Integrator: Intelligent IP Integration

Hierarchy Support

Correct-by-construction – Interface level connections – Extensible IP repository

System Hierarchy View Interface Connections with Live DRCs

– Real-time DRCs and parameter propagation / resolution – Designer Assistance

TCL Console Extensible IP Catalog

Automated IP Subsystems – Block automation for rapid design creation – One click IP customization

Board Aware – Support all 7 series and Zynq Platforms

Page 51

Vivado 2013.2 Accelerates Time to Integration: IP Integrator, with Zynq, HLS, SysGen Integration

Page 52

High-Level Synthesis with Xilinx Vivado HLS

© Copyright 2012 Xilinx

ESL: What is it? Design Methodology

ESL

IDE

RTL

Netlist

Layout Page 54

Functionality

High-Level Synthesis

Model

Model-Based Design

Architecture

Synthesis

Gates

Place & Route

Silicon

DSP Design Methodology Flexible design environment Floating-point and Fixed-point Hardware Generation Real time analog data acquisition World class C design flow MATLAB Simulink

Xilinx IP

Toolbox IP

HDL Coder

C/C++

System Generator

MATLAB / Simulink

DSP Design Platforms Analog Signals

Page 55

A/D D/A

RTL

C Libraries

Vivado HLS

IP Catalog

Vivado

IP Integrator

Where should we use HLS?

Datapath Centric: • • • •

Definitely

Video and smart video (Consumer, A&D) Imaging with multiple DSP processors (Medical) Baseband processing with multiple DSP processors (Wireless) Radar and smart radar (Automotive, A&D)

Compute Centric:

Maybe

• Complex compute systems, complicated math (Science HPC, A&D) • Fit • GPU for parallel processing • Floating Point • Linear Algebra • No Fit • CPU-optimized code (pointers, casting,…) • System calls Control Centric:

For sure !

• Motor control (Industrial) • Package processing (Connectivity)

Page 56

Who is HLS for?

Embedded Designer: Definitely • Already a C expert • Comfortable writing C • Familiar with C development environments • Hardware aware Hardware Engineer:

Definitely

• Already a hardware expert • Might have some C experience, easy to pick up • Needs to fine-tune some algorithmic aspects • Gets a lot of test benches from algorithm designer • Struggling with verification issues at RTL level Algorithm Designer: How willing are you to learn about hardware? • Pointer casting • Hand optimized loops • malloc(), free()

Page 57

Design Flow …

VIVADO HLS

RTL Simulation

Page 58

RTL Export

IP-Xact

SysGen

Pcore

Benefits of HLS

Productivity – Verification

Video Design Example

• Functional

Input

C Simulation Time

RTL Simulation Time

Improvement

• Architectural

10 frames 1280x720

10s

~2 days (ModelSim)

~12000x

– Abstraction • Datatypes • Interface

RTL (Spec)

• Classes

– Automation

C (Spec/Sim)

RTL (Sim)

RTL (Sim)

Block level specification AND verification significantly reduced Page 59

Benefits of HLS

Portability – Processors and FPGAs – Technology migration – Cost reduction – Power reduction

Design and IP reuse Page 60

Benefits of HLS

Permutability – Architecture Exploration • Timing  Parallelization  Pipelining

• Resources  Sharing

– Better QoR

Rapid design exploration delivers QoR rivaling hand-coded RTL Page 61

Vivado HLS Code Synthesis C, C++ and SystemC

Vivado HLS Projects and Solutions Vivado HLS is project based – A project specifies the source code which will be synthesized – Each project is based on one set of source code – Each project has a user specified name Source

A project can contain multiple solutions – Solutions are different implementations of the same code – Auto-named solution1, solution2, etc. – Supports user specified names

– Solutions can have different clock frequencies, target technologies, synthesis directives Project Level

Projects and solutions are stored in a hierarchical directory structure – Top-level is the project directory – The disk directory structure is identical to the structure shown in the GUI project explorer (except for source code location) Page 63

Solution Level

HLS Control & Datapath Extraction

Code void fir ( data_t *y, coef_t c[4], data_t x ){ static data_t shift_reg[4]; acc_t acc; int i; acc=0; loop: for (i=3;i>=0;i--) { if (i==0) { acc+=x*c[0]; shift_reg[0]=x; } else { shift_reg[i]=shift_reg[i-1]; acc+=shift_reg[i]*c[i]; } } *y=acc; }

From any C code example ..

Page 64

Operations

Control Behavior Finite State Machine (FSM) states

Control & Datapath Behavior Control Dataflow

RDx RDc

>= == + * + * WRy Operations are extracted…

0

1

2 The control is known

RDx

RDc

>=

-

==

-

+

*

+

*

WRy

A unified control dataflow behavior is created.

The Key Attributes of C code Functions: All code is made up of functions which represent the design hierarchy: the same in hardware Top Level IO : The arguments of the top-level function determine the hardware RTL interface ports

void fir ( data_t *y, coef_t c[4], data_t x ){

Types: All variables are of a defined type. The type can influence the area and performance

static data_t shift_reg[4]; acc_t acc; int i; acc=0; loop: for (i=3;i>=0;i--) { if (i==0) { acc+=x*c[0]; shift_reg[0]=x; } else { shift_reg[i]=shift_reg[i-1]; acc+=shift_reg[i] * c[i]; } } *y=acc; }

Loops: Functions typically contain loops. How these are handled can have a major impact on area and performance. Arrays: Arrays are used often in C code. They can influence the device IO and become performance bottlenecks. Operators: Operators in the C code may require sharing to control area or specific hardware implementations to meet performance

Let’s examine the default synthesis behavior of these … Page 65

C Functions Become RTL Hierarchy

Each function is translated into an RTL block – Verilog module, VHDL entity Source Code void A() { ..body A..} void B() { ..body B..} void C() { B(); } void D() { B(); }

void foo_top() { A(…); C(…); D(…) }

RTL hierarchy

foo_top

B

A D

my_code.c

• Small functions may be automatically inlined

B

Each block can be shared like any other component provided it’s not in use at the same time

– Functions may be inlined to dissolve their hierarchy Page 66

C

Top-Level Function Arguments Become Ports //Top Level Function for hardware synthesis void image_demo(AXI_PIXEL in_pix[MAX_HEIGHT][MAX_WIDTH], AXI_PIXEL out_pix2[MAX_HEIGHT][MAX_WIDTH], int rows, int cols){

RTL Ports AND protocols – Created by synthesis directives

}

– Design scheduled to match IO timing

Synthesis ap_clk ap_rst ap_start

image_demo RTL

Clock & Reset ports ap_done ap_idle

rows cols in_pix_data in_pix_empty_n

Design Start, Idle & Done ports

No IO protocol (for config data)

in_pix_read

in_pix implemented as FIFO port

out_pix2_data out_pix2_full_n

Page 67

out_pix2_write

out_pix2 implemented as FIFO port

Bus Interfaces : Added on RTL Export AXI4 Interfaces added as IP Export – RTL Ports can be grouped into a single AXI4 slave interfaces – C Function and Header are provided for slave controller / CPU

ap_clk ap_rst ap_start

image_demo RTL

Exported Design ap_done ap_idle

rows cols in_pix_data in_pix_empty_n

in_pix_read out_pix_data

out_pix_full_n

Page 68

out_pix_write

Software header files provided for CPU control.

AXI4 Slave

Control ports grouped into common interface.

AXI4 Stream

Streaming Input data

AXI4 Stream

Streaming Output data

C, C++ and SystemC Support The vast majority of C, C++ and SystemC is supported – Provided it is statically defined at compile time – If it’s not defined until run time, it won’ be synthesizable

Certain Code Must be Changed: Cannot be Synthesized

– Dynamic memory allocation: size is indeterminate – System calls: cannot access the disk or OS features (e.g. date)

Certain Code Should be Changed: For Performance – Code is optimized to get performance when run on a CPU, or a DSP or a GPU and for an FPGA • Changes to synthesize on an FPGA are less than to run on a DSP or GPU

– Changes like this are nothing to run away from : it’s normal, sane, behavior Page 69

Vivado HLS Design Optimization

Data Types Determine the Size of the Operators

Code void fir ( data_t *y, coef_t c[4], data_t x ){ static data_t shift_reg[4]; acc_t acc; int i; acc=0; loop: for (i=3;i>=0;i--) { if (i==0) { acc+=x*c[0]; shift_reg[0]=x; } else { shift_reg[i]=shift_reg[i-1]; acc+=shift_reg[i]*c[i]; } } *y=acc; }

From any C code example ..

Page 71

Operations

Types Standard C types

RDx RDc

>= == + * + * WRy Operations are extracted…

long long (64-bit)

short (16-bit)

int (32-bit)

char (8-bit)

float (32-bit)

double (64-bit)

unsigned types

For floats and doubles there must be a FP core in the library binding can map to, else cannot be synthesized Arbitary Precision types C: C++:

ap(u)int types (1-1024) ap_(u)int types (1-1024) ap_fixed types

C++/SystemC: sc_(u)int types (1-1024) sc_fixed types Can be used to define any variable to be a specific bitwidth (e.g. 17-bit, 47-bit etc). The C types define the size of the hardware used: handled automatically

Review: Latency and Throughput Design Latency – The latency of the design is the number of cycle it takes to output the result • In this example the latency is 10 cycles

void foo_top (a,b,c,d, *x, *y) { ... func_A(…);func_A func_B(…);func_B func_C(…)func_C func_D(…)func_D return res; }

func_A

func_B

func_C

func_D

Latency = 10 cycles

Design Throughput – The throughput of the design is the number of cycles between new inputs • By default (no concurrency) this is the same as latency • Next start/read is when this transaction ends Next New Input Read

New Input Read

func_A

func_B

func_C

Throughput = 10 cycles Page 72

func_D

Latency and Throughput In the absence of any concurrency – Latency is the same as throughput void foo_top (a,b,c,d, *x, *y) { ... func_A(…);func_A func_B(…);func_B func_C(…)func_C func_D(…)func_D return res; }

Next New Input Read

New Input Read

func_A

func_B

func_C

func_D

Latency = 10 cycles Throughput = 10 cycles

Pipelining for higher throughput – VHLS can pipeline functions and loops to improve throughput func_A

func_B

func_D

func_C func_A

func_B

func_C

func_D

Latency = 10 cycles Throughput = 4 cycles

– Latency and throughput are related – This presentation will discuss optimizing for latency first, then throughput Page 73

Loops By default, loops are rolled – Each C loop iteration  Implemented in the same state – Each C loop iteration  Implemented with same resources foo_top

Synthesis

a[N]

+

void foo_top (…) { ... Add: for (i=3;i>=0;i--) { b = a[i] + b; ... }

N

b

Loops require labels if they are to be referenced by Tcl directives (GUI will auto-add labels)

– Loops can be unrolled if their indices are statically determinable at elaboration time • Not when the number of iterations is variable

Page 74

Unrolled loops can reduce latency foo_top

a[3]

+ +

Unrolled

+ a[2]

+

void foo_top (…) { ... Add: for (i=3;i>=0;i--) { b = a[i] + b; ... }

a[1]

b

a[0]

Select loop “Add” in the directives pane, right-click & select unroll

clk Option 1 Option 2 Option 3

3 3

2 2

3 2 1 0

Unrolled loops are likely to result in more hardware resources and higher area

Page 76

1

1 0

0

Unrolled loops allow greater option & exploration

Loop Flattening VHLS can automatically flatten nested loops – A faster approach than manually changing the code

Flattening should be specified on the inner most loop – It will be flattened into the loop above – The “off” option can prevent loops in the hierarchy from being flattened Loops will be flattened by default: use “off” to disable void foo_top (…) { ... L1: for (i=3;i>=0;i--) { [loop body l1 ] }

1 x4

2

x4

3

x4

4 x4

36 transitions Page 77

L2: for (i=3;i>=0;i--) { L3: for (j=3;j>=0;j--) { [loop body l3 ] } } L4: for (i=3;i>=0;i--) { [loop body l4 ] }

void foo_top (…) { ... L1: for (i=3;i>=0;i--) { [loop body l1 ] }

1 x4

L2: for (k=15,k>=0;k--) { [loop body l3 ] } L4: for (i=3;i>=0;i--) { [loop body l1 ] }

2

x16

4 x4

28 transitions

Loop Merging VHLS can automatically merge loops – A faster approach than manually changing the code – Allows for more efficient architecture explorations – FIFO reads, which must occur in strict order, can prevent loop merging • Can be done with the “force” option : user takes responsibility for correctness void foo_top (…) { ... L1: for (i=3;i>=0;i--) { [loop body l1 ] }

1 x4

2

x4

3

x4

4 x4

L2: for (i=3;i>=0;i--) { L3: for (j=3;j>=0;j--) { [loop body l3 ] } Already flattened } L4: for (i=3;i>=0;i--) { [loop body l4 ] }

void foo_top (…) { ... L123: for (l=16,l>=0;l--) { if (cond1) [loop body l1 ] [loop body l3 ]

1 x16

if (cond4) [loop body l4 ] }

18 transitions 36 transitions Page 78

Review: Arrays in HLS An array in C code is implemented by a memory in the RTL – By default, arrays are implemented as RAMs, optionally a FIFO foo_top

A[N] void foo_top(int x, …) { int A[N]; L1: for (i = 0; i < N; i++) A[i+x] = A[i] + i; }

RAM_1P

N-1 N-2

Synthesis

… 1 0

A_in

DIN ADDR

DOUT

CE WE

The array can be targeted to any memory resource in the library – The ports and sequential operation are defined by the library model • All RAMs are listed in the VHLS Library Guide List of available Cores Example: Array “A” is targeted to a single port distributed RAM resource

Page 79

A_out

Arrays : Performance bottlenecks Arrays are intuitive and useful software constructs – They allow the C algorithm to be easily captured and understood

Array accesses can often be performance bottlenecks – Arrays are targeted to a default RAM • May not be the most ideal memory for performance void foo_top (…) { ... for (i = 2; i < N; i++) mem[i] = mem[i-1] +mem[i-2];

Or RD

RD

+

WR

RD

WR

RD

}

RD

WR

RD

+

+

}

• Cannot pipeline with a throughput of 1

Even with a dual-port RAM, we cannot perform all reads and writes in one cycle

VHLS allows arrays to be partitioned and reshaped – Allows more optimal configuration of the array – Provides better implementation of the memory resource Page 80

Array Partitioning Partitioning breaks an array into smaller elements • •

If the factor is not an integer multiple the final array has fewer elements Arrays can be split along any dimension •

If none is specified dimension zero is assumed • Dimension zero means all dimensions



All partitions inherit the same resource target



That is, whatever RAM is specified as the resource target • Except of course “complete”

0

1



(N/2-1)

N/2



N-2

N-1

0

2



N-2

1



N-3

N-1

block

array1[N] 0

1



N-1

cyclic

0

N-3

complete Multiple memories allows greater parallel access Page 81

N-1

1

N-2 …

2

Divided into blocks: N-1/factor elements

Divided into blocks: 1 word at a time (like “dealing cards”)

Individual elements: Break a RAM into registers (no “factor” supported)

Array Dimensions The array options can be performed on dimensions of the array my_array[10][6][4] Dimension 1 Dimension 2 Dimension 3 Dimension 0 (All dimensions)

Examples

my_array_0[10][6] my_array[10][6][4]  partition dimension 3  my_array_1[10][6] my_array_2[10][6] my_array_3[10][6]

my_array[10][6][4]  partition dimension 1

my_array_0[6][4] my_array_1[6][4] my_array_2[6][4] my_array_3[6][4] my_array_4[6][4]  my_array_5[6][4] my_array_6[6][4] my_array_7[6][4] my_array_8[6][4] my_array_9[6][4]

my_array[10][6][4]  partition dimension 0  10x6x4 = 240 individual registers Page 82

Loop and Function Pipelining Without Pipelining

With Pipelining Loop:for(i=1;i(xn, xk, &fft_status, &fft_config); dummy_proc_be(&fft_status, ovflo, xk, out); Page 91

Co-processing with Zynq-7000 AP SoC

All Programmable SOC Approach Requirements

SW Spec Iterate

Page 94

Verify

HW Spec Iterate

Verify

Vivado High-Level Synthesis Requirements

SW Spec Iterate

Verify

HW Spec Iterate

Verify

Accelerates Algorithmic C to Co-Processing Accelerator Integration Page 95

Accelerator: Use Case Models Model 2: Acceleration Model

Model 1: Data Flow Model Standard Mem

PS

PS Standard IO Custom IP

Custom IP

Custom I/O

®

®

Custo m I/O

Custo m I/O

Custom I/O

Custom I/O

Custom I/O

Custom IP

Standard IO

Custom IP

Custom IP

Custom I/O

Standard Mem

Acceleration Model Control – Data Flow Model

• • • •

• Custom IP for complex function & data flows • PS for control & resource management

Programmable Systems Integration

4

Increased System Performance

BOM Cost Reduction

1

Page 96

Programmable Systems Integration

Total Power Reduction

3

Accelerated Design Productivity

Balances SW/HW partition PS primary compute platform PL for HW Acceleration Communication between PS & PL: High

1 2

Increased System Performance

BOM Cost Reduction

3

Total Power Reduction

2

Accelerated Design Productivity # = Relative importance

4

Zynq-7000 Drive Reference Design Platform DDR Current Measurement

UDP/IP @ 1Gibit

Field Oriented Control IP

PC

SVM - RPFM

ra

Bridge

La Lc

rc

Lb

rb

Process Data

SDC

Observer (Speed) Encoder readout

Current Measurement Field Oriented Control IP SVM - RPFM

ra

Bridge

La Lc Lb

rb

Process Data UDP/ Agent Dynamic Link Library

Page 97

Observer (Speed) Encoder readout

rc

Zynq-7000 EtherCAT Performance Advantage Example ARM core up to 1 GHz – Fabric FOC 4 Motors + EtherCAT®

Best Drive in the market 62.5 us

Read / Write EtherCAT ESC: Commands + Status

EtherCAT®

Octets to Floating Point Conversion FOC Motor 0

Zynq 16 us

Commands + Status

4X faster response time

Positioning Controller:

FOC Motor 1 Execute motor position

Speed Controller:

FOC Motor 2 FOC Motor 3

Estimate and correct error in speed Field Oriented Control

Zynq 1.6 us 40x faster control loop

Current Controller: Estimate and correct error in torque

Power Modulation: PWM, SVM, RPFM

Zynq-7000 EtherCAT® Intelligent Electric Drive Using ZC702, FMC-MC1 and FMC-ISMNET

Page 99

© Copyright 2013 Xilinx

Zynq-7000 EtherCAT System Using ZC702, ZedBoard and ZedBoard Master with QNX® and SOEM

ISM-NET

Page 100

© Copyright 2013 Xilinx

Accelerator Attached as Slave Pro: Simple System Architecture, Simple Register Interface Con: Limited communication bandwidth

Processing System

Common Peripherals

Memory Interfaces

Programmable Logic

ARM® Dual Cortex-A9 MPCore™ System

Slave Port

Acc. 1 AXI4 interconnect

Page 101

Acc. 2

Accelerator Attached as Master - High Performance Port Direct to OCM or DDR Memory Pro: High Data Bandwidth Con: Increased Design Complexity, Increased Latency

Zynq Processing System

Common Peripherals

Memory Interfaces

High Performance Port

ARM® Dual Cortex-A9 MPCore™ System

Slave Port

Programmable Logic AXI4 interconnect

AXI_DMA

Acc. 1 AXI4 interconnect

Page 102

AXI_DMA

Acc. 2

Accelerator Attached as Master - With Coherent DMA to L1 Caches Pro: Low latency, High data bandwidth for short bursts Con: Increased Design Complexity, Most efficient for data that fits in caches

Zynq Processing System

Common Peripherals

Memory Interfaces

ARM® Dual Cortex-A9 MPCore™ System

Slave Port

Accelerator Coherency Port

Programmable Logic AXI4 interconnect

AXI_DMA

Acc. 1 AXI4 interconnect

Page 103

AXI_DMA

Acc. 2

Co-processing with Zynq-7000 AP SoC Live Demo

Design example: matrix multiply

Simple and very common linear algebra example – Easy to integrate floating point variables and arithmetic operations

A*B=C

Algorithm maps naturally to nested loops – Loop 1: Iterate over rows of A • Loop 2: Iterate over columns of B – Loop 3: Multiply each index of row vector A with an index of column vector B and accumulate

Algorithm is easily parallelizable – No data recurrence/ dependencies Page 105

Design example: the original C code Code snippet (single-precision floating-point real-valued) void matrix_multiply_hw( float mat_in1[DIM][DIM], float mat_in2[DIM][DIM], float mat_out[DIM][DIM]) { // matrix multiplication of a A*B matrix L1: for (int y = 0; y< DIM; y++) L2: for (int x = 0; x < DIM; x++) { float sum = 0;

L3: for (int i = 0; i < DIM; i++) sum = sum + a[y][i] * b[i][x];

Loop 1 (rows of A)

Loop 2 (columns of B)

Loop 3 (vector index) FP multiply FP add

out[y][x] = sum; } return; } Page 106

Design example: the original C code Default implementation A matrix Matrix multiplication uses comparable numbers of adds versus mults, so it is a balanced example

B matrix 2-D array (C/C++)

Default mapping

Control

FADD REG

Memory I/F

Page 107

FMUL

Optimizations: array partitioning A matrix B matrix Repartition inputs to allow parallel read/ write

*

+ *

+ *

+ * Parallelize mult/ add by unrolling inner loop.

Page 108

All Programmable Abstraction The Future of Xilinx Design Tools

Productivity: Automation + Abstraction C Derivatives and Model Based

C Programming

Abstraction

Behavioral Assembly IP Binary RTL

Gate/Layout

Automation Page 110

Xilinx Strategy All Programmable Abstractions

IP

Abstraction

IPI

Automation

Page 111

Today’s Hardware Design Abstractions Zynq™ - 7000 All Programmable SoCs Accelerated time to integration

Behavioral

C/C++ Vivado HLS

Abstracting hardware through increasing layers of automation Automating not dictating design flows

Abstraction Model Based MathWorks National Instruments

Page 112

IP Extractions Vivado IP Integrator

Abstraction Evolution - System Level System Level abstraction System Level

Abstracting all hardware through increasing layer of automation Model Based MathWorks National Ins.

Abstraction

IP Extractions Vivado IPI

Page 113

Behavioral C/C++ Vivado HLS

ALL Programmable Abstractions for All Programmable Devices

Summary

The Coprocessing Value

Programmable Systems Integration

Increased System Performance

BOM Cost Reduction

All Programmable SoC’s deliver tight coupling of processors and Programmable Logic enabling coprocessing accelerators

Offloading processing tasks to Programmable Logic can appreciably increase system level performance

Accelerators can eliminate multiple general purpose or DSP processors

Total Power Reduction

Moving processing functions to Programmable logic can result in significant system level power reduction

Accelerated Design Productivity

Using HLS to create accelerators can dramatically reduce development time Using IPI provides an unmatched ease of use in system integration

Page 115

Follow Xilinx

facebook.com/XilinxInc

twitter.com/#!/XilinxInc

youtube.com/XilinxInc

Q&A

Thank You

Suggest Documents