Rounded Rectangles

4 downloads 0 Views 1012KB Size Report
May 6, 2017 - encoding, with minimal losses. ○ Motion Estimation is the most time- consuming and computationally expensive procedure of video encoding ...
Implementation of a Motion Estimation Hardware Accelerator on Zynq SoC Thomas Makryniotis1, Minas Dasygenis2 Laboratory of Digital Systems and Computer Architecture Dept. of Informatics and Telecommunications Engineering University of Western Macedonia [email protected], [email protected]

MOCAST 2017 Thessaloniki - Greece, 4-6 May 2017

Outline ●

Introduction



Related Work



Video Encoding



ZYNQ Architecture



Hardware Design



Software Design



Future work MOCAST 2017

2

Introduction (1/2) ●







New trends in multimedia – especially in video applications demand the best possible quality Low bandwidth and memory consumption should be respected Ultra-High Definition standards (4K & 8K) are further fueling the need for fast and reliable encoding, with minimal losses Motion Estimation is the most timeconsuming and computationally expensive procedure of video encoding MOCAST 2017

3

Introduction (2/2) ●





Many algorithms have been suggested for the Motion Estimation (ME) process Full-Search Motion Estimation is the most reliable, producing the best possible results with the cost of low speed and high energy consumption (especially software implementations) In order to reduce the time and energy needed, we developed an accelerator-based embedded system with the sole purpose of doing ME calculations MOCAST 2017

4

Related Work ●







Y. Ismail et al : ZYNQ-based Full-Search Motion Estimation Architecture Verification Goel, Bayoumi et al : High-speed Motion Estimation Architecture for Real-time Video Transmission Redkar, Kuwelkar et al : Full Search Block Matching Algorithm Motion Estimation on FPGA Olivares et al : Fast Full-Search Block Matching Algorithm Motion Estimation Alternatives in FPGA MOCAST 2017

5

Initial idea ●



● ●



Develop an embedded system based on that platform Will include the ARM processor and the FSME accelerator as an IP peripheral. Develop the VHDL (or Verilog) module Add peripheral support in a custom Linux distribution Write the Linux driver for that peripheral MOCAST 2017

6

System outline

MOCAST 2017

7

System outline

MOCAST 2017

8

Tools used ●

ZYNQ Evaluation and Development board (ZEDBoard)



Xilinx Vivado Suite 2014.2



Xilinx Software Development Kit



Xilinx ARM Toolchain



PetaLinux was the distribution of choice for our embedded system

MOCAST 2017

9

The Xilinx ZYNQ Platform ●

● ●

● ●

All-Programmable System-on-Chip (APSoC) Dual-Core ARM Cortex A9 CPU DDR (2 & 3) Controllers, Gigabit Ethernet, USB 2.0 Re-configurable logic (FPGA) DSP slices (in varying numbers according to the model) MOCAST 2017

10

The Xilinx ZYNQ Platform ●

● ●

● ●

All-Programmable System-on-Chip (APSoC) Dual-Core ARM Cortex A9 CPU DDR (2 & 3) Controllers, Gigabit Ethernet, USB 2.0 Re-configurable logic (FPGA) DSP slices (in varying numbers according to the model) MOCAST 2017

11

MOCAST 2017

12

ZEDBoard Basic Specs ●



ZYNQ®-7000 All Programmable SoC XC7Z020-CLG484-1 512 MB DDR3 RAM, 256 Mb Quad-SPI Flash, SD Card socket



Onboard USB-JTAG Programming



10/100/1000 Ethernet



PS & PL I/O expansion (FMC, Pmod™ Compatible, XADC) MOCAST 2017

13

ZYNQ Architecture ●







ZYNQ system consists of two discrete entities Processing System – PS : Which is actually the ARM CPU Programmable Logic – PL : Which consists of the FPGA, DSP slices and Block RAM (BRAM) These two entities communicate over a shared bus MOCAST 2017

14

Architecture of the Accelerator ●



The basic architectural concepts have been proposed many time in the past (Olivares et al) Accelerator consists of five main units: - Local Memory Unit - Sum-of-Absolute-Differences (SAD) Unit - Comparator - Motion Vector Memory - Control Unit MOCAST 2017

15

Architectural Diagram

MOCAST 2017

16

Local Memory Unit ●







Local Memory Unit consists of two main subunits: Data demultiplexer and Memory Unit Demultiplexer (Demux) routes the data either directly to the SAD unit or in the local memory to be stored Local Memory consists of three “submemories”, and is practically a register file The structure of Memory Unit enables the implementation of data re-use techniques MOCAST 2017

17

Local Memory Unit ●





Each submemory consists of a 16 × 16 register file Each register has a data width of 8-bits and thus, it can store the value of one greyscale pixel Submemories hold a part of the 32 × 32 search area and as it is raster scanned and compared to the current block, the rest is being loaded row-per-row. MOCAST 2017

18

Data Reuse ●





A major advantage of the architecture is the data reuse mechanism as implemented by the local memory The addressing mechanism is an internal counter which enables access to any column by setting it to the correct value Writing and reading can be done in a very particular way, row-by-row or column-bycolumn → Data reuse by following a smart dataflow MOCAST 2017

19

Processing Unit ●







Two different submodules: Absolute Difference Processor and Adder Tree SAD Criterion:

AD Processor produces a concatenated vector of 2048 bits. The produced values should be added in a very fast way Tree Adder consists mainly of 4:2 compressors and a few generic full adders MOCAST 2017

20

Processing Element

MOCAST 2017

21

Tree Adder ●





Works in 16 x 16 blocks (splits the 2048 bit vector in 16 x 16 x 8-bit teams) Each block is calculated separate and in parallel with the rest Final output: 16-bit value

MOCAST 2017

22

Comparison Unit ●







This unit compares the currently produced SAD (and the position we found it) with the previous one, which is stored in an internal register, after the comparison of the current block with another reference block. If value of new SAD < old SAD, then it’s being kept, otherwise it’s being discarded. The position is being calculated by an internal counter which increments with every clock cycle. Output: 11-bit position signal MOCAST 2017

23

Motion Vector Memory ●

FIFO



1395 11-bit registers

MOCAST 2017

24

Control Unit ● ●



Complex subsystem, the most significant Effectively controls the signals within the IP Consists of: - Incremental counter - Signal controller



State machine – Incremental counter counts until 400'h before resetting MOCAST 2017

25

Embedded Architecture ●





The complete system includes the Accelerator and the ARM CPU It also contains logic that enables component communication and data transfer This “glue” logic is based οn the AXI protocol like all the modern ARM SoC

MOCAST 2017

26

AMBA – AXI ● ●





Advanced Microcontroller Bus Architecture The ZYNQ system makes use of the third generation AMBA – AXI bus (AXI4) Targets high-performance, high clock frequency designs Every peripheral designed for use with ZYNQ, should communicate over the AXI shared bus, for optimum performance MOCAST 2017

27

MOCAST 2017

28

AXI Protocols ●

There are many different “flavors” of an AXI interconnection: - AXI4: Version for memory-mapped, highperformance IP - AXI4-Lite: Subset of AXI4. Simple single transactions between memory mapped IP - AXI4-Stream: Non-memory mapped IP which require high-speed, continuous data stream MOCAST 2017

29

AXI4 - Stream ●





Exactly the type of IP that requires continuous and fast data-stream, so AXI4Stream is an excellent choice for interconnection protocol. Utilizes two syncronization signals, VALID and READY Main usage: DMA Controller

MOCAST 2017

30

Direct Memory Access ●





Since we require a real-time video encoding, the system's response needs to be as fast as possible. For that reason we chose to use a DMA engine in order to speed up the data transactions Specifically we used the Xilinx AXI DMA IP in Scatter-Gather mode, in order to provide a further speed up MOCAST 2017

31

MOCAST 2017

32

Software ●





In order to test and show the system's functionality we used both TCL Scripts and Bare metal C applications However in order to have a complete working embedded system, we compiled and installed PetaLinux among with some of the Xilinx DMA drivers Except from that we developed a userspace application in order to make use of the kernel driver and provide results from IP MOCAST 2017

33

Compiling PetaLinux kernel ●

There are two flavors for the PetaLinux kernel: - Run as-is, built and ready to boot on our ZEDBoard - Reconfigure using the appropriate tools and including the BSP package of our choice



Since we have compiled the kernel appropriately, including support for our peripheral, we can begin developing the MOCAST 2017 driver

34

Results ●

Max op. frequency (theoretical) : 111,895 MHz - Operates without issues in 112 MHz





For greater clock speeds there setup time violations Design area ~ 26% of the total

MOCAST 2017

35

Results XILINX VIVADO POWER REPORT Total On-Chip Power (W)

1,735

Dynamic (W)

1,576

Device Static (W)

0,159

Effective TJA (C/W)

11,5

MAX Ambient (C)

65,0

Junction Temperature (C)

45,0

Thermal Margin (C)

39,7

On-Chip

Power (W)

Clocks

0,03

Signals

0,005

Slice Logic

0,004

PS7

1,532

Static Power

0,159

Total

1,735 MOCAST 2017

36

FPGA Area Utilization Report Site Type

Used

Available

Utilization %

Slice LUTs

13912

53200

26,15

LUT as Logic

12041

53200

22,63

LUT as Memory

1871

17400

10,75

LUT FF Pairs

17141

53200

32,22

Slice Registers

11525

106400

10,83

Block RAM Tiles

3

140

2,14

F7 Muxes

232

26600

0,87

F8 Muxes

88

13300

0,66





Comparable with that proposed on "Highspeed Motion Estimation Architecture for Real-time Video Transmission" by Goel et al The extra ~900 LUTs used, are due to other systems (DMA, AXI controllers, etc) MOCAST 2017

37

Conclusions ●





The design and the implementation of this embedded system was a proof of concept that showcased the capability of designing high performance heterogenous systems. That also showed how we can utilize the power of this new generation of SoCs for making even better digital systems with far greater potentials. Accelerators are a quite distinct possibility for the future of everyday computing, especially in processing-intensive applications MOCAST 2017

38

Future Work ●

● ●

Design and implement more complex, specialized embedded systems, focusing on low-power consumption Implement greater level of data reuse Design an upscaled version, targeting at Ultra-High Definition (HEVC) encoding standards.

MOCAST 2017

39