Implementation of a Motion Estimation Hardware Accelerator on Zynq SoC Thomas Makryniotis1, Minas Dasygenis2 Laboratory of Digital Systems and Computer Architecture Dept. of Informatics and Telecommunications Engineering University of Western Macedonia
[email protected],
[email protected]
MOCAST 2017 Thessaloniki - Greece, 4-6 May 2017
Outline ●
Introduction
●
Related Work
●
Video Encoding
●
ZYNQ Architecture
●
Hardware Design
●
Software Design
●
Future work MOCAST 2017
2
Introduction (1/2) ●
●
●
●
New trends in multimedia – especially in video applications demand the best possible quality Low bandwidth and memory consumption should be respected Ultra-High Definition standards (4K & 8K) are further fueling the need for fast and reliable encoding, with minimal losses Motion Estimation is the most timeconsuming and computationally expensive procedure of video encoding MOCAST 2017
3
Introduction (2/2) ●
●
●
Many algorithms have been suggested for the Motion Estimation (ME) process Full-Search Motion Estimation is the most reliable, producing the best possible results with the cost of low speed and high energy consumption (especially software implementations) In order to reduce the time and energy needed, we developed an accelerator-based embedded system with the sole purpose of doing ME calculations MOCAST 2017
4
Related Work ●
●
●
●
Y. Ismail et al : ZYNQ-based Full-Search Motion Estimation Architecture Verification Goel, Bayoumi et al : High-speed Motion Estimation Architecture for Real-time Video Transmission Redkar, Kuwelkar et al : Full Search Block Matching Algorithm Motion Estimation on FPGA Olivares et al : Fast Full-Search Block Matching Algorithm Motion Estimation Alternatives in FPGA MOCAST 2017
5
Initial idea ●
●
● ●
●
Develop an embedded system based on that platform Will include the ARM processor and the FSME accelerator as an IP peripheral. Develop the VHDL (or Verilog) module Add peripheral support in a custom Linux distribution Write the Linux driver for that peripheral MOCAST 2017
6
System outline
MOCAST 2017
7
System outline
MOCAST 2017
8
Tools used ●
ZYNQ Evaluation and Development board (ZEDBoard)
●
Xilinx Vivado Suite 2014.2
●
Xilinx Software Development Kit
●
Xilinx ARM Toolchain
●
PetaLinux was the distribution of choice for our embedded system
MOCAST 2017
9
The Xilinx ZYNQ Platform ●
● ●
● ●
All-Programmable System-on-Chip (APSoC) Dual-Core ARM Cortex A9 CPU DDR (2 & 3) Controllers, Gigabit Ethernet, USB 2.0 Re-configurable logic (FPGA) DSP slices (in varying numbers according to the model) MOCAST 2017
10
The Xilinx ZYNQ Platform ●
● ●
● ●
All-Programmable System-on-Chip (APSoC) Dual-Core ARM Cortex A9 CPU DDR (2 & 3) Controllers, Gigabit Ethernet, USB 2.0 Re-configurable logic (FPGA) DSP slices (in varying numbers according to the model) MOCAST 2017
11
MOCAST 2017
12
ZEDBoard Basic Specs ●
●
ZYNQ®-7000 All Programmable SoC XC7Z020-CLG484-1 512 MB DDR3 RAM, 256 Mb Quad-SPI Flash, SD Card socket
●
Onboard USB-JTAG Programming
●
10/100/1000 Ethernet
●
PS & PL I/O expansion (FMC, Pmod™ Compatible, XADC) MOCAST 2017
13
ZYNQ Architecture ●
●
●
●
ZYNQ system consists of two discrete entities Processing System – PS : Which is actually the ARM CPU Programmable Logic – PL : Which consists of the FPGA, DSP slices and Block RAM (BRAM) These two entities communicate over a shared bus MOCAST 2017
14
Architecture of the Accelerator ●
●
The basic architectural concepts have been proposed many time in the past (Olivares et al) Accelerator consists of five main units: - Local Memory Unit - Sum-of-Absolute-Differences (SAD) Unit - Comparator - Motion Vector Memory - Control Unit MOCAST 2017
15
Architectural Diagram
MOCAST 2017
16
Local Memory Unit ●
●
●
●
Local Memory Unit consists of two main subunits: Data demultiplexer and Memory Unit Demultiplexer (Demux) routes the data either directly to the SAD unit or in the local memory to be stored Local Memory consists of three “submemories”, and is practically a register file The structure of Memory Unit enables the implementation of data re-use techniques MOCAST 2017
17
Local Memory Unit ●
●
●
Each submemory consists of a 16 × 16 register file Each register has a data width of 8-bits and thus, it can store the value of one greyscale pixel Submemories hold a part of the 32 × 32 search area and as it is raster scanned and compared to the current block, the rest is being loaded row-per-row. MOCAST 2017
18
Data Reuse ●
●
●
A major advantage of the architecture is the data reuse mechanism as implemented by the local memory The addressing mechanism is an internal counter which enables access to any column by setting it to the correct value Writing and reading can be done in a very particular way, row-by-row or column-bycolumn → Data reuse by following a smart dataflow MOCAST 2017
19
Processing Unit ●
●
●
●
Two different submodules: Absolute Difference Processor and Adder Tree SAD Criterion:
AD Processor produces a concatenated vector of 2048 bits. The produced values should be added in a very fast way Tree Adder consists mainly of 4:2 compressors and a few generic full adders MOCAST 2017
20
Processing Element
MOCAST 2017
21
Tree Adder ●
●
●
Works in 16 x 16 blocks (splits the 2048 bit vector in 16 x 16 x 8-bit teams) Each block is calculated separate and in parallel with the rest Final output: 16-bit value
MOCAST 2017
22
Comparison Unit ●
●
●
●
This unit compares the currently produced SAD (and the position we found it) with the previous one, which is stored in an internal register, after the comparison of the current block with another reference block. If value of new SAD < old SAD, then it’s being kept, otherwise it’s being discarded. The position is being calculated by an internal counter which increments with every clock cycle. Output: 11-bit position signal MOCAST 2017
23
Motion Vector Memory ●
FIFO
●
1395 11-bit registers
MOCAST 2017
24
Control Unit ● ●
●
Complex subsystem, the most significant Effectively controls the signals within the IP Consists of: - Incremental counter - Signal controller
●
State machine – Incremental counter counts until 400'h before resetting MOCAST 2017
25
Embedded Architecture ●
●
●
The complete system includes the Accelerator and the ARM CPU It also contains logic that enables component communication and data transfer This “glue” logic is based οn the AXI protocol like all the modern ARM SoC
MOCAST 2017
26
AMBA – AXI ● ●
●
●
Advanced Microcontroller Bus Architecture The ZYNQ system makes use of the third generation AMBA – AXI bus (AXI4) Targets high-performance, high clock frequency designs Every peripheral designed for use with ZYNQ, should communicate over the AXI shared bus, for optimum performance MOCAST 2017
27
MOCAST 2017
28
AXI Protocols ●
There are many different “flavors” of an AXI interconnection: - AXI4: Version for memory-mapped, highperformance IP - AXI4-Lite: Subset of AXI4. Simple single transactions between memory mapped IP - AXI4-Stream: Non-memory mapped IP which require high-speed, continuous data stream MOCAST 2017
29
AXI4 - Stream ●
●
●
Exactly the type of IP that requires continuous and fast data-stream, so AXI4Stream is an excellent choice for interconnection protocol. Utilizes two syncronization signals, VALID and READY Main usage: DMA Controller
MOCAST 2017
30
Direct Memory Access ●
●
●
Since we require a real-time video encoding, the system's response needs to be as fast as possible. For that reason we chose to use a DMA engine in order to speed up the data transactions Specifically we used the Xilinx AXI DMA IP in Scatter-Gather mode, in order to provide a further speed up MOCAST 2017
31
MOCAST 2017
32
Software ●
●
●
In order to test and show the system's functionality we used both TCL Scripts and Bare metal C applications However in order to have a complete working embedded system, we compiled and installed PetaLinux among with some of the Xilinx DMA drivers Except from that we developed a userspace application in order to make use of the kernel driver and provide results from IP MOCAST 2017
33
Compiling PetaLinux kernel ●
There are two flavors for the PetaLinux kernel: - Run as-is, built and ready to boot on our ZEDBoard - Reconfigure using the appropriate tools and including the BSP package of our choice
●
Since we have compiled the kernel appropriately, including support for our peripheral, we can begin developing the MOCAST 2017 driver
34
Results ●
Max op. frequency (theoretical) : 111,895 MHz - Operates without issues in 112 MHz
●
●
For greater clock speeds there setup time violations Design area ~ 26% of the total
MOCAST 2017
35
Results XILINX VIVADO POWER REPORT Total On-Chip Power (W)
1,735
Dynamic (W)
1,576
Device Static (W)
0,159
Effective TJA (C/W)
11,5
MAX Ambient (C)
65,0
Junction Temperature (C)
45,0
Thermal Margin (C)
39,7
On-Chip
Power (W)
Clocks
0,03
Signals
0,005
Slice Logic
0,004
PS7
1,532
Static Power
0,159
Total
1,735 MOCAST 2017
36
FPGA Area Utilization Report Site Type
Used
Available
Utilization %
Slice LUTs
13912
53200
26,15
LUT as Logic
12041
53200
22,63
LUT as Memory
1871
17400
10,75
LUT FF Pairs
17141
53200
32,22
Slice Registers
11525
106400
10,83
Block RAM Tiles
3
140
2,14
F7 Muxes
232
26600
0,87
F8 Muxes
88
13300
0,66
●
●
Comparable with that proposed on "Highspeed Motion Estimation Architecture for Real-time Video Transmission" by Goel et al The extra ~900 LUTs used, are due to other systems (DMA, AXI controllers, etc) MOCAST 2017
37
Conclusions ●
●
●
The design and the implementation of this embedded system was a proof of concept that showcased the capability of designing high performance heterogenous systems. That also showed how we can utilize the power of this new generation of SoCs for making even better digital systems with far greater potentials. Accelerators are a quite distinct possibility for the future of everyday computing, especially in processing-intensive applications MOCAST 2017
38
Future Work ●
● ●
Design and implement more complex, specialized embedded systems, focusing on low-power consumption Implement greater level of data reuse Design an upscaled version, targeting at Ultra-High Definition (HEVC) encoding standards.
MOCAST 2017
39