Real-Time Space-Time Adaptive Processing on the STI CELL

0 downloads 0 Views 5MB Size Report
Mar 6, 2007 - 6.3 Radix-2 DIT FFT algorithm for a length-8 signal . . . . . . . . . . ... 6.5 Parallelism of preprocessing(DMA transfer, pipeline 0, and pipeline ...... The XL C compiler also enables the OpenMP programming model to guide par- ...... there is one common method called Newton-Raphson's method in mathematics to.
Institutionen för systemteknik Department of Electrical Engineering Examensarbete

Real-Time Space-Time Adaptive Processing on the STI CELL Multiprocessor

Examensarbete utfört i Datorteknik vid Tekniska högskolan i Linköping av Yi-Hsien Li LITH-ISY-EX--07/3953--SE Linköping 2007

Department of Electrical Engineering Linköpings universitet SE-581 83 Linköping, Sweden

Linköpings tekniska högskola Linköpings universitet 581 83 Linköping

Real-Time Space-Time Adaptive Processing on the STI CELL Multiprocessor

Examensarbete utfört i Datorteknik vid Tekniska högskolan i Linköping av Yi-Hsien Li LITH-ISY-EX--07/3953--SE

Handledare:

Di Wu isy, Linköpings universitet

Examinator:

Dake Liu isy, Linköpings universitet

Linköping, 6 March, 2007

Avdelning, Institution Division, Department

Datum Date

Division of Computer Engineering Department of Electrical Engineering Linköpings universitet SE-581 83 Linköping, Sweden Språk Language

Rapporttyp Report category

ISBN

 Svenska/Swedish

 Licentiatavhandling

ISRN

 Engelska/English 

  Examensarbete  C-uppsats  D-uppsats  Övrig rapport



2007-03-06

— LITH-ISY-EX--07/3953--SE Serietitel och serienummer ISSN Title of series, numbering —

 URL för elektronisk version http://www.control.isy.liu.se http://www.ep.liu.se/2007/3953

Titel Title

Real-Time Space-Time Adaptive Processing on the STI CELL Multiprocessor

Författare Yi-Hsien Li Author

Sammanfattning Abstract Space-Time Adaptive Processing (STAP) has been widely used in modern radar systems such as Ground Moving Target Indication (GMTI) systems in order to suppress jamming and interference. However, the high performance comes at a price of higher computational complexity, which requires extensive powerful hardware. The new STI Cell Broadband Engine (CBE) processor combines PowerPC core augmented with eight streamlined high-performance SIMD processing engine offers an opportunity to implement the STAP baseband signal processing without any full custom hardware. This paper presents the implementation of an STAP baseband signal processing flow on the state-of-the-art STI CELL multiprocessor, which enables the concept of Software-Defined Radar (SDR). The potential of the Cell BE processor is studied so that kernel subroutine such as QR decomposition, Fast Fourier Transform (FFT), and FIR filtering of STAP are mapped to the SPE co-processors of Cell BE processor with variety of architectural specific optimization techniques. This report starts with an overview of airborne radar technique and then the standard, specifically the third-order Doppler-factored STAP are introduced. Next, it goes with the thorough description of Cell BE architecture, its programming tool chain and parallel programming methods for Cell BE. In later chapter, how the STAP is implemented on the Cell BE processor is discussed and the simulation results are presented. Furthermore, based on the result of earlier benchmarking, an optimized task partition and scheduling method is proposed to improve the overall performance.

Nyckelord Keywords STAP, Adaptive Processing, Cell Broadband Engine, Parallel Programming

Abstract Space-Time Adaptive Processing (STAP) has been widely used in modern radar systems such as Ground Moving Target Indication (GMTI) systems in order to suppress jamming and interference. However, the high performance comes at a price of higher computational complexity, which requires extensive powerful hardware. The new STI Cell Broadband Engine (CBE) processor combines PowerPC core augmented with eight streamlined high-performance SIMD processing engine offers an opportunity to implement the STAP baseband signal processing without any full custom hardware. This paper presents the implementation of an STAP baseband signal processing flow on the state-of-the-art STI CELL multiprocessor, which enables the concept of Software-Defined Radar (SDR). The potential of the Cell BE processor is studied so that kernel subroutine such as QR decomposition, Fast Fourier Transform (FFT), and FIR filtering of STAP are mapped to the SPE co-processors of Cell BE processor with variety of architectural specific optimization techniques. This report starts with an overview of airborne radar technique and then the standard, specifically the third-order Doppler-factored STAP are introduced. Next, it goes with the thorough description of Cell BE architecture, its programming tool chain and parallel programming methods for Cell BE. In later chapter, how the STAP is implemented on the Cell BE processor is discussed and the simulation results are presented. Furthermore, based on the result of earlier benchmarking, an optimized task partition and scheduling method is proposed to improve the overall performance.

v

Acknowledgments First and foremost I wish, in these lines to thank for all the people and mainly, Di Wu for his stimulating suggestions and encouragement all through the thesis work as my supervisor. In addition, gratefully thank my Professor Dake Liu for constructive comments and valuable suggestions during my study in Linköpings universitet. I also wish to thank all the friends in Linköping for always making me feel welcome and less far from home. Thank you for the good time we had together. Everybody including my friends in Taiwan, in his or her way, has confronted me with difficult situations that I experienced. Last but not least, I would like to thank my wonderful family, including my parents, my sisters, for all their enduring support and always believing in me.

Yi-Hsien Li Linköping, March 6

vii

Contents 1 Introduction 1.1 Background . . . . . . 1.2 Purpose of the Thesis 1.3 Way of Work . . . . . 1.4 Outline . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

1 1 2 2 2

2 Overview of Radar Signal Processing 2.1 Airborne Radar Application . . . . . . . . . . . 2.1.1 Moving Target Indicator (MTI) . . . . . 2.1.2 Two-Dimensional Space-Time Spectrum 2.2 Space Time Adaptive Processing (STAP) . . . 2.2.1 Mathematical Model . . . . . . . . . . . 2.2.2 Full Rank STAP . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

5 5 7 7 9 10 11

3 Functional Specification of STAP System 3.1 Preprocessing . . . . . . . . . . . . . . . . . . 3.1.1 I/Q Conversion . . . . . . . . . . . . . 3.1.2 Array Calibration . . . . . . . . . . . 3.1.3 Pulse Compression . . . . . . . . . . . 3.1.4 Combination of Array Calibration and 3.2 Post-Doppler Adaptive Processing . . . . . . 3.2.1 Doppler Filtering . . . . . . . . . . . . 3.2.2 Weight Computation . . . . . . . . . . 3.3 Computation issues . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pulse Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . .

15 15 16 17 17 17 18 19 19 22

4 Overview of Cell Broadband Engine 4.1 Architecture . . . . . . . . . . . . . . . . . . . 4.1.1 Power Processing Element (PPE) . . . 4.1.2 Synergistic Processing Element (SPE) 4.1.3 Element Interconnect Bus (EIB) . . . 4.1.4 Memory and I/O . . . . . . . . . . . . 4.2 Programming Toolchain . . . . . . . . . . . . 4.2.1 Compiler . . . . . . . . . . . . . . . . 4.2.2 Accelerated Library Framework . . . . 4.2.3 The Simulator . . . . . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

25 25 26 27 29 29 29 29 31 32

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

ix

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

x

Contents 4.3

Parallel Programming Methods for the Cell Broadband Engine 4.3.1 SIMD Vectorization . . . . . . . . . . . . . . . . . . . . 4.3.2 Interleaved Load . . . . . . . . . . . . . . . . . . . . . . 4.3.3 Loop Unrolling . . . . . . . . . . . . . . . . . . . . . . . 4.3.4 Double-Buffering . . . . . . . . . . . . . . . . . . . . . . 4.3.5 Reducing the Impact of Branches . . . . . . . . . . . . . 4.3.6 Data Alignment . . . . . . . . . . . . . . . . . . . . . .

5 Design Consideration 5.1 Real-Time Performance Issues 5.2 Finite-Length Precision . . . 5.3 System Partition . . . . . . . 5.3.1 Design Consideration 5.3.2 Programming Flow . .

. . . . . . .

. . . . . . .

35 37 38 38 39 40 41

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

43 43 44 44 44 45

6 Kernel Subroutines 6.1 Implementation . . . . . . . . . . . . . . . . 6.1.1 Preprocessing . . . . . . . . . . . . . 6.1.2 Doppler Processing . . . . . . . . . . 6.1.3 QR Decomposition . . . . . . . . . . 6.1.4 Forward and Backward Substitution 6.2 Benchmark Results . . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

49 49 49 53 53 61 63

7 Original STAP Flow 7.1 Implementation . . . . . . . . . . . . . . . . . . . 7.1.1 Multidimensional Data Cube Rotation . . 7.1.2 System Integration for Whole STAP Flow 7.2 Benchmark Results . . . . . . . . . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

69 69 69 70 74

. . . . . . Flow . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

77 77 77 80 82

9 Conclusion and Future Work 9.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

85 85 85

Bibliography

87

A SPE Kernel C Intrinsic Code for QR decomposition

89

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

8 Optimized STAP Flow 8.1 Implementation . . . . . . . . . . . . . . . . . . 8.1.1 Problems of Original STAP Flow . . . . 8.1.2 System Integration for Optimized STAP 8.2 Benchmark Results . . . . . . . . . . . . . . . .

B SPE Kernel C Intrinsic Code for Forward/Backward Substitution 97

Contents

xi

List of Figures 2.1 2.2 2.3 2.4 2.5 2.6

Illustration of an airborne radar environment . . . Airborne radar electromagnetic environment . . . . Two-dimensional Space-Time spectrum . . . . . . . General STAP filter structure . . . . . . . . . . . . Input data cube for a single CPI . . . . . . . . . . Taxonomy of reduced-dimension STAP algorithms.

. . . . . .

6 7 8 9 10 12

3.1 3.2 3.3

Preprocessing block diagram for a single-array channel . . . . . . . Overlap-save method . . . . . . . . . . . . . . . . . . . . . . . . . . Third-order doppler factored STAP . . . . . . . . . . . . . . . . . .

16 18 20

4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11

Cell Broadband Engine (CBE) block diagram . . . . Power Processing Element (PPE) block diagram . . Synergistic Processing Element (SPE) block diagram Overview of ALF [4] . . . . . . . . . . . . . . . . . . Simulation stack of Cell BE processor . . . . . . . . Example of static analysis of SPE threads . . . . . . Example of dynamic analysis of SPE threads . . . . Softward develop flow on CBE . . . . . . . . . . . . Example of SIMD addition . . . . . . . . . . . . . . Eliminate data dependency by interleaved loading . DMA transfer using a double-buffering method . . .

. . . . . . . . . . .

26 27 27 31 32 34 36 37 38 39 40

5.1 5.2 5.3

STAP function flow . . . . . . . . . . . . . . . . . . . . . . . . . . . System partition of STAP on CBE . . . . . . . . . . . . . . . . . . Program flow of STAP (right part shows the flow in SPE for preprocessing) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Program flow of STAP (right part shows the flow in SPE for QR decomposition) . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

43 45

5.4

6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 6.10 6.11 6.12 6.13

. . . . . .

. . . . . .

. . . . . . . . . . .

. . . . . .

. . . . . . . . . . .

. . . . . .

. . . . . . . . . . .

. . . . . .

. . . . . . . . . . .

. . . . . .

. . . . . . . . . . .

. . . . . .

. . . . . . . . . . .

. . . . . .

. . . . . . . . . . .

DIT of a length-N DFT into two length-N /2 DFTs followed by a combining stage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . The basic operation of FFT: butterfly . . . . . . . . . . . . . . . . Radix-2 DIT FFT algorithm for a length-8 signal . . . . . . . . . . Program flow of preprocessing . . . . . . . . . . . . . . . . . . . . . Parallelism of preprocessing(DMA transfer, pipeline 0, and pipeline 1 are executed in parallel) . . . . . . . . . . . . . . . . . . . . . . . The computation of ith iteration in QR decomposition . . . . . . . The computation of vector normalization in MGS . . . . . . . . . . Newton-raphson’s method . . . . . . . . . . . . . . . . . . . . . . . Pseudo code of forward substitution . . . . . . . . . . . . . . . . . Programming flow of forward substitution . . . . . . . . . . . . . . Block diagram of 4x4 matrix forward substitution . . . . . . . . . . Block diagram to subtract y by x times first four column elements Simulation result of QR decomposition . . . . . . . . . . . . . . . .

47 48

50 50 51 52 53 58 60 61 62 63 64 65 66

xii

Contents 7.1 7.2 7.3

8.1 8.2 8.3 8.4 8.5 8.6 8.7

Original data flow and functional description of STAP processing stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example of data rotation between first and third dimension . . . . The data and computation flow of original STAP system on each SPE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ideal multi buffering with more computation and less DMA overhead Problem of multi buffering with less computation and more DMA overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The ratio of DMA transfer to computation in each processing stage The channel stall cycles rate in each processing stage . . . . . . . . Modified data flow and functional description of STAP system . . The data and computation flow of modified STAP system on each SPE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparison between original data flow and modified data flow . .

69 71 72 77 78 79 79 80 81 84

List of Tables 3.1 3.2 3.3

Complex operation counts of STAP . . . . . . . . . . . . . . . . . . Instruction counts of STAP(one complex operation per cycle) . . . Floating-point operations of STAP(one floating-point operation per cycle) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23 23

4.1

Example for unaligned load in SPU . . . . . . . . . . . . . . . . . .

41

6.1 6.2 6.3

Conversion from complex to RFLOPs . . . . . . . . . . . . . . . . Example for approaching reciprocal square root . . . . . . . . . . . Performance measurement of kernel subroutines (pure computation without data movement) . . . . . . . . . . . . . . . . . . . . . . . .

54 59

7.1 7.2 7.3

8.1

8.2

24

66

The memory address of data dependens on the setting of dimension. 70 Theoretical performance of STAP benchmark (pure computation without data movement) . . . . . . . . . . . . . . . . . . . . . . . . 74 Performance of STAP benchmark of original dataflow(Including the Latency of Memory subsystem) . . . . . . . . . . . . . . . . . . . . 75 Required data and computation cycles for each processing stage, where the permutation/transposition is not included in the computation cycles. Each data represents a 32-bits complex floating-point. Besides, assume all the processing coefficients have been served in the SPE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Performance of STAP benchmark with modified dataflow (Including the latency of memory subsystem) . . . . . . . . . . . . . . . . . .

78 83

xiv

Contents

Chapter 1

Introduction 1.1

Background

Recently, more and more Unmanned Aerial Vehicles (UAV) have been deployed for combat field surveillance tasks which requires long-time cruising at relatively low altitude. Compared to manned aircrafts, using UAV can greatly reduce the chance of casualty. However, this exerts tougher requirement on the design of the sensor system. In order to meet the strict constraints such as the space on board, power consumption and accessibility for maintenance and upgrade, the airborne electronics system must be highly compact, low power and flexible. In order to meet these requirements, Software-Defined Radar (SDR) was first introduced by Wiesbeck as an alternative to fixed-functional hardware based systems, which employs programmable devices to accommodate various radar sensors for different missions by updating the software. In order to meet both the performance and flexibility requirement of SDR system, state-of-the-art hardware is needed. The scaling of semiconductor process allows more processing units to be integrated into one single chip, which can make the system more compact and powerful so that a full radar system can be carried by a UAV which has strict constraints on space and power consumption. STI CELL [3] is the latest state-of-the-art multiprocessor designed by the joint adventure of Sony, Toshiba and IBM (STI). Meanwhile, it also brings a brand new parallel programming model which is not familiar by most of the application programmers. Recently, an estimation of Space-Time Adaptive Processing(STAP) for STI CELL is presented in [7]. However, since it is only an estimation based on the calculation of Floating-Point Operations Per Second (FLOPS) involved in the computation instead of cycle-accurate simulation, and the overhead of memory subsystem hasn’t been explicitly exposed, the result needs to be further proven. In this thesis, in order to explore the potential of CELL for array signal processing, a complete STAP baseband processing flow is implemented on STI CELL and benchmarked using cycle-accurate simulator from IBM. 1

2

1.2

Introduction

Purpose of the Thesis

The scope of the project is to design and implement Space-Time Adaptive Processing (STAP) algorithms on a latest Cell Broadband Engine - the heterogeneous multi-core processor. The thesis is to benchmark the performance of the CELL Architecture for the floating-point radar application targeted on the state-of-art real-time high radar resolution interference nulling adaptive processing (each PRI processed within 32.5 millisecond time intervals). The kernel subroutines of STAP are accelerated using SIMDization, optimization of task/data partitioning and scheduling is carried out to improve the overall performance.

1.3

Way of Work

The benchmarking was performed by writing a C-language extension with inline assembly-language instruction called Intrinsics. The written program then is compiled and executed in the IBM Full System Simulator for the Cell Broadband Engine, which support both functional and cycle-accurate simulation. The aim was to design a high data parallelization and task parallelization program to achieve as short execution time as possible, mainly for real time processing.

1.4

Outline

Chapter 2 presents the basic theory about Space Time Adaptive Processing, such as Moving Target Indication (MTI), beamforming and the environment of STAP. In Chapter 3, functional and timing specification of real-time benchmark STAP is elaborated. The benchmark case corresponds to third-order Doppler-factored STAP which consists of several computation stages, such as Preprocessing, Doppler filtering, and Weight compensation. Chapter 4 covers the overview of Cell BE processor. First a thorough description of architecture is found and then the programming toolchain, i.e., includes a basic idea about the tool from compiler to simulator are also discussed here. Several parallel programming methods for developing software on multiprocessor Cell BE are presented in the end of this chapter. Chapter 5 mainly discusses about the design consideration for implementing STAP algorithm on Cell BE processor. It consists of real-time constraints and numerical precision regarding to specification of STAP system. Also system scheduling and partitioning are discussed to exploit Cell BE architecture. Chapter 6 presents the implementation and benchmark results of all kernel subroutines, in STAP system. Kernel subroutines consist of preprocessing, Doppler processing, QR decomposition, forward substitution and backward substitution. For each processing stage, the optimization of significant modules in SIMD environ-

1.4 Outline

3

ment is presented. In the end of this chapter, simulation results of implementation are presented. In Chapter 7, the whole computation flow and integration of system is presented. The benchmark results shows performance which includes the latency of data movement and permutation. Chapter 8, based on the result of earlier benchmarking, an optimized task partition and scheduling method is proposed to improve the overall performance. The simulation results, which explain the improvement of optimized implementation, are presented in the second part in this chapter. At last, in chapter 9, the conclusion and future possible work are disclosed.

4

Introduction

Chapter 2

Overview of Radar Signal Processing Space-Time Adaptive Processing (STAP) is a signal processing technique most commonly used in airborne radar. It involves adaptive array processing algorithms to optimize target detection. Radar system benefits from STAP in areas when there is strong interference. Through application of STAP, it’s possible to suppress the clutter and jamming to achieve order-of-magnitude sensitivity improvements in target detection. This chapter discusses required technical background about adaptive airborne radar, such as conjunction, Moving Target Indication, beamforming and SpaceTime Processing. However the knowledge has board application in several distinct disciplines, for example in space-based radar, sonar, spectral estimation, biomedical imaging, wireless communications, and many other fields that utilize estimates of statistical correlations between random quantities.

2.1

Airborne Radar Application

All radars transmit electromagnetic energy and receive scattered echoes from reflective objects. These objects can be classified as targets or clutter, while clutter is defined as any unwanted echoes that interfere with the target echoes. For a high altitude airborne surveillance radar designed to detect moving aircraft or slow moving ground vehicles as depicted in Figure 2.1. In general, the most dominant source of clutter for down-looking surveillance radar is the Earth, which includes all natural and man-made structures. Ground clutter, the Earth’s echo is usually several orders of magnitude greater in power than a target echo. Therefore, the ground clutter signals must be mitigated as much as possible through filtering to detect the relatively weak target signal buried in the clutter signal. A non-clutter source of interference to all radars is intentional or unintentional jam5

6

Overview of Radar Signal Processing

Figure 2.1. Illustration of an airborne radar environment

ming by a source of electromagnetic radiation transmitting signals in the radar’s transmit/receive frequency band. This jamming could be generated from hostile sources or from radiating equipment that happen to have a similar interference effect on the radar transceiver. Discriminating features of the target, as different from the clutter, must exist in order to filter out a target from the clutter. The three most commonly used discriminating features are range, radial velocity, and azimuth angle. As shown in Fig. 2.2, range gating the received echo helps to separate the targets from clutters by limiting the clutter backscatter area to only a narrow ring, the effective width of which is equal to the compressed pulse width, and is centered on the ground and circling the aircraft. This figure, where only half of the ring is shown, suggests that the low gain is typical located in the antenna back lobe region. For the front lobe region, the antenna pattern gain as a function of azimuth angle multiplies the clutter and jamming from the same angles and generally attenuates these signals several dB if not located in the main beam region. By making the compressed pulse width smaller, a target confined to a single range bin has less clutter power to compete with during detection processing. The radial velocity discriminate is the relative velocity of the target with respect to the radar velocity vector along an angle directly to the radar. It has a direct relationship with the Doppler frequency of the target echo. Azimuth angle discriminates simply the target location with respect to the antenna pointing direction. Elevation angle is usually not useful for surveillance radars since they generally

2.1 Airborne Radar Application

7

Figure 2.2. Airborne radar electromagnetic environment

have very modest vertical antenna aperture widths and consequently have narrow elevation beam widths.

2.1.1

Moving Target Indicator (MTI)

In many surveillance radars, for each range bin, the CPI pulses are temporally processed by two stages collectively called Airborne Moving Target Indicator (AMTI). In the first stage, the AMTI circuit attempts to reverse the effect of the radar aircraft motion by making the radar signals appears as if they were transmitted from a stationary platform. A complementary circuit technique called TACCAR allows a simple second stage Moving Target Indicator (MTI) filter design which significantly attenuates the stationary (or D.C.) signal component associated with the ground clutter that exists in the first stage output. For more details of MTI, please refer to [8].

2.1.2

Two-Dimensional Space-Time Spectrum

Figure 2.3 illustrates the two-dimensional Space-Time spectrum of the received echo; it is typical of airborne surveillance radars for a single range bin. The basic concept of such a spectrum is that radar’s looking direction is perpendicular to the direction of flight. The clutter is already spread in azimuth angle due to the antenna pattern. However, due to the motion of the aircraft, the clutter also spreads in Doppler frequency. One important character is that the clutter Doppler frequencies and azimuth angles are correlated; meaning that clutter from a certain

8

Overview of Radar Signal Processing

azimuth angle has an associated Doppler frequency shift. This correlation produces a structure in the Space-Time spectrum called as clutter ridge. The clutter ridge is the diagonal ridge in Fig 2.3 cutting across both azimuth and Doppler axes. Imagine that the moving radar slows and becomes stationary in the air. The effect would be to rotate the clutter ridge clockwise (still centered at 0 Hz and 0 deg.), until it lies along the zero Doppler frequency axis. Now, every azimuth angle has the same Doppler frequency of zero Hz, and there is no longer a correlation between the two dimensions. This is precisely the effect of MTI algorithm attempts to align (via a technique called Displaced Phase Center Antenna, DPCA), but typically with limited success. Since stopping the radar motion rotates the clutter ridge, the moving targets in the Figure 2.3 can become unmasked from the clutter. In other words, the projection of the clutter on the Doppler axis reduces down from the full extent of the Doppler axis to just width caused by internal clutter motion, at the same time, revealing the relatively weak moving targets at their non-zero Doppler frequencies. Once the clutter spectrum is rotated, performing MTI and Doppler-factored processing produces near optimum solution. By reversing the effects of radar motion, targets are unmasked from the clutter Doppler spectrum which is correlated pulse to pulse. Recently, the low cross section targets requires improved clutter and jamming suppression over current airborne radar capabilities. The current level of clutter and jamming rejection in airborne radars that use cascaded DPCA and sidelobe canceller techniques is not sufficient to detect these targets. Therefore, Space-Time Adaptive Processing is found to solve this problem.

Figure 2.3. Two-dimensional Space-Time spectrum

2.2 Space Time Adaptive Processing (STAP)

2.2

9

Space Time Adaptive Processing (STAP)

STAP is a two-dimension, linear, adaptive filtering process. It filters received signals over both space and time in order to extract desired signals while reducing interference. In contrast to one-dimensional spatial or temporal filtering, STAP utilizes both of these orthogonal dimensions, simultaneously, in order to optimally discriminate between desired and interfering signals. STAP is sampled in space and time using a finite array of N spatial antennas with a length M tapped delay line attached to each antenna as shown in Figure 2.4.

Figure 2.4. General STAP filter structure

To solve the target-masking problem, STAP uses space and time samples of the received field, which provides more information than a single (time) dimension does. If only time samples were available, moving targets might be masked by the Doppler-spread clutter ridge. Because of the added dimension of data, STAP has the ability to create a two-dimensional adapted filter pattern to null out the strong clutter ridge only, without nulling out moving targets at other azimuth angles and Doppler frequencies. This alleviates the need to first remove platform motion effects (i.e., rotating the clutter ridge is no longer necessary). STAP was developed in the purpose of unmasking targets from Doppler-spread clutter, but it also potential fulfills the role of multiple sidelobe cancellation. Moreover, STAP provides a distortionless response (unity gain) to the target signal and it optimizes the Signal-to-Interference-Pulse-Noise ratio (SINR). Thus, theoretically optimal target detection can be performed directly on the magnitude (squared) for the output of the STAP filter without the need for a follow on Doppler filter bank. As shown in Fig. 2.4, the weights are applied by taking the

10

Overview of Radar Signal Processing

complex inner product of the general weight vector w and the vector of space and time samples χ for the range bin under test (referred to as a snapshot vector) as y = wH χ, where y is a complex scalar quantity.

2.2.1

Mathematical Model

Before outlining the derivation of the STAP weight solution, the mathematical models are described as below. Coherent Processing Interval (CPI): The array of L antenna channels samples the echo spatially, while each receiver samples the echo temporally over P pulses in a CPI. For each of R range bins of data is stored along a particular antenna pointing angle, the receiver on each element down converts the echo to baseband. A 3-dimensional data cube called CPI, which consists of L channels, P pulse and R range samples, is used by a STAP process as shown in Figure 2.5. It holds the received, Space-Time, complex data for all ranges along one array azimuth pointing angle. Both range and pulse dimensions are time samples, but the range dimension is called fast time as it samples much faster than P pulses in pulse dimension (called slow time). For each of R range bins, the associated L × P matrices of Space-Time samples are reshaped into a single long column vector of length LP × 1, denoted as χl which is called a Space-Time snapshot because it corresponds to a single range gate(or range sample time) indexed by variable l.

Figure 2.5. Input data cube for a single CPI

Steering Vector: A spatial steering vector is the set of normalized signal values that the antenna array receives from a wave coming from a particular angle at any given time over all N elements. a(θ) = [1, ej2πθ , . . . , ej(L−1)2πθ ]T

(2.1)

A temporal steering vector is comprised of the normalized signal values that a signal element of the antenna array receives from a wave having a particular Doppler frequency over all M pulses. b(ω) = [1, ej2πω , . . . , ej(P −1)2πω ]T

(2.2)

2.2 Space Time Adaptive Processing (STAP)

11

Thus, the Space-Time steering vector v(θ, ω) is the product of spatial and temporal steering vector. v(θ, ω) = b(ω) ⊗ a(θ)

(2.3)

where a(θ) is the spatial steering vector on θ radians and b(ω) is the temporal steering vector on relative Doppler frequency ω. A Space-Time steering vector is defined as the normalized response of a target having relative spatial frequency θ and relative Doppler frequency ω. Interference The Space-Time snapshot may be decomposed as χ = αt vt + χu

(2.4)

where vt = v(θ, ω) is the target Space-Time steering vector being tested, αt is the target complex amplitude and χu is the undesired component defined to be χu = χc + χj + χn (2.5) which consists of clutter χc , jamming χj , and noise χn . Since these are assumed mutually uncorrelated, the total covariance matrix is given by Ψu = E{χu χH u } = Ψc + Ψj + Ψn

2.2.2

(2.6)

Full Rank STAP

Referring to the Finite Impulse Response (FIR) STAP architecture in Figure 2.4, the filter weights are applied to the snapshot vector via a complex inner product that produces a complex scalar output y as y = wH χ

(2.7)

where the general weight vector is wH , and recall the Space-Time snapshot vector for the range bin under test, is defined to be χ = αt vt + χu in Equation (2.4), where generally complex αt is zero if no target energy is present. The general weight vector w is often chosen adaptively using training data in attempt to maximize the average SINR. The optimal weight vector that maximizes the SINR, while maintaining unity gain on the desired Space-Time signal direction, is well known to be wM V DR = µΨ−1 u vt

(2.8)

where vt is the target Space-Time steering vector being tested and µ = 1/(vtH Ψ−1 u vt ) is a complex constant. The filter is also called a Minimum Variance Distortionless Response (MVDR) beamformer due to the single main-beam constraint used. The optimal solution is never found in practice due to the requirement of an infinite

12

Overview of Radar Signal Processing

number of independent and identically distributed snapshot sample vectors. Using the Sample Matrix Inversion (SMI), estimation of the interference covariance b u is inverted and then substituted in for the true covariance matrix in matrix Ψ equation (2.8) to form the adaptive weight vector which is the estimation of the optimal weight vector. Specifically, the filter w bM V DR realized from this estimate is shown to produce within 3dB of the optimum SINR when applied to the data, on average. Note that once the covariance matrix is estimated it must be inverted in order to solve for the adaptive weight vector. Since matrix inversion is highly computational processing, alternative matrix inversion methods will be presented and estimated in the following chapter to perform efficiently in our implementation. Because signals nearly aligned with the target vector are interpreted by the adaptive filter as interference and consequently cancelled, the range bin under test and range cells nearby are typically excluded from the interference covariance estimation. Since range cells are statistical and have interference, the interference can be predicted and averaged. The well known maximum likehood estimation of Ψu is formed as K X bu = 1 Ψ χuk χH (2.9) uk K k=1

th

where χuk is the k from any range bin.

Space-Time snapshot used in the estimate, which may be

Figure 2.6. Taxonomy of reduced-dimension STAP algorithms.

For most airborne radar applications, the implementation of a fully-adaptive STAP is not feasible due to the computational complexity of the weight computation process and the amount of data required training the adaptive weights. Therefore, sub-optimal adaptive techniques are used in practice. There are four classes of STAP algorithms are distinguished by the type of processing applied before adaptive processing. The four classes are: element-space pre-Doppler, element-space post-Doppler, beam-space pre-Doppler and beam-space post-Doppler as shown in

2.2 Space Time Adaptive Processing (STAP)

13

Figure 2.6. This figure divides STAP architectures into four basic types, according to the type of preprocessor or the data domain in which the adaptive processing is performed. Element-space approaches, retain full spatial adaptively processing, but reduce the dimensionality through temporal preprocessing on each element. Temporal preprocessing may be simply selecting a small number of pulses (preDoppler), or it may be filtering the pulses on each element or beam (post-Doppler). The former type of approach is termed pre-Doppler, as full CPI filtering occurs after adaptation. The implementation focuses on element-space post-Doppler adaptive techniques that provide increasingly more effective clutter mitigation at the cost of higher processing throughput requirements. The benchmark is based on the hard case in [6] called third-order Doppler-factored STAP.

14

Overview of Radar Signal Processing

Chapter 3

Functional Specification of STAP System As discussed in previous chapter, the STAP algorithms involve a two-dimensional filtering technique by using a phased-array antenna (temporal) with multiple spatial channels (spatial) to cancel Doppler-spread clutter and interference. By applying the statistics of the interference environment, the STAP weight vector is formed. This weight vector is applied to the coherent samples by the radar. The critical problem on implementation of STAP algorithms is overwhelming computation complexity on the radar platform processor. Consequently, the weight compensated vectors needs to update within real-time requirements. Our goal is to provide a benchmarking methodology which could be used to evaluate CBE processor intended to support STAP applications. The benchmark of STAP in this thesis is based on RT_STAP benchmark developed by MITRE Corporation [6]. The benchmark represents a processing mode with L spatial channels sampled at 5MHz and performs Doppler-factored STAP with order Q adaptive nulling along with pulse compression and Doppler filtering operations (L = 22,Q = 3). The whole processing flow needs to be finished within an interval of 32.5 milliseconds. In this chapter, the functional and timing specification of the application are presented. Also the computation complexity will be elaborated in this chapter.

3.1

Preprocessing

Our benchmark includes the implementation of the data preprocessing typically performed before the application of STAP. Preprocessing usually includes: conversion of the received radar signals to in-phase and quadrature (I/Q) samples at baseband, array calibration, and pulse compression. In the past, preprocessing has been implemented using special purpose hardware. However, implementing 15

16

Functional Specification of STAP System

the preprocessing functions within the CBE communication fabric should significantly reduce the number of interfaces, thus simplifying the system. A CPI corresponding to L channels, P Pulse Repetition Intervals (PRIs), and N time samples per PRI must be processed for STAP algorithms. As input, these data samples are real-valued integers. As shown in the Figure 3.1, these functions of preprocessing are applied to A/D data samples independently across the L channels.

Figure 3.1. Preprocessing block diagram for a single-array channel

3.1.1

I/Q Conversion

I/Q conversion is used to demodulate the signal to baseband and generate digital samples at a certain sampling rate. In many cases, digital samples are generated at an Intermediate Frequency (IF) and sampled at a higher rate than required to accurately represent the baseband signal. For our system, the digital system must be demodulated to baseband, low-pass filtered, and decimated to lower sample rate. Demodulation is performed by multiplying the data with demodulation coefficients (i.e. complex sinusoid) to translate the signal to baseband. The complex multiplication is applied to both in-phase and quadrature-phase component of data. It requires 2N floating-point operations per channel and PRI, resulting totally L × P × (2N ) floating-point operations. The samples are then processed by a low-pass filter to remove aliased frequency components, while lowpass filter is the real-valued filter with length Ka . Decimation is simply achieved by choosing the samples corresponding to the desired sample rate. In an attempt to minimize computational complexity, directly linear convolution is chosen to implement instead of Fast Fourier Transform (FFT). The reason is that decimation takes place as part of this process, discrete linear convolution of low-pass filtering provides the most efficient implementation by computing only one of every D output samples (D the is decimation rate). Using this result, the FIR filtering and decimation requires 2 × Ka − 1 floating-point operations per sample. Totally it takes L × P × ND × (2 × Ka − 1) × 2 floating-point operations. ND is the number of decimated filter output samples. However, we suppose the data received is demodulated and low-pass filtered so that I/Q conversion is not part of our implementation.

3.1 Preprocessing

3.1.2

17

Array Calibration

Array calibration is processed to measure antenna response and equalize across all channels. In general, variations in the response of the antenna elements, receivers, and other components of the array over time introduce unknown amplitude and phase variations in the data. For wideband signals, these variations are often a function of frequency. Meanwhile, these frequency dependent variations affect the ability of STAP algorithms to null undesired interference. The calibration can be achieved by applying FIR filter to the data with filter coefficients designed to equalize the antenna response. Filter coefficients are typically determined offline. The most efficient implementation is to use either overlap-save or overlap-add fast convolution techniques. In addition, combining the computation of array calibration and pulse compression can reduce the computation complexity even more. It will be discussed in Section 3.1.4.

3.1.3

Pulse Compression

Pulse Compression is employed to transmit relatively long pulses with low peak power to achieve high signal energy and improve detection performance. Pulse compression is implemented by FIR filter with coefficients matched to the signal waveform. The pulse have a duration equivalent to the inverse of the transmitted signal bandwidth after ”matched filter”. The filter coefficients are matched to the transmitted signal with a taper applied to reduce range sidelobes and improve range resolution. As we noted in previous section, it is most efficient to combine implementation of calibration and pulse compression which is described in the following section.

3.1.4

Combination of Array Calibration and Pulse Compression

The combined filter of array calibration and pulse compression has filter length Kcp = Kc + Kp where Kc is the filter length of calibration, Kp is the filter length of pulse compression and we assume the length of the combined filter coefficients is Kcp . Fast convolution techniques based on either overlap-add or overlap-save methods represents the most efficient implementation of the linear convolution. In our implementation, overlap-save is chosen and presented in the subsequent discussion. The overlap-save method is explained in Figure 3.2. To begin overlap-save method, appendding the complex data with Kcp − 1 leading zeros as first step. The appended data sequence is then divided into B overlapped segments of length L + Kcp − 1, where B = dND /Nf f t e and Nf f t is the length of discrete linear convolution. Each segment is overlapped by Kcp − 1 samples. Each data is circularly convolved with the sequence of filter coefficients using FFT techniques. FFTs are applied to both the data block and the sequence of filter coefficients with both sequences zero padded so that the length of the FFT is a power of two.

18

Functional Specification of STAP System

Figure 3.2. Overlap-save method

Once the FFTs are computed, the transformed sequences are multiplied and an inverse FFT is applied to the result to obtain the time-domain representation of the circular convolution. Samples 1 through Kcp − 1 are discarded and the remaining samples from the B data segments are assembled to form the final output of the preprocessing. The computation of FFTs and inverse FFTs take 5 × Nf f t × log2 Nf f t operations per FFT or per inverse FFT. Multiplication of sequence requires 6 × Nf f t floating-point operations per data block. In total, it requires L × P × 3 × (10 × Nf f t × log2 Nf f t + 6 × Nf f t )

3.2

Post-Doppler Adaptive Processing

Modern airborne radar systems must have the capacity to detect targets having relatively small cross-sections in the presence of strong clutter and interference. Due to the motion of the radar platform, the clutter returns have a non-zero Doppler shift that is dependent on the azimuth of the clutter source. Space-Time Adaptive Processing algorithms have been developed to directly address cancellation of Doppler-spread clutter as shown in Figure 2.3. As shown in the Figure 2.6, there are four types of STAP architectures and elementspace post-Doppler STAP is chosen to implement in this report. Element-space approach retains full spatial adaptivity but reduced temporal dimensionality, and post-Doppler means temporal preprocessing is filtering the pulses on each element and followed by weight compensation. The Doppler filter, with its potential for low sidelobes, can suppress portions of the clutter ridge, there by localizing the

3.2 Post-Doppler Adaptive Processing

19

competing clutter in angle. The factored post-Doppler separate spatial adaptive processing is done in each Doppler bin. The weight update rate with post-Doppler STAP is once per CPI.

3.2.1

Doppler Filtering

The first component of post-Doppler adaptive algorithm is Doppler processing which transforms the signals from time domain to frequency domain to reduce the STAP problem. A precomputed window function is applied to the data to reduce spectral leakage. It is implemented by applying a discrete Fourier transform of length K across P pulses of the preprocessed data for a given range cell and channel, where K represents the number of Doppler cells to be processed. In practice, the discrete Fourier Transform is implemented using FFT and data samples are zero padded, so that the length of the FFT is a power of two. Radix-2 FFT which is employed in our implementation will be presented in section 6.1.1. Application of the real-valued window function with length P across all pulses of a given range cell and channel requires 2P floating-point operations. For computation of K points, FFTs takes 5 × K × log2 K operations. Therefore, 5 × K × log2 K + 2 × P operations are needed to process Doppler filtering for each range cell and channel. In total, L × R × (5 × K × log2 K + 2 × P ) are required to implement Doppler filtering.

3.2.2

Weight Computation

The high-order Doppler factored STAP algorithm can be one of the most effective STAP techniques known for clutter and interference suppression. Figure 3.3 shows the high-order Doppler factored STAP architecture of order Q, where Q is three in our implementation. The architecture consists of Doppler processing across all PRIs followed by adaptive filtering across sensors and adjacent Doppler bins. Adaptive filtering of the data uses simultaneously spatial and temporal Degree Of Freedom (DOF) in each specified Doppler bin. The spatial DOF are provided by the L array channels, while the temporal DOF are provided by the Q adjacent Doppler bins centered on the specified Doppler bin. For third-order Doppler factored STAP, we define kl and kr as adjacent Doppler b × 1 Space-Time snapshot vector is defined to bins to the Doppler bin kc . The L be: → − → → − → − y (k, r) = [− x (kl , r) x (kc , r) x (kr , r)]T (3.1) − b = L × 3 and each → where L x (k, r) contains all channel samples at the rth range th cell, k Doppler bin. As illustrated in Equation (2.8), the adaptive weights are obtained by solving equation: → − → → Ψ (k, r)− w (k, r) = µ− vt (3.2)

20

Functional Specification of STAP System

Figure 3.3. Third-order doppler factored STAP

3.2 Post-Doppler Adaptive Processing

21

− → → where µ is a scale factor, − vt is the target Space-Time steering vector, and Ψ (k, r) is the Space-Time covariance matrix computed by: − → → → Ψ (k, r) = E{− y (k, r)− y H (k, r)}

(3.3)

From equation (3.2), we know that the high-order Doppler-factored STAP algorithm clearly depends on knowledge of the Space-Time covariance matrix. For most practical applications, this matrix is unknown and must be estimated from the data samples. In general, an estimate of the covariance matrix is computed by averaging over snapshot vectors from adjacent range cells. With regards to [6], the training strategy for selecting the snapshot vectors involves dividing the range cells into M non-overlapping blocks containing NR range samples, where M = ND /NR . The covariance matrix for k th Doppler bin and nth block of continuous range cells is computed by averaging over the output product of the snapshot. This is −0 → 1 Ψ (k, r) = NR

r1 +N R −1 X

→ − → y (k, r)− y H (k, r)

(3.4)

r=r1

→ − where Ψ 0 (k, r) is the estimation of covariance matrix. The estimation of covariance matrix in the equation (3.2) is used to compute the adaptive weight vector. − b × NR Space-Time data matrix, → If the L X is defined to be: − → → X (k, m) = [− y (k, mNR )

− → → y (k, mNR + 1) · · · − y (k, (m + 1)NR − 1)]

(3.5)

then equation (3.4) can be re-written as −0 → → → − 1 − Ψ (k, r) = X (k, m) X H (k, m) NR

(3.6)

Then equation (3.2) becomes: − → → − → → X (k, m) X H (k, m)− w (k, r) = µNR − vt

(3.7)

This equation is the main computation focused in our implementation of STAP. To → − → obtain the weight vector − w , the inversion of X (k, m) is needed. The computation to inverse the matrix exponentially increases with the matrix size. The other option to solve this linear equation is first doing matrix factorization and then solve the equation by the simpler matrix. QR decomposition The weight vector is computed by first performing a QR decomposition on the full → − column-rand Space-Time data matrix X (k, m) defined in equation (3.7). The QR → − decomposition produces an NR × NR unitary matrix Q , and an NR × L upper

22

Functional Specification of STAP System

→ − → − →− − → → − triangular matrix R such that X T = Q R . The matrix R can be written as →T − − → − →T b×L b full rank upper triangular matrix. The matrix [ R 1 0 ] , where R 1 is a L → − →H − product X (k, m) X (k, m) decomposes to − → → − → → − − − → − → → − − → X (k, m) X H (k, m) = R T Q T Q ∗ R ∗ = R T1 R ∗1

(3.8)

→ − − → where Q T Q ∗ = I. There are a variety of methods to implement QR decomposition. The Modified Gram Schmidt (MGS) is chosen due to the less computation complexity. More discussion will be presented in the section 6.1.3. For MGS, it takes 8 × NR × L2 × Q2 floating-point operations per matrix. A total of K × M × (8 × NR × L2 × Q2 ) are required for all the Doppler bins and each block of range cells. Forward and Backward Substitution The Equation (3.7) can be rewritten as → − → − → − − → → → → w (k, r) = µ− v t. X (k, m) X H (k, m)− w (k, r) = R T1 R ∗1 −

(3.9)

→ → − → So the vector − w is given by first applying forward elimination to solve R T1 − p = → − → − T µ v t , where R 1 is a lower triangular matrix. Then use backward substitution → − → → − → → w (k, r) = − p , where R ∗1 is an upper triangular for getting weight vector − w . R ∗1 − matrix. Forward and backward substitution each requires 4 × L2 × Q2 floating-point operations. In total, K × M × 2 × (4 × L2 × Q2 ) operations are required. Weights Application At the output of STAP, NR values are given by the product of the data matrix and weight vector: → − → − w H (k, m) X (k, m) (3.10) Detection algorithms can then be applied to the result to locate targets in range and Doppler. This process requires 8 × L × Q × NR floating-point operations to implement per Doppler bin per block of range cells. In total, K × M × (8 × L × Q × NR )

3.3

Computation issues

Definitions of the variables used in Table 3.1 are given below: L: Number of channels (22) P : Number of pulses per CPI (64) N : Number of samples per PRI before decimation (1920)

3.3 Computation issues

23

Function I/Q Conversion

Operation Count L × P × (N + ND × Ka ) mul L × P × ND × (Ka − 1) add

Calibration and Pulse Compression Doppler Filtering

L × P × 3 × (Nf f t + f2f t × log2 Nf f t ) mul L × P × 3 × (Nf f t × log2 Nf f t ) add L × ND × ( K × log2 K + P ) mul 2 L × ND × (K × log2 K) add K × M × (NR × L2 × Q2 ) mul K × M × (NR × L × Q) comp-real div K × M × (NR × L2 × Q2 − 12 × L2 × Q2 − 12 × L × Q) add

QR Decomposition

Forward Substitution

N

×Q2 ) 2 2 ( L ×Q ) 2 2 2 ( L ×Q ) 2 L2 ×Q2 ( 2 )

K × M × (L K×M ×

2

2

mul add

Backward Substitution

K×M ×

Weight Apply

K×M × add K × M × (L × Q × NR ) mul K × M × (L × Q × NR ) add

mul

Table 3.1. Complex operation counts of STAP

D: Decimation factor (4) ND : Number of samples per pulses after decimation, ND = bN/Dc (480) Ka : FIR filter length used for anti-aliasing in I/Q conversion (36) Nf f t : FFT size of combined filter for array calibration and pulse compression (256) K: Doppler FFT size (64) M : Number of independent non-overlapping blocks ND /NR of contiguous range samples used to calculate the adaptive weights (2) NR : Number of contiguous range cells per weight computation (240) Q: Processing order (3) Function I/Q Conversion Calibration and Pulse Compression Doppler Filtering QR Decomposition Forward Substitution Backward Substitution Weight Apply Total

Complex Operation 50,688,000 14,057,472 6,758,400 269,377,152 557,568 557,568 4,055,040 346,051,200

Rate 14.6% 4.0% 1.9% 77.8% 0.1% 0.1% 1.1%

Table 3.2. Instruction counts of STAP(one complex operation per cycle)

If we consider each complex operation as one computing instruction, then the instruction count could be derived as presented in Table 3.2. In this table, total required instructions are 346,051,200. Sampling rate is 5MHz for 1920 samples and sample reduces to 1.25 MHz after decimation for 480 samples.

24

Functional Specification of STAP System Function I/Q Conversion Calibration and Pulse Compression Doppler Filtering QR Decomposition Forward Substitution Backward Substitution Weight Apply Total

FLOPS 101,376,000 92,995,584 21,626,880 1,070,530,560 2,097,152 2,097,152 16,220,160 1,306,943,488

Rate 7.75% 7.11% 1.65% 81.91% 0.16% 0.16% 1.24%

Table 3.3. Floating-point operations of STAP(one floating-point operation per cycle)

If we consider each real-value floating-point operation as one computing instruction then complex multiplication takes 6 FLOPs, complex addition takes 2 FLOPs, complex-real division takes 2 FLOPs and the instruction count could be derived as the Table 3.3. In this table, total required instructions are 1,306,943,488. Sampling rate is 5 MHz for 1920 samples and sample reduce to 1.25 MHz after decimation for 480 samples. In addition, weight computation is the most intensive computation part which contains more than 80% computation time. Therefore one of the most important issues in this thesis, is to efficiently implement the weight computation.

Chapter 4

Overview of Cell Broadband Engine The increasing demand of emerging multimedia, digital entertainment, and other intensive applications has generated an upsurge demand of computing power. Fortunately, the advent of modern VLSI technology that ever-scaling down of transistor size over a last decade have brought the greatest possible performance improvement. However, increasing operating frequency, refinement of architecture and lengthen pipelines by used a largest number of transistor has no longer have the significant return that they once had. With that in mind, most efficient way today is integrate multiple processing units on the same die. Fundamentally, there are two different approaches toward multi-core processing, one is Symmetric MultiProcessing (SMP) adapted by vendor like Intel and AMD in which similar processing cores are integrated. Another approach is so called tiled processors. The breakthough tiled processor most popular these days is the Cell processor. The Cell Broadband Engine (CBE) or commonly known as Cell is a microprocessor developed through the partnership of Sony, Toshiba and IBM (STI). The cell processor is initially designed for game consoles and media-rich consumer-electronics devices yet is flexible enough to be conventional general purpose microprocessor. Meanwhile, a much broader research of use in wide variety of application domain such as scientific computing to supercomputing is envisioned.

4.1

Architecture

The Cell Broadband Engine is a single-chip multi-core heterogeneous processor with combines a general-purpose 64-bit Power Architecture core called Power Processing Element (PPE) augmented with 8 streamlined high-performance SIMD RISC engine called Synergistic Processing Element (SPE). The latest cell proces25

26

Overview of Cell Broadband Engine

sor works with core frequency up to 3.2GHz. Connecting within these nine cores together the external memory and input/output interface is an Element Interconnect Bus (EIB), a specialized high-bandwidth circular data bus [3]. Figure 4.1 shows a block diagram of the Cell Broadband Engine.

Figure 4.1. Cell Broadband Engine (CBE) block diagram

4.1.1

Power Processing Element (PPE)

The Power Processing Element, as the name suggests, is a 64 bit Power PC architecture Reduced Instruction Set Computer (RISC) core with the Vector/SIMD Multimedia Extensions. In the Cell processor, PPE will run the operating system and act as a controller for the other eight SPEs, which handle most of the computational workload. The PPE support 2-way simultaneous hardware threads of execution, virtually enable to process two tasks in simultaneously. Figure 4.2 shows a block diagram of the PPE. PPE consists of two main units, namely PowerPC Processing Unit (PPU) and PowerPC Processor Storage Subsystem (PPSS). PPU has six execution units including one vector /scalar Unit (VSU) for 128-bit Vector/ SIMD Multimedia Extension which together execute floating-point and Vector/ SIMD Multimedia Extension instruction. PPE supports a conventional cache hierarchy with 32 kB each for Level-1 (L1) instruction cache and data cache. It supports 32 bytes load, while 16 bytes independently and memory coherently per processor cycle. PPSS is a unit handlles all memory access between PPE and external memory, SPEs or I/O Devices from Element Interconnect Bus. It has a 512kByte unified level-2 (L2) instruction and data cache with error-correction code (ECC).

4.1 Architecture

27

Figure 4.2. Power Processing Element (PPE) block diagram

4.1.2

Synergistic Processing Element (SPE)

The SPE offers a pervasively brand new processor architecture with 128-bit dualissue pipelined SIMD data-path aims to accelerate the streaming data-rich, computationintensive application. SPE consists of two main units, namely the Synergistic Processor Unit (SPU) and the Memory Flow Controller (MFC). Figure 4.3 shows a block diagram of the SPE.

Figure 4.3. Synergistic Processing Element (SPE) block diagram

Each SPU contains 256 Kbytes of dedicated local store memory (four 64 Kbyte SRAM Arrays), which is fully pipelined, single-ported that supports 16-bytesper-cycle load and store bandwidth, quadword aligned only memory access and 128-byte instruction fetches and DMA transfers. Because the LS has a single port,

28

Overview of Cell Broadband Engine

load, store, DMA read, DMA write, and instruction fetche operations compete for the same port. DMA operations are buffered and can only access the local store at most one of every eight cycles. Instruction fetches occur during idle memory cycles, and up to 3.5 fetches may be buffered in the instruction fetch buffer to better tolerate a large number of consecutive memory instruction that caused instruction fetch starvation. Each SPE have large register file of 128 bit, 128-entry general purpose register with six read ports and two write ports, which store all available data types (integer, single-precision and double-precision floating-point, scalars, vectors, logicals, bytes, and others). The SPU has two pipelines, even pipeline and odd pipeline. With this dual pipeline, SPU can dispatch and execute up to two instructions per cycle in order, one on each of the pipelines. Both pipelines contain a different type of execution unit, in which instruction goes to the even or odd pipeline depends on execution unit that perform the instruction type. In general, even pipe are used for floating / fixed point instruction and memory load / store and memory permute is located in odd pipe. The MFC serves as the communication interface by means of the Element Interconnect Bus (EIB), to external memory, other processing elements and I/O devices. MFC is supported by means of Direct Memory Access (DMA) operation which performs bulk of instructions and data transfer between LS storage and external memory. The MFC supports naturally aligned transfer sizes of 1, 2, 4, or 8 bytes, and multiples of 16-bytes, with a maximum transfer size of 16 kB. Besides the DMA mechanism, MFC on each SPE used a mailbox and signal-notification messaging in term to allows the inter-processor communication and synchronization explicitly in between each other processing elements in the system. Mailbox are intended for messages passing up to 32 bits in length for any short data transfer such as completion flags, storage addresses and many more. Signaling is similar as mailbox but that can be configured for messenge accumulation mode (over-write mode or logical OR mode). In overall, SPE does not work like a conventional superscalar CPU since there is no cache, virtual memory support or memory coherency. The objective of such memory hierarchy is to address a "memory wall" limitation in the conventional architecture as to compensate the thousand cycle penalties when data is not available in the cache. The instruction and data are transferred from / to main memory and the associated local store for respective SPE or between the local store of different SPEs with asynchronous direct memory access (DMA). This allows SPE to have a many concurrent memory access in flight and DMA latency can be easily hiding with some programming techniques. The SPE has omit a hardware prediction or scheduling logic, and branch are assumed non-taken. However, the mispredicted branch flushes the pipeline and incurs a high penalty of 18 cycles. To override a default branch decision, the architecture supports for a branch hint instruction. The branch hint can prefetch

4.2 Programming Toolchain

29

up to 32 cycles; and correctly-hinted taken branch incurs no penalty.

4.1.3

Element Interconnect Bus (EIB)

The EIB is a circular bus made of two 128-bit data channels working at half of CPU clock speed in opposite directions each. It allows a data communication from PPE and L2 cache to SPEs and vice versa. Each channel can convey up to 3 simultaneous transfers. Each processor element has one on-ramp and one off-ramp which can drive and receive data simultaneously. It is also connected to the Memory Interface Controller, and the FlexIO for external communications. Theoretically, the EIB’s internal bandwidth is 96 bytes per processor-clock cycle (384 GB/s). The EIB can support more than 100 outstanding DMA requests.

4.1.4

Memory and I/O

The Memory Interface Controller (MIC) of the Cell supports one or two high speed Rambus Extreme Data Rate (XDR) memory. It has a memory bandwidth of 25.6 GB/s (dual 12.8 GB/s channels). The total memory that can support is configurable between 64 MB and 64 GB of XDR DRAM Memory. The Cell Broadband Engine Interface (BEI) Unit, is an interface that processor used to communicate with I/O devices of the system. It consists of Broadband Interface Controller (BIC), I/O Controller (IOC) and Internal Interrupt Controller (IIC). The BEI supports two Rambus FlexIO, one for non-coherent I/O Interface, which used for I/O devices such as sound cards, video cards, etc. Another FlexIO, which supports both non-coherent and coherent protocol, can be used to coherently extend the EIB to link up with other cell processors to send and receive data respectively. Between both sets of FlexIO, the bandwidth is 76.8 GB/s.

4.2

Programming Toolchain

In essence, the unique nature of Cell processor with heterogeneous multi-core environment means developing an application introduces additional sources of complexity. Deployments of low-level programming model have demonstrated very competitive performance, but making it work is the trick. IBM abstractly provides broad array of resources to help programmers in exploiting the performance of the Cell processor.

4.2.1

Compiler

Currently, there are two compiler sets available for Cell BE Processor, including a modified GCC Compiler from Sony and newly released XL C/C++ Compiler from IBM. Both compilers are available for both the PPU and SPU. The famous IBM XL C/C++ compiler has been updated to fully exploit the PPE and SPE of the Cell Broadband Engine Architecture. This compiler provides

30

Overview of Cell Broadband Engine

state-of-art support to user directed exploitation of parallelization and partitioning over a wide range of heterogeneous parallelism offered by the Cell processor. The user directives adopted to communicate with the Cell processor were based on OpenMP programming model. This approach provides the programmer to view a computer system as possessing single shared-memory address space, and all the program data to reside in this space [1]. Both compilers come with a rich set of C/C++ language extension intrinsics for SPE and VMX to greatly simplify the SIMD programming, that is, precisely control SIMD instruction and data layout, while allowing the compiler to deal with instruction scheduling and register allocation. This offers an advantage to programmer to control over the high-level transformation such as loop unrolling, yet continue to have low level optimization. Programming the SPE is significantly enhanced by the availability of the automatic SIMDization framework in XL C compiler, which is the process of extracting SIMD parallelism from scalar codes. This SIMDization optimization is mainly concerned with extracting SIMD parallelism from various code structures as below: Loop-level SIMDization: Similar to vectorization for innermost loop, an instant of statement are aggregate in consecutive iterations of a loop. This SIMDization is particularly successful at extracting SIMD parallelism in loops and recognized as certain vectorizable pattern such as parallel reductions. Basic-Block Level SIMDization: Isomorphic computations on adjacent memory are aggregate into virtual vectors. Such SIMDization is particularly successful at extracting SIMD parallelism in unrolled loops, either manually by programmers or automatically by the compiler. The XL C compiler also enables the OpenMP programming model to guide parallelization decision. A key component of this parallelization approach is the presentation of system with the abstraction of a single shared-memory address space. In Cell processor, local memories are only addressable by their respective SPEs. Each SPE performed DMA transfer between the local memory and main memory primarily using DMA transfer. The compiler, attempts to abstract the concept of compiler-controlled software cache mechanism that permit reuse of the temporary buffers in SPE local store. In this mechanism, SPE program data are residing in system memory and having the compiler automatically manage the data transfer between its locations in system memory and temporary buffer in the respective local memory. Instead of using load and store instruction, this approach uses procedural analaysis with the instruction that explicitly refer to effective address of the data in a directory. If a desired data is present in the local memory, the address of requested variable in the local memory is computed and the data in local memory is used instead. Otherwise, a miss-handler subroutine is invoked and the requested data is transferred from system memory.

4.2 Programming Toolchain

31

In SPE, the limited size of local memory is shared by the code and data, there is always the possibility that a single SPE object does not fit in limited space available. In this case, overlay can be useful. This approach divides the program into several pieces of partitions and compiler reserves a small portion of SPE local memory for the code partition manager. At runtime, the code partition manager is responsible dynamically loading partition from system memory into local memory when neccessary.

4.2.2

Accelerated Library Framework

The Accelerated Library Framework (ALF) is an Application Programming Interface (API) that accelerates the software development process and provides an abstract view of parallel problems on multi-core memory hierarchy systems. The implementation of this framework focuses on solving data parallel problems on a host-accelerator hybrid system. Currently, ALF supports only the Single-ProgramMultiple-Data (SPMD) programming scenario, in which application consists of a control task typically reside on the PPE and a single program running on all allocated accelerator elements (SPEs) at one time. ALF’s most important features include data transfer management, parallel task management, double buffering and data partitioning [4].

Figure 4.4. Overview of ALF [4]

The figure 4.4 illustrates an overview of the different components in the ALF. On

32

Overview of Cell Broadband Engine

the host element, input data and corresponding output data are breaking up into number of smaller partitions, called work blocks. With the provided ALF API, these work blocks can be allocated into the work queue and wait for ALF on the host to assign the work blocks to the accelerators. The accelerators then process the assigned work blocks and returns the corresponding output to the host element. Overall, ALF provides a flexibility to offload the developers burden because the runtime framework handles the underlying resource / task management, data movement, load balancing and synchronization issues. However, designing a good partitioning strategy and kernel computation optimization are responsibility of the developer. Data partition is crucial to ALF programming model. For Cell BE architecture, proper size of data partition must be taken into account due to the limited size of local store and the double buffering scheme of the SPE. Again, kernel computation must be optimized for the SIMD nature of SPE.

4.2.3

The Simulator

IBM Full-System Simulator Version 2.0 is a system software infrastructure used to build modeling applications for program performance analysis, detailed microarchitectural modeling for Cell BE processor. This simulator can be configured to work ranging from fast functional simulator of process instruction to performance simulation of an entire system [3][5].

Figure 4.5. Simulation stack of Cell BE processor

Functional simulation, in this case, running as a debugger to test the functionality and feature correctness of software developed without modeling the cycle time it takes to execute the targeted application. While for the performance simulation, which is, running a cycle accurate model to gather and provide various types of

4.2 Programming Toolchain

33

performance details to allow developer to conduct performance analysis. A key attribute of the IBM Full Simulator is its ability to boot and run a complete PowerPC system functionally. This allows running an operating system interactively in simulation. In this manner, the simulator can execute many typical application programs that require operating system interactions with a complete environment. Alternately, the simulator also provides a stand alone environment for self-contained applications, in which all the operating system functions are supplied by simulator. The current version of Full Simulator is support cycle accurate performance model for entire system except PPU. In addition, instead of RAMBUS architecture memory model as compatible as the real hardware, it contains a DDR2 SDRAM memory model that can be configured to approximately match RAMBUS memory performance. The simulators, specifically for the SPE model accurately mimic the microarchitectural flow of instructions through SPE pipelines. This feature provides the performance metric for variety of useful information, including dual issue rate, static branch hit performance, issue stall condition and many more. The performance metric described above can be enabled to provide an extensive variety of data collection and analysis capabilities, includes the following: Performance Profiling Checkpoints: provides a set of check point instruction to delimit a specific region of code over which performance data are to be collected. Triggers: provides a set of simulator commands that are be executed to collect and aggregate performance data whenever specific type of event occurs. Emmitter: provides a user-extensible and customizable analysis framework with writing a set of emitter readers. Performance Visualization: provides a rich set of histograms to show performance data of SPU and DMA event counts. Two software tools are available in SDK to assist in measuring the performance of programs: the spu-timing static timing analyzer, and the IBM Full System Simulator for the Cell Broadband Engine. Static Analysis of SPE Threads The spu-timing analyzer performs a static timing analysis of a program by annotating its assembly instructions with the instruction-pipeline state. The analysis is useful for coarsely spotting dual-issue rates and assessing program sections that may be experiencing instruction-dependency and data-dependency stalls. It is useful for determining whether or not dependencies might be mitigated by unrolling, or whether reordering of instructions or better placement of no-ops will improve the dual-issue behavior in a loop. However, static analysis output typically do not

34

Overview of Cell Broadband Engine

provide numerical performance information about program execution. Thus, it cannot report anything definitive about cycle counts, branches taken or not taken, branch hinted or not hinted, DMA transfer, and so forth.

Figure 4.6. Example of static analysis of SPE threads

The Figure 4.6 shows an example generated by spu-timing static analyzer for SPE code. This example shows significant dependency stall (indicated by the "-") and poor dual-issue rates. The program has an instruction mix of eight even-pipeline (pipe 0) instructions and ten odd-pipeline (pipe 1) instructions. Therefore, any program changes that minimize data dependencies will improve dual-issue rates and lower the Cycle Per Instruction (CPI). The character columns in the static-analysis in Figure 4.6 have the following meaning: • Column 1: The first column shows the pipeline that issued an instruction. Pipeline 0 is represented by "0" in the first column and pipeline 1 is represented by "1". • Column 2: The second column can contain a "D", "d", or nothing. A "D" signifies a successful dual-issue was accomplished by the two instructions listed in row-pairs. A "d" signifies a dual-issue was possible, but did not occur due to dependencies; for example, operands being in flight. If there is no entry in the second column, dual-issue could not be performed because the issue rules were not satisfied (for example, an even-pipeline instruction was fetched from an odd LS address or an odd-pipeline instruction was fetched from an even LS address). • Column 3: The third column is always blank.

4.3 Parallel Programming Methods for the Cell Broadband Engine 35 • Column 4 through 53: The next 50 columns represent clock cycles and are repeated as "0123456789" five times. A digit is displayed in these columns whenever the instruction executes during that clock cycle. Therefore, an -cycle instruction will display digits. Dependency stalls are flagged by a dash ("-"). • Column 54 and beyond: The remaining entries on the row are the assemblylanguage instructions or assembler-line addresses (for example, ".L19") of the assembly code in program. Static-analysis timing files can be quickly interpreted by: • Scanning the columns of digits. Small slopes (more horizontal) are bad. Large slopes (more vertical) are good. • Looking for instructions with dependencies (those with dashes in the listing). • Looking for instructions with poor dual-issue rates -either a "d" or nothing in column 2. Dynamic Analysis of SPE Threads The IBM Full System Simulator for Cell Broadband Engine performs a dynamic analysis of program execution. Performance numbers are provided for: • Instruction histograms (for example, branch, hint, and prefetch) • Cycles Per Instruction (CPI) • Single-issue and dual-issue rates • Stall statistics • Register usage The Figure 4.7 shows a dynamic timing analysis on the same SPE inner loop using the IBM Full System Simulator for the Cell Broadband Engine. The results confirm the view of program execution from the static timing analysis. It shows poor dual-issue rate (7%) and large dependency stalls (65%), resulting in a overall CPI of 2.39. Most workloads should be capable of achieving a CPI of 0.7 to 0.9, roughly 3 times better than this. The number of used registers is 73, a 57.3% utilization of the full 128 register set.

4.3

Parallel Programming Methods for the Cell Broadband Engine

Based on the performance analysis discussed in previous section, the procedure of our benchmarking as shown in Figure 4.8 was performed by first writing a C-language extension with inline assembly-language instruction called Intrinsics.

36

Overview of Cell Broadband Engine

Figure 4.7. Example of dynamic analysis of SPE threads

4.3 Parallel Programming Methods for the Cell Broadband Engine 37 The written program is then compiled to generate assembly code and the assembly code could execute in the IBM Full System Simulator for the Cell Broadband Engine, which support both static and dynamic performance analysis. The aim was to design a high data parallelization and task parallelization program to achieve as short execution time as possible. To eliminate stalls and improve the CPI, the compiler needs more instruction to schedule, so that the program does not stall. There are several programming methods to exploit the parallelism for multiprocessor as discussed below.

Figure 4.8. Softward develop flow on CBE

4.3.1

SIMD Vectorization

Single Instruction Multiple Data (SIMD) processing exploits data-level parallelism. Data-level parallelism means that the operations required to transform a set of vector elements can be performed on all elements of the vector at the same time. That is, a single instruction can be applied to multiple data elements in parallel. Support for SIMD operations is pervasive in Cell Broadband Engine. In both PPE and SPEs, vector multimedia registers hold multiple data elements as a single vector. The data paths and registers supporting SIMD operations are 128 bits

38

Overview of Cell Broadband Engine

wide, corresponding to four full 32-bit words. This means that four 32-bit words can be loaded into a single register and added to four other words in a different register in a single operation as shown in Figure 4.9. A simple example of SIMDization is shown below.

Figure 4.9. Example of SIMD addition

4.3.2

Interleaved Load

The SPU has two pipelines, named even (pipeline 0) and odd (pipeline 1), into which it can issue and complete up to two instructions per cycle, one in each of the pipelines. Whether an instruction goes to the even or odd pipeline depends on its instruction type. In general, loads, stores, quadword rotates and shuffles execute on pipeline 1, and most of the arithmetic operation execute on pipeline 0. The SPU issues all instructions in program order according to the pipeline assignment. An instruction becomes issueable when register dependencies are satisfied and there is no structural hazard (resource conflict) with prior instructions, no memory hazard (DMA load) or error-correcting code (ECC) activity. The data dependencies occur commonly in the loop as computation followed by loading variables. The computation stalls until the load instruction is finished which wastes lot of SPU resources. To remove the data dependency, the load instruction should be fetched in advance for next iteration to compute as illustrated in Figure 4.10. This technique is called interleaved load which improves both dual-issue and stall dependency.

4.3.3

Loop Unrolling

Loop unrolling is a technique for optimizing programs by reducing the number of overhead instructions that the computer has to execute in a loop, thus improving the cache hit rate and reducing branching. To achieve this, the instructions that are called in multiple iterations of the loop are combined into a single iteration. This will speed up the program if the overhead instructions of the loop impair perform significantly. The major side effects of loop unrolling are: a) the increased register usage in a single iteration to store temporary variables, which may affect

4.3 Parallel Programming Methods for the Cell Broadband Engine 39

Figure 4.10. Eliminate data dependency by interleaved loading

performance; and b) the code size expansion after the unrolling, which is undesirable for embedded applications. A simple example of loop unrolling is shown below. /* original loop */ for (int x = 0; x < 100; x++) { sum(x); } /* Loop unrolled */ for (int x = 0; x < 100; x += 5) { sum(x); sum(x+1); sum(x+2); sum(x+3); sum(x+4); }

4.3.4

Double-Buffering

SPE programs use DMA transfers to move data and instructions between main storage and the local memory in the SPE. Consider a SPE program that requires large amount of data from main storage. The following is a simple scheme to achieve the data transfer. 1. Start a DMA data transfer from main storage to buffer B in the local memory. 2. Wait for the transfer to complete. 3. Use the data in buffer B.

40

Overview of Cell Broadband Engine 4. Repeat.

This method wastes a great deal of time waiting for DMA transfers to complete. We can speed up the process significantly by allocating two buffers, B0 and B1, and overlapping computations on one buffer with data transfer in the other. This technique is called double-buffering. Figure 4.11 shows a flow diagram for this double buffering scheme. Double-buffering is a form of multi-buffering which is the method of using multiple buffers in a circular queue to overlap processing and data transfer.

Figure 4.11. DMA transfer using a double-buffering method

4.3.5

Reducing the Impact of Branches

The SPU hardware assumes linear instruction flow and no stall penalties from sequential instruction execution. A branch instruction has the potential of disrupting the assumed sequential flow. Correctly predicted branches execute in one cycle, but a mispredicted branch (conditional) incurs a penalty of 18 cycles. Considering the typical SPU instruction latency of two-to-seven cycles, mispredicted branches can seriously degrade program performance. Branches also create scheduling barriers. The most effective means of reducing the impact of branches is to eliminate them using three primary methods: function-inlining, loop-unrolling, prediction. The next effective means of reducing the impact of branches is to use the branch-hint instructions. If a branch-hint is provided, software speculates that the instruction branches to the target path. If a hint is not provided, software speculates that the branch is not taken (that is, instruction execution continues sequentially). If either speculation is incorrect, there is a large penalty (flush and refetch).

4.3 Parallel Programming Methods for the Cell Broadband Engine 41

4.3.6

Data Alignment

SPU supports load and store only on quadwords (16-byte) alignment. While this method works very well if data accesses are exactly quadword aligned, however it is not always the case during the computation. This is mainly due to reasons that SPE load or store to the address obtained by masking the 4 least significant bits of the address. This problem can be solved by the shuffle/permutation instruction called spu_shuffle. If the loaded data is not 16-byte aligned address, the extraction is required for the desired data to align address. Figure 4.1 illustrates an example for unaligned load in SPU by table lookup. shift = lookup_t[(unsigned int)(pix) & 15]; qw0 = *((vector unsigned char *)(pix)); qw1 = *((vector unsigned char *)(pix+16)); prefv = spu_shuffle( qw0, qw1, shift );

Table 4.1. Example for unaligned load in SPU

42

Overview of Cell Broadband Engine

Chapter 5

Design Consideration Adaptive sensor array systems usually have real-time constraints with overwhelming computational complexity. Cell Broadband Engine (CBE) architectures offer massive SIMD compute power on multiple computational units interconnected via a high bandwidth internal fabric. This thesis explores the application of a computationally intensive adaptive nulling problem on the CBE architecture. To benchmark the capabilities of CBE, the whole STAP processing flow is implemented as shown in Figure 5.1.

Figure 5.1. STAP function flow

5.1

Real-Time Performance Issues

Since the STAP algorithm is applied to null interference for aircraft radar systems, the systems need to process within real-time constraints. The benchmark of STAP in this thesis is based on the RT_STAP benchmark developed by MITRE corpo43

44

Design Consideration

ration [6]. The benchmark represents a processing mode with L spatial channels sampled at 5 MHz and performs post Doppler of order Q adaptive nulling along with pulse compression and Doppler filtering operations (L = 22, Q = 3). The whole processing flow needs to be finished within an interval of 32.5 milliseconds.

5.2

Finite-Length Precision

Finite-length errors will degrade the performance of the radar system, and, hence, needs to be carefully addressed. Based on experiences, 20 bit floating-point is required to obtain enough resolution. Fortunately, in the CELL SPE, four-way 32 bit floating-point data can be processed in parallel, which means the precision is more than enough for the baseband processing of STAP.

5.3 5.3.1

System Partition Design Consideration

In general, parallelization involves two kinds of partition, namely functional partitioning and data partitioning. Using functional partitioning, the application is separated into different tasks, where each task can be executed in parallel on a multiprocessor system. Data partitioning, on the other hand, divides the data into several partitions, and each partiton is processed on each processor in a multiprocessor environment. Considering the huge computational complexity and memory access requirements of coding part, the data partitioning approach have been adapted in our implementation based on the following observations: • For functional partitioning, each task is inherently dependent on the data generated from the previous task in STAP system flow. This essentially created a shorter data lifetime in each SPE, and large amount of data transfer among the task in different SPE. For data partitioning, task-to-task communication remains intact on the same SPE. This inherently reduced the amount of DMA transfers required. • For functional partitioning, each task have a different computational complexity causes an unbalanced load among the SPEs. This makes it difficult to schedule tasks over various processors. For data partitioning, load balancing among SPEs may easily be achieved by dividing data into a large number of data partitions, so that each SPE can process a new data partition when it is finished with the previous one. • Data partitioning provides scalability in which new functions can be easily included as future work. For example, I/Q conversion and low-pass filtering maybe decomposed easily without major rewriting of the code. In addition, more SPEs may be available on a single chip in future releases. For example, more SPEs may be available to process a large number of data partitions.

5.3 System Partition

45

Figure 5.2. System partition of STAP on CBE

Due to the fact that data of STAP systems is a multidimensional data cube and the processing dimension changes after each stage in the STAP systems, the data rotation/permutation before next processing stage becomes one of the crucial problems in our implementation. It is desirable to decide which dimension to reside data and how to partition into eight SPEs in CBE before starting the implementation of each functional computation. Without proper data management, the stall cycles by memory accesses will be significant and even much more than computation time. We first partitioned the implemented system as discussed in [7], and then later chapter will propose a new data dimension and system partition method in order to decrease the memory latency in the system. There are some constraints of CBE architectures. One is the size limit of local memory, where 25 kB reside both program and data. Since the data is divided and each SPE occupies 675 kB, the size constraint of local memory is also a factor to consider for scheduling. Therefore, only a part of the data may be transfered into the local memory during the program execution. Another bottleneck of SPE performance is the overhead of DMA data transfers; it causes SPE stall execution waiting for DMA data transfer. To increase the performance, two buffers may be allocated by overlapping computation in one data buffer while DMA transfer in the other. This method is so called double-buffering.

5.3.2

Programming Flow

The CBE has one PPE and eight SPEs, where the SPEs are optimized for running compute-intensive applications. Therefore, PPE centric model is employed, in which PPE allocate threads, generate control signals, and manage resources among

46

Design Consideration

SPEs. Each data partition is sent to an individual SPE and performed concurrently as shown in Figure 5.2. To synchronize the processing among eight SPEs, PPE records the received flags and wait until the complete messages from all SPEs are obtained. Figure 5.3 and Figure 5.4 illustrate the top loop of the STAP implemented on CBE. For PPE: parse parameter system initialization. The coefficients of preprocessing and Doppler filtering, also the steering vectors of weight compensation are calculated in this step. Read CPI read input data cube Preprocessing Perform preprocessing such as array calibration and pulse compression. Synchronize Synchronization among all SPEs at the completion of processing stage. Before all the other SPEs have finished their work, the finished SPE need to be stalled. Doppler Perform Doppler filtering to form the frequency bins for Doppler-factored STAP algorithms. QRD Perform QR decomposition and generate the desired upper matrix R. FS Apply forward substitution to calculate lower triangular matrix. BS Apply backward substitution for calculating upper triangular matrix. Then the final target vector is formed. For SPE: Kernel Compute The kernel computation of each processing stage. Permute Transpose data, exchange the first and second dimensions of the data cube. DMA Transfer Double-buffering is implemented by overlapping computation in one data buffer and DMA transfer in the other.

5.3 System Partition

47

Figure 5.3. Program flow of STAP (right part shows the flow in SPE for preprocessing)

48

Design Consideration

Figure 5.4. Program flow of STAP (right part shows the flow in SPE for QR decomposition)

Chapter 6

Kernel Subroutines The implementation of each computation subroutine, such as preprocessing, Doppler processing, QR decomposition, forward substitution and backward substitution, are discussed in this chapter.

6.1 6.1.1

Implementation Preprocessing

Computation of I/Q conversion, array calibration and pulse compression must be preprocessed independently across the L channels before applying Space-Time algorithms as illustrated in Figure 3.1. In our implementation, I/Q conversion is not included. Besides, array calibration and pulse compression processes, where each of them can be achieved individually by applying an FIR filter, are combined by the overlap-save method. In our implementation, the sequence of 480 range samples for one pulse is processed at the same time during preprocessing. The data sequence is augmented by 64 leading zeros, and then divided into three segments where each segment has 256 values and 64 values are overlapped with the previous segment. The zero padding is applied for the last segment. For each segment, it processes as 256 points FFT, 256 points complex multiplication with filter, and 256 points inverse FFT. The final result is obtained by discarding sample 1 through 64 in each segment. Fast Fourier Transform The Cooley-Tukey algorithm is the most common Fast Fourier Transform (FFT) algorithm. It re-expresses the Discrete Fourier Transform (DFT) of an arbitrary composite size N = N1 N2 in terms of smaller DFTs of sizes N1 and N2 , respectively. The algorithm is applied recursively, in order to reduce the computation time to O(N logN ). A radix-2 Decimation-In-Time (DIT) FFT is the simplest and 49

50

Kernel Subroutines

most common form of the Cooley-Tukey algorithm. Radix-2 DIT divides a DFT of size N into two interleaved DFTs of size N /2 for each recursive stage.

Figure 6.1. DIT of a length-N DFT into two length-N /2 DFTs followed by a combining stage.

Radix-2 DIT first computes the Fourier transforms of the even-indexed numbers em = x2m (x0 , x2 , . . . , xN −2 ) and of the odd-indexed numbers om = x2m+1 (x1 , x3 , . . . , xN −1 ), and then combines those two results to produce the Fourier transform of the whole sequence as the equations below. X(k) =

PN −1 n=0

x(n)e−i

2πnk N

=

P N2 −1

x(2n)e−i

=

P N2 −1

x(2n)e−i N/2 + e−i

n=0

n=0

2π(2n)k N

2πnk

+

P N2 −1 n=0

2πnk N

x(2n + 1)e−i

P N2 −1 n=0

2π(2n+1)k N

2πnk

x(2n + 1)e−i N/2

= DF T N [x(0), x(2), ..., x(N − 2)] + WNk DF T N [x(1), x(3), ..., x(N − 1)] 2

2

The full radix-2 DIT decomposition illustrated in Figure 6.3 involves log2 N stages, each with N2 butterflies. Each butterfly requires 1 complex multiplication and 2

Figure 6.2. The basic operation of FFT: butterfly

6.1 Implementation

51

Figure 6.3. Radix-2 DIT FFT algorithm for a length-8 signal

additions. Thus, there are in total complex additions.

N 2 log2 N

complex multiplications and N log2 N

FFT is employed by using the library _fft_1d_r2 which includes the FFT computation. in the CELL SDK2.0. It performs complex FFT using the DFT with radix-2 DIT. This implementation utilizes the Cooley-Tukey algorithm consisting of log2 N stages. For each pulse, FFT, complex multiplication and inverse FFT are required to process for the preprocessing stage. Then, the transpose of data is followed due to that data rotation is required for the next processing stage. The computational flow is illustrated in Figure 6.4, while the execution parallelism is shown in Figure 6.5. In both pictures, ’multiply’ represents the 256 point complex multiplication and ’shuffle’ is required to re-arrange the real and imaginary part of data, which resides together for FFT and inverse FFT but is stored seperately during multiplication. ’Transpose’ is involved at the completion of the transpose of the matrix between pulse and range dimensions. As mentioned in Section 4.3.2, most arithmetic instructions reside in pipeline 0 and permutation ones reside in pipeline 1, the data movement can be executed concurrently with arithmetic computations. However, only certain range of program is taken into account to issue parallelism and instructions across branch operation is not able to issue in parallel. Therefore, in Figure 6.5, ’Transpose’ can not be parallelized with the computation of FFT or inverse FFT. Besides, the DMA transfer works individually in parallel with the computation units in SPE, which can be regarded as another data path. To make instructions dual issueable, which indicates instruction are issued and executed in parallel in both pipelines, and apply double-buffering, which was illustrated in Section 4.3.4, are the main techniques applied to optimize performance.

52

Kernel Subroutines

Figure 6.4. Program flow of preprocessing

6.1 Implementation

53

Figure 6.5. Parallelism of preprocessing(DMA transfer, pipeline 0, and pipeline 1 are executed in parallel)

6.1.2

Doppler Processing

After preprocessing, Doppler filtering is required to form the Doppler frequency bins for the STAP algorithms. Doppler filtering is also implemented by FFT, which has been presented in Section 6.1.1; however the data size of the FFT is only 64 for Doppler Processing.

6.1.3

QR Decomposition

Since the weight compensation is the most significant part (80% of the total computing power needed) through the whole signal processing, the implementation of weight compensation dominates the overall performance. The weight vector is given by wopt = µΨ−1 u vt in Equation (2.8) where Ψu is the matrix of sample average estimation and the inversion of Ψu is commonly performed by QR decomposition as shown in Equation (3.7 and 3.8). QR decomposition factors a matrix X as X = QR, where Q becomes an orthogonal matrix (meaning that QT Q = I) and R turns to an upper triangular matrix. There are several methods for actually computing the QR decomposition, such as Modified Gram-Schmidt, Householder transformations, and Givens rotations [2]. To choose the most efficient method to implement on the CBE architecture, the following part presents the computation estimation of each method. Since we get complex-valued data in the radar system executed over the realvalued CBE processor, the practical computation power need to converse from complex to Real Floating-Point Operations (RFLOP). It is assumed that real multiplications, divisions, additions, and subtractions all require the same amount of time and each is considered as a single RFLOP. The conversion is presented in

54

Kernel Subroutines

Operator complex-complex multiply complex-complex add,subtract magnitude squared real-complex multiply complex divided by real complex inverse complex-complex divide

Number of Real Floating Point Operations 4 real multiplies + 2 real adds = 6 RFLOP 2 real add, subtract = 2 RFLOP 2 real multiplies + 1 real add = 3 RFLOP 2 real multiplies = 2 RFLOP 1 real inverse + 2 multiplies = 3 RFLOP 1 magnitude-squared + 1 complex divide real = 6 RFLOP 1 complex inverse + 1 complex multiply = 12 RFLOP

Table 6.1. Conversion from complex to RFLOPs

Table 6.1. Modified Gram Schmidt The Modified Gram Schmidt (MGS) process is a method for getting an orthonormal vector set in an inner product space. To compute the ith vector qi in matrix Q, it projects ai orthogonally onto the subspace Q generated by q1 ,...,qi−1 . The vector is then guaranteed to be orthogonal to all of the vectors in the subspace Q. The QR factors of A can be found using MGS as follows Assume the columns of the m × n matrix A = [a1 ... an ] and matrix Q = [q1 ... qn ], where each entry in R is rij . (1)

(1)

(1)

• For the first step, set a1 = a1 and then normalize to find q1 = a1 /ka1 k. (1) Set the first diagonal entry of R to be r11 = ka1 k. • Now assume q1 ,...,qi−1 have already been computed. First calculate the pro(1) (1) (1) jection ai on q1 as q1T ai , then the projection is subtracted from ai to (2) become ai , and so on: (1)

ai = ai (2) (1) (1) ai = ai − q1T ai × q1 (3) (2) (2) ai = ai − q2T ai × q2 .. . (i)

(i−1)

a0i = ai = ai

(i−1)

T − qi−1 ai

× qi−1 .

We normalize the final result a0i to obtain qi = a0i /ka0i k. The ith column of matrix R is thus: (1)

[ q1T ai

(2)

q2T ai

(i−1)

T ... qi−1 ai

ka0i k 0 ... 0 ]T .

Here presenting the pseudo code of MGS method and count the number of operations.

6.1 Implementation

for i = 1 to n vi = ai for i = 1 to n rii = kvi k qi = vi /rii for j = 1 to i − 1 rij = qi × vj vj = vj − rij qi

55

m multiplications, m − 1 additions m divisions m multiplications, m − 1 additions m multiplications, m subtractions

TotalCost = n×(3m+(m−1)+1+2m)+(n2 −n)/2×(6m+2(m−1)+6m+2m) = 8mn2 − 2mn − n2 + n In our case, the matrix size is 240 rows and 66 columns. Performing QR decomposition by MGS method requires in total 8,327,550 RFLOPs. Householder The Householder matrix can be defined as H = I − 2vv∗, where the v is a column unit vector. By properly choosing vector v, Householder matrix will partially zero out the input vector. It can be used to calculate QR decomposition by bringing the matrix A to upper triangle matrix form with a sequence of Householder transformation. In addition, when applying a Householder: (I − 2uuT )A, native implementation cost is O(2n3 ). Instead: A − 2u(uT A) needs matrix-vector (2n2 instructions), outer product (2n instructions), and subtraction (2n2 instructions) totally costs only O(4n2 ). The QR factors of A can be found using Householder matrices as follows • Find a Householder matrix H1 such that the product H1 A has zeros below the diagonal element in the first column. Finding H1 is equivalent to finding v. • Find a Householder matrix H2 such that the product H2 H1 A has zeros below the diagonal element in the second column. • Continue in this way for the first (n − 1) columns. Then we define R and Q as equation R = Hn−1 Hn−2 ...H1 A and Q = H1 H2 ...Hn−1 .

Here presenting the pseudo code of Householder and count the number of operations. The input matrix A is assumed to be m × n.

56

Kernel Subroutines

for i = 1 to n − 2 x(i:m) = A(i:m,i) c = eiθ kx(i+1:m) k2

v(i:m) = x(i:m) − c × e1 v u(i:m) = kv(i:m) (i:m) k

w(i:n) = uT(i:m) A(i:m,i:n) A(i:m,i:n) = A(i:m,i:n) − 2u(i:m) w(i:n)

(m − i) magnitude squared, (m − i − 1) real addition, 1 square root, 1 complex multiply. i complex addition. (m − i + 1) magnitude squared, (m − i) real addition, 1 reciprocal square root, (m − i + 1) real-complex multiply. (m − i + 1)(n − i + 1) complex mul, (m − i)(n − i + 1) complex add. (m − i + 1)(n − i + 1) complex mul, (m − i + 1)(n − i + 1) complex sub.

Pn−2 TotalCost = i=1 {[(m − i) × 3 + (m − i − 1) + 1 + 6] + 2 + [(m − i + 1) × 3 + (m − i) + 1 + (m − i + 1) × 2] + [(m − i + 1) × (n − i + 1) × 6 + (m − i) × (n − i + 1)×2]+[(m−i+1)×(n−i+1)×6+(m−i)×(n−i+1)×2]} = 16(m2 n−m2 + 13 n3 ) In our case, the matrix size is 240 rows and 66 columns. Performing QR decomposition by Householder method requires in total 61,437,312 RFLOPs.

Givens Rotation The main use of Givens rotations is to introduce zeros in vectors or matrices as shown in equation: 

c −s

    s a r = c b 0

(6.1)

where c = √a2a+b2 and s = √a2b+b2 . QR decompositions can also be computed with a series of Givens rotations. Each rotation zeros an element in the subdiagonal of the matrix, forming the R matrix. The concatenation of all the Givens rotations forms the orthogonal Q matrix. Here presenting the pseudo code of Givens Rotation and count the number of operations.

6.1 Implementation

for i = 1 to n for j = m to i + 1 t1 = a2 + b2 t2 = √1t1 c = ai,j × t2 s = ai,j+1 × t2 ai,j = t2 × t1 ai,j+1 = 0 for k = i + 1 to n ak,j = c × ak,j + s × ak,j+1 ak,j+1 = −s × ak,j + c × ak,j+1

57

two magnitude square,one real add one reciprocal square-root one real-complex multiply one real-complex multiply one real multiplication

two complex multiply, one complex add two complex multiply, one complex add

Inner loop: 2 × 6 + 2 + 2 × 6 + 2 = 28 Outer loop: 28(n − i) + 2 × 3 + 1 + 1 + 2 + 2 + 1 = 28n − 28i + 13 Pn Total i=1 (m − i)(28n − 28i + 13) Pn Cost: = i=1 28mn + 13m − (28m + 28n + 13)i + 28i2 3 2 +n + 28 2n +3n = (28mn + 13m)n − (28m + 28n + 13) (n+1)n 2 6 14 3 13 2 11 2 = 14mn − 3 n − mn − 2 n − 6 n In our case, the matrix size is 240 rows and 66 columns. Performing QR decomposition by Givens Rotation method requires in total 13,250,237 RFLOPs. Due to the fact that Modified Gram-Schmidt (MGS) requires the least computational power and is suitable for parallel processing in SPE SIMD data path, it is employed in our implementation. Based on the algorithms of MGS, the input vector ai will not be accessed any more once the orthogonal vector qi has been formed. According to this property, qi is written back to the ai ’s space in local memory which indicates the space of input matrix A will come out to be matrix Q after whole QR decomposition. Since MGS is computed based on column vectors, we store the matrix by columns instead of rows. Figure 6.6 illustrates the program flow of MGS method. To generate one orthogonal vector qi for QR decomposition, it starts with loading vector ai in input matrix and q1 in orthogonal matrix. Computation of ’Vector Projection’, ’Scale Vector’, and ’Subtract Vector’ are processed to remove the projection of ai on q1 . Then computation goes on to next orthogonal vector q2 and iterates until the orthogonal vector qi−1 . ’Vector Normalize’ is required at the completion for generation of the orthonormal vector qi . More explanations for Figure 6.6 are presented below. ’Scale Vector’ is multiplying the obtained scalar with the vector. ’Subtract Vector’ is to compute the

58

Kernel Subroutines

Figure 6.6. The computation of ith iteration in QR decomposition

subtraction between two vectors. ’Vector Projection’ and ’Vector Normalize’ are discussed later in this section. Vector Projection Vector projection is one of the computations to generate orthogonal vectors, and it is also referred to as inner product. Inner product by definition is: < a, b >= a1 × b1 + a2 × b2 + .... + an × bn

(6.2)

where a and b are vectors, and ai ,bi are elements in the vector. This operation is easily performed by iteratively multiply-accumulation with the instruction spu_madd through the whole vector. Vector Normalization Along with the MGS method, normalizing the orthogonal vector is performed before storing back to local memory. The normalization by definition is: qi =

ai kai k

(6.3)

where ai is the orthogonal vector and qi is the orthonormal vector after normalization. It is calculated as accumulating the squared magnitudes, take reciprocal square root, and then multiply vector with scalar: qi = ai × p

1 a2i1

+ .. + a2im

(6.4)

Since the same arithmetic operation is applied to each individual element of the vectors, parallelism can be inherently achieved through SPU SIMD instructions.

6.1 Implementation

59

The flow of normalization computation by SPU Intrinsics, the assembly-like function calls written in C language, is illustrated in Figure 6.7. The operation uses the spu_madd instruction to accumulate the square magnitude for four elements in the vector concurrently. It is repeated for subsequent elements in the vector. The log2 N cascaded element wise rotation and addition is used to sum up the N elements in accumulator with instruction spu_rl and spu_add. After obtaining the accumulator, the instruction spu_rsqrte is used to take the reciprocal square root. In the end, instructions spu_mul and spu_madd, which multiplies a scalar value with all elements in a vector, are applied to perform complex multiplication. Approximation of reciprocal operations In addition, the reciprocal operations such as spu_re and spu_rsqrte used for normalizing vectors only generate estimated results. To improve the precision, there is one common method called Newton-Raphson’s method in mathematics to efficiently find the approximation as illustrated in Figure 6.8. The idea of the method is as follows: the algorithm starts with an initial guess which is reasonably close to the true zero, then replaces the function by its tangent and computes the zero of this tangent. This zero of the tangent will typically be a better approximation to the function’s zero, and it is performed in iterative manner. The idea can be written as the equation: xn+1 = xn −

f (xn ) f 0 (xn )

(6.5)

Consider the problem of computing √1a . We can rephrase that as finding the zero of f (xn ) = x12 − a. Then we have f 0 (xn ) = −2x−3 n . The method is applied to n approximate the result by xn+1 = xn + 0.5xn (1 − ax2n ). The iteration can be represented by programming language as follows (since the reciprocal is also an estimated value, the computation should not involve division): xn+1 = xn + 0.5xn (1 − ax2n ) = (3 − ax2n )xn /2 => xn+1 = (3 − ax2n )xn /2. Figure 6.2 illustrates an example of approaching the result of reciprocal square root. Assume that the input is a and the initial estimated result is x. x_square = spu_mul(x, x); x_half = spu_mul(x, VEC_SPLAT_F32(0.5)); x_tmp = spu_nmsub(a, x_square, VEC_SPLAT_F32(3.0)); x_new = spu_mul(x_tmp, x_half);

//x_square = x2 //x_half = 0.5x //x_tmp = 3 − ax2 //x_new = 0.5x(3 − ax2 )

Table 6.2. Example for approaching reciprocal square root

The loop-iteration of vector operations, such as inner product and normalization, are the most intensively used kernels in QR decomposition, the function load input

60

Kernel Subroutines

Figure 6.7. The computation of vector normalization in MGS

6.1 Implementation

61

Figure 6.8. Newton-raphson’s method

data, perform vector operations and finally store the results, which are inherently dependent. However, as indicated in Section 4.1.2, load, store, quadword rotate and shuffle operations are executed in pipeline 1, and most of the arithmetic operations execute on pipeline 0. The loop can be software pipelined, as illustrated in Section 4.3.2, by computing vector operations at the same time as loading and storing data, which improve dual issue rate and stall dependency. Furthermore, the dependency stalls and dual issue rate can be further eliminated with loopunrolling, as described in Section 4.3.3, by taking advantage of the large register file in the SPU. As there are no inter-loop dependencies between each vector operation loop, this optimization can be easily approached.

6.1.4

Forward and Backward Substitution

In STAP systems, following the QR decomposition, forward substitution and backward substitution are performed to calculate the adaptive weights. The weight vector can be computed by first solving p~ in the expression: R1T p~ = NR~s

(6.6)

using forward substitution where R1T is a lower triangular matrix and ~s is the steering vector of adaptive processing. Then the weight vector is calculated by solving the expression: R1 w(k, ~ m) = p~

(6.7)

using backward substitution where R1 is an upper triangular matrix. The final weight vector w(k, ~ m) is formed at the completion of processing. Since forward substitution is almost the same as backward substitution, only implementation details of forward substitution are presented in this report. To solve the linear equation A~x = ~y :

62

Kernel Subroutines



a11  a21   ..  .

0 a22 .. .

··· ··· .. .

an1

an2

···

    y1 0 x1     0    x2   y2   .  =  .  0   ..   ..  ann xn yn

(6.8)

where A is a lower triangular matrix and forward substitution is applied. The solution of the first equation is x1 = ay11 . This result can now be substituted into the second equation to calculate the value of x2 . Repetition of this substitution process will form the complete result vector. Backward substitute can be expressed as Pn yi − j=i+1 aj × xj where i = 1, 2, ..., n. (6.9) xi = ai Figure 6.9 illustrates the pseudo code of forward substitution.

for k = 1 to N do for i= 1 to k-1 do yk = yk − aki × xi end for xk = yk /akk end for

Figure 6.9. Pseudo code of forward substitution

Before all the computations, the first step is to transfer the triangular matrix into a unit triangular matrix, which means all the data in the diagonal are one. Since division is required to calculate each x value as shown in Equation (6.9). The division can be eliminated if ai , which is a diagonal value in the matrix, is one in Equation (6.9). Since division takes many cycles and have data dependency with following operations, the efficient method to decrease stalled cycles is to make the matrix unit diagonal, i.e., all values in the diagonal are one, so that division is eliminated in the process of forward substitution. Making the matrix unit diagonal is performed by dividing all the elements in the matrix by their diagonal values. To fully utilize the 128 bit SIMD data path of the SPU, where four 32 bit floatingpoint samples can be processed concurrently, the program is then solved by dividing the matrix into 4x4 sub matrices. The progress, as shown in Figure 6.10, goes by iteratively computing 4x4 sub matrix forward substitution and applying the returned value to the first four columns in all the rows, while the target matrix size decreases by four rows and four columns after each iteration. The computation of 4x4 unit lower matrix forward substitution consists of four steps as shown in Figure 6.11. In the purpose to exploit the parallelism, the com-

6.2 Benchmark Results

63

Figure 6.10. Programming flow of forward substitution

putation is applied column-wise. A 4x4 matrix transpose is required to rearrange the matrix so that each row of data can be processed with the identical operations. The core of forward substitution is progressed by four steps of multiply-subtraction with instruction spu_nmsub and spu_shuffle. The performance of this computation is limited by stall cycles due to the data dependencies of each equation. Applying the returned x values to the first four columns in all the rows is illustrated in Figure 6.12. Each row could be processed to multiply x concurrently without any data dependency, where the spu_mul instruction is employed. Then, four products of multiplication for one row stored in one SPU SIMD register is required to accumulate and be subtracted by vector y. Usually, the instruction spu_shuffle is applied twice to permute elements in one register in order to enable accumulation. In order to exploit parallelism, four registers, where each has four values, are transposed so that values in the same column are stored in one register after transposition. Then the instruction spu_add is employed to sum up products in parallel. In the end, accumulator is subtracted by the value in the vector y, and then updated back to vector y. The loop is repeated to all the subsequent rows. The input vector y becomes the desired vector x after whole forward substitution computation.

6.2

Benchmark Results

The benchmarking of the STAP processing is based on the IBM Full-System Simulator included in CELL SDK2.0. The simulator allows the cycle-accurate simulation of both the SPE and the memory subsystem. In Table 6.3, ’Single Cycle’ is the number of cycles with single issue in the SPE and ’Dual Cycle’ is the number of cycles with dual issue. In other words, ’Dual Cycle’ means that two instruction

64

Kernel Subroutines

Figure 6.11. Block diagram of 4x4 matrix forward substitution

6.2 Benchmark Results

65

Figure 6.12. Block diagram to subtract y by x times first four column elements

66

256FFT mul 64FFT QR FS BS

Kernel Subroutines Single Cycle 2134 357 463 1356106 5783 5479

Dual Cycle 1512 452 308 1131598 1573 1259

Brach Miss 413 17 87 79677 832 800

Depend. Stall 112 32 17 223903 1561 1583

Others 31 0 12 4488 29 151

Total Cycle 4202 858 887 2795772 9778 9272

Inst. Count 5536 1263 1107 3761004 9680 8800

Perf. CPI 0.76 0.68 0.8 0.74 1.01 1.05

Table 6.3. Performance measurement of kernel subroutines (pure computation without data movement)

are fetched and start to execute simultaneously in one cycle and ’Single Cycle’ only executes one instruction. ’Depend. Stall’ is the cycle overhead caused by dependency stall. ’Inst. Count’ is the number of instructions issued in total. ’Others’ is the number of cycles of stall due to prefetch miss and waiting for hint target. ’Perf. CPI’ is the average number of Cycles-Per-Instruction (CPI) which measures the efficiency of the program. Since the SPE is a dual-issue machine, the lowest (best) CPI in theory is 0.5, which means that both the computation and the data permutation instructions are issued in parallel without any stalls caused by dependency issues. The implementations of all kernel subroutines of STAP processing flow are presented in the previous section in this chapter. The design is then benchmarked by IBM Full-System Simulator with the result depicted in Table 6.3. In this table, ’256FFT’ indicates the computation of 256 points FFT, ’mul’ means 256 points complex multiplication, ’64pFFT’ is 64 points FFT, ’QR’ is QR decomposition, ’FS’ is forward substitution, and ’BS’ is backward substitution. In addition, ’256pFFT’ and ’mul’ are computation of preprocessing and ’64pFFT’ is for Doppler processing.

Figure 6.13. Simulation result of QR decomposition

6.2 Benchmark Results

67

Take QR decomposition as an example, the simulation result is depicted in Table 6.3. Transform the result into Figure 6.13, 48% of execution time is ’Single Cycle’ and 40% is ’Dual Cycle’. The main stalled cycles are 2.8% due to branch miss and 8% due to data dependency. The result shows 3,761,004 instructions are performed by 2,795,772 cycles, which indicates every instruction takes 0.74 cycle time in average. The table shows that the CPI of BS and FS are the highest compared to other kernel subroutines. The reason is that both of them have higher data dependency compared to the other subroutines. In other words, computation of xi requires all the previous values x0 , x1 , ..., xi−1 . Dependency causes some stall cycles and triangular structure decrease utilization of SIMDization. Therefore, performance of FS and BS can not be optimized as efficient as other processing stages.

68

Kernel Subroutines

Chapter 7

Original STAP Flow All the functional subroutines have been implemented in Chapter 6. In this chapter we discuss about integration of STAP system. In our implementation, we first design the data flow of STAP system as described in [7] and illustrated as shown in Figure 7.1, then the optimized data flow is presented in the next chapter to improve performance.

Figure 7.1. Original data flow and functional description of STAP processing stages

7.1 7.1.1

Implementation Multidimensional Data Cube Rotation

Along with STAP algorithm, the multi-dimensional data cube is applied corresponding to L channels, P pulse repetition intervals (PRI), and N time samples per PRI. Each stage requires realignment of data cube to position the data samples in memory for the next stage of processing. For example, the preprocessing 69

70

Original STAP Flow x-y-z Range(R)-Channel(L)-Pulse(P) Pulse(P)-Channel(L)-Range(R) Range(R)-Pulse(P)-Channel(L) Pulse(P)-Range(R)-Channel(L)

address in memory P*480*22+L*480+R R*64*22+L*64+P L*480*64+P*480+R L*64*480+R*64+P

Table 7.1. The memory address of data dependens on the setting of dimension.

stage operates on the sequential time samples in each pulse while the next stage, Doppler filtering operates along the pulse dimension. The rotation appears twice as "permutation" in Figure 7.1, and both appear at the completion of processing stage. It is required to solve the problem of rotating multi-dimensional data cube before integrating different processing stages. Though data cube is regarded as multidimensional, memory is only a single dimensional array. The multi-dimension of data is decided by programmer through setting the address into certain order. For example, the data in xth row and y th column of matrix is stored in address y × rows + x in the memory. Therefore, rotating data dimension indicates that changing the memory address for the stored data. In our implementation, data cube rotation is applied to re-arrange data position in the memory in order to gather samples from scattered position. The physical memory address of data for different setting of dimension is listed in the Table 7.1, while data is always stored by the respective order of row, column, and height. Rotation of data cube is performed by transposing matrix between two dimensions. In our implementation, most of the rotation is required to be performed between first and third dimension. Here an example is presented to rotate the data dimension from range-channel-pulse to pulse-channel-range, which is a critical case due to the fact that data for third dimension aligns distributively in the memory. The DMA-list command is applied in this case, which specifies a sequence of DMA transfers between a single area of local memory and possibly discontinuous areas in main storage. The rotation is implemented as shown in Figure 7.2, blocks of range samples are fetched from discontinuous space in main memory and gathered in the local memory. Data in local memory is regarded as two dimensional matrix where ’Range’ is row axis and ’Pulse’ is column axis. The matrix is transposed to exchange dimension between ’Range’ and ’Pulse’ by instruction spu_shuffle. In the end, R-range blocks of pulse samples are stored back to certain scattered space in main memory.

7.1.2

System Integration for Whole STAP Flow

After all the subroutines have been implemented and the problem of data cube rotation has been solved, we are ready to integrate the whole system of STAP.

7.1 Implementation

71

Figure 7.2. Example of data rotation between first and third dimension

The data flow and computation in each SPE is illustrated in the Figure 7.3 and details will be explained below. First of all, the data of preprocessing stage are aligned in the order of rangechannel-pulse and are partitioned into 8 SPU such that each SPU processes a unique set of 8 pulses for all channels. The working buffer in local memory of SPE loads all range samples of 8 pulses for one channel, where each pulse data is located discontinuously and far from another pulse data in the main memory and DMA-list command is applied which contains 8 DMA transfer requests and each transfer has 480 complex floating-point. The main computation of preprocessing stage, such as array calibration and pulse compression, is implemented by overlap-save method as fast convolution techniques. In overlap-save method, each swath of range samples is divided and packaged into three segments of 256 points. Computation of FFT, multiplication, and inverse FFT is then operated for each segment. It iterates until samples for all pulses of one channel have been computed, which indicates 24 iterations (8 pulses and each contain three segments). The transposition is required at the completion of preprocessing in order to realign data to pulse-channel-range. Meanwhile, the computation can go on for the next channel. The computation progresses until it goes through all 22 channels. As shown in the Figure 7.3, the data partition to each SPE for next processing stage will change. Therefore, it requires waiting for the preprocessed data from all the other SPEs to compute next stage. In our implementation, PPE centric model is employed which indicates the synchronization is controlled by PPE.

72

Original STAP Flow

Figure 7.3. The data and computation flow of original STAP system on each SPE

7.1 Implementation

73

For any SPE which has finished its computation, it sends a notification signal to PPE. Meanwhile, PPE monitors the number of finished SPE by received signals. Once PPE gets finished message from all SPEs, it broadcasts a new start signal. SPE stays idle after finishing previous processing until it receives the new start notification from PPE. For the second stage, Doppler filtering, we partition by range sample and process a P-pulse by (R/8)-range swath of samples for each channel in each SPU. Similar with previous stage, preprocessing, the buffer loads all pulse samples of 64 ranges for one channel at the same time, where each range samples are scattered in the main memory. DMA-list command is used with 64 DMA transfer requests and each transfer has 64 complex floating-point. FFT is implemented for Doppler Filtering to form Doppler frequency bins. The 64 points FFT is applied to each swath of pulse samples. In other words, FFT has to be operated for 64 times. Transposition is again followed by Doppler filtering to rearrange the data cube from Doppler-channel-range to range-channel-Doppler. Besides, the real and imaginary part of complex-valued data is changed and stored separately during the transposition to compute efficiently in the following stages. At the same time of transposition, the computation can go on for next channel. The Doppler filtering progresses until it goes through all 22 channels. Due to the reason that data partition will change again for next processing stage, the system needs to be synchronized among SPEs by PPE as mentioned before. Therefore, each SPE stalls until all the other SPEs finish Doppler filtering. Consequently, the QR work load is partitioned by Doppler bin among the eight SPU units. Therefore, each SPU processes eight continuous Doppler bins and each bin assigned to two no-overlapping range intervals over which the training windows have been defined. In our implementation, third-order Doppler-factored STAP is employed which provides adaptive filtering across sensors and adjacent Doppler bins. In each QR decomposition, three adjacent Doppler bins is involved and each Doppler bin has 240 range samples as training window. In other words, the QR decomposition is computed for the matrix with 240 rows and 66 columns. To parallelize efficiently, the data has been chosen to store by range-channel in the previous design step. The buffer is also loaded by DMA-list command. The work space of matrix to be QR factored is divided into 4 buffers. QR decomposition can start computation once the first buffer has loaded. Buffer one needs to hold the data most of time during QR decomposition until the last vector has subtracted its projection of the vectors. Once it has finished, then the space of buffer one can be freed and the DMA request sent for loading buffer one of next matrix. Since the matrix Q is not involved in the weight computation, only upper matrix R is required to be stored back to main memory. The computation iterates for 16 matrices. The synchronization is not desirable after QR decomposition since forward and backward substitution can be processed, and only requires matrix R. Forward

74

Original STAP Flow

substitution, which requires pre-computed steering vectors, is first computed with the transpose of matrix R. Then backward substitution is followed to generate the weight vectors in use to multiply with the input matrix as the adaptive radar data. During the computation of preprocessing and Doppler filtering, data is fetched along first and third dimension instead of first and second dimension. The reason is to transpose the data from first to third dimension at the completion of each computation stage. Another advantage of computing lots of data along third dimension at the same time is that the written back DMA transfer block size would be proportional to number of third dimension data. However, it is a dilemma since the overhead of DMA transfers raises a lot due to data distribution in main memory. The ideal DMA transfer is fetching data block as large as possible, but unfortunately loading with larger data blocks indicates storing back with smaller data blocks due to the transposition in our situation. Taking these factors into account, the implementation is chosen for all data of one channel at the same time.

7.2

Benchmark Results

Table 7.2 shows the simulation result of kernel computation of each processing stage. The required data is assumed to be loaded in local memory when SPU fetches data. The data movement and data rotation (transposition) are not included in this simulation. Based on simulation results of each processing stage, we can calculate the theoretical computation cycle of STAP by multiplying and accumulating times of iteration with required cycles. As shown in the last row in Table 7.2, CPI is 0.75 which is an ideal result.

Pre Dop QR FS BS All

Single Cycle 5154 463 1356106 5783 5479

Dual Cycle 4074 308 1131598 1573 1259

Brach Miss 631 87 79677 832 800

Depend. Stall 634 17 223903 1561 1583

Total Cycle 10513 887 2795772 9778 9272 51758856

Inst. Count 13564 1107 3761004 9680 8800 69026956

Perf CPI 0.78 0.8 0.74 1.01 1.05 0.75

Loop /SPE 528 1320 16 16 16

Table 7.2. Theoretical performance of STAP benchmark (pure computation without data movement)

However, as the whole processing flow is considered, the amount of cycles needed by data movement must be included to make the benchmarking result accurate. For example, as illustrated by the Figure 8.5, data permutation is needed twice when the dimensions of data in the external memory is exchanged (e.g. between range and pulse), which takes considerable amount of cycles. Based on the task partition presented in [6], we implemented the whole STAP flow as initial STAP system and benchmarked it on the CELL cycle-true simulator with the result depicted in Table 7.3. There are two simulation results

7.2 Benchmark Results

Pre Dop QR FS BS All

Single Cycle 2905699 2619542 695930 695977 25774140 25774140 242180 242181 99304 99303 26779649 26754076

Dual Cycle 1638325 1723861 367687 367692 16541200 16541200 69430 69430 21266 21266 16883459 16895770

75 Brach Miss 332807 332713 44390 44551 1225463 1225463 28004 28004 5454 5454 1382845 1380398

Depend. Stall 595437 173036 75028 75073 5133476 5133476 52681 52681 27351 27351 5287441 5294395

channel Stall 31495 1209737 54324 2754343 5332 26981 6516 626136 6513 169236 45338011 106151181

Total Cycle 5527095 6081699 1238123 3938397 48953839 48975488 399490 1019112 162217 324938 98971171 159778700

Inst. Count 6295523 6197158 1462681 1462740 61219446 61219446 382760 382762 143787 143779 63050327 63041307

Perf. CPI 0.88 0.98 0.85 2.69 0.8 0.8 1.04 2.66 1.13 2.26 1.57 2.53

Table 7.3. Performance of STAP benchmark of original dataflow(Including the Latency of Memory subsystem)

for each processing, the first row indicates the latency simulation of the memory subsystem of CELL when disabled in simulator and the second column shows simulation with memory subsystem latency when enabled. In other words, it presents whether the overhead of memory subsystem is included in simulation. Since the data movement is considered in Table 7.3, the CPI of each kenel subroutines will increase considerably even without considering the latency of memory subsystem. When the cycle-true simulation of memory subsystem is enabled, the CPI will increase significantly for some subroutines (e.g. Doppler and FS/BS) while some stay unchanged (e.g. QR Decomposition). The reason is that the amount of cycles involved in the computation of QRD is much more than that of data movement, thus making the channel stall trivial compared to the total number of cycles. In comparison, subroutines such as Doppler and FS/BS only involve a lot more data movement which brings heavy channel stall during the DMA transfer. Due to heavy channel stall in some subroutines, the overall performance is severely degraded. As depicted in the table, the total CPI as high as 2.53 which means the efficiency of initial STAP flow is only 1/3 compared to the estimation in [7].

76

Original STAP Flow

Chapter 8

Optimized STAP Flow As depicted in Table 7.3, the total CPI is 2.53 which is very inefficient, while theoritical CPI is 0.5 and expected effiecient CPI to be around 0.7 to 1.2. Since this simulation result does not fulfill the real-time constraints of STAP system, the performance needs to be further optimized.

8.1 8.1.1

Implementation Problems of Original STAP Flow

The DMA transfer engine, called MFC, performs independent of computing elements in SPE, called SPU. The method to remove waiting time for DMA transfer is allocating multiple buffers and overlapping computation on one buffer with the data transfer in other buffers as shown in the Figure 8.1. However, when we simulated the benchmark results of the implementation, the major problem noticed was having large amounts of stall cycles by DMA transfer even though the maximum number of buffers had been used, where the buffer num-

Figure 8.1. Ideal multi buffering with more computation and less DMA overhead

77

78

Optimized STAP Flow

Figure 8.2. Problem of multi buffering with less computation and more DMA overhead

Preprocessing Doppler Filtering QR decompositon Forward Substitution Backward Substitution

Required data 480 64 240 × 66 66 × 66 66 × 66 + 66

weights 8 8 8 1 1

Computation cycles 3 × 10513 887 2795772 9778 9272

Table 8.1. Required data and computation cycles for each processing stage, where the permutation/transposition is not included in the computation cycles. Each data represents a 32-bits complex floating-point. Besides, assume all the processing coefficients have been served in the SPE.

ber is limited by size of local memory. Finally it was noticed that there was less computations which could not hide the memory latency as shown in the Figure 8.2. To describe the DMA problem precisely, the required data and related computation cycles in each processing stages is compared in Table 8.1. Based on Table 8.1, the ratio between data access and computation cycles is shown in Figure 8.3 which could be regarded as the data access frequency from main memory. Besides, the overhead of loading data distributively is assumed to be eight times larger than loading continuous data. To solve the number of required data for first three stages are increased by multiplying eight and last two stages are multiplied with one as described in Table 8.1. We notice that three stages in STAP, Doppler filtering, Forward Substitution and Backward Substitution, have less computation cycles so that the data access becomes relatively heavy in Figure 8.3. If we compare the data access frequency with the simulation results, we can discover that the channel stall cycle rate of each processing stages as shown in Figure 8.4 is proportional to the data access frequency. The channel stall cycles means number of cycles SPE stalled due to DMA transfer. In summary, we can make a conclusion that the processing with too less computation and too large amounts of data access cannot hide DMA latency by multi-buffering.

8.1 Implementation

79

Figure 8.3. The ratio of DMA transfer to computation in each processing stage

Figure 8.4. The channel stall cycles rate in each processing stage

80

Optimized STAP Flow

Figure 8.5. Modified data flow and functional description of STAP system

8.1.2

System Integration for Optimized STAP Flow

Due to the fact that both preprocessing and Doppler filtering are one dimensional signal processing, the second and third dimension of data cube could be varied without any effects. To solve the bottleneck of STAP system we discussed in the Section 7.1.2 , we propose another data flow of STAP system as shown in Figure 8.5. Two things have been modified in the new data flow of STAP system, first is re-aligning the data dimension, and the other is conquering several computation stages together. Changing dimension of data is to simplify the rotation from first-third dimension rotation to first-second dimension rotation, where the first-third dimension rotation causes larger DMA overhead due to the discontinuous allocation in main memory. However, the first-third dimension computation moves to following processing stage, QR decomposition. We still take the benefits from this changing since QR decomposition has more than 80% of computation load in the STAP system, so that the DMA overhead of loading distributed data can be hidden by large amount of computation cycles. The another reason for modification is to conquer several computation stages, not only reduce the data access times but also hide the DMA latency by longer computation period. The modified data flow has only two stages as illustrated in Figure 8.6, the first stage computes preprocessing and Doppler filtering, and the second includes QR decomposition, Forward Substitution and Backward Substitution. The time for fetching from memory for each data through whole STAP system reduces from five to two, which is an important factor to reduce memory latency. Besides, for each data has been fetched, it is used for several processing stages so that computation cycles increases and DMA latency can be hidden.

8.1 Implementation

81

Figure 8.6. The data and computation flow of modified STAP system on each SPE

82

Optimized STAP Flow

However, we could not merge processing stages as many as we want. Conquering several computation stages usually causes large amount of data hold in the local memory which is limited by 256 kB for data and program. The local memory is desirable to utilize carefully, not only extend parallelism by multi-buffering but also save the space by using the less required space and free them as soon as possible. Even though SPE has large register files, 128 registers with 128-bits, registers need to be free or overlapped during processing for other computations. Otherwise, the swapping of registers with memory increase the memory overhead significantly once the usage extends the available register in the register files. For example, for the first task, preprocessing and Doppler filtering forced 192×64 complex-values to be operated at same time, which needs not only internal buffer with 100 kB for 192×64 data but also the output buffer with 32 kB of 64×64 data for transposition. In the very beginning, the data aligns in range-pulse-channel and the first task includes preprocessing and Doppler filtering. To process Doppler filtering after preprocessing, 64 pulses are forced to be computed at same time. There are 480 range samples in each pulse and are divided into three segments with length of 192, 192, and 96 samples for implementing overlap-save method. Due to the size limit of local memory in SPE, only one segment of range samples in each pulse is allowed to be loaded into local memory. Therefore, the work load of range samples in each channel can be separated into three parts, and there are 22 × 3 working blocks which are assigned to 8 SPE by 8 blocks to six of them and 9 blocks to two of them. In each work block, there are 64 pulses with each of them containing 192 or 96 range samples. The data will be packaged to compute 256 points FFT, multiplication and inverse FFT for each pulse. Transposing is followed after overlap-save method to change the dimension between pulse and range. Then Doppler filtering is applied for each swath of pulse samples by 64 points FFT. Transposing is required again after all the Doppler frequency bins are formed. After transposing, the data is ready to be written back to XDR, while computation goes on for next work block. Once SPE finishes its work load, it needs to wait until all the others finish. This synchronization is centric controlled by PPE as mentioned before. For the second task, data is aligned in range-pulse-channel while computation includes QR decomposition, Forward Substitution, and Backward Substitution. The data is partition by Doppler bins among SPE so that each SPE computes 8 continuous Doppler bins and each contain two range training windows. After QR decomposition, upper triangular matrix R is generated. Forward Substitution is computed using transpose of R and pre-computed steering vectors. Finally Backward Substitution generates the final desired adaptive vectors in STAP system.

8.2

Benchmark Results

As mentioned in Chapter 7, the modified data flow and computation partition has been proposed to improve the overall performance. In this modified solution,

8.2 Benchmark Results

part1 part2 Total

Single Cycle 3854317 3854316 26106713 26106712 29961021 29495188

Dual Cycle 2571003 2571003 16632367 16632367 19203374 18896458

83 Brach Miss 495555 495555 1262391 1262391 1759253 1696931

Depend. Stall 541923 541923 5234003 5234003 5775935 5715179

channel Stall 2392 112507 11276 22468 7334474 16278577

Total Cycle 7498459 7608572 49521556 49532746 64342140 72386091

Inst. Count 9149128 9149126 61736788 61736786 70885916 69787304

Perf. CPI 0.82 0.83 0.8 0.8 0.91 1.04

Table 8.2. Performance of STAP benchmark with modified dataflow (Including the latency of memory subsystem)

the preprocessing part is bundled with Doppler processing as one task loop, and also QRD is bundled with FS and BS as one single task loop. The benchmarking result shown in the Table 8.2 clearly shows that the CPI (1.04) is reduced by more than 60% compared to that in Table 7.3(2.53). The reason is that when double buffering and loop unrolling are used at the same time, the movement of new data can be carried out during the computation of old data which was loaded earlier, so that the channel stall can be significantly reduced. The CPI of the bundled subroutines stay unchanged no matter the latency of memory subsystem is considered or not, which means that channel stall has been successfully avoided in the modified STAP flow. Note that the overall CPI is 1.04 which slightly higher than that of subroutines. The reason is that extra cycle cost is needed for the synchronization of SPE because the DMA is out-of-order. The cycle-accurate benchmarking result of modified STAP system shows that roughly 40% more cycles are needed compared to the ideal result in Table 7.2, which is quite reasonable. The final result is measured by calculated execution based on the benchmarking result to check whether our implementation fulfills the real-time constraints. The execution time can be derived from diving number of total cycles which is 72,386,091 in our implementation by CBE operating frequency which is 3.2G Hz. Therefore, execution time can be obtained as 22.6 milliseconds which is much smaller than time constraints 32.5 milliseconds. The results show that the new module of STAP data flow improves the overall performance by 220% significantly reducing the DMA stalling cycles, and the optimal achievable performance is 1.07 CPI for whole STAP flow which is acceptable for real time constrains.

84

Optimized STAP Flow

Figure 8.7. Comparison between original data flow and modified data flow

Chapter 9

Conclusion and Future Work 9.1

Conclusion

The aim of this thesis work is to design and implement the Space-Time Adaptive Processing system on CELL BE processor. As explained in the previous chapters, several programming skills are employed to optimize performance on Cell BE multiprocessor, such as Loop Unrolling, SIMDization, Interleaved Load, Multi-buffering, branch hint, etc. In addition, overhead of DMA transfer causes significant effect on the performance which is improved by proposed data reallocation and task partition. Although this thesis work does not intend to provide the best optimization solution to implement STAP on Cell BE processor. The benchmarking result of the optimized solution shows that CELL can well accommodate the baseband processing of multiple radar technologies which proves the concept of SDR is already mature enough to be deployed in practice for light-weight radar systems.

9.2

Future Work

There is still room for improvements of STAP system implemented on Cell BE processor, especially stalling cycles by DMA transfer which still resides 22% cycles in overall execution time for optimized solution. Since the simulator provides the chance to vary size of Local memory which was the limitation in merging computation stages. It is very interesting to know how the size of local memory will affect the performance. In addition, simulator supports system executing on dual Cell BE processor. The performance could be improved by dual datapath parallelism, but meanwhile the communication between two chips stall the performance if there is data dependency. It is also interesting to see the results of their performance.

85

86

Conclusion and Future Work

More effort that could be done to improve performance is to modulate address algorithms of data permutation which takes place at the end of each processing stage. The optimal addressing method can be derived by calculating the mathematical module to obtain the maximum solution of the equation. By this method, not only DMA transfer but also memory allocation can be fully optimized. Moreover, after understanding the bottleneck of implementing STAP on Cell BE processor, the specific processor architecture could be proposed to compute radar system processing more efficiently. One proposal could be to implement complex operations instead of real-value operation per instruction, and the another proposal could be to design a memory architecture which is able to efficiently perform multi-dimensional data cube rotation.

Bibliography [1] K. M. O’Brien P. Wu T. Chen P. H. Oden D. A. Prener J. C. Shepherd B. So Z. Sura A. Wang T. Zhang P. Zhao M. K. Gschwind R. Archambault Y. Gao R. Koo A. E. Eichenberger, J. K. O’Brien. Optimizing compiler for cell processor. IBM Systems Journal, 45(1), 2006. [2] Charles F. Van Loan Gene H. Golub. Matrix Computation Third Edition. [3] International Business Machines Corporation. Cell Broadband Engine Programming Tutorial, October 2005. [4] International Business Machines Corporation. Accelerated Library Framework Programmer’s Guide and API Reference Version 1.0, December 2006. [5] International Business Machines Corporation. Performance Analysis with the IBM Full-System Simulator, November 2006. [6] Ronald T. williams Kenneth C. Cain, Jose A. Torres. Rt_stap: Real-time space-time adaptive processing benchmark. MITRE Technical Report, 67(1), 1997. [7] Jon Greene Luke Cico. Space time adaptive processing estimates for ibm/sony/toshiba cell broadband engine processor. Mercury Computer Systems, 2005. [8] Mark A. Richards. Fundamentals of Radar Signal Processing. 1 edition, 2005.

87

88

Bibliography

Appendix A

SPE Kernel C Intrinsic Code for QR decomposition /**************************************************************** * Function Name : normalize * function: * normalize the complex-value vector * u = u / |u| * input : * real : point to real-value part of vector array * img : point to imaginay-value part of vector array * number : number of elements in the vector *****************************************************************/ static float normalize( volatile float * real, volatile float * img, int number) { int i; vector float vector float vector float vector float vector float float t;

r1,i1,r2,i2; load_r1,load_i1,load_r2,load_i2; sumr1,sumi1,sumr2,sumi2; sum1,sum2,sum3,sum4,sum; inv_squroot,squrt;

/* Initialization*/ sumr1 = (vector float){ 0, 0, 0, 0 }; sumi1 = (vector float){ 0, 0, 0, 0 }; sumr2 = (vector float){ 0, 0, 0, 0 }; sumi2 = (vector float){ 0, 0, 0, 0 }; load_r1 = (vector float){ 0, 0, 0, 0 }; load_i1 = (vector float){ 0, 0, 0, 0 }; 89

90

SPE Kernel C Intrinsic Code for QR decomposition load_r2 = (vector float){ 0, 0, 0, 0 }; load_i2 = (vector float){ 0, 0, 0, 0 }; /* summarize square of each element*/ for (i=0 ; i< vectorSize/4 ; i+=2) { r1=load_r1; i1=load_i1; r2=load_r2; i2=load_i2; sumr1 = spu_madd ( r1, r1 , sumr1); sumi1 = spu_madd ( i1, i1 , sumi1); sumr2 = spu_madd ( r2, r2 , sumr2); sumi2 = spu_madd ( i2, i2 , sumi2); /* Interleaved load variables for next iteration*/ load_r1 = * ((volatile vector float *) real +i) ; load_i1 = * ((volatile vector float *) img +i) ; load_r2 = * ((volatile vector float *) real +i+1) ; load_i2 = * ((volatile vector float *) img +i+1) ; } // One more computation iteration after the loop due to interleaved load sumr1 = spu_madd ( load_r1, load_r1 , sumr1); sumr2 = spu_madd ( load_r2, load_r2 , sumr2); sumi1 = spu_madd ( load_i1, load_i1 , sumi1); sumi2 = spu_madd ( load_i2, load_i2 , sumi2); //add results in all registers sumr1 = spu_add(sumr1,sumr2); sumi1 = spu_add(sumi1,sumi2); /*sum four elements in one register by shifting*/ sum1 = spu_add(sumr1,sumi1); // A B C D sum2 = spu_rlqwbyte(sum1,8); // C D A B sum3 = spu_add(sum1,sum2); // A+C B+D A+C B+D sum4 = spu_rlqwbyte(sum3,4); // B+D A+C B+D A+C sum = spu_add(sum3,sum4); // A+B+C+D for all /* reciprocal square root for the sum*/ inv_squroot = spu_rsqrte(sum); squrt = spu_mul(inv_squroot,sum); //pre-load variables of first iteration for following loop load_r1 = * ((volatile vector float *) real) ; load_i1 = * ((volatile vector float *) img ) ; load_r2 = * ((volatile vector float *) real+1) ; load_i2 = * ((volatile vector float *) img +1) ;

91

/* each element in the vector multiply with 1/|u| */ for (i=0 ; i< vectorSize/4-2 ; i+=2) { r1= load_r1; i1= load_i1; r2= load_r2; i2= load_i2; r1 = spu_mul ( inv_squroot,r1); i1 = spu_mul ( inv_squroot,i1); r2 = spu_mul ( inv_squroot,r2); i2 = spu_mul ( inv_squroot,i2); /* Interleaved load variables for next iteration*/ load_r1 = * ((volatile vector float *) real+2) ; load_i1 = * ((volatile vector float *) img +2) ; load_r2 = * ((volatile vector float *) real+3) ; load_i2 = * ((volatile vector float *) img +3) ; /* Store results for this iteration*/ * ((volatile vector float *) real +i) = r1; * ((volatile vector float *) img +i) = i1; * ((volatile vector float *) real +i+1) = r2; * ((volatile vector float *) img +i+1) = i2; } // One more computation iteration after the loop due to interleaved load r1= load_r1; i1= load_i1; r2= load_r2; i2= load_i2; r1 = spu_mul ( inv_squroot,r1); i1 = spu_mul ( inv_squroot,i1); r2 = spu_mul ( inv_squroot,r2); i2 = spu_mul ( inv_squroot,i2); * * * *

((volatile ((volatile ((volatile ((volatile

vector vector vector vector

float float float float

*) *) *) *)

real img real img

+i) = r1; +i) = i1; +i+1) = r2; +i+1) = i2;

return t; }

/*********************************************************************** * Function Name : in_product

92

SPE Kernel C Intrinsic Code for QR decomposition

* function: * inner product of two vectors * out = a - < a , b > * b * for each element real_part = a_real * b_real - a_img * b_img * imga_part = a_real * b_img + a_img * b_real * input : * a_real : point to the real-value part of vector a * a_img : point to the imaginay-value part of vector a * b_real : point to the real-value part of vector b * b_img : point to the imaginay-value part of vector b ***********************************************************************/ static complex in_product(volatile float * a_real, volatile float * a_img, volatile float * b_real, volatile float * b_img) { int i; vector float ar1,ai1,br1,bi1; vector float tar1,tai1,tbr1,tbi1; vector float sumrr1,sumii1,sumri1,sumir1; vector float outr1,outi1; vector vector vector vector

float float float float

ar2,ai2,br2,bi2; tar2,tai2,tbr2,tbi2; sumrr2,sumii2,sumri2,sumir2; outr2,outi2;

/* Initialization*/ sumrr1 = (vector float){ 0, 0, 0, 0 }; sumii1 = (vector float){ 0, 0, 0, 0 }; sumri1 = (vector float){ 0, 0, 0, 0 }; sumir1 = (vector float){ 0, 0, 0, 0 }; tar1 = (vector float){ 0, 0, 0, 0 }; tai1 = (vector float){ 0, 0, 0, 0 }; tbr1 = (vector float){ 0, 0, 0, 0 }; tbi1 = (vector float){ 0, 0, 0, 0 }; sumrr2 = (vector float){ 0, 0, 0, 0 }; sumii2 = (vector float){ 0, 0, 0, 0 }; sumri2 = (vector float){ 0, 0, 0, 0 }; sumir2 = (vector float){ 0, 0, 0, 0 }; tar2 = (vector float){ 0, 0, 0, 0 }; tai2 = (vector float){ 0, 0, 0, 0 }; tbr2 = (vector float){ 0, 0, 0, 0 }; tbi2 = (vector float){ 0, 0, 0, 0 }; //Compute inner product for two vectors for ( i=0 ; i< vectorSize/4 ; i+=2)

93 { ar1=tar1; ai1=tai1; br1=tbr1; bi1=tbi1; ar2=tar2; ai2=tai2; br2=tbr2; bi2=tbi2; // Real_part = ar*br-ai*bi , imaginary_part = ar*bi+ai*br sumrr1 = spu_madd(ar1,br1,sumrr1); sumii1 = spu_madd(ai1,bi1,sumii1); sumri1 = spu_madd(ar1,bi1,sumri1); sumir1 = spu_madd(ai1,br1,sumir1); sumrr2 = spu_madd(ar2,br2,sumrr2); sumii2 = spu_madd(ai2,bi2,sumii2); sumri2 = spu_madd(ar2,bi2,sumri2); sumir2 = spu_madd(ai2,br2,sumir2); /* Interleaved load variables for next iteration*/ tar1 = * ((volatile vector float *) a_real +i) ; tai1 = * ((volatile vector float *) a_img +i) ; tbr1 = * ((volatile vector float *) b_real +i) ; tbi1 = * ((volatile vector float *) b_img +i) ; tar2 = * ((volatile vector float *) a_real +i+1) ; tai2 = * ((volatile vector float *) a_img +i+1) ; tbr2 = * ((volatile vector float *) b_real +i+1) ; tbi2 = * ((volatile vector float *) b_img +i+1) ; } // One more iteration after the loop due to interleaved load sumrr1 = spu_madd(tar1,tbr1,sumrr1); sumii1 = spu_madd(tai1,tbi1,sumii1); sumri1 = spu_madd(tar1,tbi1,sumri1); sumir1 = spu_madd(tai1,tbr1,sumir1); sumrr2 = spu_madd(tar2,tbr2,sumrr2); sumii2 = spu_madd(tai2,tbi2,sumii2); sumri2 = spu_madd(tar2,tbi2,sumri2); sumir2 = spu_madd(tai2,tbr2,sumir2); //add values in all registers outr1 = spu_sub(sumrr1,sumii1); outi1 = spu_add(sumri1,sumir1); outr2 = spu_sub(sumrr2,sumii2); outi2 = spu_add(sumri2,sumir2); outr1 = spu_add(outr1,outr2); outi1 = spu_add(outi1,outi2);

94

SPE Kernel C Intrinsic Code for QR decomposition vector float temp1,temp2,temp3,temp4,temp5,temp6; complex out; //add four elements one register vector temp1 = spu_rlqwbyte(outr1,8); // C D A B temp4 = spu_rlqwbyte(outi1,8); // C D A B temp2 = spu_add(outr1,temp1); temp5 = spu_add(outi1,temp4);

// A+C B+D A+C B+D // A+C B+D A+C B+D

temp3 = spu_rlqwbyte(temp2,4); temp6 = spu_rlqwbyte(temp5,4);

// B+D A+C B+D A+C // B+D A+C B+D A+C

outr1 = spu_add(temp3,temp2); outi1 = spu_add(temp5,temp6);

// A+B+C+D for all // A+B+C+D for all

//pre-load variables of first iteration for following loop tar1 = * ((volatile vector float *) a_real) ; tai1 = * ((volatile vector float *) a_img) ; tbr1 = * ((volatile vector float *) b_real) ; tbi1 = * ((volatile vector float *) b_img) ; tar2 = * ((volatile vector float *) a_real+1) ; tai2 = * ((volatile vector float *) a_img+1) ; tbr2 = * ((volatile vector float *) b_real+1) ; tbi2 = * ((volatile vector float *) b_img+1) ; //elements in b multiply with the inner product and subtracted by elements in a // - < outr*br-outi*bi , outr*bi+outi*br > for ( i=0 ; i< vectorSize/4-2 ; i+=2) { ar1=tar1; ai1=tai1; br1=tbr1; bi1=tbi1; ar2=tar2; ai2=tai2; br2=tbr2; bi2=tbi2; temp1 temp2 temp3 temp4

= = = =

spu_mul spu_mul spu_mul spu_mul

( ( ( (

bi1,outi1); bi1,outr1); bi2,outi1); bi2,outr1);

temp1 temp2 temp3 temp4

= = = =

spu_msub ( br1,outr1,temp1); spu_madd ( br1,outi1,temp2); spu_msub ( br2,outr1,temp3); spu_madd ( br2,outi1,temp4);

ar1 = spu_sub(ar1,temp1); ai1 = spu_sub(ai1,temp2);

95 ar2 = spu_sub(ar2,temp3); ai2 = spu_sub(ai2,temp4); /* Interleaved load variables for next iteration*/ tbr1 = * ((volatile vector float *) b_real+i+2) ; tbi1 = * ((volatile vector float *) b_img +i+2) ; tar1 = * ((volatile vector float *) a_real+i+2) ; tai1 = * ((volatile vector float *) a_img+i+2) ; tbr2 = * ((volatile vector float *) b_real+i+3) ; tbi2 = * ((volatile vector float *) b_img +i+3) ; tar2 = * ((volatile vector float *) a_real+i+3) ; tai2 = * ((volatile vector float *) a_img+i+3) ; /* Store results for this iteration*/ * ((volatile vector float *)a_real + i) = ar1 ; * ((volatile vector float *)a_img + i) = ai1 ; * ((volatile vector float *)a_real + i+1) = ar2 ; * ((volatile vector float *)a_img + i+1) = ai2 ; } // One more iteration after the loop due to interleaved load out.real= (float)spu_extract(outr1,0); out.imag = (float) spu_extract(outi1,0); ar1=tar1; ai1=tai1; br1=tbr1; bi1=tbi1; ar2=tar2; ai2=tai2; br2=tbr2; bi2=tbi2; temp1 = spu_mul ( bi1,outi1); temp2 = spu_mul ( bi1,outr1); temp3 = spu_mul ( bi2,outi1); temp4 = spu_mul ( bi2,outr1); temp1 = spu_msub ( br1,outr1,temp1); temp2 = spu_madd ( br1,outi1,temp2); temp3 = spu_msub ( br2,outr1,temp3); temp4 = spu_madd ( br2,outi1,temp4); ar1 = spu_sub(ar1,temp1); ai1 = spu_sub(ai1,temp2); ar2 = spu_sub(ar2,temp3); ai2 = spu_sub(ai2,temp4); * ((volatile * ((volatile * ((volatile * ((volatile return out; }

vector vector vector vector

float float float float

*)a_real *)a_img *)a_real *)a_img

+ + + +

i) = ar1 ; i) = ai1 ; i+1) = ar2 ; i+1) = ai2 ;

96

SPE Kernel C Intrinsic Code for QR decomposition

/*********************************************************************** * Function Name : QR_decompose * function: * Perform QR decomposition,A=QR, to decompose matrix into one * orthogonal matrix Q and upper triangular matrix R. *inputs: * At_real,At_img: real and imaginary part of input matrix A * Rt_real,Rt_img: real and imaginary part of output matrix R ***********************************************************************/ static void QR_decompose(float *At_real,float *At_img, float *Rt_real,float *Rt_img) { complex result; float * Ar; float * Ai; for (i=0 ; i< Cols ; i++) { Ar = & At_real[i*Rows]; Ai = & At_img [i*Rows]; for(j=0 ; j < i ; j++) { result = in_product(Ar,Ai, &(At_real[j*Rows]), &(At_img[j*Rows])); Rt_real[i*Cols+j] = result.real; Rt_img [i*Cols+j] = result.img; } Rt_real[i*Cols+i]=normalize(&(At_real[i*Rows]),&(At_img[i*Rows])); } }

Appendix B

SPE Kernel C Intrinsic Code for Forward/Backward Substitution /*********************************************************************** * Function Name : solve_lower * function: * Solve lower triangular matrix by forward substitution * AX=b, where A is lower triangular matrix, b is a vector * and X is the desired output vector. * input : * m : the size of vector * A_r,A_i: real and imaginary part of input matrix A * b_r,b_i: real and imaginary part of input vector b ***********************************************************************/ static void solve_lower(int m,float *A_r,float *A_i, float *b_r,float *b_i) { int i,k; int off1,off2,off3,off4; float *a_r,*a_i; vector float a0_r,a1_r,a2_r,a3_r, a0_i,a1_i,a2_i,a3_i; vector float t0_r,t1_r,t2_r,t3_r, t0_i,t1_i,t2_i,t3_i; vector float t02h_r,t02l_r,t13h_r,t13l_r; vector float t02h_i,t02l_i,t13h_i,t13l_i; vector float tt0_r,tt1_r,tt2_r,tt3_r; vector float tt0_i,tt1_i,tt2_i,tt3_i; vector float col0_r,col1_r,col2_r, col0_i,col1_i,col2_i; vector float bk_r,bk_i,bi_r,bi_i,bb_r,bb_i; vector float *bkv_r,*bkv_i,*biv_r,*biv_i;

97

98

SPE Kernel C Intrinsic Code for Forward/Backward Substitution float xr,xi; //define shuffle vectors vector unsigned char shuffle_2637; vector unsigned char shuffle_0415 = VEC_LITERAL( vector unsigned char, 0x00, 0x01, 0x02, 0x03, 0x10, 0x11, 0x12, 0x13, 0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17); vector unsigned char shuffle_zzz2 = VEC_LITERAL(vector unsigned char, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x08, 0x09, 0x0A, 0x0B); vector unsigned char shuffle_z000 = VEC_LITERAL(vector unsigned char, 0x80, 0x80, 0x80, 0x80, 0x00, 0x01, 0x02, 0x03, 0x00, 0x01, 0x02, 0x03, 0x00, 0x01, 0x02, 0x03); vector unsigned char shuffle_z014 = VEC_LITERAL(vector unsigned char, 0x80, 0x80, 0x80, 0x80, 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x10, 0x11, 0x12, 0x13); vector unsigned char shuffle_zz15 = VEC_LITERAL(vector unsigned char, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17); shuffle_2637 = spu_or(shuffle_0415, 8); off1 off2 off3 off4

= = = =

m; m«1; off2 + m; m«2;

bkv_r = (vector float*)b_r; bkv_i= (vector float*)b_i; for( k=0;k