Architecture Design of FPGA-Based Wavefront ... - SPIE Digital Library

2 downloads 0 Views 204KB Size Report
Correlating Shack-Hartmann (SH) wavefront sensor is used to detect the aberration of the blurred image. To detect the aberration as well as possible, frame rate ...
Architecture Design of FPGA-Based Wavefront Processor for Correlating Shack-Hartmann Sensor Xiaofeng Penga,b Mei Lia ChangHui Raoa a. Chinese Academy of Sciences Institute of Optics and Electronics Chengdu 610209 China b. Graduate University of Chinese Academy of Sciences Beijing 100039 China ABSTRACT During solar observation, atmosphere turbulence usually blur the solar image coming from solar telescope. In order to improve the quality of solar image, solar Adaptive Optical (AO) system is equipped. In a typical solar AO system, Correlating Shack-Hartmann (SH) wavefront sensor is used to detect the aberration of the blurred image. To detect the aberration as well as possible, frame rate of CCD working after the SH sensor must be fast enough to keep pace with the variation of turbulence. CCD with 1000 Hz frame rate is very common in solar adaptive optical system. What’s more, next generation telescope is so large that resolution of CCD becomes higher and higher. So it requires the wavefront processor a huge amount of processing power. As FPGA (Field Programmable Gate Array) technology becomes more powerful, they can provide amazing processing ability by high speed and parallel processing. This paper gives out a design of FPGA-based wavefront processor in solar adaptive optical system. It is characterized by pipeline and parallel architecture. The peak operation speed is over 86G/s and calculation latency is 7.04 us in a system with 16×16 sub-aperture array, which is 16×16 pixel in size each and for which the reference image is 8×8 pixel. Using this processor, frame rate of the CCD can be up to 8800 fps. Built in a single FPGA, it is low-cost, compact and easy to be upgraded. Key words: FPGA; CCD; solar adaptive optics; absolute difference algorithm; systolic array; pipeline

1. INTRODUCTION Understanding the physics of the small scale structure on the Sun can help us to make many scientific questions about the Sun clear. However, atmosphere turbulence between our telescope and the Sun makes observation of these small scale structures difficult. Adaptive Optical (AO) system, attached to a regular solar telescope and which will minimize the impact of atmosphere turbulence, is called solar AO. Typical solar AO is shown in figure 1. As a key component of solar AO system, SH wavefront sensor, which is a lens array (as show in Figure 1), detects the aberration of wavefront. After sampling by SH sensor, image from solar telescope is divided into small sub-apertures and processed by wavefront processor. Because of the nature of solar observation, absolute difference algorithm which has an extremely high bandwidth requirement is used in the processor. High frame rate wavefront processor, on which the absolute difference algorithm is running, is characterized by high throughout and low latency. What’s more, for the wavefront processor of next generation AO system which has more sub-apertures and higher frame rate, the performance requirement is even higher. All above poses a huge challenge to nowadays Von Neumann architecture processor systems.

2008 International Conference on Optical Instruments and Technology: Optical Systems and Optoelectronic Instruments edited by Yunlong Sheng, Yongtian Wang, Lijiang Zeng, Proc. of SPIE Vol. 7156, 71561B · © 2009 SPIE · CCC code: 0277-786X/09/$18 · doi: 10.1117/12.804777 Proc. of SPIE Vol. 7156 71561B-1

Uncorrected beam

Beam splitter

le ab m r or o ef irr D M

Corrected beam

Shack-Hartmann Lens Array CCD

Image Processor

Figure 1. Typical Solar Adaptive Optical System

FPGA technology develops very fast these years. It provides amazing processing ability by high speed and parallel processing. Besides, flexibility of FPGA makes integration of the whole electrical part of solar adaptive optical system in a single chip possible. A FPGA test bench for AO is developed in Spain in which a Xilinx Virtex-4 LX25 FPGA is used and frame rate can be 955 fps with 64 sub-apertures [2, 3]. Durham University produced a high-performance low-cost real-time control system for AO based on FPGA, which is called DARTS (Durham Advanced Real-Time System), to replace the aging C40 DSP-based system [4]. A Xilinx Virtex-2 Pro FPGA is used in DARTS and the frame rate is up to 500 fps with 100 sub-apertures. To further exploit the processing ability of FPGA, multiple FPGA devices are clustered to perform sub-aperture processing [5]. Wavefront processor of SPARTA contains several Xilinx Virtex-2 Pro FPGA and processes 64 sub-apertures at the frame rate of 1.5K fps [6, 7, 8, 9].

2. ABSOLUTE DIFFERENCE ALGORITHM AND SYSTEM REQUIREMENT Absolute difference algorithm is used in solar AO system. Its expression is given in equation (1). IL refers to the image template, also referred as reference image, with the resolution of M×M pixel. IL(i, j) is the value of pixel at the position of (i, j) in reference image. IR refers to the real image, with the resolution of N×N pixel. IR(i+u, j+v) is the value of pixel at the position of (i+u, j+v) in real image. And always M is smaller than N. DRL(u, v) is the absolute difference result at the position of (u, v) in the real image. Absolute difference algorithm will find out the position of the area which has the most likeness with the reference image in the real image. That position should have the smallest DRL(u, v). And (u, v) stands for the coordinate in the real image area. M −1 M −1

DRL (u , v) = ∑ ∑ I R (i + u , j + v) − I L (i, j )

(1)

i =0 j = 0

In order to find out the area with most likeness, the whole real image has to be searched. For each DRL, there will be M2 subtractions, M2 absolute values and M2-1 additions according to equation (1). And there are (N-M+1)2 DRL in a N×N real image. So totally about 3×M2×(N-M+1)2 operations are needed for one real image.

Proc. of SPIE Vol. 7156 71561B-2

In experiment, resolution of real image and reference image is 16×16 and 8×8 respectively. About 15552 operations are required for running absolute algorithm on one real image. Besides, there are 256 real images which comprise a 16×16 sub-aperture array (as shown in figure 2) after the Shack-Hartmann lens array and frame rate is 2500 fps. In this case, wavefront processor should be capable of doing 15552×256×2500 operations, almost 10,000,000,000 operations per second. 16×16 pixel

Figure 2. 16×16 Sub-aperture Array

3. SYSTOLIC ARRAY Systolic array is a kind of hardware architecture who can efficiently make use of the data read out of memory (shown in figure 3). After being used by one Processing Element (PE) of the systolic array, data read from memory will be passed to other PE who needs it. And this process will repeat several times until no PE needs the data any more. Bandwidth requirement of the memory decreases in this way because it is not necessary to access the memory as frequently as before. WGWOL

Figure 3. Systolic Array

In absolute difference algorithm, PE is defined as the unit who gets out the DRL of one position in real image. Ideally there should be 81 PE for one real image for there are 81 DRL of 81 positions. Figure 4 shows the way how PE works. Real image data will be broadcast to all PE. Reference image data is passed systolically from one PE to another. Each PE will decide whether data being broadcast belongs to his area, if so, both real and reference image data are picked up and absolute difference algorithm in this PE is running.

Proc. of SPIE Vol. 7156 71561B-3

Real image Reference image

D(0,0)

D(0,1)

PE 0

PE 1

D(0,2)

…..

D(8,8) PE 80

Figure 4. Absolute Difference Systolic Array

Inside PE, a 3-step pipeline is implemented to achieve the function of equation 1 (shown in figure 5). A comparator is used at the first step to keep the minuend bigger than the subtrahend. Then a subtraction is applied to get a positive result. At last, the subtraction result is summed into final result.

KI ! cowbn

2fJLC

CIOCJ(

Figure 5. Absolute Difference Pipeline

In experiment, there are actually only 18 PE for each real image considering the logic resource of FPGA is limited. These 18 PE can get only part of the DRL and they will be reused several times to finish the whole real image.

4. SYSTEM DESIGN 4.1 Image Buffer When dealing with the 256 sub-apertures, a ping pong image buffer is introduced. Size of each buffer equals that of a sub-aperture image. As shown in figure 6, buffer A and B works alternatively to feed data belongs to different sub-apertures to systolic array. After one buffer being filled with data belongs to certain sub-aperture, it delivers these data to systolic array. In the meantime, the other buffer can accept data belongs to another sub-aperture. This structure works well as long as the speed of sending data to systolic array is faster than that of receiving data. IJJJJ

IJJJJ

2A0J

qii

Figure 6. Image Buffer

Sub-aperture image is acquired by CCD image sensor. Data of CCD is sent out row by row. So the 16 sub-apertures in the first row in figure 2 is output first, then the rest 15 rows of sub-aperture are output sequentially. Accordingly, a 16-channel processor is used. Every channel in the processor will process the image of one sub-aperture column and all

Proc. of SPIE Vol. 7156 71561B-4

the 16 channels work in a parallel way. That is to say, after 16 sub-apertures data of the same row filling the buffer of 16 channels, all 16 systolic arrays start to work. In the meanwhile, data of the next row of sub-aperture begin to fill the other part of the ping pong buffer of 16 processing channels. Again, for the ping pong buffer, once the speed of sending data to systolic array is faster than that of receiving data, wavefront processor functions correctly. 4.2 Processing Channel CCD

CCD DV]1

IHJf B11II B11TI1

Figure 7. One Processing Channel

Figure 7 gives out the structure of a processing channel. Besides the image buffer and the systolic array, there are several function modules that constitute a processing channel. Pixel Address generates position information about the incoming pixel by the CCD synchronous signals. These signals are FV (Frame Valid), LV (Line Valid) and PV (Pixel Valid). Counting valid pulses of these signals can get the coordinate of incoming pixel on the CCD surface. Gate module is actually a comparator. 2 coordinates defining the position of a sub-aperture is pre-stored in Gate module who decides whether to store the incoming pixel or not by comparing the pre-stored coordinates and the coordinate from Pixel Address. Image buffer is enabled by Gate module if data belongs to its sub-aperture area is coming.Address module generates address when accessing the image buffer and directive buffer. It is under the control of Pixel Address. When incoming data is valid, Address module is enabled to generate image buffer address to store incoming data. After data of one sub-aperture being all stored in Image Buffer, Address module starts to generate address to read Image Buffer and Directive Buffer. Image data read from Image buffer is sent to systolic array to finish the absolute difference calculation. Directive read from directive buffer is used to control the systolic array. Because the directives and data are stored in different RAMs and accessed through different data bus, absolute difference processing channel is Harvard architecture. Compared with Von Neumann architecture, Harvard architecture has a relatively high directive bandwidth because there is a dedicated bus for directives. There will be no conflict when read data and directive simultaneously. Processor can work more efficiently with this structure.

Proc. of SPIE Vol. 7156 71561B-5

4.3 Parallel Processing Channel

CH

CH

Figure 8. Parallel 16-channel Structure

When the processor is working, CCD synchronous signal and CCD data signals are broadcasted to the 16 channels. Once data of the first sub-aperture row comes, pixel address module of each channel generates the address information about the incoming pixel according to the CCD synchronous signal. Then Gate module decides whether to process the incoming pixel or not by the address information. If the pixel will be processed, it will be written into the ping-pong buffer of the channel. After all the data of the first sub-aperture row having been stored, data of 16 sub-apertures are stored in 16 ping-pong buffers. After that, the 16 Addresser modules of 16 channels start addressing the ping-pong buffer for image data and directive buffer for directives at the same moment which will make the systolic array work correctly. In the mean time, data of the second sub-aperture row is broadcasted to these 16 channels as did before, data of 16 sub-aperture are distributed into 16 buffers, just in the other part of the ping-pong buffer. Before the other parts of the ping-pong buffers being filled with data of the second sub-aperture row, systolic array will finish the absolute difference calculation of the first sub-aperture row. So the processing core can start another computation at once when the second sub-aperture row is fully buffered. This process will be repeated 16 times to complete the calculation of 16 sub-aperture rows. The absolute difference results are output row by row. They can be used directly by control module that will generate the control signals of deformable mirror. These results can also be stored in a buffer first and be used later when the control module is free. Implementation of Control module is relatively an easy work. It can be implemented in any kinds of device.

5. PERFORMANCE EVALUATION A Xilinx Virtex-4 LX80 FPGA is used to build the processor. Wavefront processor built with this FPGA can run at the speed of 100 MHz. There are 288 (18×16) PE in the processor, with a 3-step pipeline each. It can provide 288×3×100 MHz=86G operations per second. Because sub-aperture array is processed row by row, latency of processing the whole sub-aperture array is the processing time of one sub-aperture row. And 16 sub-apertures in the same row are processed by 16 processing channels simultaneously, so the latency equals processing time of one sub-aperture. For each sub-aperture, there are 18 PE that will be reused for 4 times and a half to get all the DRL(u, v) in equation 1. While 18 PE is working, 16×9=144 data are needed considering sharing as more data in systolic array as possible. After 4 times reusing, only 81-18×4=9 DRL(u, v) is left, so 18 PE systolic array should be reused for the last half time. At this time, only 16×8=128

Proc. of SPIE Vol. 7156 71561B-6

data are needed. On finishing one sub-aperture, 144×4+128=704 data are read out from buffer. 7040 ns is used when working frequency is 100 MHz and this is the time of processing one sub-aperture. In other words, latency of the processor is 7.04 us. To finish the whole sub-aperture array, 16 rows will take 7.04×16=112.64 us, frame rate of CCD should be lower than 8877 fps and system requirement of 2500 fps can be satisfied very well. Resource of FPGA used is listed in table 1. Table 1. Resource of FPGA used Number of BUFGs

4 out of 32

12%

Number of DCM_ADVs

1 out of 12

8%

Number of External IOBs

46 out of 768

5%

Number of LOCed IOBs

4 out of 46

8%

Number of RAMB16s

36 out of 200

18%

Number of Slices

29410 out of 35840

82%

Number of SLICEMs

9 out of 17920

1%

6. CONCLUSION This FPGA-based wavefront processor works well under the requirement of 2500 fps. In the processor, the most logic delay introduced is from the first step of absolute difference pipeline. That is to say, comparator of absolute difference pipeline is the slowest component which limits the performance of pipeline, also of processor. The wider the bit width of a comparator is, the slower it runs. For compatible purpose, bit width of the comparator is set to be 16 in the processor. But for most scientific purpose CCD sensors used in solar AO, bit width is much smaller. So decreasing the bit width of the comparator suitably will increase the speed of absolute difference pipeline. This makes the processor workable for higher frame rate CCD sensor. For sub-aperture array larger than 16×16, logical resource not used in FPGA can be used to generate more processing channel. For example, for a 24×24 sub-aperture array, 8 more channels are needed to be generated. If there is not enough logical resource left, another FPGA can be introduced. And the processor in these two FPGA are exactly the same except for the sub-aperture position information pre-stored in the Gate module.

REFERENCE [1] Luis F. Rodríguez-Ramos, Teodora Viera ,Guillermo Herrera, José V. Gigante, Fernando Gago, Ángel Alonso, “Testing FPGAs for real-time control of adaptive optics in giant telescopes”, in Advances in Adaptive Optics II, Proc. of SPIE Vol. 6272, 62723X(2006). [2] Luis F. Rodríguez-Ramos, Teodora Vieraa, José V. Gigantea, Fernando Gagoa, Guillermo Herreraa, ángel Alonsoa, Nicolas Descharmes, “FPGA adaptive optics system test bench”, in Astronomical Adaptive Optics Systems and Applications II, Proc. of SPIE Vol.5903, 59030D(2005).

Proc. of SPIE Vol. 7156 71561B-7

[3] S. J. Goodsell, N. A. Dipper, D. Geng, R.M. Myers, C. D. Saunter, “DARTS: a low-cost high-performance FPGA implemented real-time control platform for adaptive optics”, in Astronomical Adaptive Optics Systems and Applications II, Proc. of SPIE Vol.5903, 5903E(2005). [4] Deli Geng, Stephen J. Goodsell, Alastair G. Basden, Nigel A. Dipper, Richard M. Myers, Chris D. Saunter, “FPGA Cluster for High Performance AO Real-time Control System”, in Advances in Adaptive Optics II, Proc. of SPIE Vol. 6272, 627240(2006). [5] Enrico Fedrigo, Robert Donaldson, Christian Soenke, Richard Myers, Stephen Goodsell, Deli Geng, Chris Saunterb, Nige Dipper, “SPART, the ESO Standard Platform for Adaptive optics Real Time Applications”, in Advances in Adaptive Optics II, Proc. of SPIE Vol. 6272, 627210(2006). [6] S. J. Goodsell, E. Fedrigo, N. A. Dipper, R. Donaldson, D. Geng, R.M. Myers, C. D. Saunter and C. Soenke, “FPGA developments for the SPARTA project”, in Astronomical Adaptive Optics Systems and Applications II, Proc. of SPIE Vol.5903, 59030G(2005). [7] S. J. Goodsell, D. Gena, E. Fedrigo, C. Soenke, R. Donaldson, C. D. Saunter, R. M. Myers, A. G. Basden and N. A. Dipper, “FPGA developments for the SPARTA project Part 2”, in Advances in Adaptive Optics II, Proc. of SPIE Vol. 6272, 627241(2006). [8] S. J. Goodsell, D. Geng, E. J. Younger, E. Fedrigo, C. Soenke, R. Donaldson, N. A. Dipper and R. M. Myers, “FPGA developments for the SPARTA project Part 3”, in Astronomical Adaptive Optics Systems and Applications III, Proc. of SPIE Vol.6691, 669103(2007). [9] C.D. Saunter, G.D.Love, M.Johns, J.Holmes, “FPGA Technology for High Speed, Low Cost Adaptive Optics”, in 5th International Workshop on Adaptive Optics for Industry and Medicine, Proc. of SPIE Vol. 6918, 60181G(2005).

Proc. of SPIE Vol. 7156 71561B-8