finding(ITF), local maximum finding(LMF), track split- ting(TS), fitting and final selection. For the FPGA imple- mentation, only the first three steps are interesting, ...
Pattern Recognition and Reconstruction on a FPGA Coprocessor Board R. M¨anner, M. Sessler, H. Simmler Lehrstuhl f¨ ur Informatik V, Universit¨at Mannheim 1
Introduction
High energy accelerator labs use huge detector systems to track particles. The ATLAS detector at CERN, Geneva, will provide complex three-dimensional images. A trigger system at the detector output is used to reduce the amount of data to a manageable size. Each trigger applies certain filter algorithms to select the very rare physically interesting events. The algorithm presented here processes data from a special detector, called TRT, to generate a trigger decision within ≈10ms. System supervisors then decide together with other results whether the event will be rejected or passed to the next trigger level. Due to the restricted execution time for calculating the decision, fast pattern recognition algorithms are required. These algorithms require a high I/O bandwidth and high computing power. These reasons and the high degree of parallelism make it best suited for custom computing machines.
2
Algorithm Description
The image of the whole TRT detector can be seperated into a Barrel(BAR) and a EndCap(EC) part, where the image of the BAR consists of ≈55.000 pixels and the EC image of ≈159.000 pixels. A part of a BAR image is shown in figure 1.
Figure 2: Pattern Classes.
Altogether there are 81920 different patterns for the BAR and another 76800( 80 pT x 960 φ ) for the EC. The algorithm consists of the following steps: initial track finding(ITF), local maximum finding(LMF), track splitting(TS), fitting and final selection. For the FPGA implementation, only the first three steps are interesting, because they need ≈98% of the complete execution time1 . During ITF, the hit image data are applied to a look-up table storing the Hough transformed values for the hits in order to find the matching patterns. The result of this ITF is a two-dimensional histogram of all patterns, as shown in figure 3 for a single muon without pile-up. Afterwards the LMF step compares all histogramer values to a given threshold and filters the local maximums. Then the TS step performs a reverse matching. Each pattern found loops over all its pixels to check whether they were active or not. Patterns with large gaps between active pixels will be rejected, because they contain no interesting tracks. Both the LMF and the TS step are performed to reduce the amount of interesting tracks to a minimum and therefore to reduce the number of uninteresting events. Fitting and final selection are the final steps in this algorithm which use floating point arithmetic. Due to the arithmetics and the lack of performance gain, these steps are performed on a CPU.
3 Figure 1: Pixel image of the Barrel. The pattern algorithm matches active hits in the image to a set of predefined patterns which have their origin in the center. The patterns for the BAR can be classified as a two-dimensional array of 80 pT blocks with 1024 possible φ each. The curve radius is determined by the pT value of the patterns and φ defines the starting angle. Two pT blocks can be seen in figure 2.
Previous Implementation
A previous implementation on a large FPGA processor was already presented[1]. This previous version implemented only the ITF on Enable++[2]. It made use of the very high RAM bandwidth with up to 576 independent data bits. Each bit was assigned to a single pattern and each pattern to a six bit counter. This required a high data bandwidth and only allowed for a small number of search patterns per clock cycle. 1 Measured
on a 600MHz Athlon
Histogram contents
Each pattern and its active pixels are then transferred to the host CPU to perform the last steps of the algorithm. Finally the histogramer RAM blocks and the hit hash table have to be cleared for the next event.
30
5 20
10
0
3.7
φ( ra
d)
2
3.6 1 0
3.5
1/p T
V) (1/Ge
-1 -2 3.4
Figure 3: Histogram example.
4
Follow-up Algorithm
An analysis of the image showed that each hit increment a maximum of two consecutive φ patterns within a pT block. All other φ patterns are not incremented by the hit. This knowledge and the availability of internal RAM blocks in new FPGA devices, like the Xilinx Virtex, make it possible to increase the number of patterns dramatically and additionally to reduce the FPGA system size at the same time. The ram version stores the histogram values within the internal RAM blocks in contrast to the pattern assigned counters. Each RAM block represents an individual pT with it’s 1024 φ values. The counter values are eight bit deep which leads to a RAM block size if 1024 x 8 bit. The internal RAMs itself are dual ported to allow an effective readincrement-write cycle. Each RAM block needs only 11 bit from the external look-up table for its 1024 φ patterns. This is a reduction of ≈ 99% compared to the previous implementation. The new version requires two clock cycles but search for 12 times more patterns then the previous one. Matching all patterns requires 80 of the internal pT RAM blocks plus the additional external RAM for the look-up and the hit hash table. The needed external RAM size is 512k x 880 bits. After the ITF, all pT RAM blocks are read out sequentially so that in each clock cycle 80 histogramer values of different pT values are passed to the maximum finder. This parallel read out makes it very easy for the LMF. The maximums can be processed parallel in a pipeline. The results are then stored in small internal fifos. TS used the found patterns and compares the single pixels of each pattern to the current one in the hit hash table. An ”inverse” look-up table is used for this task, which is not equal to the ITF look-up table. An effective implementation is only possible, when the ”inverse” look-up table and the hit hash table can be accessed in parallel by two independent RAM banks. The needed RAM size here is 256k x 17 bit for the ”inverse” look-up table and 256k x 2 bit for the hit hash table.
Implementation
The new RAM version has recently been implemented in VHDL and mapped onto an FPGA coprocessor board, which is currently under development. The new FPGA coprocessor board will be a PCI card with one Xilinx XCV300 FPGA. Eight RAM blocks with a size if 256k x 36 bit are connected to the FPGA through a memory switch. The Virtex XCV300 device with its 64k selected blockRAM bits allows the implementation of up to eight internal pT blocks. Therefore only 8192 patterns can be matched at the same time. Matching all 81920 patterns can then be done either with several passes on one board or with 10 boards working in parallel. The algorithm itself for BAR and EC is nearly identical and can be run on both detector parts, whereby BAR and EC differ in the look-up table contents. The data transfer between the host CPU and the FPGA coprocessor is done by DMA transfers to achieve a maximum performance. The execution frequency of 38MHz is calculated for the placed design on a XCV300-4. This leads to a complete execution time of only ≈1.2ms for an average hit block of 16000 hits. This time includes the data transfer from and to the host memory but is only calculated for 8192 patterns. The final steps are then performed on the CPU while the FPGA coprocessor is calculating the next event. The same algorithm performed on an 600MHz Athlon CPU takes 150ms for all 81920 patterns[3].
6
Conclusion
The described algorithm executes a pattern recognition algorithm for high energy physics. The algorithm generates a histogram of the pattern array and evaluates local maximums as identified patterns. An additional pattern check reduces the number of found patterns further to achieve a good result. It is shown that the number of search pattern was increased dramatically by the factor of 14 through the use of distributed, internal RAM blocks within the FPGAs. Also data width of the external look-up table was decreased to only 1 percent of the previous implemented version. This result in a reduction of the used custom computing machine from the 24 FPGA system Enable++ to a single FPGA Coprocessor board.
References [1] 50kHz Pattern Recognition on the Large FPGA Processor Enable++, A.Kugel et.al., IEEE Symposium on FPGAs for CCMs, Page 262–263, April 1998. [2] Enable++: A Second Generation FPGA Processor, H. H¨ ogl et.al., IEEE Symposium on FPGAs for CCMs, Page 45–53, April 1995. [3] Global Pattern Recognition in the TRT for B-Physics in the ATLAS Trigger, J. Baines, et.al., ATL-DAQ-99-012, 21 September 1999.