faster than the implementation on GPUs. We also con- ... the implementation and acceleration of SIFT feature extraction on ... are needed between the GPU and the CPU, which takes ..... We use Intel C/C++ Compiler Version 9.1 to compile.
SIFT Implementation and Optimization for Multi-Core Systems Qi Zhang† ‡ , Yurong Chen‡ , Yimin Zhang‡ , Yinlong Xu† * † Dept. of Computer Science, Univ. of Science and Technology of China ‡ Intel China Research Center, Intel Corporation {yurong.chen}@intel.com Abstract Scale Invariant Feature Transform (SIFT) is an approach for extracting distinctive invariant features from images, and it has been successfully applied to many computer vision problems (e.g. face recognition and object detection). However, the SIFT feature extraction is compute-intensive, and a real-time or even super-real-time processing capability is required in many emerging scenarios. Nowadays, with the multicore processor becoming mainstream, SIFT can be accelerated by fully utilizing the computing power of available multi-core processors. In this paper, we propose two parallel SIFT algorithms and present some optimization techniques to improve the implementation’s performance on multi-core systems. The result shows our improved parallel SIFT implementation can process general video images in super-real-time on a dual-socket, quad-core system, and the speed is much faster than the implementation on GPUs. We also conduct a detailed scalability and memory performance analysis on the 8-core system and on a 32-core Chip Multiprocessor (CMP) simulator. The analysis helps us identify possible causes of bottlenecks, and we suggest avenues for scalability improvement to make this application more powerful on future large-scale multicore systems.
1. Introduction Image matching is a fundamental aspect of many problems in computer vision, including object or scene recognition, solving the 3D structure from multiple images [5], stereo correspondence, and motion tracking. The Scale Invariant Feature Transform (SIFT) [1] image feature has many properties that make it suitable for matching different images of an object or scene. The SIFT features are invariant to image scaling and rotation, and partially invariant to changes in illumination and 3D camera viewpoint. In addition, the features * This work was partly supported by the NSFC No.60533020 and the Innovation Foundation of USTC for Graduates No.KD2007054.
978-1-4244-1694-3/08/$25.00 ©2008 IEEE
are highly distinctive, which allow every single feature to be correctly matched with high probability against a large database of features, providing a basis for object and scene recognition. At the same time, the SIFT feature extraction is a very time-consuming task. There are many scenarios (e.g. online object recognition) that require SIFT features to be extracted and matched in real-time or even in super-real-time. Recent research mainly focuses on the implementation and acceleration of SIFT feature extraction on graphics processing units (GPUs). Sinha et al. implemented the SIFT on a GPU [7]. Due to the hardware and OpenGL limitations, some data transfers are needed between the GPU and the CPU, which takes quite a portion of time. As a result, their implementation can extract about 800 keypoint features from a 640x480 video at 10 frames per second (FPS). Recently, Heymann et al. proposed another SIFT implementation on GPUs, which can be applied to image sequences with 640x480 pixels at 20 FPS [2]. However, with the boom in multi-core processors, we can also take full advantage of the computing power of today’s multi-core systems to accelerate the utility of SIFT features in computer vision applications. In this paper, we propose two parallel SIFT feature extraction algorithms and present some techniques to optimize their performance on multi-core systems. After the optimization, the SIFT implementation with an improved parallel algorithm obtains up to a 6.7x speedup on a dual-socket, quad-core system. The processing speed achieves an average of 45 FPS for a 640x480 video (with about 200~1000 keypoints each frame) in our experiment and it exceeds the real-time demand (at least 30 FPS for general video streams for real-time processing). The underlying optimization techniques and parallel methods are representative of those used in video/image-analysis applications, and can be further used in other applications to maximally improve their performance on multi-core systems. We also conduct a detailed performance analysis and identify the possible causes of bottlenecks for the improved SIFT implementation on the 8-core system. To further understand the scalability performance, we also conduct a cycle accurate simulation of the SIFT imple-
mentation on a large-scale chip multi-processor (CMP) simulator. The remainder of this paper is organized as follows. In Section 2, we introduce the original SIFT algorithm and propose two parallel SIFT algorithms. Section 3 describes the serial and parallel optimization techniques for SIFT implementation on multi-core systems. Next, Section 4 shows our experimental results and performance analysis of the improved parallel SIFT implementation. Finally, we conclude in Section 5.
Figure 2. For each octave of scale space, the initial image is repeatedly convolved with Gaussians to produce the set of scale space images as shown on the left of the figure. Adjacent Gaussian images are subtracted to produce the DoG images as shown on the right of the figure. After each octave, the Gaussian image is down-sampled by a factor of 2, and the process repeated. The convolved images are grouped by octave, and we obtain a fixed number of DoG per octave.
2. The SIFT Algorithm In this section, we first introduce the original SIFT algorithm [1], and then propose two parallel algorithms to extract the inherent parallelisms of SIFT feature extraction. One is a straightforward parallel algorithm, which directly parallelizes the most time-consuming modules in the original SIFT algorithm. The other is an improved parallel SIFT algorithm which has much better performance on load balancing.
2.1. Original SIFT Algorithm SIFT is an approach for detecting and extracting local feature descriptors from images. The flow chart of the original SIFT algorithm is shown in Figure 1. The major computation stages used to generate the set of image features are: Building Gaussian scale space, Keypoint detection and localization, Orientation assignment, and Keypoint descriptor [1].
Figure 1. Flow chart of the original SIFT algorithm Building Gaussian scale space. Interest points for SIFT features correspond to local extrema of difference-of-Gaussian (DoG) images at different scales. An efficient approach to construction of DoG is shown in
Figure 2. The blurred images at different scales and the computation of DoG images [1] Keypoint detection and localization. Keypoints are identified as local maxima or minima of the DoG images across scales. Each pixel in the DoG images is compared to its 26 neighbors in 3x3 regions at the current and adjacent scales. If the pixel is a local maximum or minimum, it is selected as a candidate keypoint. Then we remove low contrast points and edge responses, and finally get some keypoints. Orientation assignment. To determine the keypoint orientation, a gradient orientation histogram is computed in the neighborhood of the keypoint. The contribution of each neighboring pixel is weighted by the gradient magnitude and a Gaussian window with a σ that is 1.5 times the scale of the keypoint. Peaks in the histogram correspond to dominant orientations. A separate keypoint is created for the direction corresponding to the histogram maximum, and any other direction within 80% of the maximum value. Keypoint descriptor. Once a keypoint orientation has been selected, the feature descriptor is computed as a set of orientation histograms on 4x4 pixel neighborhoods. The orientation histograms are relative to the keypoint, the orientation data comes from the Gaussian image closest in scale to the keypoint’s scale. Histograms contain 8 bins each, and each descriptor contains an array of 4 histograms around the keypoint. This leads to a SIFT feature vector with 4x4x8=128 elements. This vector is normalized to enhance invariance to changes in illumination [14].
2.2. Straightforward Parallel SIFT Algorithm
2.3. Improved Parallel SIFT Algorithm
Task and data level decomposition are two primary schemes to parallelize an algorithm. In the SIFT algorithm, the major work is conducted on 2-D images, which has abundant data parallelism at the row-level and pixel-level. The selection of data parallelism is a natural choice to make use of the inherent parallelism. Due to the complexity of the SIFT algorithm, we take the conventional approach to parallelize it. That is, we first prioritize the modules in the application according to their complexity, and then parallelize them in the decreasing order of importance. After analyzing the serial algorithm, we determine the most computeintensive modules and selected four key modules (i.e., the most time-consuming ones) from the SIFT algorithm. These key modules make up more than 99.8% of the whole SIFT execution time. We introduce these key modules and related parallel schemes as follows. Build Gaussian Scale Space (BGSS). The BGSS module convolves the input image with Gaussian filters. It is a two-dimensional convolution. So each thread can process one part of the input image data. Since in the original program this module writes result into the input image immediately, there is some contest between threads. Meanwhile, to calculate the DoG after convolution, the original program requires keep a copy of the input image. So we change the BGSS module to save result to a new data array. As a result, we remove the thread contest and the copy operation. Keypoint Detection and Localization (KDL). The KDL module detects the local maxima or minima points of the DoG image, and removes points with low contrast from the extremum points. Finally, it saves the localizations of keypoints to a keypoint-list. In this module each pixel in the DoG image is identified whether it is a keypoint. Obviously, data partition method is suitable for this module, and when pushing a keypoint to the result list the synchronization between threads is necessary. Orientation Assignment and Keypoint Descriptor (OAKD). The OAKD module assigns orientation and computes feature descriptors for keypoints. However, the number of keypoints and the computational effort required for each keypoint are uncertain. So in this module we dynamically schedule the keypoints to the work threads to achieve parallel processing. Matrix Operations (MO). The MO module includes some matrix operations for image processing, such as matrix subtraction and image down-sample. Since the loop iterations of those operations are independent, this module can be easily parallelized by using the data parallelism.
The load balance is one of the most important factors, which influence the scalability performance in a parallel application. From the original algorithm flow as shown in Figure 1, each keypoint will be assigned an orientation and generated a descriptor just after detecting keypoints for one scale. In this case there may be very few keypoints for next step computation, since we only detect very few of keypoints in the later iterations of octaves and scales. The load imbalance will occur in steps of “Assign Orientation” and “Generate Descriptor” (the OAKD module) for each keypoint when using the straightforward parallel algorithm. Furthermore, as the image is down-sampled, the number of keypoints detected from each scale is decrease gradually. As a result, the load imbalance will become more serious in the latish stage.
Figure 3. Flow chart of the modified SIFT algorithm To get better load balance, we usually need enough tasks for scheduling. Sometimes we can merge several task sets to obtain a larger one. Motivating by this idea, we carefully analyze the original SIFT algorithm flow and try to change it. Finally, we found that in the original algorithm, for different input images the number of scales in each octave is constant, but the number of octaves is variable. Thus, we can gather all keypoints detected from all scales of one octave, and then calculate their features in parallel. In this way, we got a modified SIFT algorithm and its flow chart is shown in Figure 3. Based on the modified algorithm, we obtained an improved parallel SIFT algorithm. The pseudo code of the improved parallel SIFT algorithm is shown in Figure 4. In this algorithm, we assign one keypoint-list to an octave. Once detecting keypoints from each scale in
this octave, we collect those keypoints to the keypointlist and then detect keypoints from next scale, instead of calculating features for those keypoints immediately. After gathering all keypoints for one octave, we get a larger size keypoint-list and schedule those keypoints to threads for feature extraction. In this way, the load imbalance of the OAKD module is reduced significantly. for all octaves { List keypoint_list; for all scales { ConvolveImageGaussParallel( ); BuildDoGParallel( ); //Detect Keypoint #pragma omp parallel for for all pixels p in Image { if( IsKeypoint(p) ) #pragma omp critical keypoint_list.add(p); } } #pragma omp parallel for for all pixels kp in keypoint_list { ExtractFeature(kp); } DownSampleImageParallel( ); }
Figure 4. Pseudo-code of the improved parallel SIFT algorithm
3. Performance Optimization In this study, we choose the OpenMP programming model to implement our parallel SIFT algorithms, since OpenMP [11] provides an effective and quick way to develop parallel applications. In the implementation, we conducted some serial and parallel optimization to improve the performance on multi-core systems.
3.1. Serial Performance Optimization Before diving into the parallelization study, we first describe several optimization techniques to improve the application’s performance. Some optimization can improve both serial and parallel performance. First, we look for some chances to use generic optimization techniques, like loop optimizations and memory alignment [9], etc. We widely use loop fission in our serial optimization. This technique breaks a big
loop into two or more smaller loops to improve memory locality and eliminate data dependences. In addition, the cache performance and bandwidth requirement heavily influence the application’s performance on multi-core systems [10]. We applied cache-conscious optimization to improve data locality. It is more pronounced for the parallel program due to reduction of last level cache misses as well as off-chip bandwidth demands. For example, the program needs convolving each row and each column of the input image with Gaussian filter. Since the nested loop in the columns order causes bad cache performance, we transform it into access data in the rows order. We also try our best to reduce bandwidth demand and contest, for example removing memory copy operation. In SIFT algorithm the flow of computation is very complex, and there are frequent memory copy operations in the original program. As we know the memory copy requires high bandwidth and hard to scale well, so we change the data flow carefully to avoid memory copy. Finally, to take advantage of the data level parallelism (DLP) architecture features provided by the modern processor, we also utilize SIMD (Single-Instruction Multiple-Data) optimization for the float-point computations around the SIFT program.
3.2. Parallel Performance Optimization After the serial optimization, we further apply some parallel optimization techniques to enhance the performance of the parallel SIFT implementation on the multi-core system. The following methods are employed in our parallel SIFT application. Reducing Synchronization Overhead. Often threads are not totally independent, which forces the program to add synchronization to guarantee the execution order of the threads. The frequent synchronization calls and the associated waiting operations will degrade the scaling performance on multi-core processors. Generally the synchronization is present in the form of critical section, lock, and barrier in the OpenMP implementation. In the SIFT application, we also have to employ some synchronization operations and tune them carefully. For example in the KDL module, all threads push the keypoints to one shared point list, and a critical section is necessary for synchronization. Every time threads push a keypoint into the list, a lock has to be employed and this consumes too much time. So we design a lock-free mechanism to reduce the synchronization overhead. The shared keypoint list is replicated into several private lists. Each thread operates on its local list non-exclusively to avoid the mutual access of the shared list. And these local lists are merged at the end of parallel region. In addition, we make careful use of buffer manipulation
for each thread, since frequent memory allocation/deallocation operations will cause severe lock contentions in the heap, and these requests are essentially runs in serial in a parallel region. Removing False Sharing. False sharing is a common pitfall in shared memory parallel processing. It occurs when two or more cores/processors are updating different bytes of memory that happen to be located on the same cache line. Since multiple cores cannot cache the same line of memory at the same time, when one thread writes to this cache line, the same cache line referenced by the other thread is invalidated. Any new references to data in this cache line by the second thread results in a cache miss and potentially huge memory latencies. Therefore, it is important to make sure that the memory references by the individual threads are to different non-shared cache lines. We manually resolve false sharing issues in the parallel SIFT application, by padding each thread’s data element to ensure that elements owned by different threads all lie on separate cache lines. For example, we dynamically schedule keypoints to different threads for computing features, and the size of one keypoint’s feature vector is 532 Bytes. There must be some false sharing between threads. So we expand each feature vector with a blank space of 128 Bytes to force these threads never to share cache lines. Applying Thread Affinity. The thread affinity mechanism [12] attaches one thread to a specific core in multi-core and multi-processor system. It is used to improve the cache performance, and minimizes the thread migration and context switches among cores. It also improves the data locality performance and mitigates the impact of maintaining the cache coherency among the cores/processors. Since multi-core processors are likely to have a non-uniform cache architecture (NUCA), the communication latency between different cores varies depending on its memory hierarchy. When we find that a group of threads has high data sharing behavior, we can schedule them to the same cluster to utilize the shared cache for data transfer. (A cluster is a collection of closely-coupled cores, e.g., two cores sharing the same L2 cache in a Core 2 quad-core processor is termed a cluster.) On the other hand, for applications with high bandwidth demands, we prefer to schedule the threads on different clusters to utilize aggregated bandwidth. After the row-based parallelization in the SIFT application, the image chunk assigned to one thread/core will be used by the other threads. Significant coherence traffic occurs when the image data do not reside in cores sharing the same last-level cache. Therefore, thread scheduling in the same cluster will mitigate the data transfer between loosely-coupled cores that do not reside in the same cluster.
4. Experimental Results and Performance Analysis This section presents our experimental results of the SIFT implementation and optimization on a multi-core system and a large-scale CMP simulator. To characterize the performance of the parallel SIFT running on multi-core systems, we investigate the application from different aspects, including the processing speed, scalability performance and memory behavior. The multi-core platform is a dual-socket, quad-core system (HP ProLiant DL380 G5) with total 8 cores. It has two Intel Core 2 quad-core processors running at 2.33GHz and 4GB FB-DDR2 RAM. Each socket has four cores, and each core is equipped with a 32KB L1 data cache and a 32KB L1 instruction cache. The two cores on one chip share a 4MB L2 unified cache. The maximum FSB (front side bus) bandwidth is 21GB/s. In addition to the existing multi-core system, we further study the parallel SIFT application’s performance on a large-scale CMP simulator [10] with cycleaccurate simulation to see how it will scale with the increasing number of cores. We assume a very high main memory bandwidth so that we do not artificially limit the scalability of the modules. The parallel SIFT algorithms are implemented by using OpenMP programming model with C language. We use Intel C/C++ Compiler Version 9.1 to compile the program under Linux kernel 2.6.5-7.283-smp (x86_64) with full compiler optimization. The performance data is collected by Intel performance analysis tools such as the VTune performance analyzer [8] and the Thread Profiler [13]. We use three different kinds of data sets in our experiments as shown in Table 1. One is a MPG2 image. The F series is 5 images with fixed image size and increased number of keypoints. And the S series is also 5 images with almost the same keypoints (about 1000 keypoints for each image) and increased image size. Table 1. Label, image size and number of keypoints of 3 kinds of test datasets Label Size Keypoints Label Size Keypoints
F200
F400
200 S600
394 S700
600 2 1015
700 2 1005
F600 F800 640x480 620 802 S800 S900 800 2 1001
9002 1019
F1000 1028 S1000
MPG2 720x576 509
10002 990
4.1. Performance Improvement The aggregated serial performance improvement for different data sets on the multi-core platform is shown in Figure 5. From this figure, we can see that the processing speed of the original SIFT is about 2 FPS on
7
2T
6
Processing Speed (FPS)
65 60 55 50 45 40 35 30 25 20 15 10 5 0
Thread Affinity False Sharing Removal Synchronization Reduction Original
MPG2 F200 F400
F600 F800 F1000 S600 S700 S800 S900 S1000
Figure 6. Processing speeds of improved parallel SIFT with 8 threads on an 8-core system Figure 6 also shows the integrated parallel performance improvement by each parallel optimization method mentioned in Section 3.2. From this figure, we can see synchronization overhead reduction, false sharing removal and thread affinity application improve the parallel SIFT performance by 9%, 9% and 7%, respectively. After the parallel optimization, the improved parallel SIFT gets a 25% performance gain on average for all test images.
8T
5 Speedup
4 3 2 1 S900
S1000
S800
S700
S600
F800
F1000
F600
F400
F200
MPG2
S900
Forward Parallel SIFT
S1000
S800
F200
0 MPG2
After the parallelization and optimization, the SIFT implementation based on the improved parallel algorithm achieved the best performance on the 8-core system. The results with 8 threads are shown in Figure 6. We can see that the processing speed of most test cases (7 of 11 test images) exceed the real-time demand of 30 FPS. The average speed is 34 FPS for all 11 test images and 45 FPS for the 5 images with 640x480 pixels, which is much faster than the speed on GPUs reported in [7] (10 FPS) and [2] (20 FPS). Furthermore, Figure 6 indicates the processing speed slows down with the increasing keypoint number and image size.
4T
S700
F800 F1000 S600 S700 S800 S900 S1000
Figure 5. Serial processing speeds of the optimized SIFT implementation
S600
SIMD Optimized
F800
MPG2 F200 F400 F600
Serial Optimized
F1000
Original
Figure 7 shows the scaling performance of the parallel SIFT implementation using the straightforward and improved parallel algorithms on the multi-core platform, respectively. For the 11 test images on eight cores, the average speedup of the straightforward parallel SIFT is 5.6x while the improved parallel SIFT achieved an average of 6.4x speedup. From the right side of Figure 7, we can also see the improved parallel SIFT scales well with the increasing number of cores for all test images. It can achieve almost linear speedup on 2 cores, 3.5~3.7x speedup on four cores and 5.9~6.7x speedup on eight cores. Furthermore, for F series images the speedup increases with the increasing keypoint number. However, for S series images, the speedup decreases with the increase of the image size.
F600
11 10 9 8 7 6 5 4 3 2 1 0
4.2. Performance Scalability Analysis
F400
Processing Speed (FPS)
average. After the serial optimization (generic optimization and cache-conscious optimization), the processing speed is improved by about 26%. The SIMD optimization further contributes about 153% performance improvement. As a result, the optimized serial SIFT implementation achieves an average of 2.8x speedup and get an average processing speed of 5.5 FPS for all data sets.
Improved Parallel SIFT
Figure 7. Speedups of the straightforward and improved parallel SIFT applications for 11 test images on an 8-core system Table 2 shows the general parallel profiling data collected by Intel Thread Profiler [13] for the straightforward and the improved parallel SIFT for the MPG2 image with eight threads. We can see that the percentage of load imbalance is reduced from 14.3% to 5.5% in our improved parallel SIFT. Besides, this table indicates that the improved parallel SIFT has low synchronization and parallel overhead, and more than 90% of the parallel region in the execution time breakdown, which suggests this application expose good parallel performance metrics. Table 2. Breakdown of time spent in parallel regions, sequential regions, load imbalance, synchronization and parallel overheads for the straightforward (SF) and improved parallel SIFT applications on an 8-core system MPG2, 8T SF
Improved
Par. 83.57% 91.54%
Seq. 0% 0%
Imb. 14.34% 5.51%
Sync. 1.05% 1.47%
o/h 1.05% 1.47%
The above experimental result shows that the improved parallel SIFT has much less load imbalance and
ration than in the single core configuration. To find the real reason, we further analyze the cache and FSB bandwidth performance with Intel VTune Performance Analyzer. Figure 10 shows the L1 and L2 cache miss rate of the improved parallel SIFT with the MPG2 input image. We can see that L1 cache miss rates vary little with the number of threads, while L2 cache performance varies a lot when scaling the thread count. Since SIFT has a hierarchical parallel decomposition method, the downscale image has to be broadcast to all the private L2 caches after one iteration, thereby incurring significant cache coherency misses when we scale to four and eight cores. L1 cache miss Cache Miss Rate
achieves better scaling performance than the straightforward parallel SIFT. So in the following experiments, we will just analyze the performance of the improved one. To further understand the scalability performance of the improved parallel SIFT, We also looked at the scaling performance of the key modules and their serial execution time breakdown. From Figure 8, we can see the BGSS and OAKD modules scale well, but the scaling performance of the KDL and MO modules are poor. The speedup of the OAKD on eight cores is 7.1x, which is better than the BGSS (6.5x on eight cores). From Figure 9, we can see the percentage of the OAKD is growing and the percentage of the BGSS is dropping for the F series images. So the speedups of the F series images in Figure 7 are increasing gradually. But for the S series images, the state is opposite and the speedup is decreasing. 8
1T
7
2T
4T
8T
Speedup
6 5
L2 cache miss
8% 7% 6% 5% 4% 3% 2% 1% 0% 1T
4
2T
4T
8T
3
Figure 10. L1 and L2 cache miss rates of the improved parallel SIFT for a MPG2 image
2 1 0
BGSS
KDL
OAKD
MO
Figure 8. Speedups of the key modules in the improved parallel SIFT for a MPG2 image
Tiem Percentage
BGSS
KDL
OAKD
MO
100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%
We also analyze the FSB utilization rate of the improved parallel SIFT and its key modules. As shown in Figure 11, the bandwidth utilization of the whole application is not very high (3.3GB/s with eight threads) and nearly increases linearly with the increasing thread number. But for the KDL and MO modules, the bandwidth utilization with eight threads achieves about 8 GB/s, 38% of the peak FSB bandwidth. Obviously, the bandwidth requirements for these two modules are higher than the saturated bandwidth provided by the system.
Figure 9. Execution time breakdown for the key modules in the serial SIFT application
4.3. Memory Behavior Analysis Based on the general parallel limiting factor metrics shown in Table 2, the theoretical speedup of the improved parallel SIFT for the MPG2 image with eight threads should be 7.3x ( 91.54% × 8 ≈ 7.3 ). But the practical speedup is only 6.1x as shown in Figure 7. As we can see that the aggregate running time of the parallel regions increases from the uni-threaded running to the multi-threaded running. It is highly possible that some operations run slower in the multi-core configu-
Bandwidth (GB/s)
MPG2 F200 F400 F600 F800 F1000 S600 S700 S800 S900 S1000
9 8 7 6 5 4 3 2 1 0
1T
SIFT
2T
4T
BGSS
8T
KDL
OAKD
MO
Figure 11. Bandwidth usages of the improved parallel SIFT and its 4 key modules for a MPG2 image
4.4. Simulation Result Analysis To further understand the scalability performance of the improved parallel SIFT, we simulate it with the input image MPG2 on a 32-core CMP simulator [10]. This simulator is a cycle-accurate, execution driven CMP simulator and has been validated against real systems. In this simulator, each core is in-order and has a private L1 data cache, and all cores shard an L2 cache. We assume a very high main memory bandwidth (59 bytes/cycle), so that we do not artificially limit the scalability of application. As expected, the improved parallel SIFT scales well as the number of threads increases on the 32-core simulator, as shown in Figure 12. It achieves almost linear speedup on 2, 4 and 8 cores, 14x speedup on 16 cores, and 25x speedup on 32 cores. The performance data shows the load imbalance is still a limiting factor for this application when running it on large-scale CMP systems with shared last level cache (LLC) and high main memory bandwidth.
coherence miss due to the separate LLC, the amount of available system bandwidth and the load imbalance. The parallel SIFT would perform better on multi-core systems with a shared LLC and a higher memory bandwidth. Furthermore, we may need to conduct more fine-grained-level parallelization for the application to reduce the load imbalance and get maximum performance on future large-scale multi-core systems.
References [1]
[2]
[3] [4]
28
Spee dup
24
[5]
20 16
[6]
12 8 4
[7]
0 1T
2T
4T
8T
16T
32T
Figure 12. Scalability of the improved parallel SIFT on a 32-core CMP simulator
[8] [9]
5. Conclusion The SIFT algorithm extracts distinctive invariant features from images and has been widely applied to many computer vision problems. In this paper, we propose two parallel algorithms and some optimization techniques for the SIFT implementation on multi-core systems. The experimental results show the improved parallel SIFT implementation scales well on an 8-core system (up to 6.7x speedup with eight threads). Its average processing speed is 45 FPS for images with 640x480 pixels, which is much faster than the implementation on GPUs. Besides, we analyze its scalability and memory performance on the 8-core system and a 32-core CMP simulator. Based on our analysis, the main limiting factors of the parallel SIFT are the cache
[10]
[11] [12] [13] [14]
David G. Lowe, “Distinctive Image Features from Scale-Invariant Keypoints”, International Journal of Computer Vision, Volume 60, Number 2, November 2004. S. Heymann, K. Müller, A. Smolic, B. Froehlich, and T. Wiegand, “SIFT Implementation and Optimization for General-Purpose GPU”, In Proc. of the WSCG'07, Plzen, Czech Republic, January 2007. M. Brown, and D.G. Lowe, “Invariant Features from Interest Point Groups”, In British Machine Vision Conference, Cardiff, Wales, 2002. David G. Lowe, “Object Recognition from local scaleinvariant features”, In Proc. of International Conference on Computer Vision, 2004. Iryna Skrypnyk, and David G. Lowe, “Scene Modelling, Recognition and Tracking with Invariant Image Features”, In Proc. of ISMAR’04, November 2004. Andrea Vedaldi, “SIFT++ - A lightweight C++ implementation of SIFT”, http://vision.ucla.edu/~vedaldi/ code/siftpp/siftpp.html. Sudipta N. Sinha, Jan-Michael Frahm, Marc Pollefeys, and Yakup Genc, “Feature Tracking and Matching in Video Using Programmable Graphics Hardware”, Machine Vision and Applications, March 2007. Intel Corporation, “Intel VTune Performance Analyzer”, http://www.intel.com/software/products/vtune. Intel Corporation, Intel 64 and IA-32 Architectures Optimization Reference Manual, 2006. C.J. Hughes, R. Grzeszczuk, E. Sifakis, D. Kim, S. Kumar, A.P. Selle, J. Chhugani, M. Holliman, and Y.K. Chen, “Physical Simulation for Animation and Visual Effects: Parallelization and Characterization for Chip Multiprocessors”, In Proc. of the 34th Annual Intl. Symp. on Computer Architecture, 2007. OpenMP Application Program Interface, Version 2.5, May 2005. J.D. Salehi, J.F. Kurose, and D.F. Towsley, “The Effectiveness of Affinity-Based Scheduling in Multiprocessor Networking”, In Proc. of INFOCOMM’96, 1996. Intel Corporation, “Intel Threading Analysis Tools”, In http://www.intel.com/software/products/Threading. F. Estrada, A. Jepson, and D. Fleet, “Local Features Tutorial”, http://www.cs.toronto.edu/~jepson/csc2503, 2004.