Real-Time Texture Detection Using the LU-Transform - mobvis

8 downloads 1308 Views 3MB Size Report
This paper introduces a fast texture descriptor, the LU-transform. It is inspired by ... makes it even faster than a CPU implementation, and frees the CPU for other.
Real-Time Texture Detection Using the LU-Transform Alireza Tavakoli Targhi, M˚arten Bj¨orkman, Eric Hayman and Jan-Olof Eklundh Computational Vision and Active Perception Laboratory School of Computer Science and Communication Royal Institute of Technology (KTH), SE-100 44, Stockholm, Sweden {att,celle,hayman,joe}@nada.kth.se Abstract. This paper introduces a fast texture descriptor, the LU-transform. It is inspired by previous methods, the SVD-transform and Eigen-transform, which yield measures of image roughness by considering the singular values or eigenvalues of matrices formed by copying greyvalues from a square patch around a pixel directly into a matrix of the same size. The SVD and Eigen-transforms therefore capture the degree to which linear dependencies are present in the image patch. In this paper we demonstrate that similar information can be recovered by examining the properties of the LU factorization of the matrix, and in particular the diagonal part of the U matrix. While the LU-transform yields an output qualitatively similar to the those of the SVD and Eigen-transforms, it can be computed about an order of magnitude faster. It is a much simpler algorithm and well-suited to implementation on parallel architectures. We capitalise on these properties in an implementation of the algorithm on a Graphics Processor Unit (GPU) which makes it even faster than a CPU implementation, and frees the CPU for other computations.

1

Introduction

Texture is an important cue in many applications of computer vision such as image segmentation [1, 2], the classification of objects [3] or materials [4–7], and texture synthesis for computer graphics [8, 5]. Many of these applications benefit from a fast and simple texture descriptor. Although no formal or mathematical definition exists, texture is frequently considered to be small-scale structure in images, and many different descriptors for texture have been proposed in the literature [9]. Filter banks, for instance, are especially popular, and may be motivated by early processing in biological visual systems. Descriptors which use the greylevels themselves, as opposed to an intermediate filter-based representation, have also regained popularity [2, 8, 5]. Recently, Tavakoli Targhi and coworkers proposed the SVD-transform [10] and Eigen-transform [11]. These texture descriptors are derived from matrix decompositions. The basic idea is to form a matrix from the greyvalues in a small, square window centred at a pixel, compute either the singular values or eigenvalues, and form a descriptor as the average of the smallest singular values / eigenvalues. This yields a one-dimensional descriptor which fires in “rough” areas of the image. The procedure is repeated for all pixels, or on a subsampled regular grid. [11] demonstrated its suitability for applications in attention (object detection) and image segmentation.

Our work is motivated by similar applications, but with one crucial difference: we require real-time performance, e.g. for use on a robot. The SVD-transform and Eigentransform are not quite fast enough on existing off-the-shelf hardware. The bottleneck is the computation of singular values or eigenvalues. The main contribution of this paper is therefore the introduction of a new texture descriptor, inspired by the framework of [10, 11]. Rather than calculating singular values or eigenvalues, we perform an LU-decomposition [12, 13], and define the LU-transform as the average of the smallest diagonal values in the resulting upper triangular matrix U . It is well-known that LU decomposition is much faster than finding singular values or eigenvalues. In our experiments we experienced a speed-up of an order of magnitude. Moreover, the method lends itself well to implementation on a Graphics Processor Unit (GPU). For small window sizes this proves even faster, and it allows other processing to take place on the CPU. The output of the descriptor is qualitatively similar to those obtained in [10, 11]. This may be briefly explained as follows (we refer also to Section 2 which reviews [10, 11] in a little more detail, and Section 3 which discusses the LU-transform). The eigenvalues / singular values provide information about the dependence between rows and columns of the matrix of the local patch. In a patch of uniform brightness, all but the largest eigenvalue / singular value are zero. If any two rows or columns are identical, the matrix drops rank, that is the smallest eigenvalue / singular value becomes zero. If those two rows or columns are similar but not quite identical, the smallest eigenvalues / singular values will be close to, but not exactly equal to zero. Thus the SVD and Eigen-transforms essentially encode the degree to which rows or columns of the patch approach being linearly dependent by taking a sum over the smallest singular values or eigenvalues respectively. This information about rank is also captured within the LU factorization. Indeed, while achieving a considerable speed-up, the LU-transform inherits the original properties of [10, 11] for bottom-up processing in real-world applications: (i) it captures small-scale structure in terms of roughness or smoothness of the image patch; (ii) it provides a compact representation and low-dimensional output (usually just a single dimension) which is easy to store and perform calculations on; (iii) few parameters need tuning. The most significant parameter is a notion of scale provided by the size of the local image patch; and (iv) unlike most other texture descriptors, it does not generate spurious responses round brightness edges. This is in contrast to for instance filters which tend to identify a strip around a brightness edge as a separate region. The remainder of the paper is organized as follows. Section 2 reviews the SVDtransform and Eigen-transform. The LU-transform is introduced and its output compared to [10, 11] in Section 3. Section 4 focuses on computational efficiency and describes our GPU implementation. Finally, conclusions are drawn in Section 5.

2

Review of the SVD and Eigen-transforms

In this section we briefly review the SVD and Eigen-transforms [10, 11], and analyse what these texture descriptors capture. The general framework of these texture descriptors is that we consider a w × w square neighbourhood centred at a pixel, and copy its greyvalues directly into a w × w

real matrix, W . We proceed by computing a matrix decomposition of W which provides a vector of numbers. For the SVD-transform, this vector consists of the singular values of W , while for the Eigen-transform we instead insert the eigenvalues of W into the vector. Then we take the magnitude of the numbers in this vector, and sort them in decreasing order, {kα1 k, kα2 k, ..., kαw k}. At this stage we have a set of w numbers describing each pixel in an image. [10, 11] showed that the largest number, kα1 k, gives a smooth version of the original image, while the small αi capture the texture. The texture transform may therefore be defined as Φ(l, w) =

w X

kαk k , 1 ≤ l ≤ w .

(1)

k=l

The original papers [10, 11] took the average rather than sum of these numbers, but this just differs by a constant scale factor. l and w are parameters set by the user, w is a scale parameter. For a descriptor that reacts to texture as opposed to brightness, the largest few αi should be ignored. The transform is fairly insensitive to l, suitable values are in the range [2, w/2]. The results of these transforms are shown in Section 3. To save computation time and because the resulting output was found to change slowly spatially, we do not compute the transform for every pixel. Instead we define a spacing parameter δ, which means that we calculate the transform only every δ pixels in both horizontal and vertical directions. It is useful to indicate why this approach yield a successful texture descriptor. Initially we focus on the SVD-transform. Suppose we are given a w × w real matrix A with SVD A = U ΣV T , where Σ is a diagonal matrix of singular values in decreasing order, and U and V are orthogonal matrices [14, 15, 12, 13]. Then a rank r approximation to A is the matrix Ar = Ur Σr VrT , where Σr is the top-left r × r submatrix of Σ, Ur consists of the first r columns of U , and VrT the first r rows of V T . Indeed, Ar is the optimal rank r approximation to A in the sense of minimizing the Frobenius norm of the residual k A − Br kF whereq Br is a rank r matrix. Setting B = Ar , this

residual can be written as k A − Ar kF =

w Σi=r+1 σi2 [14, 15, 12, 13]. This expression

is identical to the definition of the texture transform in Equation (1) if αi = σi2 . In [10] αi was just σi (not squared), because we at that time were unaware about this argument. In unreported experiments we have, however, found that these two expressions give qualitatively similar results. Both these expressions based on SVD, and also the Eigen-transform, can be seen to detect rank deficient or near rank deficient matrices formed from image patches. If the matrix A has low rank then some of its singular values and eigenvalues are zero. Therefore the texture transform has lower response. It means the rows and columns are linearly dependent, which will occur if the image patch is uniform or partly uniform. Also if we have little structure then the rows and columns are close to being linearly dependent and consequently the texture transform is still relatively low. On the other hand, if the image patch has complex structure, then row and columns are much less likely to be dependent and the texture transform will be high. Thus, the SVD and Eigentransforms fire in image areas of rough texture.

3

The LU-transform

In this section, we first explain how to compute the LU-transform and justify why it should work. Then we compare it in experiments to the SVD and Eigen-transforms. 3.1

Computation of LU-transform

Recall the methodology from Section 2 where we considered matrix decompositions of w × w square image patches. Now, rather than calculating the singular values or eigenvalues of the matrix, we compute the LU-factorization. The LU factorization [14, 15, 12, 13] of a matrix is a decomposition of general form A = P L U , involving a permutation matrix P , a unit lower triangular matrix L with ones on the diagonal, and an upper triangular matrix U . The LU factorization is used for solving sets of linear equation by Gauss elimination. L contains the row multipliers used during elimination. U contains the pivot values on the diagonal and other information in the off-diagonal part. P records the pivoting operations carried out in Gauss elimination. Here we used partial rather than full pivoting [14, 15, 12, 13]. The LU factorization is generally computed only for nonsingular square matrices, but the LU factorization exists and is useful even if the matrix is singular, or rectangular. For the case when A is a singular matrix then U has zeros in the diagonal elements [16]. The number of zero elements on the diagonal U gives the dimensionality of the nullspace of A. For example, assume that r is the rank of an n×n matrix A, then n−r zeros will appear on the diagonal of U in the LU factorization of A. Therefore we see that the diagonal of U captures the same information as the eigenvalues and singular values. So analogous to the Eigen and SVD transform we can define the LU-transform Ω(l, w) by inserting the sorted absolute values of the pivots as coefficient αi in Equation 1, Ω(l, w) =

w X

kukk k , 1 ≤ l ≤ w .

(2)

k=l

where ukk are the diagonal elements of U . Homogeneous image patches induce linear dependence and have low value of rank r, therefore Ω has low magnitude. Conversely image patches with complex structures have a high value of r, then Ω has a large magnitude and consequently, the LU-transform has a high response. Empirically we have found that image patches with full rank, but small singular values also, give rise to low values of Ω. We are currently studying these properties from a theoretical standpoint. 3.2

Examples

To illustrate the output of the LU-transform, we have applied the algorithm on an image (Figure 1a) of a natural scene from the Corel database [17]. The size of the image is 720 × 480. Figure 1b shows the LU-transform for window size w = 8 with the four smallest coefficients (l = 4) and Figure 1c shows the corresponding result with parameters l = 22 and w = 32. We have used the spacing δ = 8 for all of the experiments in this paper. Recall that w is some form of scale parameter, and as in many image

(a) Original image

(b) LU-transform Ω(4, 8)

(c)LU-transform Ω(22, 32)

Fig. 1. The LU-transform with varying window size.

(a) Original image

(b) LU-transform

(c) SVD-transform

(a) LU-transform

(b) SVD-transform

(c) Eigen-transform

Fig. 2. Comparison of the output of LU, SVD and Eigen transforms with the same parameters, applied to the image in Figure 2a. The first row represents the result for w = 8, and the second row the results with w = 32.

processing applications its optimal value depends on the task. We have not found it necessary to fine-tune the w parameter and have found it sufficient to concentrate on just these two values (w = 8, w = 32). For example, for this image (Figure 1a) w = 32 gives a better result if the objective is to segment the cheetah out from the background (Figure 1c). On the other hand, small scales such as w = 8 can be useful for detecting small objects or defects and abnormalities, in which case there is considerable risk that a 32 × 32 image patch contains background as well as foreground. Figure 2 shows the SVD, Eigen and LU-transforms of the same image. The results are qualitatively very similar. This is not surprising since the LU, SVD and Eigentransforms all indicate the rank of the windows. Figure 3 shows the three texture transforms applied to an image containing a mosaic of different textures. First, notice that the three methods give very similar visual results, and second that they have lower value on smooth textures and higher value for rough textures. The transform captures the roughness and smoothness of small scale structure within an image. To further demonstrate this, we take different materials (Figure 4) with different structure from the KTH-TIPS2 image material database [18]. Table 1 shows the average and standard deviation of the texture transforms for each material from Figure 4. Higher scores are indicative of rough and coarse structure of materials, and lower

(a) Original image

(b) LU-transform

(c) SVD-transform

(d) Eigen-transform

Fig. 3. Comparison of the output of the LU, SVD and Eigen transforms with the same parameters (w = 32 and l = 22).

1

2

3

4

5

6

Fig. 4. Images of six materials taken from the KTH-TIPS2 database.

numbers correspond to smooth and fine materials structure. The result gives the impression that the SVD, Eigen and LU coefficients capture the same properties in the scenes of roughness. However the standard deviations of the SVD transform is very small in comparison to the Eigen and LU-transforms. This illustrates that the SVD-transform is more consistent and uniform. This should come as no surprise, as the SVD decomposition is more stable and gives better rank information than other matrix decompositions. Figure 5 gives further illustration of this fact. To illustrate the behaviour of these coefficients, we scan an arbitrary row of an image and collect Aj patches of size 32×32 and spacing of δ = 8. Figure 5b shows the texture transforms for Aj along that row. Pn (j) 1 Furthermore, Figure 5a shows K j=1 kαi k, for each αi , where αi (see Equation (1)) can be eigenvalues, singular values or the pivots from the LU factorization as in Equation (2). Here we ignored the first coefficient as it is too large. Figure 5a indicates that the LU coefficients form a flat curve in comparison with the other two transforms, the sorted pivots kukk k converge to zero very slowly. Figure 5b shows that the SVDtransform’s value is much smoother than that of the LU-transform. This agrees with the result in Table 1 that SVD gives more uniform and stable output. We have presented different applications of the texture-transform in our previous work [10, 11] and compared the results with other methods. In this paper the main focus is on computational efficiency, but here we briefly present results on two applications. Real-time object detection and visual attention are important tasks in robotic and computer vision. Figure 6 presents the result of the LU-transform for visual attention as a texture cue. The first row of Figure 6 shows a breakfast table with different materials from the KTH-TIPS2 database. The second row in Figure 6 shows a table with some objects. Most of the interesting objects with small structure or texture pop out in the LU-transform (w = 8 and l = 4, Figure 6b). Figure 6c shows the result of a thresholding of the LU-transform, yielding a segmentation. More examples of the LU-transform are shown in Figure 7.

Image No 1 LU SVD Eigen

2

3

4

5

6

2.0±0.32 3.4±0.80 7.8±2.86 9.4±3.82 14.2±6.33 33.0±11.62 0.8±0.04 0.9±0.07 3.8±0.46 4.6±1.20 5.6±1.60 14.7±1.74 1.0±0.26 1.1±0.58 4.1±4.04 5.9±5.60 6.1±5.60 17.3±11.18

Table 1. The average and standard deviation of the texture transforms for the images in Figure 4 .

(a)

(b)

Fig. 5. Results from a scanline from Figure 4.1. (a) compares the coefficients αi used in the SVD, Eigen and LU-transforms, averaged over the scan-line. (b) plots the texture transforms themselves as they vary along the scanline.

4

Computational efficiency

In this section we first evaluate the computational efficiency of the LU-transform relative to the SVD and Eigen-transforms. Then we describe our GPU implementation of the LU-transform, and compare it to the CPU version. 4.1

CPU benchmarks

We implemented the three transforms in both Matlab and C++ and benchmarked them on an AMD Opteron 250 CPU running Linux. The matrix decompositions are the dominant factor in the computational expense. Both Matlab and C++ implementations use LAPACK [19] functions. LAPACK is a library of Fortran 77 routines for solving the most commonly occurring problems in numerical linear algebra. It was designed to be efficient on a wide range of modern high-performance computers. For our C++ implementation we used exactly the same LAPACK routines as those used by Matlab, and even linked to the LAPACK and BLAS libraries shipped with Matlab. This enabled us to evaluate the overheads associated with using Matlab. Furthermore, Matlab provides BLAS libraries highly optimized for each architecture, making use of available SSE routines. Pre-compiled libraries from Linux vendors may not be as highly optimized.

(a) Original image

(b) LU-transform Ω(4, 8)

(c) Segmentation result by thresholding

Fig. 6. An experiment using the LU-transform as a saliency map (b) for attention for a mobile robot. The objects on the table which have small structures or texture, have higher saliency values. Results using a very simple thresholding on LU-transform are also shown(c).

An important aspect of the texture transforms is that we always deal with square, real matrices, and we do not need singular vectors or eigenvectors, and this saves computation time. For instance, in Matlab d = svd(X) returns only the vector of singular values, which is considerably faster than [u,d,v]=svd(X) that also computes all singular vectors. Different LAPACK functions are used depending on what output is required. All three texture transforms have a complexity of O(w3 ) in general, where w × w is the size of the matrices to be transformed. These matrices originate from small window patches in the image, and are typically w = 8, 16 or 32 pixels in size. Sizes larger than w = 32 are uncommon. If there is an interest in collecting statistics from windows with w > 32, one might instead subsample the original image and compute the transform using a smaller window. For benchmarking, we used a greyscale image of size 720 × 480 pixels. The spacing between windows is kept constant and equal to δ = 8 pixels for all experiments in this section. This yields an output transform image 8 times smaller than the original image in each dimension. Table 2 shows the computational cost (in ms) of the LU, SVD and Eigen-transforms for different window sizes. The costs are also illustrated as a graph in Figure 8. The first property to notice is that the LU-transform is much faster to compute, in fact by roughly an order of magnitude. It is clearly well-suited to real-time applications. Second, rather surprisingly the methods do not seem to exhibit O(w3 ) behaviour. If we look at the LU-transform in particular (see the algorithm in Figure 9), the cost of the transform is dominated by level-2 rather than level-3 operations. In order to understand this we conducted a deeper analysis of which parts of the algorithm the costs are associated with. On the AMD Opteron 250 used for our experiments all level-3 operations can be executed fully in the 64 Kb L1 cache. Each multiply-subtract operation (last line of the algorithm) requires a total of 2 loads, a store, a product and a subtraction. We verified that this can be done in 5.5 cycles per 4-point group of operations, using the SSE instruction set and exploiting full 4-way parallelism. It can further be shown that

Fig. 7. First row shows the original images and second row shows the LU-transform of original images.

the total number of such operations is about w3 /3. Thus the total cost of all level-3 operations should be about 35 ms for 720 × 480 pixel images with a matrix size of w = 32 and a s = 8 pixel spacing. This is considerably lower than what we see in our experiments, which implies that our runs are dominated by the level-2 operations, that are harder to make L1 cache efficient. This explains the O(w2 )-like behaviours seen in the experiments. 4.2

GPU implementation

Next we describe an implementation of the LU-transform on a Graphics Processing Unit (GPU) and compare the performance to the previously mentioned CPU versions. What makes our attempt different from those of earlier studies [20, 21] is that we have many small matrices to be transformed, instead of a single very large one. This affects the way parallelism can be exploited, which will be explained below. GPUs are the key components of modern cards for 3D graphics acceleration. Due to the introduction of reprogrammable shading hardware and accompanying languages, GPUs have recently been applied for more general purpose processing. See [22] for more information on general purpose GPU initiatives. Some examples of applications in real-time computer vision exist, like depth matching [23], motion estimation [24] and figure-ground segmentation [25]. Unlike typical CPUs, most GPUs are based on singleinstruction-multiple-data (SIMD) architectures, with multiple processing units working in parallel on different parts of an output image. Thus to avoid unnecessary sacrifices in performance, care has to be taken when porting existing methods to GPU hardware. An understanding of the underlying hardware is critical. Off-screen rendering is supported by most graphics cards using either pixel-buffers (pbuffers) or frame-buffer objects (FBOs). In both cases images and temporary data are mapped to the texture memory of the graphics card. Advanced filtering operations are possible, as image data are rendered from a set of texture buffers into a new one. With the introduction of programmable fragment shaders in recent GPUs, filtering can be controlled on a per-pixel basis. Each fragment shader is assigned a set of input and

Matlab C++ w = 8 w = 16 w = 32 w = 8 w = 16 w = 32 LU 68 SVD 145 Eigen 201

89 249 572

182 771 2081

11 83 143

29 255 516

101 708 2088

Table 2. The computations costs (in ms) of the LU, SVD and the Eigen-transforms.

Fig. 8. The computational costs (in ms) of the LU, SVD and Eigen-transforms for different values of w and a constant spacing of s = 8.

output textures, and for each point in the output a program, associated to the shader, is executed. There are a number of possible languages that can be used for programming of shaders. In our study we use the OpenGL Shading Language (GLSL). However, since the hardware is the same, GLSL has many similarities to alternatives like Cg (NVidia specific) and HLSL (Microsoft specific). FBOs are typically faster than pbuffers, since they avoid repeated context switches when multiple textures are in use, which is typically the case in general purpose processing, where the operations and operands constantly change. Floating-point arithmetics is another novel innovation that has affected the way GPUs can be used for operations that require higher levels of accuracy, such as those used in this study. The most critical part of a GPU implementation of the LU-transform, is that of mapping image data to texture memory, so that parallelism can be exploited as much as possible. A typical GPU includes multiple fragment shaders and it is important to keep all shaders busy, while using as much available texture bandwidth as possible. Unfortunately, this means that the implementation of choice might vary between GPUs. Similar to where was previously described Section 3, the original image, f (y, x), is divided into a number of overlapping matrices, each w × w points in size. If M is the number of pixels in the original image, then with a δ-pixel spacing between locations at which the LU-transform is computed, there are N = M/δ 2 matrices in total, each requiring O(w3 ) operations to factorize. We then parallelize the computations, so that when a point-to-point matrix operation is performed, it is done on all matrices simultaneously during the same GPU call. Parallelism is achieved by initially shuffling the input image

for k=1:N-1 Pivot by swapping A(l,k) and A(k,k), where l = argmax_i|A(i,k)| for i=k+1:N F(i,k) = A(i,k)/A(k,k) for j=k+1:N Pivot by swapping A(l,j) and A(k,j) for i=k+1:N A(i,j) = A(i,j)-F(i,k)*A(k,j)

Fig. 9. Gaussian elimination with pivoting

into a large texture map that consists of small subsampled and shifted versions of the original image. There are w2 patches, each patch corresponding to a different point in the matrices. As a result of shuffling, the (j, i)-th such patch is given by    A(j, i) =  

f (j, i) f (j + δ, i) .. .

f (j, i + δ, ) f (j + δ, i + δ) .. .

··· ··· .. .

f (j, i + w − δ, ) f (j + δ, i + w − δ) .. .

    

f (j + h − δ, i) f (j + h − δ, i + δ) · · · f (i + h − δ, i + w − δ) When this is done, we compute the LU factorization using Gaussian elimination and pivoting (see Figure 9 above). Thus the rest of the process is similar to a typical CPU implementation, with the distinction that computations are performed on patches rather than individual matrix points. 4.3

Performance evaluation

Unlike our CPU implementations, we cannot exploit the 4-way parallelism along rows or columns on the GPU, but parallelism within the matrix point patches mentioned above. We do this in order to fully utilize the fragment shaders of the GPU. Since there are 12 such shaders in the NVidia 6800 GT GPU used in our experiments, the level of parallelism is 12-way at most. Unfortunately, the level of parallelism cannot be measured, since we have no control over which processing cores are active or not. Theoretically, based on the knowledge of processing cores and memory systems one might come up with some conclusions. However, since NVidia likes to keep some critical information on their memory systems hidden,these conclusions would hardly be accurate. One might also compare different, but similar, GPUs with different numbers of cores and draw conclusions on efficiency based on their relative performance. However, since we do not know the details of the memory system, it is hard to tell whether these are indeed comparable. Even if 12-way parallelism would never be reached for different reasons, e.g. the limited access to texture memory, we can be certain that the same level of parallelism would never be reached if computations were done on matrices similar to those of the CPU implementations. On the GPU parallelism is exploited within each individual operation and is highly affected by the size of the polygons and textures involved, and a

w=8 w=16 w=32 Matlab C++ GPU

68 11 3

89 29 16

182 101 184

Fig. 10. The computational costs (in ms) of the LU-transforms for implementations in Matlab, C++ and on a GPU.

matrix is typically considerably smaller than a matrix point patch. However, since the l indices in Figure 9 are different from one matrix to the next, they vary within the matrix point patches. Consequently, pivoting has to be tested and rows potentially swapped in the last loop, increasing the level-3 costs. In total we will have three stores and four loads per level-3 operation, one extra store for pivoting, one load since the F factors (see Figure 9) have to be stored in a texture map, and an additional load and store due to the fact that the GPU cannot simultaneously store at two different locations in the same texture map. Summaries of the computational costs of each implementation for different matrix widths can be seen in Figure 10. The real benefit of a GPU implementation is the low overhead for small matrix sizes. Especially for w = 8, but also with w = 16, the GPU version is considerably faster than the C++ CPU algorithm. Yet even with larger matrices w = 32 there is still a benefit to be had with the GPU implementation since the CPU is freed for other processing. The excellent performance with low w is due to the way parallelism is used. For the CPU implementations, the possibility of performing operations in parallel along rows or columns is small when matrices are small. For larger matrix sizes more operations can be performed in parallel. It is not until a full matrix does not fit into the L1 cache, which occurs for sizes larger than w = 128, that a large increase in computational cost can be noticed. On the GPU, parallelism is implemented in matrix point patches. While the size of each patch is kept constant, the number of patches is w × w, which means that the total amount of accessed texture memory becomes very large for larger matrix sizes. Thus already at w = 32 texture read latencies are increased due to poor caching on the GPU. In conclusion, for GPU implementations parallelism should be exploited in different ways depending on the size of the matrices. The approach we choose is suitable for sizes smaller than w = 32. An operation that should not be underestimated is that of transferring the original input image from main memory to the texture memory of the GPU. For the graphics card used in our study this bandwidth is about 600 Mb/s, which in practise means about

0.6 ms for a grey-level 740 × 480 pixel image. In the opposite directions the bandwidth is even lower. However, once uploaded transfers to and from texture memory are quick. In our case this bandwidth is about 22 Gb/s, which is considerably faster than most CPUs.

5

Conclusion

In this paper we presented a real time texture descriptor, the LU-transform, with several useful properties. First, it provides a low-dimensional output which is easy to store and perform calculations on. Second it is easy to implement even in parallel and needs few parameters to tune. Although the C++ implementation is fast, the simple structure of the algorithm allowed us to implement it on the GPU and keep free the CPU for other calculations. The efficiency of the GPU implementations depends on the way parallelism can be exploited and cache misses avoided. Our implementation was shown to be particularly successful for matrix size smaller than w = 32. The properties of the method make it very suitable for real-time segmentation and object detection. Currently we are using this method as a texture input in a multiple cue attention system. Still there are several issues which we intend to study in our future work. One is the transform’s use as a feature for object recognition. Another concerns the scale parameter w. Different patch sizes capture different properties. In this work we used only a single scale, but in the future we would like to exploit a multiple scale representation. Finally, we need a more formal understanding of what the pivots used in the LU-transform capture.

Acknowledgments We gratefully acknowledge support from the European Commission within the projects MUSCLE (A. Tavakoli Targhi) and MOBVIS (M. Bj¨orkman, E. Hayman), and the Swedish Foundation for Strategic Research within the project VISCOS (E. Hayman).

References 1. Malik, J., Belongie, S., Leung, T., Shi, J.: Contour and texture analysis for image segmentation. IJCV 43 (2001) 7–27 2. Ojala, T., Pietikainen, M.: Unsupervised texture segmentation using feature distributions. Pattern Recognition 32 (1999) 477–486 3. Schiele, B., Crowley, J.: Recognition without correspondence using multidimensional receptive field histograms. Intl. Journal of Computer Vision 36 (2000) 31–50 4. Leung, T., Malik, J.: Representing and recognizing the visual appearance of materials using three-dimensional textons. Intl. Journal of Computer Vision 43 (2001) 29–44 5. Varma, M., Zisserman, A.: Texture classification: are filter banks necessary? In: Proc. Computer Vision and Pattern Recognition. (2003) II: 691–698 6. Pietikainen, M., Nurmela, T., Maenpaa, T., Turtinen, M.: View-based recognition of realworld textures. Pattern Recognition 37 (2004) 313–323

7. Hayman, E., Caputo, B., Fritz, M., Eklundh, J.O.: On the significance of real-world conditions for material classification. In: Proc. 8th European Conf. on Computer Vision, Prague. (2004) IV:253–266 8. Efros, A., Leung, T.: Texture synthesis by non-parametric sampling. In: Proc. Int. Conf. on Computer Vision. (1999) 1033–1038 9. Sonka, M., Hlavac, V., Boyle, R.: Image processing, analysis and machine vision, 2nd edn. Thomson Learning Vocational (1999) 10. Targhi Tavakoli, A., Shademan, A.: Clustering of singular value decomposition of image data with applications to texture classification. In: VCIP. (2003) 972–979 11. Targhi Tavakoli, A., Hayman, E., Eklundh, J., Shahshahani, M.: The eigen-transform and applications. In: ACCV (1). (2006) 70–79 12. Golub, G.H., van Loan, C.F.: Matrix Computations. The John Hopkins University Press, Baltimore, MD (1989) 13. Press, W., Teukolsky, S., Vetterling, W., Flannery, B.: Numerical Recipes in C, 2nd edition. Cambridge University Press (1992) 14. Demmel, J.W.: Applied Numerical Linear Algebra. Society for Industrial and Applied Mathematics, Philadelphia, PA (1997) 15. Coleman, T., Van Loan, C.F.: Handbook for Matrix Computations. SIAM, Philadelphia (1988) 16. F.Chan, T.: On the existence and computation of lu-factorizations with small pivots. Mathematics of computation,AMS. 25 (2003) 1075–1088 17. Li, J., Wang, J.Z.: Automatic linguistic indexing of pictures by a statistical modeling approach. IEEE Trans. Pattern Anal. Mach. Intell. 25 (2003) 1075–1088 18. Mallikarjuna, P., Fritz, M., Tavakoli Targhi, A., Hayman, E. Caputo, B., Eklundh, J.: The KTH-TIPS2 database (2004-5) Available at www.nada.kth.se/cvap/databases/kth-tips. 19. Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Du Croz, J., Greenbaum, A., Hammarling, S., McKenney, A., Sorensen, D.: LAPACK Users’ Guide. Third edn. Society for Industrial and Applied Mathematics, Philadelphia, PA (1999) 20. Krueger, J., Westermann, R.: Linear algebra operators for gpu implementation of numerical algorithms. ACM Transaction of Graphics 22 (2003) 908–916 21. Galoppo, N., Govindaraju, N., Henson, M., Manocha, D.: Lu-gpu: Efficient algorithms for solving dense linear systems on graphics hardware. In: Proc. ACM/IEEE SC05 Conf., Seattle, WA (2005) 22. Harris, M., Wooley, C.: General-purpose computation using graphics hardware (2006) Available at http://www.gpgpu.org/. 23. Woetzel, J., Koch, R.: Multi-camera real-time depth estimation with discontinuity handling on pc graphics hardware. In: Proc. Int’l Conf. Pattern Recognition (ICPR), Cambridge, United Kingdom (2004) 741–744 24. Strzodka, R., Garbe, C.: Real-time motion estimation and visualization on graphics cards. In: Proc. IEEE Visualization 2004, Austin, Texas (2004) 545–552 25. Griesser, A., Roeck, S.D., Neubeck, A., van Gool, L.: Gpu-based foreground-background segmentation using an extended colinearity criterion. In: Proc. Vision, Modeling and Visualization (VMV), Erlangen, Germany (2005) 319–326

Suggest Documents