FestGPU: A Framework for Fast Robust Estimation on GPU

Journal of Real-Time Image Processing manuscript No. (will be inserted by the editor)

Jan Roters · Xiaoyi Jiang

FestGPU: A Framework for Fast Robust Estimation on GPU

Received: date / Accepted: date

Abstract Robust estimation is used in a wide range of applications. One of the most popular algorithms for robust estimation is the Random Sample Consensus (RANSAC) achieving a high degree of accuracy even with a significant amount of outliers. A major drawback of RANSAC is the fast increasing number of iterations caused by higher outlier ratios, resulting in increasing computational costs. In this paper FestGPU, a framework for Fast robust ESTimation on GPU, is presented which reaches a speedup up to 135x compared to a singlecore CPU. Together with a C++ and a Matlab interface the framework is made publicly available on the authors’ website for the research community.

Dr aft

hand the computation time required increases dramatically with the ratio of outliers and the number of datasets required to compute a hypothesis. RANSAC , LMS and other subtypes of the RANSAC family are applicable to a wide range of estimation problems, e. g. homography and fundamental matrix estimation [12], pose estimation [8] and surface estimation from range data [24]. RANSAC is often used in structure from motion systems [22] where the structure is estimated automatically from multiple images. These systems are prone to outliers and noisy data. Even large estimation problems [2,3] can be solved with RANSAC. Robust estimation using random hypothesis computation and testing algorithms is also used in other reKeywords Robust Estimation · RANSAC · GPGPU · search fields, e. g. the finance market [28]. Framework Not only with such estimation algorithms like RANSAC and LMS, but also with all sorts of pattern recognition, computer vision and machine learning tasks the computational 1 Introduction time requirements are increasing dramatically. To solve these In several application domains model parameters to fit a spe- problems with a certain amount of speedup the usage of cific dataset have to be estimated. While estimating mod- GPU computation is a popular topic in the community. Recently, many problems are being solved using proels with exact data measurements works fine, even a small grammable graphics hardware. Feature detectors [25] and amount of outliers could be a reason for a falsely estimated matching algorithms [7] are showing a certain amount of model. Especially, for automated applications, e. g. 3D respeedup. In dense 3 D reconstruction systems [21] GPU imconstruction of unordered images, the impact of outliers is enormous. To reduce this impact robust estimation algorithms plementations are used to speed up the feature matching and triangulation. Even large non-linear optimization problems, are required. Problems related to robust estimation can often be solved for instance bundle adjustment [29], have been implemented with randomized algorithms like Random Sample Consen- on the graphics device. Furthermore, several segmentation sus (RANSAC [8]) and Least Median of Squares (LMS [23]). problems [17] as well as learning algorithms [4] and video On the one hand these algorithms provide an accurate esti- encoders [14] have been transferred to the GPU. Other romate even with a significant amount of outliers. On the other bust estimation methods have been developed running on the GPU, e. g. the detection of lines using Hough Transform [13]. On a higher level some computer vision frameworks J. Roters E-mail: [email protected] based on GPU usage have been developed giving access to different functionality [1,10]. X. Jiang E-mail: [email protected] In this work a framework for fast robust estimation on GPU is introduced that reduces the computation time signifDepartment of Mathematics and Computer Science, icantly. Different subtypes of the RANSAC family and LMS University of Münster, can be used to estimate the model parameters. We are foEinsteinstr. 62, 48149 Münster, Germany Tel.: +49-251-83-33759 cussing on four main objectives: (1) The framework should Fax: +49-251-83-33755 deliver a reasonable speedup especially for larger estimation

2

Jan Roters, Xiaoyi Jiang

2 Random Sample Consensus

y

Dr aft

y

problems, (2) it should be easy to use even with a limited programming skills, (3) it should be as generic as possible to be adaptable to other subtypes of the RANSAC family, and (4) it should be usable in a wide range of applications. We do not aim at providing the fastest, customized solution to a robust estimation problem. Instead, we aim to provide a generic framework to robust estimation on the GPU that is easy to use and achieving a certain amount of speed compared to the estimation on the CPU. The framework prox x vides generic routines to estimate a model. It is designed to (a) least squares (b) RANSAC be as generic as possible, so that it can be used by its interfaces from different programming languages. Two inter- Fig. 1: Impact of outliers with (a) non-robust line fitting faces are provided by the framework. First, a Matlab inter- using least squares and (b) robust line fitting using RAN face which acts as a high level API and gives easy access to SAC. The blue line represents the estimated result. Gray lines robust estimation. It is an example interface to demonstrate show the RANSAC iterations. how to implement additional interfaces. Second, a C++ interface which can be directly called from other applications and which can be used to implement additional interfaces thresholds that are specific for the given problem. Second, with randomized algorithms like RANSAC there is no upper such as Python, Java, Halcon and more. Two different examples are evaluated to measure the speedupbound on computation time since RANSAC-like algorithms of the framework: (1) line estimation from 2 D points and do not memorize the previously chosen samples, so that it (2) fundamental matrix estimation from 2 D point matches. is possible that a certain set of datapoints is used multiple The evaluation shows a speedup of more than 130 times us- times. Since there is no upper bound limiting the compuing the GPU instead of the CPU. The framework and the tational time an optimal solution cannot be guaranteed. Esshown examples will be published on the authors’ website pecially for the latter disadvantage it is favorable to estimate the model parameters on the GPU to reduce the computation [20]. The remainder of the paper is structured as follows: Sec- time. tion 2 covers the RANSAC family and describes the basics of In the following we present the RANSAC algorithm in the RANSAC estimation algorithm. In Section 3 our frame- more detail. For this purpose we concentrate on the nonwork is presented. The practical usage of the framework is adaptive RANSAC algorithm [8] since other subtypes of the described in Section 4. An evaluation of two examples and RANSAC family can be adapted easily. Firstly, the structure their results are shown in Section 5. Section 6 shortly covers of the RANSAC algorithm is covered (cf. Sec. 2.1). Therethe estimation of LMS on the GPU. We conclude the paper after, we discuss the different parameters used in RANSAC in Section 7 and give a brief overview of future work. (cf. Sec. 2.2).

Even if the underlying dataset is contaminated with outliers estimators of the RANSAC family [5] can be used to robustly estimate the model parameters. Many of these estimators are based on different loss functions (e. g. RANSAC [8], MSAC [27] and MLESAC [27]), different model selection (e. g. MAP SAC [26] and QDEGSAC [9]) or guided sampling (e. g. PRO SAC [6] and GOODSAC [16]). The main advantage of the algorithms of the RANSAC family is the robust estimation with a high degree of accuracy even if the underlying datasets are containing a significant number of outliers. In general, problems with automatic data acquisition and data processing are known to have high amount of outliers. One example is the fundamental matrix estimation. It is common that the dataset contains more than 50% outliers since the point correspondences are determined automatically. In addition to the robustness against outliers, the framework is applicable to a wide range of different estimation problems. In contrast, there are two major disadvantages: First, the estimation performance depends on

2.1 Structure of RANSAC In general, non-robust estimation algorithms use the entire dataset to find the model parameters so that the outliers are also included. These may have a large impact on the result (cf. Fig. 1a). In comparison, RANSAC does not use the entire dataset of information in every step. RANSAC uses a randomly chosen subset to compute a hypothesis. The remaining datapoints are used to validate this. Repeating the hypothesis computation and validation with different subsets, the probability increases to find a hypothesis that fits the data well (cf. Fig. 1b). Given a dataset with n datapoints and a specified problem requires k < n datapoints to compute a hypothesis. Then, the remaining n − k datapoints are used to evaluate this hypothesis. Therefore, for each remaining datapoint a loss value is computed which are finally used to rate the hypothesis. The concrete RANSAC algorithm is outlined in Algorithm 1.


3

(1) Iterate the steps (2) – (5) over N times (2) Randomly choose a subset of k datapoints (3) Compute a hypothesis with the chosen subset (4) Compute the loss value ei to each remaining datapoint i of the input dataset (5) Count the datapoints for which ei > t, called outliers (6) Choose the hypothesis with the minimum number of outliers

6

10

4

N

10

k k k k k k k k k k

2

10

Table 1: Computation steps of the RANSAC algorithm. N is the count of iterations, k is the number of datapoints required to compute a hypothesis, t is a problem-specific threshold. 0

2.2 Parameters in RANSAC algorithm

k N

q = (1 − (1 − ) ) .

(1)

Then, the converse probability p = 1 − q to select at least one outlier-free subset in N iterations is given by p = 1 − (1 − (1 − )k )N .

0.4

(2)

Solving for N the required iteration count can be determined to estimate the model parameters with confidence p from a dataset with outlier ratio . log(1 − p) (3) N = log(1 − (1 − )k ) For a specific problem the computation time to estimate a model with the framework presented in this paper depends on the number of required RANSAC iterations N as shown in Equation (3). Since the confidence p to choose at least one outlier-free subset of k datapoints is usually chosen to be p ≥ 0.99, the number of iterations N only depends on the number of parameters k and the assumed ratio of outliers . For both parameters a larger value relates to more iterations as illustrated in Figure 2. The number of parameters k

ϵ

0.6

0.8

2 4 6 8 10 12 14 16 18 20

1

Fig. 2: The influence of the RANSAC parameters to the number of iterations (p = 0.99). For larger values of k the number of iterations N increases faster for a given outlier ratio .

depends on the specific problem, whereas the outlier ratio depends on the quality of the input data. For many applications and datasets the ratio of outliers is unknown. For this purpose a method can be used to adaptively change the iteration count: The RANSAC algorithm is initialized with a worst-case outlier ratio = 1. After each iteration the number of inliers in is counted and the outlier ratio is updated to

Dr aft

The RANSAC algorithm requires several parameters (cf. Alg. 1). Parameter k is the number of chosen datapoints to compute a hypothesis. This number depends on the specific problem and the corresponding algorithm to compute a hypothesis, e. g. for fundamental matrix estimation there are several algorithms to compute the hypothesis with k ≥ 7 [12]. The problem-specific threshold t describes the maximum loss value, so that a tested datapoint i with a loss value ei ≤ t is called inlier. Otherwise, it is called outlier. For instance, when estimating 2 D lines from 2 D points t may describe the maximum distance between a tested point and a line (the hypothesis) to decide whether it is an inlier or an outlier. The remaining parameter is the number of iterations N . In the following we discuss the minimum iteration count N required to estimate the model parameters with a specified probability p, called confidence, when the dataset is contaminated with an outlier ratio . In many problems is unknown. This case is discussed later. Given the outlier ratio the probability q to get at least one outlier in the subset of k datapoints chosen from the dataset after N tries is given by

0.2

= = = = = = = = = =

= 1−

in n

(4)

where n is the size of the dataset. Based on the number of iterations N is updated using Equation (3).

3 Estimation Framework

With this paper we aim at providing a framework for fast robust estimation on GPU. Furthermore, it is as generic as possible and usable in many different programming languages due to the C++ interface, e. g. Matlab. The framework is based on CUDA C since it provides good documentation and debugging capabilities. These will make it as easy as possible to help the user providing the required problem-specific functions. At the downside only NVIDIA graphic devices are currently supported by the framework. Our framework provides access to many estimators of the RANSAC family. For instance, GOODSAC or PROSAC can be used to estimate the model parameters when the order of random samples is provided. Modifying the loss function, MSAC or MLESAC can be realized. Without any modifications or any data provision, the model parameters are estimated with adaptive RANSAC or non-adaptive RANSAC. Besides, it is possible to estimate with LMS. Since the framework is open source, it can be extended easily by implementing additional functionality.

4


Program Using the Framework

Interfaces Matlab Interface

...

(Sec. 3)

Application Specific Methods Loss Value Computation

(1)

Prepare Memory of Graphics Device

...

...

(Sec. 4.1)

Fig. 3: Overview of the RANSAC and LMS estimation framework. The gray parts are provided by the framework, the white parts have to be implemented for each applications.

(a) Generate (Random) Samples

(b)

Check for Duplicates

(c)

Check for Degeneration

(d)

Compute Hypothesis

(e)

Compute Loss Values

Dr aft

The framework for robust estimation on GPU (cf. Fig. 3) is divided into two parts: (1) the functions to be implemented by the user (white) and (2) the provided functionality of the framework (gray). An application using the framework only calls functions of the interfaces. These calls start the internal estimation process which uses problem-specific functions implemented in CUDA C. Figure 4 shows the internal structure of the estimation process of RANSAC and LMS. For convenience only the nonadaptive case is presented. The differences to the adaptive case are discussed at the end of this section. The advantage of the GPU is gained by executing each RANSAC iteration in a separate CUDA thread. As shown in Figure 4(2) each thread cares about computing random numbers that are used to choose the subset for further steps. After checking for duplicates and degeneration the hypothesis is computed with this subset, the loss values are computed and rated. Besides the step (2) of Figure 4, step (3) is executed on the GPU. When the ratings of all iterations have been computed on the GPU the minimum loss value is computed. To compute the minimum rated loss value we use the reduction principle as described in [19]. When the first estimation is started the framework will initialize itself automatically. To reduce the computation time

Return

...

Rate Loss Values

(f)

(3)

In Section 3.1 the basic structure of the framework and the workflow of the estimation process are described. We investigate the memory limit of graphics devices and its influence to RANSAC estimation in Section 3.2.

3.1 Workflow and Structure

Iterations / CUDA threads

(Sec. 4.2)

Estimation Algorithm RANSAC, LMS, …

Hypothesis Computation

(2)

CUDA Estimation Kernel

C Interface

Estimation Started

Iterations / CUDA threads

Find Hypothesis with Minimum Loss Value

Fig. 4: Overview of the internal calls of the non-adaptive estimation with RANSAC and LMS. The CUDA estimation kernel (2) is called with a number of threads that is equal to the number of required RANSAC iterations. of the first estimation caused by the initialization, it is possible to call the initialization manually before the first estimation. The internal estimation process consists of three major steps (cf. Fig. 4):

(1) The internal estimation allocates enough memory on the graphics device and transfers the required data. Besides the data provided by the user, other data has to be transferred such as the seeds of the random number generator. Therefore, for each concurrently executed thread a seed consisting of two integer values is required. (2) The CUDA estimation kernel is executed on the graphics device with a given number of CUDA threads T and CUDA blocks B . These depend on the number of required iterations N = T · B . Since the number of threads T is constant the number of iterations is directly related to the number of blocks B . In general, T should be a multiple of the warp size of the CUDA device. In the following the index of the iteration is i ∈ {0, . . . , N −1}. (a) At the beginning of each iterations samples have to be chosen to compute a hypothesis. With RANSAC random samples are chosen. For other subtypes of the RANSAC familiy it is also possible to specify the order of the samples to provide some sort of assess-


5

The hypothesis with index ibest represents the best model parameters and will be returned. Besides, it is possible to get the indices of the datapoints that have been used to compute the hypothesis and the indices of the inliers. The datapoints, for instance, can be used to recompute the final result with a higher precision, e. g. by using least squares regression applied to the estimated inliers.

loss value

1

th 2

Least Sq. RANSAC MSAC MLESAC

0

Adaptive termination. In contrast to the non-adaptive RAN SAC where the CUDA estimation kernel is only called once to estimate a hypothesis from a given dataset, the estimation Fig. 5: Visualization of different loss functions of the RAN - kernel is called more often with a lesser thread count when SAC family and least squares. using adaptive RANSAC. To achieve the adaptive termination after each call of the estimation kernel the required number of iterations is adapted based on the measured count of inment driven selection, e. g. PROSAC [6] and GOOD - liers as given in Equations (3) and (4). Each execution of a CUDA kernel and each data transfer SAC [16]. Since there is no random number generator on the GPU we have chosen to use the multiply- between the graphics device and the host requires additional with-carry method invented by George Marsaglia in computation time. Since the adaptive estimation is based on multiple executions of the estimation kernel and data transconjunction with the add-to-carry method [15]. (b) Before a hypothesis is computed there are two steps fers we propose an optimization strategy to reduce this overto filter datapoints that are in poor condition. The first head. The main idea of the optimization strategy is to start filter step checks for duplicates. When duplicates are detected the subset is rejected. In general, this check with a small number of iterations and double this number can also be done in the next step, the check for de- each time the estimation kernel is called. The first call comgeneration, but since this is required for almost all putes a minimum count of iterations N0 . Thus, the estimaapplications we decided to integrate this check into tion with datasets containing only small amounts of outliers may be finished after the first call. Otherwise, the results of the framework with the possiblity to omit this. (c) The second filter for ill-conditioned datapoints is op- the first call give a hint about the outlier ratio in the dataset tional and it is done by a problem-specific function due to Equation (4). The new required total count of iterathat may be implemented by the user. This function tions Ntotal is computed with Equation (3). For a subsequent call j ∈ N+ of the estimation kernel can decide whether or not the chosen subset should the number of iterations is doubled. be rejected. (d) To compute the hypothesis another problem-specific Nj = 2 · Nj−1 = 2j · N0 function is called which takes the chosen subset as Increasing the number of iterations exponentially has the adinput and returns a hypothesis based on the implevantage to quickly estimate the model parameters even if the mentation. dataset contains a lot of outliers. As a further optimization (e) For each remaining datapoint in the dataset a loss we propose to reduce the number of iterations for the last value is computed with the problem-specific loss funckernel call of an estimation. For the j -th call of the estimation (cf. Fig. 5). tion kernel the number of remaining iterations is given by (f) The rating of the loss values ej with j ∈ {0, . . . , n − j−1 X k − 1} is different between RANSAC and LMS. When N = N − Ni . rem,j total estimating with LMS the rating ri of the hypothesis i=1 of the i-th iteration is given by the median of the loss In general, the last call j requires less than 2j · N0 iteravalues tions. Thus, the remaining number of iterations is computed ri = median ej . (5) j to save computation time. Additionally to the exponential For many subtypes of the RANSAC family the rating increasing number of iterations, we propose to cap the itis given Xby the sum of the loss values erations count due to memory usage discussed in the next ri = ej , (6) section, so that Nj ≤ Nmax . 0 error

th

Dr aft

−th

j

e. g. RANSAC [8], MSAC [27] and MLESAC [27]. The loss functions are shown in Figure 5. With RANSAC the outliers are counted. (3) The index ibest of the best rating ri is given by ibest = arg min ri . i

(7)

3.2 Memory Limit In this section the memory limit of graphics devices is discussed to execute the robust estimation using our framework. Compared to the main memory, in general, there is

6


m1 = n · s + d.

(8)

For instance, the size of a datapoint for 2 D line estimation is s = 2 since one 2 D point consists of two coordinates. 2. Internal memory. When the estimation kernel is called a number of RANSAC iterations N is computed, each requires memory for internal computations. For each iteration the indices of the k chosen datapoints, the computed hypothesis of size h and the sum of the loss values have to be stored. The total number of memory elements m2 for internal usage is then given by m2 = N · (k + h + 1).

(9)

For instance, the number of elements in a hypothesis in the example of fundamental matrix estimation is h = 9 since the fundamental matrix is of size 3 × 3. Depending on the chosen algorithm the number of elements required to compute a fundamental matrix is k ≥ 7. 3. Problem-specific functions. In the functions that have to be implemented by the user it is possible to allocate additional memory mu . Since each of the CT concurrently running threads requires its own memory the number of memory elements is given by m3 = CT · mu .

0.8 0.6 0.4 0.2 0 0

128 256

512 memory (MB)

k=3 k=6 k=9 k=12 k=15 768 1024

Fig. 6: Possible outlier ratio plotted against available memory with non-adaptive RANSAC for estimations with different number of required samples k . (p = 0.99, n = 2 000, h = 9) The total used memory element count is given by m = 4 · (m1 + m2 + m3 ). (11) This is only an approximation of the memory usage since there are some minor memory uses we have not covered in the above description. These minor memory uses can be neglected due to their small amount of memory usage and due to convenience reasons.

Dr aft

1. Dataset and additional data. Independent of whether the kernel is executed once or many times the whole dataset has to be available on the graphics device. It is required to compute hypotheses and loss values. The number of memory elements m1 required by the dataset is given by the number of datapoints n available for the estimation and the size s of each datapoint. Furthermore, the user is able to pass additional data elements d to the problem-specific functions, e. g. additional parameters. Thus, m1 is given by

1

outlier ratio

less memory of graphics devices. Furthermore, it is more inflexible since the required data has to be transferred to the graphics device before a CUDA kernel is executed. To get the amount of required memory we count the number of memory elements used by the estimation kernel. The size of a memory element is defined by the size of the data type used on the graphics device. In our framework currently only 32-bit datatypes (4 bytes) are used due to the fact that there are much more single precision streaming multi processors than double precision on graphics devices. There are different memory requirements between RAN SAC , subtypes of the RANSAC family and LMS. In the following the three common major memory requirements of the estimation on the GPU are described. Thereafter, the differences between the adaptive and the non-adaptive case are discussed. Furthermore, we briefly discuss the memory usage of other subtypes of the RANSAC family and LMS.

(10)

The number of concurrently running CUDA threads is CT = #MP ·#WS where #MP is the number of streaming multiprocessors and #WS is the warp size of the CUDA device.

Non-adaptive. In the non-adaptive case the kernel is executed only once. The whole memory required by the kernel needs to be allocated before the estimation process is started. In this case the number of iterations N in the call of the estimation kernel is equal to the number of total iterations of the RANSAC estimation (cf. Eq. (3)). Depending on the memory limitations and the size of the estimation problem it may not be possible to estimate the model parameters with a given confidence p and outlier ratio . In particular, m2 is the only memory requirement that depends on the iteration count N which is increasing very quickly depending on the number of required samples and the outlier ratio. The maximum count of iterations for non-adaptive RAN SAC on the GPU can be computed solving Equation (11) for N . Based on this count it is possible to determine the maximum outlier ratio for a given application solving Equation (3) for . Figure 6 illustrates the relation between available memory and outlier ratio. For instance, on a graphics device with 128 MB of free memory it is possible to estimate a fundamental matrix (k = 8) from a set of 2 000 2 D point correspondences with a confidence of p = 0.99 from a dataset with an outlier ratio of ≈ 0.81. Adaptive. The adaptive RANSAC is much more flexible since the computation kernel is executed many times and in-between, memory can be reallocated and new data can be copied onto the graphics device. In Section 3.1 an optimization strategy has been described which increases the number of iterations N for each estimation kernel execution exponentially. This strategy helps


to decrease the number of kernel calls and the number of data transfers between the host system and the graphics device to save computation time. The number of iterations has been capped, so that for each kernel call j the number of iterations is at most Nj ≤ Nmax . The advantage is that the whole required memory can be allocated before the first call of the estimation kernel saving further computation time and memory resources. On that account only new data has to be transferred between the host and the device between consecutive calls. Other subtypes of the RANSAC family. Subtypes of the RAN SAC family which provide additional information also require additional memory. For instance, GOODSAC [16] and PROSAC [6] provide some sort of assessment driven selection of the input data that depends on a previous quality measurement, e. g. the goodness of feature matches. Hence, not only the input data has to be provided, but also a list of indices that can be used instead of random numbers.

4 Framework Usage

2. The number of required parameters to compute a hypothesis are stored in PARAMS REQUIRED (k ). 3. DATA PT SIZE (s) defines the number of memory elements of one datapoint. The user has to implement a class that includes the problemspecific functions. The class is then instantiated with these three problem-specific values as template parameters, so that the parameters are available when the framework is compiled. Additional data like thresholds or other parameters required by the computation may be given in as class attributes. The framework takes care about transferring them to the graphics device. As a convention for the framework the input data is structured in datapoints. Each datapoint i ∈ {0, . . . , k − 1} consists of s memory elements ei0 , . . . eis−1 . These elements are stored at the input array positions i · s, . . . , (i + 1) · s − 1. The user is responsible for the order of the memory elements in the datapoints, so that in the problem-specific functions the order of each datapoint is exactly the same. For instance, the datapoints for 2 D line estimation are given by {x0 , y0 , x1 , y1 , . . .}. In the following we describe the two functions which are required by our framework, the hypothesis computation and the loss value computation, to explain the basic structure and how problem-specific functions have to be implemented correctly. However, there are optional functions which affect the estimation process, for instance, a function that checks for degeneracies of the chosen subsamples. The functions running on the GPU are labeled __device__ which gives the CUDA compiler the hint to compile these to be run on the graphical device.

Dr aft

Least Median of Squares (LMS). While estimating model parameters with LMS, in each concurrent CUDA thread the loss values to all remaining datapoints are stored to search the median. For this purpose the memory requirement m3 (cf. Eq. (10)) has to be changed to m3 = CT · (mu + n − k).

7

For convenience, the framework for robust estimation on the GPU handles many functionality while being adaptable to many different problems. Therefore, the framework estimates model parameters by using problem-specific func- Hypothesis computation. The function is used to compute tions which have to be implemented by the user. These func- a hypothesis from a given subset of datapoints. This subset tions have to be implemented in CUDA C, so that their im- contains the exact amount of datapoints k required to complementation is similar to usual C libraries. pute the hypothesis. In this section the practical usage of the estimation framevoid computeHypothesis work is shown to demonstrate that the framework is easy to __device__ (float *inDataPts, const float * use. For this reason, the example of 2 D line estimation is outHypothesis) used to explain the required problem-specific functions to { outHypothesis[0] = inDataPts[0]; // x1 prepare the framework (cf. Sec. 4.1). Afterwards, the pro= inDataPts[1]; // y1 vided interfaces in C++ and Matlab are described (cf. Sec. 4.2). outHypothesis[1] outHypothesis[2] = inDataPts[2]; // x2 outHypothesis[3] = inDataPts[3]; // y2 }

4.1 Preparation Currently, there are several restrictions with programmable graphics devices. For instance, when estimating model parameters the required amount of memory allocated inside the kernel and the problem-specific functions has to be known before the framework is compiled. To make the memory management as easy as possible the memory is allocated and managed by the framework. The required amount of memory inside the kernel functions are given by three problemspecific values: 1. The HYPOTHESIS SIZE (h) defines the number of memory elements required to store a hypothesis.

The datapoints are stored in inDataPts which has the size of k · s. The framework cares about allocating inDataPts and outHypothesis. Furthermore, inDataPts gets filled with subsamples. Loss value computation. To validate an estimated hypothesis a loss value is computed for each remaining datapoint, i. e. the datapoints that have not been used to compute the hypothesis. __device__ float computeLossValue (float *inHypothesis, float *inDataPt) { float px = inDataPt[0];

8


float py = inDataPt[1]; float x1 = inHypothesis[0]; float y1 = inHypothesis[1]; float x2 = inHypothesis[2]; float y2 = inHypothesis[3]; float x2x1 = x2 - x1; float y2y1 = y2 - y1; float z = x2x1*(y1-py)-y2y1*(x1-px); float d = z*z / (x2x1*x2x1 + y2y1*y2y1); return (d > threshold); }

4.2 Interfaces

ResultValueType state = FestGPU_Initialize();

A value state is returned that shows either that the framework has been initialized successfully or that a specified error had occurred. To estimate a model using the framework with adaptive RANSAC the following function can be called: ResultValueType state = FestGPU_Adaptive( estimator, dataset, bestModel);

The estimator is an instance of the class with the problemspecific methods. In the vector dataset the datapoints are stored as described in Section 4.1. When the estimation is finished the estimated model parameters are stored in best Model which is a reference to a vector. Besides, there are some function calls with additional arguments, for instance, to specify the confidence p. The default confidence if not specified is set to p = 0.99.

Dr aft

For better readability the required values from the input arrays are copied to local variables. This step is not required, it just helps to illustrate the computation of the squared distance d between the line inHypothesis and the point inDa taPt. The loss value that is returned by this function depends on the proposed estimation algorithm, e. g. RANSAC, MSAC or MLESAC. In the above example the loss value is either 0 or 1 depending on the difference between the squared distance d and the squared threshold stored in an attribute threshold of the class. This is equal to the loss values of RANSAC. Since the problem-specific functions are executed many times, they should be optimized to require less computation time. The hypothesis computation is executed for each iteration, i. e. N times. In contrast, the number of executions of the loss value function is given by N · (n − k), i. e. for each iteration the loss values of the remaining datapoints are computed.

C++ interface. To call the estimation from C++ the header of the framework FestGPU.cu has to be included in the application. Two functions of the C++ interface are presented here, one for the initialization, and one to start the estimation. The initialization of the GPU framework is explicitly executed by

Two different interfaces are included in the estimation framework. The basic interface is written in C++ and can be used to implement additional interfaces such as the Matlab interface provided in this framework. The usage of both is described briefly to illustrate the simplicity using this framework. There are some functions and calls that are not described here but covered in the documentation. Matlab interface. Two Matlab functions of the interface are presented here. The initialization of the GPU framework is explicitly executed by [state, msg] = FestGPU_Initialize

and returns an error value. If the framework has been initialized successfully state is true, otherwise it is false and an error description is given in msg. The estimation of a model using the framework is executed with [model] = FestGPU_Estimate(dataset, options)

The datapoints of the dataset are stored in the columns of the matrix dataset. To provide additional data to the problemspecific functions in the estimation process the options parameter can be used. There are additional optional parameters for this call, e. g. confidence or inlier ratio. If an inlier ratio is given the non-adaptive estimation is started. The default confidence is set to p = 0.99.

5 Examples and Results Although the estimation with many different subtypes is possible with the proposed framework, only the adaptive and the non-adaptive RANSAC have been evaluated in this section for reasons of convenience. We have measured the mean time of our framework while running on the GPU, on the CPU using only one core (CPU 1x), and on the CPU using all four cores (CPU 4x). To compute the mean time each estimation is repeated 100 times. Especially for the adaptive estimation the repetition rate is required since a single estimation can be much slower or much faster than the mean due to the randomization. Both, the computations on the GPU as well as the computations on the CPU, are based on single precision floating point datatypes. The system on which the framework has been evaluated is equipped with an quad-core CPU running at 2.80 GHz, the Intel Core i7 930. It features the TurboBoost capability at 3.06 GHz, i. e. when only one core is used the operating frequency is increased. 12 GB of RAM are equipped. The CUDA device is a NVIDIA GeForce GTX 470 running at 1.22 GHz with 1280 MB of GDDR3 device memory. To process the results all of its 448 CUDA cores have been used. The evaluation has been performed with CUDA version 5.0. At the first execution of the estimation the framework has to be initialized. This process is required only once and takes about 600ms. For the measurements the framework has been initialized before the first estimation. To evaluate the framework two applications for robust estimation are presented. For each example 13 outlier ratios from = 0.2 to = 0.8 in steps of 0.05 are evaluated. In


0.6

0.6

y 0.4

0.4

0.2

0.2

14 12

0.4

x

0.6

0.8

(a) outlier ratio = 0.20

1

0 0

0.2

0.4

x

0.6

0.8

8 6 4

0.3

0.4 0.5 0.6 outlier ratio

0.7

0 0.2

0.8

The problem-specific functions for 2 D line estimation in this evaluation are given in Section 4.1. To estimate a line from 2 D points, the computation of the model only requires two 2 D points. Since each pair of different points represent a line, without an explicit transformation the hypothesis computation does not require a large amount of time when executed. The loss value is given by the squared Euclidean distance between a 2 D point of the dataset and the line that has been estimated.

Dataset. To evaluate the RANSAC framework with 2 D line estimation, we have generated a ground truth dataset consisting of n = 10 000 2 D points for each outlier ratio from = 0.2 to = 0.8 in steps of 0.05 (cf. Fig. 7). The 2 D points have been generated randomly in the following way. At first, a line has been chosen manually which defines our ground truth. This line is given by x 0 1 = +s· with s ∈ R. y 0.4 0.2 Second, a threshold t = d2t with distance dt = 0.01 has been chosen to decide whether a point is an inlier or an outlier. Finally, for each outlier ratio that is evaluated i · n outlier points and (1 − i ) · n inlier points have been selected randomly. The inliers have been selected randomly with a uniform distribution in the direction of the line and with a normal distribution perpendicular to the line. The outliers have been chosen with a uniform distribution in the range x, y ∈ [0, 1] with a gap of width 2 · dt perpendicular to the line.


0.7

0.8

(b) speedup

Fig. 8: Performance of 2 D line estimation with non-adaptive RANSAC . (p = 0.99) 14 12

5 GPU CPU 4x CPU 1x

CPU 4x / GPU CPU 1x / GPU 4

10 8 6 4

3 2 1

2 0 0.2

0.3


0.7

0.8

0 0.2

(a) mean time

5.1 2 D Line Estimation

0.3

(a) mean time

Dr aft

Section 5.1 we show the results of estimating 2 D lines from sets of 2 D points. The second example shows fundamental matrix estimation from sets of 2 D point correspondences (cf. Sec. 5.2). For both examples the adaptive and non-adaptive RANSAC versions are evaluated. Finally, the results are discussed (cf. Sec. 5.3).

2 1

(b) outlier ratio = 0.80

Fig. 7: Example of ground truth points for line estimation. In each figure 1 000 sample points are shown. Blue points are inliers with a threshold of squared distance t = d2t with dt = 0.01, red points are outliers.

3

2

1

mean computation time (ms)

0.2

CPU 4x / GPU CPU 1x / GPU 4

10

0 0.2 0 0

5 GPU CPU 4x CPU 1x mean speedup

0.8

y

0.8

mean computation time (ms)

1

mean speedup

1

9

0.3


0.7

0.8

(b) speedup

Fig. 9: Performance of 2 D line estimation with adaptive RANSAC . (p = 0.99) Measurements. Figure 8 shows the measurements of line estimation with non-adaptive RANSAC. In Figure 8a the mean computation time is presented. In Figure 8b the speedup of the GPU is shown compared to CPU 1x and to CPU 4x. For outlier ratios ≥ 0.52 the GPU requires less computation time than CPU 1x and for ≥ 0.71 it requires less time than CPU 4x. The results of the adaptive RANSAC are shown in Figure 9. Compared to the non-adaptive RANSAC, the adaptive version performs slightly worse. Estimating 2 D lines adaptively on the GPU is beneficial for ≥ 0.55 compared to the CPU 1x and for ≥ 0.74 compared to the CPU 4x.

5.2 Fundamental Matrix Estimation The fundamental matrix is a 3 × 3 matrix that describes the geometric relationship between corresponding points of a pair of cameras. For the evaluation the fundamental matrix is computed using the normalized 8-point algorithm [11]. This algorithm requires k = 8 point correspondences to compute the matrix. The loss values are given by the Sampson distance [12]. In general, the point correspondences are acquired automatically. Therefore, it is usual that the dataset is contaminated with a large amount of outliers, e. g. 50% or more. Dataset. To evaluate the estimation of the fundamental matrix, we have chosen 2 D point correspondences of 10 pairs

10


For extreme outlier ratios ( = 0.8) the speedup reaches about 135x for CPU 1x and about 36x for CPU 4x.

x x x

x

5.3 Discussion x

Fig. 10: Example result for fundamental matrix estimation. The points in the left image and the corresponding epipolar lines in the right image are shown.

Dr aft

of images. Each image pair has a similar number of correct point correspondences (about 2 200). To verify that these correspondences are correct, we have generated a multiple view reconstruction. Each correspondence is subject to the condition that it has to be visible in at least three images with a small reprojection error. To add outliers, we have generated random 2 D point correspondences. For each point we have checked if the corresponding point has at least a distance greater than 2 · t to the corresponding epipolar line. For each outlier ratio that is evaluated i · n outliers and (1 − i ) · n inliers have been chosen randomly.

The results show that the more iterations of RANSAC have to be computed, the more advantageous is the framework for robust estimation on the GPU. The iteration count is related to the number of required datapoints k to compute the hypothesis and the outlier ratio . Thus, the framework is more beneficial for datasets contaminated with a high degree of outliers and for more complex problems such as fundamental matrix estimation with k = 8. In contrast to complex estimation problems the example of 2 D line estimation with k = 2 shows that there is a speedup even for problems based on a smaller number of required parameters k (cf. Fig. 8a and 9a). But for those problems such as line estimation, the speedup is less. This is mainly due to the following reasons. On the one hand line estimation only requires two random samples, so that even with a large outlier ratio ( = 0.8) only N = 113 iterations of RANSAC have to be computed (cf. Eq. (3)). On the other hand the minimum number of threads which can be executed at the same time on the GPU is larger than the required iteration count N of the line estimation example. This is indicated by the mean computation time of the GPU, which is almost constant in both the adaptive and non-adaptive estimations (cf. Fig. 8a and 9a). Due to our proposed optimization strategy (cf. Sec. 3.1) there is a slight difference between the minimum mean computation time of the GPU for non-adaptive RANSAC and for adaptive RANSAC. Since for the latter case all streaming multiprocessors of the GPU are used, the minimum number of iterations N0 is larger than for the non-adaptive case. In the evaluation the time to initialize the framework has not been taken into account. On the system where the framework has been evaluated the initialization requires about 600 ms. For that reason, it may not be beneficial to use the framework for problems that require only a small number of estimations. Like other problems solved on the GPU, the framework is also based on single precision floating point computations. In the evaluation the accuracy of the framework has not been measured since it is only based on the error propagation of the problem-specific functions that have to be implemented by the user. The normalization of the input data is a common technique to reduce the error propagation and consequently, to reduce the computation error (cf. Fig. 10). For instance, the computation of the hypothesis for the fundamental matrix estimation is based on the normalized 8-point algorithm [11]. To counter the occurring errors of the single precision computations, it is possible to retrieve the indices of the inliers and the indices of the datapoints that have been used to compute the hypothesis. With this information, the hypothesis can be recomputed with a higher precision to compensate the computation error.

Measurements. An example result of the fundamental matrix estimation with the framework for robust estimation is presented in Figure 10. The epipolar lines corresponding to the points in the left image pass the same object point in the right image. To show that there is a speedup even with smaller outlier ratios, e. g. ≤ 0.5, the figures showing the mean computation time are separated into two figures. In the first one the outlier ratios 0.2 ≤ ≤ 0.5 are presented. The second figure shows higher outlier ratios 0.5 ≤ ≤ 0.8. Furthermore, this is useful to visualize the intersection between each CPU and the GPU evaluation, i. e. the outlier ratio where the speedup is equal to one. In Figure 11 the results of the fundamental matrix estimation with non-adaptive RANSAC are presented. Our GPU framework estimates the fundamental matrix faster than the CPU 1x for outlier ratios ≥ 0.2 and faster than the CPU 4x for ≥ 0.33. With an outlier ratio of = 0.5 which is common for the fundamental matrix estimation from automatically computed point correspondences the speedup is about 20x compared to CPU 1x and about 8x compared to CPU 4x. For larger outlier ratios ( ≥ 0.75) the speedup reaches about 125x for CPU 1x and about 36x for CPU 4x. Figure 12 shows the results of the adaptive RANSAC estimation. The adaptive version shows a higher speedup compared to the non-adaptive version, especially with lower outlier ratios. With all tested outlier ratios, i. e. 0.2 ≤ ≤ 0.8, the GPU framework is faster than the CPU-based implementation. At the outlier ratio = 0.5 the speedup is about 30x compared to CPU 1x and about 10x compared to CPU 4x.

0.06

GPU CPU 4x CPU 1x

0.04

0.02

0 0.2

0.25


0.45

120 100

140 GPU CPU 4x CPU 1x

120

80 60 40 20 0 0.5

0.5

11

mean speedup

0.08

mean computation time (s)



(a) mean time, ≤ 0.5

CPU 4x / GPU CPU 1x / GPU

100 80 60 40 20

0.55


0.75

0 0.2

0.8

0.3

(b) mean time, ≥ 0.5


0.7

0.8

0.7

0.8

(c) speedup

GPU CPU 4x CPU 1x

0.5 0.4 0.3 0.2 0.1 0 0.2

0.25


0.45

(a) mean time, ≤ 0.5

0.5

700 600

140 GPU CPU 4x CPU 1x

120

500 400 300 200 100

0 0.5

CPU 4x / GPU CPU 1x / GPU

100 80 60

Dr aft

0.6

mean speedup

0.7



Fig. 11: Performance of fundamental matrix estimation with non-adaptive RANSAC. (p = 0.99)

40 20

0.55


0.75

0.8

0 0.2

0.3

(b) mean time, ≥ 0.5


(c) speedup

RANSAC and LMS are similar algorithms. In this section we briefly discuss the estimation on the GPU with LMS. The difference of RANSAC and LMS is given by the rating of loss values (cf. Step 5 of Alg. 1). While for RANSAC the sum of the loss values is minimized (cf. Eq. (6), (7)), LMS requires to minimize the median of the loss values (cf. Eq. (5), (7)).

0.1

0.08

5

GPU CPU 4x CPU 1x

CPU 4x / GPU CPU 1x / GPU 4 mean speedup

6 Least Median of Squares


Fig. 12: Performance of fundamental matrix estimation with adaptive RANSAC. (p = 0.99)

0.06 0.04 0.02

0 0.2

0.25


0.45

3 2 1

0.5

0 0.1

0.2

0.3 0.4 outlier ratio

0.5

(a) (b) The advantage of LMS is that it does not require a problemspecific threshold. The disadvantage is the breakdown point Fig. 13: Performance of LMS estimation of the fundamental of 50% which indicates that LMS may not estimate a model matrix with the framework. (p = 0.99) correctly from a dataset contaminated with an outlier ratio of ≥ 0.5. For some problems such as the fundamental matrix estimation the outlier ratio may exceed the 50% since the point correspondences are usually computed automatically.

Figure 13 shows the mean computation time of 100 repetitions and the speedup against the CPU implementation for fundamental matrix estimation with outlier ratios between = 0.1 and < 0.5 with step size 0.05. Compared to RAN SAC we observe a lower speedup with LMS.

we have observed that for line estimation the CPU implementation is faster than the GPU implementation.

Estimation on the GPU is faster than the estimation on the CPU 1x for ≥ 0.34. Compared to a CPU 4x a greater speedup of the GPU is only given between 0.433 < < 0.5. The use of LMS for fundamental matrix estimation hence is not beneficial. Besides the fundamental matrix estimation,

The bottleneck in LMS is the computation of the median since everything else is equal to the RANSAC algorithm. We assume that the quickselect implementation of the median is not optimal to run on CUDA devices due to the use of different branches causing warp divergence [18].

12


7 Conclusion

Dr aft

In this paper we have presented FestGPU, a framework for robust estimation on GPU which is applicable to a wide range of estimation problems. Two example problems have been shown and evaluated, line estimation from 2 D points and fundamental matrix estimation from 2 D point correspondences. The framework aims at estimating models with random hypothesis computation and testing algorithms. Only two problem-specific functions need to be implemented to execute estimations with non-adaptive RANSAC, adaptive RAN SAC , LMS, GOODSAC , PROSAC , MSAC , and more. To call the estimation from other applications two simple interfaces in Matlab and C++ are available. An evaluation of the framework has been presented. The computation time of the adaptive and non-adaptive RANSAC algorithms have been measured. Overall, the framework using adaptive and non-adaptive RANSAC shows a good performance. In the evaluation a speedup 135 times faster than the CPU with one core has been measured. Depending on the problem, even for small outlier ratios a reasonable speedup has been observed. Finally, we briefly discussed the estimation with LMS on the GPU. Due to the breakdown point of 50% and the current implementation of the median computation on the GPU, the speedup is not as reasonable as with RANSAC. The presented framework for robust estimation on GPU is easy to use and allows a significant speedup. Without additional implementations the framework can be used for both examples, i.e. line estimation and fundamental matrix estimation, which gives a favorable speedup. Currently, the framework is implemented in CUDA C, so that a NVIDIA graphics card is required. We plan to implement the framework in OpenCL to create a more common base of computation hardware. Furthermore, the use of multiple graphics devices at the same time should be supported to gain additional speedup.

4. Chang, C.C., Lin, C.J.: LIBSVM: A library for support vector machines. ACM Trans. on Intelligent Systems and Technology 2(3), 1–27 (2011) 5. Choi, S., Kim, T., Yu, W.: Performance evaluation of RANSAC family. In: Proc. of the British Machine Vision Conf., pp. 1–12 (2009) 6. Chum, O., Matas, J.: Matching with PROSAC - progressive sample consensus. In: Proc. of IEEE Conf. on Computer Vision and Pattern Recognition, vol. 1, pp. 220–226 (2005) 7. Cornelis, N., Van Gool, L.: Fast scale invariant feature detection and matching on programmable graphics hardware. In: Proc. of IEEE Conf. on Computer Vision and Pattern Recognition (Workshops), pp. 1–8 (2008) 8. Fischler, M.A., Bolles, R.C.: Random Sample Consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Comm. of the ACM 24(6), 381–395 (1981) 9. Frahm, J.M., Pollefeys, M.: RANSAC for (quasi-)degenerate data (QDEGSAC). In: Proc. of IEEE Conf. on Computer Vision and Pattern Recognition, vol. 1, pp. 453–460 (2006) 10. Fung, J., Mann, S.: OpenVIDIA: Parallel GPU computer vision. In: Proc. of the 13th Annual ACM Int. Conf. on Multimedia, pp. 849–852 (2005) 11. Hartley, R.I.: In defense of the eight-point algorithm. IEEE Trans. on Pattern Analysis and Machine Intelligence 19(6), 580–593 (1997) 12. Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision, 2nd edn. Cambridge University Press (2004) 13. Havel, J., Dubská, M., Herout, A., Joˇsth, R.: Real-time detection of lines using parallel coordinates and CUDA. Journal of RealTime Image Processing 9(1), 205–216 (2014) 14. Ko, Y., Yi, Y., Ha, S.: An efficient parallelization technique for x264 encoder on heterogeneous platforms consisting of CPUs and GPUs. Journal of Real-Time Image Processing 9(1), 5–18 (2014) 15. Marsaglia, G., Zaman, A.: A new class of random number generators. The Annals of Applied Probability 1(3), 462–480 (1991) 16. Michaelsen, E., Hansen, W.V., Meidow, J., Kirchhof, M., Stilla, U.: Estimating the essential matrix: GOODSAC versus RANSAC. In: Symposium on Photogrammetric Computer Vision (2006) 17. Montañe´ s Laborda, M., Torres Moreno, E., Mart´ınez del Rincón, J., Herrero Jaraba, J.: Real-time GPU color-based segmentation of football players. Journal of Real-Time Image Processing 7(4), 267–279 (2012) 18. NVIDIA: CUDA C Programming Guide (Version 5.5) (2013) 19. Pharr, M., Fernando, R. (eds.): GPU Gems 2: Programming Techniques for High-Performance Graphics and General-Purpose Computation. Addison-Wesley (2005) 20. Roters, J., Jiang, X.: FestGPU: A framework for Fast robust ESTimation on GPU.

Acknowledgements This work was developed in the project AVIGLE funded by the State of North Rhine Westphalia (NRW), Germany, and the European Union, European Regional Development Fund “Europe - Investing in your future“. AVIGLE is conducted in cooperation with several industrial and academic partners. We thank all project partners for their work and contributions to the project. Furthermore, we thank Cenalo GmbH for their image acquisition.

References 1. Babenko, P., Shah, M.: MinGPU: A minimum GPU library for computer vision. Journal of Real-Time Image Processing 3(4), 255–268 (2008) 2. Barreto, J., Daniilidis, K.: Fundamental matrix for cameras with radial distortion. In: Proc. of IEEE Int. Conf. on Computer Vision, vol. 1, pp. 625–632 (2005) 3. Brito, J., Angst, R., Köser, K., Zach, C., Branco, P., Ferreira, M., Pollefeys, M.: Unknown radial distortion centers in multiple view geometry problems. In: Computer Vision - ACCV, LNCS, vol. 7727, pp. 136–149. Springer Berlin Heidelberg (2012)

http://cvpr.uni-muenster.de/research/gpu-estimation

21. Roters, J., Jiang, X.: Incremental dense reconstruction from sparse 3D points with an integrated level-of-detail concept. In: X. Jiang, O.R.P. Bellon, D. Goldgof, T. Oishi (eds.) Advances in Depth Image Analysis and Applications, LNCS, vol. 7854, pp. 116–125. Springer Berlin Heidelberg (2013) 22. Roters, J., Steinicke, F., Hinrichs, K.H.: Quasi-real-time 3D reconstruction from low-altitude aerial images. In: S. Zlatanova, H. Ledoux, E. Fendel, M. Rumor (eds.) Proc. of the 28th Urban Data Management Symposium, pp. 231–241 (2011) 23. Rousseeuw, P.J.: Least median of squares regression. Journal of the American Statistical Association 79(388), 871–880 (1984) 24. Stewart, C.V.: Robust parameter estimation in computer vision. SIAM Reviews 41, 513–537 (1999) 25. Terriberry, T., French, L., Helmsen, J.: GPU accelerating speededup robust features. In: Proc. of the 4th Int. Symposium on 3D Data Processing, Visualization and Transmission, pp. 355–362 (2008) 26. Torr, P.: Bayesian model estimation and selection for epipolar geometry and generic manifold fitting. Int. Journal of Computer Vision 50(1), 35–61 (2002) 27. Torr, P.H.S., Zisserman, A.: MLESAC: A new robust estimator with application to estimating image geometry. Computer Vision and Image Understanding 78, 138–156 (2000)


28. Winker, P., Lyra, M., Sharpe, C.: Least median of squares estimation by optimization heuristics with an application to the CAPM and a multi-factor model. Computational Management Science 8(1), 103–123 (2011) 29. Wu, C., Agarwal, S., Curless, B., Seitz, S.M.: Multicore bundle adjustment. In: Proc. of IEEE Conf. on Computer Vision and Pattern Recognition, pp. 3057–3064 (2011)

Dr aft

Jan Roters studied Computer Science at the University of Münster. He received his Ph.D. in Computer Science from the University of Münster in 2013.

Xiaoyi Jiang studied Computer Science at Peking University and received his Ph.D. and Venia Docendi (Habilitation) degree in Computer Science from the University of Bern, Switzerland. He was an Associate Professor at the Technical University of Berlin, Germany. Since 2002 he is a full professor of Computer Science at the University of Münster, Germany. He is Senior Member of IEEE and Fellow of IAPR.

13

FestGPU: A Framework for Fast Robust Estimation on GPU

FestGPU: A Framework for Fast Robust Estimation on GPU

Suggest Documents

A Robust Probabilistic Estimation Framework for ... - CiteSeerX

Fast Screen Space Curvature Estimation on GPU - Semantic Scholar

GPU Fast and Robust Computation for Barycentric Coordinates and

SAFFRON: A Fast, Efficient, and Robust Framework for Group Testing ...

A GPU-Based Application Framework Supporting Fast Discrete-Event ...

A Framework for the Robust Estimation of Optical ... - Semantic Scholar

gpuMF: A Framework for Parallel Hybrid Metaheuristics on GPU with

Fast and robust parameter estimation for statistical ... - Semantic Scholar

Fast Randomized Algorithms for Robust Estimation ... - Semantic Scholar

Fast and Robust Parametric Estimation for Time ... - Semantic Scholar

A Fast GEMM Implementation On a Cypress GPU - Semantic Scholar

GPU-Accelerated Computation for Robust Motion ... - CiteSeerX

A Fast Decoupling Capacitor Budgeting Algorithm for Robust On-Chip ...

GPU Implementations for Fast Factorizations of STAP

Fast Integral Histogram Computations on GPU for Real-Time ... - arXiv

Fast parallel image registration on CPU and GPU for ... - BioMedSearch

Fast Katsevich Algorithm Based on GPU for Helical ... - Semantic Scholar

Fast Katsevich Algorithm Based on GPU for Helical ... - Semantic Scholar

Fast Deformable Registration on the GPU: A CUDA Implementation of ...

A robust motion estimation algorithm for PIV

Protein Phylogeny Gives a Robust Estimation for

A Robust Estimation Algorithm for Printer Modeling

A Framework for Robust MFCC Feature Extraction

A Robust Optimization Framework for Analyzing