SUBMITTED TO THE SPECIAL ISSUE OF IEEE TPAMI ON ENERGY MINIMIZATION IN COMPUTER VISION AND PATTERN RECOGNITION
1
Grouping in images: from local to multi-local classification Joes Staal, Stiliyan N. Kalitzin and Max A. Viergever Abstract— A method is presented that uses grouping to improve local classification of image primitives. The grouping process is based upon a spin-glass system, where the image primitives are treated as possessing a spin. The system is subject to an energy functional consisting of a local and a bilocal part, allowing interaction between the image primitives. Instead of defining the state of lowest energy as the grouping result, the mean state of the system is taken. In this way, instabilities caused by multiple minima in the energy are avoided. The means of the spins are taken as the a posteriori probabilities for the grouping result. The energy functional is defined in such a way, that in case of no interactions between the elements, the means of the spins equal the a priori local probabilities. The grouping process enables the fusion of the a priori local and bilocal probabilities into the a posteriori probabilities. The method is illustrated both on grouping of line elements in synthetic images and on vessel detection in retinal fundus images. Keywords— Statistical Pattern Recognition, Statistical Learning, Bayesian Grouping.
I. Introduction As long as the field of digital image analysis exists, segmentation has been the bottleneck to achieve object extraction, object specific measurements, and fast object rendering from multi-dimensional image data. Simple segmentation techniques based on local pixel-neighborhood classification fail to apprehend the globality of objects and often require intensive operator assistance so as to produce acceptable results. The reason is that the notion of a object does not necessarily follow the characteristics of its local image representation; only in idealized cases local operations directly yield a definition of an object. Local properties, such as textures, edgeness, ridgeness etc. generally do not represent connected features of an object. Therefore, a method is needed that groups pieces of image primitives into objects. In such a method the solution of the segmentation problem will involve the use of domain knowledge that derives from the recognition task. Similar arguments have motivated earlier work on model-driven grouping and segmentation applied to real-world images [1–10]. In our view the segmentation problem can only be tackled successfully in conjunction with the recognition problem. The recognition task provides a notion of the objects to be defined using the segmentation method; this allows us to incorporate model knowledge of the objects in the grouping process, either by predefining properties that are J. Staal and M. A. Viergever are with the Image Sciences Institute, University Medical Centre Utrecht, Heidelberglaan 100, E01.335, 3584 CX Utrecht, The Netherlands. Email: {joes, max}@isi.uu.nl. S. N. Kalitzin is with the Dutch Epilepsy Clinics Foundation, Achterweg 5, 2103 SW Heemstede, The Netherlands. Email:
[email protected].
characteristic of an object, or by deriving such properties by statistical means. The grouping process, described in this study, relies on local and bilocal prior object probabilities that have been based on the predefined recognition task. In the grouping process, image primitives interact with each other and through these interactions posterior probabilities for being part of the group are computed. In this sense, the method is based on Bayesian statistics. The grouping process itself can be regarded as finding the mean state of a spin-glass system subject to the Gibbs-Boltzmann-distribution. Instead of searching the configuration that maximizes the posterior probabilities, the mean values of all possible configurations of the image primitives determine the posterior probabilities. This is an important difference of the proposed method with respect to Boltzmann-machine like approaches [11]. Also, unlike relaxation labelling methods, the method does not have to re-evaluate the probabilities of an objective function [12]. Other methods for grouping define affinity matrices between primitives and try to define a splitting of the primitives based on eigenanalysis [7, 13] or normalized cuts [9]. The problem with such methods is that their foundation is not statistical in nature, so that one has to incorporate user defined rules in constructing the affinity matrices, although [10] has made an attempt to overcome this shortcoming. The purpose of the present study is to come up with a general statistical method that improves the classification of image primitives by introducing (bilocal) interaction between them. The setup of the paper is as follows. In section II the grouping process is considered as a spin-glass system. The solution of the grouping problem is formulated as finding the means of the state variables. Section III gives an overview of implementational issues, followed in section IV by illustrations of the approach, both on synthetic and on real-world images. Concluding remarks and a discussion of the results are presented in section V. II. Formal problem statement In this section the grouping process will be regarded as a spin-glass system. The system is governed by an energy functional, consisting of a local and a bilocal part. In the next subsection, the spin-glass formulation will be derived. Section II-B discusses the contribution of the local energy, followed in section II-C by the investigation of the bilocal energy.
2
SUBMITTED TO THE SPECIAL ISSUE OF IEEE TPAMI ON ENERGY MINIMIZATION IN COMPUTER VISION AND PATTERN RECOGNITION
A. Probabilistic formulation The task is to group K elements ξi out of a set Ξ = {ξ1 , . . . , ξN } of N elements. The number K is unknown beforehand. If every element is regarded as having a spin si , that can be in one of two states (down or 0 and up or 1), the probability of a certain grouping, i.e. a configuration {si } of the spins, can be formulated as P ({si }) =
1 −βE({si }) e Z
.
(1)
Equation (1) is known as the Gibbs-Boltzmann-distribution. The constant Z is the partition function, which is the sum over all configurations of the spins of e−βE({si }) . It takes care that the sum over all configurations of P ({si }) equals 1. The functional E({si }) plays the role of the energy belonging to a state {si } of the system, whereas β is a control parameter, determining the inverse temperature of the system. Once the energy E for every configuration is known, it is possible to define the optimal grouping as the configuration which maximizes P ({si }), or equivalently, minimizes E({si }). Because there are 2N different configurations and because the state-space parameters are discrete, deterministic minimization of E is not an option. A solution is to fall back on simulated annealing methods [14] and force the system finally in a frozen configuration (β → ∞). The problem with such an approach is that the energy may have multiple minima, yielding unstable and non-unique results. Therefore, instead of searching for the most probable configuration, the mean state of the system is investigated. The mean of a spin si is defined as si P ({sj }) , (2) si = {sj }
where the sum runs over all configurations. Mean spins close to one are very probable in the group, whereas values close to zero belong to the background. So, once the mean values of the spins are determined the grouped elements can be extracted by setting a threshold. For each element ξi , its mean spin plays the role of the a posteriori probability of being part of the group. Given that si is the observable on which the classification decision is based, it makes sense to look at higher order statistics. In particular, the covariance si sj of the spins is a measure for the strength of the bond between elements. Elements which both belong to the group will have a covariance close to one, while the other configurations (one element in the group, one not or both not) will show a covariance close to zero. The computation of the values of the mean spins and their covariances can be computed using the Metropolis algorithm [15], which will be discussed in Section III-A. But first, definitions for the energy will be given in the next subsections. B. Local classification In the previous subsection the problem was formulated as finding the mean spins of a spin-glass system, which is
governed by an energy functional E. In this subsection a definition for the energy is investigated without interactions between the spins. It is supposed that local knowledge of every element ξi is available in the form of an a priori probability pi = p(si = 1). Since every element can only be in one of two states, we have p(si = 0) = 1 − pi . If there is no interaction between the elements, the energy should be defined in such a way that the a posteriori probability equals the a priori probability, i.e. si = pi
.
(3)
Now, consider the following energy functional in which there is no interaction between the spins E({si }) = Li si . (4) i
Because E is regarded as the energy of the system, Li is referred to as the local potential induced by element ξi . With eq. (1) the probability that the system is in state {si } equals 1 −βLi si e , P ({si }) = Z i showing that without interactions the classification of the elements is independent of each other. To satisfy eq. (3), the definition for the local potential is found to be pi 1 . (5) Li = − loge β 1 − pi With eq. (5) the local part of the energy has been determined. C. Classification with interactions We next consider a system where interaction between the elements does exist. The reason for allowing such interaction is that we want it to reinforce the mean spins of the grouped elements, yielding stronger groups. Interactions between the spins can be introduced by defining the following energy functional E({si }) =
i
Li si +
1 Bij si sj 2 i j
,
(6)
where Bij denotes the bilocal potential caused by the interaction between elements ξi and ξj . This bilocal potential perturbs the mean value of the spins and instead of eq. (3) we will have for the a posteriori probabilities si = pi + ∆i
.
(7)
The value and sign of ∆i depend upon the definition of the bilocal potential Bij . Before a definition for the bilocal potential is investigated, some observations will be made on its role in the energy functional of eq. (6). Only pairs of spins which are both in the on state contribute to the energy. This is in accordance with the goal that grouped elements should reinforce their mean spin. Extra terms like Cij si (1 − sj ) could be added
SUBMITTED TO THE SPECIAL ISSUE OF IEEE TPAMI ON ENERGY MINIMIZATION IN COMPUTER VISION AND PATTERN RECOGNITION
to discriminate between foreground and background, or Dij (1−si )(1−sj ) for grouping background elements. It depends on the application whether or not this makes sense. If multiple classes are available, grouping the background will probably be of little value, unless background elements share common properties. In this paper eq. (6) will only be used in its unmodified form. The bilocal part of the energy can be viewed upon as a discrete Hopfield network [16] with connections Bij between the neurons. For the grouping process it is important that the elements influence each other, which is accomplished by Hopfield networks since they exhibit strong feedback-coupling. The definition for the local potentials, eq. (5), suggests a straightforward definition for the bilocal potential Bij = −
pij 1 loge β 1 − pij
,
(8)
with pij the a priori conditional probability p(si = 1|sj = 1). Note that p(si = 0|sj = 1) = 1 − pij . Since we are indifferent with respect to background grouping, one could say we regard p(si = 1|sj = 0) = p(si = 0|sj = 0) = 12 . Note that with the definitions for the potentials in eqs. (5) and (8) the inverse temperature β drops out in eq. (1). III. Implementation In this section we will start with discussing the Metropolis algorithm, a method for finding the expected values of the state variables of a system that is characterized by the Gibbs-Boltzmann-distribution. After this discussion, determination of the a priori probabilities pi and pij will be dealt with. A. The Metropolis algorithm The Metropolis algorithm [15] is a Monte-Carlo method for calculating the expected values of the state variables of a system that is subject to the the Gibbs-Boltzmanndistribution. Before the algorithm is executed the system is in a certain state, quite probably not the equilibrium state. At the start of the algorithm one element is chosen at random and its spin is reversed. The reversal will change the energy of the system. If the change of energy ∆E < 0 the reversal of the spin is accepted, if ∆E > 0 the reversal is accepted with probability exp(−β∆E). The remaining N −1 elements are checked in random order and the system changes its state with the same rules as before. This procedure is referred to as a Metropolis step. The Metropolis step is repeated M times, where M has to be large enough in order to represent the systems thermal equilibrium. The mean of the spins is found by si =
M 1 si (k) , M
(9)
k=1
where si (k) is the value of the spin after the k-th Metropolis step.
3
Similarly the covariance is found to be si sj =
M 1 si (k)sj (k) . M
(10)
k=1
B. Determination of the a priori probabilities The main issue in the implementation of the proposed method is the determination of the a priori chances pi and pij . Once those are known, the potentials from eq. (5) and eq. (8) can be computed. In the implementation of the Metropolis algorithm, knowledge of energy itself is not si − 1)(Li + j Bij sj ) necessary, only ∆E(si → s¯i ) = (2¯ is needed, where ∆E(si → s¯i ) is the change of the energy due to the spin reversal of element ξi . For the determination of the chances pi and pij two choices can be made 1. Prior information suggests a functional form based on properties of the elements. 2. The probability densities are estimated from training data based on properties of the elements. In this paper the chances pi (φi ) and pij (ψij ) have been learned from training data. The feature vector φi represents local properties and ψij bilocal properties. An obvious choice for training the a priori chances could be the use of a feed-forward neural network with the sum of the squared differences as error measure. With appropriate squashing functions, the chances (given the features) will be obtained [17, see chapter 6]. However, training such networks can be difficult and many parameters have to be adjusted. Therefore, we have chosen to use the k-Nearest-Neighbor (k nn) classifier for approximating the chances. The main drawback of this method is that it is believed to be slow. However, there exist optimized and fast implementations [18]. Using a k nn-classifier, the search space for the local chances is built with all data points, which are labelled with 1 for spin up and with 0 for spin down. Then, the chance that an element has spin up given its features φi is given by ni pi (φi ) = , k where ni is the number of neighbors in the feature space that have label 1 and k is the total number of neighbors. The pij are the a priori probabilities that spin si = 1 given that sj = 1. To approximate pij by a k nn-classifier all elements with spin equal to one are paired with all the elements and the labels are set to 1 if a pair has the second spin up and to 0 if the second spin is down. After this labelling an unseen pair has its chance approximated in the same way as in the local case. To reduce the number of bilocal chances that have to be learned, and to avoid long-range interaction, a neighborhood or “clique” can be used, in which element ξi only interacts with a limited number of other elements. IV. Examples In this section we will illustrate the proposed method in two examples. The first example deals with synthetic data
4
SUBMITTED TO THE SPECIAL ISSUE OF IEEE TPAMI ON ENERGY MINIMIZATION IN COMPUTER VISION AND PATTERN RECOGNITION
and shows grouping of line elements into a cord. In the second example real-world data is used in the detection of the vasculature in retinal fundus images. A. Grouping line elements into a cord As a first example cords existing of line elements are discussed. Ten training images of size 400 × 400 pixels are generated, which contain 420 line elements of which 20 form a cord (an example is shown in figure 1). All line elements have a length of 10.0 ± 1.0 pixels (all distributions to generate the training and test data in this section are uniform). The orientation of the line elements that form the background vary between 0o and 360o . The cords have a random orientation θ and the orientation of their constituting line elements varies between θ − 1.8o and θ + 1.8o . Only one local feature is taken into account, the mean µi of the gray values of the line elements, which is 6.0 ± 4.3 for background elements and 10.2 ± 0.2 for foreground elements. These values cause some overlap between the distributions of the local features. The width of the background distribution of the µ’s is slightly larger than the ratio between the background and foreground elements (20 : 1), so that with these settings a fair amount of foreground elements will be selected as background in local classification. Also, based on local classification, some background elements will be regarded as foreground. For the bilocal a priori chances five bilocal features are computed, three based on the geometry of the line elements, and two on the local features. The latter two are the sum and the absolute difference of the µ’s of every considered pair. The geometrical features are measures for distance, mutual orientation and alignment, see figure 2. Note that a parallel displacement of one line element with respect to another, does not change their mutual orientation. To decrease computational costs and to avoid long range interaction, for every element only interactions are computed within a neighborhood of 25 pixels around the element. So, only those elements are taken into account that are with at least one endpoint closer than 25 pixels to the element under consideration, see figure 3. The method is tested with ten test images that are constructed in the same way as the training data, except that no labels are attached to classify foreground and background. The threshold value for the mean spin is set to 0.5 in all experiments. A local classification for one of the images, shown in figure 4 (a), is presented in figure 4 (b). Notice that a few background elements are classified as foreground and that about 25% of the cord is classified as background, see figure 4 (c). Figure 4 (d) shows the results after bilocal classification. The Metropolis algorithm is run 10, 000 times. All the elements of the cord are classified as foreground, but a few background elements make it as well to the foreground. To further improve on this result, use can be made of the covariance between the spins. Only foreground elements having strong connection to other foreground neighbors are assumed to belong to the object. Using this reasoning, fig-
Fig. 1. Example of a training image. In black the elements that constitute the cord, in gray the background elements.
ure 4 (e) is obtained. Elements that are selected as foreground but have low covariance (in fact, the covariances are all zero) with their neighbors are marked with triangles at their endpoints. The found object now coincides with the cord in the input image. To evaluate the result of the grouping, several measures have been computed, see table I. The table shows clearly that the classification result after grouping increases: instead of an error of 20% in foreground classification, an error of 0.5% is obtained. The error made with respect to the background elements, which is already low, decreases, but not in the same rate as the foreground error. If, however, isolated elements are taken out by setting a threshold on the covariance, no errors are made in the background. The threshold value used is 0.1. The changes in the values of the mean spins show an clear increase of 0.33 of the bilocal a posteriori probabilities of the foreground elements with respect to the local a priori probabilities. The background elements have an overall very small (negligible) decrease in the a posteriori probabilities. B. Segmenting ridges in retinal fundus images In this subsection the method is tested on twodimensional medical images of the retina of the human eye, for an example see figure 5. The images are obtained from a scanning laser ophthalmoscope (slo). In this procedure patients are injected with a contrast agent to enhance the vessel structure. The image processing task is to delineate the vessel structure. Since image ridges are natural indicators of vessels, we start our analysis with ridge detection. In the appendix a short overview is given on ridge detection in twodimensional gray value images. For a more extensive discussion on this subject, see [19]. The ridges of figure 5 are
SUBMITTED TO THE SPECIAL ISSUE OF IEEE TPAMI ON ENERGY MINIMIZATION IN COMPUTER VISION AND PATTERN RECOGNITION
(a)
5
(b)
(c)
(d)
(e)
Fig. 4. (a) Input image. (b) Locally classified image. The gray value of the elements measure the probability on spin up (darker denotes higher probability, lighter denotes lower probability). The endpoints of the elements that are classified as foreground are drawn as circles; the background elements are denoted by line elements. (c) Enlargement of the cord elements of (b). Note that 5 elements are classified as background (drawn as line elements). (d) As in (b), but now with bilocal interaction added. Note that the spins of the elements in the cord have become stronger :the circles are darker than in (b). (e) Bilocal classification where foreground elements that have low covariance with all their neighbors are marked with triangles at their endpoints.
6
SUBMITTED TO THE SPECIAL ISSUE OF IEEE TPAMI ON ENERGY MINIMIZATION IN COMPUTER VISION AND PATTERN RECOGNITION
(a)
(b)
(c)
Fig. 2. (a) The shortest distance between the end-points of two line elements is taken as the distance d between the elements. Note that there are four distances between the end-points of two elements. (b) The angle between two line elements is characterized by the absovj , that lute value of the inner product of the unit vectors ˆ vi and ˆ are aligned with the line elements. (c) A (symmetric) measure for alignment is found by looking for the end-points which are closest to each other and forming a vector ˆr of unit length along the line between the two other end-points. Note that those end-points are not necessarily the end-points with the longest distance between the two line elements. The alignment measure is now defined as the vj : mean of the absolute values of the inner product of ˆr with ˆ vi and ˆ 1 (|ˆr · ˆ vi | + |ˆr · ˆ vj |). 2
TABLE I Results from the experiments of section IV-A. The total number of background elements considered is 4000 and the total number of foreground elements 200. The first row shows how many background elements are classified wrongly in the local and the bilocal classification, respectively. The second row shows the same, when the covariance is taken into account. In the third row the results are shown for the foreground elements (same results with and without covariance). The fourth row shows for the background elements how much the means of the spins increase because of the grouping (see eq. (7)). The last row shows the same, but then for the foreground elements.
Background error 1 Background error 2 Foreground error ∆i background ∆i foreground
# 94 94 41
Local % 0.024 0.024 0.205
Bilocal # % 69 0.017 0 0.000 1 0.005 −0.937 −0.002 6.609 0.330
Fig. 3. Neighborhood for the white element. Only those elements (dark gray) which have at least one end-point within the neighborhood are taken into account. The black elements are outside the neighborhood.
shown in figure 6. The problem of detecting the vessels in image figure 5 is thus reduced to detecting which ridge pixels in figure 6 delineate vessels. It is obvious from the abundance of ridges in figure 6 that this representation is still suboptimal. To improve the representation, the ridge point sets are fragmented into convex subsets. Each of these convex subsets represents a line segment. The so obtained set of line segments is the basic “grouping set” of geometrical image primitives. A convex point set is a set of points such that with every couple of points that belong to the set, all points that lie in-between these two points on the line connecting them also belong to the set. In more formal terms Definition 1: A point set C is convex iff z(s) ∈ C for all x ∈ C, y ∈ C, s ∈ [0, 1] and z(s) = xs + y(1 − s). A slightly different form of this basic definition is used, in which the information governed by the directional information of the ridges is exploited (see the appendix). This definition replaces Euclidean convexity by geodesic convexity. The resulting sets are called affine convex sets. The mechanism to obtain affine convex sets is a simple region growing algorithm which compares an already grouped ridge pixel with ungrouped pixels in a neighbor-
Fig. 5. An optical fluorescence image of the fundus of the human retina obtained from a scanning laser ophthalmoscope. Image size is 512 × 512 pixels. The artefacts on the boundaries of the image are due to signal underflow.
hood of radius c , where the subscript c stands for connectivity. The condition on grouping a grouped and a candidate pixel within the neighborhood is based on two comparisons 1. Is the direction of the ridges on which the pixels are found similar? 2. If so, are the pixels on the same ridge or are they on parallel ridges? The first question can be checked by taking the scalar product of the principal eigenvectors of the Hessian matrix at the location of the ridge pixels. The principal eigenvectors are perpendicular to the ridges (appendix). If the
7
SUBMITTED TO THE SPECIAL ISSUE OF IEEE TPAMI ON ENERGY MINIMIZATION IN COMPUTER VISION AND PATTERN RECOGNITION
Fig. 6. The ridges (black) of figure 5 obtained at scale t = 1.0 pixel2 . Note the large response of the ridge detector with respect to the noise in the background.
orientations are similar, the scalar product will be close to 1. The second question can be checked by computing the unit-length normalized vector ˆr between the locations of the two pixels under consideration and taking the vector product between this vector and the principal direction of the grouped ridge pixel. If the pixels are on the same segment, the vector product will be close to 1. See also figure 7 for the construction of the sets. Mathematically the following inequalities are checked xg − xu ≤ c , |ˆv(xg , t) · ˆv(xu , t)| ≥ o , |ˆv(xg , t) ∧ ˆr| ≥ p ,
(11) (12) (13)
where the subscript g stands for grouped, u for ungrouped, o for orientation and p for parallelism. The ’s determine the measure for similarity. For the other symbols, see figure 7. Using these techniques the convex sets of the ridges in figure 6 are displayed in figure 9 (a). These convex sets have been used for local and bilocal classification. For this purpose, from four slo-images the convex sets are computed and labelled by hand into vessel and non-vessel. To approximate the local a priori probabilities pi and the bilocal a priori probabilities pij , features are computed for every convex set (local) and every combination of a vessel convex set with any other set (bilocal). To avoid long range interactions, neighbors are taken into account only within a neighborhood of eight pixels. The following local features are computed for every convex set i: 1. The mean µi of the image gray values at the M pixel locations of the convex set 1 i µi = L(xim , ym ) , M m
Fig. 7. The curved gray lines are two ridges. The disk represents the neighborhood within which pixels are compared. The dark arrow is the direction in which the ridge is detected (see appendix) at the location of the grouped pixel. The light arrows are the directions at the location of candidate pixels. The pixel that belongs to the same ridge will be grouped to the pixel with the dark arrow, because it satisfies conditions in (11)–(13). The pixel on the parallel ridge does not satisfy condition (13) and will not be grouped.
where L denotes the gray value image and (xi , y i ) the pixel locations of the convex set. 2. A measure for the width of vessels is computed in the i following way. For every pixel (xim , ym ) in the convex set i the principal directions ˆvm are known (see appendix and discussion above). The principal directions are perpendicular to the ridges, i.e. perpendicular to the vessels. Onei ) and in dimensional gray value profiles centred at (xim , ym i the direction of ˆvm are extracted from the image. In every i profile, the edges on the left and right hand side of (xim , ym ) are detected. The distance between the locations in the profile with strongest edge response on the left and right i side is taken as the width δm for profile m. The measure wi for the width is the mean of the widths of all profiles 1 i wi = δ . M m m 3. A measure σi for the edge strength in the convex set is computed as follows. The response of the strongest edges on the left and right side of the profiles (see previous item), λim and ρim respectively, are averaged, yielding 1 i σi = λ + ρim . M m m 4. The curvature κi of the convex set, defined as κi =
M −1 1 ˆvi · ˆvi M − 1 m=1 m m−1
,
where ˆvi is the principal direction corresponding to pixel m of the convex set.
8
SUBMITTED TO THE SPECIAL ISSUE OF IEEE TPAMI ON ENERGY MINIMIZATION IN COMPUTER VISION AND PATTERN RECOGNITION
TABLE II Results from the experiments of section IV-B. The total number of elements considered is 23317. The first row shows the number of true positives, the second row the true negatives, the third row the false positives and the fourth row the false negatives. With these numbers, the accuracy ((tp + tn)/(tp + tn + fp + fn)) , sensitivity (tp/(tp + fn)) and specificity (tn/(tn + fp)) are computed, shown in lines five to seven. The eighth row shows the total and average increase of the mean spins that are selected as foreground and that are vessel. The average is computed by dividing by the number of bilocal true positives. The last row shows the same for the background elements; here the increase is computed for the sets that are non-vessel. The average increase is computed by dividing by the number of bilocal true negatives.
Fig. 8. ROC curves for the local and bilocal classification of retinal fundus images. The dotted lines are the lower and upper bounds of the asymmetric 95% confidence interval. The area under the curve is 0.677 for the local curve and 0.834 for the bilocal curve.
And for the bilocal features between convex sets i and j the following measures are taken: 1. The Euclidean distance between the closest endpoints of the sets. 2. The sum of µi and µj (see item 1 of the local features). 3. The absolute difference of µi and µj (see item 1 of the local features). 4. The sum of σi and σj (see item 3 of the local features). 5. The absolute difference of σi and σj (see item 3 of the local features). To be able to evaluate the method, leave-one-out experiments are performed on the data sets, i.e. three of the data sets are used to train the potentials. With this training set the fourth data set is classified, both locally and bilocally. This is done for all four data sets. For the k nn-classifier a k-value of 20 is taken. Roc-analysis shows that the area under the curve is 0.677 for the local classification and 0.834 for the bilocal classification. The roc-curves are shown in figure 8. If a convex set is classified as vessel if si > 0.5, the numerical results shown in table II are obtained. An example of which convex sets are classified as vessels based on this threshold is shown in figure 9 (b) and (c). From table II it can be concluded that the foreground classification has benefited from the grouping method. The total number of correctly classified vessel sets has increased. It also causes a small decrease of the correct classification of the background. A count reveals that 3266 of the local true positives stay in the class of true positives after bilocal classification. The total number of false negatives that moves to true positives is 656. Since the total increase of true positives is 543, this means that 113 locally correctly classified vessel elements are lost to the other groups. Not only the number of true positives increased, their a
tp tn fp fn Accuracy Sensitivity Specificity ∆i foreground ∆i background
Local 3379 16742 1278 1918 0.863 0.638 0.929
Bilocal 3922 16447 1573 1375 0.874 0.740 0.913 935.3 −639.4
Average
0.238 −0.039
posteriori probability increased on average by 0.238, which is the purpose of the method. V. Discussion In this paper a method is presented for grouping image primitives based on local and bilocal features. The method performs well on synthetic data. Compared to local classification the number of classification errors is reduced and the confidence with which the elements are classified is increased. In the retinal fundus images, roc analysis shows that bilocal classification clearly outperforms local classification. For a threshold of 0.5 on si , the mean increase of the posterior probabilities for correctly classified convex sets is 0.238. Yet, the number of false positives increases as well. It must be noted that the test on the fundus images is meant to serve as an illustration only. For a genuine evaluation of the approach on real world images, the characteristic features of the image at hand must be determined. The method itself can be applied to a variety of grouping and classification problems. In this study, we consider grouping of line elements and convex sets, but grouping of individual pixels, pixel sets or other structures can be studied as well. The extension to higher dimensional images is straightforward. The complexity will remain the same, O(N 2 ) for fully connected bilocal interactions, with N the number of elements. It is also possible to include higher order interactions (trinary, n-nary), although for this extension the complexity will increase as O(N n ).
SUBMITTED TO THE SPECIAL ISSUE OF IEEE TPAMI ON ENERGY MINIMIZATION IN COMPUTER VISION AND PATTERN RECOGNITION
(a)
(b)
(c)
(d)
9
Fig. 9. (a) The convex sets of the ridges of figure 6. Every grouped set has its own color. The settings used are for the neighborhood (connectivity) accuracy c = 3.0 pixel, for the orientation accuracy o = 0.98 and for the parallellity accuracy p = 0.98, see also eqs. (11)– (13). Note that sets which consisted of 1 pixel have been removed. (b) Locally classified convex sets. Only those sets are depicted which have si > 0.5. Sets with higher mean spin are shown darker. (c) Bilocally classified convex sets. Again darker elements mean higher mean spin. Only mean spins larger than 0.5 shown. (d) Manually labelled convex sets for the vessels.
Acknowledgements This work is carried out in the framework of the nwo research project stw-ugn/4496. Appendix The ridge detection used in this paper is described in full detail in [20]. Here a short overview for two-dimensional images is presented. Ridges and valleys are defined as those points, where the image has an extremum in the direction of the largest surface curvature. Mathematically, the points in the im-
age L(x) are searched, with x = (x1 , x2 )T , where the first derivative of the luminance in the direction of the largest surface curvature changes sign. The direction of largest surface curvature is the eigenvector ˆv of the matrix of second order derivatives of the image which has the largest absolute eigenvalue λ. This matrix is often referred to as the Hessian matrix. The first derivative of the image in the direction of ˆv is found by projecting the gradient of the image onto it. The sign of λ determines whether a valley (λ > 0) or a ridge (λ < 0) is found.
10
SUBMITTED TO THE SPECIAL ISSUE OF IEEE TPAMI ON ENERGY MINIMIZATION IN COMPUTER VISION AND PATTERN RECOGNITION
Because taking derivatives of discrete images is an illposed operation, they are taken at a scale t using the Gaussian scale-space technique (see e.g. [21] and references therein). The main idea is that the image derivatives can be taken by convolving the image with derivatives of a Gaussian 2 ∂ i L(x, t) ∂ i e−x−x /t 1 = L(x )dx , (14) 2πt x ∈R2 ∂xj i ∂xj i where xj is the image coordinate with respect to which the derivative is taken. Mixed derivatives are computed by taking mixed derivatives of the Gaussian kernel. It is now possible to define a scalar field ρ(x, t) over the image that takes value −1 for valleys, 1 for ridges and 0 elsewhere as follows: ρ(x, t) = − 12 sign(λ(x, t))× | sign (g(x+ˆv, t) · ˆv) − sign (g(x−ˆv, t) · ˆv) |
, (15)
where the gradient vector g(x, t) is defined as ∇L(x, t), λ(x, t) is the largest eigenvalue by absolute value of the Hessian matrix H(x, t) = ∇∇T L(x, t) and ˆv(x, t) is the unitlength normalized eigenvector belonging to that eigenvalue. In (15) ˆv is evaluated at (x, t). The parameter is the spatial accuracy with which the point-sets are detected. In the continuous case the limit → 0 is taken, but in the discrete pixel case = 1.0 pixel is a natural choice. Figure 6 shows an example of ridge detection at a fundus image (valleys are not shown). References [1]
P. Parent and S. W. Zucker, “Trace inference, curvature consistency and curve detection,” IEEE Trans. on Patt. Anal. and Mach. Intell., vol. 11, no. 8, pp. 823–839, 1989. [2] G. N. Khan and D. F. Gillies, “Extracting contours by perceptual grouping,” Image and Vision Computing, vol. 10, no. 2, pp. 77–88, 1992. [3] T. Pun, “Electromagnetic models for perceptual grouping,” in Advances in machine vision: strategies and applications, C. Archibald, Ed. World Scientific Publishing Co., 1992. [4] G. Guy and G. Medioni, “Inferring global perceptual contours from local features,” Int. Journ. Comp. Vis., vol. 20, no. 1/2, pp. 113–133, 1996. [5] M. Pilu and R. B. Fisher, “Model-driven grouping and recognition of generic object parts from single images,” Journal of Robotics and Autonomous Systems, vol. 21, pp. 107–122, 1997. [6] L. R. Williams and D. W. Jacobs, “Stochastic completion fields: a neural model of illusory contour shape and salience,” Neural Computation, vol. 9, no. 4, pp. 837–858, 1997. [7] P. Perona and W. T. Freeman, “A factorization approach to grouping,” in ECCV (1). 1998, vol. 1406 of Lecture Notes in Computer Science, pp. 655–670, Springer. [8] S. N. Kalitzin, J. J. Staal, B. M. ter Haar Romeny, and M. A. Viergever, “Image segmentation and object recognition by Bayesian grouping,” in Proc. ICIP 2000, Vancouver, 2000. [9] J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE Trans. on Patt. Anal. and Mach. Intell., vol. 22, no. 8, pp. 888–905, 2000. [10] A. Robles-Kelly and E. R. Hancock, “Perceptual grouping using eigendecomposition and the EM algorithm,” in Proc. 12th Scand. Conf. Im. Anal., 2001, pp. 214–221. [11] E. Aarts and J. Korst, Simulated annealing and Boltzmann machines, John Wiley and Sons, 1989. [12] J. Kittler and J. Illingworth, “Relaxation labelling algorithms— a review,” Image and Vision Computing, vol. 3, no. 4, pp. 206– 216, 1985.
[13] Y. Weiss, “Segmentation using eigenvectors: a unifying view,” in IEEE Int. Conf. Comp. Vis., 1999, pp. 975–982. [14] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi, “Optimization by simulated annealing,” Science, vol. 220, no. 4598, pp. 671–680, 1983. [15] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller, “Equation of state calculations by fast computing machines,” Journal of Chemical Physics, vol. 21, no. 6, pp. 1087–1092, 1953. [16] J. J. Hopfield, “Neural networks and physical systems with emergent collective computational abilities,” Proc. Natl. Acad. Sci. USA, vol. 79, pp. 2554–2558, 1982. [17] C. M. Bishop, Neural networks for pattern recognition, Oxford University Press, 1995. [18] S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman, and A. Y. Wu, “An optimal algorithm for approximate nearest neighbor searching,” Journal of the ACM, vol. 45, pp. 891–923, 1998, for an implementation see: http://www.cs.umd.edu/~mount/ANN/. [19] D. Eberly, Ridges in image and data analysis, Kluwer Academic Publishers, Dordrecht, 1996. [20] S. N. Kalitzin, J. J. Staal, B. M. ter Haar Romeny, and M. A. Viergever, “A computational method for segmenting topological point sets and application to image analysis,” IEEE Trans. on Patt. Anal. and Mach. Intell., vol. 23, no. 5, pp. 447–459, 2001. [21] L. M. J. Florack, Image structure, Kluwer Academic Press, Dordrecht, 1997.