Eric Jones, Travis Oliphant, Pearu Peterson, et al. SciPy: Open source scientific tools for. Python, 2001â. URL http://www.scipy.org/. S.S. Keerthi and E. G. Gilbert ...
Department of Physics and Astronomy University of Heidelberg
Diploma thesis in Physics submitted by Nathan Hüsken born in Landau in der Pfalz 2010
Active and Online Learning for Interactive Image Analysis
This diploma thesis has been carried out by Nathan Hüsken at the Heidelberg Collaboratory for Image Processing (HCI) under the supervision of Herrn Prof. Dr. F. A. Hamprecht
and Herrn Dr. Ullrich Köthe
Active and Online Learning for Interactive Image Analysis: In modern image analysis classifiers such as Support Vector Machines and Random Forests are used to classify regions in images. Gathering labels to train classifiers is both time consuming and expensive. Active learning strategies reduce the labeling demand by requesting labels valuable for the classification. The method Active Segmentation has been developed to apply active learning to segmentation tasks. To enable real time user interaction, active learning algorithms need to respond fast. To decrease training time an online support vector machine based on laSvm has been developed and online Random Forests variants have been investigated. Both were integrated into the interactive image labeling tool Ilastik. While Random Forests work out of the box, the performance of a Support Vector Machine highly depends on several hyperparameters. Here automatic model selection by gradient descent on the Xi-Alpha bound, an error bound for the generalization performance of SVMs, has been investigated and applied. The prediction time of online Random Forests has been significantly reduced by online prediction clustering. Cover trees and the removal of linear dependent support vectors have been investigated to speed up the prediction of the online Support Vector Machine.
Aktives und Online lernen für Interaktive Bildanalyse: In der modernen Bildanalyse werden Klassifikatoren, wie Support Vector Machines und Random Forests, eingesetzt um Regionen in Bildern zu klassifizieren. Das Sammeln von Labeln um den Klassifikator zu trainieren ist zeitaufwendig und kostspielig. Active learning Strategien reduzieren den Bedarf an Labeln indem für den Klassifikator wertvolle Labels angefordert werden. Die Methode Active Segmentation wurde entwickelt um aktives lernen auf Segmentierungsaufgaben anzuwenden. Um Benutzerinteraktion in echtzeit zu ermöglichen müssen aktive Lernalgorithmen schnell reagieren. Um die Trainigsdauer zu reduzieren wurde eine online Support Vector Machine, basierend auf laSvm entwickelt, und verschiedene online Random Forest Varianten untersucht. Beide Verfahren wurden in das interaktive Bild Labelwerkzeug Ilastik integriert. Während der Random Forest ohne Einstellungen funktioniert, hängt die Leistung einer Support Vector Maschine von mehren Parametern ab. Methoden zur automatischen Anpassung der Parameter, basierend auf Gradientenabstiegsmethoden auf der Xi-Alpha Fehlerschranke für die Generalizierungsperformanz einer Support Vector Machine wurden erforscht und angewendet. Die Vorhersagezeit des online Random Forest wurde durch online prediction Clustering erheblich reduziert. Um die Vorhersage der online Support Vector Machine zu beschleunigen wurden Cover trees und das Entfernen von linear abhängigen Support Vectors untersucht.
Erklärung: Ich versichere, dass ich diese Arbeit selbständig verfasst und keine anderen als die angegebenen Quellen und Hilfsmittel benutzt habe.
Heidelberg, den 09.02.2010 ....................................... (Unterschrift)
Acknowledgements I wish to thank Professor Fred A. Hamprecht for the opportunity of writing this thesis and my adviser Dr. Ullrich Köthe for the guidance through the thesis and his support with various tasks. I wish to thank the whole multi dimensional image processing group for being nice colleagues and open whenever I needed someone to discuss Ideas. Among the people in the group I owe special thanks to Christoph Sommer. Last but not least I wish to thank Eva Münich for her understanding and support.
7
Contents Introduction
i
Datasets
v
I.
1
Active Segmentation
1. Active segmentation 1.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2. Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1. The Rand Index . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.2. Watershed segmentation . . . . . . . . . . . . . . . . . . . . 1.2.3. The maximin path . . . . . . . . . . . . . . . . . . . . . . . 1.3. The active segmentation algorithm . . . . . . . . . . . . . . . . . . 1.3.1. Getting the maximin path from a watershed segmentation 1.3.2. Setting temporary labels . . . . . . . . . . . . . . . . . . . . 1.3.3. Finding valuable pixel pairs . . . . . . . . . . . . . . . . . . 1.4. Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5. Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.1. Berkeley segmentation database . . . . . . . . . . . . . . . 1.5.2. SBFSEM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
II. Online learning 2. Online Support Vector Machine 2.1. Support vector machines . . . . . 2.2. Karush-Kuhn-Tucker conditions 2.3. τ tolerance . . . . . . . . . . . . . 2.4. Sequential minimal optimization 2.5. laSvm . . . . . . . . . . . . . . . . 2.6. Unlearning . . . . . . . . . . . . . 2.7. Online learning with laSvm . . . 2.7.1. Resampling . . . . . . . .
3 3 3 3 4 4 6 7 8 10 14 15 15 16 20
23
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
25 25 27 28 29 29 32 32 33
9
Contents 3. Model Selection for online SVMs 3.1. Generalization error bounds . . . . . . . . . . 3.1.1. Radius Margin bound . . . . . . . . . 3.1.2. Empirical error . . . . . . . . . . . . . 3.1.3. Xi-Alpha bound . . . . . . . . . . . . . 3.1.4. Smooth Xi-Alpha bound . . . . . . . . 3.1.5. Deriving the smooth Xi-Alpha bound 3.1.6. Experiments . . . . . . . . . . . . . . . 3.1.7. Discussion . . . . . . . . . . . . . . . . 3.2. Gradient Descent on Error bounds . . . . . . 3.2.1. Gradient Descent strategies . . . . . . 3.2.2. Experiments . . . . . . . . . . . . . . . 3.2.3. Results . . . . . . . . . . . . . . . . . . 3.2.4. Discussion . . . . . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
49 50 50 50 51 51 52 55 61 65 65 66 67 67
4. Speeding up SVM prediction 4.1. Linear independent support vectors . . . . . . . . . . . . . 4.1.1. Merging of support vectors . . . . . . . . . . . . . . 4.1.2. Experiments . . . . . . . . . . . . . . . . . . . . . . . 4.1.3. Results . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.4. Discussion . . . . . . . . . . . . . . . . . . . . . . . . 4.2. Cover Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1. Theory . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2. SVM Prediction utilizing cover trees . . . . . . . . . 4.2.3. Cover tree kernel sum tailored for SVM prediction 4.2.4. Speeding up the evaluation of exp . . . . . . . . . 4.2.5. Experiments . . . . . . . . . . . . . . . . . . . . . . . 4.2.6. Results . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.7. Finding bottlenecks in the prediction . . . . . . . . 4.2.8. Discussion . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
71 71 74 75 75 75 79 79 82 82 83 84 85 86 86
5. Online Random Forest 5.1. Random Forest . . . . . . . . . . . . . . . . . . . . . . . . 5.2. Online Random Forest . . . . . . . . . . . . . . . . . . . . 5.2.1. Online Bagging . . . . . . . . . . . . . . . . . . . . 5.2.2. Growing of trees . . . . . . . . . . . . . . . . . . . 5.2.3. Threshold adjustment . . . . . . . . . . . . . . . . 5.2.4. Relearning of trees . . . . . . . . . . . . . . . . . . 5.3. Online prediction clustering . . . . . . . . . . . . . . . . . 5.3.1. Algorithm . . . . . . . . . . . . . . . . . . . . . . . 5.3.2. Tree relearning with online prediction clustering 5.4. Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.1. Threshold adjusting vs. no threshold adjusting .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
93 93 94 94 94 94 96 96 97 98 98 99 99
10
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . .
Contents 5.5.2. Tree relearning . . . . . . . . . . 5.5.3. Batch vs. online Random Forest 5.5.4. Online prediction clustering . . 5.6. Discussion . . . . . . . . . . . . . . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. 99 . 105 . 105 . 105
III. Implementation and software Infrastructure 6. Implementations of the online learners 6.1. Online SVM . . . . . . . . . . . . 6.1.1. svhandler . . . . . . . . 6.1.2. OnlineSVMBase . . . . . 6.1.3. laSvmBase . . . . . . . . 6.1.4. laSvm . . . . . . . . . . . 6.1.5. Unit tests . . . . . . . . . . 6.2. Online Random Forest . . . . . . 6.2.1. Online Learning . . . . . 6.2.2. Online prediction . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
107 . . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
109 109 109 110 110 111 111 112 112 113
7. VIGRA software infrastructure 115 7.1. Exporting python functions . . . . . . . . . . . . . . . . . . . . . . . . . . 115 7.2. vigranumpy documentation . . . . . . . . . . . . . . . . . . . . . . . . . . 115 7.3. Organizing the modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 8. Online learning in Ilastik 8.1. Label Queue . . . . . . . . . . . . . . . 8.1.1. Online learner base class . . . 8.1.2. Controlling the online learner 8.2. Results and Discussion . . . . . . . . . Conclusion
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
119 119 119 120 120 123
11
Introduction Vital tools for modern image processing and analysis are machine learning algorithms such as support vector machines [Schölkopf and Smola, 2001] and Random forests [Breiman, 2001]. A general approach is to train a classifier to discriminate different regions in the image. For example a classifier could be used to classify every pixel in an image to either belong to the foreground or the background. Training a classifier requires two inputs: Features have to be computed on the pixels and their neighborhood and labels have to be created for a subset of the pixels. The features embed the samples in an high dimensional space called the feature space while the label denotes to which class the sample belongs. The assignment of labels to pixels is usually done by a human expert annotating regions or pixels in the image. This task is often expensive and time consuming. To build an accurate classifier, some labels can be more informative than others. A label for a sample very close to another already labeled sample of the same class is much less valuable than a label for a sample in an unexplored region of the feature space. If labels are more valuable, fewer of them are needed to build a sufficiently accurate classifier. Ilastik [Sommer et al., 2010] is an graphical user interface to classify and segment image data. The user can create classes, each represented by a color, in which labels are displayed. Pixels of images can than be labeled using a brush tool. Features can be selected and computed per mouse click. On these features and labels, a classifier can be trained and loaded data items can be predicted. A goal of this thesis is to research and develop an interactive mode for Ilastik. To improve the classifier accuracy and reduce labeling time the user should receive feedback informing him of which labels would be informative and improve the classifier most. Active Learning attempts to decrease the amount of labels needed by allowing the learning algorithms to actively ask the user for labels. Settles [2009] shortly described different scenarios for active learning. Depending on the kind of data, different strategies to query samples are available. In the “membership query synthesis” scenario [Angluin, 1988], the active learner may ask for labels anywhere from the input space including instances generated by the active learner itself. While generating query instances could be most useful for the active learner, it might not be possible for an oracle such as an human annotator [Baum and Lang, 1992] to label such instances. In the case of labeling image pixels, the human annotator is labeling pixels while the classifier works in feature space. Labeling directly in feature space is difficult if not impossible for a human labeler.
i
Introduction In the “Stream-based selective sampling” [Atlas et al., 1990], a stream of unlabeled samples is presented to the learner and it decides to query it or not. The advantage over membership query synthesis is, that the samples come from the underlying distribution and it focuses on dense areas of the feature space. The scenario that applies to classifying pixels in images is called “Pool-based sampling”. A pool of unlabeled data exist and the active learner chooses the sample from the pool which it finds most beneficial. Many strategies have been developed to query new instances for labeling. A basic active learning strategy, known as uncertainty sampling [Lewis and Gale, 1994] ask for a label of a sample for which the decision is most uncertain. For Support Vector Machines (which will be discussed in chapter 2), this approach has been theoretically founded in Tong and Koller [2000] by introducing it as a heuristic for halving the version space. Another query strategy with a better founded theoretical background [Seung et al., 1992] is the “Query-by-committee” strategy. A set of models is trained on the currently labeled samples representing competing hypotheses. The unlabeled samples are predicted by all models and the sample for which the models most disagree is queried. Interestingly, this is exactly what happens when the uncertainty sampling is applied using Random Forests [Breiman, 2001]. The concept of Random Forests will be discussed in section 5. An overview over active learning strategies is presented in [Settles, 2009]. In chapter 1 a active learning algorithm for image segmentation will be introduced and experimented with. In Ilastik a human labeler is equipped with a brush tool and can label many pixels located closely together quickly. This advantage would be lost if an active learner would ask for single labels jumping between images or regions of one image. In addition, the labeler would have to adapt to a new image or image region for every given label. A solution is to display a map overlapping the image marking regions that would improve the classification when labeled. A good active learning strategy might be to query samples that are predicted incorrectly by the current model. Since the active learner does not have this information, it can not follow this strategy. But when the current classification result is displayed as an overlay over the image, the human annotator, knowing the problem at hand, can quickly identify these instances and correct them. Therefore, in the case of classifying pixels in an image, the best active learning strategy might be to interact with the user by displaying the intermediate classification results. Since the labeler can label a lot of pixels fast and easily, waiting for a response of the active learning method wastes valuable time. The method has to respond quickly in order to be valuable. For presenting an intermediate result to the user, a classifier needs to be trained on the labels gathered so far. Most active learning strategies also rely on a classifier being trained. In the labeling process the set of training samples are increasing over time. An offline
ii
classifier needs to train on this growing set over and over again. The idea of online, or incremental learning, is to train a classifier on the complete training data by starting from the classifier trained on the last subset. Since only few samples were added, the full classifier is most of the time not very different from the last and the model does not have to be changed much. Depending on the learning method, the new model can be reached in much less time than a full retrain would take. Derived from the Perceptron algorithm [Aizerman, 1964], early online learner do a single sweep over the data. A linear hyperplane is learned and modified whenever a misclassified sample is encountered. Very little memory is required because samples can be discarded after examination. Novikoff’s Theorem [Novikoff, 1962] proofs the convergence of these algorithms if a solution exists. The algorithm can not handle noise datasets very well. Support vector machines and their success [Boser et al., 1992] showed that large margins where desirable. Perceptron algorithms, explicitly constructing large margins where developed [Frieß et al., 1998, Gentile, 2002, Li and Long, 2000, Crammer et al., 2003]. They kept a working set of input vectors, called Support Vectors. With the demand of a large margin, the number of Support Vectors grows fast and memory limitations became a problem. In Crammer et al. [2003] a kernel perceptron with a removal step, removing vectors from the working set, was suggested. On noise free datasets, the budget kernel perceptron performs relatively nicely. Unfortunately noise free datasets occur seldom in the real world. In Cauwenberghs and Poggio [2000] a recursive algorithm for learning samples one at a time is presented also featuring unlearning. Unlearning means, that the classifier can be transformed to a state, it would have reached without ever seeing a specific sample. Another SVM based online learner, also featuring a working set of support vectors and a removal step, is laSvm [Bordes et al., 2005], which will be used and adapted in this thesis. In Part 2, we will investigate two online learners. The first is based on laSvm [Bordes et al., 2005], a SVM optimizer based on sequential minimal optimization [Platt, 1998]. The advantage of laSvm is, that a approximate solution can be reached very fast and improved when more time is available. This applies very good to the situation in Ilastik. As soon as the user has labeled, a new solution is needed fast. But while the user looks for new labels, there is time to improve the existing solution. The second online learner is an online version of Random Forests [Breiman, 2001] and based on Saffari et al. [2009]. While a random forest works “out of the box” without optimizing any hyperparameter, the generalization performance of a support vector machines highly depends on the chosen kernel and its parameters [Adankon and Cheriet, 2007]. For Ilastik the online learner should work as a “plug-in” method, requiring all parameters to be optimized automatically. Traditional methods like cross validation are to computational expensive. In chapter 3 a fast method for automatic model selection is investigated.
iii
Introduction Training a model is only the first step for presenting an intermediate prediction to the user. Predicting on the complete unlabeled data for a descent size image of 500 · 500 = 250000 pixels can take too long to be interactive. In section 4.1 and 4.2 methods for speeding up SVM prediction are investigated whereas in section 5.3 a method for decreasing random forest prediction is developed.
iv
Datasets Table 1 list the datasets used through this thesis. Table 1.: Datasets used in this theses Short name # train samples # test samples # features a1a 1,605 30,956 123 a2a 2,265 30,296 123 a3a 3,185 29,376 123 breast-cancer 683 none 10 cod-rna 59535 271617 8 german.nummer 1,000 none 24 ijcnn1 49,990 91,701 22 ionosphere 351 none 34 svmguide1 3,089 4,000 3 mushrooms 8124 none 112 splice 1,000 2,175 60 australian 690 none 14 voxels 10018 none 40
Source 1) 1) 1) 1) 2) 1) 3) 1) 4) 1) 1) 1) 5)
1) UCI [Asuncion and Newman, 2007] 2) Uzilov et al. [2006] 3) Prokhorov [2010] 4) Hsu et al. [2003] 5) Andres et al. [2008]
All the datasets but the voxels dataset were obtained from the libsvm website [Chang and Lin, 2001]. For details of the datasets see the corresponding citations and the libsvm website at http://www.csie.ntu.edu.tw/~cjlin/. The voxels dataset has been given by Björn Andres and is also described in the citation. We shortly describe a few of the datasets exemplary. The a1a, a2a and a3a datasets are extracts from the UCI Adult dataset in which the prediction task is to determine whether a person makes more than 50,000 US Dollar a year based on an extract of personal information. They have been preprocessed by discretizing the continuous features into quantiles and representing each quantile as a binary feature. The categorical variables with n categories are converted into n binary features.
v
Datasets In the splice dataset, also from the UCI Machine Learning repository, the task is to recognize two types of splice junctions in DNA sequences. A splice junction is a point on a DNA sequence at which ‘superfluous’ DNA is removed during protein creation in higher organisms. The features are sequences of DNA, the boundaries between exons (the remaining parts of the DNA after splicing) and introns (the removed parts of the DNA sequences). In the german.numer dataset, also originating from the UCI machine learning repository, credit applicants are classified as “good” or “bad”. The features include several categorical variables transformed into a series of binary variables. The meaning of the features can be read on the UCI homepage. The voxels dataset has been created by Björn Andres and contains features computed on neural tissue data from Denk and Horstmann [2004]. The voxels has been labeled as described in Jain et al. [2007]. 40 generic rotation invariant features have been computed on the data, comprising the eigenvalue of the structure tensor and the hessian matrix, the gradient magnitude and the original gray value. We will refer to the complete set of datasets in table 1 as the svm_sets . Because ijcnn1 and cod-rna are much bigger than the rest, they had to be omitted in many experiments. We will refer to all datasets but ijcnn1 and cod-rna as the svm_small_sets .
vi
Part I.
Active Segmentation
1
1. Active segmentation 1.1. Introduction In Turaga et al. [2009] a novel image segmentation algorithm was introduced, which improves segmentation by directly maximizing the Rand Index, a well known segmentation performance measure. The algorithm works by learning an affinity graph on the pixel neighborhood. All edges below a certain threshold are removed to create the final segmentation. The novelty of Turaga et al. [2009] lies in the selection of edges to be learned. Instead of learning the affinity edges for randomly chosen neighboring pixel pairs, the minimum gray value along the maximin path between two arbitrary pixels is learned. We propose the active learning method Active Segmentation adapting the Ideas from Turaga et al. [2009] to watershed segmentations. Active Segmentation asks the user if certain pairs of pixels belong to the same segment. The pairs of pixels gathered this way are called marked pixel pairs. Using the information from the marked pixel pairs, temporary labels are set based on the maximin path. In addition the maximin path can reveal inconsistencies between the current labeling and the information from the marked pixel pairs. The user can than be asked to correct these problems and thereby provide valuable labels. In watershed segmentation no affinity between neighboring pixels are learned but the segmentation is done on a probability map assigning a probability to every pixel for belonging either to a border separating segments or to the interior of a segment. We will refer to pixels separating segments as “border pixels” and the to rest as “interior pixels”. The probability map is created by learning a binary classifier on a few labeled pixels and predicting on the whole image. For further information about the original maximin algorithm refer to Turaga et al. [2009].
1.2. Preliminaries 1.2.1. The Rand Index We define a segmentation as an assignment S of every pixel i to a segment labeled si . Two pixels i, j belong to the same segment iff si = s j . The Rand Index [Turaga et al., 2009] measures the similarity between to segmentations, in our case between the ground truth and the learned segmentation. The indicator function δ(si , s j ) is 1 if
3
1. Active segmentation two pixels belong to the same segment (if si = s j ) and 0 otherwise. Let S and Sˆ be two segmentations. The Rand Index RI is defined as in Turaga et al. [2009]:
ˆ S) = RI (S,
−1 N δ(si , s j ) − δ(sˆi , sˆj ) ∑ 2 i< j
−1
The factor ( N2 ) normalizes the Rand Index, making sure it is between 0 and 1. An intuitive way of understanding the sum is to observe that it adds one for every pixel pair that has a different indicator value on the two segmentations. Therefore the Rand Index is higher the more different the segmentations are. Turaga et al. [2009] suggest that, in order to improve the segmentation, the Rand Index should be minimized.
1.2.2. Watershed segmentation The Watershed segmentation [Roerdink and Meijster, 2000] is one of the state of the art segmentation algorithm. Its input is a height map and a seed map. In our case the height map is the probability output of a binary classifier. It is represented by a grayscale image with the gray values of pixels being the probability of belonging to a border. The seeds in the seed map are the starting points for the segments created by the watershed segmentation. Every seed, or every group of completely connected seeds, will form exactly one segment in the final result. The watershed algorithm maintains a threshold variable which is increased in every step and augments every region with neighboring pixels that do have gray value below the threshold and do not belong to any other regions. Besides the segmentation, the watershed algorithm can output the watercourses which are of crucial importance to us. Definition 1.1. Let Di,j be the set of connections between seed i and seed j that do not cross any segments but the segments originating from seed i and j. The watercourse wci,j is the connection in Di,j , for which the highest gray value it crosses is lowest. We use this property in a later section. The watercourses are found by keeping track of the positions where two regions first meet and recursively following the neighborhood pixels, responsible for adding the current pixels, until arriving at a seed. For sake of simplicity we will handle seeds as if they were single pixels although they can be regions. All concepts can be transformed to regional seeds. Here a watershed implementation done by Björn Andres is used.
1.2.3. The maximin path We define the maximin path for watershed segmentation following Turaga et al. [2009]. Let hk be the height, or gray value of pixel k. For two pixels i, j, let Pi,j be the set of all path between i and j. For a given path p ∈ Pi,j – which is represented by the set of
4
1.2. Preliminaries
Figure 1.1.: Toy example illustrating a maximin path on a probability map. The brightness of the pixels encodes the height hk . In our example, this would be the border probability. The red pixels are seeds for the watershed segmentation and the green line follows the maximin path. pixels it crosses on the image – we define its height as the maximum gray value over all pixels making up p. h( p) := max hk k∈ p
The corresponding pixel is defined by: maxpixel ( p) := argmax hk k∈ p
∗ for pixels i and j is the path in P with minmal h ( p ). In other The maximin path pi,j i,j ∗ words, pi,j ∈ Pi,j is the path that minimizes the maximum height on its way: ∗ pi,j := argmin h( p) = argmin max hk p∈ Pi,j
p∈ Pi,j
k∈ p
Figure 1.1 shows a toy example illustrating a maximin path on a probability map. The importance of the maximin path is illustrated by the following lemma: Lemma 1.1. If the seeds for the watershed are found by thresholding the height map with a ∗ ) < t. threshold t, then two seeded pixels i, j belong to the same segment if and only if h( pi,j ∗ ) < t, then every pixel along p∗ has a height lower than t. Therefore Proof. If h( pi,j i,j there is a path from i to j completely filled with seeds and i and j will belong to the same segment. ∗ ) ≥ t then every path p ∈ P contains a point k with h ≥ t and the seeds must If h( pi,j i,j k be separated. Since the watershed does not merge seeds, they will not belong to the same segment.
5
1. Active segmentation Assume a pixel pair has been marked to belong to the same segment or to different segments. Following up on lemma 1.1, to strengthen this property the maxpixel of the maximin path between the pixels should be labeled “border” or “interior” depending on the segment togetherness of the pixels. We will refer to such pixel pairs as “marked pixels pairs” in the future. Definition 1.2. A marked pixel pair is a tuple ( p, c) where p is a set of two pixels and c a truth value indicating if the pixels belong to the same segment or not.
1.3. The active segmentation algorithm The active segmentation algorithm consists of two main building blocks which will be discussed in the next subsections. The routine RequestPixelPair picks a pair of pixels and asks the user if they belong to the same segment. It returns a marked pixel pair together with labels for the requested pixels. The routine GetLabelsFromMarkedPixels creates a set of labels given a set of marked pixels as suggested by lemma 1.1. Because these label may be based on an intermediate result, they may be wrong. They are removed and recalculated when new sure labels arrive. In a later version, GetLabelsFromMarkedPixels also returns a set of labeled pixels for which the label is sure. For a better understanding of the development of the building blocks, the main algorithm combining them together is given in algorithm 1. It requires a starting set of pixel labels L. The GetLabelsFromMarkedPixels function returns two sets of labels. The first, SureLabels, are certain because the user has been asked for them. The second, TempLabels are temporary labels which are found based on marked pixel pairs. Algorithm 1 Active segmentation algorithm 1: procedure ActiveSegmentation(start labels L) 2: T ← {} . Set temporarily labels to empty set 3: C ← {} . Set marked pixel pairs to empty set 4: while user does not stop process do 5: (SureLabels, TempLabels) ← GetLabelsFromMarkedPixels(C, L) 6: T ← TempLabels 7: L ← L ∪ SureLabels 8: Get height map H by learning random forest on L ∪ T and predict 9: ( NewLabeledPixels, MarkedPixelPair ) ←RequestPixelPair( H ) 10: L ← L ∪ NewLabeledPixels 11: C ← C ∪ { MarkedPixelPair } 12: end while 13: return L 14: end procedure
6
1.3. The active segmentation algorithm The algorithm has been implemented in “classy”, the predecessor of IlastiK (see chapter 8), which is implemented in Matlab.
1.3.1. Getting the maximin path from a watershed segmentation To find the maximin path based on a height map between marked pixels, the watercourse property seems to be useful. And indeed, if the watershed would be applied to an image with only two seeds, the watercourse is identical to the the maximin path. We could in principal find all maximin paths by applying the watershed for every pair of marked pixels as seeds and retrieving the watercourse. But in practice we have a lot of marked pixels, and the watershed would need to be recalculated for every pair which leads to a unfeasible number of watershed calculations. In general if we would apply the watershed algorithm with all marked pixels as seeds and restrict the image to two segments, the watercourse would be the maximin path between the seeds of the segments it originates from. Unfortunately this property does not hold if the restriction to the two segments is removed. This is illustrated in figure 1.2. From definition 1.1, we know that the watercourse wc consists only of pixels belonging to the segments it connects. The maximin path can be a “shortcut” leading over segments that are different from the segments of the watercourse. In addition there is no guarantee for a pair of seeds to have a common watercourse. 5 2
2
1
2
3
3
5
3
4
Figure 1.2.: Marked pixels used as seeds for a watershed segmentation are denoted by circles. The black lines illustrate the watercourses between the marked pixels and are labeled by the height of the watercourse. The maximin path between the red and blue seeds is illustrated by the yellow arrow. Definition 1.3. Watercourse graph: Based on a watershed segmentation, where marked pixel pairs have been used as seeds, we build a weighted graph with marked pixels as vertices and
7
1. Active segmentation watercourses as edges. The graph is guaranteed to be connected. Each edge corresponding to an watercourse wc and is assigned a weight of h(wc). We call the weighted graph the watercourse graph. Analogous to the definition of an maximin path in an image, we define the maximin path between two vertices in the graph as the path whos maximal edge weight is minimal. A maximin path between marked pixel pair used as seeds in a watercourse can be found by concatenating all the watercourses of the maximin path in the corresponding watercourse graph. This is easily seen by noticing that any path between the marked pixel pair has to touch all segments corresponding to seeds lying on some path in the watercourse graph. To get from seed i to seed j, the path has to cross some pixel k with hk ≥ h(wci,j ). The path composed by the watercourses can therefore not have any point higher than the highest point on any path between the marked pixels. To find the maximin path on a watercourse graph, a slightly modified version of Dijkstra’s algorithm [Dijkstra, 1959] has been applied. Instead of summing up all the weights on its way, it just remembers the highest weight it encountered.
1.3.2. Setting temporary labels The first naive implementation of GetLabelsFromMarkedPixels learns the height map, applies the watershed and finds all maximin paths as described above. Image 1.3 shows a partial screenshot of the resulting algorithm in action. The green and red pixels show pixels for which the labels are “interior” and “border” respectively. The yellow lines denote the maximin paths between these pixel pairs and the red cross is a temporary label set by GetLabelsFromMarkedPixels. The pink arrows illustrate that all the pixel pairs result in the same temporary label. And indeed, as can be seen from the yellow lines, all maximin paths go through the same central point and have the same maxpixel. In the example, there are about 10 pixel pairs marked. The exploit is very little for that. But the red cross is not the only position having a low border probability on paths between the known marked pixel pairs. Indeed, the algorithm asked for most of the indicated pixel pairs based on a classifier which has been trained including that the label marked by the red cross. Therefore there must be other potential “cracks” that could and should be closed by labeling the corresponding pixels. A solution would be to make the temporary labels permanent and reusing them in the next round. This unfortunately yields a high risk of making a wrong label permanent and decreasing the classifier performance in the long term. But we can mark more temporary labels by retraining the classifier with the temporary labels we got so far included. The temporary labels are added, and the classifier is relearned. Based on the new classifier output, the watershed is repeated and new temporary labels are added. This process is repeated until no interesting labels are found. In the next invocation of GetLabelsFromMarkedPixels, when new sure labels from the user are available, the temporary labels are discarded.
8
1.3. The active segmentation algorithm
Figure 1.3.: The algorithm displaying its actions: The green dots show pixels for which the label is “interior” while the red pixels have a “border” label. The yellow line denotes found maximin path and the red cross marks temporary border labels. The pink arrows visualize that all pixel pairs suggest the same temporarily label. Often the algorithm sets premature labels on wrong places. In such situations the label often contradicts with the current classifier output. Let us assume, the classifier knows that pixel i and j do not belong to the same segment. Therefore there must be a ∗ ) is low. The classifier could now label border pixel on p∗ . Assume further that h( pi,j the maxpixel but chances are that that label is not correct. Another label on the path (having also a low probability) could be the correct border position. For this reasons, the algorithm is extended by asking the user for the correct pixels label if the maxpixel has a low border probability. While this worked in many cases, the algorithm tended to ask on interior pixels based on a single pixel pair not belonging to the same segment while searching for the border quite a lot. The algorithm requested labels on the maximin path until a border is found. When the maximin path contains only points with very low border probability, the probability is height that the maxpixel on that path is an interior pixel. In such cases the algorithm gathered a lot of labels for interior pixels while searching for the border pixel on the maximin path. Because the gathered pixel already had a very low border probability, their label was of little value. A solution to this problem was adding yet
9
1. Active segmentation another question. If the maxpixel has a very low border probability (is below a certain threshold), the algorithm displays the current maximin path to the user and request the position of the border within this path. Labels the user has been asked for are added to the set of permanent labels. The user can only be asked for the border on the maximin path of pixel pairs, if the pixels belong to different segments. In the preliminary experiments done in classy the problem of the classifier asking often for invaluable pixels did not occur with marked pixels belonging to the same segment. Figure 1.4 shows the situation and illustrates that the algorithm is indeed capable of localizing cracks between segments. The thresholds when the user is queried for the label of the maxpixel on the maximin path has been manually adjusted by visual inspections utilizing classy and has been set to 0.5 for both marked pixel pairs belonging to the same segment and marked pixel pairs belonging to different segments. The threshold for asking the border position on the maximin path has been set the same way to 0.2. Collecting temporary labels is stopped until no temporary labels are found anymore, that are expected to contribute much to the classification performance. We set this criterion to be true, if the probability of misclassification is below 30%. The final version of GetLabelsFromMarkedPixels is given in algorithm 2. The procedure GetMaximinPath returns the maximin path between two seeds as described above.
1.3.3. Finding valuable pixel pairs The marked pixel pairs are requested from the user based on a watershed segmentation done the same way as the final segmentation will be created. The height map is predicted with a classifier trained on all sure and temporary labels collected so far. The seed map is constructed by thresholding the height map with a seed threshold t. Watershed segmentation is applied and a watercourse wc selected by applying uncertainty sampling [Lewis and Gale, 1994] to the maxpixel-s of the watercourses. This results in the watercourse for which the border probability of the maxpixel is closest to 0.5 to be selected. A marked pixel pair which would result in temporary labeling maxpixel (wc) is then searched for. Let a and b be seeds and wc a,b the selected watercourse. We set k = maxpixel (wc a,b ). ∗ ) is the desired The algorithm needs to find a pair of pixels i, j such that maxpixel ( pi,j point k. The seeds a and b responsible for the watercourse are good candidates for these pixel pairs but as we have seen in the last paragraph maxpixel ( p∗a,b ) does not necessarily have to be equal to maxpixel (wc a,b ). Algorithm 3 returns a set of pixel pairs for which the maxpixel of the maximin path is the desired one, which will be proved in lemma 1.2. Algorithm 3 can return an empty set, meaning that their are no pixel pairs with the desired property. In that case the uncertainty of maxpixel (wc) is set to 0 and another watercourse is searched until one for which algorithm 1.2 does not return an empty set is found.
10
1.3. The active segmentation algorithm
Algorithm 2 Set temporary labels based on marked pixels. GetMaximinPath returns the maximin path between two seeds based on the watercourse graph as described in the text. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: 32: 33: 34: 35: 36: 37: 38: 39: 40: 41: 42: 43: 44:
procedure GetLabelsFromMarkedPixels(marked pixel pairs C, labels l) T ← {} . start with empty temp. label set S ← {} . this will be the set of labels we are sure about stop ← False while stop = False do stop ← True . this stays until a reason not to stop is found train classifier based on L ∪ T ∪ S and get height map H seeds ← marked pixels pairs from C (segmentation, WC ) ← watershed( H, seeds) for all c ∈ C do m ← GetMaximinPath(c, WC ) p ← maxpixel (m) if not Stop(c, h( p)) then stop ← False . do not stop unless all given labels are of little use end if if c marks pixels of same segment then if h(m) < 0.5 then T ← T ∪ {( p, Interior)} . temporarly label pixel else Ask user for label of p and insert into S end if else if h(m) > 0.5 then T ← T ∪ {( p, Border)} else if h(m) > 0.2 then Ask user for label of p and insert into S else Show m to user and ask him to label border pixels. Add these to S end if end if end for end while return (S, T ) end procedure . Returns true if new information is of little value procedure Stop(marked pixels c, height h( p)) if (c marks pixel to be of same segment) and h( p) < 0.3 then return True end if if (c marks pixel to be of different segments) and h( p) > 0.7 then return True end if return False end procedure
11
1. Active segmentation
Figure 1.4.: The algorithm displaying its actions: The green dots show pixels for which the label is “interior” while the red pixels have an “border” label. The yellow lines denote the maximin path on which the algorithm currently requests a label from the user. Lemma 1.2. For a given seeded pixel pair (i, j) with common watercourse wci,j algorithm 3 returns all pairs of seeded pixels ( a, b), for which maxpixel ( p∗a,b ) = maxpixel (wci,j )
(1.1)
Proof. The pairs returned at line 20 consist of pairs take from sitei and site j . sitei is built up for s = i and not touched for s = j. We first show that if the algorithm finishes in line 20, then: t ∈ sitei ⇔ there is a path pt,i from i to t with h( pt,i ) ≤ h(wci,j ) Every element of sitei is inserted in line 14 and every element in sitei has been in Queue before being inserted. The condition in line 13 ensures, that when a seed t is inserted into sitei and Queue, there is a path (the watercourse wcq,t ) from an element q in Queue with h(wcq,t ) < h(wci,j ). Since Queue started with i, there is by induction already a path pq,i with the desired condition. pq,i can be extended by wcq,t to build a new path
12
1.3. The active segmentation algorithm Algorithm 3 Find all seed pairs whose maximin path crosses the watercourse of seeded pixels i, j with common watercourse wi,j 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21:
procedure FindPixelPairs(Seeded pixels i, j. Seeds S. Watercourses WC = {wc a,b }) sitei = {i } site j = { j} limit ← h(wci,j ) for s ∈ {i, j} do Queue ← {s} while Queue 6= {} do q ← any element from Queue if {s, q} = {i, j} then return {} end if for all n with ∃wcn,q ∈ WC do if h(wcn,q ) < limit then sites ← sites ∪ {n} Queue ← Queue ∪ {n} end if end for end while end for return {( a, b)| a ∈ sitei , b ∈ site j } end procedure
pt,i with h( pt,i ) ≤ h(wci,j ) . Let, on the other hand, t be a seed for which there is a path pt,i with h( pt,i ) ≤ h(wci,j ). Than there is a series S = (seg1 , . . . , segn )
and
s = (s1 , . . . sn ), s1 = i, sn = a
of segments segi and corresponding seeds si which are crossed by pt,i in order. These seeds are then inserted into Queue and sitei in that order and t is inserted into sitei . We now consider the algorithm exiting in line 10. This means that a path pi,j from i to j ∗ ) < h ( wc ) and there cannot be with h( pi,j ) ≤ h(wci,j ) has been found. Therefore h( pi,j i,j any path with maxpixel ( p∗a,b ) = maxpixel (wci,j ). If on the other hand, if the algorithm exits on line 20, there is no path pi,j between i and j with h( pi,j ) < h(wci,j ). Because of the shown properties of sitei and site j , for every pair ( a, b), a ∈ sitei , b ∈ site j , p∗a,b includes maxpixel (wci,j ) and it fulfills 1.1.
13
1. Active segmentation When the algorithm asks the user if a certain pixel pair belongs to the same segment, it can happen that the requested pixels are not interior pixels and the user a corresponding response of the user is possible. This implies that from a user response, the labels for the individual pixel of the pair are also known. It therefore makes sense to select from the pixel pairs returned by Algorithm 3 with some active learning criteria. Here pixels with border probability closer to 0.5 are preferred, which corresponds to uncertainty sampling. Algorithm 4 finds a pixel pair for which it asks the user if they belong to the same segment and returns the gathered information together with additional labels. The seed threshold t for the watershed segmentation has been set to 0.01. A Random Forest (see chapter 5 and Breiman [2001]) with 100 trees has been used. The seed threshold results in seeds being set only on places where all trees voted for “interior”. Algorithm 4 Ask the user if a pixel pair belongs to the same segment based on a height map. 1: procedure RequestPixelPair(height map H) 2: (segmentation, WC ) ← watershed( H ) 3: Order WC by active learning criteria, such that best come first 4: for all wc ∈ WC do 5: i, j ← seeds of wc 6: pair ← FindPixelPairs(i, j, WC) 7: if not pair = {} then 8: ask user if pair belong to the same segment or not 9: return marked pair and corresponding labels 10: end if 11: end for 12: end procedure
1.4. Implementation Because classy is implemented in Matlab the first ideas where implemented in Matlab as an extension to classy. For the Random Forest and watershed algorithm, the matlab bindings described in Nair [2010] have been used to interface C++ code. The first implementation requested pixel labels and border positions by displaying dialog boxes. To be able to carry out automatic experiments, these requests have been moved into separate functions. The functions display the dialog boxes or take the information from some ground truth depending on a switch passed to them. This way it was possible to use the same code for the automatic experiments and testing in classy. For the experiments in 1.5.2, the algorithm needed to work also in three dimensions. It has therefore been generalized to arbitrary dimensions.
14
1.5. Experiments
1.5. Experiments Often active learning strategies are compared with random sampling. But in this case the classes are very unbalanced (many interior pixels and few border pixels). A human labeler would not behave like random sampling but would label more border pixels than random sampling would. Therefore the active segmentation algorithm is not compared against random sampling but what we define as “balanced random sampling”. In balanced random sampling labels are sampled from border and interior pixels separately ensuring that as many border pixels as interior pixels are sampled. This imitates the expected behavior of a human labeler better than random sampling.
1.5.1. Berkeley segmentation database The Berkeley Segmentation Dataset [Martin et al., 2001] consists of 12,000 hand labeled segmentations of 1,000 Corel datasets from 30 human subjects. One half of the segmentation where created while a color image was available, the other half has been done on grayscale images. Because the segmentation given by different labeler are contradictory, a single segmentation has been taken as ground truth. The Berkeley Segmentation Dataset provides ground truth in form of a segmented image labeling pixels with a different number for every segment. For the algorithm additional labels marking border and interior pixels are necessary. This ground truth has been created by labeling pixels with “Border” if they have a neighbor belonging to a different segment and with “Interior” otherwise. This results in a 2 pixel wide border between segments. The experiment has been conducted on a single image from the Berkeley segmentation database, which is shown together with the segmentation results in 1.6. The experiments where started with 10 samples drawn from each class. For balanced random sampling, the rand index has been measured every 100 drawn labels. In the case of active segmentation it turned out to be difficult to measure the rand index at precisely every 100 labels because many labels are drawn within GetLabelsFromMarkedPixels and no intermediate result exist at that point. In the case of the Active Segmentation algorithm, the rand index has been measured at roughly every 100 drawn labels. As a classifier a Random Forest [Breiman, 2001] has been trained using features, calculated on the image. The features are listed in table 1.1. vigra [Köthe, 2000] has been used to calculate the features. Result Figure 1.5 shows the rand index in dependence of the number of drawn labels for both balanced random sampling and Active Segmentation. Because the absolute value of
15
1. Active segmentation Feature indexes 1-3 4-21 21-57 57-75 75-87
description
parameters
Color values of the original image Gaussian filter on scale σ for all color channels Eigenvalues of the Hessian on scale σ Eigenvalues of the Structure tensor on scale σ Gabor filter with DC component compensation1)
σ σ σ σ
= 3/4, 3/2, 5/4, 5/2, 7/5, 7/2 = 3/4, 3/2, 5/4, 5/2, 7/5, 7/2 = 3/4, 5/4, 7/4 = 1/3, 1/5, 1/7, ν = σ/2, ω = 0, π/4
1) Gabor filters are described in Movellan [2008]. Here the gaussian enveloped is centered at (0, 0) with variance σ. The sigmoid has a complex polar frequency of (ν, ω ). The gabor filter outputs a real and an imaginary part.
Table 1.1.: Features used for the classifier the rand index is relatively meaningless, it has been scaled to the rand index at the first 20 labels. Since the starting label set is the same for Active Segmentation and balanced random sampling, both graphs in figure 1.5 start off at 1.0. The results in figure 1.5 look promising. But on closer inspection, taking also into account the final segmentation, the results seem much less useful. Figure 1.6 shows the segmentations produced by balanced random sampling and Active Segmentation compared to the ground truth after roughly 1000 labels. Figure 1.6 reveals that the segmentation produced by both balanced random sampling and active segmentation are of not much use. The distinction between border and interior pixels is most likely to difficult for the classifier with the features used in this experiment.
1.5.2. SBFSEM The SBFSEM (short for Serial block-face scanning electron microscopy) is described in Denk and Horstmann [2004]. It produces an image stack by scanning the surface (called block face) of a sample of tissue and cutting off a slice to reach the next plane. The slices can be as thin as 25nm while the resolution of every image reaches 10-20nm. This resolution allows for the first time truly 3-dimensional image analysis [Andres et al., 2008]. Volume images gathered by SFBSEM can eventually be as large as 400003 voxels. Here we have used a dataset of 110 × 110 × 110 voxels from the inner plexiform layer of rabbit retina [Andres et al., 2008] at a resolution of 22 × 22 × 30nm3 . The table 1.2 has been provided by Andres [2010] and lists the 28 features that have been used. The entire framework is calibrated on a training set of 1103 voxels hand-labeled by experts as described by Jain et al. [2007]. Not every pixel in the volume has been labeled. Especially on the border of the volume, labels are missing. To assure dense labels and remove artifacts in the features, a border of 10 pixels in each direction has been removed. For generating ground truth a classifier has been trained utilizing all labels and the whole dataset has been predicted generating a height map. To seed the watershed, different seed thresholds have been tested and the number of
16
1.5. Experiments
Rand Index, relative to rand index at start
1.4 1.3 1.2 1.1 1.0
balanced random sampling active sementation
0.9 0.8 0.7 0.6 0.5
0
200
400
600 # samples drawn
800
1000
1200
Figure 1.5.: Rand index for experiment on an image from the Berkeley Segmentation Dataset shown in figure 1.6 connected components (segments) have been counted. The result is shown in table 1.3. After personal communication with Björn Andres and visual inspection of the segmentations, a seed threshold of 0.1 was selected. The border labels have been generated by the same method as described in section 1.5.1. But in the dataset some borders are wider than 2 pixels and marking border pixels as interior pixels would generate contradictory ground truth for the classifier. For that reason all voxels having a border probability of more than 90% have been added to the set of border pixels. Also, because calculating the rand index on a 90 × 90 × 90 = 7, 290, 000 dataset took to long, only a neighborhood of 15 pixels in each direction has been considered for the rand index calculation. A big problem was the time needed for the experiment to run through. The algorithm worked on a 50 × 50 × 50 subset of the considered 90 × 90 × 90 volume. When learning was done until 1000 labels have been collected, the algorithm took about 3 days. On the other hand, it has been observed in preliminary experiments, that the results vary from experiment to experiment. Therefore a statistic –be it only over several a small number of runs– should be made. Shortly before finishing the diploma thesis, a bug in the generation of the ground truth was found. The iteration variables for setting the border labels by iterating over the
17
1. Active segmentation
Table 1.2.: The constrast of the SBFSEM volume image is enhanced, based on 28 rotationinvariant non-linear features. The gradient magnitude as well as the eigenvalues of the Structure Tensor and the Hessian matrix are computed from the volume image as well as from the output of a 4-times iterated bilateral filter for which the functions wσs and wσv with parameters σs and σv are used for weighting in the spatial and in the intenstity domain. Derivatives are computed by DoG filters at scales σm , σi , and σh . Entries of the Structure Tensor are averaged with a Gaussian at scale σo . Index 1 2
Feature SBFSEM volume image Iterated bilateral filter [Barash, 2002] wσs (r ) =
wσv (v) =
1 σs (2π )3/2 1 2 1+ v 2
exp
σs = 1, σv = 3
r2 − 2σ 2 s
σv
3–4 5–16 17–28
Scale
Gradient magnitude Structure Tensor eigenvalues Hessian matrix eigenvalues
σm = 1 σi ∈ {1, 1.5}, σo = 2σi σh ∈ {1, 1.5}
Table 1.3.: Seed thresholds on the border probability map and the resulting number of connected components. Seed threshold 0.5 0.4 0.3 0.2 0.1 # connected components 48 59 81 104 114
18
1.5. Experiments
Figure 1.6.: Experiment on a image from the Berkeley Segmentation Dataset. The image shows the segmentation results of Active Segmentation and balanced random sampling and the ground truth. The borders between segments are marked by pink lines. volume were upper bound by the size of the volume minus one, as it has to be done in C++. Matlab on the other side indexes arrays, such as the voxel volume from 1 to the size of the volume. This way, some border labels on the edges of the volume where not set producing contradictory ground truth. The experiments had to be repeated in the time remaining. Figure 1.7 shows a slice of size 50 × 50 × 1 from the volume. There are already little different segments in this slice. When the training volume has been reduced even more, no interesting watercourses remained after about 100 labels. The only option remaining was reducing the number of labels at which the experiment stops. Advantageous due to the increasing number of marked pairs, the active segmentation algorithm needs more time the more labels it already collected. The number of labels collected has been reduced to 400. Because one pixel wide gaps connecting segments where produced even when labeling with the complete data in the training set, the results have been smoothed with a gaussian filter before measuring the rand index. This has also be done with the height map generated in FindPixelPairs. In addition to active segmentation and balanced random sampling, a standard active learning method known as “uncertainty sampling” has been tested. For a 2 class problem uncertainty sampling request the label for the voxel, for which the boundary probability is closest to 0.5 Lewis and Gale [1994].
19
1. Active segmentation
Figure 1.7.: A slice of 50 × 50 × 1. Result Figure 1.8 shows the results for the voxel datasets. For uncertainty sampling and balanced random sampling, the results have been averaged on all measuring points and error bars are shown. Since for active segmentation did not measure the rand index at the same number of labels for different runs, averaging was not possible. The different active segmentation measurements are shown separately in figure 1.8. In the beginning, balanced random sampling clearly beats all other strategies. At about 150 labels, active segmentation catches up. From this point on, no real improvement in any of the two strategies can be seen. Uncertainty sampling never reaches the rand index of the others.
1.6. Discussion Finding data with good ground truth was a big problem. The Berkeley segmentation database provides a lot of pictures with ground truth, but doing segmentation on these pictures with the features used here is extremely difficulty. Even when half the picture was labeled, no satisfactory segmentation could be retrieved. For the SBFSEM dataset, the ground truth made by human labeler is sparse and the dense ground truth has been auto generated. In figure 1.8, it can be seen that with less than 30 labels, a rand index of 5 · 10−4 was possible, while when labeling the complete learning volume it goes down to only 3 · 10−4 . Problems, like 1 pixel wide cracks between segments have been detected. Smoothing the probability map with a gaussian filter helped closing these cracks, but it is unclear how this post-processing effects the
20
1.6. Discussion
0.0018
Uncertainty sampling Balanced random sampling Active segmentation all samples
Rand Index (for a maximum distance of 15 voxels)
0.0016
0.0014
0.0012
0.0010
0.0008
0.0006
0.0004
0.0002
0
50
100
150
200 # of labeld samples
250
300
350
400
Figure 1.8.: Results on the voxel dataset. performance of the different labeling strategies. The validity of the results is therefore doubtful. Pushing aside these doubts, figure 1.8 and 1.5 show, that Active Segmentation has trouble when only few labels are available. This is most likely due to the probability map being extremely contradictory and temporary labels are placed on incorrect positions. When more samples are available, Active Segmentation can compete with optimal random sampling. On the one hand, this is a very good result, because optimal random sampling is hard to beat. On the other hand it is not good enough to justify using Active Segmentation in practice. Additionally for practical usage, Active Segmentation is much to slow. The user has to wait to long between label requests. Variants of Active Segmentation might be worth investigating. For example, the marked pixels could be collected by given the user a tool to mark regions that have to be separated. While the user labels the image, the Active Segmentation algorithm would check in the background of the marked pixels pairs are contradictory with the probability map. If an inconsistency is found the user could be informed or the Active Segmentation algorithm could set labels itself.
21
Part II.
Online learning
23
2. Online Support Vector Machine We investigate and implement an online Support Vector Machine based on laSvm [Bordes et al., 2005]. The implementation will be presented in section 6. While the online method for laSvm introduced in Bordes et al. [2005] works very well when the order of labeled samples is randomized, problems occur when there is a bias in the order of the labeled samples. If the order of the training data is determined by the labeling of a human annotator, a high risk of biased sampling exists. We will introduce resampling to solve this problem and improve the accuracy of the online Support Vector Machine.
2.1. Support vector machines In Bordes et al. [2005] an online SVM called laSvm has been introduced. The basic idea of laSvm is to not give every vector equal attention. Instead the algorithm focuses on an active set of support vectors. Support vector machines have been introduced in Boser et al. [1992] and have received ample treatment being both theoretically well founded and showing excellent generalization performance in practice. Consider a binary classification problem {( x1 , y1 ), . . . , ( xn , yn }), xi ∈ Rd , yi ∈ {−1, 1} consisting of n labeled training samples in a d dimensional input space. SVMs train a model yielding a discriminative function f which labels a prediction sample x by label( x) = sign( f ( x)) = sign(hw, Φ( x)i + b)
(2.1)
where Φ maps the input vectors into the so called feature space which is usually a higher dimensional space than the input space. w is the normal of a hyperplane separating the data in feature space and b is called the bias term. The parameters w and b are the model learned by the SVM. Definition 2.1. The margin of a hyperplane separating the training data is defined as the minimum distance between the hyperplane and any input vector. The core idea of the SVM is to find a separating hyperplane which maximizes the margin [Boser et al., 1992]. Because the input data might not necessarily be separable slack variables ξ i for every input vector are introduced penalizing input variables on the wrong side or within the margin. This also accounts for noise. Formulated as an optimization problem, this gives [Bordes et al., 2005, Burges, 1998, Schölkopf and
25
2. Online Support Vector Machine Smola, 2001] n
minkwk2 + C ∑ ξ i with w,b
(2.2)
i =1
∀i : yi f ( xi ) ≥ 1 − ξ i , ξ i ≥ 0 where C is a hyperparameter controlling the penalty of the slack variables ξ i . f is the discriminative function from 2.1 and w is the normal of the separating hyperplane as defined in 2.1. 2.2 can be transformed using the wolfe duale to max W (α)
with
(2.3)
α,b n
∑ αi = 0
(2.4)
i =1
∀i : Ai ≤ αi ≤ Bi , Ai = min(0, Cyi ), Bi = max (0, Cyi ) . α are the Lagrangian variables of the wolfe duale. The objective function W is defined by n
W (α) :=
1
n
∑ αi yi − 2 ∑
Φ ( x i ), Φ ( x j )
(2.5)
i,j=1
i =1
In the dual the discriminative function 2.1 transforms to n
f ( x) =
∑ αi hΦ(xi ), Φ(x)i + b
i =1
The advantage of this formulation is, of the input vectors xi → Φ( xi )
that the mappings is only used in the scalar product Φ( xi ), Φ( x j ) . The well established “Kernel Trick” now replaces this scalar product by a kernel function K ( x, y) := hΦ( x), Φ(y)i
(2.6)
This way the Φ-s have not to be explicitly known to solve 2.3. and mappings, which are not explicitly computable, are possible. The kernel functions must be a dot product of the input vectors mapped into the feature space. In other words, there must be a mapping Φ such that 2.6 is fulfilled for all input vectors x, y. It has been shown [Burges, 1998] that any Kernel fulfilling Mercer’s Condition represents a scalar product in some feature space. A Kernel K fulfils Mercer’s Condition iff for any g( x ) with Z
26
g( x)2 dx is finite
2.2. Karush-Kuhn-Tucker conditions it follows, that
Z
K ( x, y) g( x) g(y)dxdy > 0 .
If K fulfills Mercer’s Condition, the kernel Matrix defined by Ki,j = K ( xi , x j ) is invertible and positive definite for any choice of the xi -s. A commonly used Kernel is the gaussian radial basis function (RBF), which can be defined with one parameter γ K ( x, y) := exp −γ · | x − y|2 (2.7) or with as many parameters as dimensions d of the input vectors (γi , i ∈ {1 . . . d}). ! d
K ( x, y) := exp − ∑ γi | xi − yi |2
(2.8)
i =1
The gaussian RBF is a kernel function known to work well on most datasets and will be used throughout this thesis. The kernel trick tranforms 2.1 to its final form: n
f ( x) =
∑ αi K ( xi , x ) + b
(2.9)
i =1
The formulation in 2.3, taken from Bordes et al. [2005], varies slightly from the formulation found in most textbooks but has the advantage that input vectors with different labels are treated more equally simplifying the outcoming algorithm.
2.2. Karush-Kuhn-Tucker conditions Since the SVM problem is a quadratic programing problem, the so called Karush-KuhnTucker (KKT) conditions are both sufficient and necessary for the optimum [Jensen and Bard, 2008]. Before we introduce these conditions, we define (following Bordes et al. [2005]) the gradient of the objective function of the SVM with respect to the α-s. gk : =
n ∂W (α) = yk − ∑ αi Ki,k = yk − f ( xk ) + b ∂αk i =1
(2.10)
The Karush Kuhn Tucker conditions for the SVM dual are:
∃b ∈ R such that ∀ input vectors with index i α i = A i ⇒ f ( x i ) − y i ≥ 0 ⇔ gi ≤ b Ai < αi < Bi ⇒ f ( xi ) − yi = 0 ⇔ gi = b αi = Bi ⇒ f ( xi ) − yi ≤ 0 ⇔ gi ≥ b
(2.11) (2.12) (2.13)
27
2. Online Support Vector Machine where f is defined in 2.9 and b is the same as in 2.9. These conditions can be reformulated to
∀i with αi > Ai : gi ≥ b ∀i with αi < Bi : gi ≤ b
(2.14) (2.15)
Proof. 2.14 follows, because for αi > Ai , one of 2.12 and 2.13 must be fulfilled. 2.15 follows analogous using 2.11 and 2.12. On the other hand, if αi = Ai , than from 2.15 it follows that gi ≤ b, showing 2.11. 2.12 and 2.13 can be concluded analogously from 2.14 and 2.15. Another transformation of the conditions gives gmax ≤ gmin with b =
gmin + gmax 2
(2.16)
where gmax := gmin :=
max
gj
(2.17)
min
gj
(2.18)
j∈S,α j < Bj j∈S,α j > A j
Proof. If 2.14 and 2.15 are fulfilled, then gmax ≤ b ≤ gmin showing 2.16. In the other direction, if 2.16 is fulfilled, then gmax ≤ b ⇒ ∀i with αi < Bi : gi ≤ b ⇒ (2.15) 2.14 follows analogously.
2.3. τ tolerance In practice one does not require the KKT condition 2.16 to be fulfilled exactly. Instead a tolerance parameter τ is introduced and 2.16 is transformed to gmax ≤ gmin + τ Following this definition, a τ violating pair is defined as αi < Bi (i, j) is a τ violating pair ⇔ α j > A j gi − g j > τ
(2.19)
(2.20)
It can easily be seen that the approximate KKT condition 2.19 is fulfilled, if no τ violating pair exists. In such a case the corresponding solution is called a τ approximate solution. In the appendix of Bordes et al. [2005] it is shown that for small τ this solution approximates the real solution.
28
2.4. Sequential minimal optimization
2.4. Sequential minimal optimization laSvm utilizes the sequential minimal optimization (SMO) algorithm for SVM optimization introduced in Platt [1998]. As most SVM solvers, SMO optimize a subset, called a working set, of αi in every step. In SMO the working set has a size of 2, yielding a analytical solution for the subset. SMO makes successive searches along well chosen directions u updating α by α0 = α + λu. Let i, j be the indexes of the α-s to be optimized in a specific step. Because of 2.4, it must be that uk = 0 for k 6= i, j and ui = −u j . Because u gives only the direction and not the step length, we can assume that |ui | = |u j | = 1. The step length λ is found by λ = argmax W (α + λ∗ u) with 0 ≤ λ∗ ≤ B(α, u) λ∗
B ensures that α stays feasible. Fixing without loss of generality ui = 1 and u j = −1, B is defined as B(α, u) = min Bi − αi , α j − A j It can easily be seen, that the optimum is archived for gi − g j λ = min B(α, u), Ki,i + K j,j − Ki,j − K j,i
where Ki,j := K ( xi , x j ) and g is the gradient of W (α), defined in 2.10. The question that remains is how to choose i, j. We observe that λ = 0 for gi − g j ≤ 0. It therefore does not make sense to choose i, j such that gi = g j . Remembering that for the KKT conditions to be fulfilled all τ violating pairs (2.20) have to be removed. Keerthi and Gilbert [2000] showed that any algorithm choosing a τ violating pair and stopping if no such pairs exist anymore converges in a finite number of steps. In Bordes et al. [2005], (i, j) is chosen to be the “maximum violating pair”, which is defined as the τ violating pair that maximises gi − g j .
2.5. laSvm Vectors with αi 6= 0 are called support vectors. The laSvm algorithm keeps track of a set S of vectors called the active set. Vectors are inserted into the active set if they are found to be part of a τ-violating pair and removed if they are not part of any τ-violating pair anymore. Any vector not in the set has an αi = 0. The active set is therefor a super set of all support vectors. laSvm features two core procedures, PROCESS and REPROCESS. PROCESS gets one input vector xi not in the active set and tries to insert it. It tests if the input vector can be part of any τ violating pair. If so, it is optimized with the vector, forming the biggest τ violating pair with xi , and inserted.
29
2. Online Support Vector Machine Assume xi is not in the active set and yi = 1. This yields αi = 0 = Ai and means that xi can only be part of a τ violating pair if there is a j with gi − g j > τ and α j > A j . Using the definition of gmax and gmin from 2.17 the only necessary test is if the vector corresponding to gmin is a τ violating pair with xi . If this is the case, (i, j) is also the most violating pair for xi . The case yi = −1 works analogously. Because gmax and gmin are needed so often and the g-s are fairly costly to compute, the g-s are kept and updated for every vector in S. Algorithm 5 states the pseudo code, taken form Bordes et al. [2005]. Algorithm 5 PROCESS(k) 1: procedure PROCESS(k) 2: if k ∈ S then 3: return 4: end if 5: αk ← 0, gk ← yk − f ( xk ) + b 6: if yk = +1 then 7: i ← k, j ← argmin g j j∈S,α j > A j
8: 9:
else j ← k, i ← argmax g j j∈S,α j < Bj
10: 11: 12: 13: 14: 15: 16: 17: 18: 19:
end if if (i, j) is not a τ violating pair then return end if S ← S ∪ {k} g −g λ ← min Bi − αi , α j − A j , Ki,i +Ki i,j −j 2Ki,j αi ← αi + λ α j ← αi − λ gs ← gs − λ(Ki,s − K j,s ) ∀s ∈ S end procedure
REPROCESS (algorithm 6) optimizes the maximum violating pair and tests if any s ∈ S can be removed, meaning if any s ∈ S can not be a τ violating pair with any other s0 ∈ S. In Bordes et al. [2005] a few samples are inserted into the active set S and for a predefined number of epochs it is iterated calling PROCESS on all vectors in the input set. Between successive invocations of PROCESS, REPROCESS is executed once potentially removing vectors from S. At the end the KKT conditions are ensured for S by calling FINISH. FINISH calls REPROCESS until no τ-violating pairs exist anymore (see algorithm 7). FINISH is stated in algorithm 7 and is the same as in other SMO based algorithm but only on the active set S. We have modified FINISH slightly from Bordes et al. [2005]. At the end, we remove all vectors with α = 0 from the active set S.
30
2.5. laSvm
Algorithm 6 REPROCESS() 1: procedure REPROCESS 2: j ← argmin g j j∈S,α j > A j
3:
i ← argmax g j j∈S,α j < Bj
4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21:
if (i, j) is not a τ violating pair then return end if g −g λ ← min Bi − αi , α j − A j , Ki,i +Ki i,j −j 2Ki,j αi ← αi + λ α j ← αi − λ gs ← gs − λ(Ki,s − K j,s ) ∀s ∈ S . Throw out vectors from the working set j ← argmins∈S,αs > As i ← argmaxs∈S,αs < Bs for all s ∈ S with αs = 0 do if ys = −1 and gs ≥ gi then S ← S − {s} end if if ys = 1 and gs ≤ g j then S ← S − {s} end if end for end procedure
31
2. Online Support Vector Machine This has been found to bring significant improvements in some cases without harming the accuracy of the SVM much. Algorithm 7 FINISH() 1: procedure FINISH 2: while There are still τ violating pairs in S do 3: REPROCESS() 4: end while 5: . added, not found in the original laSvm 6: remove all vectors with α = 0 from S 7: end procedure
2.6. Unlearning The online SVM is integrated into Ilastik (see chapter 8). It can happen that the labeler makes an error and wants to remove an already learned sample. To avoid a complete retraining of the SVM, some way of removing a support vector, not changing the solution too much has to be found. The main difficulty is to preserve the equality constraint ∑ αi = 0. The preservation of the equality constraint is done by “moving” the α of the removed vector to other support vectors. For the gaussian RBF kernel, the discriminant function 2.9 is least changed if the α is moved to support vectors as close to the removed one as possible. When support vector i should be unlearned, it is first checked if αi = 0. If so, the SV can simply be removed. If not, the closest SV j with α j 6= yi C is found. Assume without loss of generality, that yi = y j = 1. The α-s are then changed by αi = αi − ∆, α j = α j + ∆ where ∆ = min(αi , C − α j ). This ensures that α j ≤ C is not violated after the removing. The step is repeated until αi = 0 and the vector is removed. After removing support vectors, FINISH has to be called to ensure the KKT conditions within S.
2.7. Online learning with laSvm Bordes et al. [2005] suggest to use laSvm as an online SVM by doing only one epoch over the data. As this seemed to work pretty well on the datasets used in Bordes et al. [2005], the idea has been adapted here. While experimenting in IlastiK, potential pitfalls have been observed and will be discussed in the next section.
32
2.7. Online learning with laSvm
2.7.1. Resampling Motivation and theory Consider the toy datasets in figure 2.1 to which we will refer as the “xor” dataset. Imagine the user would label in a way that the SVM first is trained with all samples fulfilling x1 > 0, which is the right half in figure 2.1. The SVM would then most likely learn a solution, placing the decision boundary at x2 = 0 and predict completely wrong for x1 < 0. This means that all support vectors are located close to x2 = 0. Now, the user would start labeling samples with x1 < 0 correcting the solution in that area. Because all new samples are now predicted wrongly, they get inserted into the active set including samples far from the old decision boundary at x2 = 0. For the old set of samples (those with x1 > 0) no support vectors exist far away from x2 = 0. Therefor the decision boundary is “pushed” by the new samples into the area of the old samples leading to a wrong decision boundary there.
XOR toy dataset
1.5 1.0
x2
0.5 0.0 −0.5 −1.0 −1.5 −1.5
−1.0
−0.5
0.0 x1
0.5
1.0
1.5
Figure 2.1.: The “xor” toy dataset The result of such a training is shown in figure 2.2. The influence area of the new samples extends into the area of the old samples causing misclassification. We will refer to this type of sampling, where order of the training data is not coherence with the input distribution, as “biased sampling”. It is therefore natural to retest samples, that have been learned but not inserted into the
33
2. Online Support Vector Machine
1.0
0.5
0.0
−0.5
−1.0 −1.0
−0.5
0.0
0.5
1.0
Figure 2.2.: Biased sampling on the “xor” toy dataset active set by PROCESS and check if they are must be inserted now because of newly gained support vectors. We refer to this retest as resampling. Which samples should be resampled? The simplest strategy would be to just resample all vectors not in the active set. But, as the learning problem grows, the time for resampling (and relearning of the resampled vectors) grows as well and the questions arises, if we can do better. We motivate a few resampling strategies, that will be tested in the next section. No resampling: To determine if resample is at all necessary, no resampling is considered an option. All: As already discussed, the best classification results are expected when all vectors are resampled. Random: If we have no knowledge, which vectors are best to resample, the only choice is to resample randomly. This might be even advantageous to a bad resampling heuristic, because it is free of bias. Close to new SVs: Resampling makes most sense in areas of the features space, in which the solution changed. This is true for areas, where new support vectors have been added in the last learning round. It therefore makes sense to resample vectors close to this new support vectors.
34
2.7. Online learning with laSvm Close to misclassified: When a newly learned vector has originally been misclassified, the margin changes most. Resampling close to these vectors might therefor be advantageous. Close to new SVs and Close to misclassified make only sense because the kernel (see 2.8) used here, is monotone in the distance between the samples. For other kernels, the common kernel value may make more sense to use as an criteria than the distance between the samples. Experiments The resampling strategies have been tested on the svm_small_sets and on two toy datasets. Biased sampling which could be introduced by the labeler has been simulated on the toy datasets. The “xor” and “stripes” dataset are shown in figure 2.1 and 2.3. They both consists of 1000 samples. For the xor dataset, the 500 samples with x1 > 0 have been drawn first and the samples with x1 ≤ 0 afterwards. Because the dataset is distributed differently for x1 > 0 and x1 ≤ 0, this should introduce difficulties if specific vectors are not resampled.
Stripes toy dataset
1.5 1.0
x2
0.5 0.0 −0.5 −1.0 −1.5 −1.5
−1.0
−0.5
0.0 x1
0.5
1.0
1.5
Figure 2.3.: The “stripes” toy dataset The stripes datasets consists of 4 “stripes” of alternating labels in the x2 direction. There
35
2. Online Support Vector Machine is a gap of 0.1 between the stripes. For this dataset, the 500 samples in the interval −0.5 < x2 < 0.5 have been sampled first followed by the rest. The laSvm has been online learned on the datasets and for every 20 samples, the resample strategies have been invoked. In addition, after every 20 samples finish has been called producing a intermediate model with high accuracy. All but the No resampling and All strategy resampled as many vectors as new samples have been learned (which are 20). For Close to new SVs only the one closest has been resampled for every new support vector inserted into the active set. The rest of the 20 resampled vectors have been filled following the Random strategy. In the Close to misclassified strategy, it has been done similarly. All experiments have been repeated 20 times ensuring that a strategy does not fail or do better by chance. For every repeat, the order of the training data has been shuffled. The slack penalty parameter of the SVM has been set to C = 1. A gaussian kernel with one parameter has been used with γ = 1/d where d is the number of dimensions in the dataset. Only for the toy dataset has γ been set to γ = 4 by visual inspection. Precision results We first examine the precision results on the xor and stripes datasets. Figure 2.4 shows the precision of laSvm in dependence of the number of learned samples on the xor dataset. Two graphs are shown, the first displays the result over the whole 1000 samples, the second focuses on the last 500 samples. For the first 500 samples, not much happens. Here the problem is a simple linear separation problem, misclassifying all test samples with x1 < 0. When the second 500 samples are touched, the real distribution of the datasets is revealed and the precision increases rapidly. The results are very surprising. Up to 540 samples, No resampling is better than the rest, then it suddenly losses precision and becomes much worse than all other strategies. Random is also almost as good as No resampling but does not lose precision with more samples. It is overtaken by Close to new SVs and All at about 740 samples. Close to new is all the time very similar to All while Close to misclassified behaves somewhere between All and Random. To investigate why the All strategy has such bad precision shortly after the first 580 samples, the intermediate SVM results for All and Random resampling are shown in figure 2.5. The decision boundaries returned by the SVM are black while the true decision boundaries are white. Because of the biased sampling, only very little (80) samples with x1 < 0 have been drawn. In the All resampling strategies, these samples are "overwhelmed" by the first 500 samples and the decision boundary moves to the left. In the Random resampling, the effect is reduced because many of the old samples (those not in the active set) do not force the decision boundary to move leftwards. The result from All in 2.5 better follows the distribution of the training data. But
36
2.7. Online learning with laSvm because the training and testing data do not follow the same distribution, this leads to bad test error rates. The same is true for Close to new SVs. While we in general want the SVM solution to model the training distribution, the example on the xor dataset shows that there are cases where this is not desired. For such cases, one could strengthen underrepresented samples by increasing their misclassification penalty (the C value in the SVM formalism). In the laSvm implementation developed in this thesis, the misclassification penalty is stored for every training sample. Changing it for individual samples is therefore a straightforward extensions. Figure 2.8 shows the SVM model for All resampling after 580 samples on the xor dataset where C has been set to C = 10. The results do follow the true decision boundary better. If such an effect is wanted, corresponding labels could be marked by the labeler. The precision results for the stripes toy dataset are shown in figure 2.7 the effects seen in the xor dataset are not visible here. The All resampling strategy does best, while the Close to new SVs is second. Figure 2.8 shows the intermediate solution of the SVM for All resampling after 580 samples. The effects seen in the xor dataset do not occur because of the gaps between the classes. The results for the remaining datasets is shown in figures 2.9, 2.10 and 2.11. In table 2.1, the precisions after the whole datasets have been learned are listed. For all datasets, the resampling method with the highest precision is marked with a color. If the method is more than 0.05% better than the next best method, it is marked green. If not, it is marked blue. On all datasets but the toy datasets, there are no big differences between the strategies. Even "no resampling" does best on ionosphere and splice. Close to new and Close to misclassified do better on a few datasets, the difference is never higher than 0.1%. Even All does not do much better on any dataset and is a little worse for some. The only big difference are on the toy datasets (which have been learned with biased sampling). Close to new does significant better than the other on xor but still looses 0.18% precision to All. On stripes it is comparable to All and does 0.42% better than the next best. No resampling is very bad on these datasets. Timing results Table 2.2 shows the time needed by the different resampling strategies to learn the last 20 samples and do resampling. The best result, with the exception of No resampling are marked by a color where green is used when the time is at least 10ms better than the second best result. The table shows that the strategy All takes much more time on many datasets, while for the rest the difference is not very large. Discussion The time results clearly show, that for online learning where training time is crucial, the strategy All is a bad candidate for doing resampling. In the precision results, the difference on all but the toy datasets are so little that there is no clear best candidate.
37
2. Online Support Vector Machine
Dataset xor stripes a1a a2a a3a mushrooms breast-cancer australian ionosphere splice svmguide1 german.numer voxels
All 1.91 ± 0.00 5.10 ± 0.00 16.43 ± 0.00 15.90 ± 0.00 16.18 ± 0.00 0.11 ± 0.00 2.20 ± 0.00 16.52 ± 0.00 10.04 ± 0.03 9.93 ± 0.00 3.02 ± 0.00 25.54 ± 0.03 4.93 ± 0.00
Random 3.58 ± 0.04 5.56 ± 0.02 16.44 ± 0.01 15.99 ± 0.01 16.36 ± 0.00 0.11 ± 0.00 2.20 ± 0.00 16.57 ± 0.02 10.04 ± 0.03 9.95 ± 0.00 2.93 ± 0.00 25.57 ± 0.06 4.92 ± 0.00
Close to new 2.09 ± 0.01 5.08 ± 0.01 16.47 ± 0.00 15.91 ± 0.00 16.23 ± 0.00 0.11 ± 0.00 2.20 ± 0.00 16.48 ± 0.01 9.96 ± 0.05 9.96 ± 0.00 2.98 ± 0.00 25.32 ± 0.06 4.90 ± 0.00
Close to miscl. 3.98 ± 0.05 5.50 ± 0.02 16.48 ± 0.01 16.00 ± 0.00 16.21 ± 0.00 0.11 ± 0.00 2.20 ± 0.00 16.35 ± 0.02 9.91 ± 0.05 9.92 ± 0.00 2.88 ± 0.00 25.41 ± 0.07 4.91 ± 0.00
No resampling 10.16 ± 0.15 9.49 ± 0.06 16.81 ± 0.03 16.23 ± 0.01 16.53 ± 0.01 0.11 ± 0.00 2.20 ± 0.00 16.43 ± 0.02 9.91 ± 0.06 9.91 ± 0.00 2.94 ± 0.00 25.45 ± 0.14 4.93 ± 0.00
Table 2.1.: Error rate in percentage after learning the complete dataset with different resampling strategies. The colors are explained in the text.
Dataset xor stripes a1a a2a a3a mushrooms breast-cancer australian ionosphere splice svmguide1 german.numer voxels
All 31 ± 0 22 ± 0 336 ± 3 695 ± 2 1421 ± 13 4163 ± 1458 9±0 14 ± 0 4±0 80 ± 0 207 ± 1 36 ± 6 1938 ± 45
Random 2±0 1±0 24 ± 0 40 ± 0 58 ± 0 53 ± 12 1±0 3±0 1±0 21 ± 0 5±0 11 ± 2 24 ± 0
Close to new 2±0 1±0 25 ± 0 42 ± 0 63 ± 0 48 ± 3 1±0 4±0 1±0 21 ± 0 5±0 7±0 28 ± 0
Close to misclass. 2±0 1±0 21 ± 0 37 ± 0 55 ± 0 36 ± 4 1±0 3±0 1±0 18 ± 0 4±0 7±0 23 ± 0
No resampling 1±0 1±0 16 ± 0 27 ± 0 41 ± 0 15 ± 1 1±0 3±0 1±0 17 ± 0 3±0 6±0 17 ± 0
Table 2.2.: Time in milliseconds needed for learning the last 20 samples including resampling on the complete datasets. The colors are explained in the text.
38
2.7. Online learning with laSvm But on the generated toy datasets with simulating biased sampling, which can occur in practice, the Close to new strategy outperforms the other significantly. As stated above, biased sampling can happen in practice. As it is not known beforehand, how the labels will arrive, a resampling strategy performing well in all situations is needed. For that reason in the integration into Ilastik, the close to new resampling strategy has been chosen.
39
2. Online Support Vector Machine
all random close to new close to misclassified no resampling
40
error rate (%)
30
20
10
0
0
200
400
600 #samples
800
1000
1200
16
all random close to new close to misclassified no resampling
14
12
error rate (%)
10
8
6
4
2
0 500
600
700
800 #samples
900
1000
Figure 2.4.: Error rate on the xor dataset
40
1100
2.7. Online learning with laSvm
Resampled with random
1.0
x1
0.5
0.0
−0.5
−1.0 −1.0
−0.5
0.0 x0
0.5
1.0
l
Resampled with all
1.0
x1
0.5
0.0
−0.5
−1.0 −1.0
−0.5
0.0 x0
0.5
1.0
Figure 2.5.: Above: Intermediate solution on the xor dataset doing random resampling, Below: Intermediate solution on the xor dataset doing all resampling. The true decision boundaries are marked by white lines.
41
2. Online Support Vector Machine
Resampled with all
1.0
x1
0.5
0.0
−0.5
−1.0 −1.0
−0.5
0.0 x0
0.5
1.0
Figure 2.6.: Intermediate results on the xor dataset with all resampling setting C = 10 for the samples with x0 < 0
42
2.7. Online learning with laSvm
all random close to new close to misclassified no resampling
40
error rate (%)
30
20
10
0
200
400
600 #samples
800
1000
1200
11
all random close to new close to misclassified no resampling
10 9
error rate (%)
8 7 6 5 4 3 2 500
600
700
800 #samples
900
1000
1100
Figure 2.7.: Error rate on the stripes dataset
43
2. Online Support Vector Machine
Resampled with all 1.0
x1
0.5
0.0
−0.5
−1.0 −1.0
−0.5
0.0 x0
0.5
1.0
Figure 2.8.: Intermediate results on the stripes dataset with all resampling
44
16
18
20
22
24
17
18
19
20
21
22
23
24
0
0
500
200
400
1000
600
45 (c) a3a
1500 2000 #samples
(a) a1a
800 1000 #samples
1400
1600
2500
3000
all random close to new close to misclassified no resampling
1200
all random close to new close to misclassified no resampling
3500
1800
error rate (%)
error rate (%)
error rate (%)
error rate (%) 16
17
18
19
20
21
22
23
24
16
17
18
19
20
21
22
23
0
0
100
500
#samples
#samples
300
2000
400
all random close to new close to misclassified no resampling
1500
(d) australian
200
(b) a2a
1000
all random close to new close to misclassified no resampling
500
2500
Figure 2.9.: precision while doing online learning with different resampling strategies on the svm_small_sets datasets
2.7. Online learning with laSvm
2. Online Support Vector Machine
50
100
200 250 #samples
100 #samples
150
350
400
all random close to new close to misclassified no resampling
300
200
all random close to new close to misclassified no resampling
(a) breast-cancer
150
250
450
32
31
30
29
28
27
26
25
20 18 16 14 12 10 8 6 4 2
0
0
100
100
200
300 #samples
500
600
all random close to new close to misclassified no resampling
400
500
all random close to new close to misclassified no resampling
(d) mushrooms
200
(b) german.numer
300 400 #samples
600
700
Figure 2.10.: precision while doing online learning with different resampling strategies on the svm_small_sets datasets
2.5
0
50
error rate (%) error rate (%)
2.4
2.3
2.2
22
20
18
16
14
12
10 0
(c) ionosphere
46
error rate (%) error rate (%)
error rate (%)
10
15
20
25
30
35
40
0
200
400
(a) splice
600 #samples
5
6
7
8
9
10
11
12
0
800
1000
1000
1200
2000
all random close to new close to misclassified no resampling
47
(c) voxels
3000 4000 #samples
error rate (%) error rate (%)
0
500
5000
6000
1500 2000 #samples
(b) svmguide1
7000
1000
all random close to new close to misclassified no resampling
5
10
15
20
2500
3000
all random close to new close to misclassified no resampling
3500
Figure 2.11.: precision while doing online learning with different resampling strategies on the svm_small_sets datasets
2.7. Online learning with laSvm
3. Model Selection for online SVMs The performance of a Support Vector Machine greatly depends on the used hyperparameters. In addition to the slack penalty C the chosen kernel usually introduces some parameters. The gaussian RBF Kernel is often defined with one parameter γ: K RBF ( x, y) := e−γ| x−y|
2
This definition gives the feature dimension equal weights. But in real data, the different feature dimensions typically have different discriminative power and are defined on different scales. If we are able to optimize multiple kernel parameters simultaneously, it makes sense to allow scaling each feature with a different factor and define the gaussian RBF kernel as D
K RBF ( x, y) := e
− ∑ γi ·( xi −yi )2 i =0
where D is the number of feature dimensions. The online Support Vector Machine developed in this thesis is supposed to work as an plug-in method for Ilastik. The algorithm should work without manual adjustment of the parameters. A method for automatic model selection is therefore investigated. Traditionally SVM hyperparameters have been optimized using cross validation which requires retraining the SVM on a subset of the dataset N times, where N is number of folds in the cross validation (usally between 5 and 10). Another error estimate is the leave one out error estimate, for which the SVM is trained for every training example, leaving out that particular example. The left out example is classified and the percentage of samples, misclassified when left out is used as the generalization approximation Tloo . Many heuristic error bounds are based on approximating the leave one out error Tloo . In Sonnenburg et al. [2006b] a multi kernel learning (MKL) algorithm is given, training a SVM on a linear combination of kernels and adjusting the weights for the kernels in parallel. Sonnenburg et al. [2006a] rewrote the MKL as a semi-infinite linear program recycling the standard SVM implementation. While laSvm could be used here, the main bottleneck of calculating the kernel values is increased by the number of used kernels. Vapnik [1995] suggested a gradient descent on an error bound for the leave one out error. The approach has been followed in Adankon and Cheriet [2007] and Peng and Wang [2009] but with a different error bound requiring a separate validation set. Here the generalization performance is approximated by the smooth Xi-Alpha bound which is a differentiable variant of the Xi-Alpha bound introduced in Joachims [2000].
49
3. Model Selection for online SVMs The gradient on the smooth Xi-Alpha bound is approximated using approximations from Adankon and Cheriet [2007] and a gradient descent is performed to find the minimum.
3.1. Generalization error bounds Since during training the test data is not available, some procedure of estimating the generalization error of the SVM is needed. In the following we will present the error bounds that have been examined.
3.1.1. Radius Margin bound In Vapnik [1995] the radius margin bound has been proposed for SVMs without threshold and slack variables as an upper bound for the leave-one-out-procedure. Trm :=
1 R2 l γ2
l is the number of training example, γ is the width of the margin and R is the radius of the minimal bounding sphere around the training data in feature space. γ can be calculated by 1 γ := |w|
where w is the separating hyperplane normal of the SVM solution. Given the Kernel matrix Ki,j , R can be found by solving the quadratic problem [Chapelle et al., 2002]: l
R2 := max ∑ β i Ki,i − β
i =1
l
∑
β i β j Ki,j .
(3.1)
i,j=1
This problem has to be solved again whenever the kernel parameters change.
3.1.2. Empirical error In Ayat et al. [2002] optimization of kernel parameters based on an empirical error for a separate validation set has been proposed and successfully used in Peng and Wang [2009] and Adankon and Cheriet [2007]. Let ti ∈ {0, 1} be the label of a validation sample with index i. The output of the SVM can be converted to probabilities using techniques described in Platt [1999] and the algorithm from Lin et al. [2007]. Let pˆ i be the probability of sample i having class 1, as output by the classifier. The empirical error is then defined as Temp :=
50
|V |
|V |
i =1
i =1
∑ Ei = ∑ | pˆ i − ti |
,
(3.2)
3.1. Generalization error bounds where Ei := | pˆ i − ti | is the estimated empirical error of the i-th validation sample and |V | the size of the validation set. Using the empirical error for kernel parameter optimization requires recalculation of the probabilistic output of the SVM in every step and keeping of a separate validation set. Especially this last requirement is difficult when training samples are rare. We will examine if the training set itself can be used for validation.
3.1.3. Xi-Alpha bound In Joachims [2000] it has been proven that the the leave-one-out error Tloo of an SVM is upper bounded by 1 (3.3) Tloo ≤ card {i : (ρ|αi | R2 + ξ i ) ≥ 1} l where ρ := 2 and R ≥ Ki,i − Ki,j ∀i, j is an upper bound for differences between kernel values to the diagonal of the Kernel Matrix. card denotes the cardinality of a set. We inserted the absolute value bars around αi because in the SVM formulation used here, αi can get negative in difference to the SVM formulation Joachims [2000] is based on. For a gaussian Kernel, R2 = 1 can be chosen. The proof of 3.3 works by showing that any training sample i being misclassified in an leave-one-out test fulfills ρ|αi | R2 + ξ i ≥ 1.
3.1.4. Smooth Xi-Alpha bound While the Xi-Alpha bound seems to have good properties approximating the SVM generalization error, it is unfortunately not derivable. For convenience we define the Xi-Alpha term by β i := ραi R2 + ξ i . From Joachims [2000] we know that for every misclassified sample, β i ≥ 1 is true. Or, negating the expression β i < 1 ⇒ sample i is not misclassified in an leaf one out test . On the other hand, samples with ξ i ≥ 1, are already misclassified when included in the training set. Since for ξ i > 0, |αi | = C follows from the KKT conditions, it can be concluded β i ≥ ρ · C · R2 + 1 C +1 ⇒ ξi ≥ | αi | ⇒ ξi ≥ 1
⇒ the i-th sample gets misclassified which gives us an upper bound for Tloo . 1 1 1 card ({i : β i ≥ 1}) ≥ Tloo ≥ card ({i : ξ i ≥ 1}) = card {i : β i ≥ ρ · C · R2 + 1} l l l
51
3. Model Selection for online SVMs To approximate the area between the bounds, we define a function of β i , approaching 0 for β i → 1 and 1 for β i → ρ · C · R2 + 1. An appropriate choice seem to be the sigmoid function, defined by s( x ) :=
1 . 1 + exp(− a · ( x + b))
(3.4)
Setting the parameters a and b will be discussed in the experiments section. We call the changed Xi-Alpha bound the smooth Xi-Alpha bound. It is given by l
Tsmooth :=
∑ s( β i )
(3.5)
i =1
3.1.5. Deriving the smooth Xi-Alpha bound Given the derivatives of |αi | and ξ i with respect to the kernel parameters Θk and knowing that ∂ x s( x ) = a · s( x )s(1 − x ) , the derivative of 3.5 is ∂Tsmooth ∂ξ i 2 2 2 ∂αi (|αi |, ξ i ) = a · s(ραi R + ξ i )s(1 − ραi R + ξ i ) · ρR + (3.6) ∂Θk ∂Θk ∂Θk In order to find the derivative of Tsmooth we need the derivatives of αi and ξ i . Finding the derivatives of αi and ξ i From the KKT conditions and constraints of the SVM solution, we know ! l
f ( xi ) = yi ·
∑ α j · Ki,j + b
j =1
= 1 − ξi
(3.7)
l
∑ αi
= 0
(3.8)
i =1
Let K be the kernel matrix. Similar to Vapnik [1995] we can formulate this for all i in matrix notation: y K 1 α y ξ · = − , b 0 0 1T 0 | {z } | {z } | {z } | {z }
=:Kˆ
=:ˆα
=:yˆ
=:ξˆ
⇒ Kˆ · αˆ = yˆ − ξˆ y
where ξ i := ξ i · yi and 1 is a vector with 1 in every entry. Lemma 3.1. Kˆ is invertible.
52
(3.9)
(3.10)
3.1. Generalization error bounds Proof. We show that the rows of Kˆ are linear independent. Because the Kenrel fulfills Mercer’s Condition, K is invertible, and therefore all but the last row of Kˆ are already ˆ Suppose r could be shown to be linear independent. Let r = (1, 0) be the last row of K. written as a linear combination of the remaining rows. Than there is λ ∈ Rl such that l
l
j =1
j =1
∑ K j,i λ j = ∑ Kˆ i,j λ j = ri = 1
∀ i ∈ {1 . . . l } :
l
l
j =1
j =1
∑ λ j = ∑ Kˆ j,l λ j = rl = 0
(3.11) (3.12)
Combining these yields λ T Kλ =
l
∑
λi K j,i λ j
i,j=1
=
l
l
i =1
j =1
∑ λi · ∑ K j,i λ j
!
l
=
∑ λi
(by 3.11)
i =1
=0
(by 3.12) .
which is means that K can not be positive definite. This is a contradiction to Mercer’s Condition. Deriving 3.10 with respect to the kernel parameter Θk yields ∂ ∂ξˆ Kˆ · αˆ = − ∂Θk ∂Θk ˆ ∂ˆα ∂ξ ∂Kˆ ⇒ Kˆ · + = − · αˆ . ∂Θk ∂Θk ∂Θk
(3.13) (3.14)
∂ξ ∂α Since ∂Θ as well as ∂Θ have l elements, these are l equations for 2l unknowns. But the k k KKT condition also yield
ξ i > 0 ⇒ αi = yi · C
αi 6 = yi · C ⇒ ξ i = 0 .
Assuming that αi and ξ i are smooth functions of Θk , these conditions also hold for a small e-environment around Θk . It can be concluded, that ∂ξ i =0 ∂Θk ∂αi ξ i 6= 0 ⇒ =0. ∂Θk ξi = 0 ⇒
53
3. Model Selection for online SVMs We define the vector β with the remaining unknown entries of αi and ξ i . ( ∂αˆ i if ξˆi = 0 ∂Θk β i := ∂ξˆi if ξˆi 6= 0 ∂Θk
In addition, let B be the matrix constructed by replacing the columns of Kˆ for which ξ i 6= 0 with the corresponding columns of the identity matrix. Lemma 3.2. B is invertible. Proof. We show the lemma by induction over the number of replaced columns in B. We write Bi,j to denote the matrix created by removing the i-th row and j-th column from B. If no columns are replaced in B, B is invertible because Kˆ is invertible. Since K is constructed from a mercer kernel, Bi,j is also invertible for any i, j ∈ {1 . . . l + 1}. Let B0 be constructed by replacing column k of B by the corresponding column of the identity matrix. According to the laplacian formula the determinant of B0 can be calculated by det( B0 ) =
l
0 · det( B0i,k ) = det( B0k,k ) = det( Bk,k ) 6= 0 ∑ (−1)i+k · Bi,k
i =1
0 = δ and B0k,k = Bk,k . By the same argument The last two equalities hold because Bi,k i,k B0i,j for i, j ∈ {1 . . . l + 1} is also invertible.
Equation 3.14 can then be transformed to
B·β = −
∂Kˆ ∂Kˆ · αˆ ⇒ β = − B−1 · αˆ . ∂Θk ∂Θk
(3.15)
From the resulting β, the corresponding αi and ξ i can be extracted. Approximating the derivatives of αi and ξ i Finding the derivatives of αi and ξ i using equation 3.15 involves inverting a matrix of size l + 1 which can be expensive when many the SVM solution contains many support vectors. A fast approximation of 3.15 would therefore be desirable. Equation 3.7 can be transformed to yield l
Ki,i αi + yi ξ i = −
∑
j=1,j6=i
Ki,j α j − b + yi ,
(3.16)
where Ki,i = 1 for an RBF kernel. In Adankon and Cheriet [2007] it was noted, that ∂α j ∂Ki , j Ki,j αj . ∂Θk ∂Θk
54
3.1. Generalization error bounds Using this result, 3.16 yields l l ∂α j ∂Ki,j ∂ξ ∂Ki , j ∂αi + yi i = 0 ∑ Ki,j + αj ≈ − ∑ αj . ∂Θk ∂Θk ∂Θk ∂Θk ∂Θk j=1,j6=i j=1,j6=i
from which the approximate derivatives of αi and ξ i can be extracted. Derivative of the bias b b was defined as: 1 b := 2
min gi + max gi
αi < Bi
αi > Ai
The KKT condition of the dual SVM solution, yield
∀i with Ai < αi < Bi : gi = b
(3.17)
While for any i with Ai = αi or Bi = αi the value has an arbitrary offset to b. Assuming that the set of indexes i for which the condition in 3.17 does not change in a small area around the current kernel parameters, the derivative of b should be: ∂Θk b = ∂Θk gi , ∀i with Ai < αi < Bi . To minimize the approximation error, we average the derivatives over all appropriate gi : 1 1 ∂ Θk b = ∂ Θ k gi ≈ − ∑ ∑ Sα j ∂Θk Ki,j , Z A 0, there is never an angle bigger than 90◦ . Basic geometry dictates that in this scenario vectors can only be linear dependent if they are equal. Since we are using a gaussian RBF kernel, this is the only case interesting to us. We will now show that a slight modification to the dual formulation allows finding the exact solution using only linear independent vectors. Assume (α, b) is the solution of the SVM optimization problem with samples X = {( xi , yi ), i = 1 . . . N }. Let without loss of generality ( x N −1 , y N −1 ) = ( x N , y N ). We define the reduced problem by removing ( x N −1 , y N −1 ) and ( x N , y N ) and inserting ( xˆ , yˆ ) := ( x N , y N ) with Aˆ := A a + Ab and Bˆ := Ba + Bb at position N − 1. Lemma 4.1. αˆ with ( αˆ i :=
αi α N −1 + α N
for i ∈ {1..N − 2} for i = N − 1
and bˆ = b are the solution of the reduced problem and yield the same decision function as α and b.
72
4.1. Linear independent support vectors Proof. The decision function given by α, ˆ bˆ is fˆ( x) =
N −1
∑
αˆ i K ( x, xi ) + bˆ
i =1
= (α N −1 + α N ) · K ( x, x N −1 ) +
N −2
∑
αi K ( x, xi ) + b
i =1
N
=
∑ αi K(x, xi ) + b =
f ( x)
i =1
and therefore equal to the decision function given by (α, b). Resulting from this and 2.10, the gradients gˆi of this solution are gˆi = yi − fˆ( xi ) + bˆ = yi − f ( xi ) + b = gi and thereby equal to the original gradients. We now show that the KKT conditions are fulfilled. Remember from section 2.2, that if the KKT conditions are fulfilled, the solution is optimal. Because the KKT condition did not change for sample 1 till N − 2, we have to only show them for the new inserted sample. ˆ In this case either A N < α N < BN or A N −1 < α N −1 < BN −1 which Case Aˆ < αˆ < B: ˆ yields gˆ N −1 = g N = g N −1 = b = b. ˆ Case Aˆ = αˆ : It follows that A N = α N and therefore gˆ N −1 = g N −1 ≤ b = b. ˆ Case B = αˆ : This case is analogous to the last. The fulfillment of the KKT conditions shows, that (α, ˆ bˆ ) is indeed the solution of the reduced problem. Lemma 4.1 implies that we can merge linear dependent vectors simply by removing one and setting A and B to the respective sums for the other. As mentioned above, for an gaussian RBF kernel, two support vectors are only linear dependent if they are equal. Unfortunately, this is very unlikely to happen. In Orabona et al. [2010] a criterion was developed for almost linear dependent support vectors. Following Orabona et al. [2010], assume Φ( x) is linear dependent on Φ( xi ), i = {1 . . . l }. It follows, that
∃d ∈ Rl s.t. Φ( x) =
l
∑ d i · Φ ( x i ).
i =1
If this criterion can not be fulfilled, we are interested in the “best” representation of Φ( x) that can be achieved by a linear combination of the Φ( xi ). In Orabona et al. [2010] it is defined as the minimal squared distance ∆ of Φ( x) to the hyperplane spanned by the Φ( xi ).
2
l
∆ := min ∑ di · Φ( xi ) − Φ( x) .
d ∈Rl i =1
(4.1)
73
4. Speeding up SVM prediction In the online learning case, it is computationally too expensive to check if a new support vector can be represented by any combination of existing support vectors. They are therefore only checked against single vectors in the working set reducing 4.1 to ∆i = min kd · Φ( xi ) − Φ( x)k2 . d ∈R
(4.2)
Geometry dictates for ∆i to be minimal, Φ( x) and d · Φ( xi ) − Φ( x) have to be orthogonal. Let dmin be the d reaching this minimum. Let < ·, · > be the scalar product in feature space on which the gaussian kernel K is built. 0 = hΦ( xi ), dmin · Φ( xi ) − Φ( x)i
⇒ 0 = dmin · hΦ( xi ), Φ( xi )i − hΦ( xi ), Φ( x)i = dmin K ( xi , xi ) − K ( xi , x) K ( xi , x ) ⇒ dmin = K ( xi , xi )
Plugging this result into 4.2 and remembering, that K ( x, x) = 1 reveals ∆i = kdmin · Φ( xi ) − Φ( x)k2
2
K ( xi , x )
=
K ( x , x ) · Φ ( xi ) − Φ ( x ) i i K ( x i , x )2 K ( xi , x ) = · K ( xi , xi ) + K ( x, x) − 2 · K ( xi , x ) 2 K ( xi , xi ) K ( xi , xi )
= 1 − K ( x i , x )2
4.1.1. Merging of support vectors Orabona et al. [2010] proposed to represent a vector by approximate linear dependent vectors if ∆ < t where t is an externally set threshold. We are only considering pairs of vectors. Whenever a vector is considered for insertion into the active set in the PROCESS procedure (algorithm 5), the vector is checked against all vectors in the active set for approximate linear dependence. The major computational expense in this process is the calculation of the kernel values. Since they are already calculated for the gradient of the new vector, the overhead is minimal. When vectors i, j fulfill 1 − K ( x i , x j )2 < t
74
(4.3)
4.1. Linear independent support vectors the situation is completely symmetric in i, j and it is unclear if i should be represented by j or the other way around. We choose a “fair” approach by choosing a vector to represent i and j which is build by averaging xi and x j . x avg =
xi + x j 2
4.1.2. Experiments The merging threshold t has been varied on a logarithmic scale from 0 to 1. For t = 0 only identically support vectors are merged. For t = 1 all vectors of one class are merged into a single vector. On every set in svm_sets, a SVM has been trained with a multi parameter gaussian Kernel. The parameters of the Kernel have been optimized using the techniques described in 3. With these parameters, the SVM has been trained for the different threshold values t. The number of support vectors and the test accuracy has been measured.
4.1.3. Results The results are shown in figure 4.1 and 4.2. Because of their similarity to a1a, the results for a2a and a3a have been omitted. The number of support vectors is not monotone decreasing, because the merging of support vectors can cause other vectors to become support vectors. The dataset for which the test error increases significantly first is svmguide1. From around t = 10−2 the testing error soon increases rapidly. For all other datasets a value of t = 10−1 still does not decrease the accuracy significantly. For the a1a, svmguide1, ijcnn1 and voxels datasets, the number of support vectors has decreased already significantly at t = 10−2 . For ijcnn1 it has decreased by factor of about 2. But for t = 10−1 the number of support vectors has decreased to a fraction of about 1/10. For other datasets it has also decreased much more then for t = 10−2 . For the breast-cancer dataset, the number of support vectors could be decreased below 50% of the original amount without increasing the error rate. But at t = 10−1 there is only a very small reduction in the number of support vectors for this dataset. For online learning it would be extremely helpful to have a measure for the reduction in accuracy without a separate testing set. One could try to use the fraction of remaining support vectors as a measure. But as the results on splice and ionosphere in table 4.1 show, there are examples where the accuracy decreases as soon as the number of support vectors decreases only by a small amount.
4.1.4. Discussion A method for reducing the number of support vectors in a SVM has been discussed and adapted for the laSvm implementation developed in this thesis. The experiments show that for most datasets, the number of support vectors can be reduced dramatically
75
4. Speeding up SVM prediction
70
80 6
7
3
4
5
Error rate # of SVs
10−1
0 100
6 5 4 3 2 1 0 100
4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 100
Error rate (%)
60 50 40
10−1
2
10−2
30
10−3
10−2
threshold t
10−3
(a) a1a
1
10−4
Error rate # of SVs
10−4
10−2
threshold t
10−3
(c) breast-cancer
Error rate # of SVs
10−4
10−1
20 10 10−5
70 60 50 40 30 20 10 0 10−5
60 55 50 45 40 35 30 25 10−5
threshold t
(e) german.numer
Error rate (%)
0 100
1
2
3
4
5
6
7
0.0 100
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.8
10−1
1.6
10−2 threshold t
10−3
90
Error rate # of SVs
10−4
(b) australian
Error rate # of SVs
10−2 threshold t
10−3
10−1
80 70 60 50 40 30 20 10 10−5
50
40
30
20
10
10−4
7
0 10−5
35
6
(d) cod-rna
30
3
4
5
Error rate # of SVs
25 20 15
10−1
0 100
2
10−2 threshold t
10
10−3
1
10−4
5 0 10−5
(f) ijcnn1
Figure 4.1.: Number of support vectors and testing error in dependence of the threshold t .
76
# SVs ×102
# SVs ×103
# SVs ×103
# SVs ×102
Error rate (%)
# SVs ×10 # SVs ×102
Error rate (%) Error rate (%) Error rate (%)
Error rate (%)
77
5 10−5
10
15
20
25
30
35
40
45
50
0 10−5
10
20
30
40
10−4
10−2
threshold t
10−3
(c) splice
5 10−5
10
15
20
25
30
35
40
45
50
10−2
threshold t
10−3
(a) ionosphere
Error rate # of SVs
10−4
Error rate # of SVs
10−4
threshold t
10−2
(e) voxels
10−3
0 10−5
10
0.0 100
20
0.5
30
40
50
60
70
80
0 10−5
2
4
6
1.0
1.5
2.0
2.5
3.0
3.5
4.0
0 100
Error rate # of SVs
10−1
10−1
2
4
6
8
10
10−1
10−4
10−2
0.0 100
0.2
0.4
0.6
0.8
1.0
1.2
threshold t
10−2
(d) svmguide1
10−3
threshold t
10−3
(b) mushrooms
Error rate # of SVs
10−4
Error rate # of SVs
10−1
10−1
0 100
1
2
3
4
5
2.0 100
2.5
3.0
3.5
4.0
4.5
5.0
5.5
Figure 4.2.: Number of support vectors and testing error in dependence of the threshold t .
Error rate (%)
8
# SVs ×10 # SVs ×102
50
Error rate (%)
Error rate (%) Error rate (%)
10
# SVs ×103
# SVs ×10 # SVs ×102
60
4.1. Linear independent support vectors
4. Speeding up SVM prediction without decreasing the SVM accuracy significantly. Unfortunately it is yet unclear how to determine at which merging threshold t the accuracy starts to decrease significantly. If one does not want to harm the performance of the SVM, one can only set the threshold to a safe value of t = 10−2 . This value shows much less improvements in the number of support vectors of the SVM than could be achieved. For offline SVM, one could use cross validation, and gradually increase t. As soon as the accuracy stops increasing, a good value for t might have been found. This has a good chance of decreasing SVM overall training time because the SVM has to be trained only for large t. A good criterion for SVM accuracy that does not rely on retraining and can be computed fast would be most valuable and would for most datasets decrease the training and prediction time of the SVM significant. In the online SVM integrated in ILASTIK, t has been set to 10−2 to ensure that performance is not harmed.
78
4.2. Cover Trees
4.2. Cover Trees We will investigate decreasing SVM prediction time by using cover trees to evaluate the kernel sum in 2.9. An cover tree based kernel sum method tailored for SVM prediction will be introduced and its performance compared to a cover tee based methods from Ram et al. [2009].
4.2.1. Theory Cover trees have been introduced in Beygelzimer et al. [2006]. It is a data structure for fast nearest neighbor search, and can also be used for kernel summation [Ram et al., 2009]. A cover tree T stores a data set D in a levelled tree. Every node in the tree is associated with a x ∈ D. Although a point can be contained in several nodes, it may not occur more than once in one level of the tree. We will denote nodes by their points if ambiguity is clarified by the context. Let Ci denote the set of points at level i and d( x, y) the distance between x, y ∈ D. For a cover tree, the following invariants hold: Nesting: Ci ⊂ Ci−1 Covering: ∀ x ∈ Ci−1 : ∃y ∈ Ci such that d( x, y) ≤ 2i . The node associated with y at level i is the parent of the node associated with x at level i − 1. Separation: ∀ x, y ∈ Ci : d( x, y) > 2i As discussed in Beygelzimer et al. [2006], in theory a cover tree goes from level i = −∞ till level i = ∞. This is of course in practice not possible but also not necessary. Because of the Nesting invariant, every node has a child node containing the same point. We call these child nodes the “self nodes”. If all levels consisting only of self nodes are removed, the explicit representation of the cover tree is created. Its upper most level is one above the first level not containing any nodes but the root node. Its lower most level is the first level, containing all points. Building the cover tree Constructing any tree structure can either happen by inserting the elements one at a time into a growing structure or by a batch mode algorithm constructing the tree from all samples at once. In case of a cover tree it can be assumed that batch mode creation could be faster, since the samples do not have to traverse down the tree separately for inserting. While Beygelzimer et al. [2006] provides pseudo code for a function inserting an element into a cover tree, batch mode creation of cover trees is not discussed. In the supplementary material of Beygelzimer et al. [2006] source code for a cover tree, including batch mode creation, is provided. Unfortunately the code is badly documented and its internal mechanisms did not reveal themselves after deep studies of the code.
79
4. Speeding up SVM prediction Because the code is in pure C, it is extremely inflexible and integration into the laSvm was at no means obvious. It has been decided to redesign and reimplement the cover tree in C++. The result for the batch construction is stated in pseudo code in algorithm 8. A comparison of the C++ code and the original C implementation suggests, that the idea is the same. Algorithm 8 ConstructCoverTree(root point xroot ,point set P ⊂ D,level i) 1: procedure ConstructCoverTree(root point xroot , point set P ⊂ D, level i) 2: Create node n based on point xroot and level i 3: if P = ∅ then 4: return (n, ∅) 5: end if 6: Pnext ← { x ∈ P|d( xroot , x ) ≤ 2i } 7: Punused ← P\ Pnext 8: (c, Pcon f lict ) ←ConstructCoverTree(p,Pnext ,i − 1) 9: add c to children of n 10: while Pcon f lict 6= ∅ do 11: choose arbitrary xnext ∈ Pcon f lict 12: Pcon f lict ← Pcon f lict \{ xnext } 13: Pnext ← { x ∈ Pcon f lict ∪ Punused |d( xnext , x ) ≤ 2i } 14: Pcon f lict ← Pcon f lict \ Pnext 15: Punused ← Punsued \ Pnex 16: (c, Preturn ) ←ConstructCoverTree(xnext ,Pnext ,i − 1) 17: add c to children of n 18: Pcon f lict ← Pcon f lic ∪ { x ∈ Preturn |d( x, x xoot ) ≤ 2i } 19: Punused ← Punused ∪ { x ∈ Preturn |d( x, xroot ) > 2i )} 20: end while return (n, Punused ) 21: end procedure We will proof that the returned tree follows the cover tree invariants. But first we need a lemma necessary for the proofs. Lemma 4.2. Let Tsub be an subtree of a explicit represented cover tree T. With T \ Tsub we will denote the cover tree T without its subtree Tsub . Let i be the root level of Tsub and xroot the point in the root note. Than any point y with d( xroot , y) > 2i+1 can not be in a node of T \ Tsub that conflicts with the separating invariant with one of the nodes in Tsub Proof. Because of the Covering invariant, any point z in Tsub at level i − j must fulfill i
d( xroot , z) ≤
80
∑
k =i − j
2 k ≤ 2i + 1 − 2 j − i
4.2. Cover Trees Let y be defined such that d( xroot , y) > 2i+1 and belong to a node in T \ Tsub . Then d(z, y) ≥ d( xroot , y) − d( xroot , y) > 2i+1 − 2i+1 + 2i− j = 2i− j which proves the statement. The Nesting invariant is fulfilled because the given root point is passed as the root point for the first child created in line 10. The Covering invariant is fulfilled because all children created in line 11 are from Pcon f lict into which points conflicting with the invariant are never inserted. We will now proof the separation invariant. Lemma 4.3. The set of points returned by algorithm 8 cannot violate the separation invariant of the cover tree nodes in the returned cover tree. Proof. The proof is done by induction over the cardinality of P. If | P| = 0 in line 3, no points are returned. Let the statement be proven for all P with | P| < b. If in line 6 Pnext = P the recursion will continue with the same P and xroot until P 6= Pnext . Because of lemma 4.2 and the induction hypothesis, Punused in line 7 can not violate the separation invariant with nodes in the subtree. The loop at lines 10-20 holds the following invariants at line 10: 1. No x ∈ Pcon f lict ∪ Punused can violate the separating invariant with any node in a subtree rooted at any of the children of n. 2. No x ∈ Punused can violate the separating invariant with any node in the subtree rooted at n. Punused is initialized at line 7, with only points having a distance of at least 2i to the child x. Because of lemma 4.2, none of these can violate the separation invariant with x. Pcon f lict on the other hand is initialized at line 8 with no points that could violate the separation invariant with x according to the induction hypotheses. For every run of the loop in lines 10-20, Pcon f lict and Punused are filtered for any points that could violate the separation invariant with the new child xnext . In line 18 and 19 the sets Pcon f lict and Punused are extended with points in Preturn . Because Preturn ⊂ Pnext ⊂ ( Pcon f lict ∪ Punused ), the added points cannot conflict with any node in a subtree rooted in any of the children of n but xnext . Because Preturn is returned from the recursion, the same holds for the new child by the induction hypotheses. This concludes, that invariant 1 holds. Invariant 2 can be concluded from invariant 1 and seeing that at line 19 only points holding the separation invariant for level i are inserted into Punused . Since in line 21 Punused is returned, from Invariant 2 the assertion follows. Lemma 4.3 yields the Separation invariant by recursion.
81
4. Speeding up SVM prediction
4.2.2. SVM Prediction utilizing cover trees The goal of this section is to speed up the SVM prediction done by the calculation of 2.9 utilizing cover tree structures. In Ram et al. [2009] a kernel sum evaluation method based on cover trees is given. A perquisite is that the kernel is a monotone function of some distance measure. We use the kernel in 2.8, and can use v ! u d u 2 d( x, y) := t − ∑ γi | xi − yi | i =1
as a distance measure. Than 2.8 changes to K ( x, y) := exp −d( x, y)2
fulfilling the needed constraint. The Idea of the kernel sum algorithm is to prune away parts of the tree, for which the contributed error of every point within the pruned subtree is bounded by a threshold e. This can be done because the distance of a node in a subtree rooted at level i to its root node can be no more than
i
∑ 2k = 2i+1 . If the number of points in the cover tree is
k =−∞
N, the total error on f ( x) can be no more the N · e. While no significant error was observed during preliminary experiments with e = 0.1, using this cover tree kernel sum guarantees the error to be smaller than N · e. If we have 100 support vectors, the bound would be 10, which is by no means acceptable. To get an acceptable bound on f ( x), one would have to set e = 0.001. But with e = 0.001 the computational benefit vanished yielding no significant speedup to the baseline method.
4.2.3. Cover tree kernel sum tailored for SVM prediction To get the label from the prediction in a SVM, one would only need to know if the discriminate function 2.9 is positive or negative. In case of active learning, the exact value is only needed if it is close to the decision boundary. We will refer to the direct computation of 2.9 as the complete kernel sum. We introduce an algorithm exploiting this. It traverse every prediction sample down the cover tree while keeping track of the possible range [ f min , f max ] of 2.9 imposed by the nodes that have not been investigated yet until one of the following stopping criteria gets true: 1. | f max − f min | ≤ δ: f ( x) needs to be known only up to a certain precision δ. 2. f max ≤ e ∨ f min ≥ e: When f ( x) is known to have a certain distance to the decision boundary, its exact value does not matter. Only its sign matters
82
4.2. Cover Trees Consider a subtree Tsub of a cover tree filled with support vectors. Let the root node be x at level l. Let X = { xi } be the set of support vectors contained in Tsub and α = {αi } their corresponding weights. Remember from section 2, that the αi are negative for samples with negative label and positive otherwise. Let S pos =
∑
αi
∑
αi
α i >0
Sneg =
α i 2l +1 K (d( x, y) + 2l +1 ) ≤ K ( xi , y) ≤ . K (0) otherwise The second case of the upper bound is because in the case of negative values, K (·) cannot be monotone anymore. For convenience we define ( K ( a) if a > 0 Kˆ (d) := K (0) otherwise . The contribution of the positive weights is bounded by: S pos · K (d( x, y) + 2l +1 ) ≤
∑
α i >0
αi · K (d( xi , y)) ≤ S pos · Kˆ (d( x, y) − 2l +1 ) .
Similar reasoning can be applied to the negative weights, resulting in f min = Sneg · Kˆ (d( x, y) − 2l +1 ) + S pos · K (d( x, y) + 2l +1 ) f max = S pos · Kˆ (d( x, y) − 2l +1 ) + Sneg · K (d( x, y) + 2l +1 )
(4.4) (4.5)
In the practical implementation, the maximum distance of a vector in a subtree is stored during creation. The algorithm 9 needs a function getRange returning the range [ f min , f max ] of a subtree given the root node.
4.2.4. Speeding up the evaluation of exp In the implementation of the cover trees, any distance evaluation is cached. Therefore the cover tree strategies will never need more distance calculations than the complete kernel sum. But for finding f min and f max , as in 4.4, 2 evaluations of exp are necessary.
83
4. Speeding up SVM prediction Algorithm 9 Approximation of a kernel sum tailored for SVM prediction 1: procedure ApproximateKernelSum(cover tree ct, prediction vector y, precision δ, margin e) 2: r ←rootNode(ct) 3: ( f min , f max ) ← getRange(r, y) 4: Q ← {( f min , f max , r )} . create a stack with nodes and ranges 5: while Q 6= ∅ do 6: ( gmin , gmax , r ) ← pop( Q) 7: f min ← f min − gmin . remove range from popped stack entry 8: f max ← f max − gmax 9: for all c ∈ Children(r ) do 10: ( gmin , gmax ) ← getRange( x, y) 11: f min ← f min + gmin . add the range of the children 12: f max ← f max + gmax 13: Q ← Q ∪ {( f min , f max , c)} 14: end for 15: if f min ≥ e ∨ f max ≤ e ∨ ( f max − f min ) ≤ δ then . stopping criteria 16: break 17: end if 18: end while 19: return f max − f min 20: end procedure
If the pruning of the tree is unsuccessful, there can be more exp evaluations than in the complete kernel sum. To speed up exp a lookup table has been created. Only values for x ∈ [0, −∞) are needed. The lookup table has a size of 500 and stores values for exp( x ) with x equidistant distributed in [0, −5]. For any x < −5, 0 is returned. The distance between two lookup entries is 5/500 = 0.01. Because x ∈ [0, −∞) and exp(−5) < 0.01, the error is never greater than 0.01.
∂ exp( x ) ∂x
≤ 1 for
4.2.5. Experiments For all svm_sets datasets a SVM has been trained and optimized as described in 3. Using the resulting SVM, predictions has been done on the corresponding test sets. The following prediction strategies have been compared: Complete sum: The complete sum in 2.9 is evaluated. CT: A cover tree is created with the support vectors and prediction is done using the kernel sum described in Ram et al. [2009]. This has been done for e = 0.1 and e = 0.001.
84
4.2. Cover Trees CT (marg.): Short for Cover tree with margin. The kernel sum is built using algorithm 9 with the parameters e = 0.5, δ = 0.1 and e = 0.0, δ = 0.1. The experiments have been repeated 30 times yielding a statistic for the running time measurements.
4.2.6. Results First, the experiments have been done with the exp implementation from the C-standard library. In figure 4.3 the error rates of the strategies on svm_sets are given and compared in a bar plot. Besides the cover tree strategy with e = 0.1, which is much worse than the others for the ijcnn1 and cod-rna datasets, there are no notable differences for the strategies. It is no surprise that the Cover tree strategy with e = 0.1 fails on ijcnn1 and cod-rna, these datasets still have a lot of support vectors after being optimized. For the ijcnn1 dataset, there are about 9000 support vectors. The error bound for the Cover tree kernel sum is proportional to the number of support vectors. For the ijcnn1 dataset it is e · N ≈ 0.1 · 9000 = 900. Figure 4.4 shows run time measurements for the different strategies. The times have been measured for all strategies and divided by the time needed for the complete kernel sum. This makes the results for different datasets comparable. The Cover tree strategy with e = 0.1 seems to be the fastest in general. Unfortunately, as we have seen in figure 4.3, it is also not very reliable. For the Cover tree strategy with margin, it does not make a big difference if e = 0 or e = 0.5 is set. As figure 4.4 shows, in general the cover tree strategies do not come of well. Besides CT with e = 0.1, only on three datasets (australian, splice and mushroom) a real improvement is visible. For other datasets, especially the big one (ijcnn1 and cod-rna) they perform badly. For the cover tree, the calculation of the kernel 2.8 has been split into two parts. The calculation of a distance and the evaluation of exp on this distance. While the number of distance calculations (if properly cached) is never bigger for any of the cover tree strategies than for the complete kernel sum, the exponential function could have to be evaluated a lot more often as can be seen in equation 4.4. Figure 4.5 shows the number of distance calculations needed in the different strategies while figure 4.6 shows the number of evaluations of exp. These tables reveal, that for most datasets much more exp evaluations are needed than for the complete sum. The experiments have been repeated replacing exp from the C standard library with the lookup table implementation. For the complete kernel sum, the speedup gain by this replacement is stated in table 4.1. Figure 4.7 lists the times needed with the lookup table. All methodes have gained speedup, but the complete sum gained the biggest speedup as can be seen on the graphic in figure 4.7.
85
4. Speeding up SVM prediction Dataset Speedup
a1a 2.3
a2a 2.3
a3a 2.3
breast-cancer 2.2
australian 2.2
ionosphere 2.3
splice 2.4
Dataset
svmguide1
german.numer
voxels
mushrooms
ijcnn1
cod-rna
Speedup
1.8
2.2
2.2
1.6
1.8
2.0
Table 4.1.: Relative time gain by using a lookup table for the exponantial function
4.2.7. Finding bottlenecks in the prediction In order to find the reasons for the unsatisfactory running times of SVM tailored cover tree prediction, a test program predicting with the complete kernel sum and with the margin cover tree has been profiled on the ionosphere dataset. The exp implementation from the C standard library has been used in the profiling. The callgrind tool from the valgrind bundle1 has been used. Utilizing kcachegrind2 the results have been visualized. The program has been compiled in “release with debugsymbols” mode. This means that debug symbols are created but the program still has been optimized. The debug symbols are necessary to see which machine code belongs to which C++ function. If the optimization would have been turned off, the results would not be representative. Unfortunately turning optimization on also reduced the ability to assign machine code to C++ functions because functions can be inlined and changed in other ways. The profiling revealed that the complete kernel sum takes about 20% of its time for evaluating exp and 70% for the distance calculations. In the margin cover tree method, 15% is spent in exp. The amount of time spent calculating distances was not recoverable, because in this method the distance calculations are distributed over different places and the optimizer mangled them too much. Construction of the cover tree took less than 1% of the time. But another thing was found. 20% of the time spent in the margin cover tree method was used for organizing the stack. The implementation uses a heap from the C++ standard library to store and retrieve the element with the biggest range in O(log(n))3 where n is the number of stored ranges. This explains why the timing for these method is so tremendously inefficient on the big datasets (ijcnn1 and cod-rna). On these datasets, the heap grows particularly huge.
4.2.8. Discussion Cover trees have been investigated to speed up SVM prediction. An algorithm tailored for SVM prediction by bailing out if the prediction is sure has been developed. Unfortunately these cover tree methods are faster only on a few datasets and much slower on 1 http://valgrind.org/ 2 http://kcachegrind.sourceforge.net/html/Home.html 3 http://www.sgi.com/tech/stl/push_heap.html
86
4.2. Cover Trees some others. Since it can not be guaranteed that the cover trees methods are faster in advance, it is difficult to decide if they should be used on a particularly dataset or not. Because of the tremendous speedup on some datasets (splice, australian and mushrooms), it would be very desirable to either know in advance if a cover tree is advantageous on a dataset or improve it not to have such huge speed disadvantages on any dataset. The latter might be possible to archive by further analysing the bottlenecks found in section 4.2.7 and trying improve them. A major bottleneck is the heap in the queue for always returning the element with highest kernel sum range. If this criterion, and with it the heap, could be replaced by something simpler, a speedup is to be expected. While the cover trees do not improve the prediction times on all datasets, the number of needed distance calculations was never worse on any dataset bringing big improvements on some. Within the last month of this thesis, several problems with the handling of prediction data have been detected causing significant slowdowns of the distance calculations. The preliminary experiments have been conducted with those problems showing much bigger improvements by the cover tree methods. On datasets with many thousands of features, the distance calculations are the major bottleneck in the kernel sum and cover tree based evaluations should have a higher chance of decreasing prediction times. In addition, the exp function has been sped up using a lookup table. For the complete kernel sum, a speedup of around 2 as been archived for most datasets.
87
4. Speeding up SVM prediction
45
complete sum CT: = 0.001 CT: = 0.1 CT (marg.): δ = 0.1, = 0 CT (marg.): δ = 0.1, = 0.5
40
Error rate (%)
35 30 25 20 15 10 5 0
a
a1
a2
a
a
a3 bre
a
r r n a ls re ms cnn1 ce lice uide1 ume xe -rn he alia sp ij vo roo .n od g str nosp h n c u s m a a sv erm io mu g
an
c st-
complete sum Dataset a1a a2a a3a breast-cancer australian ionosphere splice svmguide1 german.numer voxels mushrooms ijcnn1 cod-rna
17.8 17.9 17.9 2.2 13.0 6.8 9.5 3.5 23.4 6.1 0.1 2.2 4.1
CT: e = 0.001 17.8 17.9 17.9 2.2 13.0 6.8 9.5 3.5 23.4 6.1 0.1 2.2 4.1
CT: e = 0.1 17.8 17.8 17.9 2.2 13.0 7.7 9.5 3.9 23.1 12.1 0.1 43.4 12.7
CT (marg.): δ = 0.1, e = 0 17.8 17.9 17.9 2.2 13.0 6.8 9.5 3.5 23.4 6.0 0.1 2.2 4.1
CT (marg.): δ = 0.1, e = 0.5 17.8 17.9 17.9 2.2 13.0 6.8 9.5 3.5 23.4 6.0 0.1 2.2 4.1
Figure 4.3.: Error rates (in %) for the different strategies on the svm_sets
88
4.2. Cover Trees
3.5
complete sum CT: = 0.001 CT: = 0.1 CT (marg.): δ = 0.1, = 0 CT (marg.): δ = 0.1, = 0.5
3.0
Relative time
2.5 2.0 1.5 1.0 0.5 0.0
a
a1
a2
a
a
a3 bre
a
complete sum Dataset a1a a2a a3a breast-cancer australian ionosphere splice svmguide1 german.numer voxels mushrooms ijcnn1 cod-rna
(1975 ± 3)s (2842 ± 5)s (3595 ± 4)s (416 ± 1)ms (1.58 ± 0.01)s (508 ± 1)ms (89.5 ± 0.8)s (42.0 ± 0.2)s (4.73 ± 0.02)s (144 ± 0)s (240 ± 1)s (19381 ± 76)s (72651 ± 402)s
r r n ls re na ms cnn1 ce lice uide1 ume xe he alia d-r sp ij vo roo .n o g str nosp h n c u s a svm erma io mu g
an
c st-
CT: e = 0.001 (1886 ± 3)s (2858 ± 4)s (3755 ± 6)s (624 ± 1)ms (982 ± 1)ms (619 ± 1)ms (24.2 ± 0.1)s (62.6 ± 0.1)s (6.52 ± 0.00)s (222 ± 0)s (7.22 ± 0.00)s (41060 ± 36)s (156199 ± 1241)s
CT: e = 0.1 (715 ± 1)s (2095 ± 2)s (2134 ± 2)s (616 ± 1)ms (270 ± 1)ms (524 ± 1)ms (24.1 ± 0.1)s (38.0 ± 0.1)s (6.41 ± 0.00)s (58.8 ± 0.1)s (6.01 ± 0.00)s (11991 ± 22)s (43705 ± 79)s
CT (marg.): δ = 0.1, e = 0 (1746 ± 3)s (2748 ± 3)s (3676 ± 5)s (596 ± 1)ms (340 ± 1)ms (462 ± 1)ms (19.3 ± 0.0)s (59.0 ± 0.0)s (6.61 ± 0.01)s (213 ± 0)s (4.77 ± 0.01)s (43353 ± 54)s (176088 ± 1050)s
CT (marg.): δ = 0.1, e = 0.5 (1814 ± 2)s (2791 ± 2)s (3730 ± 5)s (610 ± 1)ms (496 ± 1)ms (490 ± 1)ms (19.6 ± 0.0)s (61.4 ± 0.1)s (6.67 ± 0.01)s (219 ± 0)s (5.09 ± 0.00)s (44106 ± 44)s (179975 ± 1464)s
Figure 4.4.: Time for the different strategies on the svm_sets. For the graphic, the times have been normalized to the time needed for the complete sum.
89
4. Speeding up SVM prediction
relative number of distance calculations
3.0
complete sum CT: = 0.001 CT: = 0.1 CT (marg.): δ = 0.1, = 0 CT (marg.): δ = 0.1, = 0.5
2.5 2.0 1.5 1.0 0.5 0.0
a
a1
a2
a
a
a3 bre
a
r s 1 ls er na ere plice uide1 ume lian xe om ijcnn d-r n s tra osph vo . g s hro n co s m a n au v u s io rm m ge
nc
ca st-
complete sum Dataset a1a (×107 ) a2a (×107 ) a3a (×107 ) breast-cancer (×104 ) australian (×103 ) ionosphere (×103 ) splice (×105 ) svmguide1 (×106 ) german.numer (×105 ) voxels (×106 ) mushrooms (×104 ) ijcnn1 (×108 ) cod-rna (×109 )
202 289 360 138 3818 1193 779 149 127 322 21529 499 248
CT: e = 0.001 189 270 336 138 1862 1080 322 145 127 322 587 479 206
CT: e = 0.1 83 203 199 136 386 912 321 79 124 92 469 135 54
CT (marg.): δ = 0.1, e = 0 173 256 320 128 469 757 312 126 124 305 344 448 196
CT (marg.): δ = 0.1, e = 0.5 179 259 324 131 779 810 315 130 125 310 371 455 199
Figure 4.5.: Number of distance calculations done in the different kernel sum strategies on the svm_sets. For the graphic, they have been normalized to the distance calculations needed by the complete sum.
90
4.2. Cover Trees
relative number of evaluations of exp
6
complete sum CT: = 0.001 CT: = 0.1 CT (marg.): δ = 0.1, = 0 CT (marg.): δ = 0.1, = 0.5
5 4 3 2 1 0
a
a1
a2
a
a
a3 bre
a
r r ls na ms cnn1 ce ere plice uide1 ume lian xe d-r oo n s ij tra osph vo r . o g s h n c s au svm erma ion mu g
an
c st-
complete sum Dataset a1a (×107 ) a2a (×107 ) a3a (×107 ) breast-cancer (×104 ) australian (×104 ) ionosphere (×104 ) splice (×105 ) svmguide1 (×106 ) german.numer (×105 ) voxels (×106 ) mushrooms (×104 ) ijcnn1 (×108 ) cod-rna (×109 )
202 289 360 138 382 119 779 149 127 322 21529 499 248
CT: e = 0.001 513 750 925 375 513 286 995 381 345 920 2527 1313 578
CT: e = 0.1 285 649 675 373 135 243 995 270 342 361 1863 512 216
CT (marg.): δ = 0.1, e = 0 350 509 637 252 126 152 619 253 240 605 899 891 388
CT (marg.): δ = 0.1, e = 0.5 360 515 644 258 197 162 625 260 242 615 980 904 393
Figure 4.6.: Relative number of exp evaluations done in the different kernel sum strategies on the svm_sets
91
4. Speeding up SVM prediction
4.0
complete sum CT: = 0.001 CT: = 0.1 CT (marg.): δ = 0.1, = 0 CT (marg.): δ = 0.1, = 0.5
3.5
Relative time
3.0 2.5 2.0 1.5 1.0 0.5 0.0
a
a1
a2
a
a
a3 bre
a
complete sum Dataset a1a a2a a3a breast-cancer australian ionosphere splice svmguide1 german.numer voxels mushrooms ijcnn1 cod-rna
(873 ± 0)s (1220 ± 0)s (1535 ± 0)s (191 ± 1)ms (716 ± 1)ms (225 ± 1)ms (37.3 ± 0.0)s (23.4 ± 0.0)s (2.12 ± 0.00)s (64.5 ± 0.0)s (146 ± 0)s (10788 ± 0)s (36476 ± 54)s
r ls er na ms cnn1 ere plice uide1 ume lian xe d-r oo n s ij tra osph vo r . g s h n co s m a n au v u s io rm m ge
nc
ca st-
CT: e = 0.001 (1213 ± 2)s (1892 ± 4)s (2565 ± 3)s (276 ± 1)ms (396 ± 1)ms (263 ± 1)ms (11.1 ± 0.0)s (33.0 ± 0.0)s (3.47 ± 0.01)s (123 ± 0)s (4.12 ± 0.00)s (26155 ± 44)s (99933 ± 671)s
CT: e = 0.1 (368 ± 1)s (1212 ± 2)s (1505 ± 2)s (266 ± 1)ms (152 ± 1)ms (243 ± 1)ms (11.1 ± 0.0)s (9.39 ± 0.00)s (3.40 ± 0.00)s (46.0 ± 0.1)s (3.22 ± 0.00)s (6091 ± 19)s (28244 ± 110)s
CT (marg.): δ = 0.1, e = 0 (1225 ± 3)s (2029 ± 3)s (2842 ± 4)s (342 ± 1)ms (360 ± 1)ms (280 ± 1)ms (13.1 ± 0.0)s (41.2 ± 0.0)s (4.53 ± 0.01)s (147 ± 0)s (2.62 ± 0.01)s (32317 ± 57)s (134498 ± 394)s
CT (marg.): δ = 0.1, e = 0.5 (1273 ± 1)s (2060 ± 2)s (2874 ± 3)s (356 ± 1)ms (430 ± 1)ms (292 ± 1)ms (13.2 ± 0.0)s (43.1 ± 0.0)s (4.54 ± 0.00)s (149 ± 0)s (2.84 ± 0.01)s (32634 ± 18)s (137579 ± 1185)s
Figure 4.7.: Time for the different strategies on the svm_sets using a lookup table for the exponential function. For the graphic, the times have been normalized to the time needed for the complete sum.
92
5. Online Random Forest We will investigate online Random Forest based on Saffari et al. [2009]. A extension called threshold adjustment will be introduced. In addition, we will introduce online Prediction Clustering to speed up the prediction of the online Random Forest by exploiting a clustering drawn from the last online Random Forest prediction.
5.1. Random Forest The Random Forest has been introduced in Breiman [2001]. A Random Forest consists of an ensemble of decisions trees. Each tree in the ensemble is built independently based on a bootstrap training set generated by sub-sampling with replacement of the original training set. We differentiate between two kinds of nodes in a decision tree. Interior nodes have two children, a right and a left child. They split the feature space with a threshold t along a feature dimension d. During training, all input vectors x with xd < t are used to build the left child while the rest is used to build the right child. First a random set of tests is selected. In the Random Forest variant used here, the tests are selected by choosing a random subset of the features that will be considered. A split is then made according to some quality measure identifying the best split. In the implementation from Nair [2010] used here the Gini index is used. During prediction, interior nodes propagate the prediction samples according to the threshold t in dimension d to its left or right child the same way as during training. Leaf nodes have no children. They contain a prediction value which is created based on the labels of the samples the leaf node has been built from. During prediction, every sample ends in a specific leaf node. It gets assigned a probability distribution over the possible classes which is proportional to the fraction of samples from the classes the node has been built with. Here trees are grown until purity, which means that leaf nodes are not created before all samples are of only one class. This means that during prediction the leaf nodes assign a probability of 1.0 for the class its training samples were labeled with. A Random Forest assigns a probability to prediction samples by averaging the prediction of the single decision trees. To differentiate between online and “non-online” Random Forests, we refer to the “non-online” Random Forest as the batch Random Forest.
93
5. Online Random Forest
5.2. Online Random Forest In Saffari et al. [2009] an online Random Forest, based on extremely randomized forest [Geurts et al., 2006] is introduced. They use an online bagging modeled by a poisson distribution using a method from Oza and Russell [2001]. A Random Forest selects the dimensions, along which the split criterion (for example the gini index) is evaluated. In an extremely randomized forest, the tested threshold values are also selected randomly. In other words, in every node of a tree of an extremely randomized forest a set of tests T is randomly sampled, among which the best is selected. For the online Random Forest in Saffari et al. [2009] the set of tests is remembered in every node. New samples propagate to their corresponding nodes and the tests are redone. A node is split if the number of samples in the node exceed a certain threshold and if the gain in the quality measure exceeds another certain threshold. Here those ideas are adapted. It is examined how this method works with normal instead of extremely randomized forests. Also, an extension is introduced adjusting thresholds of interior nodes in some cases.
5.2.1. Online Bagging Similar to Saffari et al. [2009] we use the method of online bagging proposed in Oza and Russell [2001]. Each tree receives k copies of each new sample where k is generated by a poison distribution p(λ). We set λ = 1 in accordance with Saffari et al. [2009].
5.2.2. Growing of trees Again, similar to Saffari et al. [2009], new samples are propagated to their corresponding leafs. Since the criteria when to split a node in Saffari et al. [2009] are weakly motivated, we split leafs until they are pure (contain only samples from one class). In summary, the online Random Forest used here works like this: 1. A Random forest is trained based on a initial training set. 2. Every time new samples arrive, for every tree the number of instances of a sample to be inserted into that tree is selected by a poisson distribution with k = 1. This means that not every sample is inserted into every tree. 3. For every tree, the samples are propagated to their leaf nodes. The corresponding nodes are split until all are pure.
5.2.3. Threshold adjustment In Saffari et al. [2009] and the online Random Forest described above, trees are extended online by continuing to grow the tree from its leaf nodes. Modifying any of its interior nodes is difficult, because the tree must stay in accordance with the training data it has seen so far. We propose an extension to the Random Forest in which we exploit that it
94
5.2. Online Random Forest is sometimes possible to adjust the threshold without changing the prediction on the training samples the tree has seen so far. Assume an interior node has been created based on training samples ( xi , yi ), xi ∈ RD , yi ∈ L where L is the set of possible labels. The interior node gets assigned a dimension d and a threshold t. Assume there is a gap between the threshold and the closest training samples among the dimension d. In other words, there is a e ∈ R such that for all xi , | xdi − t| > e. If a new training sample ( x, y) arrives in the node which lies within the gap, the threshold can be adjusted propagating the new training sample to either of the children without modifying the decision on any of the old training samples. If the new sample lies within the gap, | xd − t| < e holds. Without loss of generality, assume xd < t. When the threshold is not adjusted, the new sample propagates to the left child. If we modify the threshold to t0 =
xd + t − e 2
then t − xd < e
⇒ xd > t − e ⇒ 2xd > xd + t − e x +t−e ⇒ xd > d = t0 2 and therefore the new sample is propagated to the right child. This adjustment does not change the decision for old training samples. For any xi with xdi > t > t0 this is obvious. Assume xdi < t. We know xd > t − e and therefore t−e >
xd + t − e = t0 . 2
Because of the prerequisite it follows xdi < t − e < t0 meaning that the sample is still assigned to the left child. We propose adjusting the threshold if the gini index can be reduced by this measure. We propose remembering the gap for all interior nodes during initial raining of the Random Forest. Then, during online learning for every new sample in an interior node it is checked if the sample falls into the gap. If that is the case, the gap is adjusted such that the gini index is minimal. Calculating the gini index for both choices (putting the new sample into left child and putting the new sample into right child) requires knowing how many samples of which label have been propagated to the left and right child. In addition to the gap, this information has to be stored in every interior node. The idea is illustrated in figure 5.1. We call the online Random Forest with this addition “online RF with threshold adjustment”.
95
5. Online Random Forest new Sample
Gap
adjustment of threshold
Figure 5.1.: Threshold adjustment illustrated. The colored circles denote training samples with their label indicated by their color. The new sample is marked by a black boundary. The threshold is adjusted such that gini index is minimal.
5.2.4. Relearning of trees Because the first splits of the online Random Forest can be biased by the samples with which it has been initially started, it might not generalize as well as a batch learned Random Forest. If during online learning time is left, the model can be improved by discarding and relearning whole trees.
5.3. Online prediction clustering For a Random Forest with 100 trees, predicting a picture with 500x500 pixels took much longer than one second in preliminary experiments, making the whole process not applicable for interactive feedback to the user. For prediction on a dataset every data vector needs to traverse down all trees in the Random Forest. If the data were clustered, one could try traversing a whole cluster down a tree. If the cluster is completely encapsulated in one leaf of the tree, the cluster has to be traversed down the tree only once to get the prediction for all samples within it. Although a cluster could be split while traversing down the tree, this idea can only bring computational benefit, if samples in a cluster belong to the same leaf most of the time. In case of online learning in Ilastik, two facts contribute favorably to the situation: 1. In every online step, the trees are only expanded, not changed. Since the number of online samples should be small compared to the already learned number of samples, the tree also does not change much. 2. The prediction is done on the same datasets every time when the forest is expanded online. As illustrated in figure 5.1, having a clustering of the prediction data close to the clustering imposed by the leaf nodes of a decision tree would benefit prediction time.
96
5.3. Online prediction clustering Because we do the prediction over and over on the same dataset, a clustering is given by the last prediction and because the trees do not change much, this clustering is close to the clustering imposed by the tree. In fact, because the online learned trees are only expansions to the original trees, the clusters can go down the tree without having to be split until they reach the online learned expansion.
5.3.1. Algorithm In addition to the feature matrix, a separate clustering for every tree of the data is input into the prediction function. For every cluster, a bounding box is passed. The clusters are then traversed down the tree. On every interior node, it is checked if the cluster fits completely into one of the child nodes using the bounding box. • If Yes, the cluster continues by traversing to the child node it fits into. • If No, it is partitioned into sub-clusters fitting in the child nodes. These are then traversed down the corresponding sub trees.
Level in Random Forest
If a cluster reaches a leaf node, the prediction of all elements of the cluster are updated and the cluster is remembered for the next prediction. Figure 5.2 illustrates the process for a one dimensional Random Forest. The cluster, marked by a black circle, is traversing down the tree. On the upper level it completely fits into the left child and continues traversing. On the next level it has to be partitioned into two sub-clusters which fall into leaf nodes.
Feature Value
Figure 5.2.: A cluster traversing down a tree and being split. For details refer to the text. Every tree in the Random Forest creates its own clustering. In every prediction of the datasets by the online Random Forests, the clustering of the last prediction is used as input. Details for the implementation are given in 6.2.2.
97
5. Online Random Forest
5.3.2. Tree relearning with online prediction clustering As discussed above during online learning trees are relearned to account for bias by the first training samples. Assume we decide to relearn n trees. But which trees should be relearned? One possibility is going through the trees sequentially, relearning the trees in the order they have been trained (which is a random order). We refer to these strategy as sequential tree relearning. If prediction time should be reduced by tree relearning, the trees causing the highest prediction time should be relearned. For online prediction clustering, a measure for the prediction time is given by the sum over all clusters adding up the depth at which the clusters reach a leaf node. We will refer to this strategy as longest prediction length.
5.4. Experiments Different variants of the online Random Forest and a batch Random Forest have been trained on the svm_small_sets. For sets with less than 4000 training samples, the training data has been learned in subsets of 20 samples. For larger training sets, the training data has been split into 200 subsets. The classifiers have been learned on the subsets sequentially until the whole datasets were learned. The training time for the subsets has been measured. After every subset, prediction has been done using online prediction clustering. The time needed for prediction and the accuracy of the classifier has been recorded. The experiments have been repeated 30 times randomizing the order of the training samples. For better readability of the tables and graphics, the following abbreviations for the names have been used: online RF: Online Random Forest without threshold adjusting. online with t.a.: Online Random Forest with threshold adjusting. no. rel.: No trees have been relearned. seq. n.: n trees have been relearned in sequential order after every training subset. long. n.: n trees have been relearned after every training subset. They where selected by the prediction length, described above. The following Random Forest variants have been tested: • batch Random Forest • online RF no rel. • online RF seq. 1 • online RF seq. 10 • online RF long. 1
98
5.5. Results • online RF long. 10 • online RF with t.a. seq. 1 • online RF with t.a. seq. 10 • online RF with t.a. long. 1 • online RF with t.a. long. 10
5.5. Results Table 5.0(a) shows the error rates of the different Random Forest variants after learning the complete datasets. Table 5.0(b) shows the training times of the last subset. The a1a, splice, german.numer and voxels datasets have been chosen as representatives for the graphs. In figure 5.3 and 5.4 the progress of the error rate and training time has been plotted in dependency on the number of learned labels. For clarity the variants relearning 1 tree and the errorbars have been omitted. For comparison between the online Random Forest with and without threshold adjustment, the tables have been color coded. In both table 5.0(a) and 5.0(b) blue entries mark the better result between online RF and online RF with threshold adjustment. Table 5.1(a) shows the time needed for prediction in milliseconds after the whole datasets have been learned. For a1a, splice, german.numer and voxels datasets these have been plotted in figure 5.5 omitting the variants where 1 tree has been relearned and the errorbars for clarity. Similar to table 5.0(b), results are color coded in 5.1(a). The color codes mark the better results between online RF with and without threshold adjusting. Table 5.1(b) additionally shows the speedup achieved by online prediction clustering.
5.5.1. Threshold adjusting vs. no threshold adjusting Although there are more entries marked blue in the online RF with t.a. column of table 5.0(a), non of these show a significant improvement to the counterpart without threshold adjustment. Due to the overhead of storing the gap for all interior nodes and testing it during online prediction against all new samples, the training times increase by threshold adjusting. But again, as can be seen in table 5.0(b), these time differences are not significant. The results suggest, that the situation in which a threshold can be adjusted are rare. The overhead of storing and testing gaps nevertheless exists and increases the training time.
5.5.2. Tree relearning Already the relearning of 1 tree after every subset brings an advantage, no matter which strategy is used for selecting the tree to relearn. For relearning 10 instead of 1 tree, no
99
5. Online Random Forest
Dataset a1a a2a a3a mushrooms breast-cancer australian ionosphere splice german.numer voxels
Dataset a1a a2a a3a mushrooms breast-cancer australian ionosphere splice german.numer voxels
batch RF 17.0 ± 0.0 17.2 ± 0.0 17.2 ± 0.0 0.0 ± 0.0 2.9 ± 0.0 16.3 ± 0.0 6.6 ± 0.0 3.4 ± 0.0 23.3 ± 0.0 4.9 ± 0.0 batch RF 1356 ± 0 1881 ± 0 2913 ± 1 2091 ± 1 116 ± 0 184 ± 0 128 ± 0 491 ± 0 360 ± 0 10656 ± 613
No rel. 18.4 ± 0.1 18.6 ± 0.2 18.6 ± 0.2 0.0 ± 0.0 2.5 ± 0.5 15.4 ± 0.8 7.0 ± 1.3 6.8 ± 0.9 24.7 ± 1.1 5.3 ± 0.2
No rel. 26 ± 5 33 ± 4 32 ± 2 5±1 5±4 10 ± 2 10 ± 4 18 ± 2 14 ± 2 26 ± 16
seq. 1 17.5 ± 0.1 17.4 ± 0.1 17.3 ± 0.1 0.0 ± 0.0 2.5 ± 0.6 15.2 ± 0.8 6.9 ± 1.1 4.5 ± 0.5 24.2 ± 1.0 5.0 ± 0.1
seq. 1 33 ± 4 40 ± 3 50 ± 4 24 ± 3 6±3 12 ± 2 11 ± 4 19 ± 2 17 ± 2 134 ± 28
long. 1 17.8 ± 0.1 18.1 ± 0.1 18.0 ± 0.1 0.0 ± 0.0 2.5 ± 0.6 16.0 ± 0.9 6.9 ± 1.1 4.8 ± 0.6 24.4 ± 1.0 5.0 ± 0.1
long. 10 17.4 ± 0.1 17.7 ± 0.1 17.7 ± 0.1 0.0 ± 0.0 2.9 ± 0.5 16.0 ± 0.6 7.2 ± 1.2 4.0 ± 0.3 23.4 ± 1.5 4.9 ± 0.1
(a) Prediction errors in % online RF seq. 10 17.1 ± 0.1 17.2 ± 0.1 17.2 ± 0.1 0.0 ± 0.0 3.2 ± 0.5 16.0 ± 0.9 7.5 ± 1.0 3.8 ± 0.4 23.9 ± 1.4 4.9 ± 0.1
long. 10 175 ± 6 236 ± 4 345 ± 9 225 ± 12 18 ± 5 31 ± 3 24 ± 5 68 ± 5 56 ± 5 1137 ± 88
(b) Training times in milliseconds online RF seq. 10 long. 1 180 ± 6 34 ± 4 243 ± 5 41 ± 3 351 ± 7 52 ± 4 227 ± 10 25 ± 3 18 ± 4 6±4 30 ± 4 11 ± 3 24 ± 4 11 ± 4 70 ± 4 19 ± 2 56 ± 4 17 ± 2 1143 ± 94 140 ± 26
long. 10 17.5 ± 0.1 17.7 ± 0.1 17.7 ± 0.1 0.0 ± 0.0 3.1 ± 0.5 16.4 ± 0.8 6.9 ± 0.9 4.0 ± 0.4 23.7 ± 1.0 4.9 ± 0.1
long. 10 198 ± 6 260 ± 4 379 ± 9 232 ± 10 19 ± 5 33 ± 4 25 ± 5 74 ± 6 62 ± 4 1146 ± 86
online RF with t.a. seq. 1 seq. 10 long 1. 17.5 ± 0.1 17.1 ± 0.1 17.9 ± 0.1 17.4 ± 0.1 17.3 ± 0.1 18.1 ± 0.2 17.3 ± 0.1 17.2 ± 0.1 18.1 ± 0.1 0.0 ± 0.0 0.0 ± 0.0 0.0 ± 0.0 2.6 ± 0.5 3.2 ± 0.5 2.4 ± 0.6 15.1 ± 0.7 16.1 ± 0.8 15.7 ± 0.7 6.6 ± 0.9 7.0 ± 1.1 6.5 ± 0.7 4.6 ± 0.5 3.9 ± 0.3 5.0 ± 0.5 24.0 ± 1.0 23.4 ± 0.9 23.7 ± 0.8 5.0 ± 0.1 4.9 ± 0.1 5.0 ± 0.1
online RF with t.a. seq. 1 seq. 10 long 1. 40 ± 4 192 ± 7 40 ± 4 49 ± 3 263 ± 7 50 ± 3 60 ± 5 378 ± 6 62 ± 5 26 ± 4 233 ± 11 26 ± 3 6±3 19 ± 5 7±3 13 ± 3 33 ± 3 13 ± 3 11 ± 4 24 ± 4 11 ± 4 23 ± 2 73 ± 5 22 ± 2 20 ± 3 62 ± 4 19 ± 2 142 ± 30 1153 ± 66 147 ± 25
No rel. 18.4 ± 0.1 18.6 ± 0.2 18.6 ± 0.2 0.0 ± 0.0 2.3 ± 0.5 15.1 ± 0.9 6.7 ± 0.9 6.6 ± 0.7 24.6 ± 1.1 5.3 ± 0.1
No rel. 28 ± 5 29 ± 3 30 ± 4 4±0 5±4 12 ± 3 10 ± 4 21 ± 2 17 ± 2 30 ± 17
Table 5.1.: Error rates in % and training times in milliseconds for different online Random Forests after learning the complete datasets. The colors mark the better results between online RF with and without threshold adjusting. Details are explained in the text.
100
Error rate (%)
Training time (s)
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
17.0
17.5
18.0
18.5
19.0
19.5
20.0
0
0
400
800 #samples
200
400
1000
800 #samples
101
1200
1200
1400
1400
1600
1600
0
2
4
6
8
10
12
4.5
5.0
5.5
6.0
6.5
7.0
7.5
0
0
3000 4000 #samples
1000
3000 4000 #samples
5000
(d) Training time for voxel
2000
5000
6000
6000
batch random forest online RF No relearning online RF with t.a. No relearning online RF seq. 10 online RF with t.a. seq. 10 online RF long. 10 online RF with t.a. long. 10
(b) Error rates for voxel
2000
batch random forest online RF No relearning online RF with t.a.No relearning online RF seq. 10 online RF with t.a.seq. 10 online RF long. 10 online RF with t.a.long. 10
1000
Figure 5.3.: Error rates and training times on a1a and voxel.
(c) Training time for a1a
600
1000
(a) Error rates for a1a
600
batch random forest online RF No relearning online RF with t.a.No relearning online RF seq. 10 online RF with t.a.seq. 10 online RF long. 10 online RF with t.a.long. 10
200
batch random forest online RF No relearning online RF with t.a. No relearning online RF seq. 10 online RF with t.a. seq. 10 online RF long. 10 online RF with t.a. long. 10 Error rate (%) Training time (s)
20.5
5.5. Results
7000
7000
5. Online Random Forest
29
28
27
26
25
0
0
100
200
300
300
#samples
#samples
500
500
600
600
batch random forest online RF No relearning online RF with t.a. No relearning online RF seq. 10 online RF with t.a. seq. 10 online RF long. 10 online RF with t.a. long. 10
400
400
(a) Error rates for german.numer
200
batch random forest online RF No relearning online RF with t.a.No relearning online RF seq. 10 online RF with t.a.seq. 10 online RF long. 10 online RF with t.a.long. 10
100
(c) Training time for german.numer
700
700
Error rate (%) Training time (s)
20 18 16 14 12 10 8 6 4 2
0.5
0.4
0.3
0.2
0.1
0.0
0
0
200
400
#samples
#samples
800
800
1000
1000
batch random forest online RF No relearning online RF with t.a. No relearning online RF seq. 10 online RF with t.a. seq. 10 online RF long. 10 online RF with t.a. long. 10
600
600
(b) Error rates for splice
400
batch random forest online RF No relearning online RF with t.a.No relearning online RF seq. 10 online RF with t.a.seq. 10 online RF long. 10 online RF with t.a.long. 10
200
(d) Training time for splice
Figure 5.4.: Error rates and training times on splice and german.numer.
102
24
23
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
Error rate (%) Training time (s)
batch RF 1797.6 ± 1.2 1688.3 ± 0.1 1817.3 ± 0.0 47.9 ± 0.0 3.6 ± 0.0 6.7 ± 0.0 3.4 ± 0.0 69.2 ± 0.1 20.0 ± 0.0 106.5 ± 0.1 seq. 1 205.5 ± 24.8 164.8 ± 23.2 180.2 ± 19.9 10.4 ± 0.2 1.7 ± 0.1 3.4 ± 0.1 1.7 ± 0.1 21.0 ± 1.7 7.7 ± 1.4 23.3 ± 2.2
online RF seq. 10 353.7 ± 23.1 305.4 ± 15.9 330.2 ± 9.7 13.3 ± 0.3 1.7 ± 0.1 3.0 ± 0.1 1.7 ± 0.1 19.7 ± 1.8 7.7 ± 1.5 28.4 ± 3.5 long. 1 202.7 ± 24.9 159.1 ± 15.7 176.0 ± 10.3 9.2 ± 0.1 1.6 ± 0.1 3.3 ± 0.1 1.7 ± 0.1 19.9 ± 1.5 7.5 ± 1.7 23.7 ± 2.8
long. 10 351.2 ± 23.7 302.1 ± 14.4 328.1 ± 9.1 11.5 ± 0.2 1.6 ± 0.1 3.0 ± 0.1 1.7 ± 0.1 19.2 ± 1.6 7.5 ± 1.0 27.9 ± 1.6
No rel. 188.1 ± 28.5 144.0 ± 15.0 160.4 ± 10.6 11.4 ± 0.3 1.7 ± 0.1 3.7 ± 0.2 2.0 ± 0.1 24.3 ± 2.0 8.0 ± 0.5 44.6 ± 4.3
a1a a2a a3a mushrooms breast-cancer australian ionosphere splice german.numer voxels
Dataset
batch RF 1 1 1 1 1 1 1 1 1 1 No rel. 9.34 11.33 10.84 4.04 2.18 1.93 1.92 2.95 2.61 3.23
seq. 1 8.75 10.25 10.08 4.59 2.20 1.97 1.93 3.30 2.61 4.57
online RF seq. 10 long. 1 5.08 8.87 5.53 10.61 5.50 10.33 3.61 5.19 2.20 2.26 2.22 2.06 1.93 1.98 3.51 3.48 2.61 2.65 3.75 4.50
long. 10 5.12 5.59 5.54 4.16 2.23 2.22 2.00 3.61 2.67 3.82
(b) Prediction speedups for online RF without threshold adjustment on the svm_small_sets.
No rel. 192.5 ± 24.6 149.0 ± 14.8 167.7 ± 10.5 11.9 ± 0.3 1.7 ± 0.1 3.5 ± 0.2 1.8 ± 0.1 23.5 ± 2.2 7.7 ± 0.6 33.0 ± 3.0
online RF with t.a. seq. 1 seq. 10 long 1. 200.7 ± 25.4 354.6 ± 31.7 197.9 ± 25.3 157.5 ± 15.7 302.8 ± 16.2 155.0 ± 15.2 173.1 ± 10.2 326.3 ± 9.8 171.3 ± 10.1 10.1 ± 0.2 13.0 ± 0.3 9.1 ± 0.2 1.7 ± 0.1 1.6 ± 0.1 1.7 ± 0.1 3.6 ± 0.1 3.0 ± 0.1 3.5 ± 0.1 2.0 ± 0.1 1.9 ± 0.1 2.0 ± 0.1 21.3 ± 1.6 18.8 ± 1.7 20.2 ± 1.6 8.1 ± 1.8 7.4 ± 1.0 7.9 ± 1.0 22.4 ± 1.8 26.9 ± 2.9 23.0 ± 2.4
Table 5.2.: Prediction times in milliseconds and speedups by online prediction clustering compared to prediction of the batch Random Forest.
a1a a2a a3a mushrooms breast-cancer australian ionosphere splice german.numer voxels
Dataset
long. 10 350.0 ± 23.3 302.4 ± 14.6 323.3 ± 9.3 11.4 ± 0.3 1.6 ± 0.1 3.0 ± 0.1 1.8 ± 0.1 18.7 ± 1.6 8.1 ± 4.0 27.8 ± 2.8
(a) Prediction times in milliseconds for different online Random Forest after learning the complete datasets using online prediction clustering. The colors mark the better results between online RF with and without threshold adjusting. Details are explained in the text.
5.5. Results
103
5. Online Random Forest
0.025
0.020
0.015
0.010
0.005
0.000
1.8 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0
0
0
100
200
200
600
300 #samples
800 #samples
500
600
batch random forest online RF No relearning online RF with t.a.No relearning online RF seq. 10 online RF with t.a.seq. 10 online RF long. 10 online RF with t.a.long. 10
400
1000
1200
1400
batch random forest online RF No relearning online RF with t.a.No relearning online RF seq. 10 online RF with t.a.seq. 10 online RF long. 10 online RF with t.a.long. 10
(a) Prediction times for german.numer
400
(c) Prediction times for a1a
700
1600
Prediction time (s) Prediction time (s)
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0.12
0.10
0.08
0.06
0.04
0.02
0.00
0
0
1000
200
400
#samples
800
batch random forest online RF No relearning online RF with t.a.No relearning online RF seq. 10 online RF with t.a.seq. 10 online RF long. 10 online RF with t.a.long. 10
600
3000 4000 #samples
5000
6000
batch random forest online RF No relearning online RF with t.a.No relearning online RF seq. 10 online RF with t.a.seq. 10 online RF long. 10 online RF with t.a.long. 10
(b) Prediction times for splice
2000
(d) Prediction times for voxels
Figure 5.5.: Prediction times on german.numer, splice, a1a and voxels.
104
Prediction time (s) Prediction time (s)
1000
7000
5.6. Discussion significant improvements can be seen in table 5.0(a). Also whether the sequential or longest tree relearning selection strategy is used does not effect the results too much. In general the sequential strategy is better, but the differences are not significant. Improved performance of the sequential strategy can be explained by noticing, that the longest strategy might not touch some trees which do not necessarily perform better. In table 5.0(b) we see that while retraining one tree does not give a big speed decrease to retraining no trees, the needed time for retraining 10 trees is significantly higher. In addition the prediction time with online prediction clustering increases by up to a factor of two. This is due to the clustering having to be redone when a tree is relearned.
5.5.3. Batch vs. online Random Forest Table 5.0(b) shows that the speedups by online learning compared to the batch Random Forest are significant. For the voxel dataset, online Random Forest performs about 11 times faster than batch Random Forest even when ten trees are relearned. While accuracy of the online Random Forest is worse than for the batch Random Forest, the decrease in training time justifies the use of the online Random Forest.
5.5.4. Online prediction clustering In table 5.1(b) the speedup achieved by online prediction clustering compared to prediction by a batch Random Forest is listed. For some datasets, the speedup is tremendous. For a2a a speedup of up to 11 is recorded. The minimal speedup is 1.93.
5.6. Discussion We have implemented an online Random Forest similar to Saffari et al. [2009]. An extension, called threshold adjustment has been introduced and experimentally evaluated. While the online Random Forest accuracy is acceptable, the extension does not bring any significant improvements. Relearning of single trees does improve the accuracy of the online Random Forest, but also increases the training time. The number of trees to be relearned can be set variable. In the Ilastik integration (see section 8), no trees are relearned when new data arrives but as many trees are relearned as time allows while the user is labeling new pixels. In addition, the prediction time of the online Random Forest has been decreased by a technique we call online prediction clustering. The improvements of the prediction times by online prediction clustering are dramatic. If one tree is relearned in every online learning step, a speedup factor up to 10 for some datasets is reached. When 10 trees are relearned, a speedup of 5 is still noted on some datasets. The minimum speedup on the recorded datasets is 1.93. For Ilastik, the online prediction clustering yields the crucial running time improvement to allow interactive prediction feedback by the online Random Forest.
105
Part III.
Implementation and software Infrastructure
107
6. Implementations of the online learners 6.1. Online SVM laSvm has been implemented in C++ starting from the C implementation provided with Bordes et al. [2005]. The class hierarchy of the implementation is sketched in the following diagram.
svhandler
onlineSVMBase
laSvmBase
laSvm
We will discuss the classes in the following sections.
6.1.1. svhandler An important part, that has been completely redesigned and implemented, is the kernel cache which is packed in the svhandler class. This class is responsible for handling input vectors in an active set S and an inactive set N by providing function for moving vectors between these sets. We will refer to vectors in S as support vectors and in N to normal vectors as they are used in the laSvm implementation. A vector initially enters N with svhandler::AddSample, which return the index in N of the new vector. The method svhandler::MakeSV moves a vector from N to S and svhandler::MakeNonSV moves it back. S is represented by an array svhandler::support_vectors for which svhandler::MakeSV returns an index. The svhandler class makes sure that this index can be used to access the support vector until it is removed with svhandler::MakeNonSV. This way support vectors with special properties can be remembered by classes utilising svhandler. It is assumed that the active set as well as the inactive set are in a general growing and that shrinking happens comparably seldom. Justified by this assumption, the internal array for S (svhandler::support_vectors) and N (svhandler::normal_vectors) are never shrinked. Instead “garbage lists” of unused elements in these arrays are kept. The main task of the svhandler class is to manage the kernel cache storing often used kernel values for vectors in S. The kernel values are needed at three points in laSvm. In PROCESS the initial gradient are calculated at algorithm 5, line 5. In algorithm 5 line 18 and algorithm 6 line 10 the gradients of all support vectors are updated. In both cases, the iteration fixes one of the indexes of the kernel matrix. When kernel values are needed, a whole kernel row is always needed. It makes sense for a kernel cache to store often used values by caching kernel rows.
109
6. Implementations of the online learners The number of kernel rows that can be stored is upper bounded by the memory limitation given to svhandler on initialization. This upper bound shrinks when the number of support vectors grows, because the length of the kernel rows increases. The svhandler takes care of shrinking the number of rows in the kernel cache. It keeps a list identifying which kernel row is used by wich support vector. The kernel cache can also be invalidated completely (for Example when the kernel parameter change) using invalidateCache. svhandler provides two interface functions to access kernel rows. svhandler::GetKernelRow returns a reference to a kernel row, recalculating it if necessary. Because a reference is returned, the svhandler class may not change its internal storage. Because of that, svhandler::UnlockKernelRow has to be called when the kernel row will not be used anymore. For laSvm, the kernel rows are arrays and the support vectors classes storing α and g. But for other SVM variants, different information in every support vector and kernel row may be needed. To implement for Example a multi kernel learning SVM, as described in Sonnenburg et al. [2006b], a kernel row needs to consist of as many arrays as there are kernels in the combined kernel. For this reason the type of a kernel row and of a support vector are template parameters making svhandler suitable as a base class for other SVM algorithms utilizing an active set of support vectors.
6.1.2. OnlineSVMBase The OnlineSVMBase class derives from svhandler and provides functionality that should be general to online support vector machines algorithms. It assumes that support vectors (which still are template parameters in this class) have a member alpha storing the weight of the support vectors. Utilizing this, methods to predict labels and calculate the discriminative function from 2.9 using different methods are provided. These methods include direct calculation of the sum in 2.9 and approximating it with methods described in chapter 4.2. OnlineSVMBase handles the vector data and implements the external interface for adding and removing data to the SVM. In addition the class provides functions for merging and separating support vectors to implement the techniques described in chapter 4.1.
6.1.3. laSvmBase The laSvmBase class derives from onlineSvmBase and implements the build blocks of laSvm, mainly PROCESS and REPROCESS. Unlearning input vectors is also implemented in this class. The class for support vectors is still a template parameter, but it is assumed to have a member g that can be used like a floating point number and is used to store the gradients defined in 2.10.
110
6.1. Online SVM
6.1.4. laSvm The laSvm class builds up on laSvmBase and assumes, that a kernel row is an array of floats (or doubles) values that can be accessed using the [] operator. Optimization of the kernel parameters is implemented in this class, assuming a lot of functions in the interface of the kernel, which is given as a template parameter.
6.1.5. Unit tests Using the test suite CppUnit1 , unit tests for the laSvm class hierarchy and for the cover tree have been written. In the following we will shortly describe these tests. svhandler The function checkIntegrity is called from different unit tests for the svhandler class and checks if the internal structures are correct. It checks if the garbage lists for S and N are correct and if the mapping between S and the kernel rows is consistent. In addition it checks, if no kernel row is accidentally locked. The unit tests for svhandler test different functions calling checkIntegrity between every test. These tests are • Adding samples into N using AddSample and verify the returned indices of N. • Moving samples from N to S and verify the returned indices of S. • Returning kernel rows with getKernelRow and verifying its values. • Invalidating the kernel cache and re-requesting the kernel rows based on a different kernel. • Moving samples from N to S and verify the returned indices of N. The SVM classes The laSvm class hierachy is tested as a bundle. They are tested on a 10 dimensional toy dataset generated within the unit test. The toy dataset consist of two classes. One for vectors within a sphere around the origin, and one for vectors outside the sphere. Similar to the unit tests of svhandler, a central function laSvmConsistency checks central invariants of the SVM. These invariants are • ∑ αi = 0, Ai ≤ αi ≤ Bi within the floating point precision. • The gi -s are correct. • The Ai and Bi is correct for combined samples (merging of support vectors is needed for techniques described in 4.1). 1 http://sourceforge.net/projects/cppunit/
111
6. Implementations of the online learners The following tests are included in the unit tests calling laSvmConsistency between every step. • Merging and splitting of two support vectors. • Training on the toy dataset with trainOnline and finish. • Optimizing of kernel paramters. • Learning SVM with complete toy dataset, unlearning half of it and relearning it. All these test are repeated for a linear independent merging threshold of 0, 0.1 and 0.01.
6.2. Online Random Forest The online Random Forest has been implemented based on the implementation from Nair [2010]. The visitor concept of that implementation proved to be particular useful.
6.2.1. Online Learning The online Random Forest needs the following additional information to the standard Random Forest. • Which training samples ended up in which leaf. • For every split, the absolute difference between split threshold and the first training samples. This is only needed, if split threshold adjusting is enabled. Both information is collected in the visitAftetSplit method of an online learning visitor. In the implementation from Nair [2010], every tree is given a sequential number that can be used as an unique id. The nodes are stored in a flat array and the position in that array can be used as a unique identifier for a node. The online visitor contains two hash maps. The first maps leaf nodes of the Random Forest to sets of training samples ending in that leaf node while the second maps interior nodes to “gap” information, storing how far the split of the interior node can be moved without conflicting with the training data (see chapter 5.2.3 for details). The keys of the hash maps are built from tree and node id described above. During the initial learning, the online Random Forest is learned the same way the normal Random Forest is learned utilizing the online learning visitor described above. The new method onlineLearn has been written which incrementally trains the Random Forest given new training data. The following adjustments to the Random Forest code have been done for this: • The function decision_tree::learn learns a single decision tree. It works by keeping a stack of unfinished nodes and corresponding training data. It has been
112
6.2. Online Random Forest changed not to reset (or delete) the current decision tree and taking an additional parameter initializing the stack with a given node and training samples. The adjusted function was called decision_tree::continueLearn and a new function decision_tree::learn wraps it, having the same signature and behavior as the old decision_tree::learn. • A new function onlineLearn has been added, for which pseudo code is given in algorithm 10. Before we introduce the onlineLearn function, we have to explain some of the functions used in it. decision_tree::getToleaf: This function takes a sample and returns the leaf node the samples ends in (originally used for prediction). onlineVisitor::addSampleToLeaf: Adds a sample to the hash map entry of a certain leaf in a certain tree of the online learning visitor. onlineVisitor::getLeafSamples: Returns all samples stored in the hash map for a certain leaf. It also clears that entry, expecting the leaf to stop existing. Algorithm 10 online learn Random Forest 1: procedure onlineLearn(Random forest r f ,features F, labels L) 2: for all tree t of r f do 3: S ← PoissonSample( F, L) . Online bootstrapping 4: N←∅ . Set of leaf nodes, that have received new samples 5: for all s ∈ S do 6: lea f Id =t.getToLeaf(s) 7: onlineVisitor.addSampleToLeaf(t, lea f Id, s) 8: N ← N ∪ {lea f Id} 9: end for 10: for all n ∈ N do 11: t.continueLearn(n,onlineVisitor.getLeafSamples(t, n)) 12: end for 13: end for 14: end procedure
6.2.2. Online prediction We describe the implementation of the online prediction clustering introduced in section 5.3. The online prediction clustering has been implemented using the Random Forest from Nair [2010]. A new predict function has been added. A stack stores all not fully processed clusters and the position in the tree, to which they already traversed. Pseudo code for the prediction process is given in algorithm 11. The following basic functions are used in the pseudo code:
113
6. Implementations of the online learners rootNode: Returns the root node of a tree. minBox, maxBox: Given a cluster and a feature dimension, these functions return the minimum and maximum of the bounding box respectively. leftChild, rightChild: Returns the left or right child of a interior node. isLeafNode: Returns True if the given node is a leaf node. splitDimension: Returns in which dimension a specific node splits. Algorithm 11 Prediction of an online prediction clustering 1: procedure predict(Random Forest r f ,clustered data C) 2: for all trees t in r f do 3: S←∅ . empty stack 4: for all clusters c belonging to tree t in C do 5: S ← S ∪ (c,rootNode(t) 6: end for 7: while S 6= ∅ do 8: (c, n) ←pop(S) 9: d ←splitDimension(n) 10: if isLeafNode(n) then 11: Update predictions of samples in c according to n. 12: end if 13: if minBox(d, c) ≥threshold(n) then 14: S ← S ∪ {(c,rightChild(n))} 15: continue 16: end if 17: if maxBox(d, c) ≤threshold(n) then 18: S ← S ∪ {(c,leftChild(n))} 19: continue 20: end if . At this point we now that the cluster has to be split. 21: l ← { x ∈ c| xd ≤threshold(n)} 22: r ← { x ∈ c| xd >threshold(n)} 23: S ← S ∪ {(l,leftChild(n))} 24: S ← S ∪ {(r,rightChild(n))} 25: end while 26: end for 27: end procedure
114
7. VIGRA software infrastructure 7.1. Exporting python functions Ilastik is written in the scripting language python but would benefit from using the C++ image processing library vigra [Köthe, 2000]. Therefore the vigra functions needed to be exposed to python. The underlying concept and structures have been designed and implemented by Ullrich Köthe. He developed an array class hierarchy interfacing numpy [Jones et al., 2001–] with the MultiArray classes of vigra representing images and volumes with different number of channels and pixel types. It is based on boost python1 . Wrapper functions whose signature can be exported by boost python around the vigra functions had been written. This has been done mainly by Christoph Sommer, Stephan Kassemeyer, Michael Hanselmann and myself. While many of these wrapper functions do nothing but pass the data to the vigra functions, some extend the functionality. For example many 2D filter functions can be applied to mulitspectral images, resulting in the filter being applied to every channel. Unit tests are important for ensuring software quality and stability and have been implemented using the nosetst framework2 . Because the functionality of the original vigra functions is already covered by the vigra unit tests, the vigranumpy unit tests concentrate on ensuring the correct call signature. Most of the vigranumpy tests only test the class signature by invoking the functions with the correct parameter types and checking the type of the return value.
7.2. vigranumpy documentation Being the python interface for vigra, the vigranumpy documentation ought to be extremely similar to the vigra documentation. Nevertheless, there are differences between vigra and vigranumpy such as dividing functions into modules and renaming of functions to differentiate between the for 2D and 3D cases, that justify a separate vigranumpy documentation. Python provides a mechanism for documenting functions called “docstrings”. With docstrings, the documentation can be accessed from the interactive python shell ipython [Pérez and Granger, 2007] and in-place help in text editors supporting this mechanism. 1 http://www.boost.org/doc/libs/1_42_0/libs/python/doc/index.html 2 http://somethingaboutorange.com/mrl/projects/nose/
115
7. VIGRA software infrastructure Documentation utils, such as “pydoc”3 and “sphinx”4 support inclusion of python docstrings into html documentation. For vigranumpy, “sphinx” has been chosen as the documentation tool. The arraytypes module has been documentated by Ullrich Köthe while for the vigranumpy modules the docstring have been written and imported using the “automodule” module of sphinx. For exported vigra functions, the documentation of the corresponding vigranumpy function should contain a reference to the original vigra function in the vigra html documentation. It would be possible to include the html link directly into the docstrings of the vigranumpy functions. This is undesirable for the following reasons • Typing complete html links is very error prone due to typos. • When the layout of the vigra and/or vigranumpy documentation installation changes, all relative paths would have to be adjusted. • A complete html link looks unaesthetic and cluttered in ipython or text editors help output. For these reasons a central place for the relative paths of the documentations is needed and the links appearing as not much more than the vigra function name in the docstring is also desriable. For the vigra doxygen generation, a python script collecting all vigra functions and links to their html documentation already existed. Its purpose is to generate a html page listing all vigra functions for the vigra documenation. This script has been extended to additionally output a list of vigra function names and documentation location into a file. The format of the file is : . The sphinx configuration script for the vigranumpy documentation has been extended to parse this file and generate replacement rules for the final documentation. These replacement rules replaced vigra function names with an appended “_” (an underscore) by the corresponding link. If one wants to reference vigraFunction from within the vigranumpy documentation one only needs to write “vigraFunction_” in the docstring and the sphinx configuration takes care of replacing it with a visual appealing link to vigraFunction in the vigra HTML documentation. If the relative path between the vigra and vigranumpy documentation ever changes, only one line in the sphinx configuration has to be adjusted. When the help function of ipython is used on the vigranumpy functions, the link appears as the entered vigra function name with an appended underscore making it swift to read. 3 http://docs.python.org/library/pydoc.html 4 http://sphinx.pocoo.org/
116
7.3. Organizing the modules
7.3. Organizing the modules Building the documentation should be possible even before installing vigranumpy on the system. For this, the automodule module of sphinx needs to access the vigranumpy package as if it would be installed. The module had to be organized the same way into the build directory when it is installed. vigranumpy consists of many modules which lie in a subdirectory called “vigra”. In this directory there are also the arraytypes modules and a init files called “__init__.py” organizing the loading of vigranumpy. vigra and vigranumpy are build using cmake5 . The cmake configuration has been adjusted to copy all relevant files into a directory within the build path and pass a corresponding parameter to the sphinx configuration script.
5 http://www.cmake.org/
117
8. Online learning in Ilastik The Ilastik project aims to create a software tool for interactive exploration and segmentation of image data covering a wide range of use cases. Ilastik is usable without higher-level semantic image understandings: Using a convenient user interface, the user can utilize a wide varity of image processing algorithm without having any background knowledge of their internal working. Similar to a paint program, the user can label different regions with classes represented by different colors, select features computed by a mouse click and a classifier can be run on the created data. A segmentation algorithm can then be applied segmenting the image. The framework has been successfully applied to biomedical and industrial standard applications. During the labeling, a visual feedback information telling the user about classification results based on the current labels should be displayed. With this information, the user can see the errors of the current classifier and correct them. This information is only useful if it is displayed less than a second after the user has given new labels, making the application interactive. For this purpose, the online learners described in the last part have been integrated into Ilastik.
8.1. Label Queue To collect new labels for the online learner, the Ilastik core object has been extended by named label queues. All labels the user sets are accumulated in all active label queues. Labels can also be removed from the queue when they have been processed. This queues are used for the online learners, and the undo function of Ilastik has been modified to build on this mechanism.
8.1.1. Online learner base class An online learner, integrated in Ilastik should have the following properties: • When new labels arrive it should learn them arriving at a new model as soon as possible. • When the user is labeling, there is normally enough time to improve the current model or prepare the arrival of new labels. This should be done. • It should be able to detect and unlearn labels that have been overwritten by the user (i.E. by pressing undo or drawing over existing labels).
119
8. Online learning in Ilastik To interface with the online learner, an abstract base class has been written in python. Its interface includes: fastLearn: This function gets a set a new labels and features including unique ids. It learns the new data arriving at a new model as fast as possible. The unique ids are stored and used for unlearning. removeData: Given a set of unique ids, data is unlearned from the classifier. The Random Forest currently implements this by relearning on the remaining data. improveSolution: When there is time remaining, this function is called and the classifier is expected to use the time improving the current model. The SVM can do model selection here, while the Random Forest can rebuild individual trees. addPredictionSet: Any data the classifier should predict on in the future is added here. This way the classifier can prepare the prediction sets ensuring fast prediction. predict: Given the id of a prediction set, prediction is done using the current model. The online Random Forest implementation needs to be passed the old already learned data in every update. For this reason, another base class called CumulativeOnlineClassifier storing all data seen so far has been written. It also takes care of removing data from the stored set when removeData is called.
8.1.2. Controlling the online learner When an online learner is started, a thread is run interfacing with the online leaner base class. The thread is controlled through a command queue. The commands are learn: This command is always bundled with a set of labels and features. The online classifier incorporates these into the current model using fastLearn. unlearn: Using removeData already learned data is unlearned, because the user pressed “undo” or because existing labels have been painted over. stop: Stop the online classifier and finish the thread. If the command queue is empty for more than half a second, the thread calls improveSolution on the online learner.
8.2. Results and Discussion For the integration of the online SVM into Ilastik, the resampling method has been set to close to new. When the online SVM is started, it optimizes the hyperparameters using the initial training with gradient independent step size in 10 iterations. The
120
8.2. Results and Discussion threshold for merging approximate linear dependent Support Vectors has been set to t = 0.01. Prediction is done by evaluation the complete kernel sum from 2.9. For the online Random Forest, trees are relearned with sequential relearning when the classifier is idle. When the online classifier does not get idle, not trees are relearned at all. Threshold adjustment has been turned off. Prediction is done using online prediction clustering. Figure 8.1 shows Ilastik with the online SVM running. The interactivity of the online learners has been tested on several test images. The online Random Forest performs generally well and response within one second on all test images. In case of the online SVM, online training was always sufficiently fast. When the online SVM was started, it needed some time to create the initial model and optimize the hyperparameters. Afterwards, the online process is fast enough for interactiveness. Unfortunately, prediction the whole image took too long if the number of support vectors was to high. The test image shown in 8.1 has a size of 512 × 324 pixels. The response was within one second when the number of support vectors was below 70. But often with a precise labeling of the image, the number of Support Vectors grew much bigger than 70 making the response time of the SVM unbearable large.
Figure 8.1.: The online SVM running in Ilastik
121
Conclusion Different methods for online and active learning have been investigated. In part I an active learning method for image segmentation, called Active Segmentation, has been proposed. It is based on asking the user if pairs of pixels belong to the same segment or not (gathering so called marked pixel pairs). Using this information, the maximin [Turaga et al., 2009] path is examined and temporary labels are set. The maximin path can also reveal inconsistencies between the current labeling and the information from the marked pixel pairs. If so, the user is asked to correct these inconsistencies and thereby provide labels that are very valuable. Because the labels are extremely unbalanced in the classification task to identify border pixels, the algorithm has been compared to “balanced random sampling” which simulates the behavior of the human annotator. Balanced random sampling has access to the labels of the unlabeled samples beforehand, and therefore a big advantage to strategies without access to this information. Active Segmentation has been found to compete with balanced random sampling after a certain amount of pixels have been labeled. The standard active learning method, uncertainty sampling, does not reach that goal. But to be practical in practice, Active Segmentation would have to outperform balanced random sampling. Variants of Active Segmentation could still be useful in practice as discussed in section 1.6. Finding consistent ground truth was problematic, as discussed in section 1.6. For this reason, the relevance of the results is questionable. To decrease the response time of active learning strategies, online learners have been developed and investigated in part II with the goal of integrating these online learners in Ilastik. An online Support Vector Machine based on laSvm [Bordes et al., 2005] has been implemented. In section 2.7 the strategy for online learning with laSvm is discussed and an improvement of its accuracy by resampling is introduced. While the resampling strategies are irrelevant for online learning tasks where the data comes in random order, it has been shown that in the case of a biased order of the training samples, resampling is crucial. If the order of the training data is determined by the labeling of a human annotator, a high risk of biased sampling exists. To be able to use the online Support Vector machine as a plug-in method in Ilastik, automatic model selection has been investigated in chapter 3. It has been focused on gaussian radial basis functions as the kernel for the SVM. Different error bounds have been analysed and the presented smooth Xi-Alpha bound, based on the Xi-Alpha
123
Conclusion bound from Joachims [2000] has been selected. The gradient on the smooth alpha bound was efficiently approximated utilizing results from Adankon and Cheriet [2007] and gradient descent methods have been applied on the smooth Xi-Alpha bound. The results in 3.2.3 show, that reasonable kernel parameters are found in all cases. As discussed in seciton 3.2.4 even better results might be possible by applying techniques to escape local minimums of the error bound. Unfortunately, the smooth Xi-Alpha bound did not show good properties for optimizing the slack penalty C. Optimizing this parameter might bring additional improvements. In future work the parameter C could be optimized on a different error bound, separating the optimization of the kernel parameters and of C. Even better generalization performance of the SVM might be possible this way. Online Random forest variants based on Saffari et al. [2009] have been investigated in chapter 5. The extensions threshold adjustment has been proposed, but found to bring no improvements. The implemented online Random Forest, which varies only slightly from Saffari et al. [2009], has been found to be fast and sufficiently accurate. Prediction on complete images has been found to be to slow for interactive usage in the online Support Vector Machines as well as in online Random Forests. To solve this problem for the online SVM, cover trees have been investigated and a cover tree prediction for support vector machines has been introduced in 4.2.3. Unfortunately the overhead introduced by this method even increased the prediction time on many datasets. In section 4.2.4 the evaluation of exp has been found to be one of the bottlenecks for SVM prediction and replaced by a lookup table. This brought a noticeable speedup. The prediction time of a SVM is linear in the number of support vectors. For that reason has in section 4.1 removal of support vectors by merging of close by support vectors been analyzed and adapted for the online Support Vector Machine. While the results in 4.1.3 show that significant reduction of support vectors is possible without harming the classifier’s accuracy on most datasets, these reductions depend on the merge threshold t and it remained unclear how to set t without harming the classifier’s accuracy. A safe value for t does exist, but does not provide sufficient reduction of support vectors on the datasets. For the online Random Forest, online prediction clustering has been proposed in section 5.3. It exploits the clustering drawn by the last online Random Forest prediction on the test dataset. The method brought speedups by a factor up to 11 on some datasets. The minimal speedup measured was 1.93. The implementation of the online Support Vector Machine has been described in chapter 6 together with the adjustment made to the Random Forest from Nair [2010] that has been modified for the online Random Forest. The integration into Ilastik is described in chapter 8. Both online learner have been tested on several images. The online Random Forest responded sufficiently fast in all cases, but for the online Support Vector Machine
124
the prediction time was too high when the number of support vectors grew to big. Wa have discussed that a good active learning strategy in image analysis is to show the labeler an intermediate result. Since the labeler knows the problem at hand, he also knows which crucial regions of the image have been misclassified and must be corrected by additional labels. But the labeler does not know which labels would effect the prediction of other pixels and which labels are pointless because the classifier would not adjust its decision. Other active learning strategies might therefore perform better in certain situations. Nevertheless, most active learning strategies require a trained classifier and the prediction which pixels are valuable is often very similar to predicting the label of a pixel. Active Learning therefore requires a classifier that trains and predicts fast and should be based on online learners such as those developed in this thesis.
125
List of Tables Datasets used in this theses . . . . . . . . . . . . . . . . . . . . . . . . . .
v
1.1. Features used for the classifier . . . . . . . . . . . . . . . . . . . . . . . . 1.2. The constrast of the SBFSEM volume image is enhanced, based on 28 rotation-invariant non-linear features. The gradient magnitude as well as the eigenvalues of the Structure Tensor and the Hessian matrix are computed from the volume image as well as from the output of a 4-times iterated bilateral filter for which the functions wσs and wσv with parameters σs and σv are used for weighting in the spatial and in the intenstity domain. Derivatives are computed by DoG filters at scales σm , σi , and σh . Entries of the Structure Tensor are averaged with a Gaussian at scale σo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
1.
1.3. Seed thresholds on the border probability map and the resulting number of connected components. . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
18
2.1. Error rate in percentage after learning the complete dataset with different resampling strategies. The colors are explained in the text. . . . . . . . . 38 2.2. Time in milliseconds needed for learning the last 20 samples including resampling on the complete datasets. The colors are explained in the text. 38 3.1. Test error at the minimums of the error bounds . . . . . . . . . . . . . . 3.2. Test error at points where a gradient descent could converge . . . . . . 3.3. Classification error after optimization of hyperparameters in % after 30 iterations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
68
4.1. Relative time gain by using a lookup table for the exponantial function
86
57 57
5.1. Error rates in % and training times in milliseconds for different online Random Forests after learning the complete datasets. The colors mark the better results between online RF with and without threshold adjusting. Details are explained in the text. . . . . . . . . . . . . . . . . . . . . . . . 100 5.2. Prediction times in milliseconds and speedups by online prediction clustering compared to prediction of the batch Random Forest. . . . . . 103
127
List of Figures 1.1. Toy example illustrating a maximin path on a probability map. The brightness of the pixels encodes the height hk . In our example, this would be the border probability. The red pixels are seeds for the watershed segmentation and the green line follows the maximin path. . . . . . . . 1.2. Marked pixels used as seeds for a watershed segmentation are denoted by circles. The black lines illustrate the watercourses between the marked pixels and are labeled by the height of the watercourse. The maximin path between the red and blue seeds is illustrated by the yellow arrow. 1.3. The algorithm displaying its actions: The green dots show pixels for which the label is “interior” while the red pixels have a “border” label. The yellow line denotes found maximin path and the red cross marks temporary border labels. The pink arrows visualize that all pixel pairs suggest the same temporarily label. . . . . . . . . . . . . . . . . . . . . . 1.4. The algorithm displaying its actions: The green dots show pixels for which the label is “interior” while the red pixels have an “border” label. The yellow lines denote the maximin path on which the algorithm currently requests a label from the user. . . . . . . . . . . . . . . . . . . . 1.5. Rand index for experiment on an image from the Berkeley Segmentation Dataset shown in figure 1.6 . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6. Experiment on a image from the Berkeley Segmentation Dataset. The image shows the segmentation results of Active Segmentation and balanced random sampling and the ground truth. The borders between segments are marked by pink lines. . . . . . . . . . . . . . . . . . . . . . 1.7. A slice of 50 × 50 × 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.8. Results on the voxel dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . The “xor” toy dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Biased sampling on the “xor” toy dataset . . . . . . . . . . . . . . . . . . The “stripes” toy dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . Error rate on the xor dataset . . . . . . . . . . . . . . . . . . . . . . . . . . Above: Intermediate solution on the xor dataset doing random resampling, Below: Intermediate solution on the xor dataset doing all resampling. The true decision boundaries are marked by white lines. . . . . . 2.6. Intermediate results on the xor dataset with all resampling setting C = 10 for the samples with x0 < 0 . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7. Error rate on the stripes dataset . . . . . . . . . . . . . . . . . . . . . . . . 2.1. 2.2. 2.3. 2.4. 2.5.
5
7
9
12 17
19 20 21 33 34 35 40
41 42 43
129
List of Figures 2.8. Intermediate results on the stripes dataset with all resampling . . . . . 2.9. precision while doing online learning with different resampling strategies on the svm_small_sets datasets . . . . . . . . . . . . . . . . . . . . . . 2.10. precision while doing online learning with different resampling strategies on the svm_small_sets datasets . . . . . . . . . . . . . . . . . . . . . . 2.11. precision while doing online learning with different resampling strategies on the svm_small_sets datasets . . . . . . . . . . . . . . . . . . . . . .
44
3.1. sigmoid function for different choices of a and b. . . . . . . . . . . . . . 3.2. The different error bounds compared to the test error in dependence of the kernel parameter γ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3. The different error bounds compared to the test error in dependence of the kernel parameter γ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4. The different error bounds compared to the test error in dependence of the kernel parameter γ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5. Smooth Xi-alpha bound and its approximate derivative. The minimum and of the Xi-Alpha bound and the locations where the derivatives change from negative to positive have been marked. . . . . . . . . . . . 3.6. Smooth Xi-alpha bound and its approximate derivative. The minimum and of the Xi-Alpha bound and the locations where the derivatives change from negative to positive have been marked. . . . . . . . . . . . 3.7. Smooth Xi-alpha bound and its approximate derivative. The minimum and of the Xi-Alpha bound and the locations where the derivatives change from negative to positive have been marked. . . . . . . . . . . . 3.8. Probabilistic output for the SVM learned on the a1a dataset for different kernel parameter γ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9. smooth Xi-Alpha bound in dependence of C on the splice dataset. . .
56
4.1. Number of support vectors and testing error in dependence of the threshold t . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2. Number of support vectors and testing error in dependence of the threshold t . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3. Error rates (in %) for the different strategies on the svm_sets . . . . . 4.4. Time for the different strategies on the svm_sets. For the graphic, the times have been normalized to the time needed for the complete sum. . 4.5. Number of distance calculations done in the different kernel sum strategies on the svm_sets. For the graphic, they have been normalized to the distance calculations needed by the complete sum. . . . . . . . . . . 4.6. Relative number of exp evaluations done in the different kernel sum strategies on the svm_sets . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7. Time for the different strategies on the svm_sets using a lookup table for the exponential function. For the graphic, the times have been normalized to the time needed for the complete sum. . . . . . . . . . . .
130
45 46 47
58 59 60
62
63
64 65 69
76 77 88 89
90 91
92
List of Figures 5.1. Threshold adjustment illustrated. The colored circles denote training samples with their label indicated by their color. The new sample is marked by a black boundary. The threshold is adjusted such that gini index is minimal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.2. A cluster traversing down a tree and being split. For details refer to the text. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.3. Error rates and training times on a1a and voxel. . . . . . . . . . . . . . 101 5.4. Error rates and training times on splice and german.numer. . . . . . 102 5.5. Prediction times on german.numer, splice, a1a and voxels. . . . . 104 8.1. The online SVM running in Ilastik . . . . . . . . . . . . . . . . . . . . . . 121
131
Bibliography Mathias M. Adankon and Mohamed Cheriet. Optimizing resources in model selection for support vector machine. Pattern Recogn., 40(3):953–963, 2007. ISSN 0031-3203. doi: http://dx.doi.org/10.1016/j.patcog.2006.06.012. M. A. Aizerman. Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control, 25:821–837, 1964. URL http: //ci.nii.ac.jp/naid/10017600992/en/. Björn Andres. personal communication, 2010. Björn Andres, Ullrich Köthe, Moritz Helmstaedter, Winfried Denk, and Fred A. Hamprecht. Segmentation of sbfsem volume data of neural tissue by hierarchical classification. In Gerhard Rigoll, editor, Pattern Recognition, volume 5096 of LNCS, pages 142–152. Springer, 2008. ISBN 978-3-540-69320-8. doi: 10.1007/978-3-540-69321-5\_15. Dana Angluin. Queries and concept learning. Mach. Learn., 2(4):319–342, 1988. ISSN 0885-6125. doi: http://dx.doi.org/10.1023/A:1022821128753. A. Asuncion and D.J. Newman. UCI machine learning repository, 2007. \url{http://www.ics.uci.edu/~mlearn/MLRepository.html}.
URL
Les Atlas, David Cohn, Richard Ladner, M. A. El-Sharkawi, and R. J. Marks, II. Training connectionist networks with queries and selective sampling. pages 566–573, 1990. N.E. Ayat, M. Cheriet, C.Y. Suen, and M. Cheriet C. Y. Suen. Empirical error based optimization of svm kernels: Application to digit image recognition. In In the 8 th IWFHR, Niagaraon-the-lake, pages 105–110. Springer Verlag, 2002. Danny Barash. A fundamental relationship between bilateral filtering, adaptive smoothing, and the nonlinear diffusion equation. Transactions on Pattern Analysis and Machine Intelligence, 24(6):844–847, 2002. E. B. Baum and K. Lang. Query learning can work poorly when a human oracle is used. In International Joint Conference in Neural Networks, pages 335–340. IEEE Press, 1992. Alina Beygelzimer, Sham Kakade, and John Langford. Cover trees for nearest neighbor. In ICML ’06: Proceedings of the 23rd international conference on Machine learning, pages 97–104, New York, NY, USA, 2006. ACM. ISBN 1-59593-383-2. doi: http://doi.acm. org/10.1145/1143844.1143857.
133
Bibliography Antoine Bordes, Seyda Ertekin, Jason Weston, and Léon Bottou. Fast kernel classifiers with online and active learning. Journal of Machine Learning Research, 6:1579–1619, September 2005. URL http://leon.bottou.org/papers/ bordes-ertekin-weston-bottou-2005. Bernhard E. Boser, Isabelle Guyon, and Vladimir Vapnik. A training algorithm for optimal margin classifiers. In COLT, pages 144–152, 1992. L. Breiman. Random forests. Machine Learning, 45(1):5–32, October 2001. Christopher J.C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2:121–167, 1998. Gert Cauwenberghs and Tomaso Poggio. Incremental and decremental support vector machine learning, 2000. Chih-Chung Chang and Chih-Jen Lin. LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. Olivier Chapelle, Vladimir Vapnik, Olivier Bousquet, and Sayan Mukherjee. Choosing multiple parameters for support vector machines. Mach. Learn., 46(1-3):131–159, 2002. ISSN 0885-6125. doi: http://dx.doi.org/10.1023/A:1012450327387. Kai-Min Chung, Wei-Chun Kao, Chia-Liang Sun, Li-Lun Wang, and Chih-Jen Lin. Radius margin bounds for support vector machines with the rbf kernel. Neural Comput., 15(11):2643–2681, 2003. ISSN 0899-7667. doi: http://dx.doi.org/10.1162/ 089976603322385108. Corinna Cortes and Vladimir Vapnik. Support-vector networks. In Machine Learning, pages 273–297, 1995. Koby Crammer, Jaz Kandola, Royal Holloway, and Yoram Singer. Online classification on a budget. In Advances in Neural Information Processing Systems 16, page 2003. MIT Press, 2003. Winfried Denk and Heinz Horstmann. Serial block-face scanning electron microscopy to reconstruct three-dimensional tissue nanostructure. PLoS Biology, 2:1900–1909, November 2004. Edsger W. Dijkstra. A note on two problems in connexion with graphs. Numerische Mathematik, 1:269–271, 1959. URL http://jmvidal.cse.sc.edu/library/ dijkstra59a.pdf. Tom Downs, Kevin E. Gates, and Annette Masters. Exact simplification of support vector solutions. J. Mach. Learn. Res., 2:293–297, 2002. ISSN 1532-4435.
134
Bibliography Thilo-Thomas Frieß, Nello Cristianini, and Colin Campbell. The kernel-adatron algorithm: a fast and simple learning procedure for support vector machines. In Machine Learning: Proceedings of the Fifteenth International Conference. Morgan Kaufmann Publishers, 1998. Claudio Gentile. A new approximate maximal margin classification algorithm. J. Mach. Learn. Res., 2:213–242, 2002. ISSN 1532-4435. Pierre Geurts, Damien Ernst, and Louis Wehenkel. Extremely randomized trees. Mach. Learn., 63(1):3–42, 2006. ISSN 0885-6125. doi: http://dx.doi.org/10.1007/ s10994-006-6226-1. C. W. Hsu, C. C. Chang, and C. J. Lin. A practical guide to support vector classification. Technical report, Taipei, 2003. URL http://www.csie.ntu.edu.tw/ \~{}cjlin/papers/guide/guide.pdf. Viren Jain, Joseph F. Murray, Fabian Roth, Srinivas Turaga, Valentin Zhigulin, Kevin L. Briggman, Moritz N. Helmstaedter, Winfried Denk, and H. Sebastian Seung. Supervised learning of image restoration with convolutional networks. In ICCV 2007, 2007. Paul A. Jensen and Jonathan F. Bard. Quadratic programming. In Operations Research Models and Methods. Wiley, 2008. T. Joachims. Estimating the generalization performance of a SVM efficiently. In International Conference on Machine Learning, pages 431–438, San Francisco, 2000. Morgan Kaufman. Eric Jones, Travis Oliphant, Pearu Peterson, et al. SciPy: Open source scientific tools for Python, 2001–. URL http://www.scipy.org/. S.S. Keerthi and E. G. Gilbert. Convergence of a generalized smo algorithm for svm classifier design. In Machine Learning, pages 351–360, 2000. Ullrich Köthe. Generische Programmierung für die Bildverarbeitung. Dissertation, Universität Hamburg, February 2000. David D. Lewis and William A. Gale. A sequential algorithm for training text classifiers. In SIGIR ’94: Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, pages 3–12, New York, NY, USA, 1994. Springer-Verlag New York, Inc. ISBN 0-387-19889-X. Yi Li and Philip M. Long. The relaxed online maximum margin algorithm. In Machine Learning, page 2002, 2000. Hsuan-Tien Lin, Chih-Jen Lin, and Ruby C. Weng. A note on platt’s probabilistic outputs for support vector machines. Mach. Learn., 68(3):267–276, 2007. ISSN 0885-6125. doi: http://dx.doi.org/10.1007/s10994-007-5018-6.
135
Bibliography D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proc. 8th Int’l Conf. Computer Vision, volume 2, pages 416–423, July 2001. N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller. Equation of state calculations by fast computing machines. J. Chem. Phys., 21:1087–1092, June 1953. doi: 10.1063/1.1699114. URL http://dx.doi.org/10.1063/1.1699114. J.
R. Movellan. Tutorial on Gabor http://mplab.ucsd.edu/tutorials/pdfs/gabor.pdf, 2008.
Filters.
Tutorial
paper
Rahul Nair. Construction and analysis of random tree ensembles. Master’s thesis, University of Heidelberg, Germany, February 2010. A.B.J. Novikoff. On convergence proofs on perceptrons. In Proceedings of the Symposium on the Mathematical Theory of Automata, volume 12, pages 615–622, 1962. Francesco Orabona, Claudio Castellini, Barbara Caputo, Luo Jie, and Giulio Sandini. On-line independent support vector machines. Pattern Recognition, 43(4):1402 – 1412, 2010. ISSN 0031-3203. doi: DOI:10.1016/j.patcog. 2009.09.021. URL http://www.sciencedirect.com/science/article/ B6V14-4XBX714-2/2/dcd507c4f94aa5b3ca5b2de10443bde1. Nikunj C. Oza and Stuart Russell. Online bagging and boosting. In In Artificial Intelligence and Statistics 2001, pages 105–112. Morgan Kaufmann, 2001. Xinjun Peng and Yifei Wang. A geometric method for model selection in support vector machine. Expert Systems with Applications, 36(3, Part 1):5745 – 5749, 2009. ISSN 0957-4174. doi: DOI:10.1016/j.eswa. 2008.06.096. URL http://www.sciencedirect.com/science/article/ B6V03-4SVC5K0-S/2/f37f8d3e990de4fc5895757b4ebf742f. Fernando Pérez and Brian E. Granger. IPython: a System for Interactive Scientific Computing. Comput. Sci. Eng., 9(3):21–29, May 2007. URL http://ipython.scipy. org. John C. Platt. Sequential minimal optimization: A fast algorithm for training support vector machines, 1998. John C. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In Advances in Large Margin Classifiers, pages 61–74. MIT Press, 1999. Danil Prokhorov. personal communication, 2010.
136
Bibliography Parikshit Ram, Dongryeol Lee, William March, and Alexander Gray. Linear-time algorithms for pairwise statistical problems. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 1527–1535. 2009. J.B.T.M. Roerdink and A. Meijster. The watershed transform: definitions, algorithms, and parallellization strategies. Fundamenta Informaticae, 41:187–228, 2000. Amir Saffari, Christian Leistner, Jakob Santner, Martin Godec, and Horst Bischof. On-line random forests. In ICCV 2009, 2009. Bernhard Schölkopf and Alexander J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond (Adaptive Computation and Machine Learning). The MIT Press, 1st edition, December 2001. ISBN 0262194759. URL http://www.worldcat.org/isbn/0262194759. Burr Settles. Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin–Madison, 2009. H. S. Seung, M. Opper, and H. Sompolinsky. Query by committee. In COLT ’92: Proceedings of the fifth annual workshop on Computational learning theory, pages 287–294, New York, NY, USA, 1992. ACM. ISBN 0-89791-497-X. doi: http://doi.acm.org/10. 1145/130385.130417. Christoph Sommer, Nathan Hüsken, Ullrich Köthe, and Fred A. Hamprecht. ILASTIK: Interactive learning and segmentation tool kit, 2010. Software available at http: //hci.iwr.uni-heidelberg.de/download3/ilastik.php. Sören Sonnenburg, Gunnar Rätsch, and Christin Schäfer. A General and Efficient Multiple Kernel Learning Algorithm. In Y. Weiss, B. Schölkopf, and J. Platt, editors, Advances in Neural Information Processing Systems 18, pages 1273–1280, Cambridge, MA, 2006a. MIT Press. Sören Sonnenburg, Gunnar Rätsch, Christin Schäfer, and Bernhard Schölkopf. Large scale multiple kernel learning. J. Mach. Learn. Res., 7:1531–1565, 2006b. ISSN 15324435. Simon Tong and Daphne Koller. Support vector machine active learning with applications to text classification. In Journal of Machine Learning Research, pages 999–1006, 2000. Srinivas C. Turaga, Kevin L. Briggman, Moritz Helmstaedter, Winfried Denk, and H. Sebastian Seung. Maximin affinity learning of image segmentation. CoRR, abs/0911.5372, 2009. Andrew V. Uzilov, Joshua M. Keegan, and David H. Mathews. Detection of non-coding rnas on the basis of predicted secondary structure formation free energy change. Bioinformatics, 7:173, 2006.
137
Bibliography Vladimir Vapnik. The Nature of Statistical Learning Theory. Springer, New York, 1995. URL http://books.google.com/books?id=sna9BaxVbj8C&printsec= frontcover.
138