K. Driessens, P. Reutemann, B. Pfahringer, and C. Leschi. Using weighted near- est neighbor to benefit from unlabeled data. In Proceedings of the Asia-Pacific.
Scaling up semi-supervised learning: an efficient and effective LLGC variant Bernhard Pfahringer1 , Claire Leschi2 , and Peter Reutemann1 1
Department of Computer Science, University of Waikato, Hamilton, New Zealand 2 INSA Lyon, France
Abstract. Domains like text classification can easily supply large amounts of unlabeled data, but labeling itself is expensive. Semi-supervised learning tries to exploit this abundance of unlabeled training data to improve classification. Unfortunately most of the theoretically well-founded algorithms that have been described in recent years are cubic or worse in the total number of both labeled and unlabeled training examples. In this paper we apply modifications to the standard LLGC algorithm to improve efficiency to a point where we can handle datasets with hundreds of thousands of training data. The modifications are priming of the unlabeled data, and most importantly, sparsification of the similarity matrix. We report promising results on large text classification problems.
1
Introduction
Semi-supervised learning (and transduction) addresses the problem of learning from both labeled and unlabeled data. In recent years, this problem has generated a lot of interest among the Machine Learning community [26, 38]. This learning paradigm is motivated by both practical and theoretical issues. Indeed, it provides a very interesting framework to up-to-date application domains such as web categorization (e.g. [35]), text classification (e.g. [22, 15, 17]), camera image classification (e.g. [3, 25]), or computational biology (e.g. [31]). More gerally, it is of high interest in all domains in which one can easily get huge collections of data but labeling this data is expensive and time consuming, needs the availability of human experts, or even is infeasible. Moreover, it has been shown experimentally that, under certains conditions, the use of a small set of labeled data together with a large supplementary of unlabeled data allows the classifiers to learn a better hypothesis, and thus significantly improve the generalization performance of the supervised learning algorithms. Thus, one should sum up transductive learning as ”less human effort and better accuracy”. However, as has been noted by Seeger [26], issues in semi-supervised learning have to be addressed using (probably) genuinely new ideas. Most of the semi-supervised learning approaches use the labeled and unlabeled data simultaneously or at least in close collaboration. Roughly speaking, the unlabeled data provides information about the structure of the domain, i.e. helps to capture the underlying distribution of the data, whereas the labeled
data identifies the classification task within this structure. The challenge for the algorithms can be viewed as realizing a kind of trade-off between robustness and information gain [26]. To make use of unlabeled data, one must make assumptions, either implicitely or explicitely. As reported in [34], the key to semi-supervised learning is the prior assumption of consistency, that allows for exploiting the geometric structure of the data distribution. This assumption relies on a local and/or global statement(s). The former one (also shared by most of the supervised learning algorithms) means that nearby data points should belong to the same class. The later one, called cluster assumption, states that the decision boundary should lie in regions of low data density. Then, points which are connected by a path through regions of high data density have the same label. A common approach to take into account the assumption of consistency is to design an objective function which is smooth enough w.r.t. the intrinsic structure revealed by known labeled and unlabeled data. Early methods in transductive learning were using mixture models (in which each mixture component should be associated with a class) and extensions of the EM algorithm [22]. More recent approaches belong to the following categories: self-training, co-training, transductive SVMs, split learning and graphbased methods. In the self-training approach, a classifier is trained on the labeled data and then used to classify the unlabeled ones. The most confident (now labeled) unlabeled points are added to the training set, together with their predictive labels, and the process is repeated until convergence [32, 25]. The approaches related to co-training [7, 17] build on the hypothesis that the features describing the objects can be divided in two subsets such that each of them is sufficient to train a good classifier, and the two sets are conditionally independent given the classes. Two classifiers are iteratively trained, each on one set, and they teach each other with the few unlabeled data (and their predictive labels) they feel more confident with. The transductive SVMs [29, 15] are a ”natural” extension of SVMs to the semi-supervised learning scheme. They aim at finding a labeling of the unlabeled data so that the decision boundary has a maximum margin on the original labeled data and on the (newly labeled) unlabeled data. Another category of methods, called split learning algorithms, represent an extreme alternative using the unlabeled and labeled data in two different phases of the learning process [23]. As stated by Ando and Zhang [1], the basic idea is to learn good functional structures using the unlabeled data as a modeling tool, and then the labeled data is used for supervised learning based on these structures. A detailed presentation of all these approaches is beyond the scope of this paper. In the following, we will focus on graph-based methods which are more directly related to the Local and Global Consistency (LLGC) algorithm [34] for which we are proposing some improvements. Graph-based methods attempt to capture the underlying structure of the data within a graph whose vertices are the available data (both labeled and unlabeled) and whose (possibly weighted) edges encode the pairwise relationships among this data. As noticed in [33], examples of recent work in that direction include Markov random walks [28], cluster kernels [9], regularization on graphs [27,
34] and directed graphs [35]. The graph is most often fully connected. Nevertherless, if sparsity is desired, the pairwise relationships between vertices can reflect a nearest neighbor property, either thresholding the degree(k-NN) or the distance (-NN). The learning problem on graphs can generally be thought of as estimating a classifying function f which should be close to a given function y on the labeled data and smooth on the whole graph [34]. For most of the graph-based methods, this can be formally expressed in a regularization framework [38] where the first term is a loss function and the second term a regularizer. The so-defined cost (or energy) function should be minimized on the whole graph by means of (iterative) tuning of the edges values. Consequently, different graph-based methods mainly vary by the choice of the loss function and the regularizer [38]. For example, the work on graph cuts [6] minimizes the cost of a cut in the graph for a two-class problem, while [16] minimizes the normalized cut cost and [39, 34] minimize a quadratic cost. As noticed in [38], these differences are not actually crucial. What is far more important is the construction and the quality of the graph, which should reflect domain knowledge through the similarity function which is used to assign edges (and their weights). One can find a discussion of that issue in [38, 3] Other important issues such as consistency and scalability of semi-supervised learning methods are also discussed in [37].
2
Related Work
The LLGC method of Zhou et al. [34] is a graph-based approach which addresses the semi-supervised learning problem as designing a function f that satisfies both the local and global consistency assumptions. The graph G is fully connected, with no self-loop. The edges of G are weighted with a positive and symmetric function w which represents a pairwise relationships between the vertices. This function is further normalized w.r.t. the conditions of convergernce of the algorithm [21, 9]. The goal is to label the unlabeled data. According to Zhou et al., the key point of the method is to let every point iteratively spread its label information to its neighbors until a global state is reached. Thus, looking at LLGC as an iterative process, one can intuitively understand the iteration as the process of information diffusion on graphs [18]. The weights are scaled by a parameter σ for propagation. During each iteration, each point receives the information from its neighbor and also retains its initial information. A parameter α allows to adjust the relative amount of information provided by the neighbors and the initial one. When convergence is reached, each unlabeled point is assigned the label of the class it has received most information for during the iteration process. One can also consider the LLGC method through the regularization framework. Then, the first term of the cost function Q(f) is a fitting constraint that binds f to stay close to the initial label assignment. The second term is a smoothness constraint that maintains local consistency. The global consistency is maintained by using a parameter µ which yields a balance between the two terms. As stated by Zhou et al., the closest related graph-based approach to LLGC is the method using Gaussian random fields and harmonic functions presented
in [39]. In this method, the label propagation is formalized in a probabilistic framework. The probability distribution assigned to the classification function f is a Gaussian random field defined on the graph. This function is constrained to give their initial labels to labeled data. In terms of regularization network, this approach can be viewed as having a quadratic loss function with infinite weight, so that the labeled data are clamped, and a regularizer based on the graph Laplacian [38]. The minimization of the cost function results in an harmoninc function. In [14], the LLGC method and the Gaussian Random Field Model (GRFM) are further compared to each other and to the Low Density Separation (LDS) method of Chapelle and Zien [10]. T. Huang and V. Kecman notice that both algorithms are manifold-like methods, and have the similar property of searching the class boundary in the low density region (and in this respect they have similarity with the Gradient Transductive SVMs [10] too). LLGC has been recently extended to clustering and ranking problems. Relying on the fact that LLGC has demonstrated impressive performance on relatively complex manifold structures, the authors in [8] propose a new clustering algorithm which builds upon the LLGC method. They claim that LLGC naturally leads to an optimization framework that picks clusters on manifold by minimizing the mean distance between points inside the clsuters while maximizing the mean distance between points in different clusters. Moreover, they show that this framework is able to: (i) simultaneously optimize all learning parameters, (ii) pick the optimal number of clusters,(iii) allow easy detection of both global outliers and outliers within clusters, and can also be used to add previously unseen points to clusters without re-learning the original cluster model. Similarly, in [30], A. Vinueza and G.Z. Grudic show that LLGC performs at least as well as the best known outlier detection algorithm, and can predict class outliers not only for training points but also for points introduced after training. Zhou et al. in [36] propose a simple universal ranking algorithm derived from LLGC, for data lying in an Euclidean space and show that this algorithm is superior to local methods which rank data simply by pairwise Euclidean distances or inner products. Aso note that for large scale real world problems they prefer to use the iterative version of the algorithm instead of the closed form based on matrix inversion. Empirically, usually a small number of iterations seem sufficient to yield high quality ranking results.
In the following we propose extension on LLGC to cope with the computational complexity, to broaden its range of applicability, and to improve its predictive accuracy. As reported in [37], the complexity of many graph-based methods is close to O(n3 ). Speed-up improvements have been proposed, for example in [20, 11, 40, 33, 13], but their effectiveness has not yet been shown for large real-world problems. Section 3 will give the definition of the original LLGC algorithm and detail our extensions. In Section 4 we will support our claims with experiments on textual data. Finally Section 5 summarizes and provides directions for future work.
3
Original LLGC and extensions
A standard semi-supervised learning algorithm is the so-called LLGC algorithm [34], which tries to balance two potentially conflicting goals: locally, similar examples should have similar class labels, and globally, the predicted labels should agree well with the given training labels. The way LLGC achieves that can intuitively be seen as the steady state of a random walk on the weighted graph given by the pairwise similarities of all instances, both labeled and unlabeled ones. At each step each example passes on its current probability distribution to all other instances, were distributions are weighted by the respective similarities. In detail, the LLGC works as follows: ||xi −xj ||2
for i 6= j, and Aii = 0. 1. Set up an affinity matrix A, where Aij = e− σ2 2. Symmetrically normalize A yielding S, i.e. S = D−0.5 AD−0.5 where D is a diagonal matrix with D(i, i) being the sum of the i-th row of A, which is also the sum of the i-th column of A, as A is a symmetrical matrix. 3. Setup matrix Y as a n ∗ k matrix, where n is the number of examples and k is the number of class values. Set Yik = 1, if the class value of example i is k. All other entries are zero, i.e. unlabeled examples are represented by all-zero rows. 4. Initialise F (0) = Y , i.e. start with the given labels. 5. Repeat F (t + 1) = α ∗ S ∗ F (t) + (1 − α) ∗ Y until F converges. α is a parameter to be specified by the user in the range [0, 1]. High values of α focus the process on the propagation of the sums of the neighbourhood, i.e. the local consistency, where low values put more emphasis onto the constant injection of the originally given labels, and thereby focus more on global consistency. The seminal LLGC paper [34] proves that this iteration converges to: F ∗ = (1 − α) ∗ (I − α ∗ S)−1 ∗ Y The normalized rows of F ∗ can be interpreted as class probability distributions for every example. The necessary conditions for convergence are that 0 ≤ α ≤ 1 holds, and that all eigenvalues of S are inside [−1, 1]. Before introducing the extensions designed in order to achieve the goals mentioned in the previous section, let us notice the following: LLGC’s notion of similarity is based on RBF kernels, which are general and work well for a range of applications. But they are not always the best approach for computing similarity. For text classification problems usually the so-called cosine similarity measure is the method of choice, likewise other domains have there preferred different similarity measures. Generally, it is possible to replace the RBF kernel in the computation of the affinity matrix with any arbitrary kernel function, as long as one can show that the eigenvalues of S will still be within [−1, 1], thus guaranteeing convergence of the algorithm. One way of achieving this is to use ”normalized” kernels. Any kernel k can be normalized like so ([2]):
k(x, y) knorm (x, y) = p k(x, x) ∗ k(y, y) As the experiments reported below concern text classification problems, for which the so-called cosine similarity measure is the method of choice (likewise other domains have there preferred different similarity measures), we will employ this similarity measure (which is already normalized) instead of RBF kernels. 3.1
Reduce computational complexity by sparsifying A
The main source of complexity in the LLGC algorithm is the affinity matrix. It needs O(n2 ) memory and the matrix inversion necessary for computing the closed form needs, depending on the algorithm used, roughly 0(n2.7 ) time, where n is the number of examples. If there are only a few thousand examples in total (both labeled and unlabeled), this is feasible. But we also want to work with 105 examples and even more. In such a setting even only storing the affinity matrix in main memory becomes impossible, let alone computing the matrix inversion. Our approach to sparsification is based on the insight that most values in the original affinity matrix are very close to zero anyways. Consequently we enforce sparsity by only allowing the k nearest neighbours of each example to supply a non-zero affinity value. Typical well-performing values for k range from a few dozen to a hundred. There is one caveat here: kNN is not a symmetrical relationship, but the affinity matrix has to be symmetrical. It is easy to repair this shortcoming in a post-processing step after the sparse affinity matrix has been generated: simply add all ”missing” entries. In the worst case this will at most double the number of non-zero entries in A. Therefore the memory complexity of LLGC is reduced from O(n2 ) to a mere O(k ∗ n), which for small enough values of k allows to deal with even millions of examples. Additionally, when using the iterative version of the algorithm to compute F ∗ , the computational complexity is reduced to O(k ∗ n ∗ niterations ), which is a significant improvement in speed over the original formulation, especially as the number of iterations needed to achieve (de facto) convergence is usually rather low. E.g. even after only ten iterations usually most labels do not change any more. Computing the sparse affinity matrix is still O(n2 ) timewise, but for cases where n ≥ 5000 we use a hierarchical clustering-based approximation, which is O(n ∗ log(n)). Alternatively, there is currently a lot of research going on trying to speed-up nearest-neighbour queries based on smart data-structures, e.g. kDtrees, or cover trees[4]. 3.2
Allow pre-labeling of the unlabeled data
LLGC starts with all-zero class-distributions for the unlabeled data. We allow pre-labeling by using class-distributions for unlabeled data that have been computed in some way using the training data:
Yij = probj (classif ierlabeledData (xi )) where classif ierlabeledData is some classifier that has been trained on just the labeled subset of data given. For text mining experiments as described below this is usually a linear support vector machine. There are at least two arguments for allowing this pre-labeling (or priming) of the class probability distributions inside LLGC. A pragmatic argument is that simply in all experiments we have performed we uniformly achieve better final results when using priming. We suspect that this might not be true in extreme cases when the number of labeled examples is very small, and therefore any classifier trained on such a small set of examples will necessarily be rather unreliable. There is also a second more fundamental argument in favour of priming. Due to the sparsification of the affinity matrix, which in its non-sparse version describes a fully-connected, though weighted graph, this graph might be split into several isolated subgraphs. Some of these subgraphs may not contain any labeled points anymore. Therefore the propagation algorithm would have no information left to propagate, and thus simply return all-zero distributions for any example in such a neighbourhood. Priming resolves this issue in a principled way. One potential problem with priming is the fact that the predicted labels might be less reliable than the explicitly given labels. In a different and much simpler algorithm[12] for semi-supervised learning this problem was solved by using different weights for labeled and unlabeled examples. When the weights reflected the ratio of labeled to unlabeled examples, then usually predictive accuracy was satisfactory. In a similar spirit we introduce a second parameter β, which scales down the initial predictions for unlabeled data in the primed LLGC algorithm: Yij = β ∗ probj (classif ierlabeledData (xi )) if xi is an unlabeled example. In the experiments reported below we usually find that values for β as chosen by cross-validation are reasonably close to the value that the ratio-heuristic would suggest.
4
Experiments
In this section we evaluate the extended LLGC algorithm on text classification problems by comparing it to a standard linear support vector machine. As we cannot compare to the original LLGC algorithm for computational reasons (see previous section for details), we at least include both a ”pure” version which uses only sparsification and the cosine-similarity, and the ”primed” version, which uses the labels as predicted by the linear support vector machine to initialise the class distributions for the unlabeled data. As explained in the previous section, we down-weigh these pre-labels by setting β = 0.1 and also by β = 0.01, to see
how sensitive the algorithm is with respect to β. We also investigate differently sized neighbourhoods of sizes 25, 50, and 100. The dataset we use for this comparison is the recently released large and cleaned-up Reuters corpus called RCV12[19]. We use a predefined set of 23149 labeled examples as proper training data, and another 199328 examples as the unlabeled or test data. Therefore we have training labels for slightly more than 10% of the data. RCV12 defines hundreds of overlapping categories or labels for the data. We have run experiments on the 80 largest categories, treating each category separately as a binary prediction problem. To evaluate we have chosen AUC (area under the ROC curve) which recently has become very popular especially for text classification[5], as it is independent of a specific threshold. Secondly, as some typical text classification tasks can also be cast as ranking tasks (e.g. the separation of spam email from proper email messages), AUC seems especially appropriate for such tasks, as it provides a measure for how much better the ranking computed by some algorithm is over a random ranking. As there is not enough space to present the results for all these 80 categories here, we have selected only two (CCAT and E14), where CCAT is the largest one, and E14 is considerably smaller. Table 1 depicts AUC for the various algorithms over a range of values for α. From top to bottom we have graphs for neighbourhoods of size 25, 50, and 100. The trends that can be seen in these graphs hold for all the other categories not shown here as well. Usually all LLGC variants outperform the support vector machine which was only trained on the labeled examples. The difference becomes more pronounced for the smaller categories, i.e. were the binary learning problem is more skewed. Pure LLGC itself is also usually outperformed by the primed version, except sometimes at extreme values of α (0.01 or 0.99). For larger categories the differences between pure and primed LLGC are also more pronounced, and also the influence of α is larger, with best results to be found around the middle of the range. Also, with respect to β, usually best results for β = 0.1 are found in the upper half of the α range, whereas for the smaller β = 0.01 best results are usually found at lower values for α. Globally, β = 0.1 seems to be the slightly better value, which confirms the heuristic presented in the last section, as the ratio between labeled and unlabeled data in this domain is about 0.1. 4.1
Spam detection
Another very specific text classification problem is the detection of spam email. Recently a competition was held to determine successful learning algorithms for this problem[5]. One of the problems comprised a labeled mailbox with 7500 messages gathered from publically available corpora and spam sources, whereas for prediction three different mailboxes of size 4000 were supplied. Each mailbox had an equal amount of spam and non-spam, but that was not known to the participants in the competition. The three unlabeled prediction mailboxes were very coherent for their non-spam messages, as they were messages of single Enron users. Again, that was not known to the participants. A solution based on a lazy
Table 1. AUC values for categories CCAT and E14, SVM, pure LLGC, and primed LLGC, for a range of α values. CCAT is the largest category, E14 a rather small one.
AUC for CCAT (k=25, prob=49.17%)
0.965
0. 9 0. 99
primedLLGC(0.1) primedLLGC(0.01 )
alpha
alpha
AUC for CCAT (k=50, prob=49.17%)
0.97 0.965
0. 9 0. 99
primedLLGC(0.1) primedLLGC(0.01 )
alpha
alpha
AUC for CCAT (k=100, prob=49.17%)
7
9 0.
alpha
99
5
0.
9 0.
3
7 0.
primedLLGC(0.01 ) 0.
5 0.
99
3
alpha
0.
1
0.
01
0.96
primedLLGC(0.1)
1
primedLLGC(0.01 )
pureLLGC
0.
0.97 0.965
SVM
0.
primedLLGC(0.1)
0.
AUC
0.975
AUC
pureLLLGC
0.98
01
SVM
0.985
0.
AUC for E14 (k=100, prob=0.27%) 0.99 0.98 0.97 0.96 0.95 0.94 0.93 0.92 0.91 0.
0.99
0.
0. 7
0. 5
0. 3
0. 1
0. 01
0.96
0. 01
primedLLGC(0.01 )
pureLLGC
0. 9 0. 99
primedLLGC(0.1)
SVM
0. 7
0.975
AUC
pureLLGC
0.98
0. 3
SVM
0.985 AUC
AUC for E14 (k=50, prob=0.27%) 0.99 0.98 0.97 0.96 0.95 0.94 0.93 0.92 0.91 0. 1
0.99
0. 5
0. 7
0. 5
0. 3
0. 1
0. 01
0.96
0. 01
primedLLGC(0.01 )
0. 9 0. 99
0.97
pureLLGC
0. 7
primedLLGC(0.1)
SVM
0. 5
0.975
AUC
pureLLGC
0.98
0. 3
SVM
0.985 AUC
AUC for E14 (k=25, prob=0.27%) 0.99 0.98 0.97 0.96 0.95 0.94 0.93 0.92 0.91 0. 1
0.99
feature selection technique in conjunction with the fast LLGC method described here was able to tie for first place at this competition [24]. In Table 2 we report the respective AUCs for the submitted solution, as well as for a support vector machine, and a pure LLGC approach. This time the support vector machine outperforms the pure LLGC solution, but again the primed LLGC version is the overall winner. Table 2. AUC values for the ECML Challenge submission, pure LLGC, and a linear support vector machine, averaged over three mailboxes; ranks are hypothetical except for the first row: Algorithm primedLLGC(k=100,alpha=0.8,beta=1.0) support vector machine pure LLGC(alpha=0.99)
5
AUC 0.9491 0.9056 0.6533
Rank 1/21 7/21 19/21
Conclusions
In this paper we have extended the well-known LLGC algorithm in three directions: we have extended the range of admissible similarity functions, we have improved the computational complexity by sparsification and we have improved predictive accuracy by priming. An preliminary experimental evaluation using a large text corpus has shown promising results, as has the application to spam detection. Future work will include a more complete sensitivity analysis of the algorithm, as well as application to non-textual data utilizing a variety of different kernels.
References 1. R.K. Ando and T. Zhang. A framework for learning predictive structures from multiple tasks and unlabeled data. Technical Report RC23462, IBM T.J. Watson Research Center, Yorktown Heights, NY 10598, USA, 2004. 2. M.-F. Balcan and A. Blum. On a theory of learning with similarity functions. In ICML ’06: Proceedings of the 23rd international conference on Machine learning, pages 73–80, New York, NY, USA, 2006. ACM Press. 3. M.-F. Balcan, A. Blum, P.P. Choi, J. Lafferty, B. Pantano, M.R. Rwebangira, and X. Zhu. Person identification in webcam images: an application of semi-supervised learning. In Proc. of the 22nd International Conference on Machine Learning (ICML 05), Workshop on Learning with Partially Classified Training Data, pages 1–9, Bonn, Germany, August 2005. 4. A. Beygelzimer, S. Kakade, and J. Langford. Cover trees for nearest neighbor. In ICML ’06: Proceedings of the 23rd international conference on Machine learning, New York, NY, USA, 2006. ACM Press. 5. S. Bickel, editor. Proceedings of the ECML/PKDD 2006 Discovery Challenge Workshop. Humboldt University Berlin, 2006.
6. A. Blum and S. Chawla. Learning from labeled and unlabeled data using graph mincuts. In C.E. Brodley and A. Pohoreckyj Danyluk, editors, Proceedings of the Eighteenth International Conference on Machine Learning (ICML 2001). Morgan Kaufmann, 2001. 7. A. Blum and T.M. Mitchell. Combining labeled and unlabeled data with cotraining. In Proceedings of the 11th Annual Conference on Computational Learning Theory (COLT), pages 92–100, Madison, Wisconsin, USA, July 1998. 8. M. Breitenbach and G.Z. Grudic. Clustering with local and global consistency. Technical Report CU-CS-973-04, University of Colorado, Department of Computer Science, 2004. 9. O. Chapelle, J. Weston B., and Sch¨ olkopf. Cluster kernels for semi-supervised learning. In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, pages 585–592. MIT Press, 2002. 10. O. Chapelle and A. Zien. Semi-supervised learning by low density separation. In Proc. of the 10th International Workshop on Artificial Intelligence and Statistics (AISTATS), pages 57–64, Barbados, January 2005. 11. O. Delalleau, Y. Bengio, and N.L. Roux. Efficient non-parametric function induction in semi-supervised learning. In Proceedings of the 10th International Workshop on Artificial Intelligence and statistics (AISTAT 2005), 2005. 12. K. Driessens, P. Reutemann, B. Pfahringer, and C. Leschi. Using weighted nearest neighbor to benefit from unlabeled data. In Proceedings of the Asia-Pacific Conference on Knowledge Discovery in Databases (PAKDD2006), 2006. 13. J. Garcke and M. Griebel. Semi-supervised learning with sparse grids. In Proceedings of the Workshop on Learning with Partially Classified Training Data (ICML2005), Bonn, Germany, 2005. 14. T.M. Huang and V. Kecman. Performance comparisons of semi-supervised learning algorithms. In Proc. of the 22nd International Conference on Machine Learning (ICML 05), Workshop on Learning with Partially Classified Training Data, pages 45–49, Bonn, Germany, August 2005. 15. T. Joachims. Transductive inference for text classification using support vector machines. In I. Bratko and S. Dzeroski, editors, Proceedings of the Sixteenth International Conference on Machine Learning (ICML 1999), Bled, Slovenia, June 27 - 30, 1999, pages 200–209. Morgan Kaufmann, 1999. 16. T. Joachims. Transductive learning via spectral graph partitioning. In T. Fawcett and N. Mishra, editors, Machine Learning, Proceedings of the Twentieth International Conference (ICML 2003), August 21-24, 2003, Washington, DC, USA, pages 290–297. AAAI Press, 2003. 17. R. Jones. Learning to extract entities from labeled and unlabeled text. PhD thesis, Carnegie Mellon University, School of Computer Science, Pittsburgh, Pennsylvania, USA, 2005. 18. R.I. Kondor and J.D. Lafferty. Diffusion kernels on graphs and other discrete input spaces. In C. Sammut and A.G. Hoffmann, editors, Machine Learning, Proceedings of the Nineteenth International Conference (ICML), 2002. 19. D. Lewis, Y. Yang, T. Rose, and F. Li. Rcv1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5:361–397, 2004. 20. M. Mahdavani, N. de Freitas, B. Fraser, and F. Hamze. Fast computation methods for visually guided robots. In Proceedings of the The 2005 International Conference on Robotics and Automation (ICRA), 2005. 21. A.Y. Ng, M.T. Jordan, and Y. Weiss. On spectral clustering: analysis and an algorithm. In T.G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14, pages 849–856. MIT Press, 2001.
22. K. Nigam, A. McCallum, S. Thrun, and T.M. Mitchell. Text classification from labeled and unlabeled documents using em. Machine Learning, 39(2/3), 2000. 23. C.S. Oliveira, F.G. Cozman, and I. Cohen. Splitting the unsupervised and supervised components of semi-supervised learning. In Proc. of the 22nd International Conference on Machine Learning (ICML 05), Workshop on Learning with Partially Classified Training Data, pages 67–73, Bonn, Germany, August 2005. 24. B. Pfahringer. A semi-supervised spam mail detector. In Steffen Bickel, editor, Proceedings of the ECML/PKDD 2006 Discovery Challenge Workshop, pages 48– 53. Humboldt University Berlin, 2006. 25. C. Rosenberg, M. Hebert, and H. Schneiderman. Semi-supervised self-training of object detection models. In 7th IEEE Workshop on Applications of Computer Vision, pages 29–36. IEEE Computer Society, 2005. 26. M. Seeger. Learning from labeled and unlabeled data. Technical report, University of Edinburgh, Institute for Adaptive and Neural Computation, 2001. 27. A.J. Smola and R. Kondor. Kernels and regularization on graphs. In B. Sch¨ olkopf and M.K. Warmuth, editors, Computational Learning Theory and Kernel Machines, Lecture Notes in Computer Science 2777, pages 144–158, 2003. 28. M. Szummer and T. Jaakkola. Partially labeled classification with markov random walks. In T.G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14, pages 945–952. MIT Press, 2001. 29. V.N. Vapnik. Statistical learning theory. J. Wilsley, New York, USA, 1998. 30. A. Vinueza and G.Z. Grudic. Unsupervised outlier detection and semi-supervised learning. Technical Report CU-CS-976-04, University of Colorado, Department of Computer Science, 2004. 31. J. Weston, C. Leslie, E. Le, D. Zhou, A. Elisseff, and W.S. Noble. Semi-supervised protein classification using cluster kernels. Bioinformatics, 21(15):3241–3247, 2005. 32. D. Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. In Proc. of the 33rd Annual Meeting of the Association for Computational Linguistics (ACL), pages 189–196, 1995. 33. K. Yu, S. Yu, and V. Tresp. Blockwise supervised inference on large graphs. In Proc. of the 22nd International Conference on Machine Learning, Workshop on Learning with Partially Classified Training Data, Bonn, Germany, 2005. 34. D. Zhou, O. Bousquet, T. Lal, J. Weston, and B. Sch¨ olkopf. Learning with local and global consistency. In S. Yu Thrun, K.S. Lawrence, and B. Sch¨ olkopf, editors, Advances in Neural Information Processing Systems 16. MIT Press, 2004. 35. D. Zhou, J. Huang, and B. Sch¨ olkopf. Learning from labeled and unlabeled data on a directed graph. In Proc. of the 22nd International Conference on Machine Learning (ICML 05), pages 1041–1048, Bonn, Germany, August 2005. 36. D. Zhou, J. Weston, A. Gretton, O. Bousquet, and B. Sch¨ olkopf. Ranking on data manifolds. In S. Yu Thrun, K.S. Lawrence, and B. Sch¨ olkopf, editors, Advances in Neural Information Processing Systems 16. MIT Press, 2004. 37. X. Zhu. Semi-supervised learning literature survey. Technical Report 1530, Computer Sciences, University of Wisconsin-Madison, 2005. 38. X. Zhu. Semi-supervised learning with graphs. PhD thesis, Carnegie Mellon University, School of Computer Science, Pittsburgh, Pennsylvania, USA, 2005. 39. X. Zhu, Z. Ghahramani, and J.D. Lafferty. Semi-supervised searning using gaussian fields and harmonic functions. In T. Fawcett and N. Mishra, editors, Machine Learning, Proceedings of the Twentieth International Conference (ICML), 2003. 40. X. Zhu and J. Lafferty. Harmonic mixtures: combining mixture models and graphbased methods for inductive and scalable semi-supervised learning. In Proceedings of the 22nd International Conference on Machine Leaening (ICML2005), 2005.