An Improvement on Learning with Local and Global ... - CiteSeerX

1 downloads 0 Views 115KB Size Report
of Sciences, P.O.Box 1130,. Hefei Anhui 230031, China. 2Department of Automation,. University of Science and. Technology of China, Hefei. 230027, China.
An Improvement on Learning with Local and Global Consistency Jie Gui1,2 1 Hefei Institute of Intelligent Machines, Chinese Academy of Sciences, P.O.Box 1130, Hefei Anhui 230031, China 2 Department of Automation, University of Science and Technology of China, Hefei 230027, China

De-Shuang Huang Intelligent Computing Lab, Hefei Institute of Intelligent Machines,Chinese Academy of Sciences, P.O.Box 1130, Hefei Anhui 230031, China

Abstract A modified version for semi-supervised learning algorithm with local and global consistency was proposed in this paper. The new method adds the label information, and adopts the geodesic distance rather than Euclidean distance as the measure of the difference between two data points when conducting calculation. In addition we add class prior knowledge. It was found that the effect of class prior knowledge was different between under high label rate and low label rate. The experimental results show that the changes attain the satisfying classification performance better than the original algorithms.

1. Introduction In many applications of pattern classification and data mining, collecting enough labeled samples can be costly and time-consuming, whereas unlabeled ones are far easier to obtain. For example, in web categorization, it is easy to get a lot of webs but to assign labels such as sports, economics and so on to these data requires the inspection, or even timeconsuming reading by human assessors, which is fairly expensive. Consequently, semi-supervised learning has attracted many interests from researchers [1]. What does semi-supervised learning mean? Given a dataset, where the first l examples are labeled, while the remained examples have no labels. The goal of semisupervised learning is to predict the labels of unlabeled

978-1-4244-2175-6/08/$25.00 ©2008 IEEE

Zhuhong You1,2 1 Hefei Institute of Intelligent Machines, Chinese Academy of Sciences, P.O.Box 1130, Hefei Anhui 230031, China 2 Department of Automation, University of Science and Technology of China, Hefei 230027,China

examples according to the knowledge hidden in the relation between the labeled and unlabeled data. There have existed many semi-supervised learning methods where the label information is propagated from labeled data to unlabeled ones, though the history of the investigation for semi-supervised learning methods is not too long. Among these methods is a promising family of the techniques based on Gaussian fields, which assume that nearest neighbors(according to some similarity measure) in the high-dimensional input space will have similar outputs [2] [6] or be close to each other in the low-dimensional manifold[3] [5]. These methods are called graph-based methods. Graph-based semi-supervised methods define a graph where the nodes are labeled and unlabeled examples in the dataset, and edges (may be weighted) reflect the similarity of examples [1]. These methods usually assume label smoothness over the graph and these graph methods are nonparametric, discriminative, and transductive in nature. In this paper, we shall address the method of learning with local and global consistency [2]. This method designs a classifying function which is sufficiently smooth with respect to the intrinsic structure collectively revealed by known labeled and unlabeled points. Zhou [2] present a simple algorithm to obtain such a smooth solution. Unfortunately, the label information and class prior knowledge, which can be more beneficial to classification, are omitted. Though the labeled samples may be few, the label information and the class prior knowledge, as the prior knowledge, are very important to improve the

classification efficiency. In this paper, we will propose a modified learning with local and global consistency, making use of the label information and class prior knowledge for semi-supervised learning.

K ⎧ ⎪ X i − X j , if X i ∼ X j ; d (i, j ) = ⎨ ⎪⎩+∞ , otherwise

(1)

K

where X i ∼ X j denotes that the points X i and

2. Method Compared with the original algorithm learning with local and global consistency (Lgc), our semisupervised version which is short for Slgc has two changes. One is to consider not only the distances between the pairs of data points, but also their label information, though the labeled samples may be few; the basic idea of this change is to make two data points owing the same class label the closest to each other, while two points owing different class labels the farthest from each other. The other change is to use the geodesic distance rather than the straight-line Euclidean distance as the measure of the difference between two data points; on an underlying manifold, the geodesic distance along the surface of the manifold better preserves the intrinsic geometry of the data than the Euclidean distance. In addition, the second version named Slgc-cmn adds the class prior knowledge based on Slgc. It was found that the effect of class prior knowledge was different between under high label rate and low label rate. Given a dataset X = { X 1 ,..., X l , X l +1 ,..., X n } , the first l samples are labeled by {t1 , t2 ,..., tl } , while the remained ones have no labels. The label set is L = {1, , c} .Let F represent n × c matrices with

F = [ F1T ,

, FnT ]T corresponds to a classification on the dataset X by labeling each point xi with a label yi = arg max j ≤c Fij .We can see F as a vectorial nonnegative entries. A matrix

F : X → R c which assigns a vector Fi to each point xi . Define a n × c matrix Y with Yij = 1

function

if xi is labeled as yi = j and Yij = 0 otherwise. Obviously, Y is consistent with the initial labels according to the decision rule. Our algorithm is summarized as follows. Step 1) Form the adjacent graph. Firstly set the parameter K . Define the graph G with all n data points, and construct an edge to link the point X i to X j , with the length of

Xj

are among the K-nearest neighbors of each other. Step 2) Estimate the geodesic distances. In case of neighboring point pair, the geodesic distance is approximated by the Euclidean distance. While in case of faraway point pair, the geodesic distance is approximated by adding up a sequence of “short hops” between neighboring points. Firstly initialize d G (i, j ) = d (i, j ) ; for each value of

k = 1, 2,..., n , replace all entries d G (i, j ) by min {d (i, j ), d (i, k ) + d (k , j )} in turn. Step 3) Define the affinity matrix

W = ( wij ) by

the following rule if

i≤l &j≤l,

ti = t j , ti ≠ t j

⎧⎪1,

wij = ⎨

⎪⎩eps,

else

(2) K ⎧ ⎛ −d (i, j )2 ⎞ , Xi ∼ X j ⎪exp ⎜ G ⎟ β wij = ⎨ ⎝ ⎠ ⎪eps, otherwise ⎩

eps is a data very close to zero. In our experiments, we set it to 2^(-52). −

1



1

Step 4) Construct the matrix S = D 2WD 2 where D = diag ( d ) is the diagonal matrix with entries d i =



j

wij and W = ( wij ) is the weight

matrix. Step 5) Iterate F (t + 1) = α SF (t ) + (1 − α )Y until convergence, where α is a free parameter between zero and one. ∗

Step 6) Let F represent the limit of the sequence {F (t )} .Label each point xi with a label

yi = arg max j ≤c Fij∗

(3)

The algorithm above is called semi-supervised learning with local and global consistency (Slgc). If we incorporate the next step, the algorithm becomes Slgccmn.

Step7) Zhu [6] first advised to incorporate class prior knowledge. Zhu adopted a simple procedure called class mass normalization (cmn) to adjust the class distributions to match the priors. If there are c

classes, let F _ CMN ( i, j ) denote the probability of

node i belonging to label j , j = 1, 2, , n . The obvious decision rule is to assign a label

y j = arg max j ≤c F _ CMN ( i, j ) to node i . Let

q ( j ) denote the proportion of class j in the labeled dataset. F _ CMN ( i , j ) is defined as follows

F _ CMN ( i , j ) = F ∗ ( i, j ) ×

q ( j) ∑ F ∗ ( i, j )

(4)

i

During the running of the algorithm, if there is a warning “Matrix is singular to working precision”. It usually means that the original graph is disconnected. The eps in weight can be regulated tinily to a value very close to zero. Note that though two free parameters, namely K and β , appear in our algorithm, the parameter β can be easily defined. Considering it serves for preventing wij to fall too fast when dG (i, j ) is relatively large,

the value of β is set to be the average geodesic distance between all pairs of data points.

3. Experimental results and Discussions The performance of our semi-supervised version has been compared with that of the unsupervised version and other manifold methods on the famous USPS (U.S. Postal Service) database. The database was collected from actual handwritten postal codes, which contains 9298 normalized grey scale images of size 16*16. These images are all from 10 classes, which represent digits from 0 to 9. In the database, all images are provided with class labels. For each trial, we randomly draw a set of 1000 samples from the database, among which a random subset is used with labels as the labeled set, and the remained are used as the unlabeled set. For all trials, we set the parameter K as K = 10 and the parameter σ in the original algorithm as σ = 1 .The average classification results of 10 independent runs are shown in Table 1-3, respectively. The method of Zhu [6] and the method of Zhu [6] adopting cmn are abbreviated “Harmonic” and “Harmonic-cmn” below respectively

classifiers error rates variances

Table 1. The classification error rates under 30 percent label rate Slgc-cmn Lgc Harmonic Harmonic-cmn Slgc 0.12328 0.11272 0.21043 0.135 0.11814 0.000770233 0.00045854 0.0017837 0.000209 0.000179694

classifiers error rates variances

Table 2. The classification error rates under 50 percent label rate Slgc-cmn Lgc Harmonic Harmonic-cmn Slgc 0.0926 0.094 0.1916 0.104 0.0972 0.000250711 0.000248 0.00130649 0.000157 0.000196622

classifiers error rates variances

Table 3.The classification error rates under 70 percent label rate Slgc-cmn Lgc Harmonic Harmonic-cmn Slgc 0.0734 0.0794 0.1517 0.0924 0.0887 0.000197822 0.00015449 0.00049023 0.0003083 0.000300678

From the average results in the above Tables, it is obvious that Slgc and Slgc-cmn are the best ones both under 70 and 50 percent label rates. Slgc-cmn is the best under 30 percent label rate. Slgc is slightly worse than Harmonic-cmn and a little better than Harmonic under 30 percent label rate. Slgc is better than Slgc-cmn under 70 and 50 percent label rates, respectively. Why is Slgc-cmn worse than Slgc under 70 and 50 percent label rates? We believe that this is because under high label rate, class priors are sufficient, and adopting class mass

normalization is easy to suffer from overfitting. Under low label rate, class priors are important, so Slgc-cmn which adopts cmn to adjust the class distributions to match the priors gets better performance than Slgc. From the variances in the table, we can see that Slgc-cmn is the best under 70 percent label rate in all 5 methods. Slgc-cmn is better than both Slgc and Lgc under 30 and 50 percent label rates, respectively. Shown in Figure 1 is the comparison among different K .

slgc

slgc-cmn

1

1 label rate=30% label rate=50% label rate=70%

0.9

0.8

0.8

0.7

0.7

0.6

0.6

errorrate

errorrate

0.9

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1

0

0

5

10

15

label rate=30% label rate=50% label rate=70%

20

0

0

5

10

K

15

20

K

Figure 1. Comparison among different From the figure above, it can be seen that both Slgc and Slgc-cmn have the lowest error rate under 30,50,70 percent label rate when K is 8,10,20 respectively. When K is very small, such as 1, both Slgc and Slgc-cmn have very high error rate in all cases. When K is bigger than 8 under 30 percent label rate, the error rate grows slightly with K. The error rates under the other label rates have no much differences between each other when K is bigger than 10. For large high dimensional datasets, it takes very long time to do one experiment. For the original algorithm, the free parameter of the weight is σ .Too many experiments are needed to find the best σ , because the range of σ is from zero to infinity. From the figure above, we can see that our algorithm is not very sensitive to K when K is between 5 and 20. Under low label rate, we can choose K between 5 and 10. Under high label rate, we can choose K between 10 and 20, even bigger than 20. Even though we do all the experiments when K is between 5 and 20, just 16 experiments are needed. From what has been discussed above, we can safely draw the conclusion that our algorithm is more adaptive than the original algorithm.

4. Conclusions We have provided a modified version of learning with local and global consistency for partially labeled

K

classification. Our method adds the label information and the class prior knowledge, and adopts the geodesic distance rather than Euclidean distance as the measure of the difference between two data points. The changes yield significant benefits for partially labeled classification, and make the modified version outperform other semi-supervised learning algorithms.

References [1]X. Zhu, “Semi-supervised learning literature survey”,http://www.cs.wisc.edu/~jerryzhu/pub/ssl_survey.pd f. [2] D.Y. Zhou, O. Bousquet, T.N. Lal, J. Weston, & B. Scholkopf. Learning with local and global consistency, Max Planck Institute for Biological Cybernetics Technical Report, 2003. [3] W. Du, K. Inoue, and K. Urahama, “Dimensionality reduction for semi-supervised face recognition,” Lecture notes in computer science, 3614, 1-10, 2005. [4] M. Belkin, & P. Niyogi, “Laplacian eigenmaps for dimensionality reduction and data representation,” Neural Computation, 15(6), 1373-1396, 2003. [5] M. Belkin and P. Niyogi, “Semi-supervised learning on Riemannian manifolds,” Machine Learning, 56, 209-239, 2004. [6] X. Zhu, Z. Ghahramani and J. Lafferty, “Semisupervised learning using Gaussian field and harmonic functions,” Proceeding of the 20th International Conference on Machine Learning (ICML-2003), Washington DC, 200

Suggest Documents