Particle Swarm Optimization Based Nearest Neighbor Algorithm on Chinese Text Categorization ∗ Department
† Department
Shi Cheng∗† , Yuhui Shi† , Quande Qin‡ , T. O. Ting†
of Electrical Engineering and Electronics, University of Liverpool, Liverpool, UK of Electrical & Electronic Engineering, Xi’an Jiaotong-Liverpool University, Suzhou, China
[email protected],
[email protected] ‡ College of Management, Shenzhen University, Shenzhen, China
Abstract—In this paper, the nearest neighbor method on Chinese text categorization is formulated as an optimization problem. The particle swarm optimization is utilized to optimize a nearest neighbor classifier to solve the Chinese text categorization problem. The parameter k was first optimized to obtain the minimum error, then the categorization problem is formulated as a discrete, constrained, and single objective optimization problem. Each dimension of solution vector is dependent on each other in the solution space. The parameter k and the number of labeled examples for each class are optimized together to reach the minimum categorization error. In the experiment, with the utilization of particle swarm optimization, the performance of a nearest neighbor algorithm can be improved, and the algorithm can obtain the minimum categorization error rate. Index Terms—Particle swarm optimization, nearest neighbor, k-weighted nearest neighbor, text categorization, parameter optimization
I. I NTRODUCTION Particle Swarm Optimization (PSO) was introduced by Eberhart and Kennedy in 1995 [1], [2]. It is a population-based stochastic algorithm modeled on social behaviors observed in flocking birds. A particle flies through the search space with a velocity that is dynamically adjusted according to its own and its companion’s historical behaviors. Each particle’s position represents a solution to the problem. Particles tend to fly toward better and better search areas over the course of the search process [3], [4]. Text categorization, or termed as text classification (TC) is a problem that finds correct category (or categories) for documents by giving a set of categories (subject, topics) and a collection of text documents. Text categorization can be considered as a mapping f : D → C, which is from the document space D onto the set of classes C. The objective of a classifier is to obtain an accurate categorization results or a categorization prediction with a high confidence. In the text categorization process, the text is divided into a collection of words. All words that are possible to show in any document are considered as the features of a document, and thus the dimension of the feature space is equal to the number of different words that can appear in all documents. This indicates that the data in each document is not texts, but a set of vectors. The methods of assigning weights to the features may vary. The simplest is the binary method in which the feature weight is either one – if the corresponding word
c 978-1-4673-6004-3/13/$31.00 2013 IEEE
is present in the document – or zero otherwise [5], [6]. Nearest neighbor approaches are very effective in categorization [7]–[9]. For large scale data, it is very difficult to solve the similarity search problem due to “the curse of dimensionality” [10]. Several approximate search methods have been proposed, such as Locality Sensitive Hashing [11]–[14], which is based on hashing functions with strong “local-sensitivity” in order to retrieve nearest neighbors in a Euclidean space with a complexity sublinear in the amount of data. A deficiency of nearest neighbor methods is the difficulty of parameter optimization as most of settings are made from user experiences [15]. The evolutionary algorithm approach can be applied to solve this kind of problems. The genetic algorithm has been utilized to find a compact reference set used in nearest neighbor classification [16], [17]. In this paper, the nearest neighbor method on Chinese text categorization is formulated as an optimization problem. Particle swarm optimization was utilized to optimize a nearest neighbor classifier to solve the Chinese text categorization problem. The parameter k was first optimized to obtain the minimum error, then the categorization problem is formulated as a discrete, constrained, and single objective optimization problem. Each dimension of solution vector is dependent on each other in the solution space. The parameter k and the number of labeled examples for each class are optimized together to reach the minimum categorization error. The rest of the paper is organized as follows. The canonical PSO algorithm and Cosine similarity metric are reviewed in Section II. In Section III the problem and process of text categorization are discussed. In Section IV, a weighted nearest neighbor classifier is reviewed, k value optimization in a nearest neighbor classifier, and number of labeled examples optimized with parameter k in a nearest neighbor classifier are introduced. In Section V, the results of a nearest neighbor classifier and particle swarm optimization based nearest neighbor methods on text corpus are given, and the properties of different methods are discussed. Finally, Section VI concludes with some remarks and future research directions. II. P RELIMINARIES A. Particle Swarm Optimization The canonical particle swarm optimization algorithm is simple in concept and easy in implementation [18]–[21]. The
164
basic equations are as follow: vi ← wvi + c1 rand()(pi − xi ) + c2 Rand()(pg − xi ) xi ← xi + vi
(1) (2)
where w denotes the inertia weight and usually is less than 1, c1 and c2 are two positive acceleration constants, rand() and Rand() are two random functions to generate uniformly distributed random numbers in the range [0, 1], xi represents the ith particle’s position, vi represents the ith particle’s velocity, pi is termed as personal best, which refers to the best position found by the ith particle, and pg is termed as local best, which refers to the position found by the members in the ith particle’s neighborhood that has the best fitness value so far. Random variables are frequently utilized in swarm optimization algorithms. The length of search step is not determined in the optimization. This approach belongs to an interesting class of algorithms that are known as randomized algorithms. A randomized algorithm does not guarantee an exact result but instead provides a high probability guarantee that it will return the correct answer or one close to it. The result(s) of optimization may be different in each run, but the algorithm has a high probability to find a “good enough” solution(s). The basic procedure of PSO is shown as Algorithm 1. A particle updates its velocity according to equation (1), and updates its position according to equation (2). The c1 rand()(pi − xi ) part can be seen as a cognitive behavior, while c2 Rand()(pg − xi ) part can be seen as a social behavior. In particle swarm optimization, a particle learns not only from its own experience, but also from its companions. It indicates that a particle’s ‘moving position’ is determined by its own experience and the neighbors’ experience [22]. B. Similarity Metric The simple approach to perform text categorization is to compute the similarity between the query text q and all database texts in order to find the k best texts. The k nearest Neighbor (KNN) is a very intuitive method that classifies unlabeled examples based on their similarity or distance with examples in the training set. The degree of similarity of any two or more objects can be measured by a distance through certain properties. Similarity is defined as a mapping from two vectors x and y to an interval [0, 1]. The Cosine measure is utilized to calculate the similarity between a labeled example and the test example [23]. For similarity between two vectors x and y, the equation of Cosine measure is as follows: k x i yi sim(x, y) = i=1 k k x2i yi2 i=1
i=1
If x and y are two document vectors, then x·y sim(x, y) = x y
Algorithm 1 The basic procedure of particle swarm optimization 1: Initialize velocity and position randomly for each particle in every dimension 2: while have not found the “good enough” solution or have not reached the maximum number of iterations do 3: Calculate each particle’s fitness value 4: Compare fitness value between current position and the best position in history (personal best, termed as pbest). For each particle, if the fitness value of the current position is better than pbest, then update pbest to be the current position. 5: Select the particle which has the best fitness value among current particle’s neighborhood, this particle is called the neighborhood best (termed as nbest). If the current particle’s neighborhood includes all particles then this neighborhood best is the global best (termed as gbest), otherwise, it is the local best (termed as lbest) 6: for each particle do 7: Update particle’s velocity and position according to the equation (1) and (2), respectively 8: end for 9: end while k where · indicates the vectors dot product, x · y = xi yi , and i=1 k √ x2i = x · x. x is the length of vector x, x = i=1
The cosine similarity is a measure of the (cosine of the) angle between x and y. Thus, if the cosine similarity is 1, the angle between x and y is 0◦ , and x and y are the same except for magnitude (length). If the cosine similarity is 0, then the angle between x and y is 90◦ , and they do not share any common terms(words). In the Chinese text categorization process, the texts are formed as sets of words. Each word represents a dimension in the vector space. The Vector space model (e.g., “vector model”), is an algebraic model for representing text documents as vectors of identifiers. The Cosine metric is utilized to measure the similarity between the test example (a set of vector) and the examples in the training set. III. T EXT C ATEGORIZATION P ROBLEMS The task of text categorization is to classify documents into a fixed number of predefined classes. On the different requirement or application, a document may be classified to one classes or multiple classes. A class is considered as a semantic category that groups documents that have certain properties in common. Generally, a document can belong to multiple, exactly one, or no classes. Yet, with the task of information filtering in mind, i.e., the categorization of documents as either relevant or non-relevant, we assume that each document is assigned to exactly one class. More precisely, we define the text categorization problem as follows [24].
2013 IEEE Symposium on Swarm Intelligence (SIS)
165
Assume a space of textual documents D and a fixed set of k classes C = {c1 , · · · , ck }, which implies a disjoint, exhaustive partition of D. Text categorization is a mapping, f : D → C, from the document space onto the set of classes. For text categorization, texts are first converted into a collection of words. This requires preprocessing prior to classification. In other words, the long text needs to take some preprocessing before a classifier works on it. This preprocessing includes text preprocessing, the Chinese word segmentation, and feature selection. A. Text Preprocessing One important step in text preprocessing is the Tokenization or punctuation cleaning [25], [26]. For language properties, a single word has no punctuation. Changing all punctuation marks to empty spaces is a useful method to simplify text. After punctuation cleaning, text becomes to be several short sentences, sentences become to be a set of words as the search words in dictionary. Content in text becomes a vector and each element is the frequency that a single word occurs in the text. Each word is one dimension of the text space, and text becomes a vector space model. Then the distance and similarity between two texts can be measured by using this model. B. The Chinese Word Segmentation The Chinese characters are examples of “ideograms”, and many have specific logographic qualities or functions. There is no white space to separate the words in texts. Indexing of Chinese documents is impossible without a proper segmentation algorithm [27], [28]. The frequently used method to segment Chinese sentence is dictionary. Text is split into terms which contain two or three Chinese characters. If a term is found in dictionary, then the term is a word. The quality of dictionary may affect the performance of a classifier. Constructing a proper dictionary is an important step in the Chinese word segmentation. The contents of each text are transferred to a collection of words, and each word is a dimension of the categorization space. The frequency that each word occurs in a text is a feature of the text. C. Feature Selection The number of different words that occur in documents is large even in relatively small documents such as short news articles or paper abstracts. The number of different words in big document collections can be huge. The dimension of the bag of words feature space for a big collection can reach hundreds of thousands; moreover, the document representation vectors, although sparse, may still have hundreds and thousands of nonzero components. Most of those words are irrelevant to the categorization task, which can be dropped with no harm to a classifier’s performance and may even result in improvement owing to noise reduction. These words, such as “” or “”, like “an” or “the” in English, are termed as stop words. The feature selection process removes the irrelevant words in the word collections.
166
IV. PARTICLE S WARM O PTIMIZATION BASED N EAREST N EIGHBOR A. k Weighted Nearest Neighbor The k Nearest Neighbor (KNN) classifier is to find the k training examples that are relatively similar to the attributes of the test example. These examples, which are known as nearest neighbors, can be used to determine the class label of the test example. The nearest neighbor classification, or termed as nearest neighbor rule, is a very intuitive method that categorizes unlabeled examples based on their similarities or distances with the labeled examples. The nearest neighbor rule is a kind of non-probabilistic classification procedure, and was first formulated by Fix and Hodges [7]. They investigated a rule called the k-nearestneighbor rule, which assigns to an unclassified sample the class most heavily represented by its k nearest neighbors. Cover and Hart [8] studied most extensively the properties of this rule, including the lower and upper bound on the probability of error. For a problem with a parameter k, and a collection of n correctly classified samples, it has been demonstrated that when k and n tend to infinity in such a manner that nk → 0, the risk of such a rule approaches the Bayes risk. The distance-weighted k nearest neighbor rule was proposed by Dudani [9], which weighs the evidence of neighbor close to an unclassified observation more heavily than the evidence of another neighbor which is at a greater distance from the unclassified observation. A summary of the nearest-neighbor categorization method is given in Algorithm 2. The algorithm computes the distance (or similarity) between each unclassified sample z = (x , y ) and all the training examples (x, y) ∈ D to determine its nearest-neighbor list, Dz . Such computation can be costly if the number of training examples is large. However, efficient indexing techniques are available to reduce the amount of computations needed to find the nearest neighbor of a test example. Algorithm 2 The nearest neighbor categorization algorithm 1: Let k be the number of nearest neighbor and D be the set of training examples. 2: for each unclassified sample z = (x , y ) do 3: Compute d(x , x), which is the distance between test example x and every training example x, (x, y) ∈ D. 4: Select Dz ⊆ D, the set of k closest training examples to x . 5: The prediction y = max (xi ,yi )∈Dz I(v = yi ) v 6: end for Once the nearest-neighbor lists are obtained, the test example is classified based on the majority class of its nearest neighbor: Majority voting : y = max I(v = yi )
2013 IEEE Symposium on Swarm Intelligence (SIS)
v
(xi ,yi )∈Dz
where v is class label, yi is the class label for one of the nearest neighbor, and I(·) is an indicator function that returns the value 1 if its argument is true and 0 otherwise. In the majority voting approach, every neighbor has the same impact on the categorization. This makes the algorithm sensitive to the choice of k. The k weighted nearest neighbor (KWNN) utilizes a weight on different neighbors. Distance weighted voting is a straightforward way to weight each neighbor, if a training example is located far away from a unlabeled data, which will have a weaker impact on the categorization result compared to those that has a short distance. The equation is defined as follows: wi × I(v = yi ) y = max v
(xi ,yi )∈Dz
Nearest neighbor categorization is part of more general technique known as instance (examples) based learning, which uses specific training instances to make predictions without having to maintain an abstraction (or model) derived from data. Instance based leaning algorithms require a proximity measure to determine the similarity or distance between instances and a categorization function that returns the predicted class of a test instance based on its proximity to other instances. A learning algorithm can be divided into lazy learning and eager learning [29]. For Lazy learning, such as instancebased learning and nearest neighbor classifiers, it simply stores training data (or only minor processing) and waits until it is given a test tuple. The lazy learning does not require model building. However, classifying a test example can be quite expensive because we need to compute the proximity values individually between the test and training examples. For this reason, the number of training examples cannot be too large, otherwise, the categorization process will be inefficient [6]. In contrast, eager learning, which includes decision trees, support vector machine, and neural network, etc., given a set of training set, constructs a classification model before receiving new (e.g., test) data to classify. The eager learners often spend the bulk of their computing resources for model building. Once a model has been built, classifying a test example is extremely fast. B. k Value Optimization The k weighted nearest neighbor algorithm is given in Algorithm 3. This algorithm is easy to implement, it only requires an integer k, a set of labeled examples (training data), and a metric to measure “closeness”. The number of labeled examples is set as 100 for each category, and the Cosine metric is utilized to measure the similarity of texts. The performance of categorization result is affected by the setting of parameter k. It is important to choose the right value of k. If k is too small, then the nearest-neighbor classifier may be susceptible to overfitting because of noise in the training data. On the other hand, if k is too large, the nearest-neighbor classifier may misclassify the test instance because its list of nearest neighbors may include data points that are located far away from its neighborhood.
Algorithm 3 The k value optimized nearest neighbor categorization algorithm 1: Let k be the number of nearest neighbor and D be the set of training examples. 2: Initialize random k in the swarm, each particle is a classifier. 3: while have not found the “good enough” solution or have not reached the maximum number of iterations do 4: for each unclassified sample z = (x , y ) do 5: Compute d(x , x), which is the distance between test example x and every training example x, (x, y) ∈ D. 6: Select Dz ⊆ D, the set of k closest training examples to x . 7: The prediction y = max (xi ,yi )∈Dz I(v = yi ) v 8: end for 9: Calculate the number of wrong prediction, update the k of each classifier. 10: end while
The fitness function of text categorization is as follows when similarity measures are utilized in classifier [6]. The documents may belong to the same class if they have higher similarity. k sim(xli , xu ) f (x) = max i=1
where the xli is a labeled document, and xu is a unlabeled document. The k is set from the experience in most experiments [15]. It will be ineffective to find the best k with the brute force search. Particle swarm optimization algorithm is very effective to optimize the parameter k. The minimum categorization error can be obtained by the optimized k. The object of optimization is to obtain the minimum categorization error: f (k) = min(
wrong predictions ) number of predictions
The different categorization error has different risk or loss function for some real categorization tasks. For some specific cases, a particular class is required to have the minimum categorization error. The object of categorization is to obtain the minimum categorization error in the specific class in this situation. The objective function is as follows: f (k) = min(
wrong predictions in the specific class ) number of predictions in the specific class
C. k Value and Labeled Examples Optimization Nearest neighbor method requires a parameter k and a set of labeled examples. The error rate of predictions will be reduced if increasing the number of labeled examples. However, for many large scale learning problems, acquiring a large amount of labeled training data is difficult and time consuming. Different classes have different “hardness” of categorization. For a limited number of labeled examples, given each class
2013 IEEE Symposium on Swarm Intelligence (SIS)
167
the equivalent number of labeled examples does not reach the minimum of categorization error. Algorithm 4 The k value and labeled examples optimized nearest neighbor categorization algorithm 1: Let k be the number of nearest neighbor and D be the set of training examples. 2: Initialize random k and numbers of labeled examples in each class, each particle is a classifier. 3: while have not found the “good enough” solution or have not reached the maximum number of iterations do 4: for each unclassified sample z = (x , y ) do 5: Compute d(x , x), which is the distance between test example x and every training example x, (x, y) ∈ D. 6: Select Dz ⊆ D, the set of k closest training examples to x . 7: The prediction y = max (xi ,yi )∈Dz I(v = yi ) v 8: end for 9: Calculate the number of wrong prediction, update the k and numbers of labeled examples in each class. 10: end while The number of labeled examples and the parameter k should be optimized to reach the minimum categorization error. The k value and labeled examples optimized algorithm is given in Algorithm 4. The categorization process is formulated as a discrete, constrained, and single objective optimization problem, and each dimension is dependent on each other in solution space. The object of optimization is to obtain the minimum categorization error: wrong predictions ) f (k, ni ) = min( number of predictions subject to: ni ≤ N
(4)
V. E XPERIMENTAL R ESULTS AND A NALYSIS A. Categorization Corpus The test corpus is given in Table I, which has 10 categories and 2816 news articles in total. The documents are separated unequally in each category. The class “Politics” has the most articles, which contains 505 articles, and the class “Computer” only has 200 articles. In the following, the number of categorized error, and the error rate are utilized to measure the performance of categorization. Error rate is a straightforward way to measure the performance of a categorization method. It represents the percentage of faultily classified patterns of the test data sets. The error rate is defined as follows:
168
wrong predictions . total number of predictions
Item 1 2 3 4 5 6 7 8 9 10
Categories Transport Sports Military Medicine Politics Education Environment Economics Art Computer Total
Numbers 214 450 249 204 505 220 201 325 248 200 2816
B. k Weighted Nearest Neighbor Table II gives the categorization results of k weighted nearest neighbor classifier. The number of wrong categorized documents and the error rate are given while the training examples of each category are 100 examples. The parameter k is setting within the range [1, 100]. Table II gives the categorization results of k is setting by 1, 5, 11, 20, 30, 40, 50, and 100, respectively. Figure 1 gives a more intuitive categorization results of weighted nearest neighbor algorithm with different setting of k, respectively. From the changing curves of error rate, some conclusions can be made. The error rate is not linearly changed with the increasing of parameter k. Also, for specified class, different k should be set to reach the minimum error rate. C. k value Optimization
(3)
where ni is the number of labeled examples for each class, and N is the maximum number of labeled examples.
Error rate =
TABLE I T HE TEST CORPUS USED IN OUR EXPERIMENTAL STUDY. T HIS CORPUS HAS 2816 TEXTS IN TOTAL , AND DIFFERENT CATEGORIES HAVE DIFFERENT NUMBERS OF TEXTS .
In this experiment, the population size is 5, iteration is 12, and it has 5 runs independently. The variable of this optimization is k, the evaluation time could be reduced by recording the number of k and its corresponding fitness value. The fitness value will be returned directly if the k has been evaluated. The evaluation number will be reduced to 20 ∼ 30 times in each run. Table III gives the categorization results of k nearest neighbor classifier. The number of wrong categorized documents and the error rate are given while training examples of each category is 100. Figure 2 displays the performance of particle swarm optimization in nearest neighbor method. The solutions very quickly converged to the minimum error in five runs. It is very effective to find optimized k to reach minimum error for the whole categorization process or a specified class. D. k Value and Labeled Examples Optimization In this experiment, the population size is 10, iteration is 100, and it has 5 runs independently. Table IV gives the categorization results of k value and labeled examples for optimized nearest neighbor method. Figure 3 displays the performance of utilized particle swarm optimization to optimize k and number of labeled examples in
2013 IEEE Symposium on Swarm Intelligence (SIS)
TABLE II T HE CATEGORIZATION RESULT OF NEAREST NEIGHBOR METHOD WITH DIFFERENT k. T HE NUMBER OF LABELED EXAMPLES IS 100 FOR EACH CLASS .
Categories Transport 214 Sports 450 Military 249 Medicine 204 Politics 505 Education 220 Environment 201 Economics 325 Art 248 Computer 200 Total 2816
result error rate error rate error rate error rate error rate error rate error rate error rate error rate error rate error rate
k=1 23 0.10747 18 0.04 47 0.18875 50 0.24509 65 0.12871 37 0.16818 40 0.19900 99 0.30461 21 0.08467 20 0.1 420 0.14914
k=5 41 0.19158 14 0.03111 39 0.15662 49 0.24019 35 0.06930 27 0.12272 34 0.16915 80 0.24615 9 0.03629 14 0.07 342 0.12144
k = 11 47 0.21962 20 0.04444 39 0.15662 51 0.25 24 0.04752 27 0.12272 37 0.18407 81 0.24923 13 0.05241 13 0.065 352 0.125
k = 20 71 0.33177 18 0.04 48 0.19277 56 0.27450 17 0.03366 26 0.11818 36 0.17910 80 0.24615 18 0.07258 15 0.075 385 0.13671
k = 30 77 0.35981 23 0.05111 50 0.20080 58 0.28431 12 0.02376 24 0.10909 35 0.17412 84 0.25846 19 0.07661 20 0.1 402 0.14275
k = 40 79 0.36915 23 0.05111 55 0.22088 58 0.28431 9 0.01782 24 0.10909 35 0.17412 89 0.27384 21 0.08467 25 0.125 418 0.14843
k = 50 81 0.37850 20 0.04444 60 0.24096 59 0.28921 7 0.01386 22 0.1 36 0.17910 89 0.27384 19 0.07661 28 0.14 421 0.14950
k = 100 98 0.45794 20 0.04444 77 0.30923 68 0.33333 6 0.01188 22 0.1 40 0.19900 98 0.30153 29 0.11693 28 0.14 486 0.17258
0.5
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
0
10
20
30
40
50
60
70
80
90
100
Fig. 1. The error rate of the nearest neighbor method with different k: the black ∗, +, and ∇ represent the ‘Transport’, ‘Sports’, ‘Military’ class, the blue ∗, +, and ∇ represent the ‘Medicine’, ‘Politics’, and ‘Education’ class, the red ∗, +, and ∇ represent the ‘Environment’, ‘Economics’, ‘Art’ class, the green ∗, and represent the ‘Computer’ class, and the total metric, respectively. The number of labeled examples is 100 for each class.
2013 IEEE Symposium on Swarm Intelligence (SIS)
169
420
24
410
22
400 20
390 380
18
370
16
360
14
350 12
340 330
0
2
4
6
8
10
12
10
0
2
4
(a)
6
8
10
12
(b)
Fig. 2. The performance of utilized particle swarm optimization to optimize k in nearest neighbor method. Each class has 100 labeled examples, and optimized k is found to reach the minimum categorization error: (a) minimum error for categorization, (b) minimum error for class “Computer”.
TABLE III T HE MINIMUM ERROR RATE FOR EACH CLASS AND THE CATEGORIZATION PROCESS . W ITH THE UTILIZATION OF PARTICLE SWARM OPTIMIZATION IN NEAREST NEIGHBOR METHOD , THE DIFFERENT SETTING OF k CAN BE FOUND TO REACH THE MINIMUM ERROR RATE .
nearest neighbor method. From the figure, the conclusion can be made that the speed of convergence is very fast. In general, after 30 iterations, the algorithm will reach the local optima. 400
Categories Transport Sports Military Medicine
Number 214 450 249 204
Politics
505
Education Environment Economics Art Computer Total
220 201 325 248 200 2816
k 1 3 10 2 49, 99 100 77 3, 28 10 5 10, 13 10
error 23 13 39 46
error rate 0.107476 0.028888 0.156626 0.225490
6
0.011881
21 33 73 9 11 338
0.095454 0.164179 0.224615 0.036290 0.055 0.120028
380
360
340
320
300
280
0
20
40
60
80
100
TABLE IV
Fig. 3. The performance of utilized particle swarm optimization to optimize k and number of labeled examples in nearest neighbor method.
T HE MINIMUM ERROR RATE FOR THE CATEGORIZATION . T HE OPTIMIZED k = 9, AND THE DIFFERENT NUMBER OF LABELED EXAMPLES ARE FOUND TO REACH THE MINIMUM CATEGORIZATION ERROR .
VI. C ONCLUSIONS
Categories Transport Sports Military Medicine Politics Education Environment Economics Art Computer Total
170
Number 214 450 249 204 505 220 201 325 248 200 2816
Examples 107 50 106 156 68 114 80 95 92 126 994
error 32 30 21 11 44 24 37 69 8 5 281
error rate 0.149532 0.066666 0.084337 0.053921 0.087128 0.109090 0.184079 0.212307 0.032258 0.025 0.099786
Near neighbor method is a simple and effective algorithm in categorization. However, the parameter k and number of labeled examples affect the performance of categorization. For large scale categorization problems, labeled training data is difficult to acquire. Different classes have different “hardness” of categorization. The numbers of labeled examples in each class should be different in limited examples. Finding the proper k and setting the number of labeled examples is important in nearest neighbor method. In this paper, we have utilized particle swarm optimization in nearest neighbor classifier to solve the Chinese text categorization problem. The parameter k was first optimized to obtain the minimum error, then the categorization problem
2013 IEEE Symposium on Swarm Intelligence (SIS)
is formulated as a discrete, constrained, and single objective optimization problem, and the variable in the solution space is dependent on each other. The parameter k and the number of labeled examples for each class are optimized together to reach the minimum categorization error. In the experiment, with the utilization of particle swarm optimization, the performance of nearest neighbor can be improved, and the algorithm can obtain the minimum categorization error rate. The error rate is utilized in this paper to measure the performance of different classifier. There are many conflicting objects in text categorization task [30]. Beside the error rate, for the large scale categorization problems, the magnitude of labeled examples is also important in prediction. The error rate and magnitude of labeled examples conflict to each other. With limited labeled examples to reach the minimum error rate, finding minimum labeled examples to reach the expected error rate, or find an acceptable number of labeled examples to reach a satisfactory error rate are common problems in categorization. These problems can be solved by merging them as a multi-objective problem. In multi-objective optimization, the number of labeled examples and error rate of prediction can be considered at the same time, and a proper classifier can be found to suit for different situations. The classifier should be tested on different Chinese text corpus, comparing this method with other classification methods, such as support vector machine, and modeling text categorization problems as a multi-objective problem, e.g., finding minimum error rate and number of labeled examples concurrently. This remains as a promising directions for future research work. ACKNOWLEDGMENT The authors’s work is partially supported by National Natural Science Foundation of China under grant No.60975080, No.61273367, and Suzhou Science and Technology Project grant SYJG0919. R EFERENCES [1] R. Eberhart and J. Kennedy, “A new optimizer using particle swarm theory,” in Proceedings of the Sixth International Symposium on Micro Machine and Human Science, 1995, pp. 39–43. [2] J. Kennedy and R. Eberhart, “Particle swarm optimization,” in Proceedings of IEEE International Conference on Neural Networks (ICNN), 1995, pp. 1942–1948. [3] R. Eberhart and Y. Shi, “Particle swarm optimization: Developments, applications and resources,” in Proceedings of the 2001 Congress on Evolutionary Computation (CEC 2001), 2001, pp. 81–86. [4] S. Cheng, Y. Shi, and Q. Qin, “Population diversity based study on search information propagation in particle swarm optimization,” in Proceedings of 2012 IEEE Congress on Evolutionary Computation, (CEC 2012). Brisbane, Australia: IEEE, 2012, pp. 1272–1279. [5] A. Cervantes, I. M. Galv´an, and P. Isasi, “AMPSO: A New Particle Swarm Method for Nearest Neighborhood Classification,” IEEE Transactions on Systems, Man, and Cybernetics, Part B, vol. 39, no. 5, pp. 1082–1091, October 2009. [6] S. Cheng, Y. Shi, and Q. Qin, “Particle swarm optimization based semi-supervised learning on Chinese text categorization,” in Proceedings of 2012 IEEE Congress on Evolutionary Computation, (CEC 2012). Brisbane, Australia: IEEE, 2012, pp. 3131–3198.
[7] E. Fix and J. L. Hodges, “Discriminatory analysis–nonparametric discrimination: Consistency properties,” USAF School of Aviation Medicine, Randolph Field, Texas, Tech. Rep. Project Number 21-49004, February 1951. [8] T. M. Cover and P. E. Hart, “Nearest neighbor pattern classification,” IEEE Transactions on Information Theory, vol. 13, no. 1, pp. 21–27, January 1967. [9] S. A. Dudani, “The distance-weighted k-nearest-neighbor rule,” IEEE Transactions on Systems, Man, and Cybernetics, vol. SMC-6, no. 4, pp. 325–327, April 1976. [10] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed., ser. Springer Series in Statistics. Springer, February 2009. [11] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni, “LocalitySensitive Hashing Scheme Based on p-Stable Distributions,” in Proceedings of the 20th ACM Symposium on Computational Geometry, J. Snoeyink and J.-D. Boissonnat, Eds. Brooklyn, New York, USA: ACM, June 2004, pp. 253–262. [12] G. Shakhnarovich, T. Darrell, and P. Indyk, Eds., Nearest-Neighbor Methods in Learning and Vision: Theory and Practice, ser. Neural Information Processing series. The MIT Press, March 2006. [13] M. Slaney and M. Casey, “Locality-sensitive hashing for finding nearest neighbors,” IEEE Signal Processing Magazine, vol. 25, no. 2, pp. 128– 131, March 2008. [14] B. Kulis and K. Grauman, “Kernelized locality-sensitive hashing,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 6, pp. 1092–1104, June 2012. [15] X. Li, H. Yan, and J. Wang, Search Engine: Principle, Technology and Systems. Science Press, 2004. [16] T. Nakashima and H. Ishibuchi, “GA-based approaches for finding the minimum reference set for nearest neighbor classification,” in Proceedings of The 1998 IEEE International Conference on Evolutionary Computation, (CEC 1998), 1998, pp. 709–714. [17] H. Ishibuchi and T. Nakashima, “Evolution of reference sets in nearest neighbor classification,” in Simulated Evolution and Learning, ser. Lecture Notes in Computer Science, B. McKay, X. Yao, C. S. Newton, J.-H. Kim, and T. Furuhashi, Eds. Springer Berlin / Heidelberg, 1999, vol. 1585, pp. 82–89. [18] Y. Shi and R. Eberhart, “A modified particle swarm optimizer,” in Proceedings of the 1998 Congress on Evolutionary Computation (CEC1998), 1998, pp. 69–73. [19] ——, “Empirical study of particle swarm optimization,” in Proceedings of the 1999 Congress on Evolutionary Computation (CEC1999), 1999, pp. 1945–1950. [20] J. Kennedy, R. Eberhart, and Y. Shi, Swarm Intelligence, 1st ed. Morgan Kaufmann Publisher, 2001. [21] R. Eberhart and Y. Shi, Computational Intelligence: Concepts to Implementations, 1st ed. Morgan Kaufmann Publisher, 2007. [22] S. Cheng, Y. Shi, and Q. Qin, “Experimental study on boundary constraints handling in particle swarm optimization: From population diversity perspective,” International Journal of Swarm Intelligence Research (IJSIR), vol. 2, no. 3, pp. 43–69, 2011. [23] P. N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining. Addison Wesley, 2006. [24] C. Lanquillon, “Enhancing text classification to improve information filtering,” Ph.D. dissertation, DaimlerChrysler AG, Research & Technology, December 2001. [25] Y. Liu, M. Zhang, L. Ru, and S. Ma, “Data cleansing for web information retrieval using query independent features,” Journal of The American Society for Information Science and Technology, vol. 58, no. 12, pp. 1884–1898, 2007. [26] C. D. Manning, P. Raghavan, and H. Sch¨utze, Introduction to Information Retrieval. New York, NY, USA: Cambridge University Press, 2008. [27] Z. Wu and G. Tseng, “Chinese text segmentation for text retrieval: Achievements and problems,” Journal of The American Society for Information Science, vol. 44, no. 9, pp. 532–542, 1993. [28] F. L. Wang and C. C. Yang, “Mining web data for Chinese segmentation,” Journal of The American Society for Information Science and Technology, vol. 58, no. 12, pp. 1820–1837, 2007. [29] D. W. Aha, Lazy learning. Kluwer Academic Publishers, 1997. [30] R. Davis, H. Shrobe, and P. Szolovits, “What Is a Knowledge Representation?” AI Magazine, vol. 14, no. 1, pp. 17–33, Spring 1993.
2013 IEEE Symposium on Swarm Intelligence (SIS)
171