A New Machine Learning Technique Based on

0 downloads 0 Views 221KB Size Report
This paper presents a new supervised machine learning technique based on ..... implemented (in C++ language) and tested in an artificial data set and in two ...
A New Machine Learning Technique Based on Straight Line Segments Jo˜ao Henrique Burckas Ribeiro Departamento de Ciˆencia da Computac¸a˜ o Instituto de Matem´atica e Estat´ıstica Universidade de S˜ao Paulo Brazil [email protected]

Abstract This paper presents a new supervised machine learning technique based on distances between points and straight lines segments. Basically, given a training data set, this technique estimates a function where its value is calculated using the distance between points and two sets of straight line segments. A training algorithm has been developed to find these sets of straight line segments that minimizes the mean square error. This technique has been applied on two real pattern recognition problems: (1) breast cancer data set to classify tumors as benign or malignant; (2) wine data set to classify wines in one of the three different cultivators from which they could be derived. This technique was also tested with two artificial data sets in order to show its ability to solve approximation function problems. The obtained results show that this technique has a good performance in all of these problems and they indicate that it is a good candidate to be used in Machine Learning applications.

1 Introduction Machine Learning is a field which can be divided into a broad range of categories such as supervised learning, unsupervised learning, semi-supervised learning, analytical learning, reinforcement learning, active learning. Supervised learning [4, 5] deals with learning a function from labeled data sets. Unsupervised learning (or clustering) [4, 6] is concerned with algorithms that form clusters or groupings to learn patterns or associations from data sets that have no attached class labels. Semi-supervised learning [3, 13, 17] deals with data sets that are combination of labeled and unlabeled examples. Analytical learning [10] uses data sets that have background knowledge or domain theory instead of labeled examples. Reinforcement learning [14] deals

Ronaldo Fumio Hashimoto Departamento de Ciˆencia da Computac¸a˜ o Instituto de Matem´atica e Estat´ıstica Universidade de S˜ao Paulo Brazil [email protected]

with algorithms that learn a control policy through reinforcement from an environment. Active learning [7, 12] is concerned with unlabeled data sets that can be labeled in a sequential process in the sense that the corresponding labeled examples contribute to obtain a more accurate function. This paper focuses on supervised machine learning. It is known that supervised learning has a large variety of applications such as speech recognition, optical character recognition (OCR), text classification, handwritten recognition, human face image recognition, industrial inspection and medicine diagnoses [4, 5, 11]. The major supervised machine learning techniques are linear classifiers, k-nearest neighbor, neural networks, Bayesian networks, decision trees and support vector machines [4, 5]. In this paper, a new supervised machine learning technique (a novel algorithm) based on straight line segments (SLSs) is presented. This technique finds a function that best fits the data set based on the distance between points and two sets of SLSs. It has been applied on two real Pattern Recognition problems [1, 16] and their results have shown that this technique is a good candidate to be used in supervised machine learning applications. Following this brief introduction, Section 2 gives the mathematical foundations of this new technique; while in Section 3, we present a training algorithm for learning the classifier from the data set. Section 4 shows some experimental results and, finally, in Section 5, we present the conclusion of this work and our future research.

2 Mathematical Foundations This section presents some mathematical foundations necessary for this paper. Let p, q ∈ Rd+1 . The SLS Lp,q with extremities p and q is defined as: Lp,q = {x ∈ Rd+1 : x = p + λ · (q − p), 0 ≤ λ ≤ 1} (1)

Proceedings of the 5th International Conference on Machine Learning and Applications (ICMLA'06) 0-7695-2735-3/06 $20.00 © 2006 Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on March 25,2010 at 11:28:28 EDT from IEEE Xplore. Restrictions apply.

For notation simplicity, we will use just L to indicate the SLS Lp,q whenever the extremities points p and q are clear in the context. Given a point x ∈ Rd , we define the extension of x as the point xe ∈ Rd+1 such that xe = (x, 0), that is, the point x (with dimension d) is extended to dimension d+1 by adding an extra coordinate with value zero. Given a point x ∈ Rd and a SLS L with extremities p, q ∈ Rd+1 , we define the pseudo-distance between x and L in the following way.

distP (x, L) =

dist(xe , p) + dist(xe , q) − dist(p, q) (2) 2

where dist(a, b) is the Euclidean distance between the points a ∈ Rd+1 and b ∈ Rd+1 . The pseudo-distance distP (x, L) is not exactly the Euclidean distance between x and L, but it equals zero if xe ∈ L. In addition, the farther xe is from L, the greater is distP (x, L). Note that, if p = q, then distP (x, L) = dist(xe , p) = dist(xe , q). Let L denote a collection of SLSs, that is, L = {Lpi ,qi : pi , qi ∈ Rd+1 , i = 1, . . . , n}. Let L0 and L1 be two collections of SLSs. Given a point x ∈ Rd , let us define T (x, L0 , L1 ) as T (x, L0 , L1 ) =

X L∈L1

X 1 1 − distP (x, L) + ε L∈L distP (x, L) + ε

Eq. 4 is able to approximate of complex functions even with a relatively small set of SLS. In addition, the VCdimension [15] of the classifier derived from Eq. 4 is directly proportional to the number of SLSs. This implies that using this approach one can control (by setting the number of SLSs) two desired features of supervised machine learning techniques: computational efficiency and good balance between generalization and overfitting. This paper deals with two type of problems in Machine Learning field: approximation function and supervised classification problems. These problems are defined in the following way: given a sample set S = {(xi , yi ) : i = 1, 2, . . . , n; xi ∈ Rd ; yi ∈ [0, 1]}, which comes from an unknown function f : Rd → [0, 1] (approximation function problem) or from an unknown joint probability distribution for the pair of random variables (x, y) ∈ Rd × [0, 1] (supervised classification problem), it is desired to find the function yL (or equivalently, two sets of SLSs, L0 and L1 ) that minimizes a certain error function.

3 Training Algorithm The main objective of the training algorithm, presented in this section, is to solve the approximation function or the supervised classification problems. Given a sample set S, if L = {L0 , L1 }, then an estimated error of the function yL can be defined as follow:

0

(3)

where ε is a small positive number to avoid division by zero. Note that, assuming that all SLSs in L1 are “far” from all SLSs in L0 , as the point xe is “near” to the SLSs in L0 , from Eq. 3, T (x, L0 , L1 ) tends to −∞. On the other hand, if xe is “near” to the SLSs in L1 , T (x, L0 , L1 ) tends to +∞. Thus, if L = {L0 , L1 }, consider the following function yL : Rd → [0, 1]: yL (x) =

1 1+

e−T (x,L0 ,L1 )

.

(4)

Note that yL (x) is a sigmoid with respect to T (x, L0 , L1 ), that is, • as T (x, L0 , L1 ) → 0 (i.e., xe tends to have the same distance between the SLSs in L0 and L1 ), yL (x) → 0.5; • as T (x, L0 , L1 ) → −∞ (i.e., xe tends to be “near” to the SLSs in L0 and “far” from all SLSs in L1 ), yL (x) → 0; and • as T (x, L0 , L1 ) → +∞ (i.e., xe tends to be “near” to the SLSs in L1 and “far” from all SLSs in L0 ), yL (x) → 1.

ei (yL ) = yL (xi ) − yi 1 [ei (yL )]2 n i=1

(5)

n

E(yL ) =

(6)

Eq. 5 gives the error between the estimated function at the point xi and the corresponding ideal value yi . Note that the sign of ei (yL ) indicates if yL (xi ) > yi or yL (xi ) < yi . Eq. 6 is the estimated error of yL , where n is the number of samples, that is, n = |S|. This training algorithm uses a heuristic to minimize Eq. 6. Obviously, one can use other types of error function.

3.1 Placing Algorithm Note that, from Eq. 4, the sets of the SLSs (L0 and L1 ) must have at least one element. So, the initial step of the T RAINING algorithm is to place one SLS for each set L0 and L1 (see Line 4 of the T RAINING algorithm in Fig. 2). For that, we build the P LACING procedure which takes a sample set S = {(xi , yi ) : xi ∈ Rd , yi ∈ [0, 1], i = 1, 2, . . . , n} as an argument and returns two sets of SLSs L0 and L1 , each one with one SLS. Initially, the samples in S are separated into two sets X0 = {xi ∈ Rd : (xi , yi ) ∈ S and yi ≤ 0.5} and X1 = {xi ∈ Rd : (xi , yi ) ∈ S and

Proceedings of the 5th International Conference on Machine Learning and Applications (ICMLA'06) 0-7695-2735-3/06 $20.00 © 2006 Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on March 25,2010 at 11:28:28 EDT from IEEE Xplore. Restrictions apply.

yi > 0.5} (see Lines 4 and 5 of the P LACING algorithm in Fig. 1). For j = 0, 1, a SLS will be built for Lj from the set Xj . The average (xj ), the variance (varj ) and the standard deviation (sdj ) of each set Xj are computed (see Lines between 7 and 9 of the P LACING algorithm) All these values are calculated in each dimension independently, so xj , varj and sdj ∈ Rd . Let dirj be a random unit vector in Rd . The extremities aj = xj − ||varj || · dirj and bj = xj + ||varj || · dirj of the SLS for Lj are computed (see Line 11 of the P LAC ING algorithm). Note that the center of the SLS with extremities aj and bj is placed at the average of each set Xj with length proportional to the norm of its variance and at a random direction. The motivation of making the length of SLSs proportional to the norm of the variance is to tie it to the spread degree of the samples. Observe that, since we have a random direction for each SLS, it is possible to run the TRAINING algorithm many times in order to select the best initial SLSs that lead to the lowest error function value. Thus, we have the SLS Laj ,bj in Rd to be attributed to the set Lj . In order to have SLSs in Rd+1 , their extremities aj , bj ∈ Rd are extended to the dimension d + 1 (points pj , qj ∈ Rd+1 ) by adding an extra coordinate with value proportional to the sum of the norm of the both standard deviations, that is, pj = (aj , ||sd0 || + ||sd1 ||) and qj = (bj , ||sd0 || + ||sd1 ||) (see Line 14 of the P LACING algorithm). This is done in this way in order to have the distance of the initial SLSs from the space Rd proportional to the spread degree of the both sets X0 and X1 . Thus, Lj = {x ∈ Rd+1 : x = pj + λ · (qj − pj ), 0 ≤ λ ≤ 1} (see Line 15 of the P LACING algorithm). Finally, the P LACING procedure returns L0 = {L0 } and L1 = {L1 }. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17:

P LACING(S) Input: S = {(xi , yi ) : xi ∈ Rd , yi ∈ [0, 1], i = 1, 2, . . . , n}. Output: The sets L0 and L1 such that |L0 | = |L1 | = 1. X0 ← {xi ∈ Rd : (xi , yi ) ∈ S and yi ≤ 0.5}; X1 ← {xi ∈ Rd : (xi , yi ) ∈ S and yi > 0.5}; for each set Xj ∈ {X0 , X1 } do xj ← average of Xj ; varj ← variance of Xj ; √ sdj ← 2 varj ; let dirj be a random unit vector in Rd ; aj ← xj − ||varj || · dirj ; bj ← xj + ||varj || · dirj ; end for for each computed pair (aj , bj ) do pj ← (aj , ||sd0 || + ||sd 1 ||); qj ← (bj , ||sd0 || + ||sd1 ||); Lj ← {x ∈ Rd+1 : x = pj + λ · (qj − pj ), 0 ≤ λ ≤ 1}; end for return L0 = {L0 } and L1 = {L1 };

Figure 1. The P LACING algorithm.

1: T RAINING (S, maxSLS) 2: Input: S = {(xi , yi ) : i = 1, 2, . . . , n; xi ∈ Rd ; yi ∈ [0, 1]}

3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28:

and a desired number of SLSs maxSLS = |L0 | + |L1 | ≥ 2. Output: Two sets of SLSs L0 and L1 . [L0 , L1 ] ← P LACING (S); countSLS ← 2; repeat α ← 0.1; Error ← 1; ∆Error ← 1; repeat Disp ← C OMPUTE D ISP (S, L0 , L1 ); β ← α; dir ← 1; for (count ← 1 to MAX) do [L0 , L1 ] ← M OVE SLS (L0 , L1 , Disp, dir, β); OldError ← Error; Error ← E(yL ); ∆Error ← OldError − Error; if ∆Error < 0 then dir ← (−1) · dir; end if β ← β/2; α ← α + dir · β; end for until (|∆Error| ≤ 10−8 ) or (α ≤ 10−8 ) if countSLS < maxSLS then O NE M ORE SLS (S, L0 , L1 ); end if countSLS ← countSLS + 1; until countSLS > maxSLS; return L0 and L1 ;

Figure 2. The T RAINING algorithm.

3.2 Minimizing the Error Function After having the initial sets of SLSs L0 and L1 , let us move on to the T RAINING algorithm (see Fig 2). The idea of the T RAINING algorithm is to minimize the error function (Eq. 6) by displacing the extremities of the initial SLSs. We should remark that there are many ways to find good displacements of the SLSs. Here, we present a strategy that have shown some good preliminary results. The T RAINING algorithm is iterative and its central loop (Lines between 8 and 22) has basically the following steps: (1) the displacements of the extremities of the SLSs in both L0 and L1 are computed in order to decrease the error function given by Eq. 6 (Line 9); (2) the extremities of the SLSs in L0 and L1 are displaced accordingly (Line 12); (3) with these new SLSs, the new error is computed (Line 14); (4) continue in this loop until there is no decrease in the error function or some other stopping conditions are satisfied (Line 22). In our algorithm, we have added one more stopping condition: if the displacements of the SLSs are not big enough to have a significant decreasing in the error function. This is controlled by the scalar α, as we will

Proceedings of the 5th International Conference on Machine Learning and Applications (ICMLA'06) 0-7695-2735-3/06 $20.00 © 2006 Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on March 25,2010 at 11:28:28 EDT from IEEE Xplore. Restrictions apply.

see later. In addition, one may want to increase the number of SLSs in the sets L0 and L1 in order to decrease the error function. This could be done by adding an outer loop (Lines between 6 and 27) that places a new SLS in either L0 or L1 . Once a certain number of SLSs, given by the user, is reached, the training algorithm stops and the sets L0 and L1 are returned.

(j)

For each extremity r of Lk , the displacement direction (j) of the extremity r is determined by the samples in Sr,k in the following way: (j)

1: for each sample (x, y) ∈ Sr,k do 2: e = yL (x) − y; 3: v ← r − xe ; /* extension of x to dimension d + 1 */ 4: v ← v/|v|; (j)

(j)

Dispr,k ← Dispr,k + (−1)(j+1) · e · v; 6: end for 5:

3.3 Computing the Displacements of the SLSs

(j)

The sample (x, y) ∈ Sr,k will be used to move the exNow, we present a procedure (called C OMPUTE D ISP) that finds the displacements of the SLSs in the sets L0 and L1 in order to move them and decrease the error function. This procedure takes three arguments: the sample set S and the sets of SLSs L0 and L1 and returns the set Disp that will be described later. The first step to compute the displacements of the SLSs is to find a partition of the sample set S. Let L = {L0 , L1 }. For each sample (xi , yi ) ∈ S, the error ei = yL (xi ) − yi is computed. An error vector e = (e1 , e2 , . . . , en ) is built. Then, the samples in S are split into two subsets S0 = {(xi , yi ) ∈ S : ei ≤ M (e)} and S1 = {(xi , yi ) ∈ S : ei > M (e)}, where M (e) is the median of the vector e. Then, a sub-partition of the subset Sj is computed in an such way that each subset of that sub-partition is attributed unequivocally to a SLS in (j) (j) (j) Lj . Let mj = |Lj | and Lj = {L1 , L2 , . . . , Lmj }. Then, the algorithm finds, a collection of subsets of Sj , (j) (j) (j) Pj = {S1 , S2 , . . . , Smj } (a partition of Sj ), such that (j) (j) |Sk | ≈ |S |, for k = . Basically, this is accomplished by running the following loop until |Sj | = 0. (j) Lk

1: for each ∈ Lj do 2: if |Sj | = 0 then 3: 4: 5: 6: 7:

(j)

Let (xi , yi ) ∈ Sj be the nearest sample to Lk ; (j) (j) Sk ← Sk ∪ {(xi , yi )}; Sj ← Sj \{(xi , yi )}; end if end for

Now, using the collection Pj = we will determine how the SLSs in Lj = (j) (j) (j) {L1 , L2 , . . . , Lmj } will be displaced. For that, (j) considering that the SLS Lk ∈ Lj has extremities p and (j) q, we will split the set Sk ∈ Pj into two disjoint subsets (j) (j) (j) (j) Sp,k , Sq,k ⊆ Sk . The criterion to split Sk is similar to (j)

the one we have used previously to split Sj . The subset Sp,k (j)

has the samples in Sk such that are as near as possible to (j) the extremity p of Lk . The same for the other extremity q. (j)

(j)

(j)

Now, for j = 0, 1, let Dispj = {Dispp,k , Dispq,k : (j)

p, q are extremities of Lk , k = 1, 2, . . . , mj }. The C OMPUTE D ISP procedure returns the set Disp = {Disp0 , Disp1 }.

3.4 Moving the SLSs

(j) (j) (j) {S1 , S2 , . . . , Smj },

(j)

(j)

tremity r of Lk . The value of e depends on the error (Eq. 5). The bigger is this error (in module), the bigger is the influence on the displacement of the extremity r. The vector v (a vector in Rd+1 ) gives the spacial orientation of the displacement and it points from the sample xe towards r. Now, the extremity r should be moved away from or close (j) to xe depending on the sign of e and on whether Lk ∈ L0 (j) or Lk ∈ L1 (that is, on the value of the index j). For example, if e < 0, then yL (x) < y. So, in order to decrease the error function (Eq. 5), yL (x) should be increased, and consequently, if j = 0 (respectively, j = 1), the extremity r should be moved away from (respectively, moved (0) close to) the point xe , since Lk ∈ L0 (respectively, since (1) Lk ∈ L1 ). Thus, in this case, the sign (−1)(j+1) · e is positive (respectively, negative) and correctly computed. (j) The variable Dispr,k is a weighted vector of all displacement directions and indicates the displacement di(k) rection of the extremity r of the SLS Lj ∈ Lj .

Now, consider Sk ∈ Pj and Lk ∈ Lj with extremities (j) (j) p and q. Let Sp,k and Sq,k be the subsets obtained by the previous steps.

Now, considering that the sign (variable dir) and the magnitude (variable β) of the displacements are given, we have built the M OVE SLS procedure that iteratively moves the extremities of the SLS both in L0 and L1 : 1: for each Lj ∈ {L0 , L1 } do 2: 3: 4: 5: 6: 7:

(j)

for each Lk ∈ Lj do (j) for each extremity r of Lk do (j) r ← r + dir · β · Dispr,k ; end for end for end for

The scalar dir is a sign that indicates if the extremity r should be displaced in the right or opposite direction of the (j) vector Dispr,k in order to decrease the error function (Eq. 6). The magnitude β indicates how large is this displacement.

Proceedings of the 5th International Conference on Machine Learning and Applications (ICMLA'06) 0-7695-2735-3/06 $20.00 © 2006 Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on March 25,2010 at 11:28:28 EDT from IEEE Xplore. Restrictions apply.

These two scalars change dynamically and are computed iteratively.

techniques (UCI ML Repository Content Summary1). All experiments have been realized using a PC (Pentium Dual Core 3.00GHz, 2GB of RAM) with Linux OS.

3.5 Computation of dir and β 4.1 Artificial Data Set The value of the variables dir and β are computed between the Lines 16 and 19 of the T RAINING algorithm (see Fig. 2). The sign of dir depends on the increasing or decreasing in the error function. Note that at Line 20 of the T RAINING algorithm a scalar α is computed. The scalar α controls how large could be the magnitude displacement and β is an auxiliary variable to calculate the ideal α value. Note that α will decrease (dir is negative) if the displacement increases the error. On the other hand, α will increase (dir is positive) if the displacement decreases the error. In fact, at the end, each extremity r will displaced by the magnitude α. The precision with which α changes is determined by the for statement at the Line 11. The more the algorithm iterates in this for, the more precision we will have for the α value. We have used only MAX=10 iterations and it has been showed good enough since α increases or decreases very fast: in MAX iterations, α could increase by the value α · (1 − 1/2MAX) (dir is always positive) and decrease by the value α/2MAX (dir is always negative).

3.6 Adding one more SLS The function O NE M ORE SLS at Line 24 of the T RAIN ING algorithm (see Fig. 2) places one more SLS. The rules to place a new SLS are the same that have been used in the P LACING algorithm. However, in the algorithm O NE M ORE SLS just one SLS is placed. If the error 1 ei (yL ) n i=1 n

M (yL ) =

(7)

is positive, a new SLS is placed in the set L0 . Otherwise, it is placed in the set L1 . Since yL depends on all SLSs, they need to be moved in order to avoid that some SLSs stay static in a bad place considering the new SLS. Thus all the SLSs are moved away from the space Rd with the distance proportional to the standard deviation of the sample sets in the same way the SLSs are placed in the P LACING algorithm.

4 Experimental Results In order to show the application of this new Machine Learning technique presented here in this paper, it has been implemented (in C++ language) and tested in an artificial data set and in two databases for testing pattern recognition

We used two artificial data sets to test the ability of this new technique to approximate functions. Two functions have been chosen for this purpose. The first function f1 : [0, 1] × [0, 1] → [0, 1] is a linear function and it is given by f1 (x0 , x1 ) = (x0 + x1 )/2. The second function f2 : [0, 1] × [0, 1] → [0, 1] is a sinusoidal function f2 (x0 , x1 ) = [sin(2 · π · x0 ) + cos(2 · π · x1 )]/4 + 0.5. To use the T RAINING algorithm, we have sampled the domain [0, 1] × [0, 1] of these two functions in a grid D = {0.0, 0.1, 0.2, . . . , 1.0} × {0.0, 0.1, 0.2, . . . , 1.0}. So, the sample set for each function fj , j = 1, 2, is in the form Sj = {(xi , yi ) : i = 1, 2, . . . , n; xi ∈ D and yi = fj (xi ) ∈ [0, 1]}. Thus, we have 121 samples for each function. To approximate the function f1 , we have used 4 SLSs (|L0 | + |L1 |). The T RAINING algorithm iterated 17 times the loop between Lines 6 and 22 (we call this loop the training cycle). The final error (Eq. 6) was 0.006. For the second function, f2 , we have used in total 8 SLSs. The T RAINING algorithm spent 70 training cycles and the final error was 0.014. The timing to train the first function was 0.06s and the second one was 0.41s.

4.2 Real Data Set We have used two databases from the UCI (University of California, Irvine) Machine Learning Repository: wine recognition database, donated by Stefan Aeberhard, that uses chemical analysis to determine the origin of the wines [1]; and Wisconsin Breast Cancer database, donated by Olvi Mangasarian, that uses 9 attributes (features) to classify malignant and benign tumors [2, 8, 9, 16]. The origin of the wine is classified in 3 possible categories observing 13 chemical attributes. All attributes were linearly normalized to the interval [0, 1]. For the classification of the origin of the wines, we have used three SLSs functions yL1 , yL2 and yL3 in the following way: yLi (x) = 1 if the category of the origin of the wine represented by the point x is i and yLi (x) = 0 otherwise, where i ∈ {1, 2, 3}. Thus, the final classification of the point x is the index j such that we have the maximum value of yLi (x) for all i = 1, 2, 3. In order to control the confidence of the classification, a confidence index, given by the Eq. 8, has been defined: χ(x) =

max{yLi (x) : i = 1, 2, 3} yL1 (x) + yL2 (x) + yL3 (x)

(8)

1 http://www1.ics.uci.edu/∼mlearn/MLSummary.html

Proceedings of the 5th International Conference on Machine Learning and Applications (ICMLA'06) 0-7695-2735-3/06 $20.00 © 2006 Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on March 25,2010 at 11:28:28 EDT from IEEE Xplore. Restrictions apply.

Figure 3. Wine classification -  correct classification rate;  average of confidence index of the correct classifications;  average of confidence index of the wrong classifications.

Note if yL1 (x), yL2 (x) and yL3 (x) have similar values, the confidence index χ(x) tends to 1/3, however if one of yLi (x) = 1 and the others are zero, then χ(x) tends to 1. That is, the nearer χ(x) is to 1, the more trustful is the classification. The classification rate was calculated using leave-one-out method. We have used the T RAINING algorithm three times with 2, 4, 6, 8, 10, 12 and 14 SLSs. The results are presented in Fig 3. This graph shows that increasing the number of SLSs improves the classification. In addition, the average confidence index of the correct classifications is greater than the wrong ones. The best result obtained was 100% of correct classification. It was obtained using 14 SLSs. The average timing to train this wine classifier with 14 SLSs was 4.17s. The best result reported in UCI Machine Learning Repository was 100% of correct classification using 1-nearest neighborhood [1].

Figure 4. Breast cancer classification -  correct classification rate;  average of confidence index of the correct classifications;  average of confidence index of the wrong classifications. Breast cancer database has 682 samples with 9 attributes

and two classes, malignant and benign tumors. All of these 9 attributes are in the interval [1, 10] and consequently they do not need to be normalized into the interval [0, 1]. We have used one yL function for classification assigning to the point x the malignant tumor if yL (x) = 1 and benign tumor if yL (x) = 0. That is, when yL (x) > 0.5, the point x is classified as malignant tumor; otherwise x is classified as benign tumor. The confidence index χb for the breast cancer database is:  if yL (x) > 0.5 yL χb (x) = (9) 1 − yL if yL (x) ≤ 0.5 In this way, χb (x) is in the interval [0.5, 1.0]. We have used the T RAINING algorithm three times with 2, 3, 4, 5, 7 and 9 SLSs. It was applied leave-one-out method to estimate the classification error. The average timing to train this classifier with 9 SLSs was 4.80s. The averages of the results are plotted in Fig. 4. It shows that the increase the number of the SLSs improves the classification but not significantly. We can see from the graph that only the average confidence index has a significant improvement. All tests round between 95.99% and 97.21%. It is a very good result, since the best results registered in UCI Machine Learning Repository is 96.9%. It does not mean that the technique presented here is better than they used. We have used leave-one-out with 682 instances and they used cross validation with 369 instances separating 30% of instances for test. We have also tested the same data set using k-nearest neighbor technique with k = 1, 3, 5, 7, 9. The best result was obtained for k = 5 (97.36% of correct classifications). Therefore, the results presented here assure that our technique can be a good candidate for Machine Learning applications.

5 Conclusion This paper presents a new machine learning technique based on the distance between points and a set of straight line segments. We have shown from experimental results that using this technique one can obtain similar classification accuracy from the traditional supervised machine learning techniques such as k-NN and artificial neural networks. Therefore, it should be considered as a possibility to be included in Machine Learning systems. The T RAINING algorithm has a good performance, especially in Pattern Recognition problems. However, it is clear that the T RAINING algorithm is just a heuristic and consequently we cannot assure that it finds the best positions of the SLSs. It is intended to improve the T RAINING algorithm by using the gradient descent method . In the T RAINING algorithm, the initial SLSs are placed with random direction. Since, the T RAINING algorithm does not guarantee best sets of SLSs (since it is a heuristic), one could run it many times and take the sets of SLSs that

Proceedings of the 5th International Conference on Machine Learning and Applications (ICMLA'06) 0-7695-2735-3/06 $20.00 © 2006 Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on March 25,2010 at 11:28:28 EDT from IEEE Xplore. Restrictions apply.

gives the best result. However, since the results presented here are so similar, it is possible that the fact to initiate the SLSs with random direction can confuse more than help. It is intended to use eigenvalues and eigenvectors to find the initial placements of SLSs in a deterministic way. For future research, we will use the final positions of the SLSs obtained from the training algorithm to investigate which features (attributes) can more affect the final result. This investigation could lead a new method for feature selection algorithm.

[11] J. H. B. Ribeiro and F. M. Azevedo. Development of Pattern Recognition Technique Based on Euclidean Distance. In 2nd European Medical and Biological Engineering Conference (EMBEC’02), volume 1, pages 508–509, 2002.

References

[13] M. Seeger. Learning with Labeled and Unlabeled Data. Technical report, Edinburgh University, 2001.

[1] S. Aeberhard, D. Coomans, and O. Y. de Vel. Comparative Analysis of Statistical Pattern Recognition Methods in High Dimensional Settings. Pattern Recognition, 27(8):1065–1077, 1994. [2] K. P. Bennett and O. L. Mangasarian. Robust Linear Programming Discrimination of Two Linearly Inseparable Sets. Optimization Methods and Software, 1:23– 34, 1992. [3] O. Chapelle, B. Sch¨olkopf, and A. Zien. SemiSupervised Learning. MIT Press, Cambridge, MA, 2006. [4] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. John Wiley and Sons, 2001. [5] A. K. Jain, R. P. W. Duin, and J. Mao. Statistical Pattern Recognition: A Review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(1):4– 37, 2000.

[12] N. Roy and A. McCallum. Toward Optimal Active Learning through Sampling Estimation of Error Reduction. In 18th International Conference on Machine Learning (ICML 2001), pages 441–448. Morgan Kaufmann, 2001.

[14] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 1998. [15] V. N. Vapnik. An Overview of Statical Learning Theory. IEEE Transaction on Neural Networks, 10(5):988–999, September 1999. [16] W. H. Wolberg and O. L. Mangasarian. Multisurface Method of Pattern Separation for Medical Diagnosis Applied to Breast Cytology. Proceedings of the National Academy of Sciences, U.S.A, 87:9193–9196, December 1990. [17] X. Zhu. Semi-Supervised Learning Literature Survey. Technical Report 1530, Computer Sciences, University of Wisconsin-Madison, 2005. http://www.cs.wisc.edu/∼jerryzhu/pub/ssl survey.pdf.

[6] A. K. Jain, M. N. Murty, and P. J. Flynn. Data Clustering: A Review. ACM Comput. Surv., 31(3):264–323, 1999. [7] D. D. Lewis and W. A. Gale. A Sequential Algorithm for Training Text Classifiers. In W. B. Croft and van C. J. Rijsbergen, editors, 17th ACM International Conference on Research and Development in Information Retrieval, pages 3–12. Springer Verlag, 1994. [8] O. L. Mangasarian, R. Setiono, and W. H. Wolberg. Pattern Recognition via Linear Programming: Theory and Application to Medical Diagnosis. SIAM Publications, pages 22–30, 1990. [9] O. L. Mangasarian and W. H. Wolberg. Cancer Diagnosis via Linear Programming. SIAM News, 23(5):1– 18, September 1990. [10] T. M. Mitchell. Machine Learning. McGraw-Hill, New York, 1997.

Proceedings of the 5th International Conference on Machine Learning and Applications (ICMLA'06) 0-7695-2735-3/06 $20.00 © 2006 Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on March 25,2010 at 11:28:28 EDT from IEEE Xplore. Restrictions apply.