Use of Median String for Classification

1 downloads 0 Views 61KB Size Report
A string that minimizes the sum of distances to the strings of a given set is known as (generalized) median string of the set. This concept is important in Pattern ...
Use of Median String for Classification



Carlos D. Mart´ınez-Hinarejos, Alfons Juan, Francisco Casacuberta Dep. de Sistemes Inform`atics i Computaci´o, Institut Tecnol`ogic d’Inform`atica Universitat Polit`ecnica de Val`encia Cam´ı de Vera, s/n, 46022, Val`encia, Spain fcmartine,ajuan,[email protected] Abstract A string that minimizes the sum of distances to the strings of a given set is known as (generalized) median string of the set. This concept is important in Pattern Recognition for modelling a (large) set of garbled strings or patterns. The search of such a string is an NP-Hard problem and, therefore, no efficient algorithms to compute the median strings can be designed. Recently a greedy approach was proposed to compute an approximate median string of a set of strings. In this work an algorithm is proposed that iteratively improves the approximate solution given above. Experiments have been carried out on synthetic and real data to compare the performances of the approximate median string with the conventional set median. These experiments showed that the proposed median string is a better representation of a given set than the corresponding set median.

1. Introduction In some applications of Pattern Recognition, a string or small set of strings is used to model a (large) set of garbled strings or patterns. A number of techniques have been developed for searching such a string or set of strings. One of these techniques is Clustering for Syntactic Patterns [5], that are based on a distance measurement between two strings. The Levenshtein distance is one of the most typical. The Levenshtein distance (or edit distance) corresponds to the smallest number of deletions, insertions and substitutions needed to transform one string (or sequence) into another. This very natural distance can be adapted to the case where some operations are more likely than others. Polynomial algorithms have been proposed to compute this distance [11]. The optimal prototype of a cluster is the (generalized) median string. The median string of a given set of strings is defined as a string that minimizes the sum of distances to  This work was supported by the European ESPRIT project 30268 E U T RANS and Ministerio de Educaci´on y Cultura, Becas de Formaci´on de Profesorado Universitario.

each string of the set. In general, the problem of searching the median string is an NP-Hard problem [3] and therefore no efficient algorithm can be designed to compute the median string. In some Pattern Recognition problems such as Speech Recognition, heuristics are currently used to obtain approximations to the median strings [7, 6]. One of these approximations corresponds to the search of a set median. In this case, the search is constrained to the given input set and is a polynomial problem [5, 6]. In some cases, the set median cannot be a good approximation to a median string (for example, as in the extreme case of a set of two strings). There was a proposal in [9], but this proposal was based on numerical discrete sequences. We want to apply this technique to discrete sequences of discrete symbols (without a clear and natural numerical representation), so we won’t follow this approach. A heuristic approach was proposed in [7] for computing the median string of a given set. The proposal was based on a process of systematic perturbation of the set median. This approach is not time consuming if the median is close to the set median [7]. But this algorithm was not completely specified in [7] and these authors could not reproduce it. A new greedy and simple algorithm was proposed in [2] to efficiently compute an approximation to the median string of a set. That proposal is improved in this work by introducing a procedure that iteratively refines the original approximate median string. Exhaustive experiments of comparison between the performance of the median string and the set median are reported.

2. Set median and median string Let  be the free monoid over the alphabet . Let S be a set of strings such that S   . We can define the set median string x of S by: x

= argmin

X(

y2S z2S

d y; z

)

where d is a distance function such as the conventional edit distance or its normalized counterpart, which averages the number of elementary editing operations by the length of the editing path. We can compute this string in a time O(l2  n2 ), where l is the maximum length of the strings in S . We define the median string m of S by: m

= argmin  y2

X(

d y; z

z2S

Over the initial string s, we apply the following procedure until there is no improvement: For each position i in the string s 1. Build alternatives Substitution: Make x = s. For each symbol a 2  

)

As we can see, the only difference with set median definition is that m is not constrained to belong to S . As we pointed out above, computing this string is an NP-Hard problem, so we cannot compute it in a reasonable time. One possible way to compute efficiently an approximation to the median string is the use of a greedy procedure[2]. In this process, we start with the empty string, and in every iteration we add a new symbol. To select the symbol to be added, a new string is built for each possible symbol by adding the new symbol to the previous string. The chosen symbol is the one that presents the less accumulated Levenshtein distance from the new string to all the set of strings. More formaly, 1. Let M be a hypothesized mean string, initialy set to empty string. 2. For each symbols a 2  produce a provisional mean string Ma as a concatenation of M and a. 3. Compute the sum sa of edit distance between Ma and each string x 2 S . 4. Let a 2  be a symbol shuch that sa is minimum. The new mean string M is an Ma and the new score s is sa . 5. If the value of s is lower than the one in the previous iteration, go to step 2, else stop. 6. The mean string is M The approximate median string computed by the above method can be improved by applying the edition operations (insertion, deletion and substitution) over each symbol of this string, looking for a reduction of the accumulated distance edition (the sum of the distance edition to each string in S ). This procedure can be iterated until no improvements can be made. In this work we have two different approximations depending on the initial string. The initial string can be the approximate median string described or the set median string. We will denote the two approximations as: 

GR median string: the initial string was computed as in [2].



ME median string: the initial string was the set median string.



Make x0 the result string of substituting xi by symbol a. If the accumulated distance of x0 to S is lower than the accumulated distance from x to S, then make x = x0 .

Deletion: Make y the result string of deleting the ith symbol of s. Insertion: Make z 



0 z

=

. For each symbol a 2 

s

Make the result of adding a at position i of s. If the accumulated distance from z 0 to S is lower than the accumulated distance from z to S, then make z = z 0 .

2. Chose an alternative 

Take from the set fs; x; y; zg the string s0 with less accumulated distance to S. Make s = s0 .

In this case, the computational time cost of the algorithm is O(l3  n  jj) for each global iteration. In the next sections, we will see that all approximate median strings present lower accumulated distance than median strings, and also the use of these types of approximated median strings as prototypes for classification. In both cases we will use the normalized edition distance.

3. Corpora used for experiments 3.1. EuTrans Traveler Task corpus The first corpus used in the experiments was composed of artificial corrupted samples of the Traveler-Task vocabulary [10]. The general domain of the task was a visit by a tourist to a foreign country. The task selected corresponded to a scenario of person-to-person communication situations at the reception desk of a hotel. The vocabulary size was 678 Spanish words. In speech recognition it is usual to have available different acoustic presentations of each word. Each of these presentations can be a different sequence of phonetic units (or different orthographic representation for simplicity purposes). The problem is to model the acoustic variability of a word by chosing the “best” representative (or representatives) of the different presentations. In order to simulate this problem, a set of distorted samples was generated for each word of the vocabulary. From each word, 6 set of 50 samples were generated. Each set

Table 1. Mean accumulated distance from the different prototypes to the training set for the EuTrans corpus and for different degrees of distortion. Distortion 5 10 20 30 40 50

Set median 0.05 0.11 0.21 0.30 0.42 0.52

All approximate median strings 0.05 0.11 0.21 0.29 0.36 0.46

corresponds to a different distortion degree (5 %, 10 %, 20 %, 30 %, 40 %, and 50 %) [8]. Each set was divided into 2 subsets of 20 and 30 samples to be used for training and test purposes respectively, i.e. the training set has 13560 words and the test set has 20340 words.

3.2. Chicken corpus The second corpus used in the experiments comprised images of 5 parts of a chicken: wing (117 samples), back (76), drumstick (96), thigh and back (61), and breast (96) [1] . Images were clipped to their minimum squared bounding box, subsampled to normalize their size to 16  16, and then chain coded into not ciclyng strings representing their outer contours. The alphabet of these strings was the conventional one of eight symbols ( = o f0; 1; 2; 3; 4; 5; 6; 7g), each for one of the eight possible 45 directions and lengths that define the discrete contours over a grid with a resolution of 8 pixels. The resulting 446 strings were divided into a training set of 346 samples and a test set of 100 samples, preserving a priori class probabilities.

446

4. Distance results One way to estimate the quality of the string obtained is to compute the mean accumulated distance of the prototypes to the training set. This measure sums all the distances (we use normalized edition distance) of the prototype of the class to the train samples of its class, and divides it by the total number of training samples. In this section, we will see the results achieved with both corpora using this quality measure.

Table 2. Mean accumulated distance from the different prototypes to the training set for the Chicken corpus. Train set size 5 10 15 20

Set median 0.34 0.37 0.37 0.38

All approximated median strings 0.31 0.34 0.34 0.35

point out that for low distortion rates (5, 10 and 20 %), we achieved the same prototypes with all methods (set median included), so the distance results are the same for all prototypes. The mean accumulated distances to the training set are shown in table 1. The same results for GR and ME median strings were achieved, and consequently only one column of distances is reported in all tables. However, the median strings achieved were different, as it is shown in the classification results. We can see a considerable improvement for high distortion (about a 15%). However, this corpus is a synthetic one, and these results are not as significant as with a real corpus. So we now go on to experiment with a real corpus.

4.2. Chicken corpus In this corpus, we carried out several experiments, each of them with a size of the training set varying from 1 to 20 samples. We made 10 different experiments, taking the training set randomly in each experiment. Accumulated distances of each prototype (set median and the two approximate median strings) to the training set a for different number of samples in it are summarized in table 2. There are no significant differences between the two median string prototypes (as in the previous corpus), so we present here only a column of distances for median strings. As was expected, the accumulated distance is (slightly) improved by using median strings (about a 8 % for 20 prototypes).

5. Classification experiments Another evident quality measure is the classification error rate. This is because prototype extraction has as its aim to use the obtained prototypes for classification. So, in this section we evaluate the quality of the techniques by classification of the test sets.

4.1. EuTrans Traveller Task corpus We present here the mean accumulated distances of each prototype (set median and the two approximate median strings) to the training set for EuTrans corpus. We must

5.1. EuTrans Traveller Task corpus We used the prototypes obtained (set median string and the different approximate median strings) for classifying the

Table 3. % Classification error with EuTrans corpus for each set of prototypes (set median and different approximate median strings) and different degrees of distortion. Distortion 5 10 20 30 40 50

Set median 1.89 2.10 9.10 15.11 29.53 52.18

GR m. str. 1.89 2.10 9.10 14.53 21.44 37.24

ME m. str. 1.89 2.10 9.10 14.53 21.44 37.29

test set of the EuTrans corpus. For this classification, we used the Nearest-Neighbour rule (NN), as described in [4], applying the classic exhaustive algorithm with normalized edition distance. We can see the classification error in table 3. The error rate descended dramatically (about 15 points) when we used any of the approximations of the median string against the use of set median strings, but only with high distortion. This is because prototypes for low distortion were the same for set median and for any of the median string approximations. We can also see that there are no significant differences among the several approximations for the median string and the error increments with distortion (as expected). However, we must again remember that this is a synthetic corpus, and we need results with real corpora to be sure of the quality of median string for classification.

5.2. Chicken corpus The mean classification error for the different sets of prototypes and different training sizes obtained are summarized in table 4. As mentioned above, 10 different experiments were carried out to get significant results. In general, the median string improves the results obtained with set median; in table 4 we see the only exception is for 10 prototypes and greedy initialization, and differences are not significant. Also, differences between the several initializations are not significant.

6. Conclusions and future work In this work, we have shown that the proposed approximate median strings improve set median from two viewpoints: distance to the set that it represents and classification error. With respect to the two methods used to compute approximate median string, we can conclude there are

Table 4. Mean classification error rate for the Chicken task. Training set size 5 10 15 20

Set median 72.57 69.95 69.62 70.17

GR m. str. 69.79 70.55 66.12 66.99

ME m. str. 69.95 69.60 66.01 66.76

no significant differences between the two initialization options. Future work will lead to reducing the computational cost of the algorithms implemented and to working with other several corpora for extending and generalizing these results.

References [1] G. Andreu, A. Crespo, and J. Valiente. Selecting the toroidal self-organizing feature maps (tsofm) best organized to object recognition. In Proceedings of the International Conference on Neural Networks (ICNN’97 IEEE), volume 2, pages 1341–1346, 1997. [2] F. Casacuberta and M. de Antonio. A greedy algorithm for computing approximate median strings. In VII Simposium Nacional de Reconocimiento de Formas y An´alisis de Im´agenes., pages 193–198, april 1997. [3] C. de la Higuera and F. Casacuberta. The topology of strings: two np-complete problems. Theoretical Computer Science, to appear. [4] R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. John Willey and Sons, 1973. [5] K. S. Fu. Syntactic Pattern Recognition. Prentice-Hall, 1982. [6] A. Juan and E. Vidal. Fast Median Search in Metric Spaces. In 2nd Int. Workshop on Statistical Techniques in Pattern Recognition (SPR98), Sydney, Australia, August 1998. [7] T. Kohonen. Median strings. Pattern Recognition Letters, 3:309–313, 1985. [8] F. Prat. Distorsi´on estoc´astica de cadenas de s´ımbolos mediante dos m´etodos basados en modelos ocultos de markov. Technical report, DSIC, 1994. [9] D. Sankoff and J. B. Kruskal. Time Warps, String Edits and Macromolecules: the Theory and Practice of Sequence Comparison. Addison Wesley, 1983. [10] E. Vidal. Finite-state speech-to-speech translation. In Proceedings of the International Conference on Acoustic Speech, and Signal Processing, volume I, pages 111–114, 1997. [11] R. Wagner and M. Fisher. The string-to-string correction problem. Journal of the ACM, 21:168–178, 1974.

Suggest Documents