Jan 28, 2006 - applied to noise handling in a KRK chess endgame. In Proceedings of the 6th International Workshop on Inductive Logic Programming, pages.
Kernel Based Noise-Aware Machine Learning Algorithms by Hugo Jair Escalante Balderas
A Dissertation Submitted to the Program in Computer Science Computer Science Department in partial fulfillment of the requirements for the degree of
MASTER IN COMPUTER SCIENCE
at the National Institute for Astrophysics, Optics and Electronics January 2006 Tonantzintla, Puebla
Advisor: Dr. Olac Fuentes Chávez Principal Research Scientist Computer Science Department INAOE
© INAOE 2006 All rights reserved The author hereby grants to INAOE permission to reproduce and to distribute copies of this thesis document in whole part
Kernel-Based Noise-Aware Machine Learning Algorithms Hugo Jair Escalante Balderas
Computer Science Department ´ Instituto Nacional de Astrof´ısica Optica y Electr´onica Luis Enrique Erro 1 Puebla, 72840, M´exico
Thesis Advisor Dr. Olac Fuentes Ch´avez
Thesis Committee Dr. Jes´ us A. Gonz´alez Bernal Dr. Aurelio L´opez L´opez Dr. Jos´e Francisco Mart´ınez Trinidad
January 28, 2006
Abstract Machine learning applications require of reliable information to attain accurate results. However, real world data are never as good as we would like them to be and often can suffer from corruption, degrading the performance of processes on the data. Proposed methods for data cleaning eliminate not only erroneous data, but also rare-correct observations, resulting in unreliable data sets. For this reason we want to answer the next question: how can we improve data quality and prediction accuracy in the presence of corrupted data without losing useful information?. We propose the noise-aware algorithms as an answer to this question. These algorithms are based on a process that we called re-measurement, attempting to emulate the way in which humans behave when they have a doubt about an observation. Whit these algorithms we can correct erroneous observations without eliminating any correct instance. The proposed algorithms use kernel methods in a straight forward approach. We will see along this report that kernel methods can be useful tools for solving several problems related with data processing and machine learning. Furthermore, we applied our algorithms to solve a very interesting problem: The Prediction of Stellar Population Parameters, a challenging astronomical domain in which a data-quality improvement is needed.
1 Acknowledgments I would like to thank to my family by their unconditional support and love, always pushing and motivating me. To Vero and Jair my two main motivations in all that I do, both of you have always been with me. Thank you very much to my parents and grandparents for your infinite support, uncles and aunties are not excluded, of course. I thank my thesis advisor Dr. Olac Fuentes by his stimulating comments and new ideas for the culmination of this work. Thank you for the new opportunities that you are giving to me. Also I thank my thesis committee, Aurelio L´opez , Francisco Mart´ınez and Jes´ us Gonz´alez, who provided me valuable feedback and posed me some interesting questions. I would like to thank CONACyT for the financial support received under grant 181498. Thank you for your friendship to all of my friends and classmates at INAOE: Lupita C., Maria Luisa, Lalo, Carlos, Yazid, Beton, Mago, Geovany, Lupita, Alicia, Virgilio, Alma, Agustin, Trilce, Thamar, Juan Carlos, Jorge, Luis A., Jose, Tono, Esau and Oscar, sorry if I omit somebody.
2 Dedication To my lovely new family: Vero and Jair Para mi amada nueva familia: Vero and Jair
Contents 1 Introduction 1.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Overview of this thesis . . . . . . . . . . . . . . . . . . . . . . 2 Background Information 2.1 Locally Weighted Linear Regression (LWLR) 2.2 Estimation of Stellar Populations Parameters 2.2.1 Analysis of Galactic Spectra . . . . . . 2.3 Dimensionality reduction . . . . . . . . . . . 2.3.1 Principal component analysis . . . . . 2.3.2 Kernel Principal component analysis . 2.3.3 Experimental Results . . . . . . . . . 2.3.4 Discussion . . . . . . . . . . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
5 6 8
. . . . . . . .
9 9 10 11 12 14 15 17 21
3 Outlier Detection 3.1 Related work . . . . . . . . . . . . . . . . . . . 3.2 Outlier Detection Methods . . . . . . . . . . . 3.2.1 Distance Based Method . . . . . . . . . 3.2.2 Distance K-Based Method . . . . . . . . 3.2.3 Statistical Based (ST) Method . . . . . 3.2.4 Kernel-Based Novelty Detection Method 3.2.5 ν and One-class SVM . . . . . . . . . . 3.3 Experimental Results . . . . . . . . . . . . . . . 3.4 Discussion . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
23 23 25 25 25 26 26 27 28 32
4 Noise-Aware Algorithms 4.1 Introduction . . . . . . . . . . . . . . . . . 4.2 Appropriate Domains . . . . . . . . . . . 4.3 Re-Measuring Process . . . . . . . . . . . 4.4 Noise-aware Algorithms . . . . . . . . . . 4.5 Reducing the number of re-measurements 4.6 Experimental Results . . . . . . . . . . . . 4.7 Discussion . . . . . . . . . . . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
35 35 36 39 40 43 47 58
3
. . . . . . .
. . . . . . .
. . . . . . .
4
CONTENTS
5 Conclusions 59 5.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Chapter 1
Introduction Real world data are never as good as we would like them to be and often can suffer from corruption that may affect data interpretations, data processing, classifiers and models generated from data as well as decisions made based on data. On the other hand, data can also contain useful anomalies, which often result in interesting findings, motivating further investigation. Therefore, unusual data can be due to several factors, including: ignorance and human mistakes, the inherent variability of the domain, rounding and transcription errors, instrument malfunction, biases and, most important, rare but correct and useful behavior. For these reasons it is necessary to develop techniques that allow us to deal with unusual data. As we can see, unusual data may be noise (erroneous data, due to corruption processes) or outliers (rare but correct data, also called anomalies along the thesis) and it would be useful to distinguish among noise, outliers and common objects. If we were able to differentiate unusual observations from the rest of data we could improve data quality by eliminating or isolating those objects. Moreover, if we were able to identify noise, outliers and common instances, we could eliminate noisy data, analyzing or isolating outliers and leaving intact the rest of correct data, which would result in a data quality improvement. Data cleaning is a well studied task in many areas dealing with databases, nevertheless this task requires a large time investment. Indeed, between 30% to 80% of the data analysis task is spent on cleaning and understanding the data [13]. An expert can clean the data, but it requires a large time investment, which grows with the number of observations in the data set, resulting in expensive costs. From here arises the need of automating, in the possible, this task. However, this is not an easy task, since outliers and noise may look very similar for an algorithm; for this reason we need to add to such algorithm a more human-like reasoning, in this thesis the re-measurement idea is proposed. This approach consists of detecting suspect data and, by analyzing new observations of these objects, we can substitute errors, 5
6
CHAPTER 1. INTRODUCTION
while retaining correct data for a posterior analysis. The re-measurement idea is based on the natural way in which a human clarifies his/her doubts, when he/she is not sure about the correctness of a datum. When a person is doubtful about the accuracy of an object’s observation, a new observation or several more can be obtained to confirm or discard the observer’s hypothesis. Although this idea may sound simple, is not so simple to implement it in an algorithm, since we need an accurate method able to detect both outliers and noise. Also, it is necessary to develop a module that accurately discriminates between correct data (including outliers) and noise. This last process is crucial, since a confusion of outliers by noise or viceversa will lead to an incorrect interpretation of data, as well as to the loss of useful information and retention of corrupted data. It would be very useful if we could develop an algorithm that allowed to improve the quality of data, and consequently, build accurate systems and classifiers. In this work, the re-measurement process was considered to develop methods that we called noise-aware algorithms, since such algorithms can identify and correct noise, leaving the rest of data intact. Therefore, a noise-aware machine learning algorithm is an algorithm, based on re-measurement, used to improve accuracy of classifiers built for learning tasks. These algorithms are not limited for machine learning, but also can be useful in data mining, pattern recognition, data cleansing, data warehousing and applications such as scientific research, credit fraud detection, security systems, medical diagnostics, network intrusion detection, and information retrieval among others. In this work, we oriented our efforts to improve data quality and prediction accuracy for machine learning problems, specifically for the estimation of stellar population parameters, a challenging astronomical domain in which a noise-aware algorithm is suitable to test. This domain is not the only one suitable for the algorithm, similar machine learning domains include recognition and detection tasks (face, digit, pedestrian, intrusion, fraud), spectra and time series prediction (astronomy, forecast), medical diagnosis for deadly diseases, and in general any domain in which data are obtained from a known and controlled process and in those domains in which the cost of each new measurement is affordable.
1.1
Problem statement
In most of the scientific disciplines we are facing a massive overload of data to be analyzed, classified or processed. With the development of new and automated measurement instruments several terabytes of information are being generated. Therefore, such analysis and processes become impossible for a human, thus automated tools should be developed. For some time, machine learning algorithms have been used to automate these tasks. How-
1.1. PROBLEM STATEMENT
7
ever, if there exist corrupted observations in our data, the performance of machine learning algorithms will not be as good as it could be. If we were able to correct the erroneous values from the data without eliminating correct observations, data quality would improve and the automated analysis, classifications or processes on data would become more accurate. Artificial data sets can be free of corrupted data, but real and useful domains are consistent with the following comment by Hampel [25]: ”altogether 5 to 10% wrong values in a data set seem to be the rule rather than the exception”. Therefore, it is necessary to develop techniques that allow us to deal with data sets containing both affected data and useful anomalies, this in order to obtain an accurate analysis of data. Existing approaches ”to clean” data sets do not distinguish among outliers and noise, they just eliminate the detected suspect data [1, 9, 8, 15, 41, 24, 23, 57, 35, 55, 46, 28, 30]. This fact leads us to our research question: how can we improve data quality and prediction accuracy in the presence of corrupted data without losing useful information?. We believe that the use of a noise-aware algorithm instead of an unconscious-elimination one can be useful, because we could identify and eliminate noise by re-measuring only a few observations, while leaving intact the rest of the data, including correct anomalies. An unconscious-elimination algorithm identifies suspect observations and it eliminates them from the data, which results in elimination of useful observations, too. An obvious solution for corrupted data sets may be the re-measurement of all of the data, but this is an unfeasible solution. On the other hand, a noise-aware algorithm identifies a relative small (an optimal number of observations is desired) subset of suspect data and it requests the user for new measurements of such suspect data, in order to confirm or discard the observations’ validity. This property makes a noise-aware algorithm feasible and useful for data quality improvement. It is very possible to obtain a data quality improvement, however, it is not clear if the prediction accuracy (which is often referred as a data quality metric[13]) will be improved, since we will retain some peculiar observations. As an additional motivation for this work, technology innovations result in new and sophisticated measurement instruments. In astronomy new lens and storage capacity throw a challenge to the scientific community, which should analyze thousands of spectra and photometric images. Therefore, it is very important to develop automated tools that can be both accurate and reliable. A way to be reliable is having a method that ensures data quality and one way to be accurate is being based on true information only. We selected the domain of stellar populations, starting from synthetic models, in order to control the re-measurement process and in a near future taking this algorithm to real data sets. Furthermore, this domain is very interesting since we can help to understand the evolution of our own and other galaxies in a multidisciplinary approach.
8
1.2
CHAPTER 1. INTRODUCTION
Overview of this thesis
This thesis is organized as follows. The next chapter presents background information, introducing the kernel principal component analysis for dimensionality reduction, as well as the astronomical domain used in this work: prediction of stellar population parameters, we will see that this is a difficult and interesting problem that can be solved using machine learning methods. In Chapter 3 and Chapter 4 the main contributions of this work are presented. We present a review of related work in the outlier detection area as well as a comparison of six outlier detection methods in Chapter 3. In Chapter 4, noise-aware algorithms are proposed and experimental results that show the performance of the algorithms on the astronomical domain, as well as on benchmark data sets, are presented. Furthermore, an implementation of a noise-aware machine learning algorithm and experimental results in a very realistic scenario are presented. Finally in Chapter 5 we summarize our findings and discuss future directions of this work.
Chapter 2
Background Information In this chapter fundamental methods and concepts used in the thesis are presented. First, the locally weighted linear regression algorithm is described; it is used in most of the experiments performed. Then, the astronomical domain in which the noise-aware algorithms were applied is introduced. Finally in Section 3 the methods PCA and KPCA for dimensionality reduction are described and compared.
2.1
Locally Weighted Linear Regression (LWLR)
Locally weighted linear regression (LWLR) [2] belongs to the family of instance-based learning algorithms. These algorithms build query specific local models, which attempt to fit the training examples only in a region around the query point. Learning with these algorithms consists of storing some or all of the training examples and postponing any generalization until a new instance must be classified. In contrast to most other learning algorithms, instance-based approaches can construct a different approximation of the target function for each distinct query instance that must be classified. LWLR uses nearby or distance-weighted training examples to form a local approximation to the target function f . We can approximate f in the neighborhood surrounding xq using a linear function, a quadratic function, a multilayer neural network, or some other functional form. In this work we use a linear model around the query point to approximate the target function f . Given a query point xq , to predict its output parameters yq , we assign, to a fixed number of nearest neighbors of xq in the training set, a weight given by the inverse of the distance from the training point to the query point: 1 wi = (2.1) |xq − xi | Let W, the weight matrix, be a diagonal matrix with entries w1, . . . , wn . Let X be a matrix whose rows are the vectors x1, . . . , xn , the input parameters of 9
10
CHAPTER 2. BACKGROUND INFORMATION
the examples chosen from the training set, with the addition of a ”1” in the last column. Let Y be a matrix whose rows are the vectors y1, . . . , yn , the output parameters of the examples x1, . . . , xn . Then the weighted training data are given by Z = W X and the weighted target function is V = W Y . Then we use the estimator for the target function: yq = xq Z ∗ V
(2.2)
where Z ∗ is the pseudoinverse (Moore-Penrose inverse) of Z, (Z ∗ = (Z T Z)−1 Z T ).
2.2
Estimation of Stellar Populations Parameters
The term stellar population is often used to refer to a mixture of stars and gas that share a common chemical and dynamical history. In general there are two stellar population types, populations I and II. Population I is mostly composed of young stars with high presence of heavy elements and high stellar formation rate. Generally, metal content is high for population I. Population II is the opposite, it is mainly composed of old stars, with only a few young stars, a low heavy element content and a low stellar formation ratio. Observing the properties of each of the above populations, we can infer a third population lying between populations I and II. This intermediate population presents an equilibrium between parameters of populations I and II. Therefore for practical considerations a galaxy would be formed only by these three principal stellar populations. Models and comparisons among observations of evolving stellar populations help humans to understand the stars’ formation process. Galaxies with active stellar formation can be observed, in many of them evidence of violent big mass stars’ formation processes can be found, this phenomenon is called ”starburst”. The determination of this phenomenon starting from spectral information is very important to understand the galaxies’ formation and their evolution. However, this is not an easy task, since spectra may contain noise or may be corrupted by external processes. Furthermore, it is necessary to detect patterns in spectra to identify such properties. Common approaches for determination of stellar population parameters include template fitting and the classic means: analysis of stars’ metallicity and kinematics. In most of the scientific disciplines we are facing a massive overload of data, astronomy is not the exception. With the development of new automated telescopes for sky surveys, terabytes of information are being generated. Such amounts of information need to be analyzed in order to provide knowledge and insight that can improve our understanding about the evolution of the universe. Such analysis becomes impossible using traditional techniques, thus automated tools should be developed. Recently, machine learning researchers and astronomers have been collaborating towards the
2.2. ESTIMATION OF STELLAR POPULATIONS PARAMETERS
11
goal of automating astronomical data analysis tasks. Such collaborations have resulted in the automation of several astronomical tasks. These works include galaxy classification [4, 14], prediction of stellar atmospheric parameters [19, 20, 54] and even estimation of stellar population parameters [22, 16]. In this thesis we applied our noise-aware algorithms for the prediction of the following stellar populations parameters: ages, relative contributions, metal content, reddening and redshift. In the remaining of this section the data used for our experiments are briefly described.
2.2.1
Analysis of Galactic Spectra
Almost all relevant information about a star can be obtained from its spectrum, which is a plot of flux against wavelength. An analysis of a galactic spectrum can reveal valuable information about star formation, as well as other physical parameters such as metal content, mass and shape. The accurate knowledge of these parameters is very important for cosmological studies and for the understanding of galaxy formation and evolution. Template fitting has been used to carry out estimates of the distribution of age and metallicity from spectral data. Although this technique achieves good results, it is very expensive in terms of computing time and therefore can be applied only to small samples. Modeling Galactic Spectra Theoretical studies have shown that a galactic spectrum can be modeled with good accuracy as a linear combination of three spectra, corresponding to young, medium and old stellar populations (see Figure 2.1), with their respective metallicity, together with a model of the effects of interstellar dust in these individual spectra. Interstellar dust absorbs energy preferentially at short wavelengths, near the blue end of the visible spectrum, while its effects on longer wavelengths, near the red end of the spectrum, are small. This effect is called reddening in the astronomical literature. Let f (λ) be the energy flux emitted by a star or group of stars at wavelength λ. The flux detected by a measuring device is then d(λ) = f (λ)(1 − e−rλ ), where r is a constant that defines the amount of reddening in the observed spectrum and depends on the size and density of the dust particles in the interstellar medium. In a more realistic scenario we considered the redshift, which tells us how the light emitted by distant galaxies is shifted to longer wavelengths, when compared to the spectrum of closer galaxies. This is taken as evidence that the universe is expanding and that it started in a Big Bang. More distant objects generally exhibit larger redshifts; these more distant objects are also seen as if they were further back in time, because the light has taken longer to reach us. Redshift can be due to several factors including movement
12
CHAPTER 2. BACKGROUND INFORMATION
Figure 2.1: Stellar spectra of young, intermediate and old populations.
of the source, expansion of space or gravitational effects. In this work we considered a non-relativistic formula to simulate redshift in spectra. We build a simulated galactic spectrum g(λ) (as in [22]), given c1 , c2 , c3 , P with 3i=1 ci = 1, ci ≥ 0, the relative contributions of young, medium and old stellar populations, respectively; their reddening parameters r1 , r2 , r3 , and the ages of the populations a1 ∈ {106 , 106.3 , 106.6 , 107 , 107.3 } years, a2 ∈ {107.6 , 108 , 108.3 108.6 } years, a3 ∈ {109 , 1010.2 } years, g(λ) =
P3
i=1 ci si (ai , λ)(1
− eri λ )
with each si chosen with different metallicity (m), m ∈ {0.0004, 0.004, 0.008, 0.02, 0.05} in solar units and ms1 ≥ ms2 ≥ ms3 , finally we add an artificial redshift Z by: λ = λ0 (Z + 1) 0 Dk (O)”. Where Dk (O) is the distance from O to its k th nearest neighbor. Objects are sorted by their distance to their k th nearest neighbor. However,
26
CHAPTER 3. OUTLIER DETECTION
it is possible that the k th nearest neighbor of a point P lies at a large distance from P , but the distances of P to its m-nearest neighbors, with m < k, are much smaller than the distance from P to its k th neighbor, therefore, P is not necessarily an outlier. Instead of using the original Dnk method, a modification (DK) was proposed. In this modification the average distance from a point to its k-nearest neighbors was used. With this little change to the original algorithm the outliers will be the points farther from their k-nearest neighbors and not only from the k th nearest neighbor. Here, the objects with highest distance to their k-nearest neighbors are ranked. The top n points in this ranking are considered to be outliers. Again, Euclidean distance between instances was used. In this approach parameters n and k need to be specified but this is an easy task, since we can specify the probable number of outliers present in a data set (n), and the neighbors to consider (k), for all experiments we used k = 3 and n = 10% of the total objects.
3.2.3
Statistical Based (ST) Method
Many statistical approaches perform very well if we know the distribution of the data (normal, Poisson, t-student, etc), however, real datasets either do not follow any common distribution or it is too difficult to find it. Regardless of the distribution of data, statisticians have widely used the mean and standard deviation to identify outliers [3]. Therefore, we introduced the following definition of statistical based outliers, independent of any distribution: ”An object O is a ST − (k, ρ) outlier if at least k attributes of O have higher values than ρ standard deviations from the mean”. Those objects that have k attribute values outside a predefined number of standard deviations ρ from the mean, of that attribute, are labeled as outliers.
3.2.4
Kernel-Based Novelty Detection Method
This method presented in [52] is a method that uses kernels. It calculates the center of mass for a dataset in feature space F , by using a kernel matrix K as in (2.10). A threshold is fixed by considering an estimation error EE (Equation 3.1) of the empirical center of mass and the distances between objects and such center of mass in a dataset. s
EE =
2∗φ √ ∗( 2+ n
r
1 ln ) δ
(3.1)
where φ = max(diag(K)), and K is the kernel matrix of the dataset with size nX n, δ is a confidence parameter for the detection process. This is a kernel-based method of easy implementation, efficient and very precise. For this thesis we used a polynomial kernel, see Equation (2.12), to perform the mapping to feature space.
3.2. OUTLIER DETECTION METHODS
3.2.5
27
ν and One-class SVM
The support vector machine (SVM) is a training algorithm that finds a hyperplane, which distance to a set of examples is maximized. Such examples are previously mapped into a feature space by using a kernel. Based on the structural risk minimization principle, the SVM minimizes both the empirical risk for a data set and a bound on the VC-dimension, which controls the complexity of the learning machine, see equations (3.2, 3.3 and 3.4). We will not go in detail here about SVM theory, but we just present the optimization functions and their constraints1 in order to differentiate the original algorithm from the variants presented here. max α
n X i=1
αi −
n 1 X αi αj yi yj k(xi , xj ) 2 i,j=1
subject to 0 ≤ αi ≤ C, i = 1, . . . , n, n X
αi yi = 0.
(3.2) (3.3) (3.4)
i=1
where the α’s are Lagrange multipliers, with values αi 6= 0 for the support vectors xi ; y is a vector with the labels of the examples vectors xl=1,...,n ; n is the size of the training set; C > 0 a constant which regularizes the tradeoff between the model complexity and the empirical risk. Since the introduction of the support vector algorithm [6, 56] many modifications and extensions have been proposed [40]. One of such modifications is the ν-SVM algorithm, which was first proposed for regression [49] tasks and then extended for classification problems [50]. This algorithm substitutes the parameter C on the classical SVM algorithm (3.3) by a νparameter, which is an upper bound in the fraction of outliers and a lower bound in the fraction of support vectors [50]. Advantages of the ν-SVM over the classical SVM for regression are the elimination of parameters C and −bound, which often are difficult to set. In the ν-SVM we can distinguish outliers from support vectors by checking the value of the Lagrange multiplier of each example. Examples with Lagrangian αi equal to n1 are outliers, while α’s grater than zero and lower than n1 are support vectors. For the case of ν-SVM equations (3.2, 3.3 and 3.4) become:
n 1 X max − αi αj yi yj k(xi , xj ) α 2 i,j=1
subject to 0 ≤ αi ≤
n n X X 1 , i = 1, . . . , n, αi yi = 0, αi = ν. n i=1 i=1
(3.5)
(3.6)
1 For a detailed description of the SVM algorithm and the variants presented here we refer the reader to [40, 47]
28
CHAPTER 3. OUTLIER DETECTION Dataset Triazines Pyrimidines WaveformII Musk Ionosphere Stellar Population Data
#Cases 1116 220 1000 1000 351 200
#Attributes 60 28 21 166 34 10
#Outliers 56 15 50 50 18 10
Table 3.1: Description of the data sets used for the outlier detection experiments. #Outliers is the number of observations affected. with ν ∈ (0, 1]. Another variant of the SVM that combines approaches from [55] and [50] is presented in [46]. In this work, a modification in constraints of the ν-SVM equations leads to maximizing the distance of a hyperplane from the origin and then a fraction of data will lie beyond that hyperplane while allowing some outliers (between the origin and the hyperplane). Equations for the one-class SVM are: 1X min αi αj k(xi , xj ) (3.7) α 2 i,j subject to 0 ≤ αi ≤
n X 1 αi = 1. , i = 1, . . . , n, νn i=1
(3.8)
Summarizing, the main difference between the ν-SVM and the one-class SVM is the modification in the constrains for the optimization process, see Equations (3.6 and 3.8). Furthermore, the one-class algorithm does not need the labels of the training examples, in contrast to the ν-SVM, which is applicable only in supervised learning.
3.3
Experimental Results
In this section experimental results comparing the performance of the methods described above are presented. We used benchmark data sets from the UCI repository [5] and, randomly, we added artificial outliers and noise in order to determine which of the methods is the best performer. Also, we used the stellar populations data set, in Table 3.1 the data sets used are described. As we can see, the number of outliers present in each dataset is of approximately 5% of the total data, which agrees with related work in the area [25, 37], however, we could use some other percentages. The UCI data sets were normalized to the range [0,1] and we affected the data in two ways. In the first one, we added gaussian noise with µ = 0 and σ 2 = 0.3 to 5% of the data for all data sets, aiming to simulate a 30% of affectation. The second
3.3. EXPERIMENTAL RESULTS Dataset Triazines Pyridines Waveform Ionosphere Musk Average
T A O A O A O A O A O -
DB 0.633 0.640 0.4 0.4 0.175 0.16 0.275 0.205 0.543 0.617 0.405
DK 0.587 0.778 0.444 0.667 0.667 0.667 0.075 0.566 0.667 0.667 0.578
29 ST 0.402 0.779 0.222 0.8 0.863 1 1 0.941 0.980 1 0.799
ν-SVM 0 0.103 0.129 0.278 0.014 0.039 0.2 0 0 0 0.127
KB 0.883 1 0.629 0.966 0.990 1 0.226 0.516 1 1 0.821
OC-SVM 0.647 0.034 0.490 0.078 0.662 0.039 0.182 0.067 0.649 0.078 0.293
Table 3.2: F-measure value obtained by the tested methods on the UCI data, affected with added (A) gaussian noise and simulated outliers (O) way is by inserting artificial outliers, that is, we multiplied the data by a normally distributed random factor in order to simulate rare objects, also affecting only 5% of the data. Preliminary results showed that classification accuracy was not an appropriate measure to evaluate methods for outlier detection. Since comparing the prediction accuracy, once that we eliminated the objects detected by each method, do not give us insight about how many real-outliers were detected, but just knowledge about the method that decreased the prediction error. Furthermore, if we would use prediction accuracy, we could not notice how many useful-observations were eliminated. Consequently, we decided to use a more appropriate measure, which is P TP based on recall R = (T PT+F N ) and precision P = (T P +F P ) called F -measure [36] 2∗R∗P F = (3.9) (R + P ) where TP denotes true positives, FP denotes false positives and FN denotes false negatives. This metric, from information retrieval, has been used for comparisons of outlier detection methods [38, 36]. The F −measure express with a real number, in [0,1], the performance of an outlier detection method, based on the detection rate and the precision obtained by such method. An F −measure of 1 indicates perfect performance while a 0 value means that the method did not detect any of the outliers present in the data. Furthermore, we decided to use F −measure because it is a more reliable measure since it shows us a balance of the parameters that are more interesting for us: how many real anomalies were detected by a method (R), and how many observations were falsely detected (P). In Table 3.2 the F −measure value is presented for the UCI data, also
30
CHAPTER 3. OUTLIER DETECTION T
DB
DK
A O
0.7407 0
1 0.9
A O Average
0 0 0.1852
0.9 0.8 0.9
ST 10PC 1 0 10PC+5LC 0 0 0.25
Novelty
One-Class
1 1
0.4737 0.2286
1 1 1
0.4615 0.4390 0.4007
Table 3.3: Outlier detection for the stellar populations data, F-measure value is reported. With artificially inserted, high-level gaussian noise(A) and simulated outliers(O) the average for each method is reported. As we can see, the best performer is the kernel-based method, which shows perfect performance on four data sets, obtaining the best result overall. The kernel-based method showed poor performance on the ionosphere data, although this method detected 100% of the outliers present in the data, the problem was that it also detected several false positives. Also, the statistical and DK methods showed good performance on all data sets. The worst performer was the ν-SVM that only detected a few of the true affected observations. The one-class method as well as the DB approach detected almost 100% of the outliers in the data, but the number of false positives was high. In Table 3.3 the performance of the methods on the stellar populations data set is presented. We used KPCA with a polynomial kernel of degree 1, based on results from the previous chapter, 10 PC’s were used. Additionally, and considering the ideas in [3, 29] for outlier detection, we used a combination of the 10 first components and the last 5 components in order to detect outliers that are difficult to identify. Experiments with additive (A) gaussian noise and artificial outliers (O) were conducted. The average for each method is also presented. From Table 3.3 we can see that the kernel-based algorithm showed perfect performance, and the DK method also performed well. The other methods performed poorly on both added noise and simulated outliers. These results showed that the use of the 5 last components, returned by KPCA, does not improve the detection accuracy for this data. However, notice that the accuracy of the best performers does not diminish. The methods that degraded in performance were ST and DB, which did not detect any of the affected observations when the last 5 components were included. Again, the One-class SVM algorithm detected almost all affected observations, however, it also detected many false positives. Real data are rarely clean and they may contain low-level noise due to systematic errors. Consequently, we need to select the best method that
3.3. EXPERIMENTAL RESULTS
31
Figure 3.1: A randomly selected spectrum a)original, b)affected with lowlevel noise, c)affected with negative extreme noise, d)affected with positive extreme noise, e) shifted spectrum, simulating an outlier, and f) all in one plot.
allows us to detect both outliers and highly noisy observations, even in the presence of low-level noise. In order to determine which of the methods will perform better in such scenario the following experiment was performed. A dataset of 200 spectra was generated, and we added a distribution of noise and outliers, to all of the data, in the following way: 90% of data were affected with gaussian noise with µ = 0, σ 2 = 1, and a signal to noise ratio of 50; also we affected the data with two distributions of gaussian noise with positive and negative extreme means, with probability p = 0.05; simulated outliers were introduced by shifting a spectrum by a normally distributed, random factor (f ∈ R : 1 < f < 10), with probability p = 0.05, see Figure 3.1 to observe a sample spectrum. Results of this experiment are presented in Table 3.4 and in Figure 3.2. In this table, we included the reduction percentage of M.A.E., Equation (2.14), obtained by LWLR after eliminating the detected data for each method, compared with the performance of LWLR on all of the data, using a 10-fold cross validation. From Table 3.4 and Figure 3.2, the best performer using the F −measure
32
CHAPTER 3. OUTLIER DETECTION Measure TP FP F − M easure Reduction-av Time (s)
DB 20 23 0.6349 7.5% 1.793
DK 15 15 0.6 3% 0.421
ST 18 4 0.9473 -3.2% 0.561
Novelty 19 2 0.89 8.3% 0.38
One-Class 12 14 0.5333 0.9% 0.601
Table 3.4: Performance of the outlier detection methods when data is noiseaffected, here 90% of data were affected with a low-level additive noise, 5% affected extreme additive noise and 5% affected with artificial outliers. We present the true positives, false positives, F − measure, Processing time and average M.A.E. reduction (R). We compared accuracy obtained by LWLR in the data after removing observations detected by the methods, with regard to the accuracy of LWLR in the full data.
is the ST method, the kernel-based novelty detection algorithm attains a similar performance to that of the ST method. Indeed, the kernel-based method detected 1 outlier more than the ST method. Comparing accuracy we can see that the best performer is the kernel-based method, moreover, this method is the most efficient. Comparing TP, FP, F −measure, accuracy and processing time the best performer is again the kernel-based algorithm. Results on data from the UCI repository and on the astronomical domain suggest that the most suitable technique to detect abnormality is the kernelbased algorithm for novelty detection. This algorithm identifies both kinds of observations: rare and highly-noisy even in the presence of low-level noise, while investing the least processing time and improving estimation accuracy of a classifier.
3.4
Discussion
In this chapter six methods for outlier detection were compared. Using benchmark data sets and a synthetic astronomical domain, several experiments were performed. The best performer using the F −measure, in most of the experiments, was the kernel-based novelty detection method. In this method, the center of mass for a kernel matrix is computed and a threshold for normal behavior is fixed; this method is very effective and simple. Results using synthetic spectra show that the kernel-based method reaches perfect performance on data affected with extreme noise and artificial outliers, DK and ST methods showed good performance too. The use of the last few components returned by KPCA does not improve the detection accuracy of the methods. In addition, the kernel-based algorithm detects both rare objects and highly noisy examples, even in the presence of low-level
3.4. DISCUSSION
33
Figure 3.2: ROC plot of the performance of each method, for outlier detection, on the dataset of stellar population parameters. noise in the data set, good prediction accuracy and processing time are also advantages of the kernel-based novelty detection algorithm.
34
CHAPTER 3. OUTLIER DETECTION
Chapter 4
Noise-Aware Algorithms In this chapter the main contribution of this thesis is presented. The remeasurement process is defined and suitable domains for the application of noise-aware algorithms are described, as well as examples of practical applications in each domain. Furthermore, two versions of noise-aware algorithms are presented and experimental results show how both algorithms improve data quality and prediction accuracy in training data sets. Moreover, experiments in very realistic scenarios for the testing phase are presented.
4.1
Introduction
Elimination of suspect data has been used in most outlier detection methods, when they are targeted to data cleaning [1, 9, 8, 41, 24, 23, 57, 35, 55, 46, 28, 30], due to the fact that they can alter calculated statistics [10](e.g. inflate variances, alter calculated means), prediction error can increase [15], models based on these data can become more complex [24, 28] or possibly they introduce a bias in the process to which they are dedicated. However, we should not eliminate an object unless we can determine the disability of the datum. It is very important to determine if a suspect instance is either an error or the correct observation of a rare object. This often is not possible for several reasons, including: human-hour cost, time investment, ignorance about the domain we are dealing with and even uncertainty. Nevertheless, if we could guarantee that an algorithm will successfully distinguish errors from correct observations, the difficult problem would be solved. But, how should such an algorithm be?, what techniques should it use to guarantee an almost perfect differentiation? We believe that an answer to this questions is: by ”re-measuring”. Like a human does, an algorithm can confirm or discard a hypothesis by analyzing several measurements of the same object. Re-measuring is safer than elimination for several reasons, for example we can eliminate or isolate erroneous data; unusual observations can be pre35
36
CHAPTER 4. NOISE-AWARE ALGORITHMS
served and we can decide what to do about them; we can ensure that a rare observation is the correct measurement of an abnormal object; moreover, we can be sure that a correct instance will never be rejected. All of these reasons make attractive the use of noise-aware algorithms (based on re-measuring), instead of a naive-elimination one, in certain domains. In the next section, suitable domains for noise-aware algorithms are described.
4.2
Appropriate Domains
There are many domains and practical applications in which a noise-aware algorithm is suitable to use, however, it is necessary to emphasize that such algorithms are appropriate for a certain type of domains and applications. Also it is clear that there are some domains in which the re-measuring process will result in unfavorable consequences, therefore, we should be careful about the problems to which we apply this algorithm. In general, a noise-aware algorithm can be applied to any domain in which the re-measuring process is affordable and feasible; domains requiring highly reliable information; domains in which the novelty is more useful than the rest of the observations and domains in which decisions made on data are crucial. Suitable domains include, but are not limited to: medical diagnosis, security systems, scientific domains such as astronomy, bioinformatics, and optics, commercial applications in data warehouses and business intelligence, as well as information retrieval for text mining and natural language processing. Noise-aware algorithms for scientific data analysis The development of new and automated measurement instruments such as cameras, telescopes, microscopes and spectrographs has produced more information than scientists are able to process. Such amounts of data need to be analyzed in order to make new discoveries and, therefore, advances in scientific research. However, the analysis of large databases becomes infeasible for humans, consequently, automated techniques have been proposed. In astronomy, for example, automated techniques suffer of noise in observations, due to several factors including: bad guiding, focusing, seeing, fringing, reflection effects, satellite trials and diffraction spikes. Therefore, a method for data cleaning is needed. On the other hand, most of the objects can be explained by current theories and models, the rest are anomalies. Most of such anomalies (suspect data) are uninteresting and they are due to corruption processes, as the ones mentioned above. Useful anomalies are extraordinary objects worthy of future research. Therefore, an astronomer is interested in such anomalies since he/she might allocate telescope time to observe them in detail. See Figure 4.1 for examples of anomalies in astronomy. However, since telescope time is scarce an astronomer might
4.2. APPROPRIATE DOMAINS
37
Figure 4.1: Examples of noisy objects in astrophysics data, two left images: diffraction spikes, two right images: satellite trails
have not enough observation time to check all of the suspect data. An astronomer, however, can use spectral or photometric data in order to identify useful anomalies, and then allocate telescope time for useful anomalies only. Such astronomer can perform this process but he/she requires a large time investment which results in expensive costs. A noise-aware algorithm can automate this process with high accuracy, provided that user or instrument provides the algorithm with the re-measurements. An active learning approach for detection of useful-anomalies has been proposed in [42], however, a domain expert is needed for the discrimination between noise and anomalies. Other suitable scientific applications for the use of noise-aware algorithms include land cover mapping [9], and interferogram analysis [21].
Noise-aware algorithms for deadly diseases diagnosis Medical diagnosis applications require both reliable information and accurate abnormality detection, therefore, a noise-aware algorithm is suitable to use. Useful applications include: analysis of magnetic resonance images (MRI), electrocardiograms and ultrasounds. For example, a medical center can have a large database of, say, MRIs and the neurology division may want an accurate application for automated diagnosis of brain diseases. However, due to systematic errors or human mistakes, some data may be affected by noise. Therefore, a noise-aware algorithm can be used to correct noisy observations. Furthermore, the algorithm can detect useful anomalies (if they exist), which is very important to detect early. The algorithm will detect a subset of suspect MRI of some patients and it will request to each patient to take another MRI in order to discard or confirm the diagnosis. Very possibly, each patient will consent this request if the medic guarantees a highly accurate diagnosis. In a similar way, this can be done with electrocardiograms, radiographies, tomographies and ultrasounds to diagnose cardiac and pulmonary diseases, bone deformations, and high risk pregnancies, (see Figure 4.2).
38
CHAPTER 4. NOISE-AWARE ALGORITHMS
Figure 4.2: Examples of abnormality in medical data, the two first images show a cerebral bad formation, probably a tumor, then the radiography of a patient with lung emphysema and finally a radiography of a patient with a bad formation on his foot Noise-aware algorithms in security systems In security systems, highly reliable information is needed as well as a precise novelty detection. A security agency, for example the AFI (Federal Agency of Investigations), would need an accurate system for authorized personnel identification, keeping in mind that a failure in the system could result in espionage, sabotage or terrorist attacks. Therefore, a noise-aware algorithm is suitable for this domain, since our algorithm could guarantee highly reliable data and differentiation among noise and useful anomalies. Such identification system can be based on face, iris, digit, or voice recognition. Consider the case, for example, of a face recognition system in which erroneous observations as well as anomalies are present in the data. A noisy observation in this domain could be the picture of an authorized employee wearing sunglasses or in a different pose than the rest of the authorized people. An useful anomaly could be an intruder wearing sunglasses or a cap trying to enter illegally into a federal institution, see Figure 4.3. Our algorithm could accurately differentiate among those objects. Furthermore, by requesting new pictures of the suspect authorized personnel, data quality will be improved, while outliers will remain unaffected and they could be identified. In [44] a modification to the one-class algorithm is proposed for detection of abnormalities in pictures for face recognition. Noise-aware algorithms for information retrieval Information technology applications are not excluded as suitable domains for our algorithm. For example, in information retrieval we could use a noise-aware algorithm for data quality improvement in a training set for an application that automatically chooses sites of interest for us. Instances of training could be user’s activities, frequencies of visited sites, time in each site and other personalized features. In such scenario, an error could be a page that accidentally was opened and remained open a long time, say, over lunch time. On the other hand, useful anomalies can be pay or adult-content
4.3. RE-MEASURING PROCESS
39
Figure 4.3: Examples of noisy objects and useful anomalies in face recognition, all of the pictures show personnel in different poses or wearing accessories that make them different from a common instance sites that are accessed by an intruder while the user is eating or sleeping. Therefore, it would be very useful to apply our algorithm and, by requesting the user’s confirmation for some suspect instances, we can correct noisy observations and detect the intruder. Other applications include: cleaning data for question answering and document classification systems. In [38] the one class algorithm was used to detect anomalies for document classification. The domains described above are not the only ones, as we can see there are many domains and specific applications in which a noise-aware algorithm is suitable to test. We leave to the reader’s imagination the way in which a noise-aware algorithm can be applied to other domains. Also, may exist domains that are not suitable for applying our methods. These domains include domains in which the re-measurement process is not affordable, or in those domains in which eliminating a suspicious instance do not degrade the processes on the data and even in that domains in that the cost of implementing the algorithm is not worthy by the obtained benefits. For example, consider the case of a robot in a dynamic environment. There, it would not make sense the use of a noise-aware algorithm since the obstacles are moving dynamically. If the main sensor in a robot is a camera, getting new pictures of suspicious objects could be impossible. The sensor could detect a suspicious object at time t, and when the algorithm requested for a new picture of the same object in time t + 1, it is very possible that the object would not remain in the same position as in time t.
4.3
Re-Measuring Process
Before introducing the noise-aware algorithms, the ”re-measuring” process should be explained. Given a set of instances X = {x1 , x2 , . . . , xn }, with each xi ∈ Rn (generated from a known and controlled process by means of measurement instruments or human recording), we have a subset S ⊂ X of instances xsi with S = {xs1 , xs2 , . . . , xsm } and m